Battleship: Lessons Learned

On September 5th I put my battleship API to the test by having the attendees at Seattle.rb’s quarterly coding workshop try to build clients against it. In the process, I learned some interesting things.

Stability is King

In the hours leading up to the workshop, I made some last-minute changes to the server. Mostly I did things like change passwords so that the database password wasn’t in publicly viewable source control. In my rush to make the changes, I didn’t test whether my changes worked and I just pushed wildly to production. As a result, I managed to break my server in the first ten minutes of the exercise. App Engine Flex takes between 7 and 10 minutes to complete a deployment which isn’t an issue if you are doing your weekly Thursday night deployment. It isn’t even a problem if you are a continuous delivery shop and deploy a dozen times a day. It is an issue if you are live debugging in production with 30 people anxiously waiting for you to finish so they can code.

In the future, I’ll follow the best practices I tell others to follow. I’ll test my configuration changes locally before pushing to production using RACK_ENV=production rake __. Just doing this simple step would have saved me about 30 minutes of debugging. Also, I know better than to make changes right before going to production. The credentials leaking would have been much better than dealing with a site that was down for 30 minutes while I frantically debugged things. Finally, I would have made sure I had a fallback server that folks could use if my main server had issues. A hot backup would have let me debug one server while the old version stayed up. The overall lesson is to focus on stability above all else. Anytime I made a change in the hours leading up to the event I should have asked myself “is this worth reducing stability?”

Turn-Based APIs Are Hard to Understand

I didn’t realize it until I started but turn-based APIs are hard to model in the stateless and client request based world of HTTP. Also, Battleship has a slightly weird mechanic where each “turn” requires both players to do something. A turn is comprised of one player’s guess and the other player’s response of “sunk”, “miss”, or “hit”. Contrast this with something like Tic-Tac-Toe where all you do on your turn is place a piece.

I initially tried to have the client send a guess, like “A7”, to /guess. Then the server would respond with “hit” or “sunk”. That works for the client’s turns. For the server’s turns though the client would have to request a guess from the server at /server_guess and then post the results back via /server_guess_response or something. I could clean up the endpoint names but doing it this way required the client to make three HTTP requests, in a specific order, for each set of turns.

To handle this, I went back to my childhood. I had a World Domination computer game when I was little that had a “play by mail” option. You would make your moves, and the computer would track them for you. Then you’d print out the moves and send them to your opponent. The opponent would carefully enter the contents of your printout and then repeat the process. This, in turn, was based on playing other games, like chess and checkers, by mail.

This lead me to a single endpoint, /turn. The client would post a guess to turn. The server would respond with “hit” or “sunk” and its guess. The client would then respond to the server’s guess and make another guess. It seemed logical to me. I’d gone through the whole process of coming up with the API. When folks asked why it was structured this way I said, “it is like chess by mail.” However, most of my friends and most of the Seattle.rb attendees have never played a board game by mail. To them, the API seemed utterly ridiculous.

I don’t think the three-message solution would have made more sense. Instead, I think this problem would be easier to understand if I had used WebSockets. Porting the server to Node and then using Faye Websockets is next on my to-do list for Battleship. Eventually, I also want to try using Rails/Sinatra and ActionCable to do a WebSockets implementation as well. WebSockets give me the advantage of the server sending a message to the client proactively instead of having to wait for the client to make a request. This more closely models how people play the game in person. It is annoying to play a board game with someone when you have to say, “Your turn”, every single time your opponent needs to take their turn.

Deadlines and Real Situations Find Bugs

I’ve deployed dozens of Ruby applications to various Google Cloud compute platforms. I’ve used Google Container Engine and Kubernetes. I’ve deployed to a plain VM with Compute Engine. I’ve deployed several sites to App Engine Flexible, but this was only my second “real” site that had deadlines and users. All my other projects have been demos for talks and blog posts. When I deploy, and I’m accountable to someone other than myself I find new bugs and new error cases.

I knew that App Engine deployment times were longer than I’d like but until I was debugging a configuration issue in production I didn’t understand the actual impact of the long deployment time. In my desperation to get things working, I did all sorts of things I hadn’t done in the past. I pushed buttons in the UI that looked like they could help and I canceled deployments halfway through. Putting real users up against my Battleship server found bugs that I hadn’t found until I was under pressure.

My big lesson is that I need to do more real or “real enough” projects. Just doing demos or other toy projects, I don’t find the edge cases and the sound bites that help me make a case for features that will improve the user experience.

What’s Next

While I’m a bit afraid that I’ll take this series too far, I’m not done with Battleship yet. For my next two posts, I want to port the current server code to Node.js, and I want to deploy the Ruby code to a Kubernetes cluster. They are completely different topics, but I think I’ll learn things while writing each of the posts.