According to Larry Wall, the creator of Perl, one of the three great virtues of a programmer is hubris - excessive pride. This is an interesting choice seeing as pride is perhaps the most serious of the seven deadly sins. While as a developer, taking pride in your work is important for consistently producing quality results, humility, the state of viewing yourself of low importance, of not considering yourself better than others, is even more important. Humility is the quality that allows us to not only accept our mistakes, the mistakes of others, and their consequences, but also gives us the courage to examine and learn from our mistakes

The following is a personal story of how quickly hubris can turn to humility. When you forget to be humble and give in to hubris, life has a knack for making you humble again.

Once upon a time, I worked in a land of editing production Apache server configurations by hand. Oh sure, we had automated deployment scripts for all of our dozens of applications, but for whatever reason, generating the Apache virtual host configurations for each application was left to an administrator or lead developer to perform manually. Many of the applications were load balanced across multiple servers, which compounded the human effort and increased the likelihood of inevitable misconfigurations. Configuration drift was also a problem - did we remember to tweak the TLS settings on every single server to mitigate the latest OpenSSL vulnerability? Who knows. ¯_(ツ)_/¯

I know, automation to the rescue! Here I come with my Ansible-fu, pulling down all the production vhost files, diffing them, updating them with best practices for performance and security, and crafting a golden template to generate all the configs. It worked spectacularly!

We were finally free from the tyranny of SSH-ing into production boxes where we tirelessly craft our special snowflake config files. But there was a hidden trap lurking for months, waiting to devour an unsuspecting victim.

Most of our apps use {{app_name}}.{{environment}}.example.com naming scheme for qa/staging domains and {{app_name}}.example.com for production domains. Except for one, which followed our scheme for qa/staging, but used an entirely different domain in production. Due to a well-documented Apache name-based virtual hosts behavior, the mistake went unnoticed for months because this unique application, let’s call it constellations.example.com, and it’s generated vhost file, /etc/apache2/sites-enabled/constellations.conf, happened to alphabetically precede all other applications. However, it’s production domain was actually space.example.com, which did not match any existing ServerName directives.

Now when a request arrives, the server will first check if it is using an IP address that matches the NameVirtualHost. If it is, then it will look at each section with a matching IP address and try to find one where the ServerName or ServerAlias matches the requested hostname. If it finds one, then it uses the configuration for that server. If no matching virtual host is found, then the first listed virtual host that matches the IP address will be used.

Using Name-based Virtual Hosts

Fast forward to last week, when we have a shiny new API ready for production, api.example.com. Has the code been reviewed and the app tested thoroughly in QA? Check! Do we have logging, monitoring, and error reporting in place? Check! Is the production data store provisioned and ready? Check! Ok, deploy away!

If you’re an astute reader or have some Apache experience, you can guess what happens next. Rather than getting a 200 response from my first call to api.example.com, I got an SSL certificate validation error. Yikes! NewRelic was quick to follow with an alert that the uptime ping was failing. Then a few minutes later, I get a Slack message from my boss, “Uhh, did the api.example.com deploy just clobber the space.example.com cert?”

Oh crap! I pull up space.example.com in my browser, and instead of seeing a beautiful homepage, I see Chrome’s SSL cert warning. Time to rollback.

For lower traffic apps, we host them on the same pool of VMs to save money. I knew the problem was not the TLS cert being overwritten, which meant it was most likely a configuration issue, but what? I verified that none of the other app’s vhost files or TLS certs had been modified. Then I decided to cat all the vhost files in sites-available/ and suddenly became flush with embarrassment. The ServerName directive for example.com was credits.example.com. Because Apache could not find a ServerName directive for example.com, it selects the first vhost file, the one for api.example.com. For over 5 minutes, all traffic for example.com was routed to api.example.com. #devops #fail

The quick fix was obvious enough - update the vhost template to allow for overriding the ServerName directive with a custom value. However, the engineer can’t help but ask

  • How could this outage have been prevented?
  • How can we prevent this kind of outage in the future?

How could this outage have been prevented?

Honestly, this is kind of an unfair question. If the past had been different, the present and future outcomes might have also been different. But the past is the past, we can’t change it. we can only learn from it and do better in the future.

Given our existing architecture and processes, could we have prevented this outage? Perhaps myself or the reviewer of the Ansible playbook changes would have caught it if we had been more scrupulous. Perhaps. Automation is great for preventing certain kinds of human error, but can wreak havoc when human error creeps into the automation itself.

However, there was a canary in the coal mine. The Apache error logs were warning us of this problem for months after every server restart. Unfortunately, in the busyness of business, this warning was either ignored or never noticed.

How can we prevent this kind of outage in the future?

One idea is to avoid hosting multiple applications on the same host. While it wouldn’t fix the error, it would have eliminated the circumstances leading to the outage. It also means that a misbehaving application, like one with a memory leak, won’t degrade the performance and availability of the other applications. The tradeoff is worth it if the cost of an outage (both in dollars and reputation) exceeds the price of additional capacity.

Another idea would be to invest in blue-green deployments. That would have given us an opportunity to perform health checks and other tests on the new application and its neighbors before cutting over. This would allow us to detect a wide range of issues that manage to slip through the cracks during all other stages of the release cycle before we flip the switch.

Summary

What I am hoping you will take away from this story is that no matter how awesome of an engineer you are or how great of a solution you build, there are always unknown quantities lurking in complex systems leading to unforeseen circumstances. While hubris can be a powerful motivator, humility is what empowers a developer to learn from past mistakes.