Papertrail

The revolution will be verbosely {,b}logged

DNS Outage on Monday, December 1, 2014

Posted by @troyd on

Summary

Most DNS requests for papertrailapp.com timed out for 6 hours, affecting many Web visitors and a small number of log senders. The outage was absolutely unacceptable. Personally and on behalf of Papertrail, I’m sorry.

Although the outage was caused by a DDoS against our DNS service provider, it’s our problem. We’re implementing complete DNS provider redundancy so that similar problems do not affect Papertrail.

What happened?

For about 6 hours on Monday, December 1, most DNS requests for papertrailapp.com timed out. Between approximately 19:15 UTC and 00:45 UTC the following day (11:15 - 16:45 Pacific), nearly all non-cached DNS requests timed out. As the attack pattern changed, a much smaller percentage of requests continued to time out until about 06:00 UTC (22:00 Pacific).

This DNS outage mostly impacted users trying to reach Papertrail’s site. Comparing Papertrail’s log volume during this outage to typical volume on a Monday, we did not see a noticeable change. The outage impacted few log senders because the resolved log destination was cached by most sending daemons as well as local and remote DNS servers. If your site was impacted, please contact us so we can understand and mitigate the impact of future problems. If you already reported this, we’ll be in touch.

The outage was caused by a very large distributed denial of service (DDoS) attack against a company which Papertrail purchases DNS service from. While the attack was destined for one of our service providers, we’re responsible for delivering Papertrail to you and this outage is our responsibility.

We posted frequent updates to Papertrail’s status site and @papertrailops, though for reasons discussed below, the updates were not as accessible as they should have been.

What we’re doing

While a DDoS against a service provider was the cause, Papertrail’s responsibility is to ensure reliable resolution of our own zone. In doing that, we made 2 mistakes:

  • Underestimating the duration of the longest foreseeable DNS outage
  • Based on that underestimation, depending on a single DNS infrastructure

Our estimate was based on many past large attacks that this service provider has handled in the 2 years we have used it. This attack was too large and too distributed for their anycast network and filtering the attack took much longer than we had considered.

While our service provider will make their network more resilient to attacks, making any single DNS infrastructure more resilient is not the solution. An attacker will always be able to generate a larger DDoS (and service problems other than DoSes will still impact it). For Papertrail and other maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS.

Authoritative DNS has one very thoughtful design property: it’s inherently distributed. Client resolvers (like typical Web browsers and log senders) will try to query multiple servers. Nothing about the service we purchase from our current provider prevents using additional providers or our own DNS infrastructure. Using multiple DNS infrastructures takes more effort to design and maintain – effort that, because we underestimated the impact of a severe outage, we invested elsewhere. Monday showed us that the extra effort is worthwhile and necessary.

Relying on one DNS infrastructure, no matter how large or distributed, is an unnecessary risk that we unintentionally took. On behalf of Papertrail, I apologize for our mistake and the impact it had on your operations.

Also, during the outage we realized that our status updates were not as accessible as they should be. Our status site used a hostname under the papertrailapp.com domain name, so it was also inaccessible during the outage.

We’ve registered papertrailstatus.com for Papertrail’s status site. The status site does not depend on any of the same DNS infrastructures that Papertrail itself does (and was already hosted independently). We added a reference to @papertrailops on the Twitter bio for @papertrailapp and started to retweet significant updates from the more visible @papertrailapp account.

We’ve started planning how Papertrail can use multiple, independent anycast DNS infrastructures so that a DDoS or outage of one DNS service does not affect Papertrail. If we can go into more detail about any of this, please contact us.

Update (2014-12): This change is complete and Papertrail does not depend on any single DNS infrastructure. Papertrail wrote and released dnsync to synchronize a zone between two providers, one of whom did not support AXFR.