Papertrail

The revolution will be verbosely {,b}logged

Posted by @coryduncan on

A Smarter Event Viewer

Event focus

We’re excited to release some subtle but powerful updates to Papertrail’s event viewer that make searching and sharing logs much easier:

  • Searches now stay centered around the time you’re looking at, so step-by-step troubleshooting is faster.
  • Event viewer URLs now link to exact positions, so colleagues always see exactly what you see.

Never lose your spot

When searching through logs, it’s common to start with a broad search and gradually edit the search query to refine the results. Previously, a refined search would start searching again from “now” even if you had scrolled to results from the past.

Now when you edit an existing search query, the results will be based on the time of the events you are currently viewing. This should give you a quicker and more accurate search experience. Of course when you do need to manually set a search time, that option is still available.

You see what I see

When all your logs are in one place, it’s easy to link to and share important events. We’ve made this as simple as copying and pasting the URL of your search. However, in searches that returned a lot of events, the URL might not indicate exactly what events you were looking at when sharing.

Now, you can share an event viewer URL with your team in confidence, knowing that whoever visits that URL will see the same set of events, in the same position, as was shown when the URL was generated.

How does it work?

These changes are automatically enabled in the event viewer. Keep doing what you’re doing! For example, when you perform a search, then scroll to a different position, and then perform a second search, you’ll be at the same time. And if you copy an event viewer URL and return to it, you’ll be looking at exactly the same log message.

What do you think?

These are subtle changes from how our event viewer has worked in the past, but we hope you’ll find they feel quite natural and improve your experience with Papertrail. Try it out and please let us know what you think. Thanks!

Posted by @troyd on

April 14: New SHA-2 TLS certificate for log destinations

On Thursday, April 14, 2016, Papertrail will deploy a new SHA-2 (SHA-256) TLS/SSL certificate for its syslog destinations, replacing the current SHA-1 certificate.

Update 2016-04-25

On April 14, Papertrail deployed a new SHA-2 certificate, discovered that older versions of remote_syslog2 (0.13 and prior) did not accept the certificate, and reverted to the prior SHA-1 certificate.

Part of our job is to be our customers’ eyes and ears for changes like this. To that end, Papertrail will:

  • present the new SHA-2 certificate for about 6 hours on April 27, 2016
  • log all failed connection attempts due to TLS negotiation failures
  • revert to the existing SHA-1 certificate
  • email affected customers

The certificate will be deployed permanently on ​*Wednesday, May 4​*, 2016.

If we can answer any questions or help, let us know.

Update 2016-04-14

During the migration, we were reminded that older versions of remote_syslog2 pre v0.14 don’t support SHA-2. Based on this new information, we are reverting the TLS certificate and will post a revised deployment plan next week.

Impact

For nearly all senders and all common log senders (including remote_syslog2 and rsyslog), this will be a non-event. OpenSSL has supported SHA-2 since 0.9.7m and enabled SHA-2 by default beginning with 0.9.8l (released on 2009-11-05).

The only senders which Papertrail knows do not accept SHA-2 (SHA-256) certificates are those running:

Windows 8, 7, Vista, Server 2012, and Server 2008 are not affected.

Why is this necessary?

The Wikipedia entry for SHA-2 explains the reason for this change:

Although (as of 2015) no example of a SHA-1 collision has been published yet, the security margin left by SHA-1 is weaker than intended, and its use is therefore no longer recommended for applications that depend on collision resistance, such as digital signatures.

Papertrail’s Web site already serves a SHA-2 certificate. This change only affects Papertrail’s syslog endpoints.

Questions

If we can help test an old device or otherwise save you time, we’re at your service: support@papertrailapp.com.

Posted by @rpheath on

Subtle Refinements to Papertrail's Event Viewer

We’re pleased to release several improvements to Papertrail’s event viewer. Based on our own experience and how we’ve seen customers use Papertrail, these changes make the viewer easier for new users to incrementally explore, then more predictable once you have.

One place to choose what to see

There are three ways to control which logs the viewer shows: groups (of log senders/systems), search queries, and time.

Until now, one of these options was buried in the upper left corner, hundreds of pixels away from the other two:

Old Group Filter

The idea was that the dropdown list of groups in the upper left corner would frame my view, in the same way that the title of a page might. The problem: the group of log senders is used alongside the search query and the time, which are both at the bottom of the screen. Also, so many sites use the upper left corner for site-wide navigation or decoration that Papertrail’s group dropdown was easy to miss.

Realizing this, we moved all three scope-constraining options to one place. We hope this saves mouse mileage.

Bonus:

  • the existing icon to access/change saved searches will glow blue when an existing saved search is being used
  • edit any group or saved search right from the viewer:

Edit group from viewer

Quickly update existing saved searches

Previously, to overwrite a saved search, one needed to navigate to the saved search’s settings page and change the query in a form field. Usually I want to refine a saved search because I’m viewing the resulting logs, though. It’s almost never a task of its own.

To avoid the back-and-forth, it’s now possible to update searches from the event viewer. After searching for a query which isn’t currently saved, clicking “Save Search” now shows two options:

Replace Existing Search

This new “Replace an existing search” option makes it easy for search queries to evolve and improve as I explore my logs, so saved searches always reflect the team’s current knowledge.

Offer control of high-volume streams

When “tailing” a live stream, sometimes the stream will include more new log messages than would be sane to present at once. I’m not great at evaluating 250 events per second, let alone 2,500 or 25,000, and having them unceremoniously dumped on my screen wouldn’t help me debug a problem.

This is only relevant for high volume live tail streams, so Papertrail showed a subset of the new logs on the live stream and made all logs available for non-live views.

However, this had two gaps: I want an indication that I’m viewing a subset of live logs, and I’d like more control of what to do next – sometimes I spot a problem and do want to see more.

Now, when events are omitted from a high-volume live stream, Papertrail says so. Also, I can click to load omitted events:

Load Omitted Events

Moved Contrast setting to Profile

Papertrail’s viewer supports a dark and a light background. Previously, this was a button in the viewer. We’ve learned that contrast is more of a personal preference: once set, very few people want to change it casually, nor do we. It wasn’t a good use of space or cognitive load in the viewer. It’s now in Profile:

Profile Contrast Setting

What do you think?

Our design goals are gradual, effortless discoverability the first time something is needed, then minimum cognitive load on every future use. A recent 99% Invisible video about The Norman Door explains this incredibly well. Tiny decisions matter. In some cases, we’ve been testing these changes on ourselves and refining them for 2 months.

Take the updated viewer for a spin and send us your opinions and requests. Enjoy!

Posted by @coryduncan on

Never type the same API token twice

Typing the same alert settings into multiple alerts sucks. Browser autocompletion makes it tolerable, but it’s not ideal. To help with this, now when you create a new alert, you can copy details from one of your existing alerts. This is a quick way to set up multiple alerts that share details.

Clone Alert Details

Posted by @jpablomr on

Introducing Syslog Rate Limits

Summary

Occasionally, a misconfigured log sender will generate an astonishingly high volume of log data. Because UDP doesn’t offer backpressure, a misconfigured UDP sender can generate hundreds of thousands of packets per second with no regard to whether Papertrail accepts or even receives the logs. To any other service, this activity would be a denial-of-service attack. It’s our responsibility to ensure that such a misconfigured (or even malicious) sender can’t cause problems for other Papertrail customers, while also making everyone’s logging service as painless and predictable as we possibly can.

Until now, Papertrail has handled log floods reactively and manually. With a handful of incidents under our belt, we’re now comfortable using proactive rate limits to automatically identify and minimize the impact of these unintentional floods (particularly from UDP senders, where backpressure is not possible).

These syslog rate limit rules will go live on Thursday, February 11th, 2016. As explained below, customers should not see a difference in log delivery reliability from today. Also, in the future, Papertrail will periodically email customers who regularly reach the rate-limits simply to ensure that no one is surprised.

Our job is to make this simple and painless for you, so if we can answer any questions or explain in more detail, we want to know.

Updates

2016-02-11: Rate limits have been enabled.

A quick refresher on UDP

UDP is a great protocol for sending information with minimal overhead. Its simplicity and ‘fire and forget’ model make it a practical alternative to TCP for use cases where losing a few packets is not critical. Nevertheless, there is a catch: when UDP is dropped (due to network or device issues), the sender has no knowledge. Since the UDP sender doesn’t know that packets were lost, it can’t moderate its transmission rate. A UDP sender will continue to send as fast as it can, whether or not the recipient receives the data.

This awareness, called “backpressure,” is what lets TCP realize that a link is congested, and slow its sending rate.

When a UDP sender starts sending unusually large amounts of data, the probability that all the data doesn’t arrive increases. To picture this, imagine taking notes from a speech: if the speaker talks too quickly, you might not understand everything the speaker says or you might forget to write some parts down. The speaker doesn’t notice if you can’t hear what they say or if you fail to write everything down, they will continue speaking until they are done.

If your senders are configured to use UDP, a big spike of UDP messages might cause some of them to be dropped and forgotten along the way.

Why do we need to set rate limits?

Because we cannot apply backpressure to UDP senders, our only mechanism for fairly handling all customers is to apply rate limits for UDP that are above what we have seen as normal volumes for senders.

To avoid interfering with high-volume log senders, we based the limits on the peak log volume of Papertrail’s highest-volume senders (which Papertrail already measures as part of regular operations).

Separately, we have seen rare cases where a misbehaving TCP sender will cause issues by trying to connect tens or hundreds of times per second. With TLS encrypted syslog, the TLS handshake overhead can make this look a lot like a CPU exhaustion DoS attack. In extreme cases, these misbehaving senders could interfere with normal operation of Papertrail’s syslog destinations. There’s no graceful way to ignore 1 million packets per second or 10,000 failed TLS connections per second, so the question isn’t whether limits need to exist, just what they should be and how to implement them thoughtfully.

As we have grown, more misbehaving senders have required us to react to each incident and manually limit them until the senders are behaving properly. This is not an ideal solution, as manual tasks are error-prone and we’d rather spend that time improving the service. The rate limits we are introducing are just an automated form of our manual limiting policies.

Automated rate limits will let Papertrail:

  • Guarantee the normal operation of our service for all users.
  • Quickly determine misbehaving senders and help restore them to normal operation.
  • Spend time upgrading our infrastructure and working on new features.

What are the rate limits?

For UDP:

  • 3,000 messages per second per source IP.
  • 10,000 messages per second to a log destination (port).

For TCP:

  • 10 new connections per second per source IP to a log destination (port).

These limits can (and probably will) change, based on our regular measurements. If these limits present a problem, let us know and we’ll work with you. Again, customers won’t see changes from their current log message delivery reliability; these limits just automate what’s already happening. We’ll also be proactively contacting customers who regularly exceeds these guidelines simply so they’re aware.

What protocol should I use to send my messages?

Use TCP if:

  • You can’t miss a single log message.
  • A single sender IP address may generate more than 3,000 packets per second of logs regularly (and those logs are operationally relevant, not noise).
  • Communication between your senders and Papertrail needs to be encrypted.

Use UDP if:

  • Your syslog sender must be non-blocking.
  • Your syslog sender only supports UDP.
  • The benefits of using UDP outweigh the risks of losing messages.

Papertrail’s limits also try to expose the throughput which we believe each protocol is well-suited for. A single system generating 5,000 messages per second of logs via UDP (let alone 10,000 or 50,000) is likely to experience at least some loss, possibly even at the sender’s NIC buffer. If some loss during periods of high log volume is acceptable, UDP may still be fine.

On the other hand, customers regularly generating more than 3,000 messages per second from a single sender, and who want reliable delivery, will be happier with TCP - even without the rate-limits described here. remote_syslog makes changing protocols very easy, as do most other sending daemons.

Questions

We took a lot of time and care in determining a rate limiting policy which would only affect senders that might adversely affect Papertrail’s service. However, we understand that rate limits always introduce some concerns. If there’s anything we can do to help, like recommending a protocol, reviewing a high-traffic system, or providing a configuration for a different protocol, we want to help. Let us know at support@papertrailapp.com.

Posted by @troyd on

logs.papertrailapp.com TLS/SSL cert will change January 27

Summary

The TLS/SSL certificate used by 1 of Papertrail’s syslog destinations, logs.papertrailapp.com, will change on Wednesday, January 27, 2016. Some log senders which were configured before June, 2014 and are using TLS/SSL need to be modified to trust the new certificate.

We’ve tried to make this change as fast and painless as we know how to. It should take less than 5 minutes. If Papertrail’s team can save you time, please email support@papertrailapp.com.

Updates

  • 2015-01-27: logs now presents the new wildcard certificate and full chain.
  • 2015-01-25: Reminder email sent.
  • 2016-01-20: 2-hour test performed. All customers with any senders which failed TLS handshaking were emailed.
  • Week of 2016-01-07: All customers who may be affected were emailed.

Does this affect me?

If this change may affect your systems, Papertrail will email you the week of January 4 and again before January 27.

This change only affects senders which meet all 4 of:

  • log to logs.papertrailapp.com (rather than logs2.papertrailapp.com or logs3.papertrailapp.com)
  • use TCP with TLS/SSL encryption (rather than UDP or cleartext TCP)
  • transmit logs with remote_syslog, rsyslog, or syslog-ng (rather than other daemons)
  • began logging to Papertrail prior to June, 2014

Only senders matching all of these constraints are affected. If 1 or more are not true, your senders are not affected.

Papertrail will email you the week of January 4 if any of your systems currently log to, or recently logged to, logs.papertrailapp.com.

What do I need to change?

remote_syslog

Download and install remote_syslog v0.16, released 2016-01-05:

remote_syslog is a single self-contained program. To upgrade:

  • .tar.gz: uncompress the archive. Copy the new remote_syslog binary on top of the existing one, overwriting it.
  • .rpm/.deb: install the package.

Finally, restart remote_syslog. That’s it.

More: remote_syslog setup

rsyslog

First, see whether rsyslog was configured before June, 2014. Run: grep -r ActionSendStreamDriverPermittedPeer /etc/rsyslog*/*

grep may find a configuration directive like this: $ActionSendStreamDriverPermittedPeer *.papertrailapp.com. If this directive is already present, then the system was configured after June, 2014 and no change is needed.

If grep finds no matches, then the system needs 2 rsyslog configuration changes. Edit the file containing the Papertrail destination, usually /etc/rsyslog.conf, then:

1. Update trusted certificates

Find the line beginning with $DefaultNetstreamDriverCAFile. Download and save papertrail-bundle.pem in the location listed. For example, if the line reads:

    $DefaultNetstreamDriverCAFile /etc/syslog.papertrail.crt

.. save the new file to /etc/syslog.papertrail.crt location by running:

    sudo curl -o /etc/syslog.papertrail.crt https://papertrailapp.com/tools/papertrail-bundle.pem

Alternatively, save the new certificate file to a different location, then change the $DefaultNetstreamDriverCAFile line to point to that location.

2. Accept wildcard certificates

On the line below $DefaultNetstreamDriverCAFile, copy and paste this 1 line:

    $ActionSendStreamDriverPermittedPeer *.papertrailapp.com

3. Restart

Restart rsyslog with:

sudo killall -HUP rsyslog rsyslogd

More: rsyslog TLS setup.

syslog-ng

Follow steps 1 and 3 of syslog-ng setup. This will update the TLS certificates trusted by syslog-ng. Step 2 (configuration) has not changed and can be skipped.

When will this happen?

Please make the change above any time between now and January 26, preferably before January 20.

Papertrail will change certificates on Wednesday, January 27, 2016 during the US business day. With the configuration changes above, senders will experience no impact.

  • 1 week prior (Wednesday, January 20): Papertrail will change to the new certificate for 2 hours, identify senders which do not reconnect, and then revert to the current certificate. I’ll send a second notification on Thursday, January 21 to customers with 1 or more senders which did not reconnect.

  • 2 days prior (Monday, January 25): I’ll send one last email.

If this doesn’t affect you, I’m sorry for the inconvenience. Know that I’ve used all the information available to us to decide whether this may be relevant, and that we’re only taking this much care in order to prevent a service problem.

Can I test my systems?

Absolutely. First, we’re happy to review your configuration. Just reply and attach it.

Second, here’s how to verify that a sending system is updated:

  1. Visit Destinations and click “Create log destination.”

    The new destination will be on a hostname which already presents the new certificate. It will already function the way that logs.papertrailapp.com will after January 27.

  2. On an existing TLS-enabled system, change the rsyslog, remote_syslog, or syslog-ng to use the new log destination.

  3. Visit Papertrail’s Dashboard and click the “All Systems” group.

    Scroll to this system’s name. You should now see 2 Papertrail entries for this system, not 1, since Papertrail will treat the logs sent to this new destination as a second system. If you do see 2 Papertrail entries, and the new entry has only a few minutes worth of logs, the system is configured correctly to accept the new certificate.

  4. Change the system back, then visit Destinations to remove the test log destination.

Why is this necessary?

Papertrail’s first destination, logs.papertrailapp.com, presents its TLS certificates in an improper order, and prior to June, 2014, Papertrail’s setup instructions relied on that order. In June, 2014, the instructions were updated to work with both correctly-behaving newer destinations and the existing logs presentation. Papertrail’s original logs.papertrailapp.com TLS certificate will expire on February 6, 2016, so senders which were configured with the older instructions need to be able to accept a new certificate before then.

In addition, newer log destinations present a wildcard certificate (*.papertrailapp.com). rsyslog requires an explicit configuration directive to trust it. Since Papertrail did not use a wildcard certificate until June, 2014, rsyslog instances configured before then do not include this directive.

The changes above update senders configured prior to June, 2014 so they:

  • trust the same set of root certificate authorities as Mozilla does, so that future new certificates signed by trusted roots will work automatically.
  • trust Papertrail’s wildcard certificate. That way, logs can present the same wildcard certificate already presented by other destinations.

This was not caused by a security problem. The current configuration encrypts log data and verifies certificate trust as intended.

We’ve put a ton of effort into making encryption easy, but it’s not completely effortless. This is one case.

Questions

All of us at Papertrail want to make this as simple for you as we know how to. While there’s no way to avoid this change, if we can do anything to make it easier, we want to hear it. Please email support@papertrailapp.com.

Posted by @rpheath on

Main Menu Changes (or, Where'd My Profile Go?)

We’ve made a few small changes to the main navigation in the header. Here’s what it used to look like:

Old Navigation Menu

It now looks like this:

Updated Navigation

The old “Account” tab is now a more accurate “Settings” tab, and we got rid of the “Me” tab altogether, replacing it with a logout link.

And finally, your Profile can now be found in the Settings area:

New Profile Location

Posted by @lmarburger on

Temporarily Mute Log Senders

Papertrail now has a way to stop processing logs from a sender. A sender can be muted for an hour during maintenance or to run a load test.

This augments the filtering improvements released last week.

Background

In the middle of an incident, planned or unplanned, it’s important to have quality logs. A down database, misconfigured logger, or running a load test could all produce a flood of useless log messages. If it’s a database that’s down for maintenance, being overwhelmed with connection errors from web servers is unhelpful. Not only are these errors pure noise, they consume log data transfer without adding any value. Nobody benefits.

Mute the log sender before it pollutes your log repository with useless messages. Let Papertrail enable the sender after the chosen amount of time or come back and enable it manually.

This change is part of our broader effort to provide the logging flexibility you need to handle unexpected circumstances:

When you encounter ways which centralized control or filtering would make your logs more powerful, please let us know.

Posted by @lmarburger on

Flexible Log Filtering

The set of interesting log messages changes depending on the context. Log messages which are useful while in the middle of an outage, or to debug a kernel error, can be noise during day-to-day operations. As of today, it’s easier to choose the messages which are currently useful to you.

Flexible, Centralized Filters

It’s common to use systems and services with logging behavior that can’t be modified, like systems with strict change control, managed services, and closed-source apps. Even when senders can be changed, Papertrail offers a central place to apply flexible filters to many log streams. Log filters are a way to drop unnecessary messages and pay only for the messages you find useful.

Over time, the filters (regular expressions) became long and hard to read. Log filters can now be broken into their component parts and each can have a note describing its purpose.

Disable or Enable Filters

A new feature in this release is a way to toggle an individual filter. This is especially useful for logs that are noise during normal operation but are critical during an incident or to debug a system. It effectively becomes a way to change log verbosity without touching the systems themselves.

For instance, the JVM garbage collection logs can be very noisy, but are very helpful in tuning the GC. Filtering these logs in Papertrail means GC log collection can be enabled while only seeing those log messages at the times they’re needed. No restart of the JVM process required.

Head to your Papertrail account and click Filter logs and get rid of those useless logs today.

Posted by @lmarburger on

Making Group Info Pages Faster For Very Large Groups

Some customers have thousands or even tens of thousands of log senders. Viewing the details of these large groups of senders can be quite slow. To speed up page loads, we’ve added pagination and changed how filtering log senders works.

The syntax for filtering a group’s log senders now matches the syntax for adding dynamic log senders to a group.

  • ? matches a single character
  • * matches any number of characters
  • Combine multiple searches with a comma

Posted by @troyd on

Welcoming Ryan Heath to Papertrail

It started with an email through Flickr way back in 2008. Papertrail’s co-founder, Eric Lindvall, saw a portfolio which caught his eye. Eric sent an unsolicited message to the designer, Ryan Heath, using the only method that Eric could find — Flickr. After a few interactions, Ryan began collaborating with the team, and a step that’s felt inevitable has finally happened: we’re thrilled to officially welcome Ryan Heath as Papertrail’s UI caretaker.

Ryan has had a hand in Papertrail’s design since day one, but was always limited by his day job, so we offered him a new one :-)

Until recently, Ryan provided design and Web development consulting, contributing to dozens of sites, Web services, and mobile apps. While client work intrigues him, solving UI/UX problems through product design is his true passion.

As Eric’s reaction to his design work shows, he’s an incredibly talented designer. Ryan also has a few (computer, electrical, software) engineering degrees, and that engineering background helps him understand the problems that Papertrail’s customers use Papertrail to solve. If “Design engineer” was a title, Ryan would have it.

Ryan resides in Morgantown, WV with his wife and two kids. His home office houses, among other things, a plethora of design books, a poster of Don Draper, a printmaker’s desk lamp from the 1950s, a Herman Miller chair, and the infamous contour bottle from Coca-Cola. He’s a collector of objects, to say the least. He enjoys thoughtful design, photography, and golf.

We’ve been fortunate to work with Ryan for years now, and we’re even more excited that he’s joining the team on a full-time basis.

Posted by @leonsodhi on

Retiring SSL 3.0

Summary

On Friday, June 12, 2015, Papertrail will remove support for the outdated security protocol SSL 3.0, which was released in 1996 and has since been superseded by TLS.

TLS is automatically used by nearly all modern loggers that can send encrypted log messages, so for the vast majority of customers this change will have no impact. However, there are exceptions to this which are explained in the next section.

In addition to this blog post, next week, we’ll directly email all customers who we believe will or are likely to be affected.

Update: On May 29, 2015, the SSL 3.0 retirement date was changed from June 5 to June 12.

What action do I need to take?

For clear text logging (which includes all UDP logging and some TCP logging), no changes are necessary.

For those sending log messages in an encrypted form using nxlog, an upgrade to 2.9.1347 will be needed. Other loggers may also be affected, but at this time we aren’t aware of any. Next week, we’ll directly email all customers who we believe will or may be affected, and will work with anyone that needs help upgrading or switching to an alternative logger that supports at least TLS 1.0.

If you’re concerned that you may be impacted, won’t be able to upgrade affected senders by June 5th, or have other questions, please email us.

Why is this happening now?

On October 14, 2014, the POODLE vulnerability was publicly disclosed. It described how a man-in-the-middle attack could be performed that would reveal plain text data from an encrypted log packet transmitted via SSL 3.0. This attack illustrates a fundamental flaw in the protocol which cannot be properly patched. As a result, most vendors released updates which disabled it.

In accordance with best security practices, Papertrail applied this patch to all web servers and log ingestion points on the same day that POODLE was announced. However, due to a misconfiguration, SSL 3.0 remained enabled on the latter and was deactivated in the last few weeks as part of an unrelated patch.

This 2nd update was applied to each ingestion point over several days, which meant that some syslog endpoints were patched while others weren’t. Due to DNS round robin, some nxlog clients would successfully connect to the unpatched endpoints while others would fail to connect to the patched.

After every ingestion point had been updated, it was discovered that recent versions of nxlog only support encrypted logging via SSL 3.0 and thus could not establish a secure connection to patched endpoints.

After speaking with customers, we decided to re-enable SSL 3.0 to provide a reasonable amount of time for loggers to be upgraded. We will be disabling SSL 3.0 again on June 5th.

If you have any questions, please email us.

Posted by @troyd on

Papertrail joins SolarWinds and accelerates growth

On behalf of all of us at Papertrail, I’m thrilled to announce that Papertrail is now part of SolarWinds. Read more here.

As I said in the announcement, joining the SolarWinds family gives us more resources to make Papertrail better and to do so faster. As a first example, SolarWinds helped Papertrail acquire the domain name papertrail.com, which the team has wanted to obtain for years.

Details: SolarWinds Adds Cloud-based Log Management Capabilities with Acquisition of Papertrail

Why did Papertrail join SolarWinds? How will this benefit customers?

Eric and I started Papertrail to solve a problem we had: create the log management service that we wanted to use as developers and operations engineers ourselves, then make it as easy and as powerful as we can.

We’ve done that: Papertrail is a thriving, profitable business trusted by tens of thousands of customers. That’s been fulfilling enough that we’re ready to “double down” and deliver amazing log management and infrastructure visibility at an even larger scale.

Papertrail chose to work with SolarWinds because we share an appreciation for simple, practical products that shine in everyday use. As you’d expect, you’ll continue to work with the same Papertrail team (and we’ll be growing). SolarWinds sees tremendous potential in Papertrail and this combination enables Papertrail to get better, faster.

Posted by @lmarburger on

New search alerts dashboard

Search alerts are my favorite feature of Papertrail. Almost every organization considers at least one subset of logs worthy of human attention, whether as a daily summary or an instant notification. That’s precisely why alerts exist. We’ve released a number of improvements to how alerts are managed. Below are a few of our favorites.

Alerts Dashboard

Search alerts and their details are listed on the new alerts dashboard and when editing a saved search. This helps answer questions like “Who’s this emailing?” or “What’s the metric name in Librato?” Creating a new alert lists all the available services so it’s clear what you can do with alerts and how.

Multiple Alerts

Add multiple alerts to a saved search using the same alert service. For example, a saved search could ping several webhooks or email individual members of your team. Thresholds and frequencies are configured independently of other alerts.

Get Started

Pick a saved search that you or someone on your team might want to skim once a day and choose to receive it in an email. You’ll soon find yourself with a small “robot army” of alerts, from a daily email of failed CSRF requests to a graph of how frequently a Ruby VM segfaults. Enjoy!

Posted by @troyd on

DNS Outage on Monday, December 1, 2014

Summary

Most DNS requests for papertrailapp.com timed out for 6 hours, affecting many Web visitors and a small number of log senders. The outage was absolutely unacceptable. Personally and on behalf of Papertrail, I’m sorry.

Although the outage was caused by a DDoS against our DNS service provider, it’s our problem. We’re implementing complete DNS provider redundancy so that similar problems do not affect Papertrail.

What happened?

For about 6 hours on Monday, December 1, most DNS requests for papertrailapp.com timed out. Between approximately 19:15 UTC and 00:45 UTC the following day (11:15 - 16:45 Pacific), nearly all non-cached DNS requests timed out. As the attack pattern changed, a much smaller percentage of requests continued to time out until about 06:00 UTC (22:00 Pacific).

This DNS outage mostly impacted users trying to reach Papertrail’s site. Comparing Papertrail’s log volume during this outage to typical volume on a Monday, we did not see a noticeable change. The outage impacted few log senders because the resolved log destination was cached by most sending daemons as well as local and remote DNS servers. If your site was impacted, please contact us so we can understand and mitigate the impact of future problems. If you already reported this, we’ll be in touch.

The outage was caused by a very large distributed denial of service (DDoS) attack against a company which Papertrail purchases DNS service from. While the attack was destined for one of our service providers, we’re responsible for delivering Papertrail to you and this outage is our responsibility.

We posted frequent updates to Papertrail’s status site and @papertrailops, though for reasons discussed below, the updates were not as accessible as they should have been.

What we’re doing

While a DDoS against a service provider was the cause, Papertrail’s responsibility is to ensure reliable resolution of our own zone. In doing that, we made 2 mistakes:

  • Underestimating the duration of the longest foreseeable DNS outage
  • Based on that underestimation, depending on a single DNS infrastructure

Our estimate was based on many past large attacks that this service provider has handled in the 2 years we have used it. This attack was too large and too distributed for their anycast network and filtering the attack took much longer than we had considered.

While our service provider will make their network more resilient to attacks, making any single DNS infrastructure more resilient is not the solution. An attacker will always be able to generate a larger DDoS (and service problems other than DoSes will still impact it). For Papertrail and other maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS.

Authoritative DNS has one very thoughtful design property: it’s inherently distributed. Client resolvers (like typical Web browsers and log senders) will try to query multiple servers. Nothing about the service we purchase from our current provider prevents using additional providers or our own DNS infrastructure. Using multiple DNS infrastructures takes more effort to design and maintain – effort that, because we underestimated the impact of a severe outage, we invested elsewhere. Monday showed us that the extra effort is worthwhile and necessary.

Relying on one DNS infrastructure, no matter how large or distributed, is an unnecessary risk that we unintentionally took. On behalf of Papertrail, I apologize for our mistake and the impact it had on your operations.

Also, during the outage we realized that our status updates were not as accessible as they should be. Our status site used a hostname under the papertrailapp.com domain name, so it was also inaccessible during the outage.

We’ve registered papertrailstatus.com for Papertrail’s status site. The status site does not depend on any of the same DNS infrastructures that Papertrail itself does (and was already hosted independently). We added a reference to @papertrailops on the Twitter bio for @papertrailapp and started to retweet significant updates from the more visible @papertrailapp account.

We’ve started planning how Papertrail can use multiple, independent anycast DNS infrastructures so that a DDoS or outage of one DNS service does not affect Papertrail. If we can go into more detail about any of this, please contact us.

Update (2014-12): This change is complete and Papertrail does not depend on any single DNS infrastructure. Papertrail wrote and released dnsync to synchronize a zone between two providers, one of whom did not support AXFR.