Papertrail

The revolution will be verbosely {,b}logged

Introducing Syslog Rate Limits

Posted by @jpablomr on

Summary

Occasionally, a misconfigured log sender will generate an astonishingly high volume of log data. Because UDP doesn’t offer backpressure, a misconfigured UDP sender can generate hundreds of thousands of packets per second with no regard to whether Papertrail accepts or even receives the logs. To any other service, this activity would be a denial-of-service attack. It’s our responsibility to ensure that such a misconfigured (or even malicious) sender can’t cause problems for other Papertrail customers, while also making everyone’s logging service as painless and predictable as we possibly can.

Until now, Papertrail has handled log floods reactively and manually. With a handful of incidents under our belt, we’re now comfortable using proactive rate limits to automatically identify and minimize the impact of these unintentional floods (particularly from UDP senders, where backpressure is not possible).

These syslog rate limit rules will go live on Thursday, February 11th, 2016. As explained below, customers should not see a difference in log delivery reliability from today. Also, in the future, Papertrail will periodically email customers who regularly reach the rate-limits simply to ensure that no one is surprised.

Our job is to make this simple and painless for you, so if we can answer any questions or explain in more detail, we want to know.

Updates

2016-02-11: Rate limits have been enabled.

A quick refresher on UDP

UDP is a great protocol for sending information with minimal overhead. Its simplicity and ‘fire and forget’ model make it a practical alternative to TCP for use cases where losing a few packets is not critical. Nevertheless, there is a catch: when UDP is dropped (due to network or device issues), the sender has no knowledge. Since the UDP sender doesn’t know that packets were lost, it can’t moderate its transmission rate. A UDP sender will continue to send as fast as it can, whether or not the recipient receives the data.

This awareness, called “backpressure,” is what lets TCP realize that a link is congested, and slow its sending rate.

When a UDP sender starts sending unusually large amounts of data, the probability that all the data doesn’t arrive increases. To picture this, imagine taking notes from a speech: if the speaker talks too quickly, you might not understand everything the speaker says or you might forget to write some parts down. The speaker doesn’t notice if you can’t hear what they say or if you fail to write everything down, they will continue speaking until they are done.

If your senders are configured to use UDP, a big spike of UDP messages might cause some of them to be dropped and forgotten along the way.

Why do we need to set rate limits?

Because we cannot apply backpressure to UDP senders, our only mechanism for fairly handling all customers is to apply rate limits for UDP that are above what we have seen as normal volumes for senders.

To avoid interfering with high-volume log senders, we based the limits on the peak log volume of Papertrail’s highest-volume senders (which Papertrail already measures as part of regular operations).

Separately, we have seen rare cases where a misbehaving TCP sender will cause issues by trying to connect tens or hundreds of times per second. With TLS encrypted syslog, the TLS handshake overhead can make this look a lot like a CPU exhaustion DoS attack. In extreme cases, these misbehaving senders could interfere with normal operation of Papertrail’s syslog destinations. There’s no graceful way to ignore 1 million packets per second or 10,000 failed TLS connections per second, so the question isn’t whether limits need to exist, just what they should be and how to implement them thoughtfully.

As we have grown, more misbehaving senders have required us to react to each incident and manually limit them until the senders are behaving properly. This is not an ideal solution, as manual tasks are error-prone and we’d rather spend that time improving the service. The rate limits we are introducing are just an automated form of our manual limiting policies.

Automated rate limits will let Papertrail:

  • Guarantee the normal operation of our service for all users.
  • Quickly determine misbehaving senders and help restore them to normal operation.
  • Spend time upgrading our infrastructure and working on new features.

What are the rate limits?

For UDP:

  • 3,000 messages per second per source IP.
  • 10,000 messages per second to a log destination (port).

For TCP:

  • 10 new connections per second per source IP to a log destination (port).

These limits can (and probably will) change, based on our regular measurements. If these limits present a problem, let us know and we’ll work with you. Again, customers won’t see changes from their current log message delivery reliability; these limits just automate what’s already happening. We’ll also be proactively contacting customers who regularly exceeds these guidelines simply so they’re aware.

What protocol should I use to send my messages?

Use TCP if:

  • You can’t miss a single log message.
  • A single sender IP address may generate more than 3,000 packets per second of logs regularly (and those logs are operationally relevant, not noise).
  • Communication between your senders and Papertrail needs to be encrypted.

Use UDP if:

  • Your syslog sender must be non-blocking.
  • Your syslog sender only supports UDP.
  • The benefits of using UDP outweigh the risks of losing messages.

Papertrail’s limits also try to expose the throughput which we believe each protocol is well-suited for. A single system generating 5,000 messages per second of logs via UDP (let alone 10,000 or 50,000) is likely to experience at least some loss, possibly even at the sender’s NIC buffer. If some loss during periods of high log volume is acceptable, UDP may still be fine.

On the other hand, customers regularly generating more than 3,000 messages per second from a single sender, and who want reliable delivery, will be happier with TCP - even without the rate-limits described here. remote_syslog makes changing protocols very easy, as do most other sending daemons.

Questions

We took a lot of time and care in determining a rate limiting policy which would only affect senders that might adversely affect Papertrail’s service. However, we understand that rate limits always introduce some concerns. If there’s anything we can do to help, like recommending a protocol, reviewing a high-traffic system, or providing a configuration for a different protocol, we want to help. Let us know at support@papertrailapp.com.