Dominic Cleal's Blog

Severe UDP packet loss
While looking after a UDP based service, it came to my attention that we were losing a significant number of inbound packets. The first place to start is with netstat(8) and you can use the -s option to check statistics for various protocols (or add -u for UDP only, or -t for TCP).

Example output of netstat -su:
$ netstat -su
Udp:
    2829651752 packets received
    27732564 packets to unknown port received.
    1629462811 packet receive errors
    179722143 packets sent
This is showing the total number of UDP packets received and sent, plus two extra metrics. The second line shows UDP packets that were sent to a port that doesn't have a listening socket, then the third line shows packets that were dropped by the kernel.

Sockets contain a couple of buffers between the kernel and the application, one for receiving and one for sending data which have a fixed size. When the application fails to read from the buffer fast enough, packets will be discarded, incrementing the receive error counter.

As no technical blog post is complete without a pretty graph, below is a graph generated using Munin, showing the UDP traffic flowing on one particular system:

Netstat UDP graph

In the above graph, you can see the dominant line being the received packets and the turquoise line lower down is showing the packet receive errors.

On Linux, the buffer sizes are controlled by a group of sysctl parameters with rmem* being receive buffers and w* being send buffers:
  • net.core.rmem_default
  • net.core.rmem_max
  • net.core.wmem_default
  • net.core.wmem_max
Checking a Debian Etch system, the default values for the max is about 128kB and the default size is 120kB. I've shown them here using the sysctl(8) tool.
$ sysctl net.core   grep [rw]mem
net.core.wmem_max = 131071
net.core.rmem_max = 131071
net.core.wmem_default = 122880
net.core.rmem_default = 122880
Using sysctl, you can update the values of these parameters with the -w option:
$ sudo sysctl -w net.core.rmem_max=1048576 net.core.rmem_default=1048576
net.core.rmem_max = 1048576
net.core.rmem_default = 1048576
This now causes any application to have increased buffer sizes on its sockets by default, which provided your application doesn't have other bottlenecks affecting its throughput, will give it a little more space. It's also possible to increase the maximum and then have the application alter the socket size - see socket(7) for more info.

In our case, you can clearly see on the graph that the problem has been solved for a few days. We had to apply two changes mentioned:
  • Increasing the buffer size, which was done using the application config (and increasing the net.core.rmem_max parameter, leaving rmem_default alone)
  • by tweaking the application to increase its throughput, using more controlled buffering internally, rather than relying on the kernel socket buffering
Only one packet has been lost since the changes were made, which is an acceptable error rate for this application given its throughput.
The devil's in the detail
A seemingly innocent refactor of a Java EE web application last week turned into a small nightmare due to a tiny detail in the servlet specification that I hadn't taken into account.

The webapp's main purpose is to proxy requests to a backend server using the jEasy Extensible Proxy (J2EP) project (from Google SoC). This allows us to create a custom implementation of the logic for choosing a backend server to route requests to (linked with user sessions, the user's given permissions etc) with very little effort.

J2EP is implemented as a filter, with a second rewriting filter layered on top of the proxying filter. Originally, the module for J2EP was performing session validation and contained more logic for handling the server choices - the refactor moved this validation procedure out into its own filter, layered on top of the rewriting filter.

Once this was done, after lots of puzzling debug output, that the data in POST requests to the server was simply missing after passing through the proxy server. The detail that caught me out was hidden in the JavaDoc for ServletRequest.getParameter(String name):

public java.lang.String getParameter(java.lang.String name)
Returns the value of a request parameter as a String, or null if the parameter does not exist. Request parameters are extra information sent with the request. For HTTP servlets, parameters are contained in the query string or posted form data.

[..]

If the parameter data was sent in the request body, such as occurs with an HTTP POST request, then reading the body directly via getInputStream() or getReader() can interfere with the execution of this method.


As it turns out, the code that was moved to the new, top layer filter called getParameter() in a couple of places. The J2EP proxy filter was later using getInputStream() to pass the request parameters into the new outbound request. Even though it was after the initial parameter read, the reverse of the situation mentioned in the specification caused getInputStream() to break and return an empty stream!

I wish an IllegalStateException had simply been thrown rather than returning useless streams... *sigh*

(note: this was under Apache Tomcat 5.5)
A less irritating use for vacation responders
Small tip for Exim filters when dealing with e-mail alerts. This is part of my current .filter file:
# Exim filter
# vim: ts=4 et
# Matches: SYSTEM Resolved ... Notification for Service
if $h_subject: matches "\\N^(\\S+)\\s+(\\S+).*Notification for Service\\N" then
    unseen save "mail/ALERTS"
    # Only notify if system name given in $1 and status is changing
    if $2 is "WIP" or $2 is "Resolved" then
        mail to notification@example.com
             from $reply_address
             subject $h_subject:
             text ": $message_body"
             log .alerts/alerts.log
             once .alerts/suncp.$1.$2.db
             once_repeat 3h
    endif
endif
First off, all alerts get saved into a different mailbox (as well my inbox). Using Thunderbird and the Mailbox Alert extension on my work computer, I can distinguish between normal e-mails and incident alerts.

Next, the subject line is examined for particular keywords. The incoming messages have the subject:
SYSTEM Resolved ... Notification for Service
Where the first word is the system hostname and the second word is the incident status. The line below creates a new e-mail that is sent to an e-mail to SMS service basically relaying the message.

As there can be many updates to an on-going ticket, I've used the vacation responder options "once" and "once_repeat" to limit the notifications to once every 3 hours, per system and per status. This allows me to hopefully receive notifications just on the initial alert and when it's resolved. This is done with a different vacation database file (specified with "once") per combination of system and status, stored in ~/.alerts/ and named with the $1 and $2 variables. These contain the system name and alert status that came out of the "matches" regular expression on the subject line.
Archives