We show how to use streaming statistics with the Dragonfly MLE, Redis, and OpnIDS to compute a robust indicator of outlier-ness known as Median Absolute Deviation (MAD). An example of Dragonfly script for computing MAD is also in the Github repo.
MAD is a measure of the variability of a population (See Wikipedia). MAD is conceptually similar to the standard deviation but is less sensitive to extreme values in the data than the standard deviation. In this post, we're going to show how to compute a streaming estimate of the median flow size and then use that estimated median to compute how much of an outlier each network flow actually is.
Though we are using flow size for demonstration, MAD is a general technique that could be applied to many different network use cases. For example, MAD could be applied to producer-consumer ratios to detect potential exfiltration or it could be applied to message counts to detect a malfunctioning system. In any of these scenarios, extremely large (or possibly extremely small) values of MAD should be viewed as unlikely.
Streaming Approximation of the Median Using Stochastic Averaging
The median is, of course, the value in the middle if one sorts an entire data set from smallest to largest. However, sorting an entire data set to compute the exact median can be infeasible if the data are large (perhaps infinite) and cannot be stored effectively. Streaming computations can be used to find an approximation to the median while only seeing each data point once, limiting storage to a few sufficient statistics.
To approximate the median flow size, we use a straight-forward approach called stochastic averaging. We're following the approach described in Byron Ellis's Real-Time Analytics (Wiley 2014). We use Redis's Hash data structure (which works like a Python dictionary) to track the value of the median and the learning over time.
MAD uses two median calculations to determine the number of absolute deviations from the median of each data point. First, we need to know the median of the data. Then using that median, we can compute the deviation from the median for each data point. The median deviation, then, is a standardized measure of distance from the central tendency of the population (similar to how the number of standard deviations from mean is used to compute the Z-score).
To compute MAD in a streaming environment, we use two nested streaming median computations. This does require some time for "burn-in" but rapidly achieves a stable estimate of the median.
When a new flow enters the MLE, we grab the total bytes of the flow and, using the previous median, compute the deviation of the new point from the previous median and compare it with the previous median deviation. We then take the current median and current deviation and update each of our running medians.
To compute the outlier, we take the current deviation from the median and divide by the median deviation. This multiple indicates how many standard units the current data point is from the center of the entire population. The higher this multiple, the more unexpected. For network flows, we are typically only interested in the right tail of the distribution, that is, large flows.
The famous quote that "the most exciting phrase to hear in science ... is not ‘Eureka!’ but 'That's funny…'” (popularly attributed to Isaac Asimov), holds just as well for threat hunting as it does for science. The challenge is that humans with limited visibility have a tendency to overreact to events that may seem unexpected but are actually quite common. Statistics such as MAD can be used to help quantify whether events are truly unexpected, allowing for the appropriate response.
Redis and Dragonfly MLE are a powerful combination for doing streaming analytics such as MAD on network traffic in near real-time. Get them now at OpenIDS.io.