Detecting Outlier Flows with MAD

We show how to use streaming statistics with the Dragonfly MLE, Redis, and OpnIDS to compute a robust indicator of outlier-ness known as Median Absolute Deviation (MAD). An example of Dragonfly script for computing MAD is also in the Github repo.

MAD is a measure of the variability of a population (See Wikipedia). MAD is conceptually similar to the standard deviation but is less sensitive to extreme values in the data than the standard deviation. In this post, we're going to show how to compute a streaming estimate of the median flow size and then use that estimated median to compute how much of an outlier each network flow actually is.

Though we are using flow size for demonstration, MAD is a general technique that could be applied to many different network use cases. For example, MAD could be applied to producer-consumer ratios to detect potential exfiltration or it could be applied to message counts to detect a malfunctioning system. In any of these scenarios, extremely large (or possibly extremely small) values of MAD should be viewed as unlikely.

Streaming Approximation of the Median Using Stochastic Averaging

The median is, of course, the value in the middle if one sorts an entire data set from smallest to largest. However, sorting an entire data set to compute the exact median can be infeasible if the data are large (perhaps infinite) and cannot be stored effectively. Streaming computations can be used to find an approximation to the median while only seeing each data point once, limiting storage to a few sufficient statistics.

To approximate the median flow size, we use a straight-forward approach called stochastic averaging. We're following the approach described in Byron Ellis's Real-Time Analytics (Wiley 2014). We use Redis's Hash data structure (which works like a Python dictionary) to track the value of the median and the learning over time.

Computing MAD

MAD uses two median calculations to determine the number of absolute deviations from the median of each data point. First, we need to know the median of the data. Then using that median, we can compute the deviation from the median for each data point. The median deviation, then, is a standardized measure of distance from the central tendency of the population (similar to how the number of standard deviations from mean is used to compute the Z-score).

To compute MAD in a streaming environment, we use two nested streaming median computations. This does require some time for "burn-in" but rapidly achieves a stable estimate of the median.

When a new flow enters the MLE, we grab the total bytes of the flow and, using the previous median, compute the deviation of the new point from the previous median and compare it with the previous median deviation. We then take the current median and current deviation and update each of our running medians.

To compute the outlier, we take the current deviation from the median and divide by the median deviation. This multiple indicates how many standard units the current data point is from the center of the entire population. The higher this multiple, the more unexpected. For network flows, we are typically only interested in the right tail of the distribution, that is, large flows.

Summary

The famous quote that "the most exciting phrase to hear in science ... is not ‘Eureka!’ but 'That's funny…'” (popularly attributed to Isaac Asimov), holds just as well for threat hunting as it does for science. The challenge is that humans with limited visibility have a tendency to overreact to events that may seem unexpected but are actually quite common. Statistics such as MAD can be used to help quantify whether events are truly unexpected, allowing for the appropriate response.

Redis and Dragonfly MLE are a powerful combination for doing streaming analytics such as MAD on network traffic in near real-time. Get them now at OpenIDS.io.

Introduction to the Dragonfly Machine Learning Engine (MLE)

Introduction to the Dragonfly Machine Learning Engine (MLE)

Andrew Fast, Chief Data Scientist, CounterFlow AI

 

Screen Shot 2018-08-06 at 2.58.05 PM.png

Introduction

The Dragonfly Machine Learning Engine (MLE) provides the machine learning and data science capabilities included within OPNids. Data science and machine learning promise to counteract the dynamic threat environment created by growing network traffic and increasing threat actor sophistication. This post will provide an overview of the MLE engine itself, reasoning for why data science and cybersecurity go together, and some insight into using the MLE as part of the OPNids system.

The Dragonfly MLE is available as part of OPNids. or on its own from GitHub.

 

MLE Highlights

The Dragonfly MLE provides a powerful framework for deploying anomaly detection algorithms, threat intelligence lookups, and machine learning predictions within a network security infrastructure. The MLE can process hundreds of thousands of events per second using a multi-threaded, script-able streaming application engine for network threat detection implemented in C.

Using this scalable foundation, the MLE also allows the creation of custom scripts and analytics to be applied to network traffic streaming through the sensor. Scripting includes the following capabilities "out of the box":

·       Lua (LuaJIT)

·       JSON

·       Redis

·       Redis-ML

Redis-ML is a module for Redis that includes the ability to score previously-trained models including:

·       Linear regression

·       Logistic regression

·       Forests of decision trees (random forests)

The MLE can read and write streaming data from files, Unix sockets, or Kafka brokers. It is designed to integrate closely with Suricata.

 

Solving the Machine Learning Deployment Problem

Integrating the MLE into OPNids helps to solve the machine learning deployment problem, one of the largest challenges facing the machine learning industry as a whole. Many of the network analysts we have spoken to recently about machine learning in network security lament the cost and complexity of the majority of data science platforms. For example, running Spark requires the management of a large number of servers (either on-premises or virtualized), a Hadoop cluster for data storage, and Spark or other "Big Data" analytics platform on top of that. OPNids combines data collection via Suricata with ETL, Scripting, and model scoring included in the MLE. This tight integration between data and analysis, along with the transition from batch to streaming analytics, allows powerful analysis and scoring data without the huge cost or complexity.

OPNids also combines signatures, scripts, and models into a single package, since no one solution is sufficient to cover all network threat detection use cases. Signatures capture known, but largely static, threats. Scripts and machine learning models handle those more dynamic cases but are not as helpful for known threats. OPNids merges data collection in Suricata with the powerful combination of signatures, scripts, and machine learning in the Dragonfly MLE.

 

Using the MLE for Data Science and ML

The MLE uses a powerful yet familiar "pipes and filters" model for processing data with the addition of Redis for live data caching. There are three types of event processors available for inclusion — two types that are user-configurable and one type that is built in:

— *Input processors* - User-configurable scripts that pull messages out of a source, normalize the data into JSON format, and route each message to the appropriate analyzer queue for processing. Message sources can be files, Unix sockets, or Kafka brokers. Normalization and ETL operations are performed by a user-defined Lua script.

— *Analysis processors* - User-configurable scripts that pull messages out of the input queue, analyze each event, and route the results to the appropriate output queue for processing. Analyzers are implemented as user-defined Lua scripts and take advantage of both native Redis and Redis modules.

— *Output processors* - The built-in processors pull messages out of the queue and deliver each message to the appropriate sink. Current message sinks are either files, Unix sockets, or Kafka brokers that can be ingested by SIEM, security orchestration, and/or other downstream systems.

The operational pipeline is specified in a user-defined configuration file defining the interaction between the processors.

 

Summary

OPNids with the Dragonfly MLE is a powerful platform for improving threat detection capabilities using the combination of signatures, scripts, and machine learning models. With the inclusion of Suricata into OPNids and a focus on streaming analytics, many of the traditional challenges with deploying machine learning have been eliminated. Download OPNids today or explore the Dragonfly MLE directly.