Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE
Andrew Fast, Chief Data Scientist, CounterFlow ai
Crossing the Production Gap
Machine learning promises to address many of the challenges faced by network security analysts; however, there are still many obstacles that prevent widespread adoption of machine learning within security operations centers (SOC). The first major challenge is one of trust as discussed in our previous post. The second major set of challenges is around the complexity of deploying machine learning in a production environment. Once a machine-learning model has been trained and validated in the lab, there is often an equal if not larger effort required to deploy that model in a repeatable, production environment.\Transitioning a model from the lab to production is a difficult challenge that OpnIDS and Dragonfly MLE can help address in the network operation environment.
Typical Machine Learning Process
Data science typically operates using an iterative batch process. First, there is a lengthy data collection process that results in a large static dataset (a.k.a. "training data" or "the data") stored in one or more data storage systems including spreadsheets and flat files, relational databases, and "big data" systems such as Hadoop and Apache Spark. Next, data scientists will "explore" the data using traditional data science tools such as SQL, R, or Python. Data exploration rapidly transitions into feature creation and data cleaning, resulting in a data set that is prepared for modeling. Once the first round of feature creation is complete, data scientists will train a machine-learning model using one of many model packages for R or Scikit.learn for Python, ideally splitting the training into several subsets to simulate out-of-sample, unseen data. Typically, this entire process starting with data collection will repeat until model performance meets the business goals of the project.
After performance of the model is sufficient, it is then ready to be put into "production." In production, a model "scores" new, unseen data using the model developed during the training phase. Transitioning a model from training to scoring can be a lengthy process. First, "the data" is typically a static dataset, so for a model to be deployed it must be first transitioned to live data. This entails ensuring that all of the various data features used as inputs to the model are available when the model needs to be run. Second, production systems are often built using different programming languages and frameworks than are used to train the models, requiring additional software engineering effort to translate the model code to the production language of choice.
If there are many different dependencies on datasets, a common choice is to deploy the model on top of the training architecture. This avoids additional effort re-implementing feature engineering and multiple data pipelines. However, scoring using the training architecture is not always the most efficient approach. For example, check out the recent collaboration between Redis and Spark.
Busting Complexity with Streaming Machine Learning
When we've spoken to network analysts about deploying machine learning, we hear a common response: they do not have the budget, rackspace, or know-how to set up the typical "big data" infrastructure for deploying batch machine learning. This is primarily a complaint about the complexity inherent in a "big data" system. Often the assumption by the network analyst is that effective machine learning would require a large Hadoop cluster running Apache Spark, rapidly followed by the mistaken conclusion that machine learning won't fit in their environment.
Though a powerful and popular choice, the reliance on large-scale systems is, in part, a by-product of the assumption that machine learning and batch processing must go hand in hand. Storing large amounts of data required to do batch machine learning is the primary driver of the development of massively parallel systems. But if it were possible to remove the data storage component from machine learning, then much of the engineering complexity required to maintain those systems could be removed as well. OpnIDS and Dragonfly MLE do just that by implementing a streaming machine-learning engine that operates on network data as it flows through the network sensor.
In addition to removing the complexity due to data storage, deploying the machine learning engine at the network sensor enables data to be extracted directly from the network, reducing the data pipeline complexity as well. The Dragonfly MLE is a stable, defined platform deploying statistical analysis of network data.
Streaming Reduces Latency, Too
A second unfortunate by-product of batch machine learning is an increased latency from each additional infrastructure layer that the data must traverse before coming to rest; latency also results from the time required to process the whole set of data at rest. Streaming, like the type used in the Dragonfly MLE, is a lightweight way to process traffic at near wire speed. This creates the potential for detecting possible threats as soon as they cross the wire.
Streaming Machine Learning with Dragonfly MLE
The Dragonfly MLE included within OpnIDS is a streaming machine learning platform designed to speed up the detection of threats and ease the deployment of new threat detection models. Keep watching this blog and the OpnIDS site for more innovations.
No More Black Boxes
Andrew Fast, Chief Data Scientist, CounterFlow ai
Machine learning can be challenging to adopt in a cybersecurity context because of an initial lack of trust on the part of security analysts. We believe that machine learning is a transformational technology for cybersecurity and that the best use for machine learning is in conjunction with human intelligence. In the post, we look under the hood of OpnIDS and the Dragonfly MLE to gain insight into the strategies we are using to increase trust and adoption of machine-learning techniques.
Explainable AI is an Imperative
Model explainability is a hot topic in machine learning right now. With the rise of deep learning techniques, model performance has improved dramatically — often at the cost of reduced understanding of how the model is coming to a decision. For many tasks, such as algorithmic trading or image recognition, understanding how the algorithm comes to a decision may not matter as long as the decision is correct. For many other tasks, including cyber threat hunting, medicine, or credit scoring, the model cannot be used to take action directly but is used instead as input into a related human process. For these latter tasks, explainability is of the utmost importance for building trust and acceptance of the new process. In a recent article in the Wall Street Journal, Rob Alexander the CIO at Capital One said it this way: “Until you have confidence in explainability, you have to be cautious about the algorithms you use” (WSJ, 09/26/2018).
Open, from the source code up
Building trust is one of the main reasons we chose to release the Dragonfly MLE and OPNids under an open-source license. The code behind these tools is always available on GitHub. Releasing the code under an open-source license allows interested parties to dig into the code itself to understand what is happening as the software is running. In an adversarial environment, this strategy is still viable for a machine-learning application because the specific models and uses of the tool depend on the configuration of the analyzers, which are going to be specific for each organization using the MLE.
Explainable Techniques Required
The best machine-learning techniques are able to infer patterns from many different variables. Though it is one of the primary strengths of machine learning, the ability to find multi-variate correlations is also one of the largest challenges for developing explainable techniques because it is difficult to understand where more complex correlations come from. Our preferred strategy for solving this problem is to build individual analyzers that are understandable and then combine those results into more complex models. Drawing from the machine-learning approach of ensembles, this "building block" approach can be used to effectively identify complex correlations from combinations of explainable models.
User Defined Policies
One of the most harmful myths about AI is that the machine will make the decision for you, leading to costly errors. While errors in any system (human or machine) are inevitable, allowing users to determine the threshold and the action taken is a necessary part of any explainable system. Our example analyzers in the Dragonfly MLE use a "decorator" pattern to report results to downstream applications. Rather than picking a threshold and only passing along events that are above the threshold, our strategy is to report all scores, and then let the user and the situation determine how to respond. This dovetails with the "building block" approach described above, as each analyzer could be used on its own or combined with other analyzers. This approach allows analysts full control over which analyzers are used to process traffic and the thresholds that are used to determine further action, making the process more explainable and more defensible.
The Journey of Explainability
Building explainable AI is a journey, not a destination. It requires advances in techniques and greater understanding of how techniques work. Dragonfly MLE uses open-source technology to support explainable techniques and user-defined policies. Try it out at opnids.io.
Want to learn more about OPNids?
Join us at @CYBERTACOS in Washington, D.C., October 16th.
Out team members will be there to field your questions and discuss more about OPNids, the first integration of Suricata IDS with a purpose-built Machine Learning Scripting Engine.
CYBERTACOS® was born out of a conversation with cyber reporters at the RSA Conference in 2016. CYBERTACOS® has grown into a series of events, previously held in Austin, Texas; Washington D.C.; San Francisco, California and Northern Virginia. Events typically attract more than 200 cyber professionals from reporters to engineers to top executives. It’s our chance to come together as a community over our shared love of Mexican fare.
Don't forget to register at www.cybertacos.net!
The Dragonfly MLE, included as part of OpnIDS, uses Redis-ML to provide the ability to score incoming data using machine-learning models. Redis-ML is an add-on module for Redis produced by Redis Labs and it supports prediction and scoring using linear regression, logistic regression, and random forest models.
One common example of using these types of models is the detection of domain names that have been generated by Domain Generation Algorithms (DGA). This problem fits the pattern for supervised machine learning because there are many known domains that can be identified as having been generated by a DGA algorithm and many typical domains that are known to have been generated by a human. We use this example to demonstrate how a supervised model can be deployed using the Dragonfly MLE. Example scripts containing this approach can be found in the Dragonfly MLE Github for both Logistic Regression and Random Forests.
Overview of Domain Generation Algorithms
In order to avoid detection, many families of malware use Domain Generation Algorithms to create domain names keyed to dynamic inputs such as the date, time of day, or even trending Twitter topics. This allows infected hosts to be able to communicate with command and control (C2) servers without contacting a fixed domain or IP address. Widely prevalent malware families that rely on DGA include Conficker, Murofet, and BankPatch. More on DGA
Because these malware use dynamically generated domains, it is a challenge to write a single signature (or set of signatures) to detect all possible generated domains. Instead, a probabilistic machine-learning model that uses computed features from the domain itself is a more flexible approach for determining whether a domain was generated by a DGA.
Input Features for DGA detection
For a predictive model such as a DGA detector to be effective, we need to translate the incoming data into a set of numerical indicators, commonly called “features,” that an algorithm can use. Our example model uses 22 different features computed from the domain string as input for the classification algorithms. Some selected features are shown below:
· Length of Domain Parts (TLD, 2LD, 3LD)
· Does the domain end in .edu? (Y/N)
· Number of Domain Parts (i.e. number of periods)
· Number of Distinct Characters
· String Entropy computed on the Domain
· Number of Digits
· Number of Dashes
The full feature list is included in the example code on GitHub
Training The Model
The final step before deploying the model is training and evaluating the model on real data with known labels. This is an extensive process that we will describe in more detail in a future blog post because the full training process requires more tools in addition to OpnIDS and Dragonfly MLE. In the meantime, we will give a brief overview of the model training process. We used Python Scikit.learn to train our models, but any software that is able to train a Logistic Regression model or a Random Forest model would work as well.
To gather known DGA domains, we used Johannes Bader's collection of reverse engineered domain generation algorithms. We ran those algorithms in Python to generate over 30,000 samples from 42 different malware families.
For known-good domains, we used a sample of the Majestic Million Note: Not all the of the domains contained within the Majestic Million are benign. Some domains are highlighted by threat intelligence as malicious and others look to be generated by a DGA! The Majestic Million is licensed under a creative-commons license.
A sample of domains is shown below along with the label (1 for DGA, 0 for non-DGA) and the source of the domain:
Note that some of the DGA families use English language dictionaries to create "normal" looking domains.
Deploying the Model with Redis-ML
Redis-ML extends Redis by adding specialized keys to store models. The first step for using these models is to initialize the model parameters or the model structure (in the case of the random forest). We then use the Dragonfly MLE to convert the incoming data into the appropriate feature format and input those features into the stored models.
A simple explanation of the model format for Redis-ML can be found on the Redis-ML GitHub page. Detailed examples of Redis-ML in action can be found here; logistic regression is introduced in Part 3 and random forests in Part 5.
For both the logistic regression and random forest models, we converted the Scikit.learn model to the appropriate Redis-ML format in Python. The final results can be viewed in the Dragonfly MLE Github Logistic Regression and Random Forest.
For use cases with many examples having known labels, a supervised machine-learning model is often the correct choice for threat detection. For these types of problems, models are more flexible than signatures, generalizing to correctly identify previously unseen threats. The Dragonfly MLE (with help from Redis-ML) provides a powerful platform for deploying these models, managing both the data and the software infrastructure.
Often, one of the first things a threat hunter asks when investigating a threat is "Who or what are the 'top talkers' on this network?" Top talkers can be measured in many different ways including endpoints that transfer the most bytes, protocols that are most prevalent on the network, or endpoints that are connecting to the most other network hosts.
We're going to highlight some built-in Redis functionality that, when combined with a Dragonfly MLE analyzer, can be used to compute top talkers in near real time. The Dragonfly MLE is a data science scripting engine included in the OpenIDS project. You can read more about the MLE here (Intro to Dragonfly MLE) or download the code from Github.
Redis bills itself as an "in-memory data structure store" (Redis.io). It has a small enough footprint to work in streaming environments such as Open IDS. Redis provides a variety of data structures including sets and hashes.
Top Talkers by Bytes Using the Redis SORTED SET
One strategy for determining top talkers is to count the bytes sent (or received) by each IP and track the total bytes over time. The number of bytes is included in the flow records that are emitted by Suricata within Open IDS. Since the flow record reports both the source and destination IP addresses, we can easily track the total number of bytes sent or received for each IP in the same MLE script.
One of the data structures provided natively by Redis is the "sorted set." A sorted set lets a user store a key (e.g., IP address) and a value (e.g., number of bytes). The set then keeps the records sorted by value. The ZADD command in Redis is dual purpose. It adds new keys to the set or creates a new set if it doesn't exist. The ZINCRBY command can be used to increment the value of a key, resulting in an updated sort order, if necessary. Finally, after the set has been created, the ZRANK command queries the sorted set to get the rank of a specific key. In the case of top talkers, both the rank and total number of bytes are of interest and can be attached to the flow record by the MLE for reference in downstream operations.
One challenge with the sorted set is that the size of the data structure grows with the number of entries. For large networks, this could be a problem if memory on the network sensor is being used to support other operations (such as packet capture) in addition to supporting the MLE. In the next section, we will introduce the idea of a data "sketch," a probabilistic data structure with constant size in exchange for providing approximate answers.
Counting Distinct Connections with HyperLogLog
In addition to counting bytes, it is useful to know how many unique connections are being made by a given endpoint. When combined with the bytes sent by each endpoint, the number of unique connections can be used to understand the diversity of destinations for those bytes.
For this functionality we are going to use a "data sketch" known as HyperLogLog. Data sketches are probabilistic data structures that can be used to dramatically reduce the amount of memory used to solve certain tasks in exchange for returning approximate answers instead of exact ones. HyperLogLog is a sketch used to solve the distinct count problem, exactly what we need to need count unique connections. To implement a HyperLogLog structure in Redis, use the PFADD command to add a value to the set and the PFCOUNT command to determine the number of unique endpoints.
In a world without HyperLogLog, to count the number of unique connections made by each endpoint we would need to maintain a list of unique IP addresses accessed by each endpoint. Unfortunately, for large networks, the number of entries grows with the square of the number IP addresses. For example, if 100 endpoints within an organization connected to all the other endpoints in that same organization, there would be 10,000 (100x100 = 100^2) possible connections.
With HyperLogLog, we can estimate the number of distinct items in a set (in this case the number of distinct connections) within a small error percentage of the true number (Redis reports <1% error in typical usage) but without the significant space blowup. To reduce space, HyperLogLog uses multiple hashing functions to map each new connection using a probabilistic binning strategy that occasionally has collisions — that is, two endpoints that hash to the same bin. The size and number of the hashes determines the correctness of the response.
Redis and Dragonfly MLE are a powerful combination for tracking network statistics in near real time. Get them now at OPNids.io. Examples demonstrating both of these techniques within the Dragonfly MLE can be found in GitHub.
We show how to use streaming statistics with the Dragonfly MLE, Redis, and OpnIDS to compute a robust indicator of outlier-ness known as Median Absolute Deviation (MAD). An example of Dragonfly script for computing MAD is also in the Github repo.
MAD is a measure of the variability of a population (See Wikipedia). MAD is conceptually similar to the standard deviation but is less sensitive to extreme values in the data than the standard deviation. In this post, we're going to show how to compute a streaming estimate of the median flow size and then use that estimated median to compute how much of an outlier each network flow actually is.
Though we are using flow size for demonstration, MAD is a general technique that could be applied to many different network use cases. For example, MAD could be applied to producer-consumer ratios to detect potential exfiltration or it could be applied to message counts to detect a malfunctioning system. In any of these scenarios, extremely large (or possibly extremely small) values of MAD should be viewed as unlikely.
Streaming Approximation of the Median Using Stochastic Averaging
The median is, of course, the value in the middle if one sorts an entire data set from smallest to largest. However, sorting an entire data set to compute the exact median can be infeasible if the data are large (perhaps infinite) and cannot be stored effectively. Streaming computations can be used to find an approximation to the median while only seeing each data point once, limiting storage to a few sufficient statistics.
To approximate the median flow size, we use a straight-forward approach called stochastic averaging. We're following the approach described in Byron Ellis's Real-Time Analytics (Wiley 2014). We use Redis's Hash data structure (which works like a Python dictionary) to track the value of the median and the learning over time.
MAD uses two median calculations to determine the number of absolute deviations from the median of each data point. First, we need to know the median of the data. Then using that median, we can compute the deviation from the median for each data point. The median deviation, then, is a standardized measure of distance from the central tendency of the population (similar to how the number of standard deviations from mean is used to compute the Z-score).
To compute MAD in a streaming environment, we use two nested streaming median computations. This does require some time for "burn-in" but rapidly achieves a stable estimate of the median.
When a new flow enters the MLE, we grab the total bytes of the flow and, using the previous median, compute the deviation of the new point from the previous median and compare it with the previous median deviation. We then take the current median and current deviation and update each of our running medians.
To compute the outlier, we take the current deviation from the median and divide by the median deviation. This multiple indicates how many standard units the current data point is from the center of the entire population. The higher this multiple, the more unexpected. For network flows, we are typically only interested in the right tail of the distribution, that is, large flows.
The famous quote that "the most exciting phrase to hear in science ... is not ‘Eureka!’ but 'That's funny…'” (popularly attributed to Isaac Asimov), holds just as well for threat hunting as it does for science. The challenge is that humans with limited visibility have a tendency to overreact to events that may seem unexpected but are actually quite common. Statistics such as MAD can be used to help quantify whether events are truly unexpected, allowing for the appropriate response.
Redis and Dragonfly MLE are a powerful combination for doing streaming analytics such as MAD on network traffic in near real-time. Get them now at OpenIDS.io.