The Dragonfly MLE, included as part of OpnIDS, uses Redis-ML to provide the ability to score incoming data using machine-learning models. Redis-ML is an add-on module for Redis produced by Redis Labs and it supports prediction and scoring using linear regression, logistic regression, and random forest models.
One common example of using these types of models is the detection of domain names that have been generated by Domain Generation Algorithms (DGA). This problem fits the pattern for supervised machine learning because there are many known domains that can be identified as having been generated by a DGA algorithm and many typical domains that are known to have been generated by a human. We use this example to demonstrate how a supervised model can be deployed using the Dragonfly MLE. Example scripts containing this approach can be found in the Dragonfly MLE Github for both Logistic Regression and Random Forests.
Overview of Domain Generation Algorithms
In order to avoid detection, many families of malware use Domain Generation Algorithms to create domain names keyed to dynamic inputs such as the date, time of day, or even trending Twitter topics. This allows infected hosts to be able to communicate with command and control (C2) servers without contacting a fixed domain or IP address. Widely prevalent malware families that rely on DGA include Conficker, Murofet, and BankPatch. More on DGA
Because these malware use dynamically generated domains, it is a challenge to write a single signature (or set of signatures) to detect all possible generated domains. Instead, a probabilistic machine-learning model that uses computed features from the domain itself is a more flexible approach for determining whether a domain was generated by a DGA.
Input Features for DGA detection
For a predictive model such as a DGA detector to be effective, we need to translate the incoming data into a set of numerical indicators, commonly called “features,” that an algorithm can use. Our example model uses 22 different features computed from the domain string as input for the classification algorithms. Some selected features are shown below:
· Length of Domain Parts (TLD, 2LD, 3LD)
· Does the domain end in .edu? (Y/N)
· Number of Domain Parts (i.e. number of periods)
· Number of Distinct Characters
· String Entropy computed on the Domain
· Number of Digits
· Number of Dashes
The full feature list is included in the example code on GitHub
Training The Model
The final step before deploying the model is training and evaluating the model on real data with known labels. This is an extensive process that we will describe in more detail in a future blog post because the full training process requires more tools in addition to OpnIDS and Dragonfly MLE. In the meantime, we will give a brief overview of the model training process. We used Python Scikit.learn to train our models, but any software that is able to train a Logistic Regression model or a Random Forest model would work as well.
To gather known DGA domains, we used Johannes Bader's collection of reverse engineered domain generation algorithms. We ran those algorithms in Python to generate over 30,000 samples from 42 different malware families.
For known-good domains, we used a sample of the Majestic Million Note: Not all the of the domains contained within the Majestic Million are benign. Some domains are highlighted by threat intelligence as malicious and others look to be generated by a DGA! The Majestic Million is licensed under a creative-commons license.
A sample of domains is shown below along with the label (1 for DGA, 0 for non-DGA) and the source of the domain:
Note that some of the DGA families use English language dictionaries to create "normal" looking domains.
Deploying the Model with Redis-ML
Redis-ML extends Redis by adding specialized keys to store models. The first step for using these models is to initialize the model parameters or the model structure (in the case of the random forest). We then use the Dragonfly MLE to convert the incoming data into the appropriate feature format and input those features into the stored models.
A simple explanation of the model format for Redis-ML can be found on the Redis-ML GitHub page. Detailed examples of Redis-ML in action can be found here; logistic regression is introduced in Part 3 and random forests in Part 5.
For both the logistic regression and random forest models, we converted the Scikit.learn model to the appropriate Redis-ML format in Python. The final results can be viewed in the Dragonfly MLE Github Logistic Regression and Random Forest.
For use cases with many examples having known labels, a supervised machine-learning model is often the correct choice for threat detection. For these types of problems, models are more flexible than signatures, generalizing to correctly identify previously unseen threats. The Dragonfly MLE (with help from Redis-ML) provides a powerful platform for deploying these models, managing both the data and the software infrastructure.