Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE

Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE


Andrew Fast, Chief Data Scientist, CounterFlow ai

Crossing the Production Gap

Machine learning promises to address many of the challenges faced by network security analysts; however, there are still many obstacles that prevent widespread adoption of machine learning within security operations centers (SOC). The first major challenge is one of trust as discussed in our previous post. The second major set of challenges is around the complexity of deploying machine learning in a production environment. Once a machine-learning model has been trained and validated in the lab, there is often an equal if not larger effort required to deploy that model in a repeatable, production environment.\Transitioning a model from the lab to production is a difficult challenge that OpnIDS and Dragonfly MLE can help address in the network operation environment.

Typical Machine Learning Process

Data science typically operates using an iterative batch process. First, there is a lengthy data collection process that results in a large static dataset (a.k.a. "training data" or "the data") stored in one or more data storage systems including spreadsheets and flat files, relational databases, and "big data" systems such as Hadoop and Apache Spark. Next, data scientists will "explore" the data using traditional data science tools such as SQL, R, or Python. Data exploration rapidly transitions into feature creation and data cleaning, resulting in a data set that is prepared for modeling. Once the first round of feature creation is complete, data scientists will train a machine-learning model using one of many model packages for R or Scikit.learn for Python, ideally splitting the training into several subsets to simulate out-of-sample, unseen data. Typically, this entire process starting with data collection will repeat until model performance meets the business goals of the project.

After performance of the model is sufficient, it is then ready to be put into "production." In production, a model "scores" new, unseen data using the model developed during the training phase. Transitioning a model from training to scoring can be a lengthy process. First, "the data" is typically a static dataset, so for a model to be deployed it must be first transitioned to live data. This entails ensuring that all of the various data features used as inputs to the model are available when the model needs to be run. Second, production systems are often built using different programming languages and frameworks than are used to train the models, requiring additional software engineering effort to translate the model code to the production language of choice.

If there are many different dependencies on datasets, a common choice is to deploy the model on top of the training architecture. This avoids additional effort re-implementing feature engineering and multiple data pipelines. However, scoring using the training architecture is not always the most efficient approach. For example, check out the recent collaboration between Redis and Spark.

Busting Complexity with Streaming Machine Learning

When we've spoken to network analysts about deploying machine learning, we hear a common response: they do not have the budget, rackspace, or know-how to set up the typical "big data" infrastructure for deploying batch machine learning. This is primarily a complaint about the complexity inherent in a "big data" system. Often the assumption by the network analyst is that effective machine learning would require a large Hadoop cluster running Apache Spark, rapidly followed by the mistaken conclusion that machine learning won't fit in their environment.

Though a powerful and popular choice, the reliance on large-scale systems is, in part, a by-product of the assumption that machine learning and batch processing must go hand in hand. Storing large amounts of data required to do batch machine learning is the primary driver of the development of massively parallel systems. But if it were possible to remove the data storage component from machine learning, then much of the engineering complexity required to maintain those systems could be removed as well. OpnIDS and Dragonfly MLE do just that by implementing a streaming machine-learning engine that operates on network data as it flows through the network sensor.

In addition to removing the complexity due to data storage, deploying the machine learning engine at the network sensor enables data to be extracted directly from the network, reducing the data pipeline complexity as well. The Dragonfly MLE is a stable, defined platform deploying statistical analysis of network data.

Streaming Reduces Latency, Too

A second unfortunate by-product of batch machine learning is an increased latency from each additional infrastructure layer that the data must traverse before coming to rest; latency also results from the time required to process the whole set of data at rest. Streaming, like the type used in the Dragonfly MLE, is a lightweight way to process traffic at near wire speed. This creates the potential for detecting possible threats as soon as they cross the wire.

Streaming Machine Learning with Dragonfly MLE

The Dragonfly MLE included within OpnIDS is a streaming machine learning platform designed to speed up the detection of threats and ease the deployment of new threat detection models. Keep watching this blog and the OpnIDS site for more innovations.