Identifying Top Talkers with Dragonfly MLE and Redis

Often, one of the first things a threat hunter asks when investigating a threat is "Who or what are the 'top talkers' on this network?" Top talkers can be measured in many different ways including endpoints that transfer the most bytes, protocols that are most prevalent on the network, or endpoints that are connecting to the most other network hosts.

We're going to highlight some built-in Redis functionality that, when combined with a Dragonfly MLE analyzer, can be used to compute top talkers in near real time. The Dragonfly MLE is a data science scripting engine included in the OpenIDS project. You can read more about the MLE here (Intro to Dragonfly MLE) or download the code from Github.

Redis bills itself as an "in-memory data structure store" (Redis.io). It has a small enough footprint to work in streaming environments such as Open IDS. Redis provides a variety of data structures including sets and hashes.

 

Top Talkers by Bytes Using the Redis SORTED SET

One strategy for determining top talkers is to count the bytes sent (or received) by each IP and track the total bytes over time. The number of bytes is included in the flow records that are emitted by Suricata within Open IDS. Since the flow record reports both the source and destination IP addresses, we can easily track the total number of bytes sent or received for each IP in the same MLE script.

One of the data structures provided natively by Redis is the "sorted set." A sorted set lets a user store a key (e.g., IP address) and a value (e.g., number of bytes). The set then keeps the records sorted by value. The ZADD command in Redis is dual purpose. It adds new keys to the set or creates a new set if it doesn't exist. The ZINCRBY command can be used to increment the value of a key, resulting in an updated sort order, if necessary. Finally, after the set has been created, the ZRANK command queries the sorted set to get the rank of a specific key. In the case of top talkers, both the rank and total number of bytes are of interest and can be attached to the flow record by the MLE for reference in downstream operations.

One challenge with the sorted set is that the size of the data structure grows with the number of entries. For large networks, this could be a problem if memory on the network sensor is being used to support other operations (such as packet capture) in addition to supporting the MLE. In the next section, we will introduce the idea of a data "sketch," a probabilistic data structure with constant size in exchange for providing approximate answers.

Counting Distinct Connections with HyperLogLog

In addition to counting bytes, it is useful to know how many unique connections are being made by a given endpoint. When combined with the bytes sent by each endpoint, the number of unique connections can be used to understand the diversity of destinations for those bytes.

For this functionality we are going to use a "data sketch" known as HyperLogLog. Data sketches are probabilistic data structures that can be used to dramatically reduce the amount of memory used to solve certain tasks in exchange for returning approximate answers instead of exact ones. HyperLogLog is a sketch used to solve the distinct count problem, exactly what we need to need count unique connections. To implement a HyperLogLog structure in Redis, use the PFADD command to add a value to the set and the PFCOUNT command to determine the number of unique endpoints.

In a world without HyperLogLog, to count the number of unique connections made by each endpoint we would need to maintain a list of unique IP addresses accessed by each endpoint. Unfortunately, for large networks, the number of entries grows with the square of the number IP addresses. For example, if 100 endpoints within an organization connected to all the other endpoints in that same organization, there would be 10,000 (100x100 = 100^2) possible connections.

With HyperLogLog, we can estimate the number of distinct items in a set (in this case the number of distinct connections) within a small error percentage of the true number (Redis reports <1% error in typical usage) but without the significant space blowup. To reduce space, HyperLogLog uses multiple hashing functions to map each new connection using a probabilistic binning strategy that occasionally has collisions — that is, two endpoints that hash to the same bin. The size and number of the hashes determines the correctness of the response.

Summary

Redis and Dragonfly MLE are a powerful combination for tracking network statistics in near real time. Get them now at OPNids.io. Examples demonstrating both of these techniques within the Dragonfly MLE can be found in GitHub.