Business idea: we’re going to get rid of all that complicated, stressful stuff on your car’s dashboard and replace it with a single metric, a Car Health Score, powered by Machine Learning™! Our proprietary algorithm combines all the complex information about your car and converts it to a single number representing the status, so you no longer have to worry about interpreting those difficult quantities like “how much gas is left” and “how fast am I going.” What, you’re not interested? But it’s Machine Learning™!
Okay, maybe that one is obviously unappealing, but it’s surprising to me how often I see cyber security leaders adopting a similar attitude to that ridiculous example. Machine learning, so the story goes, delivers insights drawn from vast troves of data, powered by sophisticated algorithms that will find the needle in the haystack and present brilliant answers to us on a silver platter of mathematical goodness. It makes a very compelling powerpoint presentation, and the marketing teams can crank out gorgeous pictures and snappy taglines to sell this point. Sometimes, the reality matches that beautiful picture. However, in many scenarios, faster and clearer insights are more easily obtained by counting the right things the right way than by running a complex analysis or statistical model.
To see how much easier it is to get clarity by counting things the right way, let’s consider the following example scenario. A security issue has been flagged indicating a potential breach of some company servers. Executives at the company need to know what the impact was. They want answers right away and with high confidence. There’s some good luck: data is available from internal monitoring systems about what connections occurred from the local IPs during the time period of interest. The bad news is that it’s a huge file of connection records, and someone needs to make sense of the data. These types of files can be millions of rows. Our data might look something like this:
As an analyst, our job is to figure out whether there’s anything scary or unusual indicated by the data, but that can be challenging when using machine learning tools as our main methodology. A typical starting point for machine learning might be to ask ourselves whether we should use a supervised model or an unsupervised model. Suppose we want to try a supervised model. In this case, we don’t have any labeled data that corresponds directly to this file. In some imaginary dream version of this scenario, perhaps we can generate a bunch of labels using a team of trained human experts that are sitting around just waiting to be asked. If we’re at a huge company, maybe we can find some data assets to leverage, or our organization might have purchased access to third party data that we can use as labels. But many times, needing training data means that you personally must spend many hours annotating your dataset. That’s not a suitable turnaround time when responding to a security incident, not to mention being some of the most tedious work imaginable. So, most often, a supervised machine learning model is not going to be possible here.
Our prospects for using unsupervised machine learning to find interesting aspects of the data are a bit better. A common starting point for unsupervised learning is to pick some interesting attributes and then perform a clustering algorithm to see which things “go together,” where “together” is well-defined in some mathematical sense. One challenge is that the interesting parts of the data might only be a tiny portion of the overall volume. For example, our data contains many rows that are traffic between two local IPs, which is probably irrelevant. We could exclude those rows, but machine learning algorithms tend to perform better on larger sample sets, so we may be hampering the performance of the algorithm, and the patterns that emerge might be indistinguishable from random noise. This shows the main weakness of unsupervised clustering: you are almost certainly going to get some kind of output from the algorithm, but it’s not always obvious whether it’s doing anything useful.
Anomaly detection is another popular category of unsupervised learning algorithms. We meet the same difficulty here as with clustering, which is that the algorithm is likely to produce some sets of rows or cells that it deems “unusual,” but it’s not clear what that implies. Moreover, for both clustering and anomaly detection, even in the best case scenario where they surface things that are of obvious importance and we are able to take action from that, we are still left with the nagging question of what else was in there. Unsupervised learning techniques find patterns, trends, anomalies, or other things that stand out. For that reason, unsupervised learning algorithms are ill-suited to make the assertion that the rest of the data is of no interest and can safely be ignored. After all is said and done with our machine learning approaches, we have some potential patterns to investigate and perhaps some indicators of suspicious activity taking place, but no firm conclusions at the level that help us finalize an investigation.
In contrast with machine learning, straightforward count-based analytics deliver clear and explainable answers using only a few simple operations. We will wrap up this security incident using three count-based analytics that we construct out of counting, grouping, sorting, and filtering the data. Our first count-based analytic is the number of unique origin IPs, which will help us determine the internal scope of our investigation.
We create this analytic by grouping our data on origin IP and counting up the entries. We see that there are five local network IPs plus the router at 192.168.1.1 and the null address 0.0.0.0. Now we know that at most five of our local network machines were involved and any further internal investigations should focus on those internal addresses.
Our next count-based analytic is the maximum total bytes transferred to a single external responding host. This metric measures whether significant data transfer has occurred from an internal server to an external server, which enables us to determine whether any large scale data exfiltration took place.
We filter the data to external responding hosts, sum the bytes per respondent, and sort the sum. We see that the largest transfer is 589750 bytes, or about half a megabyte. From this, we can rule out the possibility of large-scale data exfiltration. However, there is still the possibility of an attack that has some other purpose besides data exfiltration.
Our final count-based analytic is to generate a list of external IPs with a nonzero amount of incoming bytes over a nonstandard port and protocol. This analytic gets to the key question of whether there is anything we need to be concerned about in this traffic by checking for unusual inbound communication.
This is the most complicated to compute out of our three count-based analytics, but it still only requires a few straightforward steps. First, we create a new data field that combines the responding host port, the protocol, and the service. Then, we filter to connections that have an external responding host and a data transfer from that host that is greater than zero. We group by our newly-created combined data field and inspect the results. Of the five entries, we see three familiar ones: the typical HTTP on port 80, SSL on port 443, and other TCP on port 80. However, the other two entries look strange. HTTP on port 3389 is unusual because that is an RDP port, not an HTTP port. The other entry, port 10590, is not a recognized common port, so that stands out as well. When we look at the remote IPs under those ports, we see there are exactly two IPs using those unusual ports. This final result of the two suspicious IPs is a key finding of our investigation. These ports and IPs are the unusual traffic that warrants a deeper look, and if we want to take an aggressive stance we will block those two IPs right away while we continue to work the case.
These three count-based analytics brought us to three concrete conclusions by combining simple operations of counting, grouping, sorting, and filtering. First, we verified which internal IPs were involved. Second, we obtained the negative result that there was no large-scale data exfiltration. Finally, we reached the positive finding that there was traffic on non-standard ports from two external IPs, which gave us the immediate option to block those IPs. These are the kind of results that lead to decisions and security responses, and these findings make sense to communicate in an incident response report. The customer will be thrilled to receive these actionable conclusions, and if we are called upon to justify our statements, we can supply a few sentences or bullet points that provide complete clarity on the straightforward operations we used to obtain our answers.
While this data was from a fictitious attack, it highlights some key advantages of data analysis based on counting the right things:
I do feel compelled to acknowledge that there are still plenty of times when machine learning is the right tool for the job. If someone hands you a pile of pristine training data that applies perfectly to your problem domain, it would be foolish not to make use of it in a supervised machine learning model. In addition, recent advances in deep learning techniques have produced stunning results on web-scale datasets at companies like Google. Another area where there may be no other option besides more complex statistical analysis is if the input data requires some level of transformation before anything meaningful can be extracted, as can happen with text processing or other data that is represented with many thousands of dimensions. Lastly, our counting approach relies on choosing good quantities to count, which requires that the analyst comes in with an idea of what to lift from the data. If we don’t know what we’re looking for, unsupervised modeling can be a great way to at least get started. These are examples of times when machine learning may be the right approach. A highly skilled analyst will stand ready to employ these more complex techniques as needed, but a wise analyst tries the simpler approaches where possible first, and resorts to complexity only when there is no other choice.
In practice, one challenge that analysts face is that even these simple counting operations become difficult when the data is very large. This is where Gigasheet shines. By providing the familiar interface of a spreadsheet running atop modern scalable database infrastructure, the operations of counting, grouping, sorting, and filtering become as easy to execute in practice as they are simple to understand in principle. In a world that loves to talk about machine learning, I encourage all data analysts of the world to remember the power of a clear and concrete analysis based on counting the right stuff the right way.