In its simplest form, data science is the process of extracting meaningful and actionable information from massive sets of structured, semi-structured, and unstructured data. While the discipline has been around since the early 1960s, it has increased in popularity in recent years with the advent of powerful cloud-based data analysis tools, such as Tableau and Microsoft Power BI. Many industries rely on data science to make business decisions, such as finance, payments, and transportation. Similarly, in cybersecurity, data science can complement rules- and behavioral-based analysis tools to help detect more covert patterns or behaviors and allow for quicker response and more thoughtful decision-making.
Many data analysis and manipulation tools are available on the market today. Some are commercial, others are open source, some are easy to use, and some have a significantly steep learning curve. This blog shows you how to use Gigasheet for performing basic threat hunting. It demonstrates Gigasheet's easy-to-use data analysis features and functions by completing a Pandas data science lab developed by Mike Sconzo and David Dorsey, all in a few simple steps.
Pandas is an open-source tool for data analysis, manipulation, and visualization built on Python. Developed in 2008 at AQR Capital Management, the project has had over 2,000 contributors since its inception and is widely used in academic and commercial domains. In its most straightforward application, the tool works by ingesting an input file (i.e., CSV, JSON, or HTML) from which it creates a 2-dimensional data frame or table to store the file's data. It then applies some user-defined processing and returns the result on the screen as either text or plot.
If you want to follow along, you can sign up for a free Gigasheet account, or you can see it in action by watching this short tutorial.
The lab intends to analyze a large conn.log Zeek (formerly Bro) file using Pandas. But before you can start analyzing data, the lab instructions direct you to create a sample of the conn.log file because it is too large to fit in memory. The tool reads about 300k randomly selected lines and writes them to a new file.
Next, Pandas imports the new file into memory to create a data frame and verifies that the data was loaded correctly using some head() and tail() commands.
Figure 1 - File Sampling & Creation with Pandas
Figure 2 - File Input & Verification with Pandas
First off, with Gigasheet we don't need to sample the input file. We don't even need to download the file locally and unzip it before importing it to Gigasheet. We can copy and paste the file's URL into Gigasheet's link import function, and Gigasheet does the rest. The file we will analyze in Gigasheet contains 22 million lines (as opposed to 300k analyzed with Pandas).
Figure 3 - File Upload & Verification with Gigasheet
Data Summarization and Types
Next, the lab instructs you to run a describe() function to obtain a numeric summarization of all numerical columns and a dtypes function to see the type of information stored in each column.
Figure 4 - Data Summarization and Types with Pandas
How do we do the same in Gigasheet? We open the uploaded file to see all the columns and rows. That's it!
Figure 5 - Data Summarization and Types with Gigasheet
Next, the lab directs you to change the ts (timestamp) column to a readable date-time format. The timestamp in Bro files is in Unix time format, which is difficult for humans to discern. In Pandas, you change the ts column format using the to_datetime() function. However, the lab instructions do not specify how to select a time zone. It then uses the describe() function again to verify that the format conversion was successful and removes the original ts column from the data frame.
Figure 6 - Convert Column Types and Remove Columns with Pandas
In Gigasheet, we apply the Cleanup Unix Time to the ts column (or column labeled A in Figure 7) and select a timezone, which creates a new column with the desired time format and zone. We can then uncheck the ts column to remove it from view or delete it forever.
Figure 7 - Convert Column Types with Gigasheet
Next, the lab instructs you to run several functions to select all the rows that contain SSL traffic and identify unique originator bytes values. It does this in three steps using conn_df and conn_df.head functions.
Figure 8 - Row Selection in Pandas
In Gigasheet, we can apply a filter to the service column to look for all values matching SSL and then group by the origin_bytes column to see all the unique originator bytes values. We can even sort in ascending or descending order with one click.
Figure 9 - Row Selection in Gigasheet
For the final exercise, the lab challenges you to use all the techniques described to find the unique source ports and IP addresses associated with all HTTP connections over any ports other than 80. It even gives bonus points for the number of connections for each source IP address.
In Gigasheet, we can apply a filter to the service column to look for values equal to HTTP. We can then use a second filter on the id.resp_p column (or responding host's port) to look for values NOT matching port 80, which will display all HTTP connections over any port except 80. To find all unique source ports, we can group the connections by the id.orig_h column (or originating host's IP address), and lastly, we can re-group by the id.orig_port column (or originator's source port) to find all unique source ports.
Figure 10 – Final Exercise in Gigasheet