This blog is part three of the Insider Threat Hunt: The Series, a collection of blogs where we demonstrate how to analyze large synthetic data sets for insider threat patterns. In part one of the series, we identified a user who attempted to access company information assets remotely after terminating employment with an organization. The second part of the series focused on identifying a user who had uploaded company information to wikileaks.org before resigning. This blog analyzes another dataset from Carnegie Mellon University's Insider Threat Dataset (available for public download at KiltHub) to identify a user who leaves an organization to work for a competitor. Before ending employment, the user begins visiting job searching websites and soliciting employment from competitors, followed by saving considerable amounts of data to a USB device.
The data referenced in this blog are fictitious, including email addresses, company names, and individuals' names.
If you would like to follow along, create a free Gigasheet account, download a copy of the dataset, and get hunting.
The Dataset
The dataset used in this demonstration is approximately 4 GB compressed, containing seven (7) data sources. However, for this demonstration, we only use the following four (4):
The Analysis
We uploaded the device.csv, http.csv, email.csv, and the multiple (18) LDAP files to Gigasheet. Each LDAP file is named YYYY-MM.csv, where YYYY indicates the year and MM the month the file was generated.
The scenario under analysis involves a user who began browsing job websites, soliciting employment from competitors, and stealing company data before leaving the organization. We can start the investigation by analyzing the http.csv file to identify user connections to job-related websites. Unfortunately, the websites in the http.csv file are not categorized, making it somewhat challenging to locate specific sites. We could use the built-in search function to search for job-related keywords, such as job, career, or employment. A search across the 23.8 million rows may take more time than we want to spend.
A more efficient approach might involve using the filter function, which we can apply to specific columns. We can create a filter to search for any of the following keywords in the URL column:
After a few seconds, Gigasheet returns 29,766 results.
We further narrow the analysis by grouping the filtered URL column to identify all unique websites, revealing thirty-eight (38) unique records.
We can also find all unique job-related domains within the filtered URL column by exporting the first page, re-uploading the exported file (export.csv) to Gigasheet, and applying the 'split column' function to the GROUP column using a forward slash as a delimiter, which extracts the URL's domain from the path. The result contains ten (10) unique domains, including:
The scenario stated that the user accepted a position with a competitor organization. Since the dataset does not specify the industry in which this company operates, we cannot conclude whether any of the previously identified domains belong to a competitor. However, the scenario noted that the user solicited employment, which may have led to the user exchanging email messages with the competitor's representatives.
Next, we analyze the email.csv file. We start by filtering the TO column for the domains identified previously, omitting domains for search engines (i.e., yahoo.com) or career sites (i.e., careerbuilder.com).
The results contain zero matches, potentially indicating that we do not have a complete list of job-related domains or the identified domains are not email domains.
Let's focus on lockeedmartinjobs.com, which may not be a valid email domain. We can modify our filter to look for 'lockheed' instead of the full domain name.
The updated filter returns thirteen (13) email communications between a user named James Chester English (james.chester.english@dtaa.com) and a Lockheed representative named Ivor Kato Kramer (ivor.kato.kramer@lockheed.com)
The last email communication between James Chester English and the Lockheed representative occurred on August 9, 2010. If James Chester English is the individual we are looking for, he would have resigned from the organization after accepting a job offer from Lockheed. Since the last communication between James and Lockheed was in August 2010, we can start from the assumption that James resigned in or after August.
Next, we analyze the LDAP file dated August 2010 or 2010-08.csv to determine whether James was employed in August. Searching 'James Chester English' using Gigasheet's built-in search function returns one match. The result also reveals James' username, JCE0258, which we can use to continue the analysis.
Knowing that James was employed in August, we need to search for James in the September LDAP file (2010-09.csv) to see he shows up.
The absence of an employee record for James Chester English in the September LDAP file implies that James left the company the month prior.
We can now conclude the analysis by reviewing the devices.csv file for any indications of possible data exfiltration. Within the devices.csv file, we filter the USER column for James' username, JCE0258, revealing 810 matches.
The email communications between James and Lockheed occurred between July 18 and August 9, 2010, and therefore, James could have exfiltrated the data within this period. To narrow the results even further, we look for any device activity involving James' username between July 18 and August 9, 2010, resulting in 67 matches.
The devices.csv file only provides USB device connect/disconnect events. It does not provide any information regarding file transfers, such as file names, transfer rates, file sizes, or volumes. Without this information, it may be challenging for the organization to prove that the user stole company data.