I recently came across a white paper published by the SANS Institute. The white paper is entitled "ExcavationPack: A Framework for Processing Data Dumps" and presents a technical approach to process unstructured data to identify interesting information in data dumps. The author cites that the framework follows a "Bring Your Own Data" (BYOD) model where security researchers, organizations, and others can privately analyze and mine large volumes of data, mainly publicly available breached data.
I was excited to use this framework to search publicly available data dumps. Yet, the more I read about it, the more discouraged I became because the author's framework relies on Docker containers, SQL databases, container orchestration, and other software dependencies to build the data processing layer, which would increase the speed of adoption quite significantly.
While the framework has great merit in its application, it may not be easy to adopt for someone with little or no knowledge about containers, container orchestration, and databases. Not to mention the need to build, deploy, and maintain the technical infrastructure to process, transform, and analyze the data. So, I decided to explore Gigasheet to analyze a large data dump to show how easy and quickly one can find interesting information.
For this illustration, I chose a data dump named "Collection #1" by Troy Hunt, the founder of “Have I Been Pwned?”. Collection #1 dates back to 2019 and contains over 2 billion rows of unique email address-password combinations from different data sources, but I only extracted the first 111 million rows for this analysis. Each row in the file contains an email address and password combination separated by a colon.
I will not be looking for anything specific in this data dump or finding the data breach source.
Instead, this demonstration aims to show how to analyze large data sets quickly using Gigasheet without building infrastructure or learning to use different technology. We will begin by identifying the top ten email domains in the data dump, followed by the top ten government domains (.gov).
The first step in the process is to upload the data dump file to Gigasheet in CSV format. The sample Collection #1 data dump file used is ~3 GB and contains ~111 million rows of email address and password combinations separated by a colon. I did not clean the file before uploading, so it may contain duplicate values which does not matter for this analysis.
We won't need the password column for this analysis since we will only analyze email domains; hence we can hide the passwords from view to declutter the screen by simply unchecking the column name.
We can then split the email name from the domain with the "split column" feature using "@" as a separator, which will place the email domains in a separate column.
Results:
The next step is data analysis. Grouping the email domain column by unique values quickly reveals the top ten email domains. Yahoo, Gmail, and Hotmail lead the race, which is not surprising since they are the most popular email providers worldwide.
To reveal the top ten U.S. government email domains, we can filter the email column for values ending in gov and group the domain column by unique values.
The result reveals 997 unique email domains ending in .gov. The National Institute of Health (mail.nih.gov and nih.gov) leads the race with 1654 accounts, followed by Nasa (nasa.gov) with 478 accounts, and the Veterans Affairs (va.gov) with 352 accounts.
The process described above took approximately two hours to complete, including downloading the Collection #1 data dump (~39 GB). The ExcavationPack framework is an excellent option for someone who knows technology or has time and interest to learn all of the details. If your main goal is to analyze data and need ease of use, a minimal learning curve, and a quicker speed of adoption, then give Gigasheet a try.