For some, a billion row dataset is an abstraction, never something you will have to face in reality. For others, it might be part of your daily experience in your job. Maybe you’re somewhere in between. Whatever your personal experience with large datasets, I invite you to consider for a moment this question: what would you do with a billion rows?
At Gigasheet, part of our engineering culture is to expect billion-row datasets, and as a result our product can handle them just about as easily as smaller files, as I’ll show in the demo below. As CTO, I’m proud of that, and I could easily write a whole blog post about how great our tech is, but I think it’s also worth looking at some broader aspects of this idea of a billion rows.
In this post, I’ll demo how Gigasheet handles a billion rows, look at what else is out there to help with billion-row datasets, and finally share some examples and metrics about the kinds of files that we see our users uploading into our billion-row-capable product. And of course, if your work involves analyzing massive data files, I invite you to sign up for Gigasheet and try it out for free!
This blog is based on a recorded presentation I made for csv,conf,v7 so you can also check out this content in video form here.
Let's see Gigasheet's answer to the question of what you can do with a billion rows. I'll be demonstrating the Gigasheet web application, where I've already uploaded a sample file containing synthetic network connection data. Although it's a massive CSV file with over one billion rows, Gigasheet handles it without any prior tuning. By simply dropping the file into Gigasheet, I can open it up and get started on my analysis.
So what do we want to do with our billion rows? Typically, one wants to understand the data and gain insights from it, just as with smaller datasets. For this example, let’s say I'm interested in identifying high-volume hosts on my network. To accomplish this, I'll follow the standard analysis steps of summarizing, filtering out noise, and drilling down.
The workflow goes like this in Gigasheet:
From this analysis, it becomes apparent that there are three high-volume hosts communicating with only one IP address each, which is unusual behavior, and might prompt me to further investigate those machines on my network. This shows how I use Gigasheet to make sense of a billion rows.
Watch three minutes of my presentation starting here to see a live demo of these steps in Gigasheet.
Working with massive datasets can be challenging. For those without programming skills, common solutions are Excel or Google Sheets, but these will not be able to handle the kind of scale we’re talking about here. Even for programmers who are comfortable on the command line, using popular tools like Python Pandas or even grep on your own machine will likely be too limited by the machine hardware to handle a billion rows with enough speed for interactive analysis. With some investment of dollars, engineering effort, or both, you may find success by hosting database servers like PostgreSQL or by using other commercial solutions, but the cost of getting started can be substantial. That’s why we believe Gigasheet is the best way to crack open a file and get to work.
At Gigasheet, a lot of datasets come through our front door as our users look to our technology to help them complete their analytical work. Here are a couple of interesting findings. First, even though at Gigasheet we think of “big” as meaning a billion rows, most CSV files parsed by Gigasheet are under 100,000 rows. This shows that the point at which a dataset is “too big” is a matter of perspective.
Most CSV files parsed by Gigasheet are smaller than 100,000 rows and only one file had over one billion rows during this time period.
Second, we see that there is a self-selection effect where data creators are more parsimonious with their data as the row count increases. The higher the row count, the smaller the average size of each row.
The y-axis is the bytes per row for the file and the x-axis is the base 10 logarithm of the file row count. The plot shows a trend of higher bytes per row for smaller row counts, and the complete absence of dots on the top right of the plot shows that there were no files in the sample with very high bytes per row at very high row counts. This may indicate that the data creators show greater selectivity about what data attributes are worth including in a dataset as the row count of the dataset increases.
To illustrate some of the different kinds of large scale data we encounter at Gigasheet, here are some examples of large CSV files from our users, including one that is over one billion rows.
Example 1: Sensor
Stats:
Observations:
Example 2: Finance
Stats:
Observations:
Example 3: Domain Abuse
Stats:
Observations:
Example 4: Healthcare
Stats:
Observations:
Example 5: Geographic
Stats:
Observations:
Making sense of a billion rows of data might feel impossible, but with Gigasheet, the process becomes straightforward and efficient. Our platform offers a friendly spreadsheet interface that empowers you to explore, summarize, and analyze massive datasets. These days, data is everywhere. Whether you're an analyst, a data scientist, or simply curious about some data you encountered, Gigasheet will get you to the insights you’re looking for in data of any size. Sign up and try it for free!