For some, a billion row dataset is an abstraction, never something you will have to face in reality. For others, it might be part of your daily experience in your job. Maybe you’re somewhere in between. Whatever your personal experience with large datasets, I invite you to consider for a moment this question: what would you do with a billion rows?

At Gigasheet, part of our engineering culture is to expect billion-row datasets, and as a result our product can handle them just about as easily as smaller files, as I’ll show in the demo below. As CTO, I’m proud of that, and I could easily write a whole blog post about how great our tech is, but I think it’s also worth looking at some broader aspects of this idea of a billion rows.

In this post, I’ll demo how Gigasheet handles a billion rows, look at what else is out there to help with billion-row datasets, and finally share some examples and metrics about the kinds of files that we see our users uploading into our billion-row-capable product. And of course, if your work involves analyzing massive data files, I invite you to sign up for Gigasheet and try it out for free!

This blog is based on a recorded presentation I made for csv,conf,v7 so you can also check out this content in video form here.

A Billion Rows in Gigasheet

Let's see Gigasheet's answer to the question of what you can do with a billion rows. I'll be demonstrating the Gigasheet web application, where I've already uploaded a sample file containing synthetic network connection data. Although it's a massive CSV file with over one billion rows, Gigasheet handles it without any prior tuning. By simply dropping the file into Gigasheet, I can open it up and get started on my analysis.

‍

A billion rows of data in gigasheet — Screenshot of a billion-row dataset open in Gigasheet.

So what do we want to do with our billion rows? Typically, one wants to understand the data and gain insights from it, just as with smaller datasets. For this example, let’s say I'm interested in identifying high-volume hosts on my network. To accomplish this, I'll follow the standard analysis steps of summarizing, filtering out noise, and drilling down.

The workflow goes like this in Gigasheet:

My first summary is just to open the file and see the column headers and the first page of rows, enabling me to understand what kind of data I’m working with.
From there, I'll filter the data based on the byte count to find high-volume hosts, setting a threshold of 500,000.
An effective way to summarize is to group on a column of interest, so next I’ll group based on the Source IP column.
Then I’ll add another aggregation on the number of unique destinations.

From this analysis, it becomes apparent that there are three high-volume hosts communicating with only one IP address each, which is unusual behavior, and might prompt me to further investigate those machines on my network. This shows how I use Gigasheet to make sense of a billion rows.

Grouping a billion rows of data — Screenshot of the analytical finding in Gigasheet where I used filters and groups to identify three hosts on my network with high-volume traffic to one client each.

Watch three minutes of my presentation starting here to see a live demo of these steps in Gigasheet.

The Challenge of Handling Massive Files

The effects of a billion rows of data on your laptop

Working with massive datasets can be challenging. For those without programming skills, common solutions are Excel or Google Sheets, but these will not be able to handle the kind of scale we’re talking about here. Even for programmers who are comfortable on the command line, using popular tools like Python Pandas or even grep on your own machine will likely be too limited by the machine hardware to handle a billion rows with enough speed for interactive analysis. With some investment of dollars, engineering effort, or both, you may find success by hosting database servers like PostgreSQL or by using other commercial solutions, but the cost of getting started can be substantial. That’s why we believe Gigasheet is the best way to crack open a file and get to work.

‍

best ways to work with billion rows of data

‍

Insights from Gigasheet User Data

At Gigasheet, a lot of datasets come through our front door as our users look to our technology to help them complete their analytical work. Here are a couple of interesting findings. First, even though at Gigasheet we think of “big” as meaning a billion rows, most CSV files parsed by Gigasheet are under 100,000 rows. This shows that the point at which a dataset is “too big” is a matter of perspective.

Chart of files with a billion rows of data — This figure shows the count of CSV files parsed by Gigasheet from September 2022 - March 2023, binned by row count where each bin is 10x the row count of the one to the left.

Most CSV files parsed by Gigasheet are smaller than 100,000 rows and only one file had over one billion rows during this time period.

Second, we see that there is a self-selection effect where data creators are more parsimonious with their data as the row count increases. The higher the row count, the smaller the average size of each row.

bytes vs rows of billion rows of data — This figure shows bytes per row versus row count for a random sample of CSV files with at least 1,000,000 rows parsed by Gigasheet from September 2022 - March 2023.

The y-axis is the bytes per row for the file and the x-axis is the base 10 logarithm of the file row count. The plot shows a trend of higher bytes per row for smaller row counts, and the complete absence of dots on the top right of the plot shows that there were no files in the sample with very high bytes per row at very high row counts. This may indicate that the data creators show greater selectivity about what data attributes are worth including in a dataset as the row count of the dataset increases.

Large CSVs Parsed by Gigasheet

To illustrate some of the different kinds of large scale data we encounter at Gigasheet, here are some examples of large CSV files from our users, including one that is over one billion rows.

Example 1: Sensor

Stats:

Uploaded to Gigasheet: 2023-02-04
Rows: 104,618,520
Columns: 8
Column types: strings, dates, floats
Column headers? Yes
Size on disk: 1.3GB
Parsed with: Gigasheet CSV parser

Observations:

Column headers “road” and “measurement” suggest it is some kind of sensor reading associated with a road
User joined this with another file using a cross-file lookup and then filtered to matches that were in both files

Example 2: Finance

Stats:

Uploaded to Gigasheet: 2023-01-19
Rows: 142,676,154
Columns: 5
Column types: 1 string then 4 floats
Column headers? No
Size on disk: 6.9GB
Parsed with: Gigasheet CSV parser

Observations:

Appears to be tick data from UK stock exchange
Filename ended with “_00.csv”, this suggests it may have been only one of several files in a batch
User appears to have mainly just wanted to see the data, maybe look at a particular point in time

Example 3: Domain Abuse

Stats:

Uploaded to Gigasheet: 2023-01-13
Rows: 419,099,207
Columns: 7
Column types: 1 string, 1 bool, 5 int
Column headers? Yes
Size on disk: 14.4GB
Parsed with: Gigasheet CSV parser

Observations:

Based on filename and headers, this data pertains to potentially abusive domain names
Notably also includes a column that appears to be a count of URLs not indexed by search engines
User took many actions analyzing data in this file with filtering, grouping, sorting, etc.

Example 4: Healthcare

Stats:

Uploaded to Gigasheet: 2023-01-23
Rows: 870,216,232
Columns: 21
Column types: mix of string, int, float
Column headers? Yes
Size on disk: 15.0GB
Parsed with: Gigasheet CSV parser

Observations:

No patient data in this dataset
Based on headers, this is healthcare price data, for example “HCPCS” and “CPT” codes for procedures
Appears to also include hospital IDs and locations
May be related to US healthcare transparency law passed in 2022

Example 5: Geographic

Stats:

Uploaded to Gigasheet: 2022-10-13
Rows: 1,098,491,592
Columns: 3
Column types: zipcode, zipcode, float
Column headers? Yes
Size on disk: 33.6GB
Parsed with: Gigasheet CSV parser

Observations:

Based on headers and data contents, this appears to be the distance in miles between pairs of United States postal codes using census data “zcta5”
Appears to actually be geographic subsections of zipcodes, which is why this reached 1 billion rows even though given that there are only 41,683 zipcodes in the United States,
User probably wanted to find multiple regions near an area of interest to then plug into another step of their analysis

Conclusion

Making sense of a billion rows of data might feel impossible, but with Gigasheet, the process becomes straightforward and efficient. Our platform offers a friendly spreadsheet interface that empowers you to explore, summarize, and analyze massive datasets. These days, data is everywhere. Whether you're an analyst, a data scientist, or simply curious about some data you encountered, Gigasheet will get you to the insights you’re looking for in data of any size. Sign up and try it for free!

Company & Product

What Would You Do with a Billion Rows?

A Billion Rows in Gigasheet

The Challenge of Handling Massive Files

Insights from Gigasheet User Data

Large CSVs Parsed by Gigasheet

Conclusion

The ease of a spreadsheet with the power of a data warehouse.

Similar posts

Company & Product

MCP for AI Data Analysis at Scale

Company & Product

How Gigasheet Uses AI to Make Price Transparency Analytics More Powerful

Company & Product

Group By Without SQL: Unlocking Fast, Flexible Analytics in Gigasheet

Company & Product

What Would You Do with a Billion Rows?

Garth Griffin

A Billion Rows in Gigasheet

The Challenge of Handling Massive Files

Insights from Gigasheet User Data

Large CSVs Parsed by Gigasheet

Conclusion

The ease of a spreadsheet with the power of a data warehouse.

Similar posts

Company & Product

MCP for AI Data Analysis at Scale

Company & Product

How Gigasheet Uses AI to Make Price Transparency Analytics More Powerful

Company & Product

Group By Without SQL: Unlocking Fast, Flexible Analytics in Gigasheet