Company & Product
Jul 20, 2023

What Would You Do with a Billion Rows?

For some, a billion row dataset is an abstraction, never something you will have to face in reality. For others, it might be part of your daily experience in your job. Maybe you’re somewhere in between. Whatever your personal experience with large datasets, I invite you to consider for a moment this question: what would you do with a billion rows?

At Gigasheet, part of our engineering culture is to expect billion-row datasets, and as a result our product can handle them just about as easily as smaller files, as I’ll show in the demo below. As CTO, I’m proud of that, and I could easily write a whole blog post about how great our tech is, but I think it’s also worth looking at some broader aspects of this idea of a billion rows.

In this post, I’ll demo how Gigasheet handles a billion rows, look at what else is out there to help with billion-row datasets, and finally share some examples and metrics about the kinds of files that we see our users uploading into our billion-row-capable product. And of course, if your work involves analyzing massive data files, I invite you to sign up for Gigasheet and try it out for free!

This blog is based on a recorded presentation I made for csv,conf,v7 so you can also check out this content in video form here.

A Billion Rows in Gigasheet

Let's see Gigasheet's answer to the question of what you can do with a billion rows. I'll be demonstrating the Gigasheet web application, where I've already uploaded a sample file containing synthetic network connection data. Although it's a massive CSV file with over one billion rows, Gigasheet handles it without any prior tuning. By simply dropping the file into Gigasheet, I can open it up and get started on my analysis.

A billion rows of data in gigasheet
Screenshot of a billion-row dataset open in Gigasheet.

So what do we want to do with our billion rows? Typically, one wants to understand the data and gain insights from it, just as with smaller datasets. For this example, let’s say I'm interested in identifying high-volume hosts on my network. To accomplish this, I'll follow the standard analysis steps of summarizing, filtering out noise, and drilling down.

The workflow goes like this in Gigasheet:

  1. My first summary is just to open the file and see the column headers and the first page of rows, enabling me to understand what kind of data I’m working with.
  2. From there, I'll filter the data based on the byte count to find high-volume hosts, setting a threshold of 500,000. 
  3. An effective way to summarize is to group on a column of interest, so next I’ll group based on the Source IP column. 
  4. Then I’ll add another aggregation on the number of unique destinations.

From this analysis, it becomes apparent that there are three high-volume hosts communicating with only one IP address each, which is unusual behavior, and might prompt me to further investigate those machines on my network. This shows how I use Gigasheet to make sense of a billion rows.

Grouping a billion rows of data
Screenshot of the analytical finding in Gigasheet where I used filters and groups to  identify three hosts on my network with high-volume traffic to one client each.

Watch three minutes of my presentation starting here to see a live demo of these steps in Gigasheet.

The Challenge of Handling Massive Files

The effects of a billion rows of data on your laptop

Working with massive datasets can be challenging. For those without programming skills, common solutions are Excel or Google Sheets, but these will not be able to handle the kind of scale we’re talking about here. Even for programmers who are comfortable on the command line, using popular tools like Python Pandas or even grep on your own machine will likely be too limited by the machine hardware to handle a billion rows with enough speed for interactive analysis. With some investment of dollars, engineering effort, or both, you may find success by hosting database servers like PostgreSQL or by using other commercial solutions, but the cost of getting started can be substantial. That’s why we believe Gigasheet is the best way to crack open a file and get to work.

best ways to work with billion rows of data

Insights from Gigasheet User Data

At Gigasheet, a lot of datasets come through our front door as our users look to our technology to help them complete their analytical work. Here are a couple of interesting findings. First, even though at Gigasheet we think of “big” as meaning a billion rows, most CSV files parsed by Gigasheet are under 100,000 rows. This shows that the point at which a dataset is “too big” is a matter of perspective.

Chart of files with a billion rows of data
This figure shows the count of CSV files parsed by Gigasheet from September 2022 - March 2023, binned by row count where each bin is 10x the row count of the one to the left.

Most CSV files parsed by Gigasheet are smaller than 100,000 rows and only one file had over one billion rows during this time period.

Second, we see that there is a self-selection effect where data creators are more parsimonious with their data as the row count increases. The higher the row count, the smaller the average size of each row.

bytes vs rows of billion rows of data
This figure shows bytes per row versus row count for a random sample of CSV files with at least 1,000,000 rows parsed by Gigasheet from September 2022 - March 2023.

The y-axis is the bytes per row for the file and the x-axis is the base 10 logarithm of the file row count. The plot shows a trend of higher bytes per row for smaller row counts, and the complete absence of dots on the top right of the plot shows that there were no files in the sample with very high bytes per row at very high row counts. This may indicate that the data creators show greater selectivity about what data attributes are worth including in a dataset as the row count of the dataset increases.

Large CSVs Parsed by Gigasheet

To illustrate some of the different kinds of large scale data we encounter at Gigasheet, here are some examples of large CSV files from our users, including one that is over one billion rows.

Example 1: Sensor 

Stats:

  • Uploaded to Gigasheet: 2023-02-04
  • Rows: 104,618,520
  • Columns: 8
  • Column types: strings, dates, floats
  • Column headers? Yes
  • Size on disk: 1.3GB
  • Parsed with: Gigasheet CSV parser

Observations:

  • Column headers “road” and “measurement” suggest it is some kind of sensor reading associated with a road
  • User joined this with another file using a cross-file lookup and then filtered to matches that were in both files

Example 2: Finance

Stats:

  • Uploaded to Gigasheet: 2023-01-19
  • Rows: 142,676,154
  • Columns: 5
  • Column types: 1 string then 4 floats
  • Column headers? No
  • Size on disk: 6.9GB
  • Parsed with: Gigasheet CSV parser

Observations:

  • Appears to be tick data from UK stock exchange
  • Filename ended with “_00.csv”, this suggests it may have been only one of several files in a batch
  • User appears to have mainly just wanted to see the data, maybe look at a particular point in time

Example 3: Domain Abuse

Stats:

  • Uploaded to Gigasheet: 2023-01-13 
  • Rows: 419,099,207
  • Columns: 7
  • Column types: 1 string, 1 bool, 5 int
  • Column headers? Yes
  • Size on disk: 14.4GB
  • Parsed with: Gigasheet CSV parser

Observations:

  • Based on filename and headers, this data pertains to potentially abusive domain names
  • Notably also includes a column that appears to be a count of URLs not indexed by search engines
  • User took many actions analyzing data in this file with filtering, grouping, sorting, etc.

Example 4: Healthcare

Stats:

  • Uploaded to Gigasheet: 2023-01-23
  • Rows: 870,216,232
  • Columns: 21
  • Column types: mix of string, int, float
  • Column headers? Yes
  • Size on disk: 15.0GB
  • Parsed with: Gigasheet CSV parser

Observations:

  • No patient data in this dataset
  • Based on headers, this is healthcare price data, for example “HCPCS” and “CPT” codes for procedures
  • Appears to also include hospital IDs and locations
  • May be related to US healthcare transparency law passed in 2022

Example 5: Geographic

Stats:

  • Uploaded to Gigasheet: 2022-10-13
  • Rows: 1,098,491,592
  • Columns: 3
  • Column types: zipcode, zipcode, float
  • Column headers? Yes
  • Size on disk: 33.6GB
  • Parsed with: Gigasheet CSV parser

Observations:

  • Based on headers and data contents, this appears to be the distance in miles between pairs of United States postal codes using census data “zcta5”
  • Appears to actually be geographic subsections of zipcodes, which is why this reached 1 billion rows even though given that there are only 41,683 zipcodes in the United States,
  • User probably wanted to find multiple regions near an area of interest to then plug into another step of their analysis

Conclusion

Making sense of a billion rows of data might feel impossible, but with Gigasheet, the process becomes straightforward and efficient. Our platform offers a friendly spreadsheet interface that empowers you to explore, summarize, and analyze massive datasets. These days, data is everywhere. Whether you're an analyst, a data scientist, or simply curious about some data you encountered, Gigasheet will get you to the insights you’re looking for in data of any size. Sign up and try it for free!

The ease of a spreadsheet with the power of a data warehouse.

No Code
No Training
No Installation
Sign Up, Free

Similar posts

By using this website, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.