The power of deduplication of rows in a spreadsheet and how it helps with big data sets cannot be overstated! Whether it’s a spreadsheet of customer data or a large dataset with millions of records, data deduplication can help you save time and reduce errors while ensuring that you have accurate, reliable data.
Data Deduplication is the process of removing duplicate entries from a dataset based on a single column, on two columns, or even multiple columns. It can help to reduce the size of a dataset, improve data integrity and accuracy, and improve the performance of analytics. It can also reduce the cost of storage and help you gain insights more quickly by eliminating redundant data.
Removing duplicate rows can be easily accomplished in most spreadsheets, and has been a top requested feature from Gigasheet users, who rely on our product to manage data that has too many rows for Microsoft Excel or over the cell limit in Google Sheets. Gigasheet is a free big data spreadsheet that makes analyzing big data easy, no code required.
In this blog, we will explore how deduplication can help you get the most out of your data by removing duplicates! We will cover:
In its simplest form, removing duplicates works by checking the values of a single column for duplicates, and then deletes the rows containing duplicate instances of the value.
To illustrate this, we are going to look at a very simple file with columns containing Letters, Numbers, and Colors. For a simple deduplication, let's use the Letter column to remove duplicates.
We can immediately see that there are 3 unique values in the letter column:
When we run a deduplication of rows, the first instance of each value will be kept, along with the rest of the data in the row. The repeated instances of the value (highlighted in red) will be deleted, along with the rest of the data in the row.
Here are the results of running deduplication on the Letter column. As you can see, only the rows containing first instance of each letter remain:
In the example above, we demonstrated that the first row containing a duplicate value is retained, while the remaining rows with duplicate values are deleted. What if you want a particular value to remain?
You can control the first row by applying a sort to your data prior to removing duplicates from the file!
In our file above, A is paired with the colors Red and Yellow. Let's say Yellow is more important than Red. We can first sort the file on the column Color, with Yellow appearing above Red, which would look like this:
And removing duplicates on the Letter column would yield the following results:
As you can see, as result of sorting, the letters A and C both now have the color Yellow in the Color column, whereas before the value Red was retained.
Real world examples would include sorting on a Date column to retain the most recent transaction by customers, or sorting on a Purchase Amount column to retain the highest price paid by customers.
Removing duplicates keeps the row associated with the first instance of a value, so we can use this to our advantage and sort to produce desired results.
A more advanced version of removing duplicate rows uses a combination of multiple columns as the value to deduplicate. Instead of looking for a single column, the Remove Duplicates function considers multiple columns when looking for duplicate values to remove.
Sound complicated? It's probably easier to grasp in a picture, so let's use the same sample file as above and remove duplicates based on Letter and Number.
When factoring in both the Letter and Number columns, we have more unique values to consider because we treat the combined value as a instance:
As a result, there are less duplicates even though the file is exactly the same as before.
When we run a deduplication using both Letter and Number, the first instance of each value will be kept, along with the rest of the data in the row. The repeated instances of the value (highlighted in red) will be deleted.
Here is the output:
Removing duplicates can be performed on any number of column selections. Here is an example of using our same file, but factoring in all 3 columns, Letter, Number, and Color, as part of the deduplication process.
As you can see there are 2 instances that are repeated, when considering all 3 columns: A1Red and C1Red:
Here is the output after deduplication has been run using all 3 columns:
Find deduplicating multiple columns confusing? Let's provide some additional clarity by demonstrating what is happening behind the scenes. Basically, the deduplication function is combining the columns Letter and Number prior to looking for duplicates.
In this file I have combined the Letter and Number columns into a single column called Sample Combined Column. This combination simply takes the Letter value and appends the Number value. For Example, A and 1 become A1. And it continues for each value.
Now let's remove duplicates from our new column! Here we can clearly see that we are sorting through the same values that were considered in the example above, when we were removing duplicates on Letter and Number:
The output file will match the output of our example of removing duplicate rows based on two or more columns, the only difference being we have a new column called Sample Combined Data :
Hopefully it's clear now how selecting multiple columns works when removing duplicates. Behind the scenes, it combines columns and then searches those results for duplicates. We can do this manually, but it's always nice to skip a step!
Removing duplicates is a quick process using Excel once you locate the Remove Duplicates button, which can be found on the right side of the Data tab.
In this example, we are going to use the exact same example file from above. I loaded it into Microsoft Excel:
Clicking Remove Duplicates will bring up a new menu where you can select the columns to be included in the data deduplication calculation. Here, we are using the same file from our example above, removing duplicates on Letter:
Excel does have a checkbox for My data has headers that needs to remain checked if your data does include headers. This tells Excel to keep the first row and not try to find duplicates of the header values.
Click OK to proceed with the deduplication, and Excel will notify you how many rows include duplicates, and how many will remain once the process completes.
Hit OK again and the deduplicated output file will be ready for analysis!
Removing duplicates based on two columns in Excel is easy. You can either use the Remove Duplicates Feature and select two columns, or manually combine columns and then use Remove Duplicates on that single column. Let's walk through both examples!
Removing duplicate rows in Excel based on two columns is executed the same as removing duplicates from a single column. Begin by clicking Remove Duplicates on the Data tab.
This time, instead of selecting just one column, we are going to select both the Letter and Number columns and hit OK.
Behind the scenes, Excel will combine the two columns into a single value and then search the list of the combined values for duplicates.
A pop up will appear indicating that 3 duplicate values have been found, and 7 unique values remain.
In the examples above where we deduplicated data based on a single column, you will recall that 7 duplicate values were found. There are less duplicates because the deduplication is considering Letter + Number, so values like A1.
Hit OK to remove duplicates from your file. The output in Excel will look like this, which no longer contains any duplicated Letter + Number combinations.
Removing duplicate rows in Excel based on two columns can also be done manually by combining the columns prior to running Remove Duplicates. All it takes is a simple formula.
Not comfortable with formulas? This can be your first!
In our sample file, let's create a new column called Combined Letter and Number. In the box to the right of our first value, type:
Once you hit Enter, this will create the combined Letter + Number Value. The "&" basically tells Excel to stick two values together. In this case, it's A2 and B2. The resulting value is A1.
Either drag or copy and paste the formula down to the other rows, and the results will look like this:
Here we can clearly see that there are 3 duplicates in our new Combined Letter and Number column. Let's hit Remove Duplicates and select this column to clean up our file!
Hit OK, and the output is the same as just selecting two values in the Remove Duplicates tool, but with our new column on the right side.
Now that we have removed duplicates based on one and many columns from Excel, let's take a tour at removing duplicates from other spreadsheets.
Removing duplicates is also easy in Google sheets. We will take the same example file through the deduplication process. To begin, go to Data and then click on Data Cleanup and select Remove Duplicates.
The remove duplicates tool will appear and ask you to select which columns contain duplicates.
If Sheets does not automatically recognize the data headers, check the Data has header row box and the data headers will be added to the column letters in the drop-down box. To repeat our example, we are going to select Letter and then hit Remove Duplicates.
Sheets will alert you with how many duplicates are found, and how many rows will remain. Hit OK.
And there you have it! A clean file, free of duplicates.
Similar to Excel, there are two ways to remove duplicate values from Google sheets based on two or more columns.
Removing Duplicates based on two columns is as simple as selecting two columns when running the Remove Duplicates function. Here is the same sample file loaded into Google Sheets, with the duplicates highlighted based on two columns.
Let's check both Letter and Number:
And the pop-up notification indicates that the same 3 duplicates are found.
Here is our clean file:
If we want to remove duplicates based on multiple rows manually, we can first combine the Letter and Number columns and then deduplicate our combined column.
We are going to use the same formula with the "&" that we used in Excel to create a new column called Combined Letter and Number. In the box to the right of our first value, type:
Drag or copy and paste the values down into the remaining rows and the file will look like this:
Now run Remove Duplicates on the Combined Letter and Number column, and we should get the same 3 duplicates removed with 7 rows remaining.
There they are! Just like in Microsoft Excel, Google Sheets has found and the 3 duplicates in our new column.
And here is the final file, which matches the one above, with the exception of the Combined Letter and Number column that we added.
Lastly, we want to show you how this works in Gigasheet. By the way, you can open Excel files directly in Gigasheet, if you want to try it for free!
Gigasheet makes it easy to remove duplicates from any file, and we support up to a billion rows of data! Throw your biggest data sets at Gigasheet, clean up the file, and use the simple tools for data analysis.
Removing duplicates is as easy as selecting the Data Cleanup menu and clicking Delete Duplicates.
A pop-up box appears and you can select the columns to include in the remove duplicates process. Here we are using the same file as in our examples above and selecting the Letter column to search for duplicates. If we wanted to include multiple columns, I would simply select more.
Click Remove and a warning will appear notifying you how many values will be deleted. Click Proceed to remove duplicates.
And just like that, you will be taken back to your file, which will be sparkling clean and free of duplicate values. This is the same output we created in our first example above! Now you know how to do it in Gigasheet.
Removing Duplicates based on multiple values is easy in Gigasheet as well. Simply select multiple columns when running the Delete Duplicates function. For Example, here I am selecting Letter and Number.
If after this entire blog, you still want to combine columns before running Delete Duplicates, Gigasheet has a very simple combine column feature, and no formulas are required. Let's combine columns Letter and Number into a new column called Sample Combined Column.
The new column is added to our file and the values from Letter and Number are combined. Now we can remove duplicates from Sample Combined Column.
Spreadsheets are a wonderful way to manipulate data, and removing duplicates is a function that makes data cleanup easy. Duplicates can be identified in one column, or across multiple columns, and the entire row is removed.
Therefore, it's not a surprise that Microsoft Excel, Google Sheets, and Gigasheet make it quick and easy to remove duplicates from a file. However, only one allows you to remove duplicates from millions of rows of data. Try Gigasheet now for free!