Excel is a versatile application that has gone far beyond its earlier versions as a simple spreadsheet solution. Many people use Excel as a registrar, address book, forecasting tool, and more, not even as intended.
If you use Excel a lot at home or in the office, you know that sometimes Excel files can quickly become unwieldy due to the sheer number of records you work with.
Fortunately, Excel has built-in functions to help you find and remove duplicate records. Unfortunately, there are a few caveats when using these features, so be careful or you might unknowingly delete entries you didn’t intend to delete. Plus, both of the methods below remove duplicates instantly without showing what was deleted.
I will also mention a way to highlight duplicate lines first, so you can see which ones will be removed by functions before you run them. You must use a custom conditional formatting rule to highlight a line that is completely duplicated.
Remove the duplicates function
Let’s say you’re using Excel to track addresses and you suspect that you have duplicate records. Take a look at the example Excel spreadsheet below:
Note that the entry “Jones” appears twice. To remove such duplicate entries, click the Data tab on the Ribbon and find the Remove Duplicates option under Data Tools. Click â€œRemove Duplicatesâ€, a new window will open.
This is where you must make a decision based on whether you are using headings at the top of the columns. If you do, select the option labeled My Data with Headers. If you do not use header labels, you will use standard Excel column notation such as column A, column B, etc.
In this example, we will select only column A and click OK. The options window will close and Excel will delete the second Jones record.
Of course, this was a simple example. Any address records you keep using Excel are likely to be much more complex. For example, suppose you have an address file that looks like this.
Note that although there are three Jones entries, only two are identical. If we used the procedures above to remove duplicate records, there would be only one Jones record left. In this case, we need to expand the decision criteria to include both the first and last names from columns A and B, respectively.
To do this, click the Data tab on the ribbon again, and then click Remove Duplicates. This time, when the options window appears, select columns A and B. Click OK and notice that this time Excel only deleted one of the Mary Jones records.
This is because we told Excel to remove duplicates by matching records based on Columns A and B, not just Column A. The more columns you select, the more criteria must be met before Excel considers a record to be a duplicate. Select All Columns if you want to remove completely duplicate rows.
Excel will give you a message telling you how many duplicates have been removed. However, it will not show which lines were deleted! Scroll down to the last section to see how to highlight duplicate lines before running this function.
Advanced filtering method
The second way to remove duplicates is to use the advanced filter. First select all the data on the sheet. Then, on the Data tab on the Ribbon, click Advanced under Sort & Filter.
In the dialog box that appears, be sure to check the Unique records only box.
You can filter the list in-place, or copy non-duplicate items to another part of the same spreadsheet. For some strange reason, you cannot copy data to another sheet. If you want them to be on a different sheet, first select a location on the current sheet and then cut and paste this data onto a new sheet.
When using this method, you won’t even get a message about how many rows have been deleted. The lines are deleted and that’s it.
Mark the duplicate rows in Excel
If you want to see which entries are being duplicated before deleting them, you have to do a little manual work. Unfortunately, Excel does not have the ability to select completely duplicate rows. It has a conditional formatting feature that highlights duplicate cells, but this article is about duplicate rows.
The first thing you need to do is add the formula to the column to the right of the dataset. The formula is simple: just concatenate all the columns of that row together.
= A1 & B1 & C1 & D1 & E1
In my example below, I have data in columns A to F. However, the first column is the ID number, so I exclude it from the formula below. Make sure to include all the data columns you want to check for duplicates.
I put this formula in column H and then dragged it down for all my rows. This formula simply concatenates all the data in each column into one large piece of text. Now skip a couple more columns and enter the following formula:
= COUNTIF ($ H $ 1: $ H $ 34, $ H1) 1
Here we are using the COUNTIF function and the first parameter is the set of data we want to view. For me it was column H (which has a formula for combining data) of rows 1 through 34. It is also recommended to get rid of the header row before doing this.
Also make sure you use the dollar sign ($) before the letter and number. For example, if you have 1000 rows of data and your concatenated row formula is in column F, your formula would look like this:
= COUNTIF ($ F $ 1: $ F $ 1000, $ F1) 1
The second parameter only has a dollar sign in front of the column letter, so it is locked, but we don’t want to lock the row number. Again, you’ll drag it down for all data rows. It should look like this and it should be TRUE on the repeated lines.
Now let’s highlight the lines that have TRUE as they are duplicate lines. First select the entire data worksheet by clicking the small triangle in the upper left corner of the intersection of rows and columns. Now go to the Home tab, then click Conditional Formatting and click New Rule.
In the dialog box, click Use Formula to determine which cells to format.
In the box under Format Values ??where this formula is correct: Enter the following formula, replacing P with a column that is TRUE or FALSE. Remember to put a dollar sign in front of the column letter.
= $ P1 = TRUE
Once you’ve done that, hit Format and go to the Fill tab. Choose a color that will be used to highlight the entire repeating line. Click OK and you should now see the duplicate lines highlighted.
If that doesn’t work for you, start over and do it slowly again. This has to be done correctly for this to work. If you miss at least one $ character, it won’t work as expected.
Warnings with duplicate records removed
Of course, there are a few issues with Excel removing duplicate records automatically. First, you have to be careful choosing too few or too many columns for Excel to use as a criterion for identifying duplicate records.
Too few and you might accidentally delete the records you want. Too many or accidentally included the ID column and no duplicates will be found.
Second, Excel always assumes that the first unique record it encounters is the master record. All subsequent records are considered duplicates. This is a problem if, for example, you were unable to change the address of one of the people in your file, but instead created a new entry.
If a new (correct) address record appears after the old (obsolete) record, Excel will assume that the first (outdated) record is the master one and will delete any subsequent records it finds. This is why you have to be careful how liberal or conservative you let Excel decide what is or is not a duplicate record.
In such cases, you should use the duplicate highlighting method I wrote about and manually remove the corresponding duplicate entry.
Finally, Excel does not ask you to confirm if you really want to delete the record. Using the parameters (columns) you have selected, the process is fully automated. This can be dangerous if you have a huge number of records and you are confident that the decisions you made were correct and allow Excel to automatically remove duplicate records.
Also, don’t forget to check out our previous article on how to remove blank lines in Excel. Enjoy!