Find duplicate records in oracle database

Identifying duplicate records is one of the most frequently asked questions in the ETL Testing interview.  

In this article, we will learn 3 different methods to accomplish this task in the Oracle database. I will also explain all query elements for a clear understanding. If you want to do the same thing in the Microsoft SQLserver database, then check out this article.

Note: I am using Oracle XE 18c – HR data model for all examples.

Identify duplicate records in the table.

Method-1: (Using Group By)

How the query is working:

  1. Select count of all EMPLOYEE_ID from table emp
  2. Group by clause will group the rows by values of EMPLOYEE_ID
  3. Having clause filter out the records and keep only duplicate rows (i.e. count > 1)

Result shows that there are 3 records for EMPLOYEE_ID=184 and 2 records for EMPLOYEE_ID=188.

If you want to use more multiple columns to find duplicate, just change the query as:

Let’s see the complete record for better analysis.

How query is working:

  1. Select only the EMPLOYEE_ID by passing the original query as a sub-query.
  2. Select all columns from table EMP by passing EMPLOYEE_ID in WHERE clause as filter criteria.

We can clearly see in the screenshot that there are duplicate records.

Method-2: (Using Window function)

How query is working:

  1. Inner query is using a window function of count aggregate function to calculate the number of rows per EMPLOYEE_ID.
  2. Outer query is filtering out records with count less than 1 i.e. duplicate records.  

You can also modify this query to get all duplicate records and count of duplicate values in a single query.

Method 3: (Using ROW_NUMBER() Function)

ROW_NUMBER() is an analytic function that will assign a unique number to each record in the partition or the result dataset. One important point is that it is mandatory to use ORDER_BY with this function to arrange the partition/result dataset in ascending or descending order.

How query is working:

  1. Inner query is assigning a unique number to the dataset partitioned on EMPLOYEE_ID.
  2. In the outer query, we are selecting records with a value greater than 1 because they are duplicate records.

This query can also be tuned to get duplicate dataset: