जगदीश खोलिया: Finnding & Removing Duplicate records from a table

Thursday, November 17, 2011

Finnding & Removing Duplicate records from a table

There are lots of questions/posts about dealing with duplicate records in many SQL Server forums. Many of these questions are asked in “SQL Server Interviews” and many developers starting their carrier in database programming find it challenging to deal with. Here we will see how we can deal with such records.
Duplicate data is one of the biggest pain points in our IT industry, which various projects have to deal with. Whatever state-of-art technology and best practices followed, the big ERP, CRM, SCM and other inventory based database management projects ends up in having duplicate & redundant data. Duplicate data keeps on increasing by manual entries and automated data loads. Various data leads getting pumped into system’s databases without proper deduping & data cleansing leads to redundant data and thus duplicated record-sets.
Data cleansing requires regular exercise of identifying duplicates, validating and removing them. To minimize these type of scenarios various checks and filters should also be applied before loading new leads into the system.
Lets check this by a simple exercise how we can identify & remove duplicate data from a table:
1USE [AdventureWorks]
2GO
3 
4-- Insert some sample records from Person.Contact table of [AdventureWorks] database:
5SELECT TOP 10 ContactID, FirstName, LastName, EmailAddress, Phone
6INTO DupContacts
7FROM Person.Contact
8 
9SELECT * FROM DupContacts

1-- Insert some duplicate records from the same list inserted above.
2INSERT INTO DupContacts
3SELECT TOP 50 PERCENT FirstName, LastName, EmailAddress, Phone
4from DupContacts
5 
6SELECT * FROM DupContacts

1-- Insert some more duplicate records from the same list inserted above.
2INSERT INTO DupContacts
3SELECT TOP 20 PERCENT FirstName, LastName, EmailAddress, Phone
4from DupContacts
5 
6SELECT * FROM DupContacts

01--// Identify Duplicate records & delete them:-
02 
03-- Method #1: by using ROW_NUMBER() function:
04;WITH dup as (
05SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
06ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
07FROM DupContacts)
08SELECT * FROM dup
09WHERE NumOfDups > 1
10ORDER BY ContactID

1-- Remove/Delete duplicate records:
2;WITH dup as (
3SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
4ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
5FROM DupContacts)
6DELETE FROM dup
7WHERE NumOfDups > 1
8 
9SELECT * FROM DupContacts

1-- Method #2: by using SELF-JOIN:
2SELECT DISTINCT a.ContactID, a.FirstName, a.LastName, a.EmailAddress, a.Phone
3FROM DupContacts a
4JOIN DupContacts b
5ON a.FirstName = b.FirstName
6AND a.LastName = b.LastName
7AND a.ContactID > b.ContactID

1-- Remove/Delete duplicate records:
2DELETE a
3FROM DupContacts a
4JOIN DupContacts b
5ON a.FirstName = b.FirstName
6AND a.LastName = b.LastName
7AND a.ContactID > b.ContactID
8 
9SELECT * FROM DupContacts

1-- Method #3: by using AGGREGATES & Sub-QUERY:
2SELECT * FROM DupContacts
3WHERE ContactID NOT IN (SELECT MIN(ContactID)
4FROM DupContacts
5GROUP BY FirstName, LastName)

1-- Remove/Delete duplicate records:
2DELETE FROM DupContacts
3WHERE ContactID NOT IN (SELECT MIN(ContactID)
4FROM DupContacts
5GROUP BY FirstName, LastName)
6 
7SELECT * FROM DupContacts



 
Referenced from : http://sqlwithmanoj.wordpress.com/category/sql-tips/

No comments: