NeuroAgent

SQL Duplicate Detection: Multiple Fields Guide

Learn how to find and remove duplicate records in SQL using multiple fields like email and name. Complete guide with examples for duplicate identification and cleanup.

Question

How to Find Duplicate Records in SQL Based on Multiple Fields (Email and Name)

I need to find duplicate records in a SQL table where both the email and name fields are identical. I know how to find duplicates based on a single field:

sql
SELECT email, COUNT(email) 
FROM users
GROUP BY email
HAVING COUNT(email) > 1

For example, with this table:

ID   NAME   EMAIL
1    John   asd@asd.com
2    Sam    asd@asd.com
3    Tom    asd@asd.com
4    Bob    bob@asd.com
5    Tom    asd@asd.com

The query above returns John, Sam, Tom, Tom because they share the same email.

However, I need to find duplicates where both email AND name are the same. In this case, I want to identify the duplicate ‘Tom’ records.

I made an error in my database design and allowed duplicate name and email combinations. Now I need to find these duplicates so I can remove or modify them.

NeuroAgent

To find duplicate records in SQL based on multiple fields like email and name, you need to modify your query to group by both columns simultaneously. The solution uses GROUP BY with multiple columns and HAVING to filter for duplicates.

Contents

Finding Duplicates with Multiple Columns

The key is to include both fields in your GROUP BY clause and then use HAVING COUNT(*) > 1 to identify duplicates. Here’s the basic structure:

sql
SELECT email, name, COUNT(*) as duplicate_count
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

This query groups records by both email and name combinations, then shows you which combinations have more than one record.

Complete Query Examples

Basic Duplicate Identification

For your specific example table:

ID   NAME   EMAIL
1    John   asd@asd.com
2    Sam    asd@asd.com
3    Tom    asd@asd.com
4    Bob    bob@asd.com
5    Tom    asd@asd.com

The correct query would be:

sql
SELECT email, name, COUNT(*) as duplicate_count
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

This would return:

email          | name | duplicate_count
---------------|------|----------------
asd@asd.com    | Tom  | 2

Including Original Records

If you want to see all the actual duplicate records rather than just counting them:

sql
SELECT u.*
FROM users u
JOIN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
) duplicates ON u.email = duplicates.email AND u.name = duplicates.name;

This query will return both records for Tom:

ID | NAME | EMAIL
---|------|----------
3  | Tom  | asd@asd.com
5  | Tom  | asd@asd.com

Identifying Specific Duplicate Records

With Row Numbers

If you want to identify which specific records are duplicates and their order:

sql
SELECT *, ROW_NUMBER() OVER (PARTITION BY email, name ORDER BY id) as row_num
FROM users
WHERE (email, name) IN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
);

This will show you:

ID | NAME | EMAIL      | row_num
---|------|------------|--------
3  | Tom  | asd@asd.com| 1
5  | Tom  | asd@asd.com| 2

Finding First/Last Occurrences

To identify the first or last occurrence of duplicates:

sql
-- First occurrence
SELECT MIN(id) as first_id, email, name
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

-- Last occurrence  
SELECT MAX(id) as last_id, email, name
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

Removing Duplicates

Method 1: Keep First Occurrence

sql
DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id)
    FROM users
    GROUP BY email, name
);

Method 2: Keep Last Occurrence

sql
DELETE FROM users
WHERE id NOT IN (
    SELECT MAX(id)
    FROM users
    GROUP BY email, name
);

Method 3: Using Window Functions (Modern Databases)

sql
DELETE FROM users
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY email, name ORDER BY id) as row_num
        FROM users
    ) numbered
    WHERE row_num > 1
);

Advanced Techniques

Case-Insensitive Duplicate Detection

If you want to find duplicates ignoring case differences:

sql
SELECT LOWER(email) as normalized_email, LOWER(name) as normalized_name, COUNT(*)
FROM users
GROUP BY LOWER(email), LOWER(name)
HAVING COUNT(*) > 1;

Finding Near Duplicates

To find records that are similar but not exact duplicates:

sql
SELECT u1.*, u2.*
FROM users u1
JOIN users u2 ON u1.email = u2.email AND u1.name != u2.name
WHERE u1.id < u2.id;

Using EXISTS for Performance

For large tables, EXISTS might be more efficient:

sql
SELECT u.*
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM users u2
    WHERE u2.email = u.email 
    AND u2.name = u.name
    AND u2.id != u.id
);

Performance Considerations

Indexing

For better performance on large tables, create composite indexes:

sql
CREATE INDEX idx_users_email_name ON users(email, name);

Partitioning

For very large tables, consider partitioning by the columns used for duplicate detection:

sql
CREATE TABLE users (
    id INT,
    name VARCHAR(100),
    email VARCHAR(100),
    -- other columns
) PARTITION BY LIST (email);

Batch Processing

For extremely large tables, process duplicates in batches:

sql
-- Process first 1000 duplicate combinations
SELECT u.*
FROM users u
JOIN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
    LIMIT 1000
) duplicates ON u.email = duplicates.email AND u.name = duplicates.name;

Conclusion

Finding duplicates based on multiple fields in SQL is straightforward once you understand the GROUP BY and HAVING clauses. The key points to remember are:

  1. Always include all fields in the GROUP BY clause that you’re checking for duplicates
  2. Use HAVING COUNT(*) > 1 to filter for actual duplicates
  3. Consider performance implications for large datasets by adding appropriate indexes
  4. When removing duplicates, decide whether to keep first, last, or a specific occurrence
  5. Modern databases offer window functions that provide more flexibility for duplicate handling

Start with the basic duplicate identification query, then move to more complex techniques as needed for your specific use case.

Sources

  1. SQL Tutorial - GROUP BY and HAVING
  2. PostgreSQL Documentation - Aggregate Functions
  3. MySQL Documentation - GROUP BY Optimization
  4. Microsoft SQL Server - DELETE with GROUP BY
  5. Window Functions in SQL