NeuroAgent

SQL Duplicate Detection: Multiple Fields Guide

Learn how to find and remove duplicate records in SQL using multiple fields like email and name. Complete guide with examples for duplicate identification and cleanup.

#database-management #database-optimization #data-cleaning #sql-duplicate-detection #sql-group-by #sql-performance #sql-tutorial

10/25/2025, 09:26 AM

Question

How to Find Duplicate Records in SQL Based on Multiple Fields (Email and Name)

I need to find duplicate records in a SQL table where both the email and name fields are identical. I know how to find duplicates based on a single field:

sql

SELECT email, COUNT(email) 
FROM users
GROUP BY email
HAVING COUNT(email) > 1

For example, with this table:

ID   NAME   EMAIL
1    John   asd@asd.com
2    Sam    asd@asd.com
3    Tom    asd@asd.com
4    Bob    bob@asd.com
5    Tom    asd@asd.com

The query above returns John, Sam, Tom, Tom because they share the same email.

However, I need to find duplicates where both email AND name are the same. In this case, I want to identify the duplicate ‘Tom’ records.

I made an error in my database design and allowed duplicate name and email combinations. Now I need to find these duplicates so I can remove or modify them.

NeuroAgent

To find duplicate records in SQL based on multiple fields like email and name, you need to modify your query to group by both columns simultaneously. The solution uses GROUP BY with multiple columns and HAVING to filter for duplicates.

Finding Duplicates with Multiple Columns
Complete Query Examples
Identifying Specific Duplicate Records
Removing Duplicates
Advanced Techniques
Performance Considerations

Finding Duplicates with Multiple Columns

The key is to include both fields in your GROUP BY clause and then use HAVING COUNT(*) > 1 to identify duplicates. Here’s the basic structure:

sql

SELECT email, name, COUNT(*) as duplicate_count
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

This query groups records by both email and name combinations, then shows you which combinations have more than one record.

Complete Query Examples

Basic Duplicate Identification

For your specific example table:

ID   NAME   EMAIL
1    John   asd@asd.com
2    Sam    asd@asd.com
3    Tom    asd@asd.com
4    Bob    bob@asd.com
5    Tom    asd@asd.com

The correct query would be:

sql

SELECT email, name, COUNT(*) as duplicate_count
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

This would return:

email          | name | duplicate_count
---------------|------|----------------
asd@asd.com    | Tom  | 2

Including Original Records

If you want to see all the actual duplicate records rather than just counting them:

sql

SELECT u.*
FROM users u
JOIN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
) duplicates ON u.email = duplicates.email AND u.name = duplicates.name;

This query will return both records for Tom:

ID | NAME | EMAIL
---|------|----------
3  | Tom  | asd@asd.com
5  | Tom  | asd@asd.com

Identifying Specific Duplicate Records

With Row Numbers

If you want to identify which specific records are duplicates and their order:

sql

SELECT *, ROW_NUMBER() OVER (PARTITION BY email, name ORDER BY id) as row_num
FROM users
WHERE (email, name) IN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
);

This will show you:

ID | NAME | EMAIL      | row_num
---|------|------------|--------
3  | Tom  | asd@asd.com| 1
5  | Tom  | asd@asd.com| 2

Finding First/Last Occurrences

To identify the first or last occurrence of duplicates:

sql

-- First occurrence
SELECT MIN(id) as first_id, email, name
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

-- Last occurrence  
SELECT MAX(id) as last_id, email, name
FROM users
GROUP BY email, name
HAVING COUNT(*) > 1;

Removing Duplicates

Method 1: Keep First Occurrence

sql

DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id)
    FROM users
    GROUP BY email, name
);

Method 2: Keep Last Occurrence

sql

DELETE FROM users
WHERE id NOT IN (
    SELECT MAX(id)
    FROM users
    GROUP BY email, name
);

Method 3: Using Window Functions (Modern Databases)

sql

DELETE FROM users
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY email, name ORDER BY id) as row_num
        FROM users
    ) numbered
    WHERE row_num > 1
);

Advanced Techniques

Case-Insensitive Duplicate Detection

If you want to find duplicates ignoring case differences:

sql

SELECT LOWER(email) as normalized_email, LOWER(name) as normalized_name, COUNT(*)
FROM users
GROUP BY LOWER(email), LOWER(name)
HAVING COUNT(*) > 1;

Finding Near Duplicates

To find records that are similar but not exact duplicates:

sql

SELECT u1.*, u2.*
FROM users u1
JOIN users u2 ON u1.email = u2.email AND u1.name != u2.name
WHERE u1.id < u2.id;

Using EXISTS for Performance

For large tables, EXISTS might be more efficient:

sql

SELECT u.*
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM users u2
    WHERE u2.email = u.email 
    AND u2.name = u.name
    AND u2.id != u.id
);

Performance Considerations

Indexing

For better performance on large tables, create composite indexes:

sql

CREATE INDEX idx_users_email_name ON users(email, name);

Partitioning

For very large tables, consider partitioning by the columns used for duplicate detection:

sql

CREATE TABLE users (
    id INT,
    name VARCHAR(100),
    email VARCHAR(100),
    -- other columns
) PARTITION BY LIST (email);

Batch Processing

For extremely large tables, process duplicates in batches:

sql

-- Process first 1000 duplicate combinations
SELECT u.*
FROM users u
JOIN (
    SELECT email, name
    FROM users
    GROUP BY email, name
    HAVING COUNT(*) > 1
    LIMIT 1000
) duplicates ON u.email = duplicates.email AND u.name = duplicates.name;

Conclusion

Finding duplicates based on multiple fields in SQL is straightforward once you understand the GROUP BY and HAVING clauses. The key points to remember are:

Always include all fields in the GROUP BY clause that you’re checking for duplicates
Use HAVING COUNT(*) > 1 to filter for actual duplicates
Consider performance implications for large datasets by adding appropriate indexes
When removing duplicates, decide whether to keep first, last, or a specific occurrence
Modern databases offer window functions that provide more flexibility for duplicate handling

Start with the basic duplicate identification query, then move to more complex techniques as needed for your specific use case.

Sources

How to remove duplicate records in SQL while keeping the most recent entry?What are the best practices for preventing duplicate entries in SQL databases?How to find near duplicates in SQL with slight variations in data?What's the most efficient way to handle large-scale duplicate detection in SQL?How to create a unique constraint on multiple columns in SQL to prevent duplicates?What are the performance implications of duplicate detection queries on large tables?

Ask NeuroAgent