Find Duplicates in Python Lists: Set, Counter, Sort
Efficient methods to find duplicates in Python lists using sets, collections.Counter or sorting. Handles unhashable items and covers time and space complexity.
How can I efficiently find elements that appear more than once in a Python list?
Using Python, my current approach is to convert the list nums to a set to get uniqueness, then iterate over nums removing and adding elements to detect duplicates:
nums = [4,3,2,7,8,2,3,1]
z = set(nums)
for i in nums:
if i in z: z.remove(i)
else:
z.add(i)
print(nums, z)
This works, but I want a more efficient and optimized solution. What are better approaches (for example a single-pass set-based method, collections.Counter, or other patterns), and what are their time and space complexities?
To find duplicates in a Python list (find duplicates python), prefer either a single-pass set-based approach or using python collections counter for counts and clarity. The single-pass seen/duplicates pattern is O(n) time and O(n) space; collections.Counter gives you counts in one pass too and is ideal when you need frequencies. Your current toggle-on-a-prepopulated-set trick looks clever but can miss duplicates for values that occur an odd number of times — the corrected patterns below are simpler and reliable.
Contents
- Why your current code is fragile
- Single-pass set method (find duplicates python)
- collections.Counter — counts and frequencies (python collections counter)
- Sorting, NumPy and pandas alternatives
- Indices, order and unhashable items — edge cases
- Performance summary and recommendations
- Sources
- Conclusion
Why your current code is fragile
Your snippet toggles membership in a set that you pre-filled with all unique elements. That makes the final set contain values with even multiplicity, not “appears more than once.” Example:
nums = [1, 1, 1] # 1 appears three times (should be a duplicate)
z = set(nums) # {1}
for i in nums:
if i in z:
z.remove(i)
else:
z.add(i)
print(z) # set() — wrong: 1 is not reported, despite appearing >1
Why? You start with membership = True, then toggle membership once per occurrence. After k occurrences the element is present iff k is even. So counts of 2, 4, 6 will appear; counts 1, 3, 5 won’t. That makes the approach incorrect for general duplicate detection. Use one of the patterns below instead.
Single-pass set method (find duplicates python)
This is the simplest and most memory/time efficient general-purpose method for hashable items.
Code (unique duplicates, order not preserved):
nums = [4, 3, 2, 7, 8, 2, 3, 1]
seen = set()
dups = set()
for x in nums:
if x in seen:
dups.add(x)
else:
seen.add(x)
print(list(dups)) # e.g. [2, 3]
If you want duplicates in the order they first become duplicates:
seen = set()
dups_set = set()
dups_in_order = []
for x in nums:
if x in seen:
if x not in dups_set:
dups_in_order.append(x)
dups_set.add(x)
else:
seen.add(x)
Complexity
- Time: O(n) (one pass; set ops are amortized O(1))
- Space: O(k) where k is number of unique items (<= n)
When to use
- Use when elements are hashable (ints, strings, tuples).
- Use when you only need to know which items repeat (not counts).
- This is the usual go-to for “find duplicates in a list” (see community patterns on Stack Overflow for the same idea: https://stackoverflow.com/questions/9835762/find-duplicates-in-a-list).
collections.Counter — counts and frequencies (python collections counter)
If you need counts (how many times each duplicate appears) or a concise implementation, Counter is ideal.
Code:
from collections import Counter
nums = [4, 3, 2, 7, 8, 2, 3, 1]
cnt = Counter(nums)
duplicates = [x for x, c in cnt.items() if c > 1]
duplicates_with_counts = {x: c for x, c in cnt.items() if c > 1}
print(duplicates) # [2, 3]
print(duplicates_with_counts) # {2: 2, 3: 2}
Notes
- Counter counts in a single pass and is implemented efficiently in C; it’s very readable and handy when you want frequency info. Official docs: https://docs.python.org/3/library/collections.html#collections.Counter
- The result gives unique duplicate values (not every repeated occurrence).
Complexity
- Time: O(n)
- Space: O(k) for counts of unique items
Also see a practical guide and comparisons at GeeksforGeeks: https://www.geeksforgeeks.org/python/how-to-find-duplicates-in-a-list-python/
Sorting, NumPy and pandas alternatives
Sorting-based scan
- Sort the list and scan adjacent items to find runs. This is useful when you can modify the list and want to avoid auxiliary sets.
Code:
nums_sorted = sorted(nums)
dups = []
for i in range(1, len(nums_sorted)):
if nums_sorted[i] == nums_sorted[i-1]:
if not dups or dups[-1] != nums_sorted[i]:
dups.append(nums_sorted[i])
Complexity
- Time: O(n log n) (sorting dominates)
- Space: O(1) extra if you sort in-place with nums.sort(), otherwise O(n)
NumPy (numeric arrays)
- For large numeric arrays, NumPy can be faster because operations are in C:
import numpy as np
vals, counts = np.unique(np.array(nums), return_counts=True)
dups = vals[counts > 1].tolist()
- Note: np.unique often sorts under the hood, so it behaves like O(n log n) but is highly optimized for big numeric data. For small-range non-negative integers, np.bincount + np.nonzero is O(n + max_val).
pandas (tabular / large datasets)
- If you already use pandas:
import pandas as pd
s = pd.Series(nums)
dups = s[s.duplicated(keep=False)].unique().tolist()
- pandas has vectorized, C-backed routines that can be faster for very large datasets, but they carry library overhead.
For quick removal or order-preserving unique extraction see Dataquest notes on removing duplicates: https://www.dataquest.io/blog/how-to-remove-duplicates-from-a-python-list/
Indices, order and unhashable items — edge cases
Want indices (positions) of duplicate occurrences?
from collections import defaultdict
pos = defaultdict(list)
for i, x in enumerate(nums):
pos[x].append(i)
duplicates = {x: idxs for x, idxs in pos.items() if len(idxs) > 1}
# duplicates maps value -> list of indices
This is O(n) time and O(n) space.
Unhashable items (lists, dicts)
- Sets/Counter require hashable elements. If your list contains lists/dicts, convert them to a hashable representation when safe:
- Convert lists to tuples: key = tuple(x)
- Convert dict to a tuple of sorted items: key = tuple(sorted(d.items()))
- Or use a stable serialization (json.dumps(…, sort_keys=True)) as a fallback (slower).
- Beware: serialization and conversions have costs and potential collisions if not done carefully.
Streaming / very-large data
- If data is streaming or can’t fit in memory, exact duplicate detection requires external storage or multi-pass algorithms (external sort, database-backed dedup). For approximate detection with bounded memory you can use a Bloom filter (but that yields false positives).
For practical tips on indexing and locating duplicates see a compact guide: https://towardsdatascience.com/pythonic-tips-tricks-checking-and-indexing-duplicates-d62fdfea9eec/
Performance summary and recommendations
Quick cheat-sheet
-
Single-pass seen/dups (hash sets)
-
Time: O(n)
-
Space: O(k)
-
Use: general-purpose, fastest for mixed types that are hashable
-
collections.Counter
-
Time: O(n)
-
Space: O(k)
-
Use: when you need counts/frequencies; very readable
-
Sorting + scan
-
Time: O(n log n)
-
Space: O(1) extra if in-place
-
Use: when you can sort in-place and want to save auxiliary memory
-
NumPy / pandas
-
Time: often O(n log n) for unique (but fast in C)
-
Space: O(n)
-
Use: large numerical/tabular data where vectorized ops pay off
-
For unhashable items
-
Convert to stable hashable keys or use position-mapping strategies
Recommendation
- For most use-cases: use the single-pass seen + dups method (simple and O(n)).
- If you also need counts or a compact implementation: use collections.Counter.
- If memory is constrained but sorting is allowed: sort and scan.
Community and reference examples that expand on these patterns: GeeksforGeeks (multiple examples) https://www.geeksforgeeks.org/python/python-program-print-duplicates-list-integers/ and the Counter documentation https://docs.python.org/3/library/collections.html#collections.Counter.
Sources
- Program to Print Duplicates from a List of Integers in Python - GeeksforGeeks
- How To Find Duplicates in a Python List - Intellisoft Training
- How to Find Duplicates in a List - Python - GeeksforGeeks
- How To Find Duplicates In A Python List? - PythonGuides
- How to Easily Remove Duplicates from a Python List – Dataquest
- Python Pro Tip: Find Duplicates in a List Without Using Loops - Medium
- Python How to Find Repeated/Duplicate Items in a List - Medium
- Pythonic Tips & Tricks - Checking and Indexing Duplicates - Towards Data Science
- collections — Container datatypes — Python docs (Counter)
- Find Duplicates in a List - Stack Overflow
Conclusion
To reliably detect python duplicates (find duplicates python), use a single-pass seen/duplicates set pattern for speed and simplicity, or collections.Counter when you also need frequency counts. Avoid the pre-filled-toggle trick — it toggles membership and misses odd-count duplicates. For large numeric/tabular datasets consider NumPy/pandas or sorting when memory or in-place constraints make sense.