Programming

Find Duplicates in Python Lists: Set, Counter, Sort

Efficient methods to find duplicates in Python lists using sets, collections.Counter or sorting. Handles unhashable items and covers time and space complexity.

1 answer 1 view

How can I efficiently find elements that appear more than once in a Python list?

Using Python, my current approach is to convert the list nums to a set to get uniqueness, then iterate over nums removing and adding elements to detect duplicates:

python
nums = [4,3,2,7,8,2,3,1]
z = set(nums)
for i in nums:
 if i in z: z.remove(i)
 else:
 z.add(i)

print(nums, z)

This works, but I want a more efficient and optimized solution. What are better approaches (for example a single-pass set-based method, collections.Counter, or other patterns), and what are their time and space complexities?

To find duplicates in a Python list (find duplicates python), prefer either a single-pass set-based approach or using python collections counter for counts and clarity. The single-pass seen/duplicates pattern is O(n) time and O(n) space; collections.Counter gives you counts in one pass too and is ideal when you need frequencies. Your current toggle-on-a-prepopulated-set trick looks clever but can miss duplicates for values that occur an odd number of times — the corrected patterns below are simpler and reliable.


Contents


Why your current code is fragile

Your snippet toggles membership in a set that you pre-filled with all unique elements. That makes the final set contain values with even multiplicity, not “appears more than once.” Example:

python
nums = [1, 1, 1] # 1 appears three times (should be a duplicate)
z = set(nums) # {1}
for i in nums:
 if i in z:
 z.remove(i)
 else:
 z.add(i)
print(z) # set() — wrong: 1 is not reported, despite appearing >1

Why? You start with membership = True, then toggle membership once per occurrence. After k occurrences the element is present iff k is even. So counts of 2, 4, 6 will appear; counts 1, 3, 5 won’t. That makes the approach incorrect for general duplicate detection. Use one of the patterns below instead.


Single-pass set method (find duplicates python)

This is the simplest and most memory/time efficient general-purpose method for hashable items.

Code (unique duplicates, order not preserved):

python
nums = [4, 3, 2, 7, 8, 2, 3, 1]
seen = set()
dups = set()
for x in nums:
 if x in seen:
 dups.add(x)
 else:
 seen.add(x)

print(list(dups)) # e.g. [2, 3]

If you want duplicates in the order they first become duplicates:

python
seen = set()
dups_set = set()
dups_in_order = []
for x in nums:
 if x in seen:
 if x not in dups_set:
 dups_in_order.append(x)
 dups_set.add(x)
 else:
 seen.add(x)

Complexity

  • Time: O(n) (one pass; set ops are amortized O(1))
  • Space: O(k) where k is number of unique items (<= n)

When to use


collections.Counter — counts and frequencies (python collections counter)

If you need counts (how many times each duplicate appears) or a concise implementation, Counter is ideal.

Code:

python
from collections import Counter

nums = [4, 3, 2, 7, 8, 2, 3, 1]
cnt = Counter(nums)
duplicates = [x for x, c in cnt.items() if c > 1]
duplicates_with_counts = {x: c for x, c in cnt.items() if c > 1}

print(duplicates) # [2, 3]
print(duplicates_with_counts) # {2: 2, 3: 2}

Notes

Complexity

  • Time: O(n)
  • Space: O(k) for counts of unique items

Also see a practical guide and comparisons at GeeksforGeeks: https://www.geeksforgeeks.org/python/how-to-find-duplicates-in-a-list-python/


Sorting, NumPy and pandas alternatives

Sorting-based scan

  • Sort the list and scan adjacent items to find runs. This is useful when you can modify the list and want to avoid auxiliary sets.

Code:

python
nums_sorted = sorted(nums)
dups = []
for i in range(1, len(nums_sorted)):
 if nums_sorted[i] == nums_sorted[i-1]:
 if not dups or dups[-1] != nums_sorted[i]:
 dups.append(nums_sorted[i])

Complexity

  • Time: O(n log n) (sorting dominates)
  • Space: O(1) extra if you sort in-place with nums.sort(), otherwise O(n)

NumPy (numeric arrays)

  • For large numeric arrays, NumPy can be faster because operations are in C:
python
import numpy as np
vals, counts = np.unique(np.array(nums), return_counts=True)
dups = vals[counts > 1].tolist()
  • Note: np.unique often sorts under the hood, so it behaves like O(n log n) but is highly optimized for big numeric data. For small-range non-negative integers, np.bincount + np.nonzero is O(n + max_val).

pandas (tabular / large datasets)

  • If you already use pandas:
python
import pandas as pd
s = pd.Series(nums)
dups = s[s.duplicated(keep=False)].unique().tolist()
  • pandas has vectorized, C-backed routines that can be faster for very large datasets, but they carry library overhead.

For quick removal or order-preserving unique extraction see Dataquest notes on removing duplicates: https://www.dataquest.io/blog/how-to-remove-duplicates-from-a-python-list/


Indices, order and unhashable items — edge cases

Want indices (positions) of duplicate occurrences?

python
from collections import defaultdict
pos = defaultdict(list)
for i, x in enumerate(nums):
 pos[x].append(i)

duplicates = {x: idxs for x, idxs in pos.items() if len(idxs) > 1}
# duplicates maps value -> list of indices

This is O(n) time and O(n) space.

Unhashable items (lists, dicts)

  • Sets/Counter require hashable elements. If your list contains lists/dicts, convert them to a hashable representation when safe:
  • Convert lists to tuples: key = tuple(x)
  • Convert dict to a tuple of sorted items: key = tuple(sorted(d.items()))
  • Or use a stable serialization (json.dumps(…, sort_keys=True)) as a fallback (slower).
  • Beware: serialization and conversions have costs and potential collisions if not done carefully.

Streaming / very-large data

  • If data is streaming or can’t fit in memory, exact duplicate detection requires external storage or multi-pass algorithms (external sort, database-backed dedup). For approximate detection with bounded memory you can use a Bloom filter (but that yields false positives).

For practical tips on indexing and locating duplicates see a compact guide: https://towardsdatascience.com/pythonic-tips-tricks-checking-and-indexing-duplicates-d62fdfea9eec/


Performance summary and recommendations

Quick cheat-sheet

  • Single-pass seen/dups (hash sets)

  • Time: O(n)

  • Space: O(k)

  • Use: general-purpose, fastest for mixed types that are hashable

  • collections.Counter

  • Time: O(n)

  • Space: O(k)

  • Use: when you need counts/frequencies; very readable

  • Sorting + scan

  • Time: O(n log n)

  • Space: O(1) extra if in-place

  • Use: when you can sort in-place and want to save auxiliary memory

  • NumPy / pandas

  • Time: often O(n log n) for unique (but fast in C)

  • Space: O(n)

  • Use: large numerical/tabular data where vectorized ops pay off

  • For unhashable items

  • Convert to stable hashable keys or use position-mapping strategies

Recommendation

  • For most use-cases: use the single-pass seen + dups method (simple and O(n)).
  • If you also need counts or a compact implementation: use collections.Counter.
  • If memory is constrained but sorting is allowed: sort and scan.

Community and reference examples that expand on these patterns: GeeksforGeeks (multiple examples) https://www.geeksforgeeks.org/python/python-program-print-duplicates-list-integers/ and the Counter documentation https://docs.python.org/3/library/collections.html#collections.Counter.


Sources


Conclusion

To reliably detect python duplicates (find duplicates python), use a single-pass seen/duplicates set pattern for speed and simplicity, or collections.Counter when you also need frequency counts. Avoid the pre-filled-toggle trick — it toggles membership and misses odd-count duplicates. For large numeric/tabular datasets consider NumPy/pandas or sorting when memory or in-place constraints make sense.

Authors
Verified by moderation
Moderation
Find Duplicates in Python Lists: Set, Counter, Sort