Handle Duplicate Data in Pandas for AI & ML

Learn to identify and remove duplicate data in Pandas DataFrames. Essential for accurate AI and Machine Learning model preprocessing with `duplicated()`.

Handling Duplicate Data in Pandas

Duplicate data, meaning identical rows appearing more than once in a dataset, can arise from various sources such as repeated entries, data entry errors, or the merging of different datasets. Identifying and removing these duplicates is a critical step in data preprocessing to ensure the accuracy and reliability of subsequent analyses.

Pandas provides two primary methods for managing duplicate data:

  • duplicated(): Used to identify and flag duplicate rows.
  • drop_duplicates(): Used to remove duplicate rows from a DataFrame.

1. Identifying Duplicate Rows

The duplicated() method in Pandas is used to detect duplicate rows within a DataFrame. By default, this method considers all columns to determine if a row is a duplicate. It returns a boolean Series where True indicates a duplicate row, and False indicates a unique row. The first occurrence of a set of duplicate rows is marked as False.

Syntax:

DataFrame.duplicated(subset=None, keep='first')
  • subset: A column label or sequence of labels. Defaults to all columns.
  • keep: Determines which duplicates (if any) to mark.
    • 'first' (default): Mark duplicates as True except for the first occurrence.
    • 'last': Mark duplicates as True except for the last occurrence.
    • False: Mark all duplicates as True.

Example:

import pandas as pd

# Create a DataFrame with duplicate entries
df = pd.DataFrame({
    'Name': ['Rahul', 'Raj', 'Rahul', 'Priya', 'Raj'],
    'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '20 June 2019', '14 April 2018'],
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Mumbai']
})

print("Original DataFrame:")
print(df)

# Identify duplicates considering all columns
df['is_duplicate'] = df.duplicated()

print("\nDataFrame with duplicate identification:")
print(df)

Output:

Original DataFrame:
    Name     Date_of_Birth     City
0  Rahul  01 December 2017    Delhi
1    Raj     14 April 2018   Mumbai
2  Rahul  01 December 2017    Delhi
3  Priya      20 June 2019  Chennai
4    Raj     14 April 2018   Mumbai

DataFrame with duplicate identification:
    Name     Date_of_Birth     City  is_duplicate
0  Rahul  01 December 2017    Delhi         False
1    Raj     14 April 2018   Mumbai         False
2  Rahul  01 December 2017    Delhi          True
3  Priya      20 June 2019  Chennai         False
4    Raj     14 April 2018   Mumbai          True

In this example, rows with index 2 and 4 are marked as True because they are exact duplicates of rows with index 0 and 1, respectively.

2. Identifying Duplicates Based on Specific Columns

You can specify which columns to consider for identifying duplicates using the subset parameter. This is useful when you want to define uniqueness based on a subset of your data.

Example:

Let's identify duplicates based only on the 'Name' column.

# Identify duplicates based on the 'Name' column
df['is_duplicate_name'] = df.duplicated(subset=['Name'])

print("\nDataFrame with duplicates identified by 'Name' column:")
print(df)

Output:

DataFrame with duplicates identified by 'Name' column:
    Name     Date_of_Birth     City  is_duplicate  is_duplicate_name
0  Rahul  01 December 2017    Delhi         False                False
1    Raj     14 April 2018   Mumbai         False                False
2  Rahul  01 December 2017    Delhi          True                 True
3  Priya      20 June 2019  Chennai         False                False
4    Raj     14 April 2018   Mumbai          True                 True

Here, rows 2 and 4 are marked True for is_duplicate_name because their 'Name' values ('Rahul' and 'Raj') have appeared before.

You can also use keep='last' to mark all duplicates as True except for the last occurrence, or keep=False to mark all duplicates as True.

3. Removing Duplicate Rows

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. By default, it keeps the first occurrence of a duplicate row and removes all subsequent ones.

Syntax:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
  • subset: Column label or sequence of labels. Defaults to all columns.
  • keep: Determines which duplicates (if any) to remove.
    • 'first' (default): Keep the first occurrence and remove others.
    • 'last': Keep the last occurrence and remove others.
    • False: Remove all duplicates.
  • inplace: If True, modify the DataFrame in place and return None.
  • ignore_index: If True, the resulting axis will be labeled 0, 1, ..., n-1.

Example:

Removing duplicates considering all columns.

# Remove duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()

print("\nDataFrame after removing all duplicates:")
print(df_no_duplicates)

Output:

DataFrame after removing all duplicates:
    Name     Date_of_Birth     City  is_duplicate  is_duplicate_name
0  Rahul  01 December 2017    Delhi         False                False
1    Raj     14 April 2018   Mumbai         False                False
3  Priya      20 June 2019  Chennai         False                False

The rows with index 2 and 4 have been removed because they were duplicates of rows 0 and 1, respectively.

4. Removing Duplicates Based on Specific Columns

Similar to duplicated(), you can use the subset parameter with drop_duplicates() to remove duplicates based on a specific set of columns.

Example:

Removing duplicates based on the 'Date_of_Birth' column.

# Remove duplicates based on the 'Date_of_Birth' column, keeping the first occurrence
df_unique_dob = df.drop_duplicates(subset=['Date_of_Birth'])

print("\nDataFrame after removing duplicates by 'Date_of_Birth':")
print(df_unique_dob)

Output:

DataFrame after removing duplicates by 'Date_of_Birth':
    Name     Date_of_Birth     City  is_duplicate  is_duplicate_name
0  Rahul  01 December 2017    Delhi         False                False
1    Raj     14 April 2018   Mumbai         False                False
3  Priya      20 June 2019  Chennai         False                False

In this case, the row with index 2 (which had a duplicate 'Date_of_Birth' with index 0) and the row with index 4 (which had a duplicate 'Date_of_Birth' with index 1) are removed. However, since the subset was 'Date_of_Birth', and the original DataFrame had unique 'Name' and 'City' for some of these duplicate 'Date_of_Birth' entries, they were retained based on the other columns. If subset was ['Date_of_Birth', 'Name'], then row 2 would be removed, but row 4 would not be if keep='first' was used, because the first occurrence of 'Raj' on '14 April 2018' is at index 1.

Conclusion

Effectively handling duplicate data is paramount for maintaining data integrity and ensuring the accuracy of analytical insights. Pandas provides the duplicated() and drop_duplicates() methods, which offer flexible and efficient ways to identify, flag, and remove duplicate records based on all columns or specific subsets, thereby enhancing the overall quality of your datasets.