Handle Duplicate Data in Pandas for AI & ML
Learn to identify and remove duplicate data in Pandas DataFrames. Essential for accurate AI and Machine Learning model preprocessing with `duplicated()`.
Handling Duplicate Data in Pandas
Duplicate data, meaning identical rows appearing more than once in a dataset, can arise from various sources such as repeated entries, data entry errors, or the merging of different datasets. Identifying and removing these duplicates is a critical step in data preprocessing to ensure the accuracy and reliability of subsequent analyses.
Pandas provides two primary methods for managing duplicate data:
duplicated()
: Used to identify and flag duplicate rows.drop_duplicates()
: Used to remove duplicate rows from a DataFrame.
1. Identifying Duplicate Rows
The duplicated()
method in Pandas is used to detect duplicate rows within a DataFrame. By default, this method considers all columns to determine if a row is a duplicate. It returns a boolean Series where True
indicates a duplicate row, and False
indicates a unique row. The first occurrence of a set of duplicate rows is marked as False
.
Syntax:
DataFrame.duplicated(subset=None, keep='first')
subset
: A column label or sequence of labels. Defaults to all columns.keep
: Determines which duplicates (if any) to mark.'first'
(default): Mark duplicates asTrue
except for the first occurrence.'last'
: Mark duplicates asTrue
except for the last occurrence.False
: Mark all duplicates asTrue
.
Example:
import pandas as pd
# Create a DataFrame with duplicate entries
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Priya', 'Raj'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '20 June 2019', '14 April 2018'],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Mumbai']
})
print("Original DataFrame:")
print(df)
# Identify duplicates considering all columns
df['is_duplicate'] = df.duplicated()
print("\nDataFrame with duplicate identification:")
print(df)
Output:
Original DataFrame:
Name Date_of_Birth City
0 Rahul 01 December 2017 Delhi
1 Raj 14 April 2018 Mumbai
2 Rahul 01 December 2017 Delhi
3 Priya 20 June 2019 Chennai
4 Raj 14 April 2018 Mumbai
DataFrame with duplicate identification:
Name Date_of_Birth City is_duplicate
0 Rahul 01 December 2017 Delhi False
1 Raj 14 April 2018 Mumbai False
2 Rahul 01 December 2017 Delhi True
3 Priya 20 June 2019 Chennai False
4 Raj 14 April 2018 Mumbai True
In this example, rows with index 2
and 4
are marked as True
because they are exact duplicates of rows with index 0
and 1
, respectively.
2. Identifying Duplicates Based on Specific Columns
You can specify which columns to consider for identifying duplicates using the subset
parameter. This is useful when you want to define uniqueness based on a subset of your data.
Example:
Let's identify duplicates based only on the 'Name' column.
# Identify duplicates based on the 'Name' column
df['is_duplicate_name'] = df.duplicated(subset=['Name'])
print("\nDataFrame with duplicates identified by 'Name' column:")
print(df)
Output:
DataFrame with duplicates identified by 'Name' column:
Name Date_of_Birth City is_duplicate is_duplicate_name
0 Rahul 01 December 2017 Delhi False False
1 Raj 14 April 2018 Mumbai False False
2 Rahul 01 December 2017 Delhi True True
3 Priya 20 June 2019 Chennai False False
4 Raj 14 April 2018 Mumbai True True
Here, rows 2
and 4
are marked True
for is_duplicate_name
because their 'Name' values ('Rahul' and 'Raj') have appeared before.
You can also use keep='last'
to mark all duplicates as True
except for the last occurrence, or keep=False
to mark all duplicates as True
.
3. Removing Duplicate Rows
The drop_duplicates()
method is used to remove duplicate rows from a DataFrame. By default, it keeps the first occurrence of a duplicate row and removes all subsequent ones.
Syntax:
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset
: Column label or sequence of labels. Defaults to all columns.keep
: Determines which duplicates (if any) to remove.'first'
(default): Keep the first occurrence and remove others.'last'
: Keep the last occurrence and remove others.False
: Remove all duplicates.
inplace
: IfTrue
, modify the DataFrame in place and returnNone
.ignore_index
: IfTrue
, the resulting axis will be labeled 0, 1, ..., n-1.
Example:
Removing duplicates considering all columns.
# Remove duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing all duplicates:")
print(df_no_duplicates)
Output:
DataFrame after removing all duplicates:
Name Date_of_Birth City is_duplicate is_duplicate_name
0 Rahul 01 December 2017 Delhi False False
1 Raj 14 April 2018 Mumbai False False
3 Priya 20 June 2019 Chennai False False
The rows with index 2 and 4 have been removed because they were duplicates of rows 0 and 1, respectively.
4. Removing Duplicates Based on Specific Columns
Similar to duplicated()
, you can use the subset
parameter with drop_duplicates()
to remove duplicates based on a specific set of columns.
Example:
Removing duplicates based on the 'Date_of_Birth' column.
# Remove duplicates based on the 'Date_of_Birth' column, keeping the first occurrence
df_unique_dob = df.drop_duplicates(subset=['Date_of_Birth'])
print("\nDataFrame after removing duplicates by 'Date_of_Birth':")
print(df_unique_dob)
Output:
DataFrame after removing duplicates by 'Date_of_Birth':
Name Date_of_Birth City is_duplicate is_duplicate_name
0 Rahul 01 December 2017 Delhi False False
1 Raj 14 April 2018 Mumbai False False
3 Priya 20 June 2019 Chennai False False
In this case, the row with index 2
(which had a duplicate 'Date_of_Birth' with index 0
) and the row with index 4
(which had a duplicate 'Date_of_Birth' with index 1
) are removed. However, since the subset
was 'Date_of_Birth', and the original DataFrame had unique 'Name' and 'City' for some of these duplicate 'Date_of_Birth' entries, they were retained based on the other columns. If subset
was ['Date_of_Birth', 'Name']
, then row 2
would be removed, but row 4
would not be if keep='first'
was used, because the first occurrence of 'Raj' on '14 April 2018' is at index 1
.
Conclusion
Effectively handling duplicate data is paramount for maintaining data integrity and ensuring the accuracy of analytical insights. Pandas provides the duplicated()
and drop_duplicates()
methods, which offer flexible and efficient ways to identify, flag, and remove duplicate records based on all columns or specific subsets, thereby enhancing the overall quality of your datasets.
Dropping Missing Data in Pandas: A Guide for ML
Master Pandas' dropna() to handle missing data in your ML datasets. Learn to efficiently remove rows/columns with NaN values for cleaner data analysis and model training.
Fill Missing Data in Pandas | ML Data Preprocessing
Master filling missing data in Pandas for ML. Learn efficient techniques to handle NaN values, improve data consistency, and boost model accuracy.