Pandas: Python for Data Analysis & ML

Master Pandas for data manipulation and analysis in Python. Learn indexing, Series, DataFrames, and essential techniques for your machine learning projects.

Pandas: A Comprehensive Guide to Data Manipulation and Analysis

Pandas is a powerful open-source Python library designed for data manipulation and analysis. It provides data structures like Series and DataFrames that make it easy to work with structured data.

Core Concepts

1. Indexing and Selecting Data

Pandas offers flexible ways to access and subset your data.

a. Series and Attributes of Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

  • Attributes: Series have various attributes like index, values, dtype, name, size, and empty.

Example:

import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s.values)
print(s.index)

b. Slicing a Series Object

You can select subsets of a Series using label-based or integer-based indexing.

Example:

print(s[0:3]) # Integer-based slicing
print(s.loc[0:2]) # Label-based slicing
print(s.iloc[0:3]) # Integer-location based slicing

c. Accessing DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or SQL table.

Example:

data = {'col1': [1, 2, 3, 4], 'col2': ['A', 'B', 'C', 'D']}
df = pd.DataFrame(data)
print(df['col1']) # Accessing a single column
print(df[['col1', 'col2']]) # Accessing multiple columns

2. DataFrame Operations

a. Arithmetic Operations on DataFrame

You can perform element-wise arithmetic operations between DataFrames or between a DataFrame and a Series.

Example:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
print(df1 + df2) # Element-wise addition
print(df1 * 2)  # Scalar multiplication

b. Modifying DataFrame

You can add, delete, or rename columns and rows.

Example:

df['new_col'] = [5, 6, 7, 8] # Adding a new column
df.drop('col1', axis=1, inplace=True) # Dropping a column
df.rename(columns={'col2': 'NewColumnName'}, inplace=True) # Renaming a column

c. Removing Rows from a DataFrame

Rows can be removed based on their index or based on certain conditions.

Example:

df.drop(0, inplace=True) # Removing row with index 0
df_filtered = df[df['NewColumnName'] != 'A'] # Removing rows where condition is met

d. Sorting and Reindexing

You can sort DataFrames by index or by values, and reindex them to conform to a new index.

Example:

df.sort_index(inplace=True) # Sort by index
df.sort_values(by='NewColumnName', inplace=True) # Sort by column values
new_index = [0, 1, 2]
df_reindexed = df.reindex(new_index)

3. Advanced Indexing and Selection

a. Basics of Multi-Index

A MultiIndex (or hierarchical index) allows you to have multiple levels of indexing on an axis.

Example:

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['level1', 'level2'])
s_multi = pd.Series([1, 2, 3, 4], index=index)
print(s_multi['A']) # Accessing data at level 1
print(s_multi[:, 1]) # Accessing data at level 2

b. Indexing with MultiIndex

You can select data efficiently using the levels of a MultiIndex.

Example:

print(s_multi.loc[('A', 2)]) # Accessing a specific tuple

4. Boolean Indexing and Masking

Boolean indexing allows you to select data based on boolean conditions.

a. Boolean Indexing

Create a boolean Series or array to filter your DataFrame.

Example:

data = {'col1': [1, 2, 3, 4], 'col2': [10, 20, 15, 25]}
df = pd.DataFrame(data)
mask = df['col1'] > 2
print(df[mask]) # Filtering based on the mask
print(df[df['col1'] > 2]) # Direct boolean indexing

b. Boolean Masking

Boolean masking is the process of applying a boolean array to select elements.

5. Working with Categorical Data

Pandas provides specialized support for categorical data, which can be more memory-efficient and allow for faster operations.

a. Categorical Data

Convert columns to the 'category' dtype.

Example:

df['col2'] = df['col2'].astype('category')
print(df['col2'].dtype)

b. Comparing Categorical Data

Comparisons work similarly to other data types, but can be more efficient.

c. Computing Dummy Variables

Create binary indicator variables for each category.

Example:

dummies = pd.get_dummies(df['col2'])
print(dummies)

d. Ordering and Sorting Categorical Data

You can define the order of categories.

Example:

cat_type = pd.CategoricalDtype(categories=['small', 'medium', 'large'], ordered=True)
s = pd.Series(['medium', 'small', 'large', 'medium'], dtype=cat_type)
print(s.sort_values())

6. Data Aggregation and Transformation

a. Pivoting

Reshape data from long to wide format.

Example:

df_pivot = pd.DataFrame({'key1': ['A', 'A', 'B', 'B'],
                         'key2': ['one', 'two', 'one', 'two'],
                         'value': [1, 2, 3, 4]})
print(df_pivot.pivot(index='key1', columns='key2', values='value'))

b. Stacking and Unstacking

These operations transform DataFrames between wide and long formats, often with multi-level indices. stack moves columns to rows, and unstack moves rows to columns.

Example:

stacked_df = df_pivot.set_index(['key1', 'key2']).stack()
print(stacked_df.unstack())

7. Handling Missing Data

Pandas provides robust tools for dealing with missing data (NaN).

a. Calculations in Missing Data

Many operations will ignore NaN values by default.

b. Dropping Missing Data

Remove rows or columns containing missing values.

Example:

df_with_nan = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df_with_nan.dropna()) # Drops rows with any NaN
print(df_with_nan.dropna(axis=1)) # Drops columns with any NaN

c. Filling Missing Data

Replace NaN values with a specified value or using various filling methods.

Example:

print(df_with_nan.fillna(0)) # Fill with a scalar value
print(df_with_nan.fillna(method='ffill')) # Forward fill
print(df_with_nan.fillna(method='bfill')) # Backward fill

d. Interpolation of Missing Values

Fill missing values using interpolation methods.

Example:

print(df_with_nan.interpolate())

8. Duplicate Data

Identify and handle duplicate rows.

a. Duplicate Data

Detect duplicate rows.

Example:

df_dups = pd.DataFrame({'col1': [1, 2, 1, 3], 'col2': ['A', 'B', 'A', 'C']})
print(df_dups.duplicated()) # Returns a boolean Series indicating duplicates
print(df_dups[df_dups.duplicated()]) # Show duplicate rows

b. Counting and Retrieving Unique Elements

Find unique values in a Series or count their occurrences.

Example:

print(df['col2'].unique()) # Get unique values
print(df['col2'].value_counts()) # Count occurrences of each unique value

9. I/O Tools

Pandas offers functions to read and write data from various file formats.

a. Reading and Writing Data to Excel

Easily read from and write to Excel files.

Example:

# Reading from Excel
# df = pd.read_excel('your_file.xlsx', sheet_name='Sheet1')

# Writing to Excel
# df.to_excel('output_file.xlsx', sheet_name='Sheet1', index=False)

10. Iteration & Concatenation

a. Iteration

Iterate over DataFrame rows or columns.

Example:

for index, row in df.iterrows():
    print(index, row['col1'])

b. Concatenation

Combine multiple Pandas objects along a particular axis.

Example:

df_concat1 = pd.DataFrame({'A': [1, 2]})
df_concat2 = pd.DataFrame({'A': [3, 4]})
print(pd.concat([df_concat1, df_concat2])) # Concatenating along axis 0 (rows)