Pandas Iteration & Concatenation for ML Data
Master Pandas iteration and concatenation for efficient ML data manipulation. Learn to iterate Series & DataFrames and combine them effectively.
Iteration and Concatenation in Pandas
This document outlines common methods for iterating over Pandas objects (Series and DataFrames) and the process of concatenating them.
Iterating Over Pandas Objects
Pandas offers several ways to iterate through its data structures. The behavior of iteration differs between Series and DataFrames.
Series Iteration
When you iterate directly over a Pandas Series, you access its values, similar to iterating over a NumPy array.
import pandas as pd
import numpy as np
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Iterating over a Series:")
for value in s:
print(value)
DataFrame Iteration
Iterating directly over a DataFrame yields its column labels (column names), much like iterating over the keys of a Python dictionary.
df_sample = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print("\nIterating over DataFrame columns:")
for col_name in df_sample:
print(col_name)
Iterating Through DataFrame Rows
Pandas provides specialized methods for iterating through DataFrame rows, offering different ways to access row data.
-
items()
– Iterate Column-wise (Label and Series)This method iterates over the DataFrame's columns, yielding pairs of column labels and their corresponding Series.
df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3']) print("Original DataFrame:\n", df) print("\nIterated Output using df.items():") for col_label, col_series in df.items(): print(f"Column Label: {col_label}") print(f"Column Series:\n{col_series}\n")
-
iterrows()
– Iterate Row-wise as SeriesThis method iterates over the DataFrame rows, yielding pairs of the row index and the row data as a Pandas Series.
⚠️ Note:
iterrows()
does not preserve original data types. For instance, a NumPy integer type might be cast to a float.print("\nIterated Output using df.iterrows():") for row_index, row_series in df.iterrows(): print(f"Row Index: {row_index}") print(f"Row Data (as Series):\n{row_series}\n")
-
itertuples()
– Iterate Row-wise as NamedTuplesThis method iterates over DataFrame rows, returning each row as a
namedtuple
. This is generally faster thaniterrows()
and preserves data types. The first element of the namedtuple is the index, followed by the column values.print("\nIterated Output using df.itertuples():") for row_tuple in df.itertuples(): print(row_tuple) # Accessing values: # print(f"Index: {row_tuple.Index}, col1: {row_tuple.col1}, col2: {row_tuple.col2}")
Warning: Do Not Modify During Iteration
It is strongly discouraged to modify the DataFrame while iterating over it, especially using iterrows()
. Changes made to the row Series within the loop will not affect the original DataFrame.
print("\nAttempting to modify DataFrame during iterrows() (no effect):")
for index, row in df.iterrows():
row['col1'] = 100 # This modification has no impact on the original df
print(df) # The original DataFrame remains unchanged
For modifications, consider vectorized operations or methods like apply()
which are more efficient and predictable.
Concatenation in Pandas
Concatenation is the process of joining Pandas objects (Series or DataFrames) along a specified axis. The primary function for this is pd.concat()
.
Syntax
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, verify_integrity=False, sort=False, copy=True)
Parameters:
objs
: A list or dictionary of Pandas objects to concatenate.axis
: The axis along which to concatenate.0
or'index'
: Concatenate row-wise (stacking objects vertically).1
or'columns'
: Concatenate column-wise (joining objects horizontally).
join
: How to handle indexes on the other axis.'outer'
(default): Use the union of indexes.'inner'
: Use the intersection of indexes.
ignore_index
: IfTrue
, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, 1, ..., n-1.keys
: Creates a hierarchical index (MultiIndex) on the concatenation axis. Useful for tracking the origin of concatenated data.verify_integrity
: IfTrue
, raise an error if indexes already exist on the concatenation axis. Defaults toFalse
.sort
: Sort non-concatenation axis if it is not already aligned. Defaults toFalse
.copy
: IfTrue
, always copy data. IfFalse
, may share data where possible.
Examples
Let's define two sample DataFrames for demonstration:
one = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']}, index=[0, 1])
two = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']}, index=[2, 3])
three = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=[0, 1])
-
Basic Concatenation (Row-wise,
axis=0
)Stacks
one
andtwo
vertically, preserving their original indexes.result_row_wise = pd.concat([one, two]) print("Concatenation Row-wise:\n", result_row_wise)
-
Using
keys
for MultiIndexCreates a hierarchical index to distinguish the source of the data.
result_with_keys = pd.concat([one, two], keys=['first', 'second']) print("\nConcatenation with Keys:\n", result_with_keys)
-
Ignoring Indexes (
ignore_index=True
)Resets the index of the concatenated DataFrame.
result_ignore_index = pd.concat([one, two], ignore_index=True) print("\nConcatenation ignoring index:\n", result_ignore_index)
-
Concatenating Along Columns (
axis=1
)Joins
one
andthree
horizontally based on their indexes. If indexes don't align,join='outer'
(default) will fill missing values withNaN
.result_col_wise = pd.concat([one, three], axis=1) print("\nConcatenation Column-wise:\n", result_col_wise)
-
Inner Join Column-wise
Only includes rows where the index exists in both DataFrames.
four = pd.DataFrame({'A': ['A4', 'A5'], 'B': ['B4', 'B5']}, index=[0, 4]) result_inner_join = pd.concat([one, four], axis=1, join='inner') print("\nConcatenation Column-wise with Inner Join:\n", result_inner_join)
Output Comparison Summary
Method | Index Behavior | Axis Behavior | Notes |
---|---|---|---|
pd.concat([df1, df2]) | Preserved | Combined Vertically | Indexes may repeat. |
pd.concat([df1, df2], keys=[...]) | Hierarchical (MultiIndex) | Combined Vertically | Source information is retained. |
pd.concat([df1, df2], ignore_index=True) | Reset (0, 1, ..., n-1) | Combined Vertically | Creates a fresh, sequential index. |
pd.concat([df1, df2], axis=1) | Preserved | Side-by-side Columns | Useful for aligning dataframes by index. |
pd.concat([df1, df3], axis=1, join='inner') | Intersection of Indexes | Side-by-side Columns | Only rows with matching indexes are included. |
Pandas I/O Tools: Effortless Data Import & Export
Master Pandas I/O tools for seamless data import/export in your ML/AI projects. Learn to read/write CSV, Excel, JSON, SQL & more with this comprehensive guide.
Pandas DataFrame Modification: A Guide for ML
Learn to modify Pandas DataFrames in Python for ML. Explore common techniques for data cleaning & preprocessing, essential for AI and machine learning tasks.