Pandas Iteration & Concatenation for ML Data

Master Pandas iteration and concatenation for efficient ML data manipulation. Learn to iterate Series & DataFrames and combine them effectively.

Iteration and Concatenation in Pandas

This document outlines common methods for iterating over Pandas objects (Series and DataFrames) and the process of concatenating them.

Iterating Over Pandas Objects

Pandas offers several ways to iterate through its data structures. The behavior of iteration differs between Series and DataFrames.

Series Iteration

When you iterate directly over a Pandas Series, you access its values, similar to iterating over a NumPy array.

import pandas as pd
import numpy as np

s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

print("Iterating over a Series:")
for value in s:
    print(value)

DataFrame Iteration

Iterating directly over a DataFrame yields its column labels (column names), much like iterating over the keys of a Python dictionary.

df_sample = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

print("\nIterating over DataFrame columns:")
for col_name in df_sample:
    print(col_name)

Iterating Through DataFrame Rows

Pandas provides specialized methods for iterating through DataFrame rows, offering different ways to access row data.

  1. items() – Iterate Column-wise (Label and Series)

    This method iterates over the DataFrame's columns, yielding pairs of column labels and their corresponding Series.

    df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
    
    print("Original DataFrame:\n", df)
    print("\nIterated Output using df.items():")
    for col_label, col_series in df.items():
        print(f"Column Label: {col_label}")
        print(f"Column Series:\n{col_series}\n")
  2. iterrows() – Iterate Row-wise as Series

    This method iterates over the DataFrame rows, yielding pairs of the row index and the row data as a Pandas Series.

    ⚠️ Note: iterrows() does not preserve original data types. For instance, a NumPy integer type might be cast to a float.

    print("\nIterated Output using df.iterrows():")
    for row_index, row_series in df.iterrows():
        print(f"Row Index: {row_index}")
        print(f"Row Data (as Series):\n{row_series}\n")
  3. itertuples() – Iterate Row-wise as NamedTuples

    This method iterates over DataFrame rows, returning each row as a namedtuple. This is generally faster than iterrows() and preserves data types. The first element of the namedtuple is the index, followed by the column values.

    print("\nIterated Output using df.itertuples():")
    for row_tuple in df.itertuples():
        print(row_tuple)
        # Accessing values:
        # print(f"Index: {row_tuple.Index}, col1: {row_tuple.col1}, col2: {row_tuple.col2}")

Warning: Do Not Modify During Iteration

It is strongly discouraged to modify the DataFrame while iterating over it, especially using iterrows(). Changes made to the row Series within the loop will not affect the original DataFrame.

print("\nAttempting to modify DataFrame during iterrows() (no effect):")
for index, row in df.iterrows():
    row['col1'] = 100  # This modification has no impact on the original df

print(df)  # The original DataFrame remains unchanged

For modifications, consider vectorized operations or methods like apply() which are more efficient and predictable.


Concatenation in Pandas

Concatenation is the process of joining Pandas objects (Series or DataFrames) along a specified axis. The primary function for this is pd.concat().

Syntax

pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, verify_integrity=False, sort=False, copy=True)

Parameters:

  • objs: A list or dictionary of Pandas objects to concatenate.
  • axis: The axis along which to concatenate.
    • 0 or 'index': Concatenate row-wise (stacking objects vertically).
    • 1 or 'columns': Concatenate column-wise (joining objects horizontally).
  • join: How to handle indexes on the other axis.
    • 'outer' (default): Use the union of indexes.
    • 'inner': Use the intersection of indexes.
  • ignore_index: If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, 1, ..., n-1.
  • keys: Creates a hierarchical index (MultiIndex) on the concatenation axis. Useful for tracking the origin of concatenated data.
  • verify_integrity: If True, raise an error if indexes already exist on the concatenation axis. Defaults to False.
  • sort: Sort non-concatenation axis if it is not already aligned. Defaults to False.
  • copy: If True, always copy data. If False, may share data where possible.

Examples

Let's define two sample DataFrames for demonstration:

one = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']}, index=[0, 1])
two = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']}, index=[2, 3])
three = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=[0, 1])
  1. Basic Concatenation (Row-wise, axis=0)

    Stacks one and two vertically, preserving their original indexes.

    result_row_wise = pd.concat([one, two])
    print("Concatenation Row-wise:\n", result_row_wise)
  2. Using keys for MultiIndex

    Creates a hierarchical index to distinguish the source of the data.

    result_with_keys = pd.concat([one, two], keys=['first', 'second'])
    print("\nConcatenation with Keys:\n", result_with_keys)
  3. Ignoring Indexes (ignore_index=True)

    Resets the index of the concatenated DataFrame.

    result_ignore_index = pd.concat([one, two], ignore_index=True)
    print("\nConcatenation ignoring index:\n", result_ignore_index)
  4. Concatenating Along Columns (axis=1)

    Joins one and three horizontally based on their indexes. If indexes don't align, join='outer' (default) will fill missing values with NaN.

    result_col_wise = pd.concat([one, three], axis=1)
    print("\nConcatenation Column-wise:\n", result_col_wise)
  5. Inner Join Column-wise

    Only includes rows where the index exists in both DataFrames.

    four = pd.DataFrame({'A': ['A4', 'A5'], 'B': ['B4', 'B5']}, index=[0, 4])
    result_inner_join = pd.concat([one, four], axis=1, join='inner')
    print("\nConcatenation Column-wise with Inner Join:\n", result_inner_join)

Output Comparison Summary

MethodIndex BehaviorAxis BehaviorNotes
pd.concat([df1, df2])PreservedCombined VerticallyIndexes may repeat.
pd.concat([df1, df2], keys=[...])Hierarchical (MultiIndex)Combined VerticallySource information is retained.
pd.concat([df1, df2], ignore_index=True)Reset (0, 1, ..., n-1)Combined VerticallyCreates a fresh, sequential index.
pd.concat([df1, df2], axis=1)PreservedSide-by-side ColumnsUseful for aligning dataframes by index.
pd.concat([df1, df3], axis=1, join='inner')Intersection of IndexesSide-by-side ColumnsOnly rows with matching indexes are included.