Pandas: Read & Write Excel Data for ML & AI

Master reading and writing Excel files with Pandas for your ML/AI projects. Learn single/multiple sheet operations, data writing, appending & in-memory handling efficiently.

Reading and Writing Data to Excel with Pandas

Pandas is a powerful Python library for data manipulation and analysis. It offers robust support for reading from and writing to Excel files, a common format for tabular data. This guide provides comprehensive instructions on how to efficiently work with Excel files using Pandas, covering single/multiple sheet operations, writing data, appending, and in-memory handling.

Supported Excel File Formats

Pandas utilizes different backends to support various Excel file formats:

  • .xlsx (Excel 2007+): Supported by the openpyxl module.
  • .xls (Excel 2003): Supported by the xlrd module.
  • .xlsb (Binary Excel): Supported by the pyxlsb module.
  • All formats: Can also be read using the calamine engine.

Note: You can install the required packages using pip:

pip install openpyxl
pip install xlrd
pip install pyxlsb
# For calamine engine, it's usually included with pandas >= 1.3.0
# or can be installed separately if needed.

Reading Excel Files

The primary function for reading Excel files is pandas.read_excel().

1. Reading a Basic Excel File

To load an entire Excel file into a Pandas DataFrame:

import pandas as pd

df = pd.read_excel('data.xlsx')
print(df)

2. Reading a Specific Sheet

Use the sheet_name parameter to specify which sheet to read. You can provide the sheet name as a string or its index (0-based).

# Reading by sheet name
df_sheet2 = pd.read_excel('data.xlsx', sheet_name="Sheet_2")
print(df_sheet2)

# Reading by sheet index
df_sheet1 = pd.read_excel('data.xlsx', sheet_name=0) # Reads the first sheet
print(df_sheet1)

3. Reading Multiple Sheets

To read multiple sheets into a dictionary of DataFrames, pass a list of sheet names or indices to sheet_name.

# Reading sheets by index
df_dict_sheets = pd.read_excel('data.xlsx', sheet_name=[0, 1])
print(df_dict_sheets) # This will print a dictionary where keys are sheet indices and values are DataFrames

# Reading sheets by name
df_dict_sheets_named = pd.read_excel('data.xlsx', sheet_name=['Sheet_1', 'Sheet_2'])
print(df_dict_sheets_named) # This will print a dictionary where keys are sheet names and values are DataFrames

4. Reading Data with MultiIndex

Pandas can handle hierarchical row or column indices (MultiIndex) when reading Excel files. Use the index_col and header parameters for this.

First, let's create an Excel file with a MultiIndex:

import pandas as pd

index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')], names=['Level1', 'Level2'])
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df_multi = pd.DataFrame(data, index=index, columns=['X', 'Y'])
df_multi.to_excel("multiindex_data.xlsx")

Now, read this file specifying the index columns:

# Reading the MultiIndex file
df_read_multi = pd.read_excel("multiindex_data.xlsx", index_col=[0, 1])
print(df_read_multi)

Writing Excel Files

The DataFrame.to_excel() method is used to export data to Excel.

1. Writing a DataFrame to Excel

This is the most basic way to save your DataFrame to an Excel file.

import pandas as pd

df = pd.DataFrame([[5, 2], [4, 1]], index=["One", "Two"], columns=["Rank", "Subjects"])
df.to_excel('Basic_example_output.xlsx')

By default, the DataFrame index will be written to the first column. To exclude the index, set index=False.

2. Writing Multiple DataFrames to One File (Different Sheets)

To write multiple DataFrames to different sheets within the same Excel file, use pd.ExcelWriter.

import pandas as pd

df1 = pd.DataFrame([[5, 2], [4, 1]], index=["One", "Two"], columns=["Rank", "Subjects"])
df2 = pd.DataFrame([[15, 21], [41, 11]], index=["One", "Two"], columns=["Rank", "Subjects"])

with pd.ExcelWriter('output_multiple_sheets.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet_name_1')
    df2.to_excel(writer, sheet_name='Sheet_name_2')

3. Appending Data to an Existing Excel File

To add new sheets to an existing Excel file without overwriting its contents, use pd.ExcelWriter with mode='a'.

Important: When appending, the engine parameter must be specified, and the file extension should be .xlsx or .xlsm. You might need to install openpyxl for this functionality.

import pandas as pd

df3 = pd.DataFrame([[51, 11], [21, 38]], index=["One", "Two"], columns=["Rank", "Subjects"])

# Ensure 'output_multiple_sheets.xlsx' already exists from the previous example
# If it doesn't, create it first.
# For demonstration, let's assume it exists.

# For appending, you usually need to specify the engine
with pd.ExcelWriter('output_multiple_sheets.xlsx', mode='a', engine='openpyxl') as writer:
    df3.to_excel(writer, sheet_name='Sheet_name_3', index=False)

Note on appending: If you are appending to a .xls file, the process might differ or not be supported by all engines. It's generally recommended to use .xlsx for appending operations.

Working with Excel Files in Memory

For scenarios where you need to process Excel data without saving it to disk (e.g., in web applications or APIs), you can use io.BytesIO.

Writing Excel to Memory (BytesIO)

This allows you to create an Excel file in memory, which can then be sent as a response or processed further.

from io import BytesIO
import pandas as pd

df = pd.DataFrame([[5, 2], [4, 1]], index=["One", "Two"], columns=["Rank", "Subjects"])

# Create a BytesIO object
bio = BytesIO()

# Write DataFrame to the BytesIO object
df.to_excel(bio, sheet_name='Sheet1', index=False)

# Move the cursor to the beginning of the BytesIO object
bio.seek(0)

# Now 'bio' contains the Excel file data in bytes
# You can read it:
excel_data = bio.read()

# Example: You could then send 'excel_data' as an HTTP response
# For example, in a Flask app:
# from flask import Response
# return Response(excel_data, mimetype='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')

Reading from memory is similar, by passing a BytesIO object containing the Excel data to pd.read_excel().

Specifying the Excel Writer Engine

Pandas allows you to explicitly choose the engine used for writing. This is important if you need specific features or compatibility.

# Using xlsxwriter engine
df.to_excel('output_xlsxwriter.xlsx', sheet_name='Sheet1', engine='xlsxwriter')

# Using openpyxl engine
df.to_excel('output_openpyxl.xlsx', sheet_name='Sheet1', engine='openpyxl')

Ensure the selected engine (openpyxl, xlsxwriter, pyxlsb) is installed in your Python environment for the chosen engine to work.

Conclusion

Pandas simplifies the process of reading from and writing to Excel files with significant flexibility. Whether you're handling single-sheet data or complex multi-sheet files, Pandas provides intuitive and powerful tools to streamline your Excel data workflows. For optimal performance, ensure you have the necessary engines installed and consider using memory-based operations for dynamic applications.