Learn linear and non-linear curve fitting in Python using SciPy. Explore techniques for modeling data relationships with practical examples and ML applications.

Linear and Non-Linear Curve Fitting in SciPy

This document provides an overview of curve fitting techniques using the SciPy library in Python. We will cover both linear and non-linear curve fitting methods, explaining their purpose, key concepts, and providing practical examples.

Linear Curve Fitting

Linear curve fitting aims to model the relationship between two variables using a linear equation. The most common form is:

$y = mx + b$

Where:

$y$ is the dependent variable.
$x$ is the independent variable.
$m$ is the slope of the line.
$b$ is the y-intercept.

The primary goal of linear fitting is to find the values of $m$ and $b$ that best represent the relationship in the data. This is typically achieved by minimizing the sum of the squared differences between the actual observed $y$ values and the $y$ values predicted by the linear model. This method is known as the least squares method.

Key Objectives of Linear Fitting

Understanding Relationships: To quantify how changes in the independent variable ($x$) are associated with changes in the dependent variable ($y$).
Prediction: To forecast the value of the dependent variable ($y$) for a given value of the independent variable ($x$).
Error Minimization: To find the line that has the minimum overall error when compared to the data points, typically using the least squares criterion.
Statistical Inference: To evaluate the reliability and significance of the fitted model using statistical measures such as the R-squared value and p-values.

Methods for Linear Fitting in SciPy

SciPy offers several functions for performing linear regression.

1. `scipy.stats.linregress()`

This is a straightforward function designed for simple linear regression (fitting a single independent variable). It returns several key statistics that are useful for understanding the quality of the fit.

Returns:
- slope: The slope of the regression line.
- intercept: The y-intercept of the regression line.
- r_value: The Pearson correlation coefficient.
- p_value: The p-value for a hypothesis test whose null hypothesis is that there is an exact linear relationship between the variables.
- std_err: The standard error of the estimated gradient.

Example:

from scipy.stats import linregress
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2.1, 4.2, 6.1, 8.0, 10.3])

slope, intercept, r_value, p_value, std_err = linregress(x, y)

print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
print(f"R-squared: {r_value**2}") # R-squared is the square of the r_value
print(f"P-value: {p_value}")
print(f"Standard Error: {std_err}")

2. `scipy.optimize.least_squares()`

This is a more general-purpose optimization function that can be used for minimizing the sum of squares of a set of equations. It's versatile and can be applied to both linear and non-linear fitting problems. For linear fitting, you would define a model function representing $y = mx + b$.

Example (Illustrative for general least squares, can be adapted for linear):

from scipy.optimize import least_squares
import numpy as np

# Define a model function (e.g., a sine wave, for demonstration)
def model(x, a, b):
    return a * np.sin(b * x)

# Generate some sample data with noise
x_data = np.linspace(0, 10, 100)
y_data = 3 * np.sin(x_data) + np.random.normal(scale=0.2, size=x_data.size)

# Define the residual function (difference between data and model)
def residuals(params, x, y):
    a, b = params
    return model(x, a, b) - y

# Initial guess for the parameters
initial_params = [1.0, 1.0]

# Perform the least squares optimization
result = least_squares(residuals, initial_params, args=(x_data, y_data))

# The optimized parameters are in result.x
optimized_a, optimized_b = result.x

print(f"Optimized 'a': {optimized_a}")
print(f"Optimized 'b': {optimized_b}")

3. `scipy.optimize.minimize()`

While not exclusively for curve fitting, minimize is a powerful function for general-purpose optimization. You can use it to minimize a function representing the sum of squared errors of your curve fit. This method is more manual as you need to define the objective function (e.g., sum of squared residuals) explicitly.

Example (Minimizing a simple quadratic function):

from scipy.optimize import minimize
import numpy as np

# Define the objective function to minimize
def objective_function(x):
    # Example: f(x) = x[0]^2 + x[1]^2
    return x[0]**2 + x[1]**2

# Initial guess for the parameters [x0, x1]
initial_guess = [1.0, 1.0]

# Perform the minimization
result = minimize(objective_function, initial_guess)

# The optimal solution is in result.x
print(f"Optimal solution (x0, x1): {result.x}")
print(f"Minimum function value: {result.fun}")

Non-Linear Curve Fitting with `curve_fit`

Non-linear curve fitting is used when the relationship between variables cannot be adequately described by a linear equation. In such cases, you fit the data to a predefined non-linear function by optimizing the function's parameters. SciPy's scipy.optimize.curve_fit is the primary tool for this.

`scipy.optimize.curve_fit()`

The curve_fit() function is part of SciPy's optimize module. It uses non-linear least squares to fit a function $f(x, ...)$ to data. It determines the optimal parameters for the model that minimize the sum of the squared differences between the observed data and the model's predictions.

How it works:

You define a model function that takes the independent variable ($x$) and the parameters to be optimized as arguments.
You provide the observed independent variable data (x_data) and the observed dependent variable data (y_data).
curve_fit then returns the optimal parameters (popt) and the estimated covariance of popt (pcov).

Example:

Let's fit an exponential decay model with an offset: $y = a \cdot e^{-b \cdot x} + c$.

from scipy.optimize import curve_fit
import numpy as np

# Define the non-linear model function
def model_func(x, a, b, c):
    return a * np.exp(-b * x) + c

# Generate sample data
x_data = np.linspace(0, 4, 50)
# True parameters: a=2.5, b=1.3, c=0.5
y_data = model_func(x_data, 2.5, 1.3, 0.5) + 0.2 * np.random.normal(size=len(x_data))

# Initial guess for the parameters [a, b, c]
# Providing good initial estimates can significantly improve performance.
initial_params = [1.0, 1.0, 1.0]

# Perform the non-linear curve fitting
popt, pcov = curve_fit(model_func, x_data, y_data, p0=initial_params)

# popt contains the optimal parameters
# pcov contains the estimated covariance of popt
# The diagonal elements of pcov are the variances of the parameter estimates.
optimized_a, optimized_b, optimized_c = popt

print(f"Optimized parameters: a={optimized_a:.3f}, b={optimized_b:.3f}, c={optimized_c:.3f}")

# You can also get the standard errors from the covariance matrix
perr = np.sqrt(np.diag(pcov))
print(f"Standard errors: a={perr[0]:.3f}, b={perr[1]:.3f}, c={perr[2]:.3f}")

Importance of Initial Parameter Estimates (`p0`)

For non-linear fitting, the quality of the initial parameter estimates (provided via the p0 argument in curve_fit) is crucial.

Faster Convergence: Good initial guesses can help the optimization algorithm converge to the solution much more quickly.
Avoiding Local Minima: Non-linear functions can have multiple "dips" (local minima). A poor initial guess might lead the algorithm to settle in a local minimum that is not the true best fit for the data. Providing a reasonable starting point increases the chance of finding the global minimum.
Guiding the Search: Initial estimates guide the optimizer in the parameter space, helping it explore the most promising regions first.

Conclusion

SciPy provides a robust and flexible suite of tools for both linear and non-linear curve fitting.

For simple linear relationships, scipy.stats.linregress is efficient and provides comprehensive statistical outputs.
For more general optimization tasks or when fitting non-linear models, scipy.optimize.least_squares and scipy.optimize.curve_fit are powerful tools. curve_fit is specifically designed for fitting functions to data by optimizing parameters, while least_squares offers broader applicability for minimizing sums of squares.
When using non-linear fitting methods like curve_fit, investing time in obtaining good initial parameter estimates (p0) is highly recommended to ensure accurate and efficient results.

By leveraging these functions, users can effectively model various datasets, understand underlying relationships, and make informed predictions.

Linear & Non-Linear Curve Fitting with SciPy in Python