Fitting¶
Often you will have to compare a theoretical formula with experimental data. This is often done by fitting the theoretical formula to the experimental data. In this process the best parameters are found to reduce the deviation between the formula and the data as much as possible.
Python provides all kinds of different libraries and solutions to fitting a function to data. In this weekly coding hour we will use curve_fit from scipy's optimize package. It uses a "non-linear least squares to fit a function, f, to data" (see its documentation).
0. Prerequisites¶
In the previous coding hour we discussed how to install Python and its package manager Pip. Additionally, scipy will have to be installed which can be done via
pip install scipy
In the next cell we will import the needed libraries and use the magic command to make matplotlib interactive in jupyter notebooks.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
%matplotlib widget
1. Creating Test Data¶
First we will create a function that allows us to easily create linear data with a given noise level.
def create_linear_test_data(noise_strength=0):
np.random.seed(12)
a, b = preal = np.random.rand(2)
exp_xs = np.linspace(0, 10, 100)
exp_ys = a * exp_xs + b
noise = np.random.normal(0, noise_strength, exp_ys.shape)
exp_ys += noise
return(exp_xs, exp_ys, preal)
exp_xs, exp_ys, preal = create_linear_test_data(0.2)
fig, ax = plt.subplots()
ax.plot(exp_xs, exp_ys)
First, the strength of the noise is defined and the random module of numpy is given a set seed to make the results reproducible.
Then the a and b parameters of the data are initialized and the data to be fitted is generated.
2. Fitting the Test Data¶
Now we will use the curve_fit function to fit the linear dependence of the data.
linear_function = lambda x, a, b: a * x + b
popt, pcov = curve_fit(linear_function, exp_xs, exp_ys)
perr = np.sqrt(np.diag(pcov))
print(f"The parameters obtained by the fit are\na: {popt[0]}\nb: {popt[1]}\n")
print(f"The real parameter values are\na: {preal[0]}\nb: {preal[1]}\n")
print(f"One standard deviation errors of the parameters are\na: {perr[0]}\nb: {perr[1]}\n")
First we have to define the function we would like to fit. In our case the simple linear function can be described in a single line via python's anonymous lambda functions. These functions consist of a single expression and the here used line is a shorthand notation for
def linear_function(x, a, b):
return(a * x + b)
Then we call the curve_fit function with out fit function and the data as arguments. The return values are the optimal parameters and their estimated covariance. The ladder is then converted to the one standard deviation errors.
The results show that the fitted parameters match the real values within their errors.
3. Seeing the influence of noise¶
To examine the influence of noise on our fit quality, we repeat the previous steps for different noise values and plot the squared sum of the relative deviation versus the noise level.
noise_levels = np.linspace(0, 1, 100)
param_deviations = np.zeros(noise_levels.shape)
linear_function = lambda x, a, b: a * x + b
for i, noise_level in enumerate(noise_levels):
exp_xs, exp_ys, preal = create_linear_test_data(noise_level)
preal
popt, pcov = curve_fit(linear_function, exp_xs, exp_ys)
param_deviation = np.sum(np.power(((preal - popt)/preal), 2))
param_deviations[i] = param_deviation
fig, ax = plt.subplots()
ax.plot(noise_levels, param_deviations)
ax.set_xlabel("Noise Level")
ax.set_ylabel("Parameter Deviation")
3b. Fun little excurse¶
We can try to find the rank of the polynom that fits the parameter deviation vs noise level curve the best. To do this, we will use a second fitting function, namely numpy's polyfit function.
ranks = np.arange(0, 10)
rmss = np.zeros(ranks.shape)
for rank in ranks:
pcoeff = np.polyfit(noise_levels, param_deviations, rank)
ys_fit = np.polyval(pcoeff, noise_levels)
rms = np.mean(np.power((param_deviations - ys_fit), 2))
rmss[rank] = rms
fig, ax = plt.subplots()
ax.plot(ranks, rmss)
ax.set_xlabel("Polynom Rank")
ax.set_ylabel("RMS")
We see that for ranks higher than 2 the gains are only marginal. Therefore, this indicates that the parameter deviation vs noise level function is a quadratic funtion.
4. Real Life Example - Exponential Data¶
Now we will have a look at a real life example. The data shows some kind of RMS vs time. First, we will explore the data.
real_data = np.array(
[[0.1, 466.72018],
[0.4, 210.4],
[30.0, 13.06752],
[60.0, 10.19072],
[300.0, 4.4562],
[1863.0, 1.82434],
[4269.0, 1.21072],
[7552.0, 0.8563],
[10838.0, 0.7999],
[12693.0, 0.69923]])
fig, ax = plt.subplots()
ax.plot(*real_data.T)
ax.set_xlabel("Time t")
ax.set_ylabel("RMS")
The data seems to follow some kind of exponential behavior ($y = b \cdot t^{-a}$). We try to fit the data to this function.
On a Python side note, here we use numpy's transpose shorthand notation (real_data.T) together with Pythons unpacking operator (*). The last line is therefore equivalent to
ax.plot(real_data[:, 0], real_data[:, 1])
exp_function = lambda t, a, b: b * (t ** (-a))
popt, pcov = curve_fit(exp_function, *real_data.T)
perr = np.sqrt(np.diag(pcov))
ts, ys = real_data.T
xs_fit = np.linspace(ts.min(), ts.max(), 1000)
ys_fit = exp_function(xs_fit, *popt)
fig, ax = plt.subplots()
ax.plot(*real_data.T, label="Experiment")
ax.plot(xs_fit, ys_fit, label="Fit")
ax.set_xlabel("Time t")
ax.set_ylabel("RMS")
ax.legend()
print(f"The parameters obtained by the fit are\na: {popt[0]}\nb: {popt[1]}\n")
print(f"One standard deviation errors of the parameters are\na: {perr[0]}\nb: {perr[1]}\n")
Additionally we try to fit the data in a double logarithmic way. By plotting the data in such a way, the function $$ y = b \cdot t ^ {-a} $$ reads $$ \log(y) = \log(b) - \log(t) \cdot a $$
This allows us to once again use a simple linear fit to find the value of a.
double_logarithmic_data = np.log(real_data)
log_ts = double_logarithmic_data[:, 0]
popt, pcov = curve_fit(linear_function, *double_logarithmic_data.T)
fit_xs = np.linspace(log_ts.min(), log_ts.max(), 100)
fit_ys = linear_function(fit_xs, *popt)
fig, ax = plt.subplots()
ax.plot(*np.log(real_data).T, label="Experiment")
ax.plot(fit_xs, fit_ys, label="Fit")
ax.legend()
ax.set_xlabel("Log(t)")
ax.set_xlabel("Log(rms)")
print(f"The parameters obtained by the fit are\na: {popt[0]}\nb: {popt[1]}\n")
print(f"One standard deviation errors of the parameters are\na: {perr[0]}\nb: {perr[1]}\n")
Another way is to fit the exponential function and just display the data in doubly logarithmic fashion:
exp_function = lambda t, a, b: b * (t ** (-a))
popt, pcov = curve_fit(exp_function, *real_data.T)
perr = np.sqrt(np.diag(pcov))
ts, ys = real_data.T
xs_fit = np.linspace(ts.min(), ts.max(), 1000)
ys_fit = exp_function(xs_fit, *popt)
fig, ax = plt.subplots()
ax.loglog(*real_data.T, label="Experiment")
ax.loglog(xs_fit, ys_fit, label="Fit")
ax.set_xlabel("Time t")
ax.set_ylabel("RMS")
ax.legend()
As you can see, the two plots look almost identical (apart from the different axes labels and ticks). However, the results and thereby the fit function in the plots change slightly depending on the used method. So why are the results different for
- fitting an exponential function
- fitting a linear function to the double logarithmic data
The difference comes from the least-squares fitting. Oviously, the deviations will not be the same for the two methods for the same parameter values (an easy argument is error propagation with logarithms, one would actually need separate upper and lower uncertainties).
Therefore, one should use where possible method 1. and fit the actual data.
Tips and Tricks¶
Bounds and Initial Guesses¶
More complicated fits can fail due to bad initial guesses. By default all parameters will be set to 1. However, for certain fit functions this causes the fit to never converge to the best values.
To improve this behaviour we can specify bounds and initial guesses via the bounds and p0 parameters, respectively. For the bounds even infinite values can be chosen (np.inf) which allows to specify only positive values.
In my experience automated fitting of lines benefitted immensely from specifying sensible bounds and initial guesses.
Infinite Values¶
The curve_fit function expects you to have cleansed your data from any non numerical (np.nan) or infinite (np.inf) values. Otherwise the fit may silently produce non sensical results. If you want curve_fit to explicitly check this for you specify check_finite=True.
Float64 Values required¶
As stated in the documentation, "Users should ensure that inputs xdata, ydata, and the output of f are float64, or else the optimization may return incorrect results."