Do you want to compare multiple imputational algorithm using your own data set?
In this notebook we repeat the experiments with user supplied data.
Given a dataset, artificially introduce gaps to represent missing data.
Using a specific imputatations method, impute those gaps.
Measure the error between the orignal dataset and the imputed dataset.
Repeat for a range of different gaps sizes and combinations.
Compare between different imputations methods.
Download and prepare the dataset¶
To evaluate imputation algorithm, we start with a relatively complete dataset and then artificially introduce gaps.
By compare the performance between different imputational algorithms you can make an informed choice on which algorithm works best for you data.
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import panel as pn
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
pn.extension()
from rich.progress import trackThe routines used in this notebook are in the file experiments.py
from experiments import *This analysis can be repeat with any dataset of the given format.
dataset_csv = 'dataset_cmar.csv'Explore the data¶
We start by getting an overview of what data is available in our dataset.
df = pd.read_csv(dataset_csv, parse_dates=True, index_col=0)image_data = df.astype('float32').T.values
x_labels = df.index.strftime('%Y-%m-%d') # dates → x-axis
y_labels = list(df.columns) # station-depths → y-axis
x_coords = np.arange(len(x_labels))
y_coords = np.arange(len(y_labels))
heatmap = hv.Image((x_coords, y_coords, image_data)).opts(
xaxis='bottom',
xlabel='Date',
ylabel='Station @ Depth',
xticks=list(zip(x_coords[::30], x_labels[::30])), # every 30th date
yticks=list(zip(y_coords, y_labels)),
xrotation=45,
cmap='Viridis',
colorbar=True,
width=1000,
height=800,
tools=['hover']
)
heatmapVisualize the series data¶
# Create a dropdown selector
site_selector = pn.widgets.Select(name='Site', options=list(df.columns))
def highlight_nan_regions(label):
series = df[label]
# Identify NaN regions
is_nan = series.isna()
nan_ranges = []
current_start = None
for date, missing in is_nan.items():
if missing and current_start is None:
current_start = date
elif not missing and current_start is not None:
nan_ranges.append((current_start, date))
current_start = None
if current_start is not None:
nan_ranges.append((current_start, series.index[-1]))
# Create shaded regions
spans = [
hv.VSpan(start, end).opts(color='red', alpha=0.2)
for start, end in nan_ranges
]
curve = hv.Curve(series, label=label).opts(
width=900, height=250, tools=['hover', 'box_zoom', 'pan', 'wheel_zoom'],
show_grid=True, title=label
)
return curve * hv.Overlay(spans)
interactive_plot = hv.DynamicMap(pn.bind(highlight_nan_regions, site_selector))
pn.Column(site_selector, interactive_plot, 'Hightlights regions are gaps that need to imputed.')The approach will be to artificially introduce gaps into this dataset and then attempt to impute those gaps.
Configure the options below to select which imputers to compare.
We introduce gaps as a fixed number but randomly selected number of sites, and for a given gap length. You can adjust these values depending on your on dataset.
imputers = [LinearImputer(),
KNNImputer(),
MissForestImputer(),
MICEImputer()
]
NumSites = [2, 4, 6]
GapLength = [7, 14, 21, 35, 56, 90]results_list = []
filtered_results = []
for imputer in track(imputers, description="Imputation Method"):
for num_sites in track(NumSites, description="NumSites", transient=True):
for gap_length in track(GapLength, description="GapLength", transient=True):
result = run_experiment(imputer, dataset=dataset_csv, minimum_missing_data=90, num_sites=num_sites, gap_length=gap_length)
results_list.append(result)
filtered_result = {k: v for k, v in result.items() if isinstance(v, (int, float, str))}
filtered_results.append(filtered_result)
filtered_results = pd.DataFrame(filtered_results)
filtered_results.to_csv('results.csv', index=False)What is the code above doing?¶
It’s looping through all combinations of imputation algorithms and artificial gaps sizes. The results are stored in the subdirectory results/. If a particular result (combination of dataset csv file name, imputation algorithm, number of missing sites, and gap length) has already been generated, the code does not calculate it again.
In the code above, minimum_missing_data=90 means that only records in the dataset with at least 90% complete data will be used. It can be also to set to `minimum_missing_data=0’ or another value to indicate the minimum percentage of missing data allowed at any site before introduction of artificial gaps.
Recommendation: Start with a relatively small number of NumSites and GapLengths and extend after you examined the initial results.
Depending on the size of the datasets this process can takes tens of minutes. We are applying the imputation algorithms many dozens of times to explore how the fraction of missing data increases the error in the dataset.
Interpret the results¶
import holoviews as hv
hv.extension('bokeh')
df = pd.read_csv('results.csv')
plots = []
for metric in ['MAE', 'RMSE']:
scatter = hv.NdOverlay({
imputer: hv.Scatter(df[df['imputer_name'] == imputer], 'missing_fraction', metric, label=imputer).opts(size=8)
for imputer in df['imputer_name'].unique()
})
scatter.opts(
title=f'{metric} vs Missing Fraction by Imputation Strategy',
xlabel='Missing Fraction (%)',
ylabel=metric,
width=800,
height=400,
legend_position='right'
)
plots.append(scatter)
(plots[0] + plots[1]).cols(1)