Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

10. Compare Imputation Algorithms for your Own Data

Do you want to compare multiple imputational algorithm using your own data set?

In this notebook we repeat the experiments with user supplied data.

  1. Given a dataset, artificially introduce gaps to represent missing data.

  2. Using a specific imputatations method, impute those gaps.

  3. Measure the error between the orignal dataset and the imputed dataset.

  4. Repeat for a range of different gaps sizes and combinations.

  5. Compare between different imputations methods.

Download and prepare the dataset

To evaluate imputation algorithm, we start with a relatively complete dataset and then artificially introduce gaps.

By compare the performance between different imputational algorithms you can make an informed choice on which algorithm works best for you data.

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import panel as pn
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
pn.extension()

from rich.progress import track
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

The routines used in this notebook are in the file experiments.py

from experiments import *

This analysis can be repeat with any dataset of the given format.

dataset_csv = 'dataset_cmar.csv'

Explore the data

We start by getting an overview of what data is available in our dataset.

df = pd.read_csv(dataset_csv, parse_dates=True, index_col=0)
image_data = df.astype('float32').T.values

x_labels = df.index.strftime('%Y-%m-%d')  # dates → x-axis
y_labels = list(df.columns)               # station-depths → y-axis

x_coords = np.arange(len(x_labels))
y_coords = np.arange(len(y_labels))

heatmap = hv.Image((x_coords, y_coords, image_data)).opts(
    xaxis='bottom',
    xlabel='Date',
    ylabel='Station @ Depth',
    xticks=list(zip(x_coords[::30], x_labels[::30])),  # every 30th date
    yticks=list(zip(y_coords, y_labels)),
    xrotation=45,
    cmap='Viridis',
    colorbar=True,
    width=1000,
    height=800,
    tools=['hover']
)
heatmap
Loading...

Visualize the series data

# Create a dropdown selector
site_selector = pn.widgets.Select(name='Site', options=list(df.columns))

def highlight_nan_regions(label):

    series = df[label]
    
    # Identify NaN regions
    is_nan = series.isna()
    nan_ranges = []
    current_start = None

    for date, missing in is_nan.items():
        if missing and current_start is None:
            current_start = date
        elif not missing and current_start is not None:
            nan_ranges.append((current_start, date))
            current_start = None
    if current_start is not None:
        nan_ranges.append((current_start, series.index[-1]))

    # Create shaded regions
    spans = [
        hv.VSpan(start, end).opts(color='red', alpha=0.2)
        for start, end in nan_ranges
    ]

    curve = hv.Curve(series, label=label).opts(
        width=900, height=250, tools=['hover', 'box_zoom', 'pan', 'wheel_zoom'],
        show_grid=True, title=label
    )

    return curve * hv.Overlay(spans)
   
 
interactive_plot = hv.DynamicMap(pn.bind(highlight_nan_regions, site_selector))

pn.Column(site_selector, interactive_plot, 'Hightlights regions are gaps that need to imputed.')
Loading...

The approach will be to artificially introduce gaps into this dataset and then attempt to impute those gaps.

Configure the options below to select which imputers to compare.

We introduce gaps as a fixed number but randomly selected number of sites, and for a given gap length. You can adjust these values depending on your on dataset.

imputers = [LinearImputer(), 
            KNNImputer(),
            MissForestImputer(),
            MICEImputer()
           ]

NumSites = [2, 4, 6]
GapLength = [7, 14, 21, 35, 56, 90]
results_list = []
filtered_results = []
for imputer in track(imputers, description="Imputation Method"):
    for num_sites in track(NumSites, description="NumSites", transient=True):
        for gap_length in track(GapLength, description="GapLength", transient=True):
            result = run_experiment(imputer, dataset=dataset_csv, minimum_missing_data=90, num_sites=num_sites, gap_length=gap_length)
            results_list.append(result)
            
            filtered_result = {k: v for k, v in result.items() if isinstance(v, (int, float, str))}
            filtered_results.append(filtered_result)

filtered_results = pd.DataFrame(filtered_results)
filtered_results.to_csv('results.csv', index=False)
Loading...
Loading...

What is the code above doing?

It’s looping through all combinations of imputation algorithms and artificial gaps sizes. The results are stored in the subdirectory results/. If a particular result (combination of dataset csv file name, imputation algorithm, number of missing sites, and gap length) has already been generated, the code does not calculate it again.

In the code above, minimum_missing_data=90 means that only records in the dataset with at least 90% complete data will be used. It can be also to set to `minimum_missing_data=0’ or another value to indicate the minimum percentage of missing data allowed at any site before introduction of artificial gaps.

Recommendation: Start with a relatively small number of NumSites and GapLengths and extend after you examined the initial results.

Depending on the size of the datasets this process can takes tens of minutes. We are applying the imputation algorithms many dozens of times to explore how the fraction of missing data increases the error in the dataset.

Interpret the results

import holoviews as hv
hv.extension('bokeh')

df = pd.read_csv('results.csv')

plots = []
for metric in ['MAE', 'RMSE']:
    
    scatter = hv.NdOverlay({
        imputer: hv.Scatter(df[df['imputer_name'] == imputer], 'missing_fraction', metric, label=imputer).opts(size=8)
        for imputer in df['imputer_name'].unique()
    })
    
    scatter.opts(
        title=f'{metric} vs Missing Fraction by Imputation Strategy',
        xlabel='Missing Fraction (%)',
        ylabel=metric,
        width=800,
        height=400,
        legend_position='right'
    )

    plots.append(scatter)

(plots[0] + plots[1]).cols(1)
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...