Skip to article frontmatterSkip to article content

1. Introduction

Oceanographic time series, like many observational measurements based on physical sensors, are often incomplete. Due to sensor limitations, interference, malfunctions and other concerns, there is often missing data, or data that has failed a quality control check. Before these datasets can be used in subsequent data analysis workflows, it may be necessary to gap fill (or “impute”) the missing or suspect data.

Description of the image

While simple univariate methods may suffice for short gaps, they are not sufficiently accurate for longer gaps. In such instances, machine learning methods are an attractive option: these multivariate solutions estimate missing data by examining the relationship between two (or more) variables in order to predict patterns. These variables might be ocean observations recorded at the same location, or may be observations from multiple sites–such as different depths or geographically-adjacent locations.

To demonstrate the concept, this series of notebooks documents the comparison of several machine learning methods for imputation of time series. The time series chosen for analysis was the Center for Marine Applied Research (CMAR) Water Quality datasets hosted at CIOOS Atlantic, using data collected over a four-year period. To validate the approach, a dataset from the East Atlantic Coast Aquatic Invasive Species (AIS) Monitoring Program is also presented.

In both case studies missForest provided the best compromise between accuracy and speed of analysis. Although this is consistent with the literature, these examples explore only a small subset of oceanographic data, and we do not allege that this solution is optimal in all circumstances–the optimal algorithm may depend on the features of the dataset.

To assist users with determining the best solution for their own dataset, this work presents a reusable structure for imputing data.

E.g., to format your data so it is compatible with these notebooks, see x. (Note: mention that dataset must be in CSV format and titled ‘dataset’.)


This project is an outcome of Building Bridges, a project funded by Canada’s Oceans Supercluster.

Outline

  1. Introduction: this notebook

  2. Data Exploration: always begin with visualizing your data

  3. Imputation: the different imputation algorithms we are considering in this analysis

  4. Experiments: demonstration that for the datasets considered, the missForest algorithm performs best

  5. Case Study: Shelborne County Water Quality

  6. Case Study: AIS

  7. Case Study: CMAR Water Quality

  8. Case Study: Bring your own data!

  9. Appendix: Hyperparameter tuning

  10. Appendix: Various additional visualization routines