Download the Dataset - Time Series Imputation for Oceanographic Data

We want to analyze the Centre for Marine Applied Research (CMAR) Water Quality dataset.

from erddapy import ERDDAP
import os
import pandas as pd
from tqdm.notebook import tqdm

The data is available from CIOOS Atlantic

e = ERDDAP(
    server = "https://cioosatlantic.ca/erddap",
    protocol = "tabledap"
)

Determine the datasetID for each CMAR Water Quality dataset.

The study period is 2020-09-01 to 2024-08-31.

e.dataset_id = 'allDatasets'
e.variables = ['datasetID', 'institution', 'title', 'minTime', 'maxTime']

# only grab data from county with data within study period
e.constraints = {'maxTime>=': '2020-09-01', 'minTime<=': '2024-08-31'}
df_allDatasets = e.to_pandas()

df_CMAR_datasets = df_allDatasets[df_allDatasets['institution'].str.contains('CMAR') & df_allDatasets['title'].str.contains('Water Quality Data')].copy()
df_CMAR_datasets['county'] = df_CMAR_datasets['title'].str.removesuffix(' County Water Quality Data')

df_CMAR_datasets.sample(5)

For each of these datasets, we download the temperature data locally.

e.variables = [
 'waterbody',
 'station',
# 'sensor_type',
# 'sensor_serial_number',
# 'rowSize',
# 'lease',
# 'latitude',
# 'longitude',
 'deployment_start_date',
 'deployment_end_date',
# 'string_configuration',
 'time',
 'depth',
# 'depth_crosscheck_flag',
# 'dissolved_oxygen',
# 'salinity',
# 'sensor_depth_measured',
 'temperature',
# 'qc_flag_dissolved_oxygen',
# 'qc_flag_salinity',
# 'qc_flag_sensor_depth_measured',
 'qc_flag_temperature']

e.constraints = { "time>=": "2020-09-01", "time<=": "2024-08-31" }

This takes a few minutes so we locally cache the data so it only has to be downloaded once.

%%time

os.makedirs('data', exist_ok=True)

for index, row in df_CMAR_datasets.iterrows():

    csvfile = f"data/{row['county']}.csv"

    if os.path.exists(csvfile):
        continue

    print(f"Downloading {row['title']}...")
    e.dataset_id = row['datasetID']
    df = e.to_pandas()

    df.to_csv(csvfile, index=False)

Downloading Annapolis County Water Quality Data...
Downloading Antigonish County Water Quality Data...
Downloading Colchester County Water Quality Data...
Downloading Digby County Water Quality Data...
Downloading Guysborough County Water Quality Data...
Downloading Halifax County Water Quality Data...
Downloading Inverness County Water Quality Data...
Downloading Lunenburg County Water Quality Data...
Downloading Pictou County Water Quality Data...
Downloading Queens County Water Quality Data...
Downloading Richmond County Water Quality Data...
Downloading Shelburne County Water Quality Data...
Downloading Victoria County Water Quality Data...
Downloading Yarmouth County Water Quality Data...
CPU times: user 43.8 s, sys: 5.44 s, total: 49.3 s
Wall time: 26min 59s

We now have the following .csv files stored locally:

!ls -lh data/

total 2.4G
-rw-r--r-- 1 jmunroe jmunroe  33M Jul  4 09:14 Annapolis.csv
-rw-r--r-- 1 jmunroe jmunroe 106M Jul  4 09:16 Antigonish.csv
-rw-r--r-- 1 jmunroe jmunroe  32M Jul  4 09:16 Colchester.csv
-rw-r--r-- 1 jmunroe jmunroe 122M Jul  4 09:18 Digby.csv
-rw-r--r-- 1 jmunroe jmunroe 940M Jul  4 09:27 Guysborough.csv
-rw-r--r-- 1 jmunroe jmunroe 177M Jul  4 09:29 Halifax.csv
-rw-r--r-- 1 jmunroe jmunroe  40M Jul  4 09:29 Inverness.csv
-rw-r--r-- 1 jmunroe jmunroe 395M Jul  4 09:33 Lunenburg.csv
-rw-r--r-- 1 jmunroe jmunroe  47M Jul  4 09:34 Pictou.csv
-rw-r--r-- 1 jmunroe jmunroe  72M Jul  4 09:35 Queens.csv
-rw-r--r-- 1 jmunroe jmunroe  88M Jul  4 09:36 Richmond.csv
-rw-r--r-- 1 jmunroe jmunroe 143M Jul  4 09:38 Shelburne.csv
-rw-r--r-- 1 jmunroe jmunroe 183K Jul  4 09:38 Victoria.csv
-rw-r--r-- 1 jmunroe jmunroe 183M Jul  4 09:40 Yarmouth.csv

We need to organize and sort the observations so that we are considering only the observation for a single sensor in temporal order.

This will remove all of the duplicated metadata within this .csv files.

os.makedirs('segments', exist_ok=True)

all_segment_metadata = []
for index, row in tqdm(list(df_CMAR_datasets.iterrows())):

    csvfile = f"data/{row['county']}.csv"

    df = pd.read_csv(csvfile)
    
    df['segment'] = df[['waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]].agg(lambda x: row['county'] + '_' + '_'.join([str(y) for y in x]), axis=1)

    df_metadata = df[['segment', 'waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]]

    df_metadata = df_metadata.drop_duplicates()
    all_segment_metadata.append(df_metadata)
    
    df_data = df.drop(columns=['waterbody', 'station', 'depth (m)',
                                 'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                              ])
    
    df_data = df_data.sort_values(by=['segment', 'time (UTC)'])

    df_data.set_index(['segment', 'time (UTC)'], inplace=True)

    for key, segment_df in df_data.groupby(level=0):
        csvfile = f'segments/{key}.csv'
        segment_df = segment_df.droplevel(0)
        segment_df.to_csv(csvfile)

df_metadata = pd.concat(all_segment_metadata)
df_metadata.set_index('segment', inplace=True)
df_metadata.to_csv('metadata.csv')

!ls -lh segments/ | wc

   1108   12698  151581

We have 852 distinct observational time series taken at various locations and depths around Nova Scotia during the period of 2020-09-01 to 2024-08-31

df_metadata.head(8)

Time Series Imputation for Oceanographic Data

CMAR Water Quality Datasets