Tutorial 4: Creating custom datasets#

MOABB provides several ways to integrate a custom dataset, depending on the format of your data:

  1. BaseDataset — for datasets with arbitrary file formats. Requires implementing data downloading and file reading.

  2. BaseBIDSDataset / LocalBIDSDataset — for datasets already provided in BIDS format. BaseBIDSDataset is used for online datasets (only the download step needs to be implemented); LocalBIDSDataset is used for local or private datasets with no subclassing required at all.

BIDS is the preferred format for new datasets in MOABB.

This tutorial illustrates both approaches.

# Authors: Pedro L. C. Rodrigues, Sylvain Chevallier
#
# https://github.com/plcrodrigues/Workshop-MOABB-BCI-Graz-2019

import mne
import numpy as np
from pyriemann.classification import MDM
from pyriemann.estimation import Covariances
from scipy.io import loadmat, savemat
from sklearn.pipeline import make_pipeline

from moabb.datasets import download as dl
from moabb.datasets.base import BaseBIDSDataset, BaseDataset
from moabb.evaluations import WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery

1. Creating a dataset class from scratch (BaseDataset)#

Creating some Data#

To illustrate the creation of a dataset class in MOABB, we first create an example dataset saved in .mat file. It contains a single fake recording on 8 channels lasting for 150 seconds (sampling frequency 256 Hz). We have included the script that creates this dataset and have uploaded it online. The fake dataset is available on the Zenodo website

def create_example_dataset():
    """Create a fake example for a dataset."""
    sfreq = 256
    t_recording = 150
    t_trial = 1  # duration of a trial
    intertrial = 2  # time between end of a trial and the next one
    n_chan = 8

    x = np.zeros((n_chan + 1, t_recording * sfreq))  # electrodes + stimulus
    stim = np.zeros(t_recording * sfreq)
    t_offset = 1.0  # offset where the trials start
    n_trials = 40

    rep = np.linspace(0, 4 * t_trial, t_trial * sfreq)
    signal = np.sin(2 * np.pi / t_trial * rep)
    for n in range(n_trials):
        label = n % 2 + 1  # alternate between class 0 and class 1
        tn = int(t_offset * sfreq + n * (t_trial + intertrial) * sfreq)
        stim[tn] = label
        noise = 0.1 * np.random.randn(n_chan, len(signal))
        x[:-1, tn : (tn + t_trial * sfreq)] = label * signal + noise
    x[-1, :] = stim
    return x, sfreq


# Create the fake data
for subject in [1, 2, 3]:
    x, fs = create_example_dataset()
    filename = "subject_" + str(subject).zfill(2) + ".mat"
    mdict = {}
    mdict["x"] = x
    mdict["fs"] = fs
    savemat(filename, mdict)

Creating a Dataset Class#

We will create now a dataset class using the fake data simulated with the code from above. For this, we first need to import the right classes from MOABB:

  • dl is a very useful script that downloads automatically a dataset online if it is not yet available in the user’s computer. The script knows where to download the files because we create a global variable telling the URL where to fetch the data.

  • BaseDataset is the basic class that we overload to create our dataset.

The global variable with the dataset’s URL should specify an online repository where all the files are stored.

ExampleDataset_URL = "https://zenodo.org/records/14973598"

The ExampleDataset needs to implement only 3 functions:

  • __init__ for indicating the parameter of the dataset

  • _get_single_subject_data to define how to process the data once they have been downloaded

  • data_path to define how the data are downloaded.

class ExampleDataset(BaseDataset):
    """Dataset used to exemplify the creation of a dataset class in MOABB.

    The data samples have been simulated and has no physiological
    meaning whatsoever.
    """

    def __init__(self):
        super().__init__(
            subjects=[1, 2, 3],
            sessions_per_subject=1,
            events={"left_hand": 1, "right_hand": 2},
            code="ExampleDataset",
            interval=[0, 0.75],
            paradigm="imagery",
            doi="",
        )

    def _get_single_subject_data(self, subject):
        """Return data for a single subject."""
        file_path_list = self.data_path(subject)

        data = loadmat(file_path_list[0])
        x = data["x"]
        fs = data["fs"]
        ch_names = ["ch" + str(i) for i in range(8)] + ["stim"]
        ch_types = ["eeg" for i in range(8)] + ["stim"]
        info = mne.create_info(ch_names, float(np.squeeze(fs)), ch_types)
        raw = mne.io.RawArray(x, info)

        sessions = {}
        sessions["0"] = {}
        sessions["0"]["0"] = raw
        return sessions

    def data_path(
        self, subject, path=None, force_update=False, update_path=None, verbose=None
    ):
        """Download the data from one subject."""
        if subject not in self.subject_list:
            raise (ValueError("Invalid subject number"))

        url = "{:s}/files/subject_0{:d}.mat".format(ExampleDataset_URL, subject)
        path = dl.data_dl(url, "ExampleDataset")
        return [path]  # it has to return a list

Using the ExampleDataset#

Now that the ExampleDataset is defined, it could be instantiated directly. The rest of the code follows the steps described in the previous tutorials.

dataset = ExampleDataset()

paradigm = LeftRightImagery()
X, labels, meta = paradigm.get_data(dataset=dataset, subjects=[1])

evaluation = WithinSessionEvaluation(
    paradigm=paradigm, datasets=dataset, overwrite=False, suffix="newdataset"
)
pipelines = {}
pipelines["MDM"] = make_pipeline(Covariances("oas"), MDM(metric="riemann"))
scores = evaluation.process(pipelines)

print(scores)
/home/runner/work/moabb/moabb/moabb/analysis/results.py:192: H5pyDeprecationWarning: Creating a dataset without passing data or dtype is deprecated. Pass an explicit dtype. Using dtype='f4' will keep the current default behaviour.
  dset.create_dataset(
   score      time  samples  ...         dataset  pipeline  codecarbon_task_name
0    1.0  0.016990     40.0  ...  ExampleDataset       MDM
1    1.0  0.017430     40.0  ...  ExampleDataset       MDM
2    1.0  0.016916     40.0  ...  ExampleDataset       MDM

[3 rows x 13 columns]

Pushing on MOABB Github#

If you want to make your dataset available to everyone, you could upload your data on public server (like Zenodo or Figshare) and signal that you want to add your dataset to MOABB in the dedicated issue. You could then follow the instructions on how to contribute

2. Creating a BIDS dataset class (BaseBIDSDataset)#

If your dataset is already provided in the BIDS format, you can subclass BaseBIDSDataset instead of BaseDataset. The BaseBIDSDataset base class handles reading the BIDS files automatically via mne-bids; you only need to implement the _download_subject method that downloads the data for a single subject and returns the local path to the root of the BIDS dataset.

Several MOABB datasets already use this approach — for example moabb.datasets.Zhou2016.

The skeleton below shows the minimal implementation required. Note: ExampleBIDSDataset is intentionally non-functional — its _download_subject raises NotImplementedError. Replace it with your own download logic before instantiating the class.

class ExampleBIDSDataset(BaseBIDSDataset):
    """Skeleton showing how to wrap an online BIDS dataset.

    Replace ``_download_subject`` with the actual download logic for your
    dataset (e.g. fetching a zip archive from Zenodo and extracting it).
    """

    def __init__(self):
        super().__init__(
            subjects=[1, 2, 3],
            sessions_per_subject=1,
            events={"left_hand": 1, "right_hand": 2},
            code="ExampleBIDSDataset",
            interval=[0, 0.75],
            paradigm="imagery",
            doi="",
        )

    def _download_subject(self, subject, path, force_update, update_path, verbose):
        """Download the BIDS dataset for *subject* and return the BIDS root."""
        # Example (not executed here — replace with your actual download URL):
        #
        #   url = f"https://zenodo.org/records/XXXXX/files/bids_dataset.zip"
        #   bids_root = dl.data_dl(url, "ExampleBIDSDataset")
        #   return bids_root
        raise NotImplementedError("Replace this with your actual download logic.")

Using LocalBIDSDataset for local/private BIDS datasets#

If you already have a BIDS dataset on your local machine and you do not want to write a dedicated class, you can use LocalBIDSDataset directly. It auto-discovers subjects and sessions from the BIDS directory structure:

from moabb.datasets.base import LocalBIDSDataset

dataset = LocalBIDSDataset(
    bids_root="/path/to/bids/dataset",
    events={"left_hand": 1, "right_hand": 2},
    interval=[0, 0.75],
    paradigm="imagery",
)

paradigm = LeftRightImagery()
X, labels, meta = paradigm.get_data(dataset=dataset, subjects=[1])

Total running time of the script: (0 minutes 46.878 seconds)

Gallery generated by Sphinx-Gallery