GSoC 2020 Report

Organization : International Neuroinformatics Coordinating Facility (INCF)

Project : A reduced time-series feature library to efficiently characterize neural dynamics

Student : Imran Alam

Mentors : Ben Fulcher, Oliver Cliff and Joseph Lizier


Introduction

Time-Series Analysis is a broad, interdisciplinary field. Research in time-series analysis has produced a plethora of features for capturing dynamical patterns. But it is unknown which of these thousands of existing methods perform well on specific domain-related tasks such as prominent applications involving neural dynamics. In this task, we aimed to efficiently reduce an existing library of ~7700 time-series features (from the hctsa package) to a reduced subset and implement the result as an open-source library. The resulting feature set should be highly computationally efficient relative to hctsa, give high performance on the set of training tasks (mouse fMRI manipulations), and eliminate the dependency on a Matlab licence, thereby enabling more widespread (and real-time) adoption of feature-based time-series analysis in medical and research applications.

In this project, we have used a similar workflow as presented in catch22, which selects the top features based on classification performance on the given set of tasks and performs the redundancy method to get the maximally independent features. The feature set was coded in C and also made wrappers for Matlab and Python programming languages.

Getting Started

I started the project by first familiarizing myself with the hctsa library in Matlab by playing around with some test data and analysing the results which helped me to understand the data-flow and functionality of the tool. To get started with the dataset and understand various applications of hctsa in neuroscience, I studied a range of relevant scientific manuscripts. In parallel, I solved multiple issues related to data analysis and coding including stratifying imbalance class, restructuring the data loader, and implementing sanity checks.

The project can be divided into the following subsections:

  1. Dataset details and preprocessing
  2. Selection of features using clustering
  3. Final reduced feature set
  4. Implementation and Evaluation

Dataset

Details

During the start of the coding period, I worked on neuroimaging datasets to apply the feature reduction workflow.

We used a mouse fMRI dataset, based on chemogenetic manipulations of local neural dynamics published in a recent paper in Cerebral Cortex. The dataset includes two sub-datasets (a brain area stimulated in the right hemisphere and its contralateral left-hemisphere analogue in the isocortex) with four labelled classes: CAMK, excitatory, PVCre, SHAM.

Preprocessing

The datasets were normalised and the features were filtered to remove missing or extreme values. For each pairs of conditions, we extracted a binary classification task, yielding six binary tasks per hemisphere, and thus 12 tasks across two datasets. The performance of each individual time-series feature was judged according to its performance across these twelve tasks.

Selecting Features using Clustering

Hierarchical clustering is used here to reduce the feature set based on low redundancy (choosing one from a cluster) and high variability (considering all the clusters).

Improve the feature reduction pipeline and testing the hyperparameters of hierarchical clustering

Alt text
We applied different threshold (ø) values to restrict the number of features in the reduced-set. Here, the color represents the balanced accuracy on the left-out task using the reduced-set and the labelled number is the total number of features in the set.

Final reduced feature set

We extracted a reduced set of 16 features from the 16 clusters formed (on Mouse fMRI dataset and with hyperparameters n = 100 and ø = 0.2). Some features were infeasible to implement in C, so we replaced it with the highly correlated feature in their respective clusters.

The list of features selected (one from each cluster) is as follows:

Implementation and Evaluation

Performance Comparison

It is evident from the scatter plot that the reduced feature set gives similar and in some cases better performance than the full hctsa feature set. Thus, in memory and time-constraint environments the hctsa can be replaced by catchaMouse16.

Alt text
Performance of catchaMouse16 in comparison to the full-set. Each scatter point represents a single task with balanced accuracy. The error bars along the X and Y axis are the standard deviations of cross-validated accuracy with full-set and catchaMouse16 respectively.

Speed Gains

We compared the average execution time to run the reduced feature set on 2000 time-series of length 900, from hctsa (in Matlab) and catchaMouse16 library (in C and Mex). Our developed library is ~60 times faster than hctsa as shown in the comparison plot:

Alt text

Summary

The catchaMouse16 library consists of 16 hctsa features efficiently coded in C, which was selected by following the op_importance feature reduction pipeline.

Future Roadmap

Even after the end of the GSoC period, I would continue contributing to open-source development for the neuroscience community.