An example classification problem

An example classification problem#

Now that we’ve seen how to create a connectome for an individual subject, we’re ready to think about how we can use this connectome in a machine learning analysis. We’ll keep working with the same development_dataset, but now we’d like to see if we can predict age group (i.e. whether a participant is a child or adult) based on their connectome, as defined by the functional connectivity matrix.

We’ll also explore whether we’re more or less accurate in our predictions based on how we define functional connectivity. In this example, we’ll consider three different different ways to define functional connectivity between our Multi-Subject Dictional Learning (MSDL) regions of interest (ROIs): correlation, partial correlation, and tangent space embedding.

To learn more about tangent space embedding and how it compares to standard correlations, we recommend [1].

Load brain development fMRI dataset and MSDL atlas#

First, we need to set up our minimal environment. This will include all the dependencies from the last notebook, loading the relevant data using our nilearn data set fetchers, and instantiated our NiftiMapsMasker and ConnectivityMeasure objects.

import numpy as np
import matplotlib.pyplot as plt
from nilearn import (datasets, maskers, plotting)
from nilearn.connectome import ConnectivityMeasure

development_dataset = datasets.fetch_development_fmri(n_subjects=30)
msdl_atlas = datasets.fetch_atlas_msdl()

masker = maskers.NiftiMapsMasker(
    msdl_atlas.maps, resampling_target="data",
    t_r=2, detrend=True,
    low_pass=0.1, high_pass=0.01).fit()
correlation_measure = ConnectivityMeasure(kind='correlation')

Now we should have a much better idea what each line above is doing! Let’s see how we can use these objects across many subjects, not just the first one.

Building our dataset#

First, we can loop through the 30 participants and extract a few relevant pieces of information, including their functional scan, their confounds file, and whether they were a child or adult at the time of their scan.

Using this information, we can then transform their data using the NiftiMapsMasker we created above. As we learned last time, it’s really important to correct for known sources of noise! So we’ll also pass the relevant confounds file directly to the masker object to clean up each subject’s data.

children = []
pooled_subjects = []
groups = []  # child or adult

for func_file, confound_file, (_, phenotypic) in zip(
        development_dataset.func,
        development_dataset.confounds,
        development_dataset.phenotypic.iterrows()
    ):

    time_series = masker.transform(func_file, confounds=confound_file)
    pooled_subjects.append(time_series)

    if phenotypic['Child_Adult'] == 'child':
        children.append(time_series)

    groups.append(phenotypic['Child_Adult'])

print(f'Data has {len(children)} children.')

Data has 24 children.

We can see that this data set has 24 children. This is roughly proportional to the original participant pool, which had 122 children and 33 adults.

We’ve also created a list in pooled_subjects containing all of the cleaned data. Remember that each entry of that list should have a shape of (168, 39). We can quickly confirm that this is true:

print(pooled_subjects[0].shape)

(168, 39)

ROI-to-ROI correlations in children#

First, we’ll use the most common kind of connectivity–and the one we used in the last section–correlation. It models the full (marginal) connectivity between pairwise ROIs.

correlation_measure expects a list of time series, so we can directly supply the list of ROI time series we just created. It will then compute individual correlation matrices for each subject. First, let’s just look at the correlation matrices for our 24 children, since we expect these matrices to be similar:

correlation_matrices = correlation_measure.fit_transform(children)

Now, all individual coefficients are stacked in a unique 2D matrix.

print('Correlations of children are stacked in an array of shape {0}'
      .format(correlation_matrices.shape))

Correlations of children are stacked in an array of shape (24, 39, 39)

We can also directly access the average correlation across all fitted subjects using the mean_ attribute.

mean_correlation_matrix = correlation_measure.mean_
print('Mean correlation has shape {0}.'.format(mean_correlation_matrix.shape))

Mean correlation has shape (39, 39).

Let’s display the functional connectivity matrices of the first 3 children:

_, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, (matrix, ax) in enumerate(zip(correlation_matrices, axes)):
    plotting.plot_matrix(matrix, colorbar=False, axes=ax,
                         vmin=-0.8, vmax=0.8,
                         title='correlation, child {}'.format(i))

_images/cbaa15e3edf86afee904739bdb0203b7031c04b26f6869493a86e62b811b7fc7.png

Just as before, we can also display connectome on the brain. Here, let’s show the mean connectome over all 24 children.

plotting.view_connectome(mean_correlation_matrix, msdl_atlas.region_coords,
                         edge_threshold=0.2,
                         title='mean connectome over all children')

Studying partial correlations#

Rather than looking at the correlation-defined functional connectivity matrix, we can also study direct connections as revealed by partial correlation coefficients.

To do this, we can use exactly the same procedure as above, just changing the ConnectivityMeasure kind:

partial_correlation_measure = ConnectivityMeasure(kind='partial correlation')
partial_correlation_matrices = partial_correlation_measure.fit_transform(
    children)

Right away, we can see that most of direct connections are weaker than full connections for the first three children:

_, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, (matrix, ax) in enumerate(zip(partial_correlation_matrices, axes)):
    plotting.plot_matrix(matrix, colorbar=False, axes=ax,
                         vmin=-0.8, vmax=0.8,
                         title='partial correlation, child {}'.format(i))

_images/f54f91b025c6dd91fe22143d582be9e1278b519d3f40791dd5038e6e49eaf48e.png

This is also visible when we display the mean partial correlation connectome:

plotting.view_connectome(
    partial_correlation_measure.mean_, msdl_atlas.region_coords,
    edge_threshold=0.2,
    title='mean partial correlation over all children')

Using tangent space embedding#

An alternative method to both correlations and partial correlation is tangent space embedding. Tangent space embedding uses both correlations and partial correlations to capture reproducible connectivity patterns at the group-level.

Using this method is as easy as changing the kind of ConnectivityMeasure

tangent_measure = ConnectivityMeasure(kind='tangent')

We fit our children group and get the group connectivity matrix stored as in tangent_measure.mean_, and individual deviation matrices of each subject from it.

tangent_matrices = tangent_measure.fit_transform(children)

tangent_matrices model individual connectivities as perturbations of the group connectivity matrix tangent_measure.mean_. Keep in mind that these subjects-to-group variability matrices do not directly reflect individual brain connections. For instance negative coefficients can not be interpreted as anticorrelated regions.

_, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, (matrix, ax) in enumerate(zip(tangent_matrices, axes)):
    plotting.plot_matrix(matrix, colorbar=False, axes=ax,
                         vmin=-0.8, vmax=0.8,
                         title='tangent offset, child {}'.format(i))

_images/d4a7e40ac1c15ab0bd9c97f842a17d148f817593b9149b19262f26d437601222.png

We don’t show the mean connectome here as average tangent matrix cannot be interpreted, since individual matrices represent deviations from the mean, which is set to 0.

Using connectivity in a classification analysis#

We can use these connectivity matrices as features in a classification analysis to distinguish children from adults. This classification analysis can be implmented directly in scikit-learn, including all of the important considerations like cross-validation and measuring classification accuracy.

First, we’ll randomly split participants into training and testing sets 15 times. StratifiedShuffleSplit allows us to preserve the proportion of children-to-adults in the test set. We’ll also compute classification accuracies for each of the kinds of functional connectivity we’ve identified: correlation, partial correlation, and tangent space embedding.

from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import LinearSVC

kinds = ['correlation', 'partial correlation', 'tangent']
_, classes = np.unique(groups, return_inverse=True)
cv = StratifiedShuffleSplit(n_splits=15, random_state=0, test_size=5)
pooled_subjects = np.asarray(pooled_subjects)

Now, we can train the scikit-learn LinearSVC estimator to on our training set of participants and apply the trained classifier on our testing set, storing accuracy scores after each cross-validation fold:

scores = {}
for kind in kinds:
    scores[kind] = []
    for train, test in cv.split(pooled_subjects, classes):
        # *ConnectivityMeasure* can output the estimated subjects coefficients
        # as a 1D arrays through the parameter *vectorize*.
        connectivity = ConnectivityMeasure(kind=kind, vectorize=True)
        # build vectorized connectomes for subjects in the train set
        connectomes = connectivity.fit_transform(pooled_subjects[train])
        # fit the classifier
        classifier = LinearSVC().fit(connectomes, classes[train])
        # make predictions for the left-out test subjects
        predictions = classifier.predict(
            connectivity.transform(pooled_subjects[test]))
        # store the accuracy for this cross-validation fold
        scores[kind].append(accuracy_score(classes[test], predictions))

After we’ve done this for all of the folds, we can display the results!

mean_scores = [np.mean(scores[kind]) for kind in kinds]
scores_std = [np.std(scores[kind]) for kind in kinds]

plt.figure(figsize=(6, 4))
positions = np.arange(len(kinds)) * .1 + .1
plt.barh(positions, mean_scores, align='center', height=.05, xerr=scores_std)
yticks = [k.replace(' ', '\n') for k in kinds]
plt.yticks(positions, yticks)
plt.gca().grid(True)
plt.gca().set_axisbelow(True)
plt.gca().axvline(.8, color='red', linestyle='--')
plt.xlabel('Classification accuracy\n(red line = chance level)')
plt.tight_layout()

_images/02f50909d88d6c4f2582b9b621b6c01be97aaa41f8e51355b83fa06e753c4807.png

This is a small example to showcase nilearn features. In practice such comparisons need to be performed on much larger cohorts and several datasets. [1] showed that across many cohorts and clinical questions, the tangent kind should be preferred.

Combining nilearn and scikit-learn can allow us to perform many (many) kinds of machine learning analyses, not just classification! We encourage you to explore the Examples and User Guides on the Nilearn website to learn more.

1(1,2): Kamalaker Dadi, Mehdi Rahim, Alexandre Abraham, Darya Chyzhyk, Michael Milham, Bertrand Thirion, Gaël Varoquaux, and Alzheimer's Disease Neuroimaging Initiative. Benchmarking functional connectome-based predictive models for resting-state fMRI. Neuroimage, 192:115–134, May 2019.