# Clustering¶

In the next sections, the method illustrations (results, inputs…) focus on our study on Calcium Imaging data from the mice hippocampus (see Motivation for more info).

## Methods¶

The clustering module provides a clustering pipeline to group coactive elements in a .tif sequence.

A transient, where several pixels coactivate.

It takes as an input an array of voltage traces, projects them in a 2D plane using t-SNE ($$t$$-distributed Stochastic Neighbor Embedding) [1] and find clusters using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) [2].

### Dimensionality reduction¶

Before clustering the traces, it is important to build a relevant metric in order to assess their proximity. This metric can then be used to compute the distance matrix between the traces, which enables their projection in a 2D space, to simplify the representation and increase the clustering computation speed.

#### PCA¶

As t-SNE algorithm needs low dimension inputs to run fast, we first reduce the traces dimensionality using PCA (Principal Components Analysis).

#### t-SNE¶

Then we define relevant distance metric to assess traces correlation. In our case, the most famous one is “Pearson correlation” [5], which measures the linear correlation between two variables. “Spearman’s rank correlation” [5] is also particularly adapted in our case. This allows to compute the distance matrix of the components traces, which t-SNE processes to project them into a 2D plane with a non-linear method.

### Clustering¶

Once the traces are projected in a 2D plane, we apply HDBSCAN.

A plot of 2D projected traces.

Each point represents one component. The components having “similar” traces (based on the distance matrix) are close, and the clustering algorithm forms colored groups.

## Parameter selection¶

### Dimensionality reduction¶

This parameter section is related to the Dimensionality reduction.

#### Selecting normalization_method¶

• type: str
• default: 'z-score'

The normalization_method parameter selects the method for normalizing the traces. The choices are:

• 'null': no normalization
• 'mean-substraction': $$T = T - \overline T$$
• 'z-score': $$T = (T - \overline T) / \sigma(T)$$

Note

The normalization is done again to make sure the dimensionality reduction is under the right format. If the traces were already normalized during skeletonization, there is no need to do it again.

#### Selecting pca_variance_goal¶

• type: float
• default: 0.90

The pca_variance_goal parameter selects the percentage of variance to keep after PCA. It must be a float inferior to 1. The higher it is, the more principal components are kept.

Tip

The advantage of setting a percentage goal is that you keep control on the level of information you want to keep on the traces. You can also input an int, if you know the exact number of components you want to give to the following t-SNE algorithm. But be aware of the fact that you might loose a lot of information if you set it too low.

Tip

The effects of this parameter can be seen on the “before/after” graphs by inspecting the dimensionality_reduction/ folder generated when launching the clustering module.

#### Selecting tsne_distance_metric¶

• type: str
• default: 'spearman'

The tsne_distance_metric parameter selects the distance metric for t-SNE distance matrix computation. It can be one of pearson, spearman and every metric in sklearn.metrics.pairwise.distance_metrics.

Tip

The effects of this parameter can be directly seen on the clusters plots by inspecting the hdbscan_clustering/ folder generated when launching the clustering module. For instance, the clusters traces found using spearman will be largely different from the ones using euclidean.

#### Selecting tsne_perplexity¶

• type: int
• default: 30

The tsne_perplexity parameter selects “the number of nearest neighbors that is used in other manifold learning algorithms” [1].

Important

The value needs to take into account the typical size of clusters we want. For instance, if we want to cluster skeleton pixel components, the perplexity needs to be much higher than if we want to cluster branches. An example is done in Consecutive clusterings.

Tip

The effects of this parameter can be directly seen on the point distribution in the scatter plot by inspecting the hdbscan_clustering/ folder generated when launching the clustering module.

#### Selecting tsne_random_state¶

• type: int
• default: 42

The tsne_random_state parameter selects the random number generator for t-SNE algorithm. For reproducible results, pass an int.

### Clustering¶

This parameter section is related to the Clustering.

#### Selecting min_cluster_size¶

• type: int
• default: 5

The min_cluster_size parameter selects the minimum size of clusters. For more details on how to set this parameter correctly, see [4].

Tip

The effects of this parameter can be directly seen on the colors of the scatter plot by inspecting the hdbscan_clustering/ folder generated when launching the clustering module.

#### Selecting min_samples¶

• type: int
• default: 5

The min_samples parameter selects “the number of samples in a neighbourhood for a point to be considered a core point” [3]. For more details on how to well couple this parameter with min_cluster_size, see [4].

#### Selecting hdbscan_metric¶

• type: str
• default: 'euclidean'

The hdbscan_metric parameter selects “the metric to use when calculating distance between instances in a feature array” [3].

 [1] (1, 2) t-SNE, Scikit-Learn, https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html