Randomly is a python package from the Rabadan Lab for denoising single-cell data using Random Matrix Theory.
This problem is very similar to one addressed in nuclear physics in the 1950s and later in a variety of complex systems7. The physicist Eugene Wigner was interested in studying the energy spectra of heavy nuclei, such as uranium. Neutron scattering experiments revealed that the energy levels appear as peaks of the diffusion rate of neutrons as a function of the energy. This is shown in J.B. Garg et al.15:
In the absence of any nuclear theory to explain this, Wigner became interested in studying the statistical distribution of the distance, s, between neighboring energy peaks. If the positions of the peaks were uncorrelated random numbers, then the distribution of distances should follow a Poisson law:
Wigner further postulated that his surmise is universal in the sense that it applies to any large, complicated quantum system regardless of the details of that system.
Later, starting in the 1960s, works by Dyson, Gaudin, Mehta, and others demonstrated that the Wigner Surmise is a consequence of the universality of Random Matrix Theory. Technically speaking, the eigenvalue statistics are independent of the distribution used to generate the matrix elements.
Recently, Random Matrix Theory has found applications in a diverse range of complex physical and mathematical systems, including zeros of the Riemann Zeta Function, quantum chaotic systems, quantum chromodynamics, string theory, cosmology, transport in disordered systems, crystal growth, telecommunications, finance theory, and neuroscience...
For the first time, we have extended this to the field of genomics. We observe that single-cell data resembles the Wigner Surmise:
As Dyson observed7, the complex systems described by Random Matrix Theory are a
“black box in which a large number of particles are interacting according to unknown laws.”Happily, this is the essence of Systems Biology, which studies systems with a large number of components, such as genes, bio-molecules or cells, interacting according to unknown laws.
Deviations from the universal eigenvalue distribution (Marchenko-Pastur) predicted by Random Matrix Theory indicate the presence of a signal that can be further analyzed:
Equivalently, the appearance of localized eigenvectors indicates the presence of a signal:
In contrast, the delocalized eigenvectors correspond to noise described by a random matrix:
The delocalized eigenvector component distribution corresponds to the maximum entropy probability density function (PDF), which means that there is no information. On the other hand, the localized eigenvector component distribution doesn't fit this PDF and thus contains information.
However, there is a subtlety: part of the signal is an artifact due to the sparsity of the data.
A way of capturing this artifact is to completely randomize the single-cell matrix. According to the theory explained above, all the eigenvectors should be delocalized and the eigenvalues should follow a Marchenko-Pastur distribution. However, the sparsity of the data modifies this predicted behavior. We call this artifact sparsity induced eigenvector localization. In the following plot, we apply a Gaussianity test to identify the localized eigenvectors corresponding to this artifact:
Eigenvector localization, which is an example of Anderson localization, is present in many physical systems, but has never before been reported in biological data.
The take-home message is the following:
As an example, we consider single-cell transcriptomic data from a set of 6,573 peripheral blood mononuclear cells (PBMC)12.
We have projected out the noise and selected the top 1,000 genes most responsible for signal according to the step just described.
We use these projected genes to do a standard hierarchical clustering and visualize it using t-SNE.
We compare our clustering with the cell labels provided by Kang12 and Butler13.
We compare the performance of Randomly (RMT) with other algorithms in terms of cell-phenotype cluster resolution.
For completeness, we also compare with the raw data and with a selection of the top 300 genes based on highest variance (300 mvg).
Provided known cell-phenotypes from Butler13, the mean silhouette score quantifies the cluster resolution.
The comparison is performed as a function of the reduced-space number of dimensions (number of principal components).
The following demo shows a t-SNE visualization of phenotypes of mouse-cortex cells provided in Zeisel et al.14 The clustering of cells is modified by changing the t-SNE parameters and principal eigen-components involved.
As we sweep across the range of eigenvalues, we can obtain either structured or completely random distributions.
For now, you can cite the Randomly preprint as:Quasi-universality in single-cell sequencing data.