8000
Skip to content

Illico is a python library performing blazing fast asymptotic wilcoxon rank-sum tests, useful for single-cell RNASeq data analyses and processing.

License

Notifications You must be signed in to change notification settings

remydubois/illico

Repository files navigation

Illico

Overview

illico is a python library performing blazing fast asymptotic wilcoxon rank-sum tests (same as scanpy.tl.rank_genes_groups(… tie_correct=True)), useful for single-cell RNASeq data analyses and processing. illico's features are:

  1. πŸš€ Blazing fast: On K562 (essential) dataset (~300k cells, 8k genes, 2k perturbations), illico computes DE genes (with reference="non-targeting") in a mere 30 seconds. That's more than 100 times faster than both pdex or scanpy with the same compute ressources (8 CPUs).
  2. πŸ’  No compromise: on synthetic data, illico's p-values matched scipy.stats.mannwhitneyu up to a relative difference of 1.e-12, and an absolute tolerance of 0.
  3. ⚑ Thread-first: illico eventually parallelizes the processing (if specified by the user) over threads, never processes. This saves you from all the fixed cost of multiprocessing, such as spanning processes, duplicating data across processes, and communication costs.
  4. πŸͺ² Data format agnostic: whether your data is dense, sparse along rows, or sparse along columns, illico will deal with it while never converting the whole data to whichever format is more optimized.
  5. πŸͺΆ Lightweight: illico will process the input data in batches, making any memory allocation needed along the way much smaller than if it processed the whole data at once.
  6. πŸ“ˆ Scalable: Because thread-first and batchable, illico scales reasonably with your compute budget. Tests showed that spanning 8 threads brings a 7-fold speedup over spanning 1 single thread.
  7. πŸŽ† All-purpose: illico performs both one-versus-reference (useful for perturbation analyses) and one-versus-rest (useful for clustering analyses) wilcoxon rank-sum tests, both equally optimized and fast.

Approximate speed benchmarks ran on k562-essential can be found below. All the code used to generate those numbers can be found in tests/test_asymptotic_wilcoxon.py::test_speed_benchmark.

Test Format Illico Scanpy pdex
OVO (reference="non-targeting") Dense ~30s ~1h ~4h
OVO (reference="non-targeting") Sparse ~30s ~1h30 ~4h
OVR (reference=None) Dense ~30s ~11h X
OVR (reference=None) Sparse ~30s ~10h X

πŸ’‘ Note:

  1. This library only performs tie-corrected wilcoxon rank-sum tests, also known as Mann-Whitney test, also performed by scanpy.tl.rank_genes_groups(…, tie_correct=True). It does not perform wilcoxon signed-sum tests, those are less often used in for single-cell data analyses as it requires samples to be paired.
  2. Exact benchmarks ran on a subset of the whole k562 can be found at the end of this readme.
  3. OVO refers to one-versus-one: this test computes u-stats and p-values between control cells and perturbed cells. Equivalent to scanpy's rank_gene_groups(…, reference="non-targeting").
  4. OVR refers to one-versus-rest: this test computes u-stats and p-values between each group cells, and all other cells, for each group. Equivalent to scanpy's rank_gene_groups(…, reference="rest").
  5. This package is not intended at running out-of-core single cell data analyses like rapids-singlecell.

Installation

illico can be installed via pip, compatible with Python 3.11 and onward:

pip install illico -U

How to use

This library exposes one single function that returns a pd.DataFrame holding p-value, u-statistic and fold-change for each (group, gene). Except the few points below, the function and its arguments should be self-explanatory:

  1. It is required to indicate if the data you run the tests on underwent log1p transform. This only impacts the fold-change calculation and not the test results (p-values, u-stats). The choice was made to not try to guess this information, as those often lead to error-prone and potentially harmful rules of thumb.
  2. By default, illico.asymptotic_wilcoxon will use what lies in adata.X to compute DE genes. If you want a specific layer to be used to perform the tests, you must specify it.

DE genes compared to control cells

If you are working on single cell perturbation data:

from illico import asymptotic_wilcoxon

adata = ad.read_h5ad('dataset.h5ad') # (n_cells, n_genes)
de_genes = asymptotic_wilcoxon(
       adata,
       # layer="Y", # <-- If you want tests to run not on .X, but a specific layer
       group_keys="perturbation",
       reference="non-targeting",
       is_log1p=[False|True], # <-- Specify if your data underwent log1p or not
       )

The resulting dataframe contains n_perturbations * n_genes rows and three columns: (p_value, statistic, fold_change). In this case, the wilcoxon rank-sum test is performed between cells perturbed with perturbation p_i and control cells, for each p_i.

DE genes for clustering analyses

Let's say your .obs contains a clustering variable, assigning a label to each cell.

from illico import asymptotic_wilcoxon

adata = ad.read_h5ad('dataset.h5ad') # (n_cells, n_genes)
adata.obs["cluster"] = ...
de_genes = asymptotic_wilcoxon(adata, group_keys="cluster", reference=None, is_log1p=[False|True])

In this case, the resulting dataframe contains n_perturbations * n_genes rows and the same three columns: (p_value, statistic, fold_change). In this case, the wilcoxon rank-sum test is performed between cells belonging to cluster c_i and all the other cells (one-versus-the-rest), for all c_i.

illico is not faster than scanpy or pdex, is there a bug ?

illico relies on a few optimization tricks to be faster than other existing tools. It is very possible that for some reason, the specific layout of your dataset (very small control population, very low sparsity, very small amount of distinct values) result in those tricks being effect-less, or less effective than observed on the datasets used to develop & benchmark illico. It is also very possible that because of those, other solutions end up faster than illico ! If this is your case, please open a issue describing your situation.

illico's results (p-values or fold-change) does not match pdex or scanpy.

Test results (p-values)

Please open an issue, but before that: make sure that you are running asymptotic wilcoxon rank-sum tests as this is the only test exposed by illico.

  • pdex relies on scipy.stats.mannwhitneyu that runs exact (non asymptotic) only when there are 8 values in both groups combined, and no ties.
  • scanpy offers the possibility to run non-tie-corrected wilcoxon rank-sum tests, make sure this is disabled by passing tie_correct=True.
  • Also, illico uses continuity correction by default which is the best practice.

The test suite implemented in the CI and used to develop illico targets a precision of 1.e-12 compared to scipy, not scanpy. Consequently, there will be slight disagreement between scanpy's p-values and illico's p-values.

Fold-change

The fold-change computed by illico is the most naive form of the fold-change:
$$\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$$
If your data underwent log1p transform, np.expm1 is applied before computing the expectations (means). I know many definitions exist, and adding more control over this should not be complicated. If this is your case, please open an issue.

What about normalization and log1p

  1. illico does not care about your data being normalized or not, it is up to you to apply the preprocessing of your choice before running the tests. It is expected that illico is slower if ran on total-count normalized data by a factor ~2. This is because if applied on non total-count normalized data, sorting relies on radix sort which is faster than the usual quicksort (that is used if testing total-count normalized data).
  2. In order to avoid any unintended conversion, or relie on failure-prone rules of thumb, illico requires the user to indicate if the input data is log1p or not. This is only used to compute appropriate fold-change, and does not impact test (p-value and statistic) results.

What if my adata does not fit in memory ?

Optimizing this use case is highly non-trivial as efficiently chunking CSR or CSC matrices is much more complex than running adata[:, idxs]. Ran on a CSR matrix, this command will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow.

  1. If your adata holds the expression matrix in a dense array, illico shall work on it with very little extra work because batch-based by design.
  2. If your adata holds the expression matrix in a sparse (CSC or CSR) array, you have no other choice than manually chunking your array before running illico on batches. But, again, in this case I would advice to fallback to other solutions like rapids-singlecell.

How it works

The rank-sum tests performed by illico are classical, asymptotic, rank-sum tests. No approximation nor assumption is done. Illico relies on a few optimization tricks that are non-exhaustively listed below:

  1. πŸ§€ Sparse first: if the input data is sparse, that can be a lot less values to sort. Instead of converting it to dense, illico will only sort and rank non-zero values, and adjust rank-sums and tie sums later on with missing zeros.
  2. πŸ—‘οΈ Memory-conscious: ranking and sorting values across groups often requires to slice and convert the data numerous times, especially for CSC or CSR data. Memory allocations are minimized and optimized so as to ensure better scalability and lower overall memory footprint.
  3. 🧠 Sort controls only once: for the one-versus-reference use case, illico takes care of not repeatdly sorting the control values. Controls are sorted only once, after what each "perturbation" chunk is sorted, and the two sorted sub-arrays are merged (linear cost). Because there are often much more control cells than perturbed cells, this is a huge economy of processing.
  4. ➿ Vectorize everything: for the one-versus-ref use case, illico performs one single sorting of the whole batch (all groups combined) and sums ranks for each group in a vectorized manner. This allows to sort only once instead of repeatedly performing scipy.stats.mannwhitneyu on all-but-group-g and group-g, for all g - involving one sorting each.
  5. 🐍 Generally speaking, illico relies heavily on numba's JIT kernels to ensure GIL-free operations and efficient vectorization.

Benchmarks

Benchmarking against other solutions

In order for benchmarks to run in a reasonable amount of time, the timings reported below were obtained by running each solution on a subset of each cell line (20% of the genes). All solutions were find to scale linearly with the number of genes (columns in the adata). Extrapolating (x5) the elapsed times below will approximate runtime of those solutions on the whole datasets. Numbers in parenthesis report the multiplicative factor versus the fastest solution of each benchmark. A "benchmark" is defined by:

  1. The cell line (K562 essential, RPE1, Hep-G2, Jurkat) used as input.
  2. The data format (CSR, or dense) used to contain the expression matrix.
  3. The test performed: OVO (reference="non-targeting") or OVR (reference=None).

πŸ’‘ Keep in mind that pdex does not implement OVR test.

Runtime comparison for scanpy, pdex and illico on four cell lines.

Scalability

illico scales reasonably well with your compute budget. On the K562-essential dataset spanning 8 threads instead of 1 brings a 7-folds speedup.

---------------------- benchmark 'k562-dense-ovo': 4 tests -----------------------
Name (time in s)                                                    Mean
----------------------------------------------------------------------------------
test_speed_benchmark[k562-dense-100%-illico-ovo-nthreads=8]      29.6962 (1.0)
test_speed_benchmark[k562-dense-100%-illico-ovo-nthreads=4]      53.4369 (1.80)
test_speed_benchmark[k562-dense-100%-illico-ovo-nthreads=2]     100.3919 (3.38)
test_speed_benchmark[k562-dense-100%-illico-ovo-nthreads=1]     208.2443 (7.01)
----------------------------------------------------------------------------------

Why illico

The name illico is a wordplay inspired by the R package presto (now the Wilcoxon rank-sum test backend in Seurat). Aside from this naming reference, there is no affiliation or intended equivalence between the two. illico was developed independently, and although the statistical methodology may be similar, it was not designed to reproduce presto’s results.

Contributing

All contributions are welcome through merge requests. Developers are highly encouraged to rely on tox as the testing process is quite cumbersome.

Testing

The reason to be of this package is its speed, hence the need for extensive speed benchmarks in order to compare it exhaustively and accurately against existing solutions. tox is used to manage tests and testing environments.

pip install tox # this can be system-wide, no need to install it within an environment

πŸ’‘ The test suite below can be very long, especially the benchmarks (up to 48 hours). All "bench-" tox commands can be appended with the -quick suffix ensuring they are ran on 1 gene (column) of the benchmark data, just to make sure everything runs correctly. Example:

tox -e bench-all-quick # This will run speed and memory benchmarks for illico, scanpy and pdex
# OR:  tox -e bench-illico-quick # This will run speed and memory benchmarks for illico only
# OR :tox -e bench-ref-quick # This will run speed and memory benchmarks for scanpy and pdex only

Appending the -quick suffix will not write any result file or json inside the .benchmarks or .memray folders (that are versioned). Instead, benchmark result files will be written to /tmp. In this case, make sure to run tox -e memray-stats /tmp

Unit testing

Those tests are simply used to ensure the p-values and fold-change returned by illico are correct, should be quick to run: tox -e unit-tests ⚠️ Those tests do not run -quick as they use synthetic data that results in much shorter runtime.

Speed benchmarks

Speed benchmarks are ran against: pdex and scanpy as references. Those benchmarks take a lot of time (>10 hours on 8 CPUs) so they should not be re-ran for every new PR or release. However, if needed:

tox -e speed-bench-ref # Run speed benchmarks for scanpy and pdex, should not be re-ran, ideally.

Before issuing a new PR, in order to see if the updated code does not decrease speed performance, make sure to run:

tox -e speed-bench-illico #-quick

πŸ’‘ Because benchmark performance depends on the testing environment (type of machine or OS), it is recommended to run this benchmark from main on your machine as well. This will give you a clear comparison point apple-to-apple. Once the benchmarks have ran, you can cat the benchmark results in terminal with:

tox -e speed-bench-compare

Memory benchmark

Similar remarks as for the speed benchmarks:

tox -e memory-bench-ref # Should not be re-ran, ideally
tox -e memory-bench-illico # Should be re-ran before every new PR
tox -e memray-stats

Other tools available

  1. scanpy also implements OVO and OVR asymptotic wilcoxon rank-sum tests.
  2. pdex only implements OVO wilcoxon rank-sum tests.
  3. As of December 2025, rapids-singlecell has a pending PR adding a rank_genes_groups feature. I could not benchmark this solution as I had no GPU available, but it is expected that it runs at least as fast as illico, because GPU-based.

About

Illico is a python library performing blazing fast asymptotic wilcoxon rank-sum tests, useful for single-cell RNASeq data analyses and processing.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published
0