Project Links

This repository is part of the CI-SpliceAI software package published in PLOS One.

This is the project for offline variant annotation using the final models. You may also be interested in the code to train CI-SpliceAI, code comparing different tools on variant data, and the website providing online annotation of variants.

Install

We strongly advise you to install conda and create a new conda environment. This keeps your projects and their dependencies separate. So go ahead and install conda first.

CI-SpliceAI can be run on your CPU or GPU. CPU is much easier to set up, but inference is much slower than using a GPU. If you only need to annotate a few variants (less than 50 in total), don't bother setting up GPU support.

If you don't need GPU support, you can skip the next paragraph.

Setting up your GPU

For experienced users only. GPU is only supported for CUDA devices (i.e. NVIDIA graphic cards).

We suggest you install conda, CUDA and tensorflow-gpu first and ensure that your GPU is being used by tensorflow (by running python and importing tensorflow). You can use a command similar to this:

conda create -n cis_use python=3.8 tensorflow-gpu=2.2.0 cudnn=7.6.5.32=hc0a50b0_1 keras=2.4.3 -c conda-forge

We used this command with CUDA/11.0.

Please activate the new environment and make sure that your GPU is being recognised:

conda activate cis_use
python -c "import keras; from tensorflow.python.client import device_lib; print([x.name for x in device_lib.list_local_devices()])"

If you set-up your GPU correctly, the last line generated by this command should list all your GPUs. If it does not, do not proceed with installation until fixed. Once the GPU shows up, you can proceed with the next section and omit the [cpu] suffix.

Installing the python module

If you did not set-up your GPU, you need to create the conda environment first:

conda create -n cis_use python=3.8
conda activate cis_use

Then, install cispliceai like this (remove [cpu] suffix if you set up a GPU in the previous step):

pip install "cispliceai[cpu]"

Usage

cis-vcf [-h] [--annotation ANNOTATION] [--input INPUT] [--output OUTPUT] [--distance DISTANCE] [--batch BATCH] [--all] [--outside] [--mask] reference

positional arguments:
  reference             path to the reference fasta file (*.fa, *.fa.gz)

optional arguments:
  -h, --help            show this help message and exit

  --annotation ANNOTATION, -a ANNOTATION
                        annotation table with gene names, defaults to "grch38" (table included). You can specify "grch37" or a path to your own table of the same format.

  --input INPUT, -i INPUT
                        input VCF; defaults to stdin

  --output OUTPUT, -o OUTPUT
                        output VCF; defaults to stdout

  --distance DISTANCE, -d DISTANCE
                        maximum distance from the variant; defaults to 1000

  --batch BATCH, -b BATCH
                        maximum input batch size in MB. Be careful to leave enough space for the model and inference process. We recommend to increase this only for GPU processing. Defaults to 10

  --all                 annotate all affected genes/regions, not only the most significant

  --outside             keep nucleotides outside of annotated transcript areas (defined in annotation table); by default outside nucleotides are encoded as N

  --mask, -m            mask events to disregard gains of splice sites and losses of non-splice sites

Example command

Annotates an input.vcf file into an output.vcf file. Uses GrCh37 coordinates, outputs all genes affected, and masks to canonical losses or non-canonical gains.

cis-vcf -i input.vcf -o output.vcf -a grch37 --all --mask

Parameter `reference`

Please download a human reference genome file (i.e. the latest primary assembly used by GENCODE: GRCh38, GRCh37). You need to unzip it first!

If you want to save space (and decrease performance!), you can use a bgzipped version of the reference genome. To do so, install samtools and biopython (conda install samtools biopython -c bioconda), then run bgzip -c <file>.fa > <file>.fa.gz.

Annotation table (`--annotation`)

By default, the module uses an annotation table of collapsed isoforms of GENCODE on GRCh38. You can change this to GRCh37 by specifying "-a grch37".

You can also provide your own annotation table. Please use a table provided within this module, i.e. the GrCh38 one, as a template. It's a CSV file with a (gene) ID, the chromosome, strand, start and end of the transcript, and comma-separated junction start end end positions. On the forward strand, junction start/end is equivalent to donor/acceptor sites, on the reverse-strand it's the other way round.

In/Output (`--input` and `--output`)

You can specify the path to VCF files here. Or, you can omit the parameters pipe them to stdin and stdout cat variants.vcf | cis-vcf <reference> > output.vcf

Maximum distance from the variant (`--distance`)

The algorithm predicts on your REF annotation plus some nucleotides around it. This parameter changes the number of nucleotides predicted to the left and right. (Note: An additional 5,000 nucleotides in either direction is extracted for context but not used for annotation).

Batch size (`--batch`)

This parameter is only useful when using a GPU. We don't recommend changing it for CPU usage.

Each GPU has its own memory limit, so you need to find the optimal batch size yourself. If it's too small, your GPU won't be used efficiently, if it's too high, it will run out of memory.

We suggest finding the optimal value by measuring memory consumption using nvidia-smi, extrapolate usage to 100%, and try again. The parameter itself is a maximum value in MB, so it will work for different sequence lengths (depending on --distance and REF/ALT length)/

All annotations per allele (`--all`)

By default, only the most significant annotation per ALT allele is written to VCF (i.e. the one with the highest delta position). You can add the '--all' flag to instead output all alleles.

Retain sequences outside of annotated regions (`--outside`)

This is an experimental feature.

By default, all nucleotides outside of the overlapping annotated region (defined by the --annotation table) are replaced with N, except when the variant does not intercept with a region. You can instead always retain the sequences.

Mask scores (`--mask`)

This parameter masks the delta score: It will remove (i.e. set to zero) all losses of non-splice sites and all gains of splice sites. Therefore only losses of existing splice sites and gains of novel splice sites are annotated.

Output

The VCF output has an INFO column with the following annotations:

ID	Description
ALLELE	The ALT allele from your input
SYMBOL	Ensembl gene ID, or strand if no gene overlaps
DS_AG	Delta score (acceptor gain)
DS_AL	Delta score (acceptor loss)
DS_DG	Delta score (donor gain)
DS_DL	Delta score (donor loss)
DP_AG	Delta position (acceptor gain)
DP_AL	Delta position (acceptor loss)
DP_DG	Delta position (donor gain)
DP_DL	Delta position (donor loss)

Comparison between CI-SpliceAI and similar tools

There is a separate repository here comparing CI-SpliceAI with SpliceAI, MMSplice, SQUIRLS, and two MaxEntScan variants.

Differences to SpliceAI

This tool is very similar to the SpliceAI annotation tool. These are the most important differences:

Models are trained on collapsed GENCODE isoforms
The final models are trained on all chromosomes; SpliceAI models are only trained on training chromosomes
Models are optimised for inference and load/predict faster
All overlapping genes are analysed, not only the first one
When no genes overlap, the tool is run on the forward and backward strand
All variants are annotated, even multi-nucleotide variants
Better batch inference on GPU
Masking (--mask) is done prior to annotation, allowing the next-highest annotation to be returned rather than capping the output to zero (see this github issue)
Potential differences when filtering GENCODE annotations. SpliceAI's data pipeline was not released and their selection process not disclosed (see this github issue)

Current limitations

The tool is only using one CPU/GPU max. Multi-processing is currently not supported.

Prediction of custom sequences

Since version 1.1, you can predict the likelihood of each nucleotide being a splice site like so:

from cispliceai.model import CISpliceAI
from cispliceai.data import DNAPreprocessor

model = CISpliceAI()

# you can run multiple sequences in a batch
# you can input sequences of varying length, but it's your responsibility to ensure that the batch fits on memory!
# The sequences here are only an example, you should upload much longer slices to get sensible results
predictions = model.predict([
  DNAPreprocessor.onehot_and_pad('CTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACCAGAACCGGGGGCGGCTGGCAGACAAGAGGACAGTCGCCCTGCCTGCC'),
  DNAPreprocessor.onehot_and_pad('CTCCTCCTGCCCCACCAGAACCGGGGGCGGCTG'),
])

# Get acceptor and donor scores for the two sequences
acceptor_prob_1 = predictions[0][:, 1]
donor_prob_1 = predictions[0][:, 2]

acceptor_prob_2 = predictions[1][:, 1]
donor_prob_2 = predictions[1][:, 2]

Development

To install from a local directory

pip install -e "./[cpu]"

Build to release

pip install build
python -m build

Pull requests / Bug reports

Pull requests are welcome. Please submit bug reports to the git issues system.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.