SingleCell is a Python package for processing single-cell RNA-Seq data.
- Python 3 (tested with Python 3.5)
- STAR (tested with version 2.5.3a)
- samtools (tested with version 1.4.1)
The STAR and samtools executables must both be in the PATH
. To test
this, you can run the following commands, and check that they return the
respective version identifiers:
$ STAR --version
STAR_2.5.3a
$ samtools --version
samtools 1.4.1
Using htslib 1.4.1
Copyright (C) 2017 Genome Research Ltd.
$ cd singlecell
$ pip install -e .
To run the inDrop pipeline on your data, the first thing you need is a STAR genome index for the species that your data is from. A STAR index consists of a directory containing a bunch of files. For the human genome, the size of these files totals about 25 GB. You only need to create an index once (per species), which is then used by all future runs of the inDrop pipeline.
To generate an index, you need to download and decompress (using gunzip) the genome (in FASTA file) and genome annotations (in GTF format) for the species from the Ensembl FTP server. For example, for human:
$ curl -O http://ftp.ensembl.org/pub/release-88/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ curl -O http://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.gtf.gz
$ gunzip -c Homo_sapiens.GRCh38.88.gtf.gz > Homo_sapiens.GRCh38.88.gtf
For the genome annotation (GTF) file, you want to also keep the compressed version, because this is the version used by the inDrop pipeline afterwards.
Now that you have those files ready, you can run the following:
$ indrop_generate_star_index.py -g Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-n Homo_sapiens.GRCh38.88.gtf \
-od star_index_human -os build_star_index_human.sh \
-ol build_star_index_human_log.txt \
-t 16
This will output the STAR index in the directory "star_index_human"
(see -od
parameter), and will use 16 threads in parallel (-t
), making
the build process signficantly faster than if you were to run it
single-threaded.
To run the inDrop pipeline, you need to first create a configuration file (in YAML format), which contains the locations (paths) of all the input files, specifies an output directory, and sets a few parameters (e.g., how many cells you want to include in the expression matrix). To generate a configuration file template that you can then modify according to your setup, run the following:
$ indrop_create_config_file.py -o my_configuration.yaml
After adjusting the parameters in the configuration file, you can check if everything is configured correctly:
$ indrop_check_pipeline.py -o my_configuration.yaml
If there are no errors, you can run the pipeline:
$ indrop_pipeline.py -c my_configuration.yaml