fairliereese / lr-splitpipe Goto Github PK

View Code? Open in Web Editor NEW

18.0 1.0 4.0 4.71 MB

Demultiplexing and debarcoding tool designed for LR-Split-seq data.

License: MIT License

Python 100.00%

single-cell long-read transcriptome rna-seq demultiplexing

lr-splitpipe's Introduction

LR-splitpipe

LR-splitpipe is a pipeline designed for demultiplexing, debarcoding, and preparing LR-Split-seq data.

Demultiplexing reads

To demultiplex reads for their Split-seq barcodes, use demultiplex.py.

Usage: python LR-splitpipe/demultiplex.py {all, score_linkers, find_bcs, process_bcs} [options]

Subcommands:
  all                  Run all steps of demultiplexing
  score_linkers        Run steps through scoring the linkers (step 1)
                       and generate QC plots based on scored linkers
  find_bcs             Run all steps through finding the barcodes (steps 1-3)
  process_bcs          Run steps after finding barcodes (steps 4-6)

Options:
  -f                   FASTQ (gzipped or not) file output with LR-Split-seq or Split-seq reads
  -o                   Output file path / prefix
  -t                   Number of threads to run on (multithreading is recommended)
  -k                   Kit used for barcoding {WT, WT_mini, WT_mega}
  --l1_mm              Number of allowable mismatches in linker1
                        Default: 3
  --l2_mm              Number of allowable mismatches in linker2
                        Default: 3
  --chunksize          Number of lines to read in / process at a time
                        Default: 10**5
  --max_linker_dist    Maximum distance that a linker can be from the end of a read
                        Default: None (recommended ~200 based on our data)
  --max_read_len       Maximum read length to analyze for linkers / barcodes
                        Default: None
  --verbosity          Verbosity setting.
                         0: No output
                         1: QC statistics
                         2: QC statistics + progress
  --delete_input       Flag to delete temporary files (recommended!)

Demultiplexing steps:

1. Score linkers

First, LR-splitpipe uses alignment to find reads that have linkers in them, which are the static regions that connect each of the combinatorial barcodes. Reads that have fewer than 4 errors in each linker will then be used to find barcodes.

2. Align linkers

Using the reads that had valid linkers found in them, determine the location in the read of each linker. Note: This step typically takes the longest! Parallelization will help with this!

3. Find barcodes

With the locations of the linkers from step 2, extract the barcodes from each read.

4. Correct barcodes

Correct barcodes to those that are within edit distance of 3 of the list of possible Split-seq barcodes. This step uses code and assets (barcodes and barcodes within edit distance 3) from the original Parse Biosciences Split-seq demultiplexing code, which is designed for short reads.

5. Trim barcodes

After recording the barcode and UMI for each read, trim the construct off from the sequence as this part will mess up mapping.

6. Filter out duplicate UMIs

Filter out reads with duplicate UMIs per barcode, keeping the longest read.

7. Write to fastq

Take the barcode/UMI for each read and append it to the read name of each read. Output the trimmed reads labeled by their barcodes to a fastq file.

Adding cell barcode as BAM tag

After running the demultiplexer, reads should be mapped and converted to a SAM file. In this file format, the barcode, which is in the read header in fastq format, can be moved to a cell barcode (CB:Z:NNNNN...) tag as part of the SAM file. This is useful to be able to run TALON on the single-cell data down the line. To do this, run add_bam_tag.py.

Note: It is recommended to use the --merge_primers option. If you don't use this option, oligodT primed and random hexamer primed reads from the same cells won't be merged

Usage: python LR-splitpipe/add_bam_tag.py

Options:
  -s                   SAM file with Split-seq barcode+UMI information in
                         the read name
  -k                   Kit used for barcoding {WT, WT_mini, WT_mega}
  --merge_primers      Merge reads that come from the same cell from different
                         priming strategies
  --suffix             Suffix to add to cell barcodes. Useful if merging
                         separate LR-Split-seq experiments that might have
                         overlapping barcodes otherwise
  -o                   Output file path/prefix

Citing this software

Please cite our publication and / or the Zenodo DOI for the release you used!

lr-splitpipe's People

Contributors

Stargazers

Watchers

Forkers

jpromeror rvolden detrout dbrg77

lr-splitpipe's Issues

Question: Is sparsity of data normal in LR-splitseq compared to SR-splitseq data (originating from same sub-library)?

Hi there,

Apologies if this is not the right place to ask this question. I tried emailing you last week but as I haven't heard back thought I would try asking here.

I have run LR-splitpipe and TALON on our LR splitseq data but the results are odd/sparse and I am trying to work out whether it is something to do with the pipeline or the quality of our data.

To give some background, we have 8 samples - 7 human and 1 chimp. Splitseq was performed on these samples (3 rounds of splitting, adding barcodes then pooling). After final pooling of the samples they were divided into 4 sub-libraries. Short read sequencing was performed on all 4 sub-libraries. I was able to run the resulting sequencing data through Parse Bioscience's splitpipe pipeline successfully and the data look good when exploring using Scanpy.

The 4th sub-library had undergone further amplification compared to the other 3 libraries so that it could be split and also sent for long read sequencing (PacBio amplicon sequencing).

From the SR data, there were ~10K cells in the human samples that passed Parse Bioscience's QC filtering (~2000 in chimp). For human, we have 3.7M long-reads, which are filtered to 1.7M after running LR-splitpipe. I then split the human and chimp reads from one another and then further filtered the reads to the set of barcodes in the SR dataset passing QC filtering. This resulted in 96K reads for human and ~17.5K reads for chimp. For human, these 96K reads correspond to 3908 cells (meaning 6100 cells from the SR data were not retained in the LR data). Have you seen this reduction in identified cells in your LR split-seq compared to SR split-seq data?

What is really perplexing is that most cells in the LR data have very few transcripts (1-5 transcripts per cell). These same cells in the SR data do not suffer from this issue (as they passed Parse's QC filtering). Have you seen this phenomenon before?

Histogram showing distribution of transcripts across cells in the human LR samples (y-axis = no of cells, x-axis = no of transcripts (range)):

Histogram showing distribution of transcripts across cells in the human LR sample where cells have <= 50 transcripts (y-axis = no of cells, x-axis = no of transcripts (range)):

This means that when I run TALON, the data are really sparse.

Do you think the fact that we have amplicon rather than full-length sequencing data could be contributing? You can see our read length distrubution below.

I am really sorry for the long post but I am really a bit perplexed about what is going on. I would welcome any comments/suggestions that you have!

Many thanks,
Catherine

EOFError: Ran out of input

Hey fairliereese,

me again 👍

I am comparing a few pipelines to see which ones extract the BC better from my SPLiT-seq data.
One of them is yours, as i figured out (partly with help from your pipeline) that fixing barcodes might improve retrieval.
as well as the fact that not too long ago it also got published (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8383061/).
However, i am not a fan of that pipeline, there are very few options, the paper is not that clear on exactly what they do or how they do it. as well as that i cannot read C (or at least not as well as python). which makes it hard for me to get.

Anyway while running your script i ran into this error at the get_bcs_umis() part:

Traceback (most recent call last):
File "", line 1, in
File "/media/draco/lucask/LR-splitpipe/LR-splitpipe/utils.py", line 457, in get_bcs_umis
else:
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandarallel/core.py", line 324, in closure
return wrapped_reduce_function(
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandarallel/core.py", line 199, in closure
return reduce_function(dfs, extra)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandarallel/data_types/dataframe.py", line 45, in reduce
return pd.concat(datas, copy=False)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 294, in concat
op = _Concatenator(
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 348, in init
objs = list(objs)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandarallel/core.py", line 195, in
get_dataframe_and_delete_file(output_file_path)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/pandarallel/core.py", line 189, in get_dataframe_and_delete_file
data = pickle.load(file_descriptor)
EOFError: Ran out of input

i've changed a few things in your code, however when changing them back (just replacing it with a copy of the original file) the error also occurs.

The things i change to are:

linker1 sequence, and size to 30 in - get_linkers())
scoring threshold in to 30 - l1_m - align_linkers()
df = df.loc[(df.l_dir == '+')] instead of df = df.loc[(df.l_dir == '+')|(df.l_dir == '-')], because i only care about forward reads anyway - get_bcs_umis()

i keep the --delete_input as False, to make sure the files are not deleted. yet i still get this the mentioned error.

for some reason it always stops after finding 18900000 reads.
I though maybe it just runs out of reads, however the input file should have plenty of reads, of which most are + strands
(which make sense as i put in 100M reads)

wc -l of _seq_linker_alignments
88343045 myTest2/_seq_linker_alignments.tsv
awk '{print $9}' myTest2/_seq_linker_alignments.tsv | sort | uniq -c
476924 -
87866120 +
1 l1_start

maybe i am overseeing something or just dont get it.

Regards,
Lucas Kuijpers

Barcodes from Illumina

@fairliereese
this is a great tool! I am using it without much problems. Just 1 question:
I want to filter the Bams by barcodes with the -i_file option.
But the BC output of Parse's Illumina pipeline is not in nucleotides but as numbers for each well in each round eg.: 4-45-62
Is there a database to convert this "Split-seq demultiplexing code" into nucleotide sequence or vice-versa?
Thanks
Marcel

Errors with WT_mini

Hi,
I am trying LR-splitpipe for a Parse WT_mini kit sequenced with nanopore. The v2.0 release worked, but identified very few matches for bc1, otherwise everything works so it is not obvious what the issue is. It looks like this and other things have been fixed now so perhaps it would be good with a new release soon :)

I encountered a few bugs when running the latest code, the argument parsing only works if --chunksize is present and is given as a string and likewise --max_read_len needs to be defined if it is not in the argument list.

Also for WT_mini the number of wells used ar 12 so it needs to be changed from 28 in utils.py here:
def get_bc_round_set(kit):
KIT_INT_DICT = {'custom_1': 1, 'WT': 48, 'WT_mini': 28, 'WT_mega': 96}
kit_n = KIT_INT_DICT[kit]
if kit_n == 12:
bc_round_set = [['bc1','n24_v4'], ['bc2','v1'], ['bc3','v1']]

Thanks for making this available, with these changes everything seems to work fine with a good % reads classified.

Question regarding reads being filtered out due to invalid linker combinations

Hi,

Sorry, this is actually a question, not an issue. I ran LR-splitpipe on our data (single cell splitseq library sent for amplicon sequencing by PacBio). LR-splitpipe ran through no problem and I have also run the data through Talon but the data are really sparse and I am not sure why. We have the SR sc-splitseq data from the same library and the data look fine there.

One thing I noticed was that we start with 3.7 million reads. After running LR-splitpipe there are only 1.7M reads. I took a look at the log file created when running LR-splitpipe and found the following lines, which indicates that 2M reads were filtered out due to having invalid linker combinations:

Found linkers for 3800000 reads
Found 1769691 reads with valid linker combinations

Is it normal for so many reads to be filtered out in this step?

Also, is there any documentation for interpreting the figures that are produced when running LR-splitpipe? For instance, I am not sure how to interpret the figures relating to linkers.

Additionally, I think that the figure showing the cells ranked by number of UMIs (see attached plot) has the legends swapped i.e. the x axis shows the # UMIs in log scale and the y axis ranked cells by number of UMIs. You can see in this plot that most cells have very few UMIs (which is not the case in the SR-sc data). The cells here have been filtered to the set of barcodes that passed QC filtering in the short read dataset so it's very strange (additionally there were 10000 cells that passed filtering in the SR sc dataset).

Thanks,
Catherine

Generating sam file after demultiplex.py?

I would like to run add_bam_tag.py after running demultiplex.py. I have got a _demux.fastq file but no sam files after running demultiplex.py. Are we supposed to use sam files as inputs for add_bam_tag.py? Thanks!

"['l2_rc_score'] not in index" error

I am getting the following error when running 'score_linkers'. Somehow, l2_rc_score is not calculated, so it does not exist on the map. Any potential cause would be appreciated.

[2022-09-04 22:57:03] Found linker scores for 100000 reads
[2022-09-04 22:58:45] Found linker scores for 200000 reads
Traceback (most recent call last):
  File "LR-splitpipe-master/LR-splitpipe/demultiplex.py", line 361, in
  if name == 'main': main()
  File "LR-splitpipe-master/LR-splitpipe/demultiplex.py", line 359, in main
  delete_input)
  File "LR-splitpipe-master/LR-splitpipe/demultiplex.py", line 61, in score
  plot_post_score_plots(df, oprefix)
  File "LR-splitpipe-master/LR-splitpipe/utils.py", line 173, in plot_post_score_plots
  plot_linker_scores(df, oprefix)
  File "LR-splitpipe-master/LR-splitpipe/plotting.py", line 26, in plot_linker_scores
  temp = df[cols].melt(id_vars='index', value_vars=val_vars)
  File "opt/miniconda3/envs/mybioconda/lib/python3.7/site-packages/pandas/core/frame.py", line 3464, in getitem
  indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
  File "opt/miniconda3/envs/mybioconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 1314, in _get_listlike_indexer
  self._validate_read_indexer(keyarr, indexer, axis)
  File "opt/miniconda3/envs/mybioconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 1377, in _validate_read_indexer
  raise KeyError(f"{not_found} not in index")

KeyError: "['l2_rc_score'] not in index"

Question: Why is the order of barcodes 1 and 3 swapped after running modified_add_bam_tag.py?

Hi,

Sorry for the multiple posts but I am trying to pinpoint whether there is an issue with our data or at some point in our workflow.

Why when adding the barcode sequences as a bam tag are barcodes 1 and 3 swapped?

in get_read_info

        bc_umi = read_bc[1]
	bc_umi = bc_umi.split('_')
	bc = ''.join(bc_umi[0:-1][::-1])
	umi =  bc_umi[-1]

and then in main:

				bc3 = bc[:8]
				bc2 = bc[8:16]
				bc1 = bc[16:]
				if bc1 in bc_df.bc1_randhex.tolist():
					bc1 = bc_df.loc[bc_df.bc1_randhex==bc1, 'bc1_dt'].values[0]
					bc = bc3+bc2+bc1

The reason why this matters to me is because prior to doing the mapping with Minimap2, my fastq file contained human and chimpanzee reads. Additionally, we have SR-splitseq data that I ran through parse biosciences splitseq pipepline. Therefore I filtered our fastq files for

barcodes that were in the filtered set of cells after running parse biosciences' pipeline
chimp / human barcodes, producing separate fastq files.

I therefore wondered what the logic is for later swapping the sequence order of barcodes 1 and 3 to make sure that I am not somehow filtering incorrectly.

Thanks,
Catherine

error: unrecognized arguments: -k WT True

I would like to use LR-splitpipe but have encountered a problem when running it as follows:

python /projects/b1177/software/LR-splitpipe/LR-splitpipe/demultiplex.py find_bcs \
 -f ${inDir}/m54328U_220930_232205.hifi_reads.fastq.gz \
 -o ${outDir} \
 -t 8 \
 -k WT \
 --chunksize 10**5 \
 --max_linker_dist 200 \
 --verbosity 2 \
 --delete_input True

Getting the following error:

usage: demultiplex.py [-h] {all,score_linkers,find_bcs,process_bcs} ...
demultiplex.py: error: unrecognized arguments: -k WT True

Can you shed any light on why I am getting this error and how to get around it?
Thanks!

Running error

Hello fairliereese,

I am trying to run your tool on my Data however i get an error as i do so.
The data i use are not long reads but short reads, however i dont think that should stop the pipeline.
When i run this command i get the following error
python ../LR-splitpipe/demultiplex.py score_linkers -f /media/draco/lucask/LR-splitpipe/myTest/test.fastq -o /media/draco/lucask/LR-splitpipe/myTest/ -t 8 --verbosity 3 --delete_input

Traceback (most recent call last):
File "../LR-splitpipe/demultiplex.py", line 323, in
if name == 'main': main()
File "../LR-splitpipe/demultiplex.py", line 303, in main
fname = find_bcs(fastq, oprefix, t,
File "../LR-splitpipe/demultiplex.py", line 109, in find_bcs
plot_linker_scores(df, oprefix)
File "/media/draco/lucask/LR-splitpipe/LR-splitpipe/plotting.py", line 26, in plot_linker_scores
g.map(plot_hist, 'value')
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/seaborn/axisgrid.py", line 710, in map
self._facet_plot(func, ax, plot_args, kwargs)
File "/home/lucask/miniconda3/envs/scvelo/lib/python3.8/site-packages/seaborn/axisgrid.py", line 806, in _facet_plot
func(*plot_args, **plot_kwargs)
File "/media/draco/lucask/LR-splitpipe/LR-splitpipe/plotting.py", line 16, in plot_hist
for l in mismatch_lines:
UnboundLocalError: local variable 'mismatch_lines' referenced before assignment

I am not good enough at Python to fix this issue myself, i have tried putting a Global mismatch_lines in the plot_hist function but that does not fix it.
I think it has to do with the fact that the if/elif statements in plot_hist dont match to something.
the pipeline does make the _table.tsv and _seq_linker_alignment_scores.tsv

the test.fastq is a fastq file with 100k reads from my Read2 file. I have noticed in my previous attempts to analyze my data with STARsolo that there is a reasonable amount of reads that have a shift in barcode position due to mismatches in the linkers. I want to test how many reads have a linker and how many have shifts or other mismatches.
I have also tried @paulranumn11 code with which i get a decent data back, however i would still like to know how many of my reads are of "lower" quality.
your programs seems at glance to do exactly what i want.
I hope you can help.
Thanks in advance.

Regards,
Lucas