lhqing / cemba_data Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 7.0 12.59 MB

Mapping pipeline for snmC-seq based technologies.

Home Page: https://hq-1.gitbook.io/mc/

License: MIT License

Python 63.88% Shell 0.86% Jupyter Notebook 35.14% Dockerfile 0.13%

methylation pipeline sequencing single-cell snmc-seq snmc-seq2 snmct-seq

cemba_data's Introduction

Hi there 👋

I'm Hanqing Liu, read more about my publications at my personal website.

cemba_data's People

Contributors

Stargazers

Watchers

Forkers

shengyongniu jksr sel041 chooliu dingwb mbatiuk

cemba_data's Issues

potential error in yap make-sample-sheet plateinfo_and_samplesheet.py

cemba_data/cemba_data/demultiplex/plateinfo_and_samplesheet.py

Line 94 in 788e83c

ll = re.split(r' |,|\t', 'line')

I changed ll = re.split(r' |,|\t', 'line')
to ll = re.split(r' |,|\t', line)
and yap make-sample-sheet works

Bug: cutadapt adapter trimming with default 3 bases overlap instead of 6 - bug in Snakemake templates

cutadapt command in snakemake templates (for example cemba_data/mapping/Snakefile_template/mc.Snakefile) has a bug

Trim reads

rule trim_r1:
input:
"fastq/{cell_id}-R1.fq.gz"
output:
fq=temp("fastq/{cell_id}-R1.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R1.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r1_left_cut} -u -{r1_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"

rule trim_r2:
input:
"fastq/{cell_id}-R2.fq.gz"
output:
fq=temp("fastq/{cell_id}-R2.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R2.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r2_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"

Adaptor trimming is defined here:

"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "

then output is piped to another instance of cutadapt:

"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "

that has -O 6 option for specifying minimal overlap of adaptor and read

But -O option is not taken into account since adaptor trimming happened already at the first execution of cutadapt

Thus reads are not trimmed from adaptor if there is 6 bases overlap but only if there is 3 bases overlap

Broad institute version of this pipeline (https://broadinstitute.github.io/warp/docs/Pipelines/snm3C/README#snm3c-installation) does not have this problem:

      start=$(date +%s)
      echo "Run cutadapt"
      /opt/conda/bin/cutadapt \
      -a R1Adapter=~{r1_adapter} \
      -A R2Adapter=~{r2_adapter} \
      --report=minimal -O 6 -q 20 \
      -u ~{r1_left_cut} \
      -u -~{r1_right_cut} \
      -U ~{r2_left_cut} \
      -U -~{r2_right_cut} \
      -Z \
      -m ${min_read_length}:${min_read_length} \
      --pair-filter 'both' \
      -o ${sample_id}-R1_trimmed.fq.gz \
      -p ${sample_id}-R2_trimmed.fq.gz \
      ${sample_id}-R1_sorted.fq ${sample_id}-R2_sorted.fq \
      > ${sample_id}.trimmed.stats.txt
      end=$(date +%s) 
      elapsed=$((end - start)) 
      echo "Elapsed time to run cutadapt: $elapsed seconds"

changes about file names

Suggestions from Eran and will incorporate in the near future:

It might be good to name allc files as XXX.allc instead of allc_XXX.txt. This would make it clear that allc is a standardized file format, like .fastq, .bed, .sam, etc.
I agree, will adopt this.
It might be good to define a compressed allc file format (e.g. .allcz?) or just .bgz?
All the ALLC generated by YAP is compressed by bgzip and also indexed by tabix. I would prefer allc.gz because .bgz will likely break many path wildcards that already exist in mine and other’s code, which causes additional bugs/issues.
Does the .mcds file have a header or some way of storing metadata, e.g. which reference was used for quantification, etc?
Yes, MCDS is HDF5 based, technically it can store any metadata. I will take notes on that and add more metadata in the “allcools generate-mcds” function.

installation error

Hi, I get the following error when i try to install cemba_data. The same error message occurs when I try installation using pip. How can I fix this?

Building wheel for isal (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /u/local/apps/python/3.6.1/gcc-4.9.3_MKL-2018/bin/python3.6 /u/local/apps/python/3.6.1/gcc-4.9.3_MKL-2018/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmp73bak6zh
cwd: /work/tmp/pip-install-wv6n5h_e/isal
Complete output (27 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/isal
copying src/isal/init.py -> build/lib.linux-x86_64-3.6/isal
copying src/isal/igzip.py -> build/lib.linux-x86_64-3.6/isal
copying src/isal/init.pxd -> build/lib.linux-x86_64-3.6/isal
copying src/isal/crc.pxd -> build/lib.linux-x86_64-3.6/isal
copying src/isal/igzip_lib.pxd -> build/lib.linux-x86_64-3.6/isal
copying src/isal/version.pxd -> build/lib.linux-x86_64-3.6/isal
copying src/isal/_isal.pyx -> build/lib.linux-x86_64-3.6/isal
copying src/isal/isal_zlib.pyx -> build/lib.linux-x86_64-3.6/isal
creating build/lib.linux-x86_64-3.6/isal/isa-l
copying src/isal/isa-l/LICENSE -> build/lib.linux-x86_64-3.6/isal/isa-l
copying src/isal/isa-l/README.md -> build/lib.linux-x86_64-3.6/isal/isa-l
copying src/isal/isa-l/Release_notes.txt -> build/lib.linux-x86_64-3.6/isal/isa-l
running build_ext
skipping 'src/isal/isal_zlib.c' Cython extension (up-to-date)
skipping 'src/isal/_isal.c' Cython extension (up-to-date)
configure.ac:4: error: Autoconf version 2.69 or higher is required
configure.ac:4: the top level
autom4te: /usr/bin/m4 failed with exit status: 63
aclocal: autom4te failed with exit status: 63
autoreconf: aclocal failed with exit status: 63
error: [Errno 2] No such file or directory: '/tmp/tmp4_7s1z0f/configure'

ERROR: Failed building wheel for isal
Successfully built cemba-data
Failed to build isal
ERROR: Could not build wheels for isal which use PEP 517 and cannot be installed directly

outdated manual at https://hq-1.gitbook.io/mc

manual is outdated at https://hq-1.gitbook.io/mc

Such as there is no mentioning about alternative to bismark hisat3n in the reference section: https://hq-1.gitbook.io/mc/prepare/prepare-mapping-config/prepare-reference-files

However hisat3n is an option in the parameters for yap default-mapping-config

--hisat3n_dna_ref HISAT3N_DNA_REF
Path to the hisat-3n DNA reference (default: None)
--hisat3n_rna_ref HISAT3N_RNA_REF
Path to the hisat-3n RNA reference (default: None)

Github env.yaml also mentions more up to date packages for environment setup than guidline on https://hq-1.gitbook.io/mc/installation, as well as alternative environment hisat3n_env.yml

bismark_reference is not defined in Snakefile

Thanks for sharing your data analysis pipeline

There is an issue when performing read mapping using your pipeline

After demultiplexing into cell level fastq, Snakefile is created in each multiplex group subdirectory to perform mapping

However this error appears when performing snakemake:

"RuleException in line 97 of /scratch/batiuk/primary/fastq/p/cell_level/Lightning-5-E5/Snakefile:
NameError: The name 'bismark_reference' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}"

Snakefile is not defining bismark_reference, however it is present in mapping_config.ini and should be copied over into Snakefile

[mapping reference]
bismark_reference= /home/batiuk/genomes/mus/mus109_sun1-gfp_lambda_bs
; reference directory of bismark

Manually editing Snakefile and adding bismark_reference solves error:

bismark_reference= '/home/batiuk/genomes/mus/mus109_sun1-gfp_lambda_bs'

mc template

expand("allc-{mcg_context}/{cell_id}.{mcg_context}-Merge.allc.tsv.gz", cell_id=CELL_IDS),
-->
expand("allc-{mcg_context}/{cell_id}.{mcg_context}-Merge.allc.tsv.gz", cell_id=CELL_IDS, mcg_context=mcg_context),

incomplete fasta qc file

for yet unknown reason pipeline sometimes generate incomplete fasta qc file (.trimmed.stats.tsv) which affects summary generation.
need to identify the cause.

yap: error: unrecognized arguments (yap start-from-cell-fastq)

Hi, I'm trying to use YAP with directly downloaded demultiplexed cell-level FASTQ files from the snm3C-seq dataset through NEMO (https://assets.nemoarchive.org/dat-sig83t9).

So i was using yap start-from-cell-fastq, but there was an error with my command.

yap start-from-cell-fastq -o outdir -config m3c-mm10.mapping_config.ini -fq sample/*.fq.gz

yap: error: unrecognized arguments: sample/10E_HCV_3C_1-1-K6-A13-R2.fq.gz sample/10E_HCV_3C_1-1-K6-A14-R1.fq.gz sample/10E_HCV_3C_1-1-K6-A14-R2.fq.gz sample/10E_HCV_3C_1-1-K6-A1-R1.fq.gz sample/10E_HCV_3C_1-1-K6-A1-R2.fq.gz

I have 6 fq.gz files in sample/ directory.
10E_HCV_3C_1-1-K6-A13-R1.fq.gz
10E_HCV_3C_1-1-K6-A13-R2.fq.gz
10E_HCV_3C_1-1-K6-A14-R1.fq.gz
10E_HCV_3C_1-1-K6-A14-R2.fq.gz
10E_HCV_3C_1-1-K6-A1-R1.fq.gz
10E_HCV_3C_1-1-K6-A1-R2.fq.gz

Please tell me what is my mistake.

Thank you.

fastq pattern path error in demultiplex

Hello,

I am running yap demultiplex with YAP with the following command:

yap demultiplex --fastq_pattern "/path/to/fastq_files/snmC-seq2/*/*fastq.gz" --output_dir /outdir/demux_output --config_path ./hg19_mapping_config.txt --cpu 4

However, I keep getting the following error:

Message: 'No fastq name remained, check if the name pattern is correct.'

Am I formatting the fastq_pattern correctly? I am certain that is where the fastq files are. Do you have an exact example of how to format the fastq_path?

Thank you.
Eduardo

Bismark / Bowtie not writing output to directory

Running YAP (after re-installation), in Stampede2, I still notice that the BAM files are not getting created in the output directory during the Bismark rule.

I get zero size files in the bam directory for "MRSA_7-MRSA_8-E4-AD010-R1.trimmed_bismark_bt2.bam" and "MRSA_7-MRSA_8-E4-AD010-R1.trimmed_bismark_bt2_SE_report.txt"

This is how I run the command interactively (using the command from an actual error file) -

(mapping) login2.stampede2(1275)$ bismark /scratch/02332/hmanoj/utils/hg38 --bowtie2 fastq/MRSA_7-MRSA_8-E4-AD010-R1.trimmed.fq.gz --pbat -o bam/ --temp_dir bam/

I can see no errors. But I get zero size BAM files.

Working directory:
/scratch/02332/hmanoj/PoolAX_Test/snakemake/sbatch/Pool_AX_Test_3cells_sbatch

Please see complete run information attached.

CompleteRunInfo_StampedeYAP.txt

demultiplex

I would suggest running the config file check in the demultiplex step at the beginning. The demultiplex shouldn't be allowed to start if an old version of the config file is provided.

pipeline installation fails on pip install allcools - libxcrypt dependency is required

pipeline installation described here fails: https://hq-1.gitbook.io/mc/installation

pip install allcools requires additional dependency libxcrypt

performing mamba install libxcrypt solves the issue

Value Error rised when running default-mapping-config

Hi I am trying to get the Methylome and transcriptome data from GSE14049 in single cell level, however I am running yap default-mapping-config --barcode_version V1 --mode mct --bismark_ref ref/Bisulfite_Genome --genome_fasta ref/GRCh37.p13.genome.fa --star_ref ref/star_ref --gtf ref/gencode.v19.annotation.gtf I got the ValueError: chrom_size_path must be provided., even I specified the mode in for SnmCT-seq, how can I solve this issue?