boyle-lab / f-seq2 Goto Github PK
View Code? Open in Web Editor NEWImproving the feature density based peak caller with dynamic statistics
License: GNU General Public License v3.0
Improving the feature density based peak caller with dynamic statistics
License: GNU General Public License v3.0
Hi,
I'm working with fseq2 with murine data.
When I run fseq2, it works on a part of my .bed file (MT, X, Y, ....) and leaves out the "numerical" chromosomes (1 --> 18).
When I sub-select chromosomes 1 to 18, this is what it shows me:
PROGRAM: fseq2.0.3 callpeak
F-Seq Version 2.0.3
#1: Read in files and calculate parameters
/env/products/f-seq2/2.0.3/lib/python3.8/site-packages/fseq2/fseq2.py:57: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
bed['end_new'] = (bed[['end', 'strand']]).max(axis=1)
Error: No reads contained in input file(s).
Note that my file contains 12868084 lines and its format (bed) seems classic:
18 90600554 90600572 N1:1:HCGMLBGX5:3:23603:4197:5544/2 34 +
18 90600563 90600620 N1:1:HCGMLBGX5:2:13301:23722:11856/2 32 +
18 90600609 90600642 N1:1:HCGMLBGX5:3:23603:4197:5544/1 34 -
18 90600693 90600728 N1:1:HCGMLBGX5:2:13301:23722:11856/1 32 -
...
Do you have any idea what's going on?
Thanks !
I got this error while loading to UCSC browser as custom track which seems there:
Error File 'GM12376-DEB-1000_peaks.narrowPeak.chr' - Error line 1 of custom track: thickStart after thickEnd
I have generated a narrow peaks from ATAC-seq data using the following command:
fseq2 callpeak ${bemin}/${filename}.prichr.sorted.rmDup.namesorted.bam \
-f 0 -l 600 -t 4.0 -pe -cpus ${PPN} -nfr_upper_limit 150 -pe_fragment_size_range auto -p_thr 0.05 -standard_narrowpeak -sort_by pValue -o fseqpeaks_21068Rsy/${filename} \
-name ${filename}
The data is from paired-end ATAC-Seq. I got two files:
GM12376-DEB-100_peaks.narrowPeak
GM12376-DEB-100_summits.narrowPeak
On your github note, you suggested the summits file is to visualize in genome viewer such as UCSC. I see both has written in strand specific manner but without no strand information.
The summit file is one base long based on start and end (which may show the center summit of the peak region). The only thing I have added to the fseq2 output was to add chr
at the start of the line (I used Ensembl as ref genome).
chr3 93470630 93470631 GM12376-DEB-1000_summit_0 2829 . 395.179 282.943 277.703 0
chr16 1782831 1782832 GM12376-DEB-1000_summit_1 1746 . 296.913 174.654 169.714 0
chr4 781944 781945 GM12376-DEB-1000_summit_2 1689 . 296.229 168.904 164.14 0
chr17 82023063 82023064 GM12376-DEB-1000_summit_3 1677 . 328.671 167.777 163.138 0
chr19 42268966 42268967 GM12376-DEB-1000_summit_4 1598 . 319.611 159.812 155.27 0
chr19 14117960 14117961 GM12376-DEB-1000_summit_5 1542 . 249.44 154.239 149.776 0
chr19 5904543 5904544 GM12376-DEB-1000_summit_6 1526 . 316.756 152.681 148.294 0
chr8 25458597 25458598 GM12376-DEB-1000_summit_7 1526 . 269.523 152.632 148.294 0
chr9 129080973 129080974 GM12376-DEB-1000_summit_8 1509 . 245.81 150.961 146.675 0
chr7 100586399 100586400 GM12376-DEB-1000_summit_9 1447 . 285.673 144.729 140.488 0
Any idea?
Line 182 in 2eba80f
Came across this when testing on some very small example files. If the source file actually yields no cuts, then calculating the cut_density
in calculate threshold
throws an error. It would be perhaps more helpful for the tool to recognize when an input file contains no cuts and warn the user and exit with a message.
Corresponding version for tag would be 2.0.3
Please add tagged release corresponding to the version used in Zhao and Boyle, 2021.
Hello,
Another big problem when using f-seq2:
Are these known problems?
When I use f-seq1, the calls are done correctly.
Thanks,
Kévin
Hi,
I'm trying to run F-Seq2 on 3' RNA single end sequencing data to identify polyadenylation peaks. I'm running the following
fseq2 callpeak input.bam -chrom_size_file mm10.chromsizes.bed -o output_dir -name sample1 -sig_format bigwig -standard_narrowpeak -l 50 -t 8.0
However, this results in the following error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib64/python3.8/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/lib64/python3.8/site-packages/fseq2/fseq2.py", line 342, in run_kde_wo_control
result_df = call_peaks(chrom=chrom, first_cut=first_cut, kdepy_result=kdepy_result,
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/lib64/python3.8/site-packages/fseq2/fseq2.py", line 528, in call_peaks
peak_indexes, properties = find_peaks(kdepy_result, height=min_height, distance=min_distance,
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/lib64/python3.8/site-packages/scipy/signal/_peak_finding.py", line 934, in find_peaks
raise ValueError('`distance` must be greater or equal to 1')
ValueError: `distance` must be greater or equal to 1
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/bin/fseq2", line 440, in <module>
main(temp_dir_name)
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/bin/fseq2", line 49, in main
submain(args)
File "/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/lib64/python3.8/site-packages/fseq2/callpeak_main.py", line 183, in main
results = pool.starmap(fseq2.run_kde_wo_control, input_param_ls)
File "/usr/lib64/python3.8/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/lib64/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
ValueError: `distance` must be greater or equal to 1
Because the error mentioned distance I tried adding -f 150
to my call. This made the program run longer (and hence I think further). However, then I got the following error:
/home/campbell/mgeuenic/.local/share/virtualenvs/schramek-3p-analysis-XaDVbhKH/lib64/python3.8/site-packages/scipy/signal/signaltools.py:507: RuntimeWarning: invalid value encountered in multiply
ret = ifft(sp1 * sp2, fshape, axes=axes)
Ultimately I want to estimate the fragment size from the data though, rather than setting it. Is this possible?
I have been unable to run callpeak_idr
.
callpeak
runs ok.
Here is a summary:
Installed version of numpy on my computer:
numpy 1.26.4 py39h474f0d3_0 conda-forge
The call and error
fseq2 callpeak_idr \
-o test/callpeak_idr \
-control_file_1 ctl_rep1.bam bam ctl_rep2.bam \
-control_file_2 ctl_rep1.bam bam ctl_rep2.bam \
-name_1 rep1 \
-name_2 rep2 \
-standard_narrowpeak \
-v \
-l 50 \
-t 8 \
-tp 4 \
-cpus 28 \
-pe \
--plot rep1_rep2.IDR.png
tf_rep1.bam \
tf_rep2.bam \
>callpeak_idr.out 2>callpeak_idr.err
#3: Done
-------------------------------------
Thanks for using F-seq2.0.3!
Workflow#4: idr2.0.3 with samples(1,2), oracle peak list(3)
Traceback (most recent call last):
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/bin/fseq2", line 440, in <module>
main(temp_dir_name)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/bin/fseq2", line 54, in main
submain(args)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/lib/python3.9/site-packages/fseq2/callpeak_idr_main.py", line 60, in main
idr_main(args)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/lib/python3.9/site-packages/fseq2/idr_main.py", line 20, in main
r1, r2 = build_rank_vectors(merged_peaks)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/lib/python3.9/site-packages/fseq2/idr_2_0_3/idr.py", line 302, in build_rank_vectors
return ( numpy.array(rank1, dtype=numpy.int),
File "/home/msimenc/software/mambaforge/envs/fastq2peaks/lib/python3.9/site-packages/numpy/__init__.py", line 324, in __getattr__
raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
I created a new environment and followed the conda instructions on the install page:
conda install -c bioconda -c dmentipl -c samzhao fseq2
But it didn't install fseq2, so I used pip again
pip install fseq2
Now I have this version of numpy
numpy 1.19.5 pypi_0 pypi
But fseq2 is a different version now because it didn't recognize this argument
fseq2: error: unrecognized arguments: -standard_narrowpeak
I removed that argument, but I get a different error:
#3: Done
-------------------------------------
Thanks for using F-seq2.0.2!
Workflow#4: idr2.0.3 with samples(1,2), oracle peak list(3)
Initial parameter values: [0.10 1.00 0.20 0.50]
Traceback (most recent call last):
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/bin/fseq2", line 428, in <module>
main(temp_dir_name)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/bin/fseq2", line 54, in main
submain(args)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/callpeak_idr_main.py", line 60, in main
idr_main(args)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/idr_main.py", line 40, in main
fix_mu=args.fix_mu, fix_sigma=args.fix_sigma)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/idr_2_0_3/idr.py", line 449, in fit_model_and_calc_local_idr
fix_mu=fix_mu, fix_sigma=fix_sigma)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/idr_2_0_3/optimization.py", line 468, in estimate_model_params
fix_mu=fix_mu, fix_sigma=fix_sigma)
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/idr_2_0_3/optimization.py", line 420, in EMP_with_pseudo_value_algorithm
z1 = compute_pseudo_values(r1, theta[0], theta[1], theta[3])
File "/home/msimenc/software/mambaforge/envs/fastq2peaks.fseq2/lib/python3.6/site-packages/fseq2/idr_2_0_3/utility.py", line 46, in py_compute_pseudo_values
-10, 10, EPS ) )
TypeError: py_cdf_i() takes 6 positional arguments but 7 were given
fseq2 idr
fails for me due to KDEpy:
$ fseq2 idr \
--samples \
WRKY47_AtDNA_rep1_S5.trimmedAligned.sortedByCoord.out.noDuplicates_peaks.narrowPeak \
WRKY47_AtDNA_rep2_S6.trimmedAligned.sortedByCoord.out.noDuplicates_peaks.narrowPeak \
--input-file-type narrowPeak \
--rank q.value \
-o . \
--output-file-type narrowPeak \
--log-output-file idr.log \
--plot idr.png \
>idr.out \
2>idr.err
Traceback (most recent call last):
File "/home/msimenc/software/mambaforge/envs/fseq2.built/bin/fseq2", line 4, in <module>
__import__('pkg_resources').run_script('fseq2==2.0.3', 'fseq2')
File "/home/msimenc/software/mambaforge/envs/fseq2.built/lib/python3.7/site-packages/pkg_resources/__init__.py", line 656, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/msimenc/software/mambaforge/envs/fseq2.built/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1453, in run_script
exec(code, namespace, namespace)
File "/home/msimenc/software/mambaforge/envs/fseq2.built/lib/python3.7/site-packages/fseq2-2.0.3-py3.7-linux-x86_64.egg/EGG-INFO/scripts/fseq2", line 10, in <module>
from fseq2.fseq2 import __version__
File "/home/msimenc/software/mambaforge/envs/fseq2.built/lib/python3.7/site-packages/fseq2-2.0.3-py3.7-linux-x86_64.egg/fseq2/fseq2.py", line 17, in <module>
import KDEpy
File "/home/msimenc/software/mambaforge/envs/fseq2.built/lib/python3.7/site-packages/KDEpy/__init__.py", line 4, in <module>
import importlib.metadata
ModuleNotFoundError: No module named 'importlib.metadata'
Please add -V/--version option.
Hi, this is not a bug report but a question about usage. The inputs for callpeak_idr
are a treatment + control for each replicate. I have two control replicates and two treatment replicates but my controls aren't specifically paired with a given replicate, so can I provide both controls as input for callpeak_idr
using each of -control_file_1
and control_file_2
? If not, can you recommend how I proceed? P.S. I ran callpeak
once for each replicate inputting both controls under -control_file
. Thanks!
Hello, I was testing on a handful of samples and received the following KeyError
. This error occurs in versions 2.0.0/2.0.1/2.0.2. Current pandas version is 1.2.3. Not sure if this is related to the chromosomes with no peaks or something else. Happy to provide any additional information that would be helpful.
$ fseq2 callpeak g2.bed -f 0 -l 600 -t 4.0 -name g2 -cpus 32 -pe -pe_fragment_size_range auto
no peaks find on chr14_GL000225v1_random
no peaks find on chr14_KI270724v1_random
no peaks find on chr22_KI270736v1_random
no peaks find on chr22_KI270738v1_random
no peaks find on chr9_KI270720v1_random
no peaks find on chrUn_GL000224v1
no peaks find on chrUn_KI270438v1
no peaks find on chrUn_KI270467v1
no peaks find on chrUn_KI270511v1
no peaks find on chrUn_KI270746v1
no peaks find on chrUn_KI270756v1
no peaks find on chrUn_KI270750v1
no peaks find on chr17_GL000205v2_random
no peaks find on chr22_KI270735v1_random
no peaks find on chr17_KI270729v1_random
no peaks find on chrUn_KI270336v1
no peaks find on chrUn_KI270337v1
no peaks find on chrUn_KI270333v1
no peaks find on chrUn_KI270507v1
no peaks find on chrUn_KI270747v1
no peaks find on chrUn_KI270539v1
no peaks find on chr4_GL000008v2_random
no peaks find on chrUn_KI270583v1
no peaks find on chrUn_KI270466v1
no peaks find on chr22_KI270732v1_random
no peaks find on chrUn_GL000216v2
Traceback (most recent call last):
File "/home/$USER/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'query_value'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/$USER/.local/bin/fseq2", line 428, in <module>
main(temp_dir_name)
File "/home/$USER/.local/bin/fseq2", line 49, in main
submain(args)
File "/home/$USER/.local/lib/python3.8/site-packages/fseq2/callpeak_main.py", line 190, in main
result_df = fseq2.interpolate_poisson_p_value(result_df)
File "/home/$USER/.local/lib/python3.8/site-packages/fseq2/fseq2.py", line 864, in interpolate_poisson_p_value
result_df['query_int'] = result_df['query_value'].astype(int)
File "/home/$USER/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/$USER/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'query_value'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.