foxotech / methylprep Goto Github PK
View Code? Open in Web Editor NEWPython-based preprocessing software for Illumina methylation arrays
License: MIT License
Python-based preprocessing software for Illumina methylation arrays
License: MIT License
Hi,
I'm trying to use the package to obtaiin copy numbers from the idat files, however it gives me a message, that this is still unsupported.
Is there maybe an easy workaround for this, or when will this be available?
Thank you!
Greetings,
I am playing around with this repo in an attempt to implement GCT score calculation (Explained here: https://academic.oup.com/nar/article/45/4/e22/2290930).
I've added some code to manifest.py that allows me to add a column to the manifest dataframe that contains the extension base for a probe.
Next is to get the mean green intensity of probes expected to extend into a C and divide that by the mean green intensity of probes expected to extend into a T. I've included my function for doing so below but I am not getting close to producing a ratio near 1 (the expected result).
I was hoping to get some advice on where I might be going wrong here. I've had some trouble tracking probe intensity information throughout the analysis so it may be possible that I am using the wrong dataframe for this analysis. I am afraid my current approach uses unnormalized / dye bias corrected intensities which would prevent me from calculating a meaningful ratio.
I appreciate any advice you could share with me to help me implement GCT score calculation. From my testing, it's a useful way to quantify levels of bisulfite conversion and I'd be happy to add the feature and open a pull request.
Thanks,
Mauro Chavez
def _calculate_gct_score(data_container):
"""
SeSSme implementation in R:
# Get the type I probes expected to extend into C and T
extC <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extC
extT <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extT
## prbs <- rownames(oobG(sset))
df <- InfIR(sdf)
extC <- intersect(df$Probe_ID, extC)
extT <- intersect(df$Probe_ID, extT)
dC <- df[match(extC, df$Probe_ID),]
dT <- df[match(extT, df$Probe_ID),]
mean(c(dC$MG, dC$UG), na.rm=TRUE) / mean(c(dT$MG, dT$UG), na.rm=TRUE)
"""
if "Next_Base" not in data_container.man.columns:
LOGGER.warning(f"No Next_Base col found in manifest: NO GCT Score calculated")
return None
print(data_container.man.columns)
# Successful conversion means that all these probes should light up RED
ext_c_probes = data_container.man[(data_container.man["Next_Base"] == "C") & (data_container.man["Infinium_Design_Type"] == "I")]
ext_c_probes = ext_c_probes.merge(
data_container.green_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.green_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_grn', '_AddressB_grn')
).merge(
data_container.red_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.red_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_red', '_AddressB_red')
)
print("Exp C probes")
print(ext_c_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())
# Use these probes to normalize against how often a green signal comes from an extension into a T
ext_t_probes = data_container.man[(data_container.man["Next_Base"] == "T") & (data_container.man["Infinium_Design_Type"] == "I")]
ext_t_probes = ext_t_probes.merge(
data_container.green_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.green_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_grn', '_AddressB_grn')
).merge(
data_container.red_idat.probe_means,
left_on="AddressA_ID", right_index=True,
).merge(
data_container.red_idat.probe_means,
left_on="AddressB_ID", right_index=True,
suffixes=('_AddressA_red', '_AddressB_red')
)
print("Exp T probes")
print(ext_t_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())
return (ext_c_probes["mean_value_AddressA_grn"].mean() + ext_c_probes["mean_value_AddressB_grn"].mean()) / (ext_t_probes["mean_value_AddressA_grn"].mean() + ext_t_probes["mean_value_AddressB_grn"].mean())
<GEO>_signals.txt.gz
variety, it won't apply p-value detection to probes if the p-value column is there for each sample.>>>import pandas as pd
>>> x = pd.read_csv('GSE67530_signals.txt', sep='\t')
>>> cols = {i:i.title() for i in list(x.columns)}
>>> x = x.rename(columns=cols)
>>> x.to_csv('GSE67530_signals_revised.txt', sep='\t')
>>> z = methylprep.read_geo('GSE67530_signals_revised.txt')
>>> z.to_csv('GSE67530_betas.csv.gz')
(make samples case insensitive. It's possible to have the function try this trick before failing)
This will help in places where facilities leave the IDAT files on read-only volumes and so there's no way to write to them.
Could you include an example of what the download command would look like in the README?
I am trying to run methylprep process from the command line without poobah filtering. Documentation says:
--poobah By default, any beta-values or m-values output will
contain all probes. If True, those probes that fail
the p-value signal:noise detection are replaced with
NaNs in dataframes in beta_values and m_value output.
It also states that --poobah
is a Boolean variable with default True. This suggests that if I would like a beta_values output without poobah filtering I should add the option --poobah False
to the methylprep process
command. However, when I do this I get the error:
[Error]:
unrecognized arguments: False
How can I switch poobah filtering off in the command line?
Can I also confirm that the CSV file named something like "200772300016_R01C01_processed.csv" in a folder called "200772300016" contains all beta-values etc without any filtering?
python -m methylprep -v download -i GSE133062 -d ../GSE133062
INFO:methylprep.download.geo:Downloaded and unpacked GSE133062
Traceback (most recent call last):
File "/Users/mmaxmeister/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/mmaxmeister/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/mmaxmeister/methylprep/methylprep/__main__.py", line 6, in <module>
cli_app()
File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 240, in cli_app
build_parser()
File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 58, in build_parser
parsed_args.func(func_args)
File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 232, in cli_download
run_series(args.id, args.data_dir, dict_only=args.dict_only)
File "/Users/mmaxmeister/methylprep/methylprep/download/process_data.py", line 64, in run_series
seen_platforms = geo_metadata(id, series_path, GEO_PLATFORMS, str(path))
File "/Users/mmaxmeister/methylprep/methylprep/download/geo.py", line 151, in geo_metadata
raise ValueError(f'Sample: {title} has unrecognized platform: {platform}')
ValueError: Sample: BS_NS_BAL_641 has unrecognized platform: GPL23976
Adam: It would make sense to use Apache Arrow instead of pickle as a file format, as NVIDIA has libraries that can load it directly to the GPU, also it's a language independent format. It would be important to make sure that the meta data file always has Sample_ID column.
Helpful to know. This format would be useless to me, and probably to a lot of vanilla python users who've never seen them, but we can add the option to support deep learning. I've never seen an Apache Arrow file.
Going from CSVs to pickles was a design choice that broke compatibility with R users, but made everything 100X faster when dealing with large data sets that can't fit into memory. Loading a stack of CSVs takes minutes compared to a few seconds with a py3 pickle.
Mark mentioned that methylprep was reordering the probes during processing. It might be this pandas dataframe bug:
Warning (from warnings module):
File “/Users/mmaxmeister/legx/test_combine_mds.py”, line 302
slice_betas = pd.concat([slice_betas, betas[sample]], axis=1)
FutureWarning: Sorting because non-concatenation axis is not >aligned. A future version of pandas will change to not sort by default.
To accept the future behavior, pass ‘sort=False’.
To retain the current behavior and silence the warning, pass ‘sort=True’.
I’m sure I accepted the (sort=True) part to suppress the warning, but as a result, it is reordering the probes. Will put this as an issue.
As with other types of microarray platforms, technical artifacts are a concern, including background fluorescence, dye-bias from the use of two color channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing.
I found a new-ish method, RUVm, for removing unwanted variance between batches using negative controls. This works well after NOOB.
Should I plan to port this from R into methylprep?
https://cran.r-project.org/web/packages/ruv/ruv.pdf
Let’s hold off for now on RUVm for now. We may need to optimize our pre-processing pipeline, and need a strategy for empirically evaluating the utility of each step.
Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique.
This may cause sample_sheets and processed data in dataframes to not match up.
Will fix in future version:
Version 1.1 will do this as part of the download
or samplesheet
CLI functions, but this version should be more interactive. You provide path to a miniml file like GSE122038_family.xml
and it confirms how many samples to import, which columns of meta data to include, and tries to assign common meta data fields, including Replicate
, Control
, Sample_Type
, and Visit
(for a longitudinal study).
Add to version 1.2
Marc,
I'm trying in vain to import my own manifest (version b5) with methylprep. Can you give me more explanation about this.
BR
Stef L
Using methylprep version '1.2.11' and command :
python -m methylprep -v process --all
I cannot pass the parameter --export_poobah to the function.
Any idea what is happening here ?
Thanks,
Ray
methylprep.load and .load_processed won't use the processed CSV files yet.
I would like to be able to specify where/with what prefixes output files should go - right now my workflow involves copying an entire BeadChip sequencing directory from separate sequencing data storage where we do not have write permissions (taking up most of my local storage), running methylprep/-check/-ize CLI, then running a bash script to delete all of the copied files and pre-pend the directory structure onto the output files and move them into the appropriate analysis directory. I have looked through the documentation but have not found anywhere to specify output path for results - is there one? If not, could there be in the future?
via @moranr7
Running :
python -m methylprep -v process -d ./ --all --no_sample_sheet
and
python -m methylprep -v process -d ./ --all --no_sample_sheet --batch-size X
act no different in terms of memory requirements. For 107 samples, they both use >60GB of memory.
dev note: the --all overrides any other kwargs, so --batch-size is ignored
Between batches, the memory does not clear. It just continues to grow. Should the memory not be cleared after the .pkl objects are written to free up memory?
Also, the memory requirement seems very high, is there a memory leak?
Thanks,
Ray
Upon first installation of 1.6.0 (which, coincidentally, does not show up in the Releases here in GH), I got:
methylprep-cli -h
Traceback (most recent call last):
File "/home/incalci/shared/conda-envs/methylsuite/bin/methylprep-cli", line 5, in <module>
from methylprep.cli import app
ImportError: cannot import name 'app' from 'methylprep.cli' (/home/incalci/shared/conda-envs/methylsuite/lib/python3.10/site-packages/methylprep/cli.py)
It looks like app
was renamed cli_app
.
Hello,
Thanks for making and maintaining the project.
As far as I see, Bioinformatics are commonly in R language, and python devs should either learn R or don't do Bioinformatics projects.
I have a question, is there any groups to exchange with other people who uses methylprep
and methylsuite
?
I have some questions and it will be good to be part of active community.
Hi,
I'm new to python and methylation.
I ran the methylprep and it generates all the required output, but in .csv and not in .pkl.
So when I do methylcheck in jupyter like this:
import methylcheck
from pathlib import Path
filepath = Path('D:\bla\blablu\blablubli\blablublibsi')
betas = methylcheck.load(filepath)
betas.head()
I get this error with betas.head():
AttributeError Traceback (most recent call last)
Cell In[2], line 2
1 betas = methylcheck.load(filepath)
----> 2 betas.head()
AttributeError: 'NoneType' object has no attribute 'head'
Why were no pkl files created and is this really the reason why it is not recognizing the beta_values (they are there in the .csv) ?
Would be great if you could help me out. Cheers!
Nichole Rigby - 9:17 AM DEC 12
Not sure if this is a bug or not, but a single control probe shows up in the methylprep version of the manifest. Its one of the -99 ones. I'm guessing it should be in the control dataframes, and not the manifest?
Another bug, (and this effects our new QC functions), is that some of the controls are missing from the control dataframes. For example, the raw manifest (outside of methylprep) contains these STAINING controls.
New bug related to the control probes: I also found some inconsistency in the names of the BISULFITE CONVERSION probes. Some platforms have names like BS Conversion I-C1 (EPIC+) and others have similar names with the - missing (450K) for only SOME of the BISULFITE CONVERSION probes. In some cases, the original manifest has names with the - and the manifest in methylprep is missing the - from the name.
9:09 AM
I think the solution might be to through all the manifest files and methylprep manifests and make sure the - is in the name for each.
Nichole: I checked them all, only changed and re-uploaded the ones that were affected also had to delete the .methylprep_manifest_files directory. My code now works.
For users to get the update, they'll need to delete the manifest files on their machine, and it will re-download them.
Hello
Many thanks for providing methylprep.
May I ask if the analysis of files originating form Infinium MethylationEPIC v2.0 chips is already possible?
Some GEO data sets contain both kinds of sample array data. The current fix requires additional user steps, which could be dealt with programmatically:
python -m methylprep download -d GSE142512 -i GSE142512
python -m methylprep sample_sheet -d GSE142512 -t Blood --create
Sample_Name (added) | GSM_ID | Sentrix_ID | Sentrix_Position (from filename)
python -m methylprep meta_data -d GSE142512 -i GSE142512
instead.methylprep process
this might be fixable, but have to investigate.
Exception raised when running a pipeline:
Traceback (most recent call last):
File "", line 1, in
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/pipeline.py", line 331, in run_pipeline
data_container = SampleDataContainer(
^^^^^^^^^^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/pipeline.py", line 590, in init
infer_type_I_probes(self, debug=self.debug)
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/infer_channel_switch.py", line 19, in infer_type_I_probes
channels = get_infer_channel_probes(container.manifest, container.green_idat, container.red_idat, debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/infer_channel_switch.py", line 171, in get_infer_channel_probes
oobG_IG = oobG.append(green_in_band).sort_index()
^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/pandas/core/generic.py", line 5989, in getattr
return object.getattribute(self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?
It seems this is the reason:
https://stackoverflow.com/questions/75956209/dataframe-object-has-no-attribute-append
Problem's solved if I downgrade to pandas 1.5.3
Suggested change:
v1.3.0 cleaned up a lot of unnecessary memory usage. The final run_pipeline() step "consolidate" uses 2X the prev-max memory. If I dropped this as a default behavior (to return a list of data_container objects, or a df of beta_values), it would reduce the memory spike -- which could caused the process to get killed on computers with less memory processing a large batch. Either way, all files are processed before it would crash, but it would just be tidier to avoid using excessive memory.
Processing 280 EPIC samples on MacOS used ~16GB memory at its max. Seems like a lot. It would use <10GB otherwise at its max-usage point (based on mprof / python memory-profiler
I ran into this error in trying to create sample_sheets:
File "sample_sheets.py", line 198, in create_sample_sheet
raise ValueError(file_name_error_msg.format(idat))
ValueError: This .idat file does not have the right pattern to auto-generate a sample sheet:
To fix it for Windows users, one simply needs to change the "/" with "\\" like so:
filename = str(idat).split("\\")[-1]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.