guma44 / geoparse Goto Github PK

View Code? Open in Web Editor NEW

137.0 8.0 51.0 13.64 MB

Python library to access Gene Expression Omnibus Database (GEO)

License: BSD 3-Clause "New" or "Revised" License

Python 42.84% Makefile 0.68% Jupyter Notebook 56.48%

geo-database high-throughput-sequencing htseq microarray bioinformatics rna-sequencing rna-seq

geoparse's People

Contributors

Stargazers

Watchers

geoparse's Issues

Remove downloaded archive // Cleanup

GEOparse doesn't seem to provide a way to clean up after itself. I'd like to be able to delete all of the local data that has been downloaded and created once I'm finished.

In parse_GDS_columns(), unknown subset_types discarded

I have varying outcomes with which subset_types I encounter per parse. Although I have not encountered the ones encoded set(['individual', 'disease_state']), I have had to make use of set(['dose', 'agent', 'time', 'gender']). So, in parse_GDS_columns, I modified the code to start out with an empty subset_ids and collected everything on the fly.

This turned out nicely as each subset_type was accounted for in each sample in the end, so in GDS.columns no rows were dropped during GDS.__init__().

is there any way to re-write downloaded files?

Hi,

I use get_geo function on my script for 2 times.
First time it is for getting only sample names (brief), Second time for full download.
However, I can not pass second time because it use existing file.

So, is there any way to do that?
Thanks.

wrong name in parse_GSE

parse_GSE will return a GSE object with wrong name

python 2.7 only?

Is this package for supported for python 2.7 only? Might be a good idea to include that in the readme

Broken download when supplementary_files is empty or contains invalid URLs

When I download GSM supplementary files by:

gsm = cast(GSM, GEOparse.get_GEO("GSM1944823", destdir="/tmp"))
files = gsm.download_supplementary_files("/tmp", False, "[email protected]")

I get the following error

13-Feb-2018 18:02:51 DEBUG utils - Directory /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain already exists. Skipping.
13-Feb-2018 18:02:51 INFO utils - Downloading NONE to /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain/NONE
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/GEOTypes.py", line 443, in download_supplementary_files
    utils.download_from_url(metavalue[0], download_path)
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/utils.py", line 114, in download_from_url
    destination_path))
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/wgetter.py", line 272, in download
    url = opener.open(link)
  File "/usr/lib/python3.6/urllib/request.py", line 511, in open
    req = Request(fullurl, data)
  File "/usr/lib/python3.6/urllib/request.py", line 329, in __init__
    self.full_url = url
  File "/usr/lib/python3.6/urllib/request.py", line 355, in full_url
    self._parse()
  File "/usr/lib/python3.6/urllib/request.py", line 384, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'NONE'

DtypeWarning: Columns (7) have mixed types.

The following code is generating a warning for me:

import GEOparse
gpl = GEOparse.get_GEO('GPL17481')

The output is:

>>> import GEOparse
>>> gpl = GEOparse.get_GEO('GPL17481')
17-May-2021 13:32:21 DEBUG utils - Directory ./ already exists. Skipping.
17-May-2021 13:32:21 INFO GEOparse - Downloading http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full to ./GPL17481.
txt
17-May-2021 13:32:23 DEBUG downloader - Total size: 0
17-May-2021 13:32:23 DEBUG downloader - md5: None
1.72MB [00:00,1.63MB/s]
10.3MB [00:01, 7.26MB/s]
17-May-2021 13:32:24 DEBUG downloader - Moving /tmp/tmp2lblbvso to /home/dbolser/Geromics/Dogome/Geromics/GPL17481.txt
17-May-2021 13:32:24 DEBUG downloader - Successfully downloaded http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full
17-May-2021 13:32:24 INFO GEOparse - Parsing ./GPL17481.txt: 
17-May-2021 13:32:24 DEBUG GEOparse - PLATFORM: GPL17481
/usr/bin/bpython3:1: DtypeWarning: Columns (7) have mixed types.Specify dtype option on import or set low_memory=False.
  #!/usr/bin/python3
>>>

I get that this error is coming from pandas, but I'm not sure how to fix it.

SRR type now known

I am trying to download the following RNA-Seq dataset https://www.ncbi.nlm.nih.gov/sra/SRX313696[accn] with GEOParse
However, it tell me that SRR filetype is not known

Parse data table

I got an IndexError for some GSM SOFT txt files. For instance: for GSM32878 (string index out of range): geo = GEOparse.get_GEO('GSM32878'):

Traceback (most recent call last):
  File "indexReportUpdate.py", line 825, in <module>
    createIndices("GSM", outputDoc[2], outputEdg[0], outputDoc[6], outputEdg[2])
 
 File "indexReportUpdate.py", line 171, in createIndices
    raise e
  File "indexReportUpdate.py", line 163, in createIndices
    geo = GEOparse.get_GEO(filepath=fpath, silent=True)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 82, in get_GEO
    return parse_GSM(filepath)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 374, in parse_GSM
    table_data = parse_table_data(soft)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in parse_table_data
    data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in <listcomp>
    data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
IndexError: string index out of range

Partially parsing of large GPL file

Hi,
I sometimes want to parse large GPL files (e.g., GPL570), but my PC does not work due to out of memory. So, I'd like to be able to parse the GPL file partially by specifying the GSM samples to parse from the GPL. If you agree with my idea, I will make a pull request for this feature.

Thanks

Download MINiML

This prefers SOFT, would love the option to download MINiML.

Error with GSE52666

GSE52666

File already exist: using local version.
Parsing ../data/geo/GSE52666_family.soft.gz:

DATABASE : GeoMiame
SERIES : GSE52666
PLATFORM : GPL10999
SAMPLE : GSM1273835
SAMPLE : GSM1273836
SAMPLE : GSM1273837
SAMPLE : GSM1273838
SAMPLE : GSM1273839
SAMPLE : GSM1273840
SAMPLE : GSM1273841
SAMPLE : GSM1273842
SAMPLE : GSM1273843
SAMPLE : GSM1273844
SAMPLE : GSM1273845
SAMPLE : GSM1273846
SAMPLE : GSM1273847

AssertionError Traceback (most recent call last)
in ()
6
7 # Download and/or load GEO dataset
----> 8 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)
9
10 print('\tannotation.head(): {}'.format(gse.phenotype_data))

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
518 gpls=gpls,
519 gsms=gsms,
--> 520 database=database)
521 return gse
522

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOTypes.py in init(self, name, metadata, gpls, gsms, database)
590
591 for gsm_name, gsm in iteritems(gsms):
--> 592 assert isinstance(gsm, GSM), "All GSMs should be of type GSM"
593 for gpl_name, gpl in iteritems(gpls):
594 assert isinstance(gpl, GPL), "All GPLs should be of type GPL"

AssertionError: All GSMs should be of type GSM

Could you take a look at this and let me know what the issue is?

Thanks,

Warning if fastq-dump is not installed or on $PATH

Great package, thanks!

I lost a bit of time tracking down fastq-dump error with the download_SRA() function.

Granted,

"09-Apr-2019 12:48:02 ERROR sra_downloader - fastq-dump command not found"

is pretty good, but maybe it would be nice to see a good exception raised before the 15Gb file dowload?

Again, awesome pkg, thanks!

Best,

John

wgetter/urlget have no timeouts

GEOparse hangs for me quite a lot, particularly on slow connections. I think this is because there are no timeout values set.

more meaningful error in gsm.download_supplementary_files

Whenever I try to use it it downloads everything, but in the end tells me:

Converting to /home/antonkulaga/rna-seq/containers/geoparse/GSM1696283/Supp_GSM1696283_Transgenic_Control_L4_A/SRR2040662_*.fasta.gz
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 352, in download_supplementary_files
    self.download_SRA(email, filetype=sra_filetype, directory=directory)
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 463, in download_SRA
    if "command not found" in perr:
TypeError: a bytes-like object is required, not 'str'

Probably it tries to find some SRA tools in the path. So, it is better just to say that SRA tools are not in the PATH

cannot download and parse GEO files

After I downloaded Series Matrix File(s), GEOparse.get_GEO function can't work and show there isn't series.

So I try to use GEOparse.get_GEO function to download files from website. It turned out that.

It seems like url is wrong.

Library citation for scientific paper

Is there any scientific paper describing current library?

I would like to cite GEOparse, but there is no recommended way in README.

showing all SRRs for a particular GSM

Hi!
Thanks for this awesome project, a much-needed tool in the Python ecosystem.
I wonder if you could consider addiing a separate data structure to each GSM object that would store the list of all SRX and SRR entries associated with this GSM. The motivation is that, with a list like that, users could use your library only to scrape the GSE/GSM/SRX/SRR data. This would allow using your library in other workflows, where users may need to manage data download themselves.
Thank you!
Anton.

Some samples don't have table data!

I got the following error while parsing GSM2795971 SOFT file:

File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Allow the file encoding to be specified in smart_open()

Thank you very much for the effort made in providing this useful package.

I would like to request the following feature: that the encoding can be specified when calling gzip.open() or open() in smart_open().

I am currently using GEOparse 2.0.1 with Python 3.8.3 on Windows 10. I have successfully downloaded GSE files from GEO (e.g. GSE134809_family.soft) and have also used GEOparse to read the .soft (or .soft.gz) files stored locally on my computer.

I have discovered that some special characters in the .soft files are not being interpreted correctly, due to gzip.open() or open() using Python's default encoder ('cp1252' in my computer) instead of 'utf-8' even though the .soft files use 'utf-8' encoding. Due to smart_open() ignoring errors when reading the file with fh = fopen(filepath, mode, errors="ignore"), the special characters do not prevent the file from being read, but they are not interpreted correctly.

The types of characters that I've found to be problematic are letters with accents, and some punctuation marks, e.g. Naïve, 4°C, 3’ prime, “union” (those single and double quotation marks are not the standard ones even though they look similar).

This could be solved by allowing the encoding argument to be passed to gzip.open() or open() when calling smart_open():

@contextmanager
def smart_open(filepath, encoding):
    """Open file intelligently depending on the source and python version.

    Args:
        filepath (:obj:`str`): Path to the file.
        encoding (:obj:`str`): Encoding to use when reading the file.

    Yields:
        Context manager for file handle.

    """
    if filepath[-2:] == "gz":
        mode = "rt"
        fopen = gzip.open
    else:
        mode = "r"
        fopen = open
    if sys.version_info[0] < 3:
        fh = fopen(filepath, mode)
    else:
        fh = fopen(filepath, mode, encoding=encoding)
    try:
        yield fh
    except IOError:
        fh.close()
    finally:
        fh.close()

Alternatively, **kwargs could be passed through smart_open() and into gzip_open() and open().

Additionally, it would be beneficial if the errors were not ignored when reading the files, so that the user can be aware of them. This could be done by using a try/except block to attempt to open the file, and if errors are raised, display them to the user and then try to read the file again but this time ignoring errors. This would mean that the file would still be read but the user would be aware that there was a problem.

check for converted sra-s

If we are downloading with keep_sra=false and sra_format as fastq it makes sence to check for fast files and do not download sra-s if fastq files are avaliable and forcerewrite = false

use `pandas.read_csv` rather than `DataFrame.from_csv`

from_csv is deprecated, so creates a lot of deprecation warnings in the code. Additionally, read_csv is 46x to 490x faster than from_csv. There are small changes to the interface, described here.

UnboundLocalError while trying the example from documentation

Thank you for the package.
I am quite new to GEOparse and have been trying to figure out the basics of the package. I tried to implement the initial example provided on the documentation and get an Unbound Local Error. The screenshot of the same is as follows:

Python Version: 3.8.5
GEOparse Version: 2.0.2

Any leads about how to overcome this problem would be really helpful. Thanks!

NCBI GEO FTP has changed their URL structures AGAIN

I cannot download the majority of GEO metadata files! I Think that NCBI has changed again the structure of their URLs :(

10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963773/soft/GDS301963773.soft.gz
GDS301963934
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz to XXX/GDS301963934.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz
GDS301385886
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz to XXX/GDS301385886.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz
GDS302278020
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz to XXX/GDS302278020.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz
GDS302478025
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz to XXX/GDS302478025.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz
GDS301172854
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz to XXX/GDS301172854.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz
GDS301192685
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz to XXX/GDS301192685.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz
GDS302483410
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz to XXX/GDS302483410.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz
GDS302048642
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.soft.gz to XXX/GDS302048642.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.so

Automatically generate pheno data

Now in order to generate the phenotypic data like in the pData from GEOquery one should do following:

pheno_data = {}
for gsm_name, gsm in gse.gsms.iteritems():
    print gsm_name, gsm
    pheno_data[gsm_name] = {key: value[0] for key, value in gsm.metadata.iteritems()}
pheno_data = pd.DataFrame(pheno_data).T

This should be a function.

using merge_and_average with annotation

Hi,

I want to annotate all samples, pivot_and_annotate function is working well for single platforms.

g = geo.get_GEO(geo='GSE17907', how='full', destdir=download_dir)
g.pivot_and_annotate('VALUE',gse.gpls[list(gse.gpls)[0]],'Gene Symbol')

However, some datasets have multiple platforms. For instance, GSE17907.
So, I use merge_and_average function due to platform filter feature. It is good because I am able to get samples for each platform seperately. But unfortunately, merge_and_average does not annotate samples.

gse.merge_and_average(d='GPL570', expression_column='VALUE', gsm_on='ID_REF', gpl_on='ID', group_by_column='ID_REF')

Is there any feature to annotate this multiple platforms? Maybe I missed something so I just wanted to ask it.

btw, I annotate samples manually like this.

soft = gse.gpls[list(gse.gpls)[0]].table
if soft.columns[0] == 'ID' and 'Gene Symbol' in list(soft.columns) and 'ID_REF' == eset.index.name:
    soft = soft[['ID','Gene Symbol']]
pd.merge(left=soft , right=eset, left_on='ID', right_on='ID_REF').drop(['ID'],axis=1)

silent mode not working.

I've enable silent = True when calling GEOparse.get_GEO. but I still get the messages.

Parsing downloads/GSE72400_family.soft.gz:
 - DATABASE : GeoMiame
 - SERIES : GSE72400
 - PLATFORM : GPL18573
 - SAMPLE : GSM1861834
 - SAMPLE : GSM1861835
 - SAMPLE : GSM1861836
 - SAMPLE : GSM1861837
 - SAMPLE : GSM1861838
 - SAMPLE : GSM1861839
 - SAMPLE : GSM1861840
 - SAMPLE : GSM1861841
 - SAMPLE : GSM1861842
 - SAMPLE : GSM1861843

GEOparse.logger.set_verbosity doesn't work

The docs mention there is a

GEOparse.logger.set_verbosity('ERROR')

however, this causes:
AttributeError: 'Logger' object has no attribute 'set_verbosity'

This can be side-stepped with:

import logging
GEOparse.logger.setLevel(logging.getLevelName("ERROR"))

Missing File Error when GEO is Down

When NCBI/GEO is down, I'd expect a custom exception or some kind of graceful handling, instead you get:

  File "/home/user/data_refinery_foreman/surveyor/geo.py", line 222, in create_experiment_and_samples_from_api
    gse = GEOparse.get_GEO(experiment_accession_code, destdir=self.get_temp_path(), how="brief", silent=True)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 84, in get_GEO
    return parse_GSE(filepath)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 502, in parse_GSE
    with utils.smart_open(filepath) as soft:
  File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/utils.py", line 156, in smart_open
    fh = fopen(filepath, mode, errors="ignore")
  File "/usr/lib/python3.5/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.5/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/1/GSE11915_family.soft.gz'

which looks like it's a local disk error, which it isn't, it's a GEO-is-down-error.

no columns at GSE

When I try

gse = GEOparse.get_GEO(geo="GSE69263", destdir="./")
gse.columns

I get:

gse.columns
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'GSE' object has no attribute 'columns'

while in the docs it is mentioned that columns is a standard GSE property

biopython requirement

Looks like this lib uses module Bio from Entrez submodule Biopython, but I do not see Biopython in requirements file

How can I get gsm-gpl relation in multi-gpls gse.

Thank you for great project. It is very convenient to me.

My interested dataset has multi-GPLs GSE. I want to find specific GPL's samples. But I can not find where the relations are.

In this case,

>>> gse = GEOparse.get_GEO("GSE6532", destdir='data/', 
            annotate_gpl=True, include_data=True, silent=True)
>>> print(gse.gpls)
{'GPL570': <PLATFORM: GPL570>,
 'GPL96': <PLATFORM: GPL96>,
 'GPL97': <PLATFORM: GPL97>}
>>> print(len(gse.gsms))
741

How can I filter GPL570's samples?

email is optional

In gsm.download_supplementary_files email field looks optional (as email=None by default) but in reality, it crashes with "Exception: You have to provide valid e-mail", that means this field is in fact mandatory. I suggest to either make email mandatory or to make it really optional and allow downloading SRAs without email

Can you read XML files with this tool?

I'm trying to read in the XML file for the metadata and stumbled across this package. Do you recommend a good way to do this?

parallel fastq_dump support

fastq_dump is very slow. Maybe usage of https://github.com/rvalieris/parallel-fastq-dump can speed things up

Cut a New Release?

Now that you've got #45 sorted, it'd be great if you could publish an updated package version!

get_GEO(... silent=True) is NOT silent

When I call getGEO() with silent=True I expect no output at all. But the result is identical to that obtained with silent=False.

Even if I redirect sys.stdout and sys.stderr to files, I still see the same output.

Is it possible to really silence the output of getGEO?

Python 3.5.2
GEOparse 0.1.10
macOS 10.12.4

Illegal filenames and no filtering of user input from GEO to create the filenames

I'm assuming this was developed by highly UNIX users, and thus there are some missing substitutions for illegal filenames for Windows. Specifically in the GEOTypes.py file on line 403

I modified to the substitution to this for my own purposes: re.sub(r'[\s\*\?,\.\:\%\|\"\<\>]

Debugging everything

When i do the following:
gse = GEOparse.get_GEO(filepath="GPL17021_family.soft.gz") print(type(gse))
it prints out a long list of Debug...., like:
13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189087 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189088 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189089 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189090 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189091

I guess it didn't read my soft file correctly. Or maybe it is because I don't know how to use it yet.

fastq-dump parameters are not optimal

I run fastq-dump with the following parameters:

 /opt/sratoolkit/fastq-dump --skip-technical --gzip --readids --read-filter pass --dumpbase --split-files --clip ${file}

(at https://edwards.sdsu.edu/research/fastq-dump/ there are good explanations for need in some of them). While default geoparse has

cmd = "fastq-dump --split-files --gzip %s --outdir %s %s"

That creates some problems. For instance, if I do not have --readids and use paired sra, I get two files with ideas that are the same, that creates problem for downstream analysis. If I do not provide --skip-technical, then I get some technical Illumina reads that have nothing to do with biology ( like Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse. ) --read-filter pass allows to get read of multiple N-s in reads

Possibility to download all GEO data locally.

Great library! Just wondering if it is possible to allow bulk downloading of the GEO dataset from the outset rather than when it is queried. I want to speed up development times and having to download the files as they are needed takes up 90% of the analysis time. It would be great if there was a way to just dump all the GSE files into one folder. I understand this is quite large but if I have the space -- can this be added? I looked at ftp://ftp.ncbi.nlm.nih.gov/geo/series/ but I just want the _family.soft.gz files as they are used by GEOparse in a single folder.

returning pathes by download supplementary files

For bioinformatic pipelines it is useful to get pairs of name-> file for all downloaded supplementary files. My suggestion is to return dictionary of pairs name -> path in gsm.download_supplementary_files instead of current None

Enhance documentation with pre-requisite python module (pandas)

The title says it all. Thank you.

BUG: cannot " GEOparse.get_GEO(filepath= path)" On windows

gse = GEOparse.get_GEO(filepath=DIR_PATH)

ValueError: Unknown GEO type: E:\. Available types: GSM, GSE, GPL and GDS

This error arises from the way windows directory file path works i.e. "\" as opposed to linux "/"

In GEOparse.py on line 77 is the culprit :

` else:
if geotype is None:
geotype = filepath.split("/")[-1][:3] ------------this line #77

logger.info("Parsing %s: " % filepath)
if geotype.upper() == "GSM":
    return parse_GSM(filepath)
elif geotype.upper() == "GSE":
    return parse_GSE(filepath)
elif geotype.upper() == 'GPL':
    return parse_GPL(filepath)
elif geotype.upper() == 'GDS':
    return parse_GDS(filepath)
else:
    raise ValueError(("Unknown GEO type: %s. Available types: GSM, GSE, "
                      "GPL and GDS.") % geotype.upper())

Reporting issues with loading the following 2 datasets

GSE14755
File already exist: using local version.
Parsing ../data/geo/GSE14755_family.soft.gz:

DATABASE : GeoMiame
SERIES : GSE14755
PLATFORM : GPL5345

UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)
--> 508 gpls[entry_name] = parse_GPL(data_group, entry_name)
509 elif entry_type == "DATABASE":
510 is_data, data_group = next(groupper)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GPL(filepath, entry_name, silent)
383 gpl_soft.append(line)
384 else:
--> 385 for line in filepath:
386 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
387 has_table = True

/home/k/Jumis/tools/anaconda/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 5280: invalid continuation byte

GSE5336
File already exist: using local version.
Parsing ../data/geo/GSE5336_family.soft.gz:

DATABASE : GeoMiame
SERIES : GSE5336
PLATFORM : GPL3887
PLATFORM : GPL3888
PLATFORM : GPL3889
PLATFORM : GPL3892
PLATFORM : GPL3893
PLATFORM : GPL3894
PLATFORM : GPL4003
SAMPLE : GSM120869

UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
503 elif entry_type == "SAMPLE":
504 is_data, data_group = next(groupper)
--> 505 gsms[entry_name] = parse_GSM(data_group, entry_name)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSM(filepath, entry_name)
303 soft = []
304 has_table = False
--> 305 for line in filepath:
306 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
307 has_table = True

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 2897: invalid start byte

Thanks

ValueError when trying to reproduce tutorials

Hi,

The following code from the first section of the tutorials is broken on my machine. I'm running Anaconda Python 3.6 on windows 10.

import GEOparse
gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")

Produces the following error

12-Nov-2018 15:40:26 INFO GEOparse - Parsing ./GSE1563.soft.gz: 
Traceback (most recent call last):
  File "C:/Users/Ciaran/Box Sync/MesiSTRAT/PublicDataSetSearch/ReFormatShittyNCBIOutput.py", line 95, in <module>
    gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 84, in get_GEO
    return parse_GSE(filepath)
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 502, in parse_GSE
    with utils.smart_open(filepath) as soft:
  File "C:\ProgramData\Anaconda2\lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\utils.py", line 154, in smart_open
    fh = fopen(filepath, mode)
  File "C:\ProgramData\Anaconda2\lib\gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "C:\ProgramData\Anaconda2\lib\gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')

Process finished with exit code 1

Am I doin it rong?

Sorry for being dumb, but I wondered if you could give me some feedback on my code?

https://gist.github.com/CholoTook/2eaed8009e48e65bc1b1b65111320a59

I always worry that I'm not using the tool 'canonically' or that I've overlooked some simple features.

I'd really appreciate you giving my code a once over and letting me know what I've done wrong.

Huge thanks!
Dan.

Check if GSE entry is public

I've had an FTP error when trying to get_gse for: "GSE122295"
And then I realized it is because it is still private....
Is there a way to know is a GSE is private?

stderr.write missing

Hi, just a minor mistake that write() is omitted.
https://github.com/guma44/GEOparse/blob/master/GEOparse/GEOparse.py#L244

is there any way to get GSM sample names without full download?

for instance, I want to get GSE19826 sample names.
GEOparse.get_GEO(geo='GSE19826',how='quick')
I have changed how variable to quick but It is still downloads full dataset files and it takes time for large datasets. Is there any way to download only sample names and descriptions?

guma44 / geoparse Goto Github PK

geoparse's People

Contributors

Stargazers

Watchers

Forkers

geoparse's Issues

Recommend Projects

Recommend Topics

Recommend Org