Giter Club home page Giter Club logo

geoparse's People

Contributors

agalitsyna avatar alexbarrera avatar bioinfodata avatar eric6356 avatar grisaitis avatar guma44 avatar hariesramdhani avatar kurtwheeler avatar maarten-vd-sande avatar michaellampe avatar miserlou avatar mvonpapen avatar simonvh avatar ttyskg avatar tychobismeijer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geoparse's Issues

Remove downloaded archive // Cleanup

GEOparse doesn't seem to provide a way to clean up after itself. I'd like to be able to delete all of the local data that has been downloaded and created once I'm finished.

In parse_GDS_columns(), unknown subset_types discarded

I have varying outcomes with which subset_types I encounter per parse. Although I have not encountered the ones encoded set(['individual', 'disease_state']), I have had to make use of set(['dose', 'agent', 'time', 'gender']). So, in parse_GDS_columns, I modified the code to start out with an empty subset_ids and collected everything on the fly.

This turned out nicely as each subset_type was accounted for in each sample in the end, so in GDS.columns no rows were dropped during GDS.__init__().

is there any way to re-write downloaded files?

Hi,

I use get_geo function on my script for 2 times.
First time it is for getting only sample names (brief), Second time for full download.
However, I can not pass second time because it use existing file.

So, is there any way to do that?
Thanks.

python 2.7 only?

Is this package for supported for python 2.7 only? Might be a good idea to include that in the readme

Broken download when supplementary_files is empty or contains invalid URLs

When I download GSM supplementary files by:

gsm = cast(GSM, GEOparse.get_GEO("GSM1944823", destdir="/tmp"))
files = gsm.download_supplementary_files("/tmp", False, "[email protected]")

I get the following error

13-Feb-2018 18:02:51 DEBUG utils - Directory /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain already exists. Skipping.
13-Feb-2018 18:02:51 INFO utils - Downloading NONE to /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain/NONE
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/GEOTypes.py", line 443, in download_supplementary_files
    utils.download_from_url(metavalue[0], download_path)
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/utils.py", line 114, in download_from_url
    destination_path))
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/wgetter.py", line 272, in download
    url = opener.open(link)
  File "/usr/lib/python3.6/urllib/request.py", line 511, in open
    req = Request(fullurl, data)
  File "/usr/lib/python3.6/urllib/request.py", line 329, in __init__
    self.full_url = url
  File "/usr/lib/python3.6/urllib/request.py", line 355, in full_url
    self._parse()
  File "/usr/lib/python3.6/urllib/request.py", line 384, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'NONE'

DtypeWarning: Columns (7) have mixed types.

The following code is generating a warning for me:

import GEOparse
gpl = GEOparse.get_GEO('GPL17481')

The output is:

>>> import GEOparse
>>> gpl = GEOparse.get_GEO('GPL17481')
17-May-2021 13:32:21 DEBUG utils - Directory ./ already exists. Skipping.
17-May-2021 13:32:21 INFO GEOparse - Downloading http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full to ./GPL17481.
txt
17-May-2021 13:32:23 DEBUG downloader - Total size: 0
17-May-2021 13:32:23 DEBUG downloader - md5: None
1.72MB [00:00,1.63MB/s]
10.3MB [00:01, 7.26MB/s]
17-May-2021 13:32:24 DEBUG downloader - Moving /tmp/tmp2lblbvso to /home/dbolser/Geromics/Dogome/Geromics/GPL17481.txt
17-May-2021 13:32:24 DEBUG downloader - Successfully downloaded http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full
17-May-2021 13:32:24 INFO GEOparse - Parsing ./GPL17481.txt: 
17-May-2021 13:32:24 DEBUG GEOparse - PLATFORM: GPL17481
/usr/bin/bpython3:1: DtypeWarning: Columns (7) have mixed types.Specify dtype option on import or set low_memory=False.
  #!/usr/bin/python3
>>> 

I get that this error is coming from pandas, but I'm not sure how to fix it.

Parse data table

I got an IndexError for some GSM SOFT txt files. For instance: for GSM32878 (string index out of range): geo = GEOparse.get_GEO('GSM32878'):

Traceback (most recent call last):
  File "indexReportUpdate.py", line 825, in <module>
    createIndices("GSM", outputDoc[2], outputEdg[0], outputDoc[6], outputEdg[2])
 
 File "indexReportUpdate.py", line 171, in createIndices
    raise e
  File "indexReportUpdate.py", line 163, in createIndices
    geo = GEOparse.get_GEO(filepath=fpath, silent=True)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 82, in get_GEO
    return parse_GSM(filepath)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 374, in parse_GSM
    table_data = parse_table_data(soft)
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in parse_table_data
    data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
  File "/home/mimsadm/.local/lib/python3.5/site-packages/GEOparse/GEOparse.py", line 329, in <listcomp>
    data = "\n".join([i.rstrip() for i in lines if i[0] not in ("^", "!", "#")])
IndexError: string index out of range

Partially parsing of large GPL file

Hi,
I sometimes want to parse large GPL files (e.g., GPL570), but my PC does not work due to out of memory. So, I'd like to be able to parse the GPL file partially by specifying the GSM samples to parse from the GPL. If you agree with my idea, I will make a pull request for this feature.

Thanks

Download MINiML

This prefers SOFT, would love the option to download MINiML.

Error with GSE52666

GSE52666

File already exist: using local version.
Parsing ../data/geo/GSE52666_family.soft.gz:

  • DATABASE : GeoMiame
  • SERIES : GSE52666
  • PLATFORM : GPL10999
  • SAMPLE : GSM1273835
  • SAMPLE : GSM1273836
  • SAMPLE : GSM1273837
  • SAMPLE : GSM1273838
  • SAMPLE : GSM1273839
  • SAMPLE : GSM1273840
  • SAMPLE : GSM1273841
  • SAMPLE : GSM1273842
  • SAMPLE : GSM1273843
  • SAMPLE : GSM1273844
  • SAMPLE : GSM1273845
  • SAMPLE : GSM1273846
  • SAMPLE : GSM1273847

AssertionError Traceback (most recent call last)
in ()
6
7 # Download and/or load GEO dataset
----> 8 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)
9
10 print('\tannotation.head(): {}'.format(gse.phenotype_data))

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
518 gpls=gpls,
519 gsms=gsms,
--> 520 database=database)
521 return gse
522

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOTypes.py in init(self, name, metadata, gpls, gsms, database)
590
591 for gsm_name, gsm in iteritems(gsms):
--> 592 assert isinstance(gsm, GSM), "All GSMs should be of type GSM"
593 for gpl_name, gpl in iteritems(gpls):
594 assert isinstance(gpl, GPL), "All GPLs should be of type GPL"

AssertionError: All GSMs should be of type GSM

Could you take a look at this and let me know what the issue is?

Thanks,

Warning if fastq-dump is not installed or on $PATH

Great package, thanks!

I lost a bit of time tracking down fastq-dump error with the download_SRA() function.

Granted,

"09-Apr-2019 12:48:02 ERROR sra_downloader - fastq-dump command not found"

is pretty good, but maybe it would be nice to see a good exception raised before the 15Gb file dowload?

Again, awesome pkg, thanks!

Best,

John

wgetter/urlget have no timeouts

GEOparse hangs for me quite a lot, particularly on slow connections. I think this is because there are no timeout values set.

more meaningful error in gsm.download_supplementary_files

Whenever I try to use it it downloads everything, but in the end tells me:

Converting to /home/antonkulaga/rna-seq/containers/geoparse/GSM1696283/Supp_GSM1696283_Transgenic_Control_L4_A/SRR2040662_*.fasta.gz
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 352, in download_supplementary_files
    self.download_SRA(email, filetype=sra_filetype, directory=directory)
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.5/site-packages/GEOparse/GEOTypes.py", line 463, in download_SRA
    if "command not found" in perr:
TypeError: a bytes-like object is required, not 'str'

Probably it tries to find some SRA tools in the path. So, it is better just to say that SRA tools are not in the PATH

cannot download and parse GEO files

After I downloaded Series Matrix File(s), GEOparse.get_GEO function can't work and show there isn't series.
image

So I try to use GEOparse.get_GEO function to download files from website. It turned out that.
image
It seems like url is wrong.

showing all SRRs for a particular GSM

Hi!
Thanks for this awesome project, a much-needed tool in the Python ecosystem.
I wonder if you could consider addiing a separate data structure to each GSM object that would store the list of all SRX and SRR entries associated with this GSM. The motivation is that, with a list like that, users could use your library only to scrape the GSE/GSM/SRX/SRR data. This would allow using your library in other workflows, where users may need to manage data download themselves.
Thank you!
Anton.

Some samples don't have table data!

I got the following error while parsing GSM2795971 SOFT file:

File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Allow the file encoding to be specified in smart_open()

Thank you very much for the effort made in providing this useful package.

I would like to request the following feature: that the encoding can be specified when calling gzip.open() or open() in smart_open().

I am currently using GEOparse 2.0.1 with Python 3.8.3 on Windows 10. I have successfully downloaded GSE files from GEO (e.g. GSE134809_family.soft) and have also used GEOparse to read the .soft (or .soft.gz) files stored locally on my computer.

I have discovered that some special characters in the .soft files are not being interpreted correctly, due to gzip.open() or open() using Python's default encoder ('cp1252' in my computer) instead of 'utf-8' even though the .soft files use 'utf-8' encoding. Due to smart_open() ignoring errors when reading the file with fh = fopen(filepath, mode, errors="ignore"), the special characters do not prevent the file from being read, but they are not interpreted correctly.

The types of characters that I've found to be problematic are letters with accents, and some punctuation marks, e.g. Naïve, 4°C, 3’ prime, “union” (those single and double quotation marks are not the standard ones even though they look similar).

This could be solved by allowing the encoding argument to be passed to gzip.open() or open() when calling smart_open():

@contextmanager
def smart_open(filepath, encoding):
    """Open file intelligently depending on the source and python version.

    Args:
        filepath (:obj:`str`): Path to the file.
        encoding (:obj:`str`): Encoding to use when reading the file.

    Yields:
        Context manager for file handle.

    """
    if filepath[-2:] == "gz":
        mode = "rt"
        fopen = gzip.open
    else:
        mode = "r"
        fopen = open
    if sys.version_info[0] < 3:
        fh = fopen(filepath, mode)
    else:
        fh = fopen(filepath, mode, encoding=encoding)
    try:
        yield fh
    except IOError:
        fh.close()
    finally:
        fh.close()

Alternatively, **kwargs could be passed through smart_open() and into gzip_open() and open().

Additionally, it would be beneficial if the errors were not ignored when reading the files, so that the user can be aware of them. This could be done by using a try/except block to attempt to open the file, and if errors are raised, display them to the user and then try to read the file again but this time ignoring errors. This would mean that the file would still be read but the user would be aware that there was a problem.

check for converted sra-s

If we are downloading with keep_sra=false and sra_format as fastq it makes sence to check for fast files and do not download sra-s if fastq files are avaliable and forcerewrite = false

UnboundLocalError while trying the example from documentation

Thank you for the package.
I am quite new to GEOparse and have been trying to figure out the basics of the package. I tried to implement the initial example provided on the documentation and get an Unbound Local Error. The screenshot of the same is as follows:
image

Python Version: 3.8.5
GEOparse Version: 2.0.2

Any leads about how to overcome this problem would be really helpful. Thanks!

NCBI GEO FTP has changed their URL structures AGAIN

I cannot download the majority of GEO metadata files! I Think that NCBI has changed again the structure of their URLs :(

10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963773/soft/GDS301963773.soft.gz
GDS301963934
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz to XXX/GDS301963934.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301963nnn/GDS301963934/soft/GDS301963934.soft.gz
GDS301385886
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz to XXX/GDS301385886.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301385nnn/GDS301385886/soft/GDS301385886.soft.gz
GDS302278020
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz to XXX/GDS302278020.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302278nnn/GDS302278020/soft/GDS302278020.soft.gz
GDS302478025
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz to XXX/GDS302478025.soft.gz
10-Jan-2018 20:14:50 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302478nnn/GDS302478025/soft/GDS302478025.soft.gz
GDS301172854
10-Jan-2018 20:14:50 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz to XXX/GDS301172854.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301172nnn/GDS301172854/soft/GDS301172854.soft.gz
GDS301192685
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz to XXX/GDS301192685.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS301192nnn/GDS301192685/soft/GDS301192685.soft.gz
GDS302483410
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz to XXX/GDS302483410.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302483nnn/GDS302483410/soft/GDS302483410.soft.gz
GDS302048642
10-Jan-2018 20:14:51 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.soft.gz to XXX/GDS302048642.soft.gz
10-Jan-2018 20:14:51 ERROR utils - Cannot find file ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS302048nnn/GDS302048642/soft/GDS302048642.so

Automatically generate pheno data

Now in order to generate the phenotypic data like in the pData from GEOquery one should do following:

pheno_data = {}
for gsm_name, gsm in gse.gsms.iteritems():
    print gsm_name, gsm
    pheno_data[gsm_name] = {key: value[0] for key, value in gsm.metadata.iteritems()}
pheno_data = pd.DataFrame(pheno_data).T

This should be a function.

using merge_and_average with annotation

Hi,

I want to annotate all samples, pivot_and_annotate function is working well for single platforms.

g = geo.get_GEO(geo='GSE17907', how='full', destdir=download_dir)
g.pivot_and_annotate('VALUE',gse.gpls[list(gse.gpls)[0]],'Gene Symbol')

However, some datasets have multiple platforms. For instance, GSE17907.
So, I use merge_and_average function due to platform filter feature. It is good because I am able to get samples for each platform seperately. But unfortunately, merge_and_average does not annotate samples.

gse.merge_and_average(d='GPL570', expression_column='VALUE', gsm_on='ID_REF', gpl_on='ID', group_by_column='ID_REF')

Is there any feature to annotate this multiple platforms? Maybe I missed something so I just wanted to ask it.

btw, I annotate samples manually like this.

soft = gse.gpls[list(gse.gpls)[0]].table
if soft.columns[0] == 'ID' and 'Gene Symbol' in list(soft.columns) and 'ID_REF' == eset.index.name:
    soft = soft[['ID','Gene Symbol']]
pd.merge(left=soft , right=eset, left_on='ID', right_on='ID_REF').drop(['ID'],axis=1)

silent mode not working.

I've enable silent = True when calling GEOparse.get_GEO. but I still get the messages.

Parsing downloads/GSE72400_family.soft.gz:
 - DATABASE : GeoMiame
 - SERIES : GSE72400
 - PLATFORM : GPL18573
 - SAMPLE : GSM1861834
 - SAMPLE : GSM1861835
 - SAMPLE : GSM1861836
 - SAMPLE : GSM1861837
 - SAMPLE : GSM1861838
 - SAMPLE : GSM1861839
 - SAMPLE : GSM1861840
 - SAMPLE : GSM1861841
 - SAMPLE : GSM1861842
 - SAMPLE : GSM1861843

GEOparse.logger.set_verbosity doesn't work

The docs mention there is a

GEOparse.logger.set_verbosity('ERROR')

however, this causes:
AttributeError: 'Logger' object has no attribute 'set_verbosity'

This can be side-stepped with:

import logging
GEOparse.logger.setLevel(logging.getLevelName("ERROR"))

Missing File Error when GEO is Down

When NCBI/GEO is down, I'd expect a custom exception or some kind of graceful handling, instead you get:

  File "/home/user/data_refinery_foreman/surveyor/geo.py", line 222, in create_experiment_and_samples_from_api
    gse = GEOparse.get_GEO(experiment_accession_code, destdir=self.get_temp_path(), how="brief", silent=True)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 84, in get_GEO
    return parse_GSE(filepath)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/GEOparse.py", line 502, in parse_GSE
    with utils.smart_open(filepath) as soft:
  File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/GEOparse/utils.py", line 156, in smart_open
    fh = fopen(filepath, mode, errors="ignore")
  File "/usr/lib/python3.5/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.5/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/1/GSE11915_family.soft.gz'

which looks like it's a local disk error, which it isn't, it's a GEO-is-down-error.

no columns at GSE

When I try

gse = GEOparse.get_GEO(geo="GSE69263", destdir="./")
gse.columns

I get:

gse.columns
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'GSE' object has no attribute 'columns'

while in the docs it is mentioned that columns is a standard GSE property

biopython requirement

Looks like this lib uses module Bio from Entrez submodule Biopython, but I do not see Biopython in requirements file

How can I get gsm-gpl relation in multi-gpls gse.

Thank you for great project. It is very convenient to me.

My interested dataset has multi-GPLs GSE. I want to find specific GPL's samples. But I can not find where the relations are.

In this case,

>>> gse = GEOparse.get_GEO("GSE6532", destdir='data/', 
            annotate_gpl=True, include_data=True, silent=True)
>>> print(gse.gpls)
{'GPL570': <PLATFORM: GPL570>,
 'GPL96': <PLATFORM: GPL96>,
 'GPL97': <PLATFORM: GPL97>}
>>> print(len(gse.gsms))
741

How can I filter GPL570's samples?

email is optional

In gsm.download_supplementary_files email field looks optional (as email=None by default) but in reality, it crashes with "Exception: You have to provide valid e-mail", that means this field is in fact mandatory. I suggest to either make email mandatory or to make it really optional and allow downloading SRAs without email

Cut a New Release?

Now that you've got #45 sorted, it'd be great if you could publish an updated package version!

get_GEO(... silent=True) is NOT silent

When I call getGEO() with silent=True I expect no output at all. But the result is identical to that obtained with silent=False.

Even if I redirect sys.stdout and sys.stderr to files, I still see the same output.

Is it possible to really silence the output of getGEO?

Python 3.5.2
GEOparse 0.1.10
macOS 10.12.4

Debugging everything

When i do the following:
gse = GEOparse.get_GEO(filepath="GPL17021_family.soft.gz") print(type(gse))
it prints out a long list of Debug...., like:
13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189087 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189088 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189089 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189090 13-Jul-2019 22:47:09 DEBUG GEOparse - SAMPLE: GSM1189091

I guess it didn't read my soft file correctly. Or maybe it is because I don't know how to use it yet.

fastq-dump parameters are not optimal

I run fastq-dump with the following parameters:

 /opt/sratoolkit/fastq-dump --skip-technical --gzip --readids --read-filter pass --dumpbase --split-files --clip ${file}

(at https://edwards.sdsu.edu/research/fastq-dump/ there are good explanations for need in some of them). While default geoparse has

cmd = "fastq-dump --split-files --gzip %s --outdir %s %s"

That creates some problems. For instance, if I do not have --readids and use paired sra, I get two files with ideas that are the same, that creates problem for downstream analysis. If I do not provide --skip-technical, then I get some technical Illumina reads that have nothing to do with biology ( like Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse. ) --read-filter pass allows to get read of multiple N-s in reads

Possibility to download all GEO data locally.

Great library! Just wondering if it is possible to allow bulk downloading of the GEO dataset from the outset rather than when it is queried. I want to speed up development times and having to download the files as they are needed takes up 90% of the analysis time. It would be great if there was a way to just dump all the GSE files into one folder. I understand this is quite large but if I have the space -- can this be added? I looked at ftp://ftp.ncbi.nlm.nih.gov/geo/series/ but I just want the _family.soft.gz files as they are used by GEOparse in a single folder.

returning pathes by download supplementary files

For bioinformatic pipelines it is useful to get pairs of name-> file for all downloaded supplementary files. My suggestion is to return dictionary of pairs name -> path in gsm.download_supplementary_files instead of current None

BUG: cannot " GEOparse.get_GEO(filepath= path)" On windows

gse = GEOparse.get_GEO(filepath=DIR_PATH)

ValueError: Unknown GEO type: E:\. Available types: GSM, GSE, GPL and GDS

This error arises from the way windows directory file path works i.e. "\" as opposed to linux "/"

In GEOparse.py on line 77 is the culprit :

` else:
if geotype is None:
geotype = filepath.split("/")[-1][:3] ------------this line #77

logger.info("Parsing %s: " % filepath)
if geotype.upper() == "GSM":
    return parse_GSM(filepath)
elif geotype.upper() == "GSE":
    return parse_GSE(filepath)
elif geotype.upper() == 'GPL':
    return parse_GPL(filepath)
elif geotype.upper() == 'GDS':
    return parse_GDS(filepath)
else:
    raise ValueError(("Unknown GEO type: %s. Available types: GSM, GSE, "
                      "GPL and GDS.") % geotype.upper())

`

Reporting issues with loading the following 2 datasets

GSE14755
File already exist: using local version.
Parsing ../data/geo/GSE14755_family.soft.gz:

  • DATABASE : GeoMiame
  • SERIES : GSE14755
  • PLATFORM : GPL5345

UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)
--> 508 gpls[entry_name] = parse_GPL(data_group, entry_name)
509 elif entry_type == "DATABASE":
510 is_data, data_group = next(groupper)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GPL(filepath, entry_name, silent)
383 gpl_soft.append(line)
384 else:
--> 385 for line in filepath:
386 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
387 has_table = True

/home/k/Jumis/tools/anaconda/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 5280: invalid continuation byte

GSE5336
File already exist: using local version.
Parsing ../data/geo/GSE5336_family.soft.gz:

  • DATABASE : GeoMiame
  • SERIES : GSE5336
  • PLATFORM : GPL3887
  • PLATFORM : GPL3888
  • PLATFORM : GPL3889
  • PLATFORM : GPL3892
  • PLATFORM : GPL3893
  • PLATFORM : GPL3894
  • PLATFORM : GPL4003
  • SAMPLE : GSM120869

UnicodeDecodeError Traceback (most recent call last)
in ()
5 print(id_)
6
----> 7 gse = GEOparse.get_GEO(geo=id_, destdir=DIR_GEO)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in get_GEO(geo, filepath, destdir, how, annotate_gpl, geotype, include_data, silent)
64 return parse_GSM(filepath)
65 elif geotype.upper() == "GSE":
---> 66 return parse_GSE(filepath)
67 elif geotype.upper() == 'GPL':
68 return parse_GPL(filepath, silent=silent)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSE(filepath)
503 elif entry_type == "SAMPLE":
504 is_data, data_group = next(groupper)
--> 505 gsms[entry_name] = parse_GSM(data_group, entry_name)
506 elif entry_type == "PLATFORM":
507 is_data, data_group = next(groupper)

/home/k/Jumis/gist/tools/GEOparse/GEOparse/GEOparse.py in parse_GSM(filepath, entry_name)
303 soft = []
304 has_table = False
--> 305 for line in filepath:
306 if "_table_begin" in line or (line[0] not in ("^", "!", "#")):
307 has_table = True

/home/k/Jumis/tools/anaconda/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 2897: invalid start byte

Thanks

ValueError when trying to reproduce tutorials

Hi,

The following code from the first section of the tutorials is broken on my machine. I'm running Anaconda Python 3.6 on windows 10.

import GEOparse
gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")

Produces the following error

12-Nov-2018 15:40:26 INFO GEOparse - Parsing ./GSE1563.soft.gz: 
Traceback (most recent call last):
  File "C:/Users/Ciaran/Box Sync/MesiSTRAT/PublicDataSetSearch/ReFormatShittyNCBIOutput.py", line 95, in <module>
    gse = GEOparse.get_GEO(filepath="./GSE1563.soft.gz")
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 84, in get_GEO
    return parse_GSE(filepath)
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\GEOparse.py", line 502, in parse_GSE
    with utils.smart_open(filepath) as soft:
  File "C:\ProgramData\Anaconda2\lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\ProgramData\Anaconda2\lib\site-packages\GEOparse\utils.py", line 154, in smart_open
    fh = fopen(filepath, mode)
  File "C:\ProgramData\Anaconda2\lib\gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "C:\ProgramData\Anaconda2\lib\gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')

Process finished with exit code 1

Check if GSE entry is public

I've had an FTP error when trying to get_gse for: "GSE122295"
And then I realized it is because it is still private....
Is there a way to know is a GSE is private?

is there any way to get GSM sample names without full download?

for instance, I want to get GSE19826 sample names.
GEOparse.get_GEO(geo='GSE19826',how='quick')
I have changed how variable to quick but It is still downloads full dataset files and it takes time for large datasets. Is there any way to download only sample names and descriptions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.