Giter Club home page Giter Club logo

refgenconf's Introduction

Build package Test refgenie CLI install with bioconda

Refgenie

A standardized reference genome resource manager. See the documentation.

refgenconf's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

refgenconf's Issues

adding genome attributes to config file

related to #7 (but for genomes instead of for assets)

what if we want to add some other attributes about a genome? examples include a description, URL to where it came from, what species it comes from, how long it is, etc.

maybe the config format should introduce an "assets" attribute under genome so they are not right under it?

refgenie clobbers existing assets

if I try to download an asset that exists, refgenie just re-downloads it. Shouldn't it not re-do an existing asset unless prompted to overwrite?

imprecise message when pulling non-existent asset/genome pair

the response is correct (see code example in: refgenie/refgenie#28 (comment)), as there's no bowtie2_index asset for hg38 genome (and it's true for all genomes), try: http://refgenomes.databio.org/asset/hg38/bowtie2_index/archive

however now the error/message that we get is not precise, and I think it's caused by the introduction of the connection pre-check that we do for refgenie pull

Originally posted by @MichalStolarczyk in refgenie/refgenie#28

Deal with HTTPError error (missing archives on the server)

Currently we determine a valid source server for an asset by downloading a JSON file with its attributes. Consequently, if a refgenieserver instance does serve the JSON but for some reason the asset archive is missing on the server side, pull_asset will fail without checking other servers.

This is a rare case, which is caused by faulty refgenieserver instance, but would it be helpful to deal with it here.

account for the interrupted pulls

once we provide the "do not untar" option here, we could account for the interrupted refgenie pulls because now these just result in an incomplete archive. This could be confusing then.

Pointing to assets?

Is how I've constructed this the intended mechanism?

My refgenie config file:

genome_folder: $GENOMES
genome_server: http://refgenomes.databio.org

genomes:
  hg38:
    indexes:
      bowtie2: indexed_bowtie2
      hisat2: indexed_hisat2
    chrom_sizes: $GENOMES/hg38/hg38.chrom.sizes

So, for rgc get_asset() to obtain the chrom_sizes file, it requires the full path?

Before full path error:

Traceback (most recent call last):
  File "pipelines/pepatac.py", line 1551, in <module>
    sys.exit(main())
  File "pipelines/pepatac.py", line 504, in main
    res = _add_resources(args, res)
  File "pipelines/pepatac.py", line 460, in _add_resources
    res[asset] = rgc.get_asset(args.genome_assembly, asset)
  File "/home/jps3dp/.local/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 148, in get_asset
    raise IOError(msg)
OSError: Asset may not exist: hg38.chrom.sizes

Because other assets, like bowtie2 indices are relative, should additional assets also be relative to parent genome folder?

empty dir produced after a negative answer to "large archive" question

a negative answer to "do you want ... large archive?" question in refgenie pull G/A produces an empty G/A directory:

[mstolarczyk@MichalsMBP test_genomes]: refgenie pull -c genomes.yaml hg38_noalt_decoy/star_index
No digest not found for 'hg38_noalt_decoy/fasta:default'. Populating with server data
'hg38_noalt_decoy/star_index:default' archive size: 24.3GB
Are you sure you want to download this large archive? [y/N] N
pull action aborted by user
[mstolarczyk@MichalsMBP test_genomes]: refgenie pull -c genomes.yaml hg38_noalt_decoy/star_index
Replace existing (/Users/mstolarczyk/Desktop/testing/test_genomes/hg38_noalt_decoy/star_index/default)? [y/N] N
[mstolarczyk@MichalsMBP test_genomes]: ll /Users/mstolarczyk/Desktop/testing/test_genomes/hg38_noalt_decoy/star_index/default
total 0
drwxr-xr-x  2 mstolarczyk  staff    64B Sep 19 21:33 .
drwxr-xr-x  3 mstolarczyk  staff    96B Sep 19 21:33 ..

missing requirements

pip install --user --upgrade refgenconf
Collecting refgenconf
  Downloading https://files.pythonhosted.org/packages/43/7c/12a275d5d113cf8af16ab8835619c2d16f6f96c2d287327a4f76d1984461/refgenconf-0.1.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/sfs/lustre/scratch/ns5bc/tmp/pip-install-0e90qak4/refgenconf/setup.py", line 10, in <module>
        with open("requirements/requirements-all.txt", "r") as reqs_file:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements/requirements-all.txt'

likely MANIFEST.in again

create an empty RefGenConf object

In [3]: RefGenConf({})                                                                                                               
Config lacks version key: config_version
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/attmap/ordattmap.py in __getitem__(self, item)
     44         try:
---> 45             return super(OrdAttMap, self).__getitem__(item)
     46         except KeyError:

KeyError: 'genome_server'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in __init__(self, entries)
     95         try:
---> 96             self[CFG_SERVER_KEY] = self[CFG_SERVER_KEY].rstrip("/")
     97         except KeyError:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/attmap/pathex_attmap.py in __getitem__(self, item, expand)
     50         """
---> 51         v = super(PathExAttMap, self).__getitem__(item)
     52         return _safely_expand(v) if expand else v

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/attmap/ordattmap.py in __getitem__(self, item)
     46         except KeyError:
---> 47             return AttMap.__getitem__(self, item)
     48 

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/attmap/attmap.py in __getitem__(self, item)
     31     def __getitem__(self, item):
---> 32         return self.__dict__[item]
     33 

KeyError: 'genome_server'

During handling of the above exception, another exception occurred:

MissingConfigDataError                    Traceback (most recent call last)
<ipython-input-3-39988c53dab0> in <module>
----> 1 RefGenConf({})

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in __init__(self, entries)
     96             self[CFG_SERVER_KEY] = self[CFG_SERVER_KEY].rstrip("/")
     97         except KeyError:
---> 98             raise MissingConfigDataError(CFG_SERVER_KEY)
     99 
    100     def __bool__(self):

MissingConfigDataError: genome_server

No genome_folder

What's the verdict re: no genome_folder in a config file? Do we require that, or if absent is it to be assumed that the intended folder is the folder where the config file lives?

`update_assets(` will need a signature change to accommodate:

update_assets( will need a signature change to accommodate:

            rgc.update_assets(genome, asset, tag, {
                CFG_ASSET_PATH_KEY: build_pkg[ASSETS][asset][PTH].format(**asset_vars),
                CFG_ASSET_DESC_KEY: build_pkg[ASSETS][asset][ASSET_DESC]
            })```

The new arg is 'tag'. If tag is none, it can be set to default.

Originally posted by @nsheff in #51 (comment)

add setdefault method

we need a setdefault method to change default tag of the asset

it should prompt on override: you currently have "...."

why would get_asset return a relative to pwd path?

I'm confused by something...refgenie seek gives me the path correctly:

refgenie seek -c genome_config.yaml  -a salmon_index -g mouse_chrM2x
refgenie 0.4.5-dev
/home/nsheff/mouse_chrM2x/salmon_index

but if I add a file with the same name as the asset path in the pwd:

touch salmon_index

Then now, get_asset returns that relative to the pwd:

refgenie seek -c genome_config.yaml  -a salmon_index -g mouse_chrM2x
refgenie 0.4.5-dev
salmon_index

I cannot figure out why it works this way. Is this a bug or a feature?

In any case, I want to eliminate this behaviour... I would thinkthe local folder where I'm running the command should definitely not affect what is returned by get_asset or refgenie seek ...

colon is overloaded in tqdm progress bar description

root@ec26a3b6d069:/# refgenie pull -c genomes.yaml hg19_cdna/salmon_index
Downloading URL: http://refgenomes.databio.org/v2/asset/hg19_cdna/salmon_index/archive
hg19_cdna/salmon_index:default:  18%|██████████████████████▋                                             | 462M/2.50G [00:31<02:16, 14.9MB/s]

see if there's a way to remove the auto-generated colon after hg19_cdna/salmon_index:default

behavior in get_asset for optional asset?

If I have an asset that is optional to a pipeline, it seems like the current configuration of the combination of _get_asset() and _genome_asset_path actually prevent the usage of strict_exists as anything other than True?

For example here:

path = _genome_asset_path(self.genomes, genome_name, asset_name)
if strict_exists is None or check_exist(path):
    return path

If I just want to report a warning or nothing at all about some asset, then _genome_asset_path() throws an exception before I'd get to the next line yes?

e.g. in a pipeline:

# REQ
for asset in ["chrom_sizes", BT2_IDX_KEY]:
    res[asset] = rgc.get_asset(args.genome_assembly, asset)
# OPT
asset = "tss_annotation"
if args.TSS_name:        
    res[asset] = os.path.abspath(args.TSS_name)
else:
    res[asset] = rgc.get_asset(args.genome_assembly, asset,
                                              strict_exists=False)

The resultant error when running if I do NOT have a tss_annotation asset.

python2

$ python2 pipelines/peppro.py --sample-name test --genome hg38 --input examples/data/test_r1.fq.gz --single-or-paired single -O ~/Downloads/peppro_example/
Changed status from initializing to running.

Loading config file: pipelines/peppro.yaml

Traceback (most recent call last):
  File "pipelines/peppro.py", line 2434, in <module>
    sys.exit(main())
  File "pipelines/peppro.py", line 627, in main
    res = _add_resources(args, res)
  File "pipelines/peppro.py", line 557, in _add_resources
    strict_exists=False)
  File "/home/jps3dp/.local/lib/python2.7/site-packages/refgenconf/refgenconf.py", line 144, in get_asset
    path = _genome_asset_path(self.genomes, genome_name, asset_name)
  File "/home/jps3dp/.local/lib/python2.7/site-packages/refgenconf/refgenconf.py", line 423, in _genome_asset_path
    "Genome '{}' exists, but index '{}' is missing".format(gname, aname))
refgenconf.exceptions.MissingAssetError: Genome 'hg38' exists, but index 'tss_annotation' is missing
Pipeline status: running
Starting cleanup: 0 files; 0 conditional files for cleanup

### Pipeline failed at:  (06-12 14:00:18) elapsed: 0.0 _TIME_

Total time: 0:00:00
Failure reason: Pipeline failure. See details above.

Changed status from running to failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/jps3dp/.local/lib/python2.7/site-packages/pypiper/manager.py", line 1744, in _exit_handler
    self.fail_pipeline(Exception("Pipeline failure. See details above."))
  File "/home/jps3dp/.local/lib/python2.7/site-packages/pypiper/manager.py", line 1638, in fail_pipeline
    raise e
Exception: Pipeline failure. See details above.
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/jps3dp/.local/lib/python2.7/site-packages/pypiper/manager.py", line 1744, in _exit_handler
    self.fail_pipeline(Exception("Pipeline failure. See details above."))
  File "/home/jps3dp/.local/lib/python2.7/site-packages/pypiper/manager.py", line 1638, in fail_pipeline
    raise e
Exception: Pipeline failure. See details above.

python3:

$ python3 pipelines/peppro.py --sample-name test --genome hg38 --input examples/data/test_r1.fq.gz --single-or-paired single -O ~/Downloads/peppro_example/

Changed status from initializing to running.

Loading config file: pipelines/peppro.yaml

Traceback (most recent call last):
  File "/home/jps3dp/.local/lib/python3.6/site-packages/attmap/ordattmap.py", line 38, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
KeyError: 'tss_annotation'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jps3dp/.local/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 420, in _genome_asset_path
    asset_data = genome[aname]
  File "/home/jps3dp/.local/lib/python3.6/site-packages/attmap/pathex_attmap.py", line 51, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
  File "/home/jps3dp/.local/lib/python3.6/site-packages/attmap/ordattmap.py", line 40, in __getitem__
    return AttMap.__getitem__(self, item)
  File "/home/jps3dp/.local/lib/python3.6/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
KeyError: 'tss_annotation'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pipelines/peppro.py", line 2434, in <module>
    sys.exit(main())
  File "pipelines/peppro.py", line 627, in main
    res = _add_resources(args, res)
  File "pipelines/peppro.py", line 557, in _add_resources
    strict_exists=False)
  File "/home/jps3dp/.local/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 144, in get_asset
    path = _genome_asset_path(self.genomes, genome_name, asset_name)
  File "/home/jps3dp/.local/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 423, in _genome_asset_path
    "Genome '{}' exists, but index '{}' is missing".format(gname, aname))
refgenconf.exceptions.MissingAssetError: Genome 'hg38' exists, but index 'tss_annotation' is missing
Pipeline status: running
Starting cleanup: 0 files; 0 conditional files for cleanup

### Pipeline failed at:  (06-12 14:06:34) elapsed: 0.0 _TIME_

Total time: 0:00:00
Failure reason: Pipeline failure. See details above.

Changed status from running to failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/jps3dp/.local/lib/python3.6/site-packages/pypiper/manager.py", line 1744, in _exit_handler
    self.fail_pipeline(Exception("Pipeline failure. See details above."))
  File "/home/jps3dp/.local/lib/python3.6/site-packages/pypiper/manager.py", line 1638, in fail_pipeline
    raise e
Exception: Pipeline failure. See details above.

python2 error

 refgenie listr -c genomes.yaml 
Querying available assets from server: http://refgenomes.databio.org/assets
Traceback (most recent call last):
  File "./.local/bin/refgenie", line 11, in <module>
    sys.exit(main())
  File "/home/sheffien/.local/lib/python2.7/site-packages/refgenie/refgenie.py", line 464, in main
    pfx, genomes, assets = _exec_list(rgc, args.command == LIST_REMOTE_CMD)
  File "/home/sheffien/.local/lib/python2.7/site-packages/refgenie/refgenie.py", line 385, in _exec_list
    assemblies, assets = rgc.list_remote()
  File "/home/sheffien/.local/lib/python2.7/site-packages/refgenconf/refgenconf.py", line 256, in list_remote
    genomes, assets = _list_remote(url, order)
  File "/home/sheffien/.local/lib/python2.7/site-packages/refgenconf/refgenconf.py", line 541, in _list_remote
    genomes_data = _read_remote_data(url)
  File "/home/sheffien/.local/lib/python2.7/site-packages/refgenconf/refgenconf.py", line 575, in _read_remote_data
    with urllib.request.urlopen(url) as response:
AttributeError: 'module' object has no attribute 'request'
sheffien@zen:~$ 

Random Errors in output

Here's a successful build run. note several random messages about errors, that are not errors:

uild -c refgenie.yaml -g hs38d1 -a fasta bowtie2_index --fasta hs38d1.fna.gz
Output to: hs38d1 /home/nsheff/code/sandbox /home/nsheff/code/sandbox/hs38d1
Removed existing flag: '/home/nsheff/code/sandbox/hs38d1/refgenie_failed.flag'
### Pipeline run code and environment:

*              Command:  `/home/nsheff/.local/bin/refgenie build -c refgenie.yaml -g hs38d1 -a fasta bowtie2_index --fasta hs38d1.f
na.gz`
*         Compute host:  puma
*          Working dir:  /home/nsheff/code/sandbox
*            Outfolder:  /home/nsheff/code/sandbox/hs38d1/
*  Pipeline started at:   (10-15 08:04:18) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.5.2
*          Pypiper dir:  `/home/nsheff/.local/lib/python3.5/site-packages/pypiper`
*      Pypiper version:  0.12.0
*         Pipeline dir:  `/home/nsheff/.local/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:

* `asset_registry_paths`:  `['fasta', 'bowtie2_index']`
*            `command`:  `build`
*        `config_file`:  `refgenie.yaml`
*            `context`:  `None`
*             `dbnsfp`:  `None`
*             `docker`:  `False`
*        `ensembl_gtf`:  `None`
*              `fasta`:  `hs38d1.fna.gz`
*        `gencode_gtf`:  `None`
*             `genome`:  `hs38d1`
*      `genome_config`:  `refgenie.yaml`
* `genome_description`:  `None`
*                `gff`:  `None`
*             `logdev`:  `False`
*          `new_start`:  `False`
*          `outfolder`:  `/home/nsheff/code/sandbox`
*            `recover`:  `False`
*            `refgene`:  `None`
*       `requirements`:  `False`
*             `silent`:  `False`
*    `tag_description`:  `None`
*               `tags`:  `None`
*          `verbosity`:  `None`
*            `volumes`:  `None`

----------------------------------------

**MissingGenomeError: using 'default' as the default tag**
Inputs required to build 'fasta': fasta
Building asset 'fasta'
Target exists: `/home/nsheff/code/sandbox/hs38d1/fasta/default/build_complete.flag`  

> `cd /home/nsheff/code/sandbox/hs38d1/fasta/default; find . -type f -exec md5sum {} \; | sort -k 2 | awk '{print $1}' | md5sum`
Default tag for 'hs38d1/fasta' set to: default
Computing initial genome digest...
Initializing genome...
Finished building asset 'fasta'
**MissingAssetError: using 'default' as the default tag**
Inputs required to build 'bowtie2_index': 
Building asset 'bowtie2_index'
Target exists: `/home/nsheff/code/sandbox/hs38d1/bowtie2_index/default/build_complete.flag`  

> `cd /home/nsheff/code/sandbox/hs38d1/bowtie2_index/default; find . -type f -exec md5sum {} \; | sort -k 2 | awk '{print $1}' | md
5sum`
Default tag for 'hs38d1/bowtie2_index' set to: default
Finished building asset 'bowtie2_index'

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:01
*  Total elapsed time (all runs):  0:00:05
*         Peak memory (this run):  0 GB
*        Pipeline completed time: 2019-10-15 08:04:18

ValueError: To create config object, string should be config filepath; got 'genome_config.yaml'

weird error message:

>>> rgc = refgenconf.RefGenConf("genome_config.yaml")
Can't load config file 'genome_config.yaml'
FileNotFoundError[Errno 2] No such file or directory: 'genome_config.yaml'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sheffien/.local/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 51, in __init__
    "be config filepath; got '{}'".format(entries))
ValueError: To create config object, string should be config filepath; got 'genome_config.yaml'

I did provide a filepath. it just didn't exist...

pull_asset changes RGC object attribute outside of __init__

RefGenConf.pull_asset method alters the genome_server value and type while iterating through multiple servers. Consequently two subsequent calls to this method are not possible.

In [9]: rgc = RefGenConf(filepath="genomes.yaml")                                                                                      

In [10]: rgc.genome_server                                                                                                             
Out[10]: 
['http://refgenomes.databio.org',
 'http://staging.refgenomes.databio.org',
 'http://0.0.0.0:80']

rgc.genome_server is a list of strings

In [11]: rgc.pull_asset("rCRSd","fasta","default")                                                                                     
Replace existing (/Users/mstolarczyk/Desktop/testing/test_genomes_new/rCRSd/fasta/default)? [y/N] y
rCRSd/fasta:default: 16.4kB [00:00, 871kB/s]
Out[11]: 
(['rCRSd', 'fasta', 'default'],
 {'asset_path': 'fasta',
  'seek_keys': {'fasta': 'rCRSd.fa',
   'fai': 'rCRSd.fa.fai',
   'chrom_sizes': 'rCRSd.chrom.sizes'},
  'archive_digest': '93e0b97572fb218aa69434e40975c8a6',
  'archive_size': '8.6KB',
  'asset_size': '38.7KB',
  'asset_parents': [],
  'asset_children': ['star_index:default',
   'bwa_index:default',
   'bowtie2_index:default',
   'bismark_bt1_index:default',
   'bismark_bt2_index:default',
   'hisat2_index:default'],
  'asset_digest': '4eb430296bc02ed7e4006624f1d5ac53'})

In [12]: rgc.genome_server                                                                                                             
Out[12]: 'http://refgenomes.databio.org'

rgc.genome_server is a string


In [13]: rgc.pull_asset("rCRSd","fasta","default")                                                                                     
---------------------------------------------------------------------------
MissingSchema                             Traceback (most recent call last)
<ipython-input-13-c840b49930e7> in <module>
----> 1 rgc.pull_asset("rCRSd","fasta","default")

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in pull_asset(self, genome, asset, tag, unpack, force, get_json_url, build_signal_handler)
    489                 continue
    490 
--> 491             url_attrs = get_json_url(self.genome_server, API_ID_ASSET_ATTRS).format(genome=genome, asset=asset)
    492             url_archive = get_json_url(self.genome_server, API_ID_ARCHIVE).format(genome=genome, asset=asset)
    493 

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in <lambda>(server, operation_id)
    442 
    443     def pull_asset(self, genome, asset, tag, unpack=True, force=None,
--> 444                    get_json_url=lambda server, operation_id: construct_request_url(server, operation_id),
    445                    build_signal_handler=_handle_sigint):
    446         """

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in construct_request_url(server_url, operation_id)
   1162     """
   1163     try:
-> 1164         return server_url + _get_server_endpoints_mapping(server_url)[operation_id]
   1165     except KeyError as e:
   1166         _LOGGER.error("'{}' is not a compatible refgenieserver instance. "

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in _get_server_endpoints_mapping(url)
   1176     :return dict: endpoints mapped by their operationIds
   1177     """
-> 1178     json = _download_json(url + "/openapi.json")
   1179     return _map_paths_by_id(asciify_json_dict(json) if sys.version_info[0] == 2 else json)
   1180 

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenconf/refgenconf.py in _download_json(url, params)
    899     import requests
    900     _LOGGER.debug("Downloading JSON data; querying URL: '{}'".format(url))
--> 901     resp = requests.get(url, params=params)
    902     if resp.ok:
    903         return resp.json()

~/Library/Python/3.6/lib/python/site-packages/requests/api.py in get(url, params, **kwargs)
     73 
     74     kwargs.setdefault('allow_redirects', True)
---> 75     return request('get', url, params=params, **kwargs)
     76 
     77 

~/Library/Python/3.6/lib/python/site-packages/requests/api.py in request(method, url, **kwargs)
     58     # cases, and look like a memory leak in others.
     59     with sessions.Session() as session:
---> 60         return session.request(method=method, url=url, **kwargs)
     61 
     62 

~/Library/Python/3.6/lib/python/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    517             hooks=hooks,
    518         )
--> 519         prep = self.prepare_request(req)
    520 
    521         proxies = proxies or {}

~/Library/Python/3.6/lib/python/site-packages/requests/sessions.py in prepare_request(self, request)
    460             auth=merge_setting(auth, self.auth),
    461             cookies=merged_cookies,
--> 462             hooks=merge_hooks(request.hooks, self.hooks),
    463         )
    464         return p

~/Library/Python/3.6/lib/python/site-packages/requests/models.py in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
    311 
    312         self.prepare_method(method)
--> 313         self.prepare_url(url, params)
    314         self.prepare_headers(headers)
    315         self.prepare_cookies(cookies)

~/Library/Python/3.6/lib/python/site-packages/requests/models.py in prepare_url(self, url, params)
    385             error = error.format(to_native_string(url, 'utf8'))
    386 
--> 387             raise MissingSchema(error)
    388 
    389         if not host:

MissingSchema: Invalid URL 'h/openapi.json': No schema supplied. Perhaps you meant http://h/openapi.json?

add list remote assets functionality from CLI

the ability to easily peer at remote asset availablity is currently implemented in the CLI:

https://github.com/databio/refgenie/blob/91a6f6424652689943390cdc0e58a263d0a94738/refgenie/refgenie.py#L380-L387

this is more fundamental... it would be useful, for example, in a pipeline to check if an asset is available in the configured server on-the-fly.

As such, we should move that functionality here, and then call those functions on this object in the CLI.

rename object to RefGenConf?

Right now we named the object RefGenomeConfiguration, which is, uh...long.

Any thoughts on using RefGenConf instead?

correct tests

after config file format changes in #25 nearly all of the tests are failing

default seek keys

Right now seeking for the fasta asset doesn't work, because it expects you to type fasta.fasta:

 refgenie seek hg38/fasta
/ext/yeti/refgenomes/hg38/fasta/default
nsheff@puma:/project/shefflab/www/refgenie_raw$ refgenie seek hg38/fasta.fasta
/ext/yeti/refgenomes/hg38/fasta/default/hg38.fa

See, the vanilla fasta key is just pointing to the folder.

Do we want to enable a default when there are keys present? I thought if the name of the seek key matched the asset name, then the repetition shouldn't be required?

check for env var existence in genome_path

if the specified genome_path in the config is an environment variable that is not set, refgenie pull creates such a directory in the current working dir.

genome_folder: $GENOMES
[mjs5kd@cphg-5L9SYF2 refgenieserver]: refgenie pull -g hg38 -a epilog -u
[mjs5kd@cphg-5L9SYF2 refgenieserver]: ll
total 52
drwxr-xr-x  8 mjs5kd mjs5kd 4096 Jun  1 17:01  ./
drwxr-xr-x 19 mjs5kd mjs5kd 4096 Jun  1 17:00  ../
-rw-r--r--  1 mjs5kd mjs5kd  564 May 31 16:06  Dockerfile
drwxr-xr-x  4 mjs5kd mjs5kd 4096 May 31 14:02  files/
drwxr-xr-x  3 mjs5kd mjs5kd 4096 Jun  1 17:01 '$GENOMES'/
drwxr-xr-x  8 mjs5kd mjs5kd 4096 Jun  1 16:31  .git/
-rw-r--r--  1 mjs5kd mjs5kd  797 May 31 14:02  .gitignore
drwxrwxr-x  2 mjs5kd mjs5kd 4096 Jun  1 16:54  .idea/
-rw-r--r--  1 mjs5kd mjs5kd 1572 May 31 14:02  README.md
drwxr-xr-x  5 mjs5kd mjs5kd 4096 May 31 14:03  refgenieserver/
-rw-r--r--  1 mjs5kd mjs5kd  751 Jun  1 17:01  refgenie.yaml
drwxr-xr-x  2 mjs5kd mjs5kd 4096 May 31 16:09  requirements/
-rw-r--r--  1 mjs5kd mjs5kd 1951 May 31 14:02  setup.py

bug in refgenie seek

on dev I get:

refgenie seek hg19/chrom_sizes
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/refgenie", line 10, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenie/refgenie.py", line 532, in main
    print((rgc.get_asset(a["genome"], a["asset"], a["tag"], a["seek_key"])))
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenconf/refgenconf.py", line 191, in get_asset
    raise TypeError("Asset existence check must be a one-arg function.")
TypeError: Asset existence check must be a one-arg function.

on genome_checksum branch I get:

refgenie seek hg19/chrom_sizes
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/refgenie", line 6, in <module>
    from refgenie.refgenie import main
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenie/refgenie.py", line 16, in <module>
    from .exceptions import MissingGenomeConfigError, MissingFolderError
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenie/exceptions.py", line 1, in <module>
    from refgenconf import CFG_ENV_VARS
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenconf/__init__.py", line 6, in <module>
    from .refgenconf import *
  File "/home/nsheff/.local/lib/python3.5/site-packages/refgenconf/refgenconf.py", line 503, in <module>
    for x in rgc.genomes:
NameError: name 'rgc' is not defined

check for untar

raised in databio/pepatac#111

when refgenie returns a direct path to something useful, it should make sure it's there (particularly should make sure it's not still tarred...)

Imports in main module

There are declared imports from ubiquerg that don't exist. Are these to be imported from elsewhere, moved to ubiquerg, or defined locally?

A refgenie class

We need to restructure refgenie to be a package, and to produce a class, which can be serialized (saved as a yaml file maybe?) at the end of the run with attributes pointing to the indexes that it creates; then there should be a way to reconstruct the class from that file.

chk_digest_update_child checks and writes asset relationship multiple times

With multiple servers defined this method will check and update the asset relationship data multiple times (for each server).

We should not loop through all the servers here since this method is called in refgenie after a successful asset pull to check and update asset relationship data. Consequently, we just need to record the server that the new asset was actually pulled from (in pull_asset method) and pass the URL here and use just this URL.

Originally posted by @MichalStolarczyk in #68

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.