nasa-impact / csdap-bulk-download Goto Github PK

Helper script to download data urls from SDX.

Home Page: https://csdap.earthdata.nasa.gov

Python 100.00%

csdap-bulk-download's Introduction

CSDAP Bulk Download Script

Authorized users submit data requests through the Smallsat Data Explorer (SDX) for desired data scenes. Once an order is approved, users will receive a .csv file that includes download links to the ordered scenes. Each ordered scene includes a separate download link for each asset type. Depending on the number of ordered scenes and associated assets, the .csv could have many download links. This script allows users to conveniently download all files or to select a subset of files to download by filtering the .csv file.

Note: Download links can only be downloaded once. However, user downloads are logged so that if there is a failure during download (e.g., loss of internet), executing this script again will start downloading files that were not previously downloaded.

Installation

pip3 install --user https://github.com/NASA-IMPACT/csdap-bulk-download/archive/main.zip

Note: Keep an eye out for a warning from pip along the lines of this:

WARNING: The script csdap-bulk-download is installed in '/Users/username/Library/Python/3.8/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

If you encounter this issue, you will likely need to ensure that the directory mentioned is available on your path. See here for techniques on resolving this issue.

Development

Install into a virtual environment:

pip install -e .

Formatting & Linting

To maintain common code style, please format and lint all code contributions:

pip3 install -r requirements-dev.txt
black csdap_bulk_download  # Format code
flake8  # Lint code

csdap-bulk-download's People

Contributors

Stargazers

Watchers

Forkers

davidkhayes

csdap-bulk-download's Issues

Update to support quotas download endpoint

This script downloads items from download/:order_id/:scene_id/:asset_type. We should instead be downloading data with a :collection_id rather than an :order_id.

Refine Radbio Userflows

Fix the "Pre authorization required for this application, please authorize by visiting the resolution url" issue

(.venv) C:\Users\megha\Documents\hello_django\csdap-bulk-download-main>csdap-bulk-download order_356.csv -o
Earthdata Login username: meghananp
Earthdata Login password:
Traceback (most recent call last):
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\Scripts\csdap-bulk-download-script.py", line 33, in
sys.exit(load_entry_point('csdap-bulk-download==1.0', 'console_scripts', 'csdap-bulk-download')())
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\csdap_bulk_download\cli.py", line 110, in cli
token = csdap.get_auth_token(username, password)
File "C:\Users\megha\Documents\hello_django\csdap-bulk-download-main.venv\lib\site-packages\csdap_bulk_download\csdap.py", line 71, in get_auth_token
raise AuthError(querystring["error_msg"])
csdap_bulk_download.exceptions.AuthError: ['Pre authorization required for this application, please authorize by visiting the resolution url']

Above error is being thrown because there is no proper flow followed before starting the download. The user needs to agree to the terms and conditions before starting the download

Tasks

Replicate the issue
Find the root cause
Come up with a fix

Acceptance Criteria

Pull Request with either the fix or a comprehensive document that explains the fix

"failed to download" for all assets scripts

I'm unable to get any download scripts to work. I can download individual files one at a time just fine in the CSDA web GUI, but when I attempt to use this software with any assets.csv file that I download using the 'Download Inventory' button, I get a "Failed to download" message for every file (I have tried multiple file types...). I have ruled out an authentication issue, as I tried entering incorrect ones and that gives an authentication error instead of the error I am seeing.

If it matters, I installed using pip install as suggested.

Example:

~/.local/bin/csdap-bulk-download assets.csv
Earthdata Login username: ******
Earthdata Login password: 
2022-03-08 17:46:53,431:spire/2021-01-17T23-47-20_FM120_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antPOD.rnx: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104.sp3: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-47-20_FM120_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antFRO.rnx: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-47-20_FM120_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120.sp3: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-47-20_FM120_navigation/spire_att_L1A_leoAtt_v06.01_2021-01-18T00-00-00_FM120.log: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antFRO.rnx: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_nav_L1B_leoOrb_v06.01_2021-01-17T23-00-24_FM104.sp3: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antBRO.rnx: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-47-20_FM120_navigation/spire_nav_L1B_leoOrb_v06.01_2021-01-17T23-47-20_FM120.sp3: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_att_L1A_leoAtt_v06.01_2021-01-17T23-00-00_FM104.log: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-00-24_FM104_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antPOD.rnx: Failed to download
2022-03-08 17:46:53,439:spire/2021-01-17T23-47-20_FM120_navigation/spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antBRO.rnx: Failed to download
Complete.

assets.csv:

collection_id,scene_id,asset_type
"spire","2021-01-17T23-47-20_FM120_navigation","spire_att_L1A_leoAtt_v06.01_2021-01-18T00-00-00_FM120.log"
"spire","2021-01-17T23-47-20_FM120_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120.sp3"
"spire","2021-01-17T23-47-20_FM120_navigation","spire_nav_L1B_leoOrb_v06.01_2021-01-17T23-47-20_FM120.sp3"
"spire","2021-01-17T23-47-20_FM120_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antBRO.rnx"
"spire","2021-01-17T23-47-20_FM120_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antFRO.rnx"
"spire","2021-01-17T23-47-20_FM120_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-47-20_FM120_antPOD.rnx"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_att_L1A_leoAtt_v06.01_2021-01-17T23-00-00_FM104.log"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104.sp3"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_nav_L1B_leoOrb_v06.01_2021-01-17T23-00-24_FM104.sp3"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antBRO.rnx"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antFRO.rnx"
"spire","2021-01-17T23-00-24_FM104_navigation","spire_nav_L1A_podObs_v06.01_2021-01-17T23-00-24_FM104_antPOD.rnx"

Any suggestions for how to troubleshoot?

NameError when run in Virtual Environment

I'm having trouble running csdap-bulk-download on the x86_64 Linux platform (Debian 10) using python3 Virtual Environments. The setup and error are some variation of the following:

Installing into the venv csdap

python3 -m venv csdap
. csdap/bin/activate
# My venv doesn't come with wheel, so install it manually
pip3 install wheel
# Removed --user since I want this in the venv
pip3 install https://github.com/NASA-IMPACT/csdap-bulk-download/archive/main.zip
csdap-bulk-download order.csv -o /data/csdap/

Results of running in the Anaconda install ~/csdap3

~$ csdap-bulk-download order.csv -o /data/csdap/
Traceback (most recent call last):
  File "/home/user/csdap3/bin/csdap-bulk-download", line 5, in <module>
    from csdap_bulk_download.cli import cli
  File "/home/user/csdap3/lib/python3.9/site-packages/csdap_bulk_download/cli.py", line 12, in <module>
    from .csdap import CsdapClient
  File "/home/user/csdap3/lib/python3.9/site-packages/csdap_bulk_download/csdap.py", line 19, in <module>
    class CsdapClient:
  File "/home/user/csdap3/lib/python3.9/site-packages/csdap_bulk_download/csdap.py", line 88, in CsdapClient
    **_,
NameError: name '_' is not defined

So far I've encounter this on:

System Python (3.7.3)
System Python created venv, then built/installed stock Python 3.10.2 into the venv.
Anaconda 3 (Python 3.9.7)

I have not tried this using an isolated user account and installing with pip3 install --user.

/home/user is an ext4 filesystem. /data/csdap is a ZFS dataset owned by user. Please let me know if you need additional details.

When user hits their quota, downloader should exit

Currently, if a user hits their quota (receives a 403 response), the downloader simply logs this information and continues:

https://github.com/NASA-IMPACT/csdap-bulk-download/blob/main/csdap_bulk_download/csdap.py#L123-L129

In the event that someone is running the downloader in an unmonitored fashion, this can result in a large number of unnecessary requests following their quota rejection. Instead, the downloader should inform the user that they have hit their auth limit and exit the task (after other threads complete).

File path too long on Windows

When downloading Spire data on a Windows system, the file names/directory names can be so long and nested that they trigger a File Path Too Long error.

Download starts over after crash.

This issue was brought to light due to #22. The platform I'm running on is Debian 10, Python 3.7.3.

The order I'm downloading is quite large, so accumulating statistics is quite time consuming as the storage volume is on traditional spinning drives. What I've gathered so far:

The program is run as csdap-bulk-download order_xyz.csv -o /data/output/folder.
/data/output/folder is on a ZFS volume. This is a local volume (eg not NFS mounted)
The order_xyz.csv file has some 17 million lines
find /data/output/folder returns around 9 million files and folders
find /data/output/folder -type f returns 114,412 files
du -hs /data/output/folder returns 146 GB
I've run the downloader several dozen times.
After several runs over the course of a week, the output of find /data/output/folder -type f remains at 114,412.

It seems the downloads start back at the beginning of the list after the process is reaped by the OOM killer.

Fix the window SSLError

Motivation

The bulk download script fails when ran from windows. This looks like a SSL issue. This is the error log:

D:\>csdap-bulk-download f:\dsms\order_356.csv -o f:\dsms\
Earthdata Login username: jstoker
Earthdata Login password:
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 588, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "c:\programdata\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 248, in _get_conn
    return conn or self._new_conn()
  File "c:\programdata\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 816, in _new_conn
    raise SSLError("Can't connect to HTTPS URL because the SSL "
urllib3.exceptions.SSLError: Can't connect to HTTPS URL because the SSL module is not available.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py", line 449, in send
    timeout=timeout
  File "c:\programdata\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "c:\programdata\anaconda3\lib\site-packages\urllib3\util\retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='csdap.earthdata.nasa.gov', port=443): Max retries exceeded with url: /api/v1/auth/?redirect_uri=script (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\Scripts\csdap-bulk-download.exe\__main__.py", line 9, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\csdap_bulk_download\cli.py", line 110, in cli
    token = csdap.get_auth_token(username, password)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\csdap_bulk_download\csdap.py", line 31, in get_auth_token
    allow_redirects=False,
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\jstoker\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='csdap.earthdata.nasa.gov', port=443): Max retries exceeded with url: /api/v1/auth/?redirect_uri=script (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))

Here is SO question that asks about this issue:
https://stackoverflow.com/questions/54135206/requests-caused-by-sslerrorcant-connect-to-https-url-because-the-ssl-module/55632553#55632553

Tasks

Replicate the issue
Figure out the root cause of the issue and fix it
If the fix can be shipped with the code, change the code, otherwise write a comprehensive doc that describes how to fix the issue in case it arrises.

Acceptance Criteria

Pull Request with either the fix or a comprehensive document that explains the fix

Produce an error log

When downloading a long list of files, errors are printed to the console but not output to a log file. This makes it hard to keep track of errors. We should dump errors to a separate file for easier parsing.

tqdm dependency error

ModuleNotFoundError: No module named 'tqdm.contrib.logging'

Memory leak?

I'm running the csdap-bulk-download on Debian 10 with Python 3.7.3 and the program is killed after 6-10 hours by the Out Of Memory killer. This is surprising as the system I'm running the downloader on has 256 GB of RAM.

The downloader seems to download about 146 GB of data before the OOM killer reaps the process. The downloader is outputting to a folder on a ZFS array backed by spinning disks. I've observed the python3 process in top and it grows steadily in memory consumption.

Please let me know if you need any additional details.

nasa-impact / csdap-bulk-download Goto Github PK

csdap-bulk-download's Introduction

CSDAP Bulk Download Script

Installation

Development

Formatting & Linting

csdap-bulk-download's People

Contributors

Stargazers

Watchers

Forkers

csdap-bulk-download's Issues

Tasks

Acceptance Criteria

Motivation

Tasks

Acceptance Criteria

Recommend Projects

Recommend Topics

Recommend Org