jeanpaulrsoucy / archivist Goto Github PK

Python-based digital archive tool currently powering the Canadian COVID-19 Data Archive

License: MIT License

Python 100.00%

archivist's Introduction

About me

Hello! My name is Jean-Paul R. Soucy and I am a PhD candidate in infectious disease epidemiology at the Dalla Lana School of Public Health at the University of Toronto. My research interests focus on the use of emerging data sources in infectious disease surveillance.

Connect with me on Twitter. Find my personal website at jprs.me and my writing at Data Gripes. Code related to my blog can be found in the data-gripes repository.

Software packages are highlighted with code text.

COVID-19 Canada Open Data Working Group

Click to expand

COVID-19 datasets

Covid19Canada: The original COVID-19 Canada Open Data Working Group dataset
CovidTimelineCanada: The Timeline of COVID-19 in Canada, a definitive COVID-19 dataset for Canada
Covid19CanadaArchive: The Canadian COVID-19 Data Archive
Covid19CanadaBot, Covid19CanadaETL, Covid19CanadaDataProcess, Covid19CanadaData: Automated COVID-19 data collection pipeline

Other CCODWG projects

Click to expand

Covid19CanadaArchive-data-explorer: Data explorer for the Canadian COVID-19 Data Archive
CovidTimelineCanada-js-dashboard: A simple JS dashboard for the Timeline of COVID-19 in Canada Dataset
Covid19CanadaDashboard: The original COVID-19 Canada Open Data Working Group dashboard
Covid19CanadaAPI: Python-based API for the COVID-19 Canada Open Data Working Group dataset
WhatHappenedCovid19InCanada: Landing page for the What Happened? COVID-19 in Canada project
CovidDataStandard
CovidStoriesCanada
FAIR COVID-19 Data for 🇨🇦

Python & R packages

Click to expand

In addition to the COVID-19 data automation pipeline described above, I have also created the following R and Python packages:

archivist: Python-based digital archive tool currently powering the Canadian COVID-19 Data Archive
getDDD: Download defined daily dose (DDD) information for selected ATC codes from the website of the WHO Collaborating Centre for Drug Statistics Methodology

Reproducible research

Click to expand

Wherever possible, I aim to publish reproducible research with open data:

covid-19-mobility: Public transit mobility as a leading indicator of COVID-19 transmission in 40 cities during the first wave of the pandemic
covid-19-antibiotic: Antibiotic prescribing in patients with COVID-19: A rapid review and meta-analysis
ccodwg-scientific-data: A sub-national real-time epidemiological and vaccination database for the COVID-19 pandemic in Canada
covid-19-coinfection-metaregression: Predictors and Microbiology of Co-Infection in Patients with COVID-19: Living Rapid Review Update and Meta-Regression
california-counties-suicide-and-temperature: The Effect of Average Temperature on Suicide Rates in Five Urban California Counties, 1999–⁠2019: An Ecological Time Series Analysis
covid-19-patients-amr-meta-analysis: Antimicrobial resistance in patients with COVID-19: a systematic review and meta-analysis
covid-19-amr-meta-analysis: Antimicrobial resistance in patients with COVID-19: a systematic review and meta-analysis
covid-19-mobility-responsiveness: Characterizing responsiveness to the COVID-19 pandemic in the United States and Canada using mobility data

Presentations & workshops

Click to expand

cssc-2021-r-shiny-workshop: What makes R Shiny so shiny? A step-by-step introduction to interactive dashboards in R (2021)
dlsph-student-conf-2022-web-scraping-workshop: Web scraping for public health (2022)

Miscellaneous

Click to expand

COVID-19 data

Other

Gists

Find my GitHub gists here.

archivist's People

Contributors

Stargazers

Watchers

archivist's Issues

Improved console output using Rich

Rich is a Python library for writing rich, formatted text to the console. This could be very useful for simplifying and extending the output of this package, as well as future add-ons/integrations (e.g., data validation).

Integrate utils functions into archivist

The functions from utils.py (e.g., for managing the contents of datasets.json) in Covid19CanadaArchive should be merged into archivist.

The ability to generate a markdown list of all datasets (by meta_group_1 and meta_group_2) should also be added as a utility function.

Use requests for simple HTML pages

Add simple tutorial to README

Migrate features from archiver.py to archivist

The features still in archiver.py should be migrated to archivist. Mostly they should inhabit a meta-function which will be the only function called directly by the user: archivist. This function will accept the flags presently accepted by archivist.py.

Loading and filtering datasets.json

Convert backend of package to Scrapy

Add debug option to force HTTPS verification

The verify = False option has been used for some datasets where HTTPS verification fails when using requests. We already have a flag to universally disable verification (ignore-ssl). An additional flag should be added to force verification in order to check which datasets can safely have their verify = False option dropped.

Check required environmental variables are available before run

The module should check that the required environmental variables are present before running any other commands.

It should also be possible to provide environmental variables as a file (e.g., using the -e flag and python-dotenv).

Add support for SingleFile

Add support for SingleFile as an alternative to plain HTML scraping (#8).

Alternative format for saving webpages

At the moment, we have two methods of saving webpages using archivist:

requests: For pages without (necessary) JavaScript, we simply request the URL and save the response.
Chrome + Selenium: For pages with JavaScript, we give time for the page to render then save the page source.

However, neither of the above methods accomplish things like saving externally-embedded images and page styling.

There are two alternatives to the current method of saving webpages:

SingleFile: Web extension/CLI tool for saving a complete webpage as a single HTML file.
SingleFileZ: As above, but output format is both an HTML file and a valid ZIP file. The only possible issues are that the resulting file may not be as universally readable as the pure HTML format and that some work still needs to be done to ensure long-term support of Chrome-based browsers.

Other interesting projects:

shot-scraper html is also worth looking at.

Utility functions

Utility functions that should be added to the package:

Extract a list of JSON files from an Esri/ArcGIS dashboard

Rewrite webdriver

Add ability to download file locally

This would probably work best as a new mode (e.g., dl). Should either use out_path to specify the directory to write files, or out_dir (a new parameter) or simply replace out_path with out_dir (this would also require changing the behaviour of index to write the index to the specified directory with a default name (should probably also create an overwrite variable to determine whether to overwrite files with the same name).

Add ability to cache datasets on external server between runs

Sometimes, datasets are inaccessible at the time that the nightly update runs despite having been available earlier in the day.

Perhaps the test mode could be reworked (or another mode added, e.g., no-prod) such that a current copy of each dataset is kept on an external server (e.g., S3). On each run, the hash of the version of the dataset on the server could be compared with the local hash, replacing it when they differ (assuming the dataset downloaded correctly, e.g., a dynamic HTML page meets its minimum size requirement).

The prod mode could copy the current version of the dataset on the server to the permanent location with the appropriate timestamp.

One issue could be that copies of inactive datasets would remain on the server. A maintenance function could run at the beginning of every run to clear datasets marked as inactive from the server.

Unify file downloading functions

dl_fun can probably deprecated in favour of a single download method choosing its options via other attributes (i.e., extension and the presence/absence of the "js" arg. ss_page can be deprecated as there is no use for it in the current package.

This will also allow simplification/avoid redundancy when implementing features like automatic retries (#1).

User agent spoofing by default

For both requests and Selenium.

When page source is below min_size, report the actual size of the page

Refactor function to handle file at end of download function

Should have single method to deal with download/upload/local save/etc.

print-md5 error

The following errors occur only when the flag --debug print-md5 is enabled:

PE - PEI COVID-19 Case Data
'bytes' object has no attribute 'encode'
FAILURE: pe/pei-webpage/pei-webpage_2022-03-26_20-15.html
PE - COVID-19 Vaccination Data
'bytes' object has no attribute 'encode'
FAILURE: pe/pei-vaccination-webpage/pei-vaccination-webpage_2022-03-26_20-15.html

There are two issues here: these two webpages are failing to be hashed, and the failure to hash them causes the dataset download as a whole to fail.

Add asynchronous dataset updates

Should be able to run multiple dataset updates asynchronously.

Two possible paths:

Use something like asyncio (e.g., see this post)
Use scrapy on the backend (see Files Pipeline, controlling file names)

In both cases, the framework should support timeouts.

Double-counting failed UUID in the logs

2022-04-09 22:02

Successful downloads: 385/389
Failed downloads: 4/389

FAILURE: nt/nwt-dashboard-communities-webpage/nwt-dashboard-communities-webpage_2022-04-09_22-11.html
FAILURE: nt/nwt-dashboard-communities-webpage/nwt-dashboard-communities-webpage_2022-04-09_22-11.html
FAILURE: nu/nunavut-vaccination-table/vaccine_table_2022-04-09_22-16.png
FAILURE: sk/covid-weekly-epi-report/covid-weekly-epi-report_2022-04-09_22-31.pdf

Automatic redo for Shiny dashboards

Shiny dashboard can be finicky and "time out" rather than displaying content. A hallmark of a failed download is small size, but there may be other ways to detect this as well.

Document project files

The format for all project files should be documented:

datasets.json
config.toml
Special processing code (currently proc/webdriver)

More general recipe for generating re-run code

Instead of checking all of the options and using this to generate re-run code (necessitating manually specifying any new flags in the re-run code), we could simply process sys.argv. This would involve stripping out any uuid specification (and replacing it with failed datasets from the current run) and perhaps forcing the project path to be explicit.

Log email: include log as attachment

The log email would be easier to parse if the actual log portion was included as a txt attachment to the email while than comprising the vast majority of the email body.

Error tolerance: download timeouts

Error tolerance should be rock-solid. This includes being more aggressive with time-outs as well as all out show-stoppers.

For example, trying to download the SK situation report PDF (after retrieving the URL) completely killed the script recently.

Flexible query to select datasets to download

At the moment, the method of selecting datasets is limited to explicit inclusion (--uuid) and exclusion (--uuid-exclude). It would be useful to be able to select datasets:

In a range between two datasets
Datasets before a certain dataset
Datasets after a certain dataset

Additional possibilities would include selecting datasets only within a certain group (e.g., "on" or "bc").

It would be best for this query to happen in a single flag to avoid cluttering up the help pages with redundant parameters.

Create file download class

With a file download class, we could easily track properties such as n_retries, useful for yet-to-be-implemented features such as #10 (automatic retries based on expected file size).

Separate out special processing code

Since archivist is intended to be a stand-alone project, any special processing code (i.e., code that depends on a specific uuid being selected) should be separated out and moved to Covid19CanadaArchive. There should nonetheless be an easy way to ensure this code can be easily integrated into an archivist project.

Add ability to manually add old files to index

Add flag to force attempt datasets marked as inactive

This would also involve re-writing the DS ingestion function to store the inactive cases somewhere, if the flag to force attempt datasets marked inactive to be downloaded.

Upgrade to Selenium 4

Fix the following warnings introduced by Selenium 4:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  driver = webdriver.Chrome(executable_path=os.environ['CHROMEDRIVER_BIN'], options=options)

DeprecationWarning: find_elements_by_link_text is deprecated. Please use find_elements(by=By.LINK_TEXT, value=text) instead
  elements = driver.find_elements_by_link_text('Data Table')

Create simple config file

Create simple config file (probably TOML) to cut down on options that need to be specified on run.

Will also need to add functionality that validates the config options before running the module.

Minor features/bug fixes

A collection of minor features and bug fixes, mainly transferred from Covid19CanadaArchive.

download functions (dl_file, html_page, ss_page)

Parameters could be converted into a single 'args' parameter
Add 'debug' parameter
~~Add more robust timeouts (see issue)~~ (moved to #7)
~~When spoofing user agent, randomly pick from a set of modern user agent strings~~ (moved to #4)
~~Add useragent support to html_page and ss_page~~ (moved to #4)

html_page

~~When saving pages with d3 elements, ensure that the d3 elements are saved~~ (moved to #8)
- https://stackoverflow.com/questions/28671388/getting-an-d3-elements-property-that-is-in-dom-not-in-the-html-with-seleniu
- https://stackoverflow.com/questions/25356440/need-to-dump-entire-dom-tree-with-element-id-from-selenium-server

Re-running UUIDs

When using --uuid, should run in the order provided
Add automated retries for failed datasets (example code)
~~Add ability to re-run a range of datasets~~ (moved to #26)
Add ability to exclude certain datasets from the run (e.g., --uuid-exclude)
Add option to run datasets in a random order to reduce load on servers (e.g., --random-order)
Print re-run code to standard error
Report failed datasets at bottom of script, for convenience

Miscellaneous

Add option to wait between queries
~~Add ability to pass environmental variables through a file~~ (moved to #14)
- env file using python-dotenv
~~Script to grab all embedded datasets and tables and save as a single JSON file~~ (moved to #8)
- Example use case: NWT dashboard or Alberta info page

Second positional argument should refer to project folder

The second positional argument of archivist should refer to the project folder, rather than datasets.json directly. In the absence of an argument, it should assume the working directory is the project folder. The project folder should contain all necessary project files, such as datasets.json, the config file, special processing code, etc. The ability to leave the second positional argument blank would also necessitate removing out_path as an optional third positional argument, and should instead be replaced with a keyword argument.

jeanpaulrsoucy / archivist Goto Github PK

archivist's Introduction

About me

My repositories

Table of contents

COVID-19 Canada Open Data Working Group

COVID-19 datasets

Other CCODWG projects

Python & R packages

Reproducible research

Presentations & workshops

Miscellaneous

COVID-19 data

Other

Gists

archivist's People

Contributors

Stargazers

Watchers

archivist's Issues

download functions (dl_file, html_page, ss_page)

html_page

Re-running UUIDs

Miscellaneous

Recommend Projects

Recommend Topics

Recommend Org