Giter Club home page Giter Club logo

archivist's Introduction

About me

Hello! My name is Jean-Paul R. Soucy and I am a PhD candidate in infectious disease epidemiology at the Dalla Lana School of Public Health at the University of Toronto. My research interests focus on the use of emerging data sources in infectious disease surveillance.

Connect with me on Twitter. Find my personal website at jprs.me and my writing at Data Gripes. Code related to my blog can be found in the data-gripes repository.

My repositories

Table of contents

Software packages are highlighted with code text.

COVID-19 Canada Open Data Working Group

Click to expand
COVID-19 datasets
Other CCODWG projects
Click to expand

Python & R packages

Click to expand

In addition to the COVID-19 data automation pipeline described above, I have also created the following R and Python packages:

  • archivist: Python-based digital archive tool currently powering the Canadian COVID-19 Data Archive
  • getDDD: Download defined daily dose (DDD) information for selected ATC codes from the website of the WHO Collaborating Centre for Drug Statistics Methodology

Reproducible research

Click to expand

Wherever possible, I aim to publish reproducible research with open data:

Presentations & workshops

Click to expand

Miscellaneous

Click to expand
COVID-19 data
Other

Gists

Find my GitHub gists here.

archivist's People

Contributors

jeanpaulrsoucy avatar mschoettle avatar svmillin avatar

Stargazers

 avatar

Watchers

 avatar

archivist's Issues

Improved console output using Rich

Rich is a Python library for writing rich, formatted text to the console. This could be very useful for simplifying and extending the output of this package, as well as future add-ons/integrations (e.g., data validation).

Integrate utils functions into archivist

The functions from utils.py (e.g., for managing the contents of datasets.json) in Covid19CanadaArchive should be merged into archivist.

The ability to generate a markdown list of all datasets (by meta_group_1 and meta_group_2) should also be added as a utility function.

Migrate features from archiver.py to archivist

The features still in archiver.py should be migrated to archivist. Mostly they should inhabit a meta-function which will be the only function called directly by the user: archivist. This function will accept the flags presently accepted by archivist.py.

  • Loading and filtering datasets.json

Add debug option to force HTTPS verification

The verify = False option has been used for some datasets where HTTPS verification fails when using requests. We already have a flag to universally disable verification (ignore-ssl). An additional flag should be added to force verification in order to check which datasets can safely have their verify = False option dropped.

Alternative format for saving webpages

At the moment, we have two methods of saving webpages using archivist:

  • requests: For pages without (necessary) JavaScript, we simply request the URL and save the response.
  • Chrome + Selenium: For pages with JavaScript, we give time for the page to render then save the page source.

However, neither of the above methods accomplish things like saving externally-embedded images and page styling.

There are two alternatives to the current method of saving webpages:

  • SingleFile: Web extension/CLI tool for saving a complete webpage as a single HTML file.
  • SingleFileZ: As above, but output format is both an HTML file and a valid ZIP file. The only possible issues are that the resulting file may not be as universally readable as the pure HTML format and that some work still needs to be done to ensure long-term support of Chrome-based browsers.

Other interesting projects:

shot-scraper html is also worth looking at.

Utility functions

Utility functions that should be added to the package:

  • Extract a list of JSON files from an Esri/ArcGIS dashboard

Add ability to download file locally

This would probably work best as a new mode (e.g., dl). Should either use out_path to specify the directory to write files, or out_dir (a new parameter) or simply replace out_path with out_dir (this would also require changing the behaviour of index to write the index to the specified directory with a default name (should probably also create an overwrite variable to determine whether to overwrite files with the same name).

Add ability to cache datasets on external server between runs

Sometimes, datasets are inaccessible at the time that the nightly update runs despite having been available earlier in the day.

Perhaps the test mode could be reworked (or another mode added, e.g., no-prod) such that a current copy of each dataset is kept on an external server (e.g., S3). On each run, the hash of the version of the dataset on the server could be compared with the local hash, replacing it when they differ (assuming the dataset downloaded correctly, e.g., a dynamic HTML page meets its minimum size requirement).

The prod mode could copy the current version of the dataset on the server to the permanent location with the appropriate timestamp.

One issue could be that copies of inactive datasets would remain on the server. A maintenance function could run at the beginning of every run to clear datasets marked as inactive from the server.

Unify file downloading functions

dl_fun can probably deprecated in favour of a single download method choosing its options via other attributes (i.e., extension and the presence/absence of the "js" arg. ss_page can be deprecated as there is no use for it in the current package.

This will also allow simplification/avoid redundancy when implementing features like automatic retries (#1).

print-md5 error

The following errors occur only when the flag --debug print-md5 is enabled:

PE - PEI COVID-19 Case Data
'bytes' object has no attribute 'encode'
FAILURE: pe/pei-webpage/pei-webpage_2022-03-26_20-15.html
PE - COVID-19 Vaccination Data
'bytes' object has no attribute 'encode'
FAILURE: pe/pei-vaccination-webpage/pei-vaccination-webpage_2022-03-26_20-15.html

There are two issues here: these two webpages are failing to be hashed, and the failure to hash them causes the dataset download as a whole to fail.

Double-counting failed UUID in the logs

2022-04-09 22:02

Successful downloads: 385/389
Failed downloads: 4/389

FAILURE: nt/nwt-dashboard-communities-webpage/nwt-dashboard-communities-webpage_2022-04-09_22-11.html
FAILURE: nt/nwt-dashboard-communities-webpage/nwt-dashboard-communities-webpage_2022-04-09_22-11.html
FAILURE: nu/nunavut-vaccination-table/vaccine_table_2022-04-09_22-16.png
FAILURE: sk/covid-weekly-epi-report/covid-weekly-epi-report_2022-04-09_22-31.pdf

Automatic redo for Shiny dashboards

Shiny dashboard can be finicky and "time out" rather than displaying content. A hallmark of a failed download is small size, but there may be other ways to detect this as well.

Document project files

The format for all project files should be documented:

  • datasets.json
  • config.toml
  • Special processing code (currently proc/webdriver)

More general recipe for generating re-run code

Instead of checking all of the options and using this to generate re-run code (necessitating manually specifying any new flags in the re-run code), we could simply process sys.argv. This would involve stripping out any uuid specification (and replacing it with failed datasets from the current run) and perhaps forcing the project path to be explicit.

Log email: include log as attachment

The log email would be easier to parse if the actual log portion was included as a txt attachment to the email while than comprising the vast majority of the email body.

Error tolerance: download timeouts

Error tolerance should be rock-solid. This includes being more aggressive with time-outs as well as all out show-stoppers.

For example, trying to download the SK situation report PDF (after retrieving the URL) completely killed the script recently.

Flexible query to select datasets to download

At the moment, the method of selecting datasets is limited to explicit inclusion (--uuid) and exclusion (--uuid-exclude). It would be useful to be able to select datasets:

  • In a range between two datasets
  • Datasets before a certain dataset
  • Datasets after a certain dataset

Additional possibilities would include selecting datasets only within a certain group (e.g., "on" or "bc").

It would be best for this query to happen in a single flag to avoid cluttering up the help pages with redundant parameters.

Create file download class

With a file download class, we could easily track properties such as n_retries, useful for yet-to-be-implemented features such as #10 (automatic retries based on expected file size).

Separate out special processing code

Since archivist is intended to be a stand-alone project, any special processing code (i.e., code that depends on a specific uuid being selected) should be separated out and moved to Covid19CanadaArchive. There should nonetheless be an easy way to ensure this code can be easily integrated into an archivist project.

Upgrade to Selenium 4

Fix the following warnings introduced by Selenium 4:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  driver = webdriver.Chrome(executable_path=os.environ['CHROMEDRIVER_BIN'], options=options)
DeprecationWarning: find_elements_by_link_text is deprecated. Please use find_elements(by=By.LINK_TEXT, value=text) instead
  elements = driver.find_elements_by_link_text('Data Table')

Create simple config file

Create simple config file (probably TOML) to cut down on options that need to be specified on run.

Will also need to add functionality that validates the config options before running the module.

Minor features/bug fixes

A collection of minor features and bug fixes, mainly transferred from Covid19CanadaArchive.

download functions (dl_file, html_page, ss_page)

  • Parameters could be converted into a single 'args' parameter
  • Add 'debug' parameter
  • Add more robust timeouts (see issue) (moved to #7)
  • When spoofing user agent, randomly pick from a set of modern user agent strings (moved to #4)
  • Add useragent support to html_page and ss_page (moved to #4)

html_page

Re-running UUIDs

  • When using --uuid, should run in the order provided
  • Add automated retries for failed datasets (example code)
  • Add ability to re-run a range of datasets (moved to #26)
  • Add ability to exclude certain datasets from the run (e.g., --uuid-exclude)
  • Add option to run datasets in a random order to reduce load on servers (e.g., --random-order)
  • Print re-run code to standard error
  • Report failed datasets at bottom of script, for convenience

Miscellaneous

  • Add option to wait between queries
  • Add ability to pass environmental variables through a file (moved to #14)
  • Script to grab all embedded datasets and tables and save as a single JSON file (moved to #8)
    • Example use case: NWT dashboard or Alberta info page

Second positional argument should refer to project folder

The second positional argument of archivist should refer to the project folder, rather than datasets.json directly. In the absence of an argument, it should assume the working directory is the project folder. The project folder should contain all necessary project files, such as datasets.json, the config file, special processing code, etc. The ability to leave the second positional argument blank would also necessitate removing out_path as an optional third positional argument, and should instead be replaced with a keyword argument.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.