Giter Club home page Giter Club logo

edscrapers's Introduction

U.S. Department of Education scraping kit

NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g. edscrapers/scrapers or edscrapers/transformers).

Running the tool

Clone this repo using git clone.

Change directory into the directory created/cloned for this repo.

From within the repo directory run pip install -r requirements.txt to install all package dependencies required to run the toolkit

You need the ED_OUTPUT_PATH environment variable to be set before running. Not having the variable set in your environment will result in a fatal error. The ED_OUT_PATH environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.

If GNU Make is available in your environment, you can run the command make install. Alternatively, run python setup.py install.

After installing, run the eds command in a command line prompt.

Containerization of Scraping Toolkit - Docker Image

If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image. Simply run : docker build in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile

ED Scrapers Command Line Interface

To get more info on the usage on the ED Scrapers Command Line Interface - eds, read the eds cli docs.

Architectural Design

To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc

Terminology

  • Scraping Source: a website (or section of website) where you scrape information from
  • Scraper: A script that collects structured data from (rather unstructured) web pages
    • Crawler: A script that follows links and identifies all the pages containing information to be parsed
    • Parser: A script that identifies data in HTML and loads it into a machine readable data structure
  • Transformer: a script that takes a data structure and adapts it to a target structure
  • ETL: Extract + Transform + Load process for metadata.
  • Data.json: A specific JSON format used by CKAN harvesters. Example

Scrapers

Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:

  • A crawler class that defines rules for link extraction and page filters
    • This will be instantiated by a CrawlerProcess in the main scraper.py script
  • A parser script that is essentially a callback for fetching HTML pages. It receives a Scrapy Response payload, which can be parsed using any HTML parsing methods
    • An optional Model class, to define the properties of extracted datasets and make them more flexible for dumping or automating operations if needed

Transformers

Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).

We currently have 7 transformers in place:

  • deduplicate: removes duplicates from scraping

  • sanitize: cleans up the scraping output data/metadata based on specified rules.

  • datajson: creates data.json files from the scraping output; these data.json files can then by ingested/harvested by ckanext-harvest (used to populate a CKAN data portal).

  • rag: produces RAG analyses output files using an agreed weighted-value system for calculating the quality of metadata generated by the datajson transformer and (by extension) the 'raw' scraping output.

  • TODO: Add info about the others

License

GNU AFFERO GENERAL PUBLIC LICENSE

edscrapers's People

Contributors

georgiana-b avatar higorspinto avatar nightsh avatar osahon-okungbowa avatar tanvirchahal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edscrapers's Issues

Create a RAG summary of the results of the scraping exercise

Description: Create a RAG summary of the results of the scraping exercise.

RAG to be determined:

  • Dark green: We scrapped and have all the metadata, we can determine if DP, collection or source
  • Green: We scrapped and have all the metadata, we can't technically determine if DP, collection or source
  • Light Amber: We have scrapped and have partial metadata
  • Dark Amber: We have the title
  • Red: Not scraped

Can we automate in some way? [Prob not in this sprint]

Acceptance criteria

  • We have a RAG summary of the results of the scraping exercise

Task-list

  • Agree the RAG status levels
  • create RAG per domain
  • Create RAG per publisher

Link to Jira card: https://open-data-ed.atlassian.net/browse/OD-503

Create the list of all resources and pages where we have scrapable data and validate against the list provided

Description: Replicate the job that the other team did so we are clear on the Dept Ed Websphere to scrape. Validate that against the other list and have an agreed scope of pages to scrape

Acceptance criteria

  • We have the list of scrapable sites
  • We have validated it against the list provided by the other team
  • We are aligned as a team on the Dept Ed Websphere to scrape

Jira card: https://open-data-ed.atlassian.net/browse/OD-499


UPDATES Mar 5:

Design and implement a reproducible way of validating the scrape results against the list provided. We need to be able to do this any time we want, as the scrapers run / collect more data.

Bonus points for a google sheets chart from the output, or anything similar.

Tasks:

  • Automate the validation process
  • Push automation code
  • Review & merge the code
  • Document the extracted analytics from the data

P8 Metadata (ed.gov): Improve Metadata Quality

Description: Improve metadata quality from P8 ed.gov parser. (highest priority)

TASKS

  • Ensure datasets produced have a description metadata

  • Create more parsers for the variant page structures

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for all the metadata fields (default values at least)

Jira Card

Use a mirror for scraping sources to avoid hitting them more than necessary

We are about to scrape some rather large websites full of unstructured data to be collected. This will likely lead to many trial-and-error requests made to them and we might get throttled by these websites.

Ideally, we won't have to execute the same request twice. We need to set up either a proxy server with large (and persistent) cache, or to mirror the entire site.

Since setting up a proxy raises a few (minor) complications for the scraping environment (e.g. local resolv.conf updates for each environment hitting the cache and forgetting to do so would void the effort), we decided to attempt to mirror the entire websites on a server we own, then hit those clones instead, as much as we want. Once everything is set up and working (and we have completely ran the scrapers on the entire site) we can remove them and hit the real ones instead.

Tasks:

  • set up a server somewhere, storage doesn't matter, we are only scraping metadata
    • make sure access to that server is allowed on SSH, HTTP and HTTPS ports
    • make sure have lots of bandwidth and no (or super high) traffic quota
  • document the method used for mirroring
  • start the mirroring process and let it run until finish
    • if possible, monitor in the meantime so it won't die 2 hrs later and we only see it Monday ๐Ÿ˜…

List of sites to be mirrored TBD later today.

Transform the collected JSON datasets to CKAN harvester data.json format

Blocked by #15 #19 #20 #21 #22 #23 #24

Upon running the scrapers, the collected data is dumped into an output directory structure. We need to traverse it, and for each scraping source (i.e. child directory) create a data.json file to incorporate all the dumped items.

Tasks:

  • Create a transformer Py module
  • Iterate through the list of output files in each directory
  • Generate a data.json file according to a shared structure filled with data from the files
  • Test by loading the data in the CKAN harvester, locally or remotely
  • Avoid duplicates
  • Avoid printable versions of the resources
  • Test (update if needed) when the parsers are done
    • P1 Parser (OCR)
    • P2 Parser (OCTAE)
    • P3 Parser (OPE)
    • P4 Parser (OELA)
    • P5 Parser (OSERS)
    • P6 Parser (OPEPD)
    • P7 Parser (OESE)
    • NCES
    • ed.gov

Acceptance criteria:

  • A data.json type file is generated for each scraping dump discovered
  • There are no duplicate datasets per scraping source (even if they are collected multiple times, we only want to add them once)
  • Transformer is generating a transform log, recording number of input files and output datasets/resources
  • The data.json file is loadable by the CKAN harvester

P3 Metadata (OPE): Improve Metadata Quality

Description: Improve metadata quality from P3 Office of Postsecondary Education parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

P1 Metadata (OCR): Improve Metadata Quality

Description: Improve metadata quality from P1 Office of Civil Rights parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira card

Sanitising Datasets

SITUATION

Based on investigation & client feedback from previous scrapping runs, there are datasets being scraped that need to be treated/sanitized.
Sanitising methodology may vary based on the nature of the dataset e.g. datasets with 'photo; in title are to removed; datasets with 'conference' in title are to be tagged as 'private' and also set as private etc.

TASKS

  • based on investigation and client feedback identify steps that can be used to sanitise affected datasets
  • translate identified steps into usable algorithm and code
  • Integrate code into already established scrapy process with minimal/no alteration to current process

ACCEPTANCE CRITERIA

  • datasets are successfully sanitised based on agreed feedback
  • the sanitising process integration introduced little or no disruptions to established scrapy process i.e. integration is smooth/flexible

PROBLEM LIST FOR SANITISING

P6 Metadata (OPEPD): Improve Metadata Quality

Description: Improve metadata quality from P6 Office of Planning, Evaluation and Policy Development parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

Filter ZIP files by contents

[STUB]

While scraping, we can download, extract and investigate the contents of the ZIP files to eliminate zipped resources that don't contain data files.

This would likely be a part of the Airflow DAGs and would likely take a long time to complete (but however, shorter than doing it manually).

To Be Continued...

P4 Crawling: Office of English Language Acquisition

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/oela/index.html?src=oc

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-502

P6 Parsing: Office of Planning, Evaluation and Policy Development

ref. #8

Create a HTML parser that integrates into the OPEPD scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • parser uses scrapy pipelines to dump datasets with resources

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Remove false positives From Scraping Results/Output

SITUATION
The ouptut from the scraping processes currently generates some resources which satisfy the criteria for resource generation; However upon cursory inspection, these resources are NOT actual dataset resources (e.g. a site-map as .xslx file OR an image file stored in a zip). We refer to this as false positives

TASKS

  • identify/create an effective methodology on how to identify false positives from scraping output

  • Write script that implements effective methodology of false positives identification

ACCEPTANCE CRITERIA

  • script must interface with scrapy

  • Validate the removal of false positives from scrapy results

RELATED TO:
#109 , #118 . Close this issue when the related issues are closed

Ed Scraping: Overall Scraping Roadmap (WIP)

A four stage roadmap

There are 4 phases to the scraping approach:

  • 1st phase - One-off catalogue population
  • 2nd phase - Ongoing catalogue population & technical debugging of crashes
  • 3rd phase - Ed can CRUD harvesters/scrappers & non-technical debugging of crashes
  • 4th phase - Move away from scraping and towards structured data-pipeline based on a data strategy

1st phase - One-off catalogue population & technical debugging of crashes

To scrape the Dept Ed websites and use data wrangling to populate the ODP with metadata (and maybe data). This will start with a 2-week sprint to validate that this approach will likely meet the coverage and metadata quality objectives.

The scraping / data wrangling output will be a data.json that will be ingested by a legacy Harvester.

This will be a one-off process to support the launch of the catalog with as much coverage as possible.

It will start with a 3 week test (1 week to prepare for a sprint, 2 week sprint) - at the end of this we will decide whether the scrapping approach is a viable option to populate the catalog.

2nd phase - Ongoing catalogue population

To build on the above so that:

  • Can be run at intervals or on-demand by developer
  • Before loading in data into the portal, check for diff
  • Scraping pipeline works with data.json "aka proper" pipeline (no duplication and Harvester favors the data.json ww2 file over web scrapping)

This will probably involve using the NG Harvester pipeline, ie the backend NG Harvester infra but none of the front-end customization.

3rd phase - CRUD harvesters/scrappers

The Dept Ed can view, create, update and delete scraped data pipelines using a WUI and see logs of issues that have gone wrong.

4th phase - Moving away from scraping

Due to the inherent problems with scrapping, the 4th phase is to have a department-wide strategy strategy

Scraper for FSA

Federal Student Aid is an organization that publishes a lot of relevant data for the portal. The main website where everything is published is studentaid.gov.

Tasks:

  • create a crawler for the FSA site
  • add a parser (or more) to extract resources
  • verify that it produces usable output
  • manually test a few pages to validate output

Acceptance criteria:

  • Crawler visits all the pages on studentaid.gov
  • Parser extracts resources from all pages with data files
  • JSON output is usable by transformers

Improve Deduplication of Datasets

SITUATION

Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.

TASKS

  • inspect the source_url of sample duplicate datasets to identify why these duplicates escape the current deduplication process
  • based on inspection/investigation of source_url identify the best method(s)/solution(s) for trapping these duplicates.
  • transcribe the identified solution into flexible/reusable code which can be integrated into the current deduplication transformer with little change

ACCEPTANCE CRITERIA

  • code is easily integrated into current deduplication transformer with little change
  • more datasets duplicates are caught and removed by the improved deduplication

Current Sample of dataset duplication from source url

P4 Metadata (OELA): Improve Metadata Quality

Description: Improve metadata quality from P4 Office of English Language Acquisition parser
TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

Add `source_url` to package `extras` field for the scraped datasets

ref. #25

Tasks:

  • Add the source_url property to the resulting datasets in the data.json output
  • Make sure the source_url property passes validation and gets stored as an extras item in CKAN
  • Adjust the ckanext-ed package template so it can display where that dataset was scraped from

Acceptance criteria:

  • Scraped datasets in CKAN have a new item in the metadata table to show where they came from
  • Datasets without source_url that were harvested will have a link to the harvester source info page
  • Datasets that were manually added will not have the extra property

Phase 3 - Improve P8 (ed.gov) Parser metadata quality

Description

On the basis of the results from the first runs/output from P8. The parser output needs to be improved both in terms of content and metadata quality.

Tasks

  • based on the output from prior runs of P8, review the output content and identify possible areas for improvement
  • Implement improvements on parsers based on identified areas
  • After refactoring and improvements, ensure all parsers still work as expected

Acceptance Criteria

  • parser shows a marked improvement in content output
  • parsers still generate acceptable data.json after refactoring/improvements
  • metadata data has been improved, compare using weighted system if possible

JIRA CARD

Uniform I/O between transformers, tools and dashboard

What we have now:
image

Target:
image

(draw.io link)

Tasks:

  • dump everything in XLSX files
  • load everything from XLSX sheets instead of individual files
  • generate one file per transformer (except datajson)
  • use the new files in dashboard

Acceptance criteria:

  • All transformers dump in XLSX
  • The name parameter will generate a new sheet, instead of a new file
  • Everything works as expected, with the new data files

Note: The deprecated compare tool will be ignored for now, as well as its output. It will be subject to another task related to stats.

Retain all relevant header information for resources

As we're not downloading resources, we currently have no way of knowing some basic information about them without hitting each URL.

However, downloading would be a rather costly action, both in time and disk space. But we can probably get the headers info only.

An idea would be to use Scrapy's cache, if possible, but we need to investigate.

Examples of useful headers to fetch for each downloadable file:

  • Content-Type
  • Content-Length

Acceptance criteria:

  • the JSON dumps generated by scrapers contain headers for all resources
  • all the scrapers still run with no errors

Update Tech Spec to reflect metrics and dashboard changes

SITUATION

The Dept Ed Scraping project has evolved rapidly and there is critical need to evaluate and measure the progress of the project.
Although quantitative measurements and reporting was not part of the original tech spec for the project, we now need adequate specs on how to effectively and easily capture and report some quantitative measures which benchmark project progress internally and externally

TASKS

  • Analyse the problem and identify adequate quantitative metrics and reports for the project

  • Design a tech spec which allows for the rapid gathering and reporting of identified significant metrics

  • Design tech spec to also include easy to understand visualisations of metrics

  • Design tech to include periodic reports/trends feature

ACCEPTANCE CRITERIA

  • Tech spec includes agreed critical metrics which can be clear indicators of project progress
  • Tech spec includes an automated process of metrics gathering
  • Tech spec includes an internal/external channel (may be a dashboard) for reporting and visualising gathered metrics

P7 Metadata (OESE): Improve Metadata Quality

Description: Improve metadata quality from P7 Office of Elementary and Secondary Education parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

Phase 3 - Improve metadata quality P9 (NCES) parser

Description

On the basis of the results from the first runs/output from P9. The parser output needs to be improved both in terms of content and metadata quality.

Tasks

  • based on the output from prior runs of P9, review the output content and identify possible areas for improvement
  • Implement improvements on parsers based on identified areas
  • Review all parsers and identify if number of parsers can be reduced from current number (4)
  • After refactoring and improvements, ensure all parsers still work as expected

Acceptance Criteria

  • parser shows a marked improvement in content output
  • parsers still generate acceptable data.json after refactoring/improvements
  • metadata data has been improved, compare using weighted system if possible

JIRA CARD

Create the initial code structure to run scrapers / transformers

In order to extract data from the Ed websites we need to work in a structured way, so we have all the scrapers and transformers in the same place and talking to each other.

The task is to create a code structure to accommodate all the scripts. Nothing complicated, just a module in which we put scrapers as submodules and have a standard way of working together.

Tasks:

  • boostrap the project with a place for scrapers and a place for transformers
  • have a simple method of running commands
  • support virtual environments by providing a installable list of pip packages
  • document the structure
  • document the input / output formats
  • gitignore the temporary assets (e.g. scrapy cache, python wheels, etc.)

Acceptance criteria:

  • the repository contains code that runs
  • using a single command you can run a scraper or a transformer
  • project has documentation for developers

Parser for NCES

Create a HTML parser that integrates into the nces scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover the most frequent cases
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • parser uses scrapy pipelines to dump datasets with resources

P1 Parsing: Office Civil Rights

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

  • source URL (where was it scraped from)
  • title
  • name (usually a unique slug of title)
  • publisher
  • description
  • tags
  • date
  • person of contact (name)
  • person of contact (email)

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

  • using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
  • test the parser script and output the data in a spreadsheet format for all pages in the list
  • integrate the script into the pipeline after the above validation

Acceptance criteria:

  • script accepts raw HTML as a input
  • correctly identifies pages that have or have not resources
  • produces a Python structure with the properties in the list above
  • returns None for no resources and a Python dictionary with the result otherwise

P3 Parsing: Office of Postsecondary Education

ref. #5

Create a HTML parser that integrates into the OPE scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • uses scrapy pipelines to dump dataset

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P5 Crawling: Office of Special Education and Rehabilitative Services

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/osers/index.html?src=oc

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-504

Automate the scraping pipelines

The situation

The edscrapers module has an increasing number of scrapers. In order to produce
an output usable by CKAN, each scraper needs to run, then its output to be
deduplicated using a simple universal script, then the deduplication result goes
into a datajson transformer. In addition to this, scraping results are also subject
of a real time statistics module and public HTTP dashboard, which needs updating
every time there are new results.

Having to operate all the system manually is cumbersome and could lead to human
errors due to its sequential and multithreaded nature.

It also costs time to monitor, run and type all the commands needed to operate
each step for each scraper (via SSH / terminal access).

The consensus was that we wouldn't develop pipelines, and focus on the scraping
instead. At the moment, we have a number of steps, but no pipeline to glue them
and standardize logging / reporting.

Analogy

A car washing company has 10 stalls and one employee operating them. Normally
one or two cars come at once, but there are times when more or all the stalls
are full. Each car needs to be washed following a specific sequence, and not
respecting that could void the process partially or totally.

Would it be worth for the car wash to automate its stalls, so all the operations
are carried out automatically and quality is ensured without depending on a human
executing multiple sequences (each in a different stage) at once?

Would it increase the efficiency long term? Mid term? Short term?

Would it prevent otherwise unnoticed and unnoticeable errors leading to poor
results? What about finding out what went wrong when problems are spotted
(debugging)?

The proposed approach

Having a pipeline framework and manager that could run all these steps for us
would have some benefits:

  • standard errors / progress monitoring
  • web UI, users etc.
  • we don't have to glue things together in improvised pipelines
  • portability and ability to schedule / trigger pipelines

Airflow is the best candidate for this, according to
the analysis we have already made.

The steps are also documented in the doc linked above.

Tasks

  • Bootstrap an Airflow instance
  • Create the pipelines for each scraper
  • Test if it runs properly
  • Make sure dashboard reads from the right source

Acceptance criteria

  • We have an Airflow instance running the scrapers
  • Dashboard is updating
  • Pipelines are running automatically

v1- Report / Visualise Gathered Metrics

Situation:

We needed to know how the project was progressing, so we gathered metrics on scraping performance. Now, we need to report these gathered metrics in an easy way, to facilitate quick decision-making and progress-reporting

Tasks

  • Create an easily accessible and usable dashboard

  • provide reports/visualisations on important metrics gathered

  • provide some periodic trends / progress reporting on metrics

  • create an overall indicator/summary of scraping progress

  • provide visualise comparison of between Datopian and AIR scraping

Acceptance

  • Dashboard is available on a public url/link
  • Dashboard contains simple. 'live' reports/visualisations on important metrics gathered
  • Dashboard contains periodic / time-series graphs showing trends on metrics
  • Dashboard contains RAG summary
  • Dashboard contains comparison report and visualisation between Datopian and AIR scraping

P2 Metadata (OCTAE): Improve Metadata Quality

Description: Improve metadata quality from P2 Office of Career, Technical and Adult Education parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

P4 Parsing: Office of English Language Acquisition

ref. #6

Create a HTML parser that integrates into the OELA scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • parser dumps datasets with scrapy pipelines

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P7 Parsing: Office of Elementary and Secondary Education

ref. #9

Create a HTML parser that integrates into the OESE scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • parser uses scrapy pipeline to dump datasets with resources

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Parser for ed.gov

Create a HTML parser that integrates into the edgov scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover the most frequent cases
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • parser uses scrapy pipelines to dump datasets with resources

P9 Metadata (NCES): Improve Metadata Quality

Description: Improve metadata quality from P9 NCES parser. (highest priority)

TASKS

  • Ensure datasets produced have a description metadata

  • Create more parsers for the variant page structures

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

Gather data insights from all scrapers output

Using the data collected so far, we need to extract some information to back our future sprint's targets.

Questions we need answered (doubles as task list):

  • What are the top 1000 resources by size? [Tasks cannot be completed - see #68 ]
  • Which pages have the most resources collected from? Top 1000.
  • What domains (subdomains) Datopian touched and AIR didn't?
    • How many items were extracted from them?
  • What domains (subdomains) AIR touched and Datopian didn't? How many items were extracted from them?
  • List of all domains ordered by number of parsed pages.
  • What is the difference between all the data we have and ed.gov/data.json?

Acceptance criteria:

  • We have a Python script that can be run over and over so we get updated numbers
  • We have CSV or XLS answers to all the above questions

P2 Parsing: Office of Career, Technical and Adult Education

ref. #4

Create a HTML parser that integrates into the OCTAE scraper and replaces the existing dummy parser script.

https://www2.ed.gov/about/offices/list/ovae/index.html

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • uses scrapy pipelines to dump datasets

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P7 Crawling: Office of Elementary and Secondary Education

Description: Scrape metadata for https://oese.ed.gov/

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-507

P2 Crawling: Office of Career, Technical and Adult Education

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/ovae/index.html

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-500

P3 Crawling: Office of Postsecondary Education

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/ope/index.html

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-501

P5 Parsing: Office of Special Education and Rehabilitative Services

ref. #7

Create a HTML parser that integrates into the OSERS scraper and replaces the existing dummy parser script.

Tasks:

  • identify the possible page structures in the target site
  • write one or multiple parsers that cover as many cases as possible
  • test if it runs well within the pipeline

Acceptance criteria:

  • we have a parser specific to this site that replaces the initial dummy parser
  • parser creates Dataset instances, each having at least one Resource
  • calls dataset.dump() method and dumps serialized output in a directory specific to this scraper

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Define architecture & infrastructure

At the moment this is quite a manual process. There is a tactical need for improvement (eg. set up a cron job etc) but also this is the right time to think about architecture (eg how do the scraping pipelines not step on the toes of the data.json harvester?) and what Infra options we have for go-live.

Acceptance criteria:

  • We have a spec for architecture and DevOps set up internally and an estimate of how long that would take to implement the proposed solution (there could be a tactical and long-term solution)

Change data quality assessment to measure datajson output

Currently we are generating a RAG summary based on a data quality analysis, which uses a weighted scoring system.

The problem is that we are measuring the quality of the scraped data, and not the quality of the data that ends up in CKAN. We should be assessing the quality of the items in the final datajson output.

Tasks:

  • Add a new feature to the RAG summary to switch measuring target from JSON files to data.json transformer output
    • Keep the existing implementation in case we want to compare before/after datajson phase
  • Make the dashboard use the new scores

Acceptance criteria:

  • RAG transformer measures datajson output
  • Dashboard shows datajson assessment

P1 Crawling: Office Civil Rights

Description: Scrape metadata for https://ocrdata.ed.gov/ (this seems like a useful page - https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt )

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-498

[Parsing] Extract level of data from resource name

Some datasets have resources named the same as U.S. states etc. We can use this as a heuristic to determine the level of data, in some cases.

Tasks:

  • identify what are the patterns
  • create a map of what file name mean what level of data
  • implement it in either parsers or transformers
  • import/adopt the 'enhancement' from the transformer into the data.son

Acceptance criteria:

  • datasets having state level files named the same as the states will have the appropriate level of data
  • 'enhancement' is adopted by data.json/harvester and is visible on the data portal

P6 Crawling: Office of Planning, Evaluation and Policy Development

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/opepd/index.html?src=oc

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-506

P5 Metadata (OSERS): Improve Metadata Quality

Description: Improve metadata quality from P5 Office of Special Education and Rehabilitative Services parser

TASKS

  • Ensure datasets produced have a description metadata

  • Ensure datasets have a publisher metadata

  • Improve other metadata (use defaults where available)

Acceptance criteria:

  • Metadata quality from the parser has been improved
  • Have values for the high priority metadata available or set default values (if available)

Jira Card

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.