civicactions / edscrapers Goto Github PK

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset

License: GNU Affero General Public License v3.0

Makefile 0.09% Python 99.87% Dockerfile 0.03% Shell 0.01%

edscrapers's Introduction

U.S. Department of Education scraping kit

NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g. edscrapers/scrapers or edscrapers/transformers).

Running the tool

Clone this repo using git clone.

Change directory into the directory created/cloned for this repo.

From within the repo directory run pip install -r requirements.txt to install all package dependencies required to run the toolkit

You need the ED_OUTPUT_PATH environment variable to be set before running. Not having the variable set in your environment will result in a fatal error. The ED_OUT_PATH environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.

If GNU Make is available in your environment, you can run the command make install. Alternatively, run python setup.py install.

After installing, run the eds command in a command line prompt.

Containerization of Scraping Toolkit - Docker Image

If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image. Simply run : docker build in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile

ED Scrapers Command Line Interface

To get more info on the usage on the ED Scrapers Command Line Interface - eds, read the eds cli docs.

Architectural Design

To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc

Terminology

Scraping Source: a website (or section of website) where you scrape information from
Scraper: A script that collects structured data from (rather unstructured) web pages
- Crawler: A script that follows links and identifies all the pages containing information to be parsed
- Parser: A script that identifies data in HTML and loads it into a machine readable data structure
Transformer: a script that takes a data structure and adapts it to a target structure
ETL: Extract + Transform + Load process for metadata.
Data.json: A specific JSON format used by CKAN harvesters. Example

Scrapers

Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:

A crawler class that defines rules for link extraction and page filters
- This will be instantiated by a CrawlerProcess in the main scraper.py script
A parser script that is essentially a callback for fetching HTML pages. It receives a Scrapy Response payload, which can be parsed using any HTML parsing methods
- An optional Model class, to define the properties of extracted datasets and make them more flexible for dumping or automating operations if needed

Transformers

Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).

We currently have 7 transformers in place:

deduplicate: removes duplicates from scraping
sanitize: cleans up the scraping output data/metadata based on specified rules.
datajson: creates data.json files from the scraping output; these data.json files can then by ingested/harvested by ckanext-harvest (used to populate a CKAN data portal).
rag: produces RAG analyses output files using an agreed weighted-value system for calculating the quality of metadata generated by the datajson transformer and (by extension) the 'raw' scraping output.
TODO: Add info about the others

License

GNU AFFERO GENERAL PUBLIC LICENSE

edscrapers's People

Contributors

Stargazers

Watchers

Forkers

estebanruseler nightsh tanvirchahal osahon-okungbowa katucker leondra-richardson trustdan

edscrapers's Issues

Create a RAG summary of the results of the scraping exercise

Description: Create a RAG summary of the results of the scraping exercise.

RAG to be determined:

Dark green: We scrapped and have all the metadata, we can determine if DP, collection or source
Green: We scrapped and have all the metadata, we can't technically determine if DP, collection or source
Light Amber: We have scrapped and have partial metadata
Dark Amber: We have the title
Red: Not scraped

Can we automate in some way? [Prob not in this sprint]

Acceptance criteria

We have a RAG summary of the results of the scraping exercise

Task-list

Agree the RAG status levels
create RAG per domain
Create RAG per publisher

Link to Jira card: https://open-data-ed.atlassian.net/browse/OD-503

Create the list of all resources and pages where we have scrapable data and validate against the list provided

Description: Replicate the job that the other team did so we are clear on the Dept Ed Websphere to scrape. Validate that against the other list and have an agreed scope of pages to scrape

Acceptance criteria

We have the list of scrapable sites
We have validated it against the list provided by the other team
We are aligned as a team on the Dept Ed Websphere to scrape

Jira card: https://open-data-ed.atlassian.net/browse/OD-499

UPDATES Mar 5:

Design and implement a reproducible way of validating the scrape results against the list provided. We need to be able to do this any time we want, as the scrapers run / collect more data.

Bonus points for a google sheets chart from the output, or anything similar.

Tasks:

Automate the validation process
Push automation code
Review & merge the code
Document the extracted analytics from the data

P8 Metadata (ed.gov): Improve Metadata Quality

Description: Improve metadata quality from P8 ed.gov parser. (highest priority)

TASKS

Ensure datasets produced have a description metadata
Create more parsers for the variant page structures

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for all the metadata fields (default values at least)

Jira Card

Use a mirror for scraping sources to avoid hitting them more than necessary

We are about to scrape some rather large websites full of unstructured data to be collected. This will likely lead to many trial-and-error requests made to them and we might get throttled by these websites.

Ideally, we won't have to execute the same request twice. We need to set up either a proxy server with large (and persistent) cache, or to mirror the entire site.

Since setting up a proxy raises a few (minor) complications for the scraping environment (e.g. local resolv.conf updates for each environment hitting the cache and forgetting to do so would void the effort), we decided to attempt to mirror the entire websites on a server we own, then hit those clones instead, as much as we want. Once everything is set up and working (and we have completely ran the scrapers on the entire site) we can remove them and hit the real ones instead.

Tasks:

set up a server somewhere, storage doesn't matter, we are only scraping metadata
- make sure access to that server is allowed on SSH, HTTP and HTTPS ports
- make sure have lots of bandwidth and no (or super high) traffic quota
document the method used for mirroring
start the mirroring process and let it run until finish
- if possible, monitor in the meantime so it won't die 2 hrs later and we only see it Monday 😅

List of sites to be mirrored TBD later today.

Transform the collected JSON datasets to CKAN harvester data.json format

Blocked by #15 #19 #20 #21 #22 #23 #24

Upon running the scrapers, the collected data is dumped into an output directory structure. We need to traverse it, and for each scraping source (i.e. child directory) create a data.json file to incorporate all the dumped items.

Tasks:

Acceptance criteria:

A data.json type file is generated for each scraping dump discovered
There are no duplicate datasets per scraping source (even if they are collected multiple times, we only want to add them once)
Transformer is generating a transform log, recording number of input files and output datasets/resources
The data.json file is loadable by the CKAN harvester

P3 Metadata (OPE): Improve Metadata Quality

Description: Improve metadata quality from P3 Office of Postsecondary Education parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

P1 Metadata (OCR): Improve Metadata Quality

Description: Improve metadata quality from P1 Office of Civil Rights parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira card

Sanitising Datasets

SITUATION

Based on investigation & client feedback from previous scrapping runs, there are datasets being scraped that need to be treated/sanitized.
Sanitising methodology may vary based on the nature of the dataset e.g. datasets with 'photo; in title are to removed; datasets with 'conference' in title are to be tagged as 'private' and also set as private etc.

TASKS

based on investigation and client feedback identify steps that can be used to sanitise affected datasets
translate identified steps into usable algorithm and code
Integrate code into already established scrapy process with minimal/no alteration to current process

ACCEPTANCE CRITERIA

datasets are successfully sanitised based on agreed feedback
the sanitising process integration introduced little or no disruptions to established scrapy process i.e. integration is smooth/flexible

PROBLEM LIST FOR SANITISING

handled all sanitising tasks listed here AND more here

P6 Metadata (OPEPD): Improve Metadata Quality

Description: Improve metadata quality from P6 Office of Planning, Evaluation and Policy Development parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

Filter ZIP files by contents

[STUB]

While scraping, we can download, extract and investigate the contents of the ZIP files to eliminate zipped resources that don't contain data files.

This would likely be a part of the Airflow DAGs and would likely take a long time to complete (but however, shorter than doing it manually).

To Be Continued...

Output bucket for scraped files in github defined

Description: Define an output bucket in GitHub for all the scraped files. Super small, not worth an issue but let's go with the principle of tracking the work.

Acceptance criteria

Output bucket for scraped files in GitHub defined

Link to Jira card: https://open-data-ed.atlassian.net/browse/OD-505

P4 Crawling: Office of English Language Acquisition

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/oela/index.html?src=oc

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-502

Config legacy harvester

Description: We need to config legacy harvester to harvest from data.json file(s) in the github repo.

Acceptance criteria

Legacy harvester is configured
Scrapped data is on the portal

Link to Jira card: https://open-data-ed.atlassian.net/browse/OD-508

P6 Parsing: Office of Planning, Evaluation and Policy Development

ref. #8

Create a HTML parser that integrates into the OPEPD scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
parser uses scrapy pipelines to dump datasets with resources

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Remove false positives From Scraping Results/Output

SITUATION
The ouptut from the scraping processes currently generates some resources which satisfy the criteria for resource generation; However upon cursory inspection, these resources are NOT actual dataset resources (e.g. a site-map as .xslx file OR an image file stored in a zip). We refer to this as false positives

TASKS

identify/create an effective methodology on how to identify false positives from scraping output
Write script that implements effective methodology of false positives identification

ACCEPTANCE CRITERIA

script must interface with scrapy
Validate the removal of false positives from scrapy results

RELATED TO:
#109 , #118 . Close this issue when the related issues are closed

Ed Scraping: Overall Scraping Roadmap (WIP)

A four stage roadmap

There are 4 phases to the scraping approach:

1st phase - One-off catalogue population
2nd phase - Ongoing catalogue population & technical debugging of crashes
3rd phase - Ed can CRUD harvesters/scrappers & non-technical debugging of crashes
4th phase - Move away from scraping and towards structured data-pipeline based on a data strategy

1st phase - One-off catalogue population & technical debugging of crashes

To scrape the Dept Ed websites and use data wrangling to populate the ODP with metadata (and maybe data). This will start with a 2-week sprint to validate that this approach will likely meet the coverage and metadata quality objectives.

The scraping / data wrangling output will be a data.json that will be ingested by a legacy Harvester.

This will be a one-off process to support the launch of the catalog with as much coverage as possible.

It will start with a 3 week test (1 week to prepare for a sprint, 2 week sprint) - at the end of this we will decide whether the scrapping approach is a viable option to populate the catalog.

2nd phase - Ongoing catalogue population

To build on the above so that:

Can be run at intervals or on-demand by developer
Before loading in data into the portal, check for diff
Scraping pipeline works with data.json "aka proper" pipeline (no duplication and Harvester favors the data.json ww2 file over web scrapping)

This will probably involve using the NG Harvester pipeline, ie the backend NG Harvester infra but none of the front-end customization.

3rd phase - CRUD harvesters/scrappers

The Dept Ed can view, create, update and delete scraped data pipelines using a WUI and see logs of issues that have gone wrong.

4th phase - Moving away from scraping

Due to the inherent problems with scrapping, the 4th phase is to have a department-wide strategy strategy

Scraper for FSA

Federal Student Aid is an organization that publishes a lot of relevant data for the portal. The main website where everything is published is studentaid.gov.

Tasks:

create a crawler for the FSA site
add a parser (or more) to extract resources
verify that it produces usable output
manually test a few pages to validate output

Acceptance criteria:

Crawler visits all the pages on studentaid.gov
Parser extracts resources from all pages with data files
JSON output is usable by transformers

Improve Deduplication of Datasets

SITUATION

Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.

TASKS

inspect the source_url of sample duplicate datasets to identify why these duplicates escape the current deduplication process
based on inspection/investigation of source_url identify the best method(s)/solution(s) for trapping these duplicates.
transcribe the identified solution into flexible/reusable code which can be integrated into the current deduplication transformer with little change

ACCEPTANCE CRITERIA

code is easily integrated into current deduplication transformer with little change
more datasets duplicates are caught and removed by the improved deduplication

Current Sample of dataset duplication from source url

P4 Metadata (OELA): Improve Metadata Quality

Description: Improve metadata quality from P4 Office of English Language Acquisition parser
TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

Add `source_url` to package `extras` field for the scraped datasets

ref. #25

Tasks:

Add the source_url property to the resulting datasets in the data.json output
Make sure the source_url property passes validation and gets stored as an extras item in CKAN
Adjust the ckanext-ed package template so it can display where that dataset was scraped from

Acceptance criteria:

Scraped datasets in CKAN have a new item in the metadata table to show where they came from
~~Datasets without source_url that were harvested will have a link to the harvester source info page~~
Datasets that were manually added will not have the extra property

Phase 3 - Improve P8 (ed.gov) Parser metadata quality

Description

On the basis of the results from the first runs/output from P8. The parser output needs to be improved both in terms of content and metadata quality.

Tasks

based on the output from prior runs of P8, review the output content and identify possible areas for improvement
Implement improvements on parsers based on identified areas
After refactoring and improvements, ensure all parsers still work as expected

Acceptance Criteria

parser shows a marked improvement in content output
parsers still generate acceptable data.json after refactoring/improvements
metadata data has been improved, compare using weighted system if possible

JIRA CARD

Uniform I/O between transformers, tools and dashboard

What we have now:

Target:

(draw.io link)

Tasks:

dump everything in XLSX files
load everything from XLSX sheets instead of individual files
generate one file per transformer (except datajson)
use the new files in dashboard

Acceptance criteria:

All transformers dump in XLSX
The name parameter will generate a new sheet, instead of a new file
Everything works as expected, with the new data files

Note: The deprecated compare tool will be ignored for now, as well as its output. It will be subject to another task related to stats.

Retain all relevant header information for resources

As we're not downloading resources, we currently have no way of knowing some basic information about them without hitting each URL.

However, downloading would be a rather costly action, both in time and disk space. But we can probably get the headers info only.

An idea would be to use Scrapy's cache, if possible, but we need to investigate.

Examples of useful headers to fetch for each downloadable file:

Content-Type
Content-Length

Acceptance criteria:

the JSON dumps generated by scrapers contain headers for all resources
all the scrapers still run with no errors

Update Tech Spec to reflect metrics and dashboard changes

SITUATION

The Dept Ed Scraping project has evolved rapidly and there is critical need to evaluate and measure the progress of the project.
Although quantitative measurements and reporting was not part of the original tech spec for the project, we now need adequate specs on how to effectively and easily capture and report some quantitative measures which benchmark project progress internally and externally

TASKS

Analyse the problem and identify adequate quantitative metrics and reports for the project
Design a tech spec which allows for the rapid gathering and reporting of identified significant metrics
Design tech spec to also include easy to understand visualisations of metrics
Design tech to include periodic reports/trends feature

ACCEPTANCE CRITERIA

Tech spec includes agreed critical metrics which can be clear indicators of project progress
Tech spec includes an automated process of metrics gathering
Tech spec includes an internal/external channel (may be a dashboard) for reporting and visualising gathered metrics

P7 Metadata (OESE): Improve Metadata Quality

Description: Improve metadata quality from P7 Office of Elementary and Secondary Education parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

Phase 3 - Improve metadata quality P9 (NCES) parser

Description

On the basis of the results from the first runs/output from P9. The parser output needs to be improved both in terms of content and metadata quality.

Tasks

based on the output from prior runs of P9, review the output content and identify possible areas for improvement
Implement improvements on parsers based on identified areas
Review all parsers and identify if number of parsers can be reduced from current number (4)
After refactoring and improvements, ensure all parsers still work as expected

Acceptance Criteria

parser shows a marked improvement in content output
parsers still generate acceptable data.json after refactoring/improvements
metadata data has been improved, compare using weighted system if possible

JIRA CARD

Create the initial code structure to run scrapers / transformers

In order to extract data from the Ed websites we need to work in a structured way, so we have all the scrapers and transformers in the same place and talking to each other.

The task is to create a code structure to accommodate all the scripts. Nothing complicated, just a module in which we put scrapers as submodules and have a standard way of working together.

Tasks:

boostrap the project with a place for scrapers and a place for transformers
have a simple method of running commands
support virtual environments by providing a installable list of pip packages
document the structure
document the input / output formats
gitignore the temporary assets (e.g. scrapy cache, python wheels, etc.)

Acceptance criteria:

the repository contains code that runs
using a single command you can run a scraper or a transformer
project has documentation for developers

Parser for NCES

Create a HTML parser that integrates into the nces scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover the most frequent cases
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
parser uses scrapy pipelines to dump datasets with resources

P1 Parsing: Office Civil Rights

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

source URL (where was it scraped from)
title
name (usually a unique slug of title)
publisher
description
tags
date
person of contact (name)
person of contact (email)

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
test the parser script and output the data in a spreadsheet format for all pages in the list
integrate the script into the pipeline after the above validation

Acceptance criteria:

script accepts raw HTML as a input
correctly identifies pages that have or have not resources
produces a Python structure with the properties in the list above
returns None for no resources and a Python dictionary with the result otherwise

P3 Parsing: Office of Postsecondary Education

ref. #5

Create a HTML parser that integrates into the OPE scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
uses scrapy pipelines to dump dataset

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P5 Crawling: Office of Special Education and Rehabilitative Services

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/osers/index.html?src=oc

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-504

Automate the scraping pipelines

The situation

The edscrapers module has an increasing number of scrapers. In order to produce
an output usable by CKAN, each scraper needs to run, then its output to be
deduplicated using a simple universal script, then the deduplication result goes
into a datajson transformer. In addition to this, scraping results are also subject
of a real time statistics module and public HTTP dashboard, which needs updating
every time there are new results.

Having to operate all the system manually is cumbersome and could lead to human
errors due to its sequential and multithreaded nature.

It also costs time to monitor, run and type all the commands needed to operate
each step for each scraper (via SSH / terminal access).

The consensus was that we wouldn't develop pipelines, and focus on the scraping
instead. At the moment, we have a number of steps, but no pipeline to glue them
and standardize logging / reporting.

Analogy

A car washing company has 10 stalls and one employee operating them. Normally
one or two cars come at once, but there are times when more or all the stalls
are full. Each car needs to be washed following a specific sequence, and not
respecting that could void the process partially or totally.

Would it be worth for the car wash to automate its stalls, so all the operations
are carried out automatically and quality is ensured without depending on a human
executing multiple sequences (each in a different stage) at once?

Would it increase the efficiency long term? Mid term? Short term?

Would it prevent otherwise unnoticed and unnoticeable errors leading to poor
results? What about finding out what went wrong when problems are spotted
(debugging)?

The proposed approach

Having a pipeline framework and manager that could run all these steps for us
would have some benefits:

standard errors / progress monitoring
web UI, users etc.
we don't have to glue things together in improvised pipelines
portability and ability to schedule / trigger pipelines

Airflow is the best candidate for this, according to
the analysis we have already made.

The steps are also documented in the doc linked above.

Tasks

Bootstrap an Airflow instance
Create the pipelines for each scraper
Test if it runs properly
Make sure dashboard reads from the right source

Acceptance criteria

We have an Airflow instance running the scrapers
Dashboard is updating
Pipelines are running automatically

v1- Report / Visualise Gathered Metrics

Situation:

We needed to know how the project was progressing, so we gathered metrics on scraping performance. Now, we need to report these gathered metrics in an easy way, to facilitate quick decision-making and progress-reporting

Tasks

Create an easily accessible and usable dashboard
provide reports/visualisations on important metrics gathered
provide some periodic trends / progress reporting on metrics
create an overall indicator/summary of scraping progress
provide visualise comparison of between Datopian and AIR scraping

Acceptance

Dashboard is available on a public url/link
Dashboard contains simple. 'live' reports/visualisations on important metrics gathered
Dashboard contains periodic / time-series graphs showing trends on metrics
Dashboard contains RAG summary
Dashboard contains comparison report and visualisation between Datopian and AIR scraping

P2 Metadata (OCTAE): Improve Metadata Quality

Description: Improve metadata quality from P2 Office of Career, Technical and Adult Education parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

P4 Parsing: Office of English Language Acquisition

ref. #6

Create a HTML parser that integrates into the OELA scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
parser dumps datasets with scrapy pipelines

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P7 Parsing: Office of Elementary and Secondary Education

ref. #9

Create a HTML parser that integrates into the OESE scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
parser uses scrapy pipeline to dump datasets with resources

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Parser for ed.gov

Create a HTML parser that integrates into the edgov scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover the most frequent cases
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
parser uses scrapy pipelines to dump datasets with resources

P9 Metadata (NCES): Improve Metadata Quality

Description: Improve metadata quality from P9 NCES parser. (highest priority)

TASKS

Ensure datasets produced have a description metadata
Create more parsers for the variant page structures

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

Gather data insights from all scrapers output

Using the data collected so far, we need to extract some information to back our future sprint's targets.

Questions we need answered (doubles as task list):

What are the top 1000 resources by size? [Tasks cannot be completed - see #68 ]
Which pages have the most resources collected from? Top 1000.
What domains (subdomains) Datopian touched and AIR didn't?
- How many items were extracted from them?
What domains (subdomains) AIR touched and Datopian didn't? How many items were extracted from them?
List of all domains ordered by number of parsed pages.
What is the difference between all the data we have and ed.gov/data.json?

Acceptance criteria:

We have a Python script that can be run over and over so we get updated numbers
We have CSV or XLS answers to all the above questions

P2 Parsing: Office of Career, Technical and Adult Education

ref. #4

Create a HTML parser that integrates into the OCTAE scraper and replaces the existing dummy parser script.

https://www2.ed.gov/about/offices/list/ovae/index.html

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
uses scrapy pipelines to dump datasets

TBD: List of potentially different structures and one or two false positives to test the scraper against.

P7 Crawling: Office of Elementary and Secondary Education

Description: Scrape metadata for https://oese.ed.gov/

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-507

P2 Crawling: Office of Career, Technical and Adult Education

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/ovae/index.html

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-500

P3 Crawling: Office of Postsecondary Education

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/ope/index.html

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-501

P5 Parsing: Office of Special Education and Rehabilitative Services

ref. #7

Create a HTML parser that integrates into the OSERS scraper and replaces the existing dummy parser script.

Tasks:

identify the possible page structures in the target site
write one or multiple parsers that cover as many cases as possible
test if it runs well within the pipeline

Acceptance criteria:

we have a parser specific to this site that replaces the initial dummy parser
parser creates Dataset instances, each having at least one Resource
calls dataset.dump() method and dumps serialized output in a directory specific to this scraper

TBD: List of potentially different structures and one or two false positives to test the scraper against.

Define architecture & infrastructure

At the moment this is quite a manual process. There is a tactical need for improvement (eg. set up a cron job etc) but also this is the right time to think about architecture (eg how do the scraping pipelines not step on the toes of the data.json harvester?) and what Infra options we have for go-live.

Acceptance criteria:

We have a spec for architecture and DevOps set up internally and an estimate of how long that would take to implement the proposed solution (there could be a tactical and long-term solution)

Change data quality assessment to measure datajson output

Currently we are generating a RAG summary based on a data quality analysis, which uses a weighted scoring system.

The problem is that we are measuring the quality of the scraped data, and not the quality of the data that ends up in CKAN. We should be assessing the quality of the items in the final datajson output.

Tasks:

Add a new feature to the RAG summary to switch measuring target from JSON files to data.json transformer output
- Keep the existing implementation in case we want to compare before/after datajson phase
Make the dashboard use the new scores

Acceptance criteria:

RAG transformer measures datajson output
Dashboard shows datajson assessment

P1 Crawling: Office Civil Rights

Description: Scrape metadata for https://ocrdata.ed.gov/ (this seems like a useful page - https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt )

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-498

[Parsing] Extract level of data from resource name

Some datasets have resources named the same as U.S. states etc. We can use this as a heuristic to determine the level of data, in some cases.

Tasks:

identify what are the patterns
create a map of what file name mean what level of data
implement it in either parsers or transformers
import/adopt the 'enhancement' from the transformer into the data.son

Acceptance criteria:

datasets having state level files named the same as the states will have the appropriate level of data
'enhancement' is adopted by data.json/harvester and is visible on the data portal

P6 Crawling: Office of Planning, Evaluation and Policy Development

Description: Scrape metadata for https://www2.ed.gov/about/offices/list/opepd/index.html?src=oc

Acceptance criteria

We have a data dump with all the resources metadata we can get from target site

Task-list:

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-506

P5 Metadata (OSERS): Improve Metadata Quality

Description: Improve metadata quality from P5 Office of Special Education and Rehabilitative Services parser

TASKS

Ensure datasets produced have a description metadata
Ensure datasets have a publisher metadata
Improve other metadata (use defaults where available)

Acceptance criteria:

Metadata quality from the parser has been improved
Have values for the high priority metadata available or set default values (if available)

Jira Card

civicactions / edscrapers Goto Github PK

edscrapers's Introduction

U.S. Department of Education scraping kit

Running the tool

Containerization of Scraping Toolkit - Docker Image

ED Scrapers Command Line Interface

Architectural Design

Terminology

Scrapers

Transformers

License

edscrapers's People

Contributors

Stargazers

Watchers

Forkers

edscrapers's Issues

Acceptance criteria

Task-list

Acceptance criteria

SITUATION

TASKS

ACCEPTANCE CRITERIA

PROBLEM LIST FOR SANITISING

TASKS

Acceptance criteria:

Acceptance criteria

Acceptance criteria

Task-list:

Acceptance criteria

A four stage roadmap

1st phase - One-off catalogue population & technical debugging of crashes

2nd phase - Ongoing catalogue population

3rd phase - CRUD harvesters/scrappers

4th phase - Moving away from scraping

SITUATION

TASKS

ACCEPTANCE CRITERIA

Description

Tasks

Acceptance Criteria

SITUATION

TASKS

ACCEPTANCE CRITERIA

TASKS

Acceptance criteria:

Description

Tasks

Acceptance Criteria

Acceptance criteria

Task-list:

Situation:

Tasks

Acceptance

Acceptance criteria

Task-list:

Acceptance criteria

Task-list:

Acceptance criteria

Task-list:

Acceptance criteria

Task-list:

Acceptance criteria

Task-list:

Recommend Projects

Recommend Topics

Recommend Org