Giter Club home page Giter Club logo

overview's Introduction

Overview of Technical Projects

Welcome to the Environmental Data and Governance Initiative (EDGI) Government Data Archiving technical team. We are building online tools, helping events, and creating research networks to proactively preserve, archive and track public environmental data and ensure its continued availability. We are indexing millions of government web pages on a weekly basis, tracking changes to them, and producing regular reports. Our focus has turned to using machine learning to sift through millions of government web pages to find the most important changes as well as working with the protocols for resilient, sustainable, distributed data storage networks.

This repository is an overview for people who are getting involved in the project. Our GitHub organization, chat, and in-person events have a Code of Conduct.

Get Involved

Development

If you'd like to help with ongoing development of these tools, great!

  1. Review our Contributor Guidelines and Code of Conduct
  2. Jump on our chat archivers.slack.com, anyone can request an invite from archivers-slack.herokuapp.com slackin
  • Contributor and development conversations happen on #dev
  • Ping one of the EDGI coordinators (@mattprice or @dcwalk) with your GitHub name to be added to the organization
  1. Take a look at our Current Projects and Kanban Board
  2. Join us at our weekly standup, Saturdays at 6:30 ET (Eastern Time), call link posted in #dev as well as notes and recorded meetings

Running an Event

If you are interested in running your own event please head to the EDGI's Event Toolkit and DataRefuge's Get Involved.

Supporting repositories include:

Projects

The EDGI technical team is currently supporting development of the following projects. Overall progress is tracked via our Project Tracking Board, however specific tasks, issues and milestones are handled in individual repos.

Event Preservation

Tool Name Description Status
Archivers.space App Heroku app for research and harvesting Working
Harvesting Tools A collection of code snippets designed to be dropped into the data harvesting process directly after generating the zip starter kit Working
Nomination Tool Chrome extension to simplify the nomination process at archiv-a-thons Working
DataRefuge's Event Workflow Detailed descriptions of the phases of event workflow Working
DataRescueTEMPLATE DataRescue Event Template with gh-pages branch for event website Working
Zip Starter Automate zip folder creation for Harvesting during pipeline Working
s3 Upload Server Heroku app for uploading Datasets to S3 from the browser Working
EIS WARC Archiver Docker app that ingests a list of URLs then crawls and generates WARCs Working

Monitoring Websites

These repositories support the current workflow, based on Google spreadsheets automatically generated from scraping the Verionista web interface. They will be deprecated when a web app-based workflow is ready to use:

Tool Name Description Status
Versionista Outputter A Ruby script that scrapes Versionista's web interface to generate a csv summarizing which websites and pages have had recent changes Working
Version Tracking UI Tools to facilitate the tracking website changes Archived

These repositories will support a future workflow that improves upon the current one by:

  • replacing the Google spreadsheets with a custom web app
  • drawing on data from multiple sources including Versionista, PageFreezer, and others in the future
  • applying text processing techniques to prioritize and filter diffs before presenting them to human volunteers

The timing of the change-over depends on some external factors but is roughly planned for late March.

Tool Name Description Status Language
web-monitoring Documentation and project management repo for Website Monitoring project Working --
web-monitoring-processing Queries data sources, performs prioritization/filtering, populates databases for web app In Progress Python
web-monitoring-db The Rails backend of the web app that human volunteers will use to evaluate diffs In Progress Ruby on Rails
web-monitoring-ui The JS front-end that human volunteers will use to evaluate diffs In Progress TypeScript

Other Projects...

  • Improving toolkit/remote contribution process
  • Exploring redundant, distributed storage (IPFS)

Archived Tools

Tools that have been archived for reference:

Archived Tools Description Status Language
EPA Search Utilities A scraper for the EPA search engine, that systematically feeds in search queries and extracts resultant URLs Archived Go, Binary
EPA Quantitative Databases Undocumented scraper for a set of databases accessible but obfuscated through one of the EPA data websites Archived Python
Sitemapper Tools and services to create xml, csv and json sitemaps of websites Archived Python 3
EPA ECHO Scraping Scraper for the EPA Enforcement & Compliance History archives Archived Ruby
EPA Geoportal Database Scraper Scraper to archives all GIS data ZIP files on EPA's Geoportal Archived Node
EPA Sitemap A sitemap tool to provide initial models of government domains--intended to facilitate volunteer organization at archivathons Archived Python
EIS Scraping Full workflow for identifying, scraping, and downloading WARCs of eis's (WARC Archiver above is part) Archived Ruby, Node, Python 2
Sprint Toolkit Our organizing documents for the tech group at our archiv-a-thon (potential overlap with the current repo) Archived --

Other tools for scraping and data preservation that we've experimented with:

  • Dolley Madison, a PHP script to download all government Github repos
  • Grab-Site, a crawler with cli that also outputs WARCs
  • WARCprox, a proxy with cli for generating WARCs
  • Python-sitemap, a mini-crawler that just makes a sitemap of the website

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.