edgi-govdata-archiving / overview Goto Github PK

🎈 Start here for current projects, how to get involved, and joining community calls, a resource for new and veteran members

License: GNU General Public License v3.0

onboarding documentation project-management

overview's Introduction

Overview

Welcome to the Environmental Data and Governance Initiative (EDGI) Government Data Archiving team.

We are:

Building online tools, helping events, and creating research networks to proactively preserve, archive and track public environmental data and ensure its continued availability
Indexing millions of government web pages on a weekly basis, tracking changes to them, and producing regular reports
Working with protocols for resilient, sustainable, distributed data storage networks

This repository is an overview for people who are getting involved in the project.

Our GitHub organization, chat, and in-person events have a Code of Conduct and Contributor Guidelines.

Get Involved

Welcome to our community! We welcome contributors from many skillsets. Here's how to get started:

Review our Contributor Guidelines and Code of Conduct
Jump on the Archivers chat (archivers.slack.com). Sign up for an account here (open to all).
- Introduce yourself in #introductions, key starting places for conversations are #general, #datatogether and #community-building
- Ping one of the EDGI coordinators (@lightandluck or @kelsey) with your GitHub name to be added to the organization
You can also ping our coordinators on Github at @lightandluck or @Frijol. Please let us know a bit about yourself and what you're interested in.
Take a look at our Current Projects or jump straight into one of our "good-first-issue" labeled issues!

Note for IRC users: (Advanced) If you prefer to use an IRC client, please review these configuration instructions for Slack's IRC gateway.

Projects

Here are some projects we're building and maintaining right now.

Want to get involved? Check out the emoji column to see all the different types of contribution we need for these projects!

Project (Click through to repo)	Description	Contribution type most needed (emoji key from All Contributors)
Web Monitoring	Tools around monitoring changes to government websites	📖 🐛 💻
Walk (Web Monitoring and Archiving)	A system for scraping a BIG list of URLs and processing the results for monitoring and archiving	✅ ⚠️ 📖 💻 🐛 👀
EIS Search Tool	Making federal environmental impact statements easier to search	🤔 🐛 💻 ⚠️ 📖 🎨 📓
Data Together	Developing a distributed model for holding copies of archived and preserved data by reading, talking, and prototyping together	📝 🎨 💡 📋 🤔 📢 🚇
100 Days	Website for EDGI 100 Days Report at 100days. envirodatagov.org (in maintenance)	🐛
Website	Project management and design support for EDGI's website at envirodatagov.org	🐛 💻 🎨 🤔 🖋
EDGI Hubot	Chat bot for EDGI Slack built on the Hubot framework	📖 💻
EDGI Scripts	Code scripts for running and maintaining our digital infrastructure	💻 ✅ 📖
Video Call Landing Page	Landing page app with important info that participants can be sent through prior to joining a video call http://edgi-video-call-landing-page.herokuapp.com/	🐛

Working Openly

EDGI operates under horizontal-organizing principles. We have developed guidelines for open project development in line with these principles, which you can find in this repo:

Funding

Our work is made possible through volunteer labor, grants, and direct tax-deductible donations from the public.

Donate to EDGI

(More about how EDGI is funded)

overview's People

Contributors

Stargazers

Watchers

overview's Issues

Verify FTP crawl details and whether they are in IA before sending

Got the update about sending off FTP urls:

CDX API docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md
Example: https://web.archive.org/cdx/search/cdx?url=ftp://aftp.cmdl.noaa.gov/user/vasel/posters/Screen%20Shot%202016-06-30%20at%2011.15.40%20AM.png or you can wildcard it: https://web.archive.org/cdx/search/cdx?url=ftp://aftp.cmdl.noaa.gov/user/vasel/*
[IA] have over 100TB of FTP content already, so it is worth checking before adding more FTP URLs.

Resources:

Suggest GH team setup for access control

I just kinda decided to use a lieutenant model like the linux project, which seemed simple given my experience. But I could imagine this part being much more confusing if a group didn't have an opinion on a way to manage access -- a GH team, done incorrectly, can in effect lock participants out of being able to work effectively, and so kill later momentum in the project.

How I operated:

Created a single "Lieutenant" team, with admin access on each repo -- the repos were added to that team whenever a member transferred the repo from their personal account to the org account.
whichever 1 or 2 people were managing that account before (or who wanted to take ownership), they were added to the lieutenant team.
@jpmckinney gently urged people in chat to set their team membership to "public", so that the public org page appears active and participatory: https://github.com/orgs/edgi-govdata-archiving/people/patcon
Added all participants as general "People" to team org. This ensured we stayed connected to the github accounts of participants. This also allows the org to decide what catch-all access they wanted to offer -- we set it to "read" access on all repos, but could later choose to permit "write" access: https://github.com/organizations/edgi-govdata-archiving/settings/member_privileges (this could be chaotic, so I think "read" is the correct choice)
During hackathon, gently reminded people to accept the team invite -- easy to dismiss as spam after the event.

Anyhow, hopefully that helps somehow. Lemme know if there's a better place for me to drop it.

Rename repos and add descriptions

Just noticing we have a lot of tool names and descriptions that are not clear, especially for the ones that are tailored to specific agency.

I'd suggest:

consistently rename repos to link to specific datasets/agencies
update repo descriptions to make clear status and outcomes of tool

Add Contributing Guidelines to all active repos

Add a link back to overview/Contributing.md so people will see the fancy contributing banner on the top of each repo!

Hold weekly standups for EDGI development

February 14, 7:45 PM EST Call link: https://zoom.us/j/844613104

Post-NYC conversation about how to maintain development going forward, a key outcome is that we will begin testing out a weekly standup (of no more than 30 mins [and maybe a goal to get that number lower]).

First iteration will be going around to ask the following questions to anyone who shows up:

Does anyone need help?
Does anyone have anything to talk about?

Based on availability we are going to schedule it on Tuesdays at 7:45pm

Todo to mark this closed:

Test/setup a permanent Google Channel (no nice name for 30days) and the test link for our first call
Write short protocol doc for standup meetings PR#51
Confirm time works for key contributors: Tuesday 7:30pm ET is the current winner
Update overview docs PR#51
Add to EDGI calendar? pinged on this, not a block for the first standup tho
Host first standup... link for our first call
Iterate 😊

Prep "guide creation" tasks for upcoming events

Repo has been set up: https://github.com/edgi-govdata-archiving/guides

Notes from NYC Next Steps group...

Guide for Writing Guides

How to call out the need for a guide?
- create a GH issue in a "guides" repo?
How to let people know you're working on it
- claim the GH issue
Where to put the guide
- EDGI Github Repo (setup for gh pages)
- contribution guidelines (ie. how to make it jekyll-ready)
How to write it
- Markdown files
- Option: hackmd.io if collaboratively editing
- (ideally) template for writing new guides (headings, etc.)

Concerns:

Github markdown files are intimidating for everyday readers, but contributing via github is a tolerable level of expectation to place on people contributing to a guide
The resulting guides need to be not-intimidating.

Should we revisit Heretrix?

Per conversation at Jan 17 Civic Tech TO, question was whether we should revisit Heretrix and get an instance up and running for testing
(http://www.crawler.archive.org/index.html)

Develop Event Training Protocol

Thinking about how to make this project sustainable: we need to find effective ways to train people to run events when the old crew (b5, dallan, MAT, etc) are not around. Some ideas include:

pipeline walkthrough screencast
~~- up-to-date docs~~
~~- [ ]~~

(edit: _@dcwalk added:

review feedback from NYC event
update workflow docs to reflect Archivers App (in progress: #dev-eventdocs and https://github.com/datarefuge/workflow)
reduce duplication across harvestor-tools and workflow

Apply for Gogol SoC, deadline Feb 9, 12:00 EST

We should get Summer of code students to work on this project! Here's what we stil lneed to do to make this happen:

Sync up DataRescue Event Materials

Based on conversations with DataRefuge we are coordinating to do one shared update of the Events documentation and planning materials, closed off #49 and ongoing efforts will fall under this collaboration.

(Tech Team) TODOs and steps for this to be completed:

Move to a website for workflow docs (https://datarefuge.github.io/workflow/)
Decision on 'place' for event materials
Update Overview
Update Event Toolkit page

Script improvement to webpage version tracking workflow

We need a tool to substantially reduce the manual labor of iinputting & evaluating changes to the web pages we're monitoring.

Recommend GH Issues to Event Organizers

One outcome from NYC was the sense that we didn't have a good mechanism to collect and act on participant insights about event process. The next-steps group recommended using Github Issues in a lean, event-specific repositoriy such as DataRescueNYC for such issues.

Set up VM for testing/staging scrapers

Test IPFS out as storage method for data

See related IPFS archiving issue, there are multiple steps to work on!

Progress Meter/Analysis Tool/Visualization

We need better ways for volunteers to identify relevant areas of the target websites. Primers provided to us from the policy analysis group will be really helpful. But it might also be really useful to hook up the sitemap tool to the records of what's already been nominated, so we have some kind of visualization of what's already been done & where we have extensive needs.

Update contributor communications channels links

We need to finalize/update content for how contributors can join us on:

EDGI Website (related to GSOC application #42 contact gsoc, draft gsoc page)
EDGI Events Toolkit (related to #49, no action on this now)
Overview repo

Automate tracking version changes when monitoring website

This is being worked on in version-tracking-ui, created issue to indicate this is an ongoing project

Remove "Harvesting Toolkit" From EDGI github

Workflow documentation now all lives in datarefugephilly/workflow/pull/8, I'd like to remove the old repo to minimize confusion.

Incorporate Feedback from last couple DataRescue events

Feedback from Philly:

For future Hackathons:

Use the EDGI slack team for all the hackathons.

set up a Slack auto-invite button on an EDGI website or the hackathon sites, so people can add themselves to the slack team

Encourage the hackathon organizers to use the EDGI github org by default

Makes it easy to add users, manage access, etc.

Let hackathon organizers create repositories within the GitHub org

Encourage “Seed and Sort” people to make videos explaining how to use

the nominator tool

the guides on google docs

Encourage the tools people or the baggers to create a video on how to run a web crawler and record provenance info

Feedback from Chicago:

Some hot take/aways:

If you plan to scrape for further URLs at the locations we found
and/or elsewhere, and if someone in the room has a modicum of Python
experience, Scrapy proved to be a effective and lightweight tool:
https://doc.scrapy.org/en/1.3/intro/tutorial.html

If you too plan to use Webrecorder at an event, I recommend checking
in first ... in
order to make sure that the load can be handled and balanced. We
experienced some service outages that I worry were caused by the
load I was putting on their proxy server.

NASA is a relatively untapped resource! We really only got far
enough in our time to get a quick sense of the scale of available
and highly accessible resources here in our short time. If future
events can focus as much as Toronto did on EPA and Philly on NOAA,
this would make for an important (and fun!) target.

Add detailed recommendations about Github organizations & distributed workflow

We should make stronger recommendations about the use of Github organizations as a structure for both individual events and the longer-term project. In particular, we should outline forking/merging recommendations for a spoke-hub structure.

Use Parsehub as a Scraper

Given requirements for outputs (WARC and/or wanting something in a format for download/browsing from a platform like ckan) I don't think that Parsehub is the right tool for the job. It adds a few additional layers of complexity/work, especially in converting outputs to WARC, which I think would be better served sticking to best practises from IA, in particular moving to WARC proxy to build the WARCs and updating our tools to make use of that (this would effect the two EIS tools we have).

Implement solution for managing all workflow tasks at in-person events

Based on feedback at the Ann Arbor event, there are many issues around the current spreadsheet task management workflow, to briefly summarize:

too many people in the sheet causes it to perform slowly
too many ranges have to be unprotected for work, leading to concerns of missing data, duplication, destruction of data
many people unfamiliar/uncomfortable with UI (e.g., using filters)
comments about overall usability (hard to see all relevant, and only relevant, data)

In A2 wrap up/debrief, people mentioned looking at project management tools to handle workflow (examples mentioned are Phabricator and Jira... though it was acknowledged that there has to be the right balance between using powerful tools like this and maintaining as low a barrier of entry as possible.

Perform security audit around development practices and tools

While this is a bit vague as the objectives are part of a growing conversation, I wanted to drop this into an issue as a reminder.

Resources

Add in link to Trello board

We have a kanban-esque board for project management, this should be linked to: https://trello.com/b/owAePiFt

Verify we are properly creating WARCs

Tools mentioned in this issue: edgi-govdata-archiving/eis-WARC-archiver#4

Pull down baseline for Website Monitoring service

Need to set up the cluster to begin monitoring changes

Finalize WARC creation method

We have been using wget in python scripts to generate WARCs with results from scraping.
It appears that for bulk results we should have a single resulting WARC instead of multiples.
This is being worked on in* edgi-govdata-archiving/eis-WARC-archiver#4

Need to have well-formatted WARCs result from scraping tools... dropping some resources here:

Anatomy of a WARC https://blogs.loc.gov/thesignal/2013/11/anatomy-of-a-web-archive/
WARC file format http://archive-access.sourceforge.net/warc/warc_file_format-0.16.html#anchor47
IA standalone WARC collection https://archive.org/download/archiveteam_ftpgov_20170101195042
IA WARCs https://archive.org/details/testWARCfiles
Examples of WARC records https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#annex-c-informative-examples-of-warc-records

Migrate key documentation and tools from DataRefugePhilly event

Based on conversation prior to the event, we hope to pull some of the key documentaiton and toolkit knowledge from DataRefugePhilly back into this EDGI repo to aid future events

Currently some exists here:

https://github.com/edgi-govdata-archiving/harvesting-toolkit

Automate provenance or metadata generation for downloading

From feedback in #30, primarily a concern for people working in in-person events and using the workflow document

Chihacks:

Encourage the tools people or the baggers to create a video on how to run a web crawler and record provenance info

Investigate grab-site for creation of WARCs

Latest update: edgi-govdata-archiving/eis-WARC-archiver: Issue #4
Was able to preserve multiple EIS pages with the documents preserved using grab-site.
For verification can use this tool: https://github.com/ikreymer/webarchiveplayer
Also another tool to host WARC playbacks: https://github.com/ikreymer/pywb

(moved from Trello)

Investigate/add in Webrecorder or brozzler info

There have been a number of pointers to webrecorder.io and the IA's brozzler tool. We should investigate & figure outuse cases, then point to them from here.

Apply to Open Technology Fund

I htink rapid response:

https://www.opentech.fund/funding

Revamp Event Documentation

2017/02/20 Update: We are almost there... but are working with DataRefuge to finalize and collectively establish the event description. I'm going to close this for now and open a new issue to cover todos coming out of that collaboration.

Based on slack conversations on NY event debrief, @ebarry will lead a thorough review and reduce of all event materials. Wanted to throw an issue up so others would be aware that the project is ongoing

Per conversation on #dev-eventdocs these are my sense of the todos--

Current documentation:

GH github overview checklist, event tech recommendations
GH DataRescue workflow our forked workflow day-of role descriptions, advance work
WEB event toolkit planning materials, technical guides
WEB agency primers day-of seeding guides
DataRescue Webinar
DataRescuePhilly Toolkit
DataRefuge FAQ

Important Links/Resources:

archivers.slack.com
GH github organization
WEB contact (onboarding to slack/#dev)
WEB contributor guidelines for developers
WEB code of conduct for online EDGI spaces

Event repositories:

Streamling/review Checklist

This could probably be made more succinct, and has some duplication with the readme

Website Tracking Dev Check-In

As discussed, we wanted to set up a time to chat about the Website Tracking project before the SF Event (Sat. 2/11). On the agenda:

Briefly discuss progress
Compile technical questions for PageFreezer
Update @titaniumbones on the state of the pagefreezer-cli project and outline achievable goals for the SF event, assuming we're able to get a group of coders in a room.

Even if you couldn't make the first standup meeting, please feel free to join. Let us know what your availability is and we'll schedule a time by Wednesday evening: http://doodle.com/poll/qhrev79mg3k7gd9k Note: All times are EST!
We'll send out a zoom/hangout link then.

Broad "Issues to Consider" Document

Move towards a substtantive/broad "Issues to consider" document.

Perform Accessibility Audit

Split off from a discussion in #50

FOIA-a-tron

If we're going to have FOIA-a-thons, we should either find or build a tool that provides boilerplate legalese for FOIA requests & makes it trivial to submit them. I actually think that those kinds of tools exist, haven't done any research yet, just didn't want to lose this idea.

Confirm status of tools

We have a preliminary pass on the status of the tools coming out of the guerrilla archiving event... but there are a lot of unknowns

Resolve data transfer issue between staging VM and individual downloads

We are having some issues moving data to the staging VM, could be a problem with firewall/security policies. We are using OpenStack to manage VM instances.

Kick-start collaboration on improvement to the CKAN instance

A document of recommendations came out of @jaclynweiser and co.s's session at NYC, my checklist of how to kickstart this from that conversation has:

~~[ ] Onboard people to develop the CKAN instance~~
~~[ ] Streamline dev environment (Dockerize development?) This is being done here: https://github.com/datarefuge/ckan-docker-build~~
~~[ ] Document current deployment? (Postgres) (Solr) (CKAN)~~
~~[ ] Add key members to Github organization~~

The DataRefuge repo has a related issue datarefuge/ckan#8

Migrate to single chat

Based on discussion we will be going ahead with this, to mark as complete we need to:

Notify, create/rename new #dev channel in Archivers repo
Notify the participants in the other channels
- Civic Tech Toronto
- EDGI
- DataRefuge
Update overview documentation
Add details to EDGI website
~~Test portal integration with established rooms in other orgs~~

Right now our development chat is spread out over:

Civic Tech Toronto slack
DataRefuge slack
EDGI
(now) Archivers slack

Other places we work:

This github organization

We need to reduce the pain cause by all the chats!

Investigate our ongoing community communication needs

Per rich conversation in #53, to emphasize mailinglists is probably too narrow a scope. Will have a community management conversation after the Standup on Feb 21 to identify action items to move this forward.

New gh-pages repo for event-specific boilerplate

I'd like to set up a gh-pages-only repo to serve as website boilerplate for individual events; event organizers can fork, modify yaml, prune/add as needed, and have an event website ready to go. Some models for what such a website needs to include:

Review README changes

I just committed them to master (gaah!) instead of submitting a PR. Nonetheless it would be good if you checked them over. There are a couple of italicized sections that I'm hoping you can fill in; in general the language may need to be revised, too.

Kick-start development on Website Monitoring project

Coming out of our Feb 14 Standup, @ambergman and @titaniumbones are looking to kickstart development for the Website Monitoring project.

TODOs to mark this complete:

established project management structure (Matt | @titaniumbones Tech Contact; Andrew | @ambergman l Tracking Team Contact)
- Conversations in #dev-webmonitoring slack channel
- Github issues in core repository, emphasize tagging people
- Use weekly standup to sync up on projects
established 'landing' repo for core docs (e.g., ROADMAP, INTERFACE SPECs)
- (renamed) https://github.com/edgi-govdata-archiving/pagefreezer-cli
preliminary Architecture spec (with picture) @titaniumbones
- A draft of this will land in edgi-govdata-archiving/web-monitoring#5
preliminary User Interface mockups from Tracking Team @ambergman
- This is ongoing and will land in edgi-govdata-archiving/web-monitoring-ui#35)
github setup (labels, milestones in 'landing repo,' coordinating old repos @dcwalk
~~[ ] functional user story for each component? ;)~~

Current work on the Website Monitoring Project is spread out over:
New Toolchain
https://github.com/edgi-govdata-archiving/pagefreezer-cli
https://github.com/edgi-govdata-archiving/filtration

issues for what this system needs to look like
https://github.com/edgi-govdata-archiving/differ
Service to do the diffing
https://github.com/edgi-govdata-archiving/webdiff-ui
UI development

Old Toolchain
https://github.com/edgi-govdata-archiving/version-tracking-ui
https://github.com/edgi-govdata-archiving/versionista-outputter

Can we add the Code of Conduct to the org?

We have/had a code of conduct at our events. I'm wondering why we don't also feature it prominently as part of our online community?

Support ongoing (remote) involvement

We've had people who've attended events express a desire to continue to be involved in the project and we need to figure out a way to support that. As we move into longer-term projects we should prioritize ways to include previous participants. Currently we have:

Contributor Guidelines (really just cover using branches)
Overview repo

We should consider revisiting our first sprint conversations:

Establish and document preliminary work process (i.e. where we document issues and how we make changes)

e.g. Public Labs

See further Making your open source project newcomer-friendly awesome-readmes, Public Lab's GSOC recap

Add contributor guidelines (and code of conduct?)

e.g. Public Labs

See further Contributor Covenant

2017-02-10:
I think we are almost ready to close this as a first iteration because in the last two weeks we have:

added contributor guidelines
added a code of conduct
consolidated our chat
started a weekly standup
begun to track our overall progress with a slick kanban

Future work includes:

update standup protocol
officially document onboarding protocol
update website with info for joining

Set up Mailing Lists for events and alumni

Recent events have set up a mailing list for planning and we anticipate more doing so. Thinking long term-- we don't want to have a bunch of scattered lists we are trying to maintain in multiple places.

Also, we want to support email but move to having most of our conversations on slack.

To mark this closed:

make lists.envirodatagov.org to a place to sign up for mailinglists (investigate Mailman)
Get multiple people set up as admin, moderators
Establish protocol for: folding temporary lists into a larger one (e.g. AdaCamp style having an "alumni" list for people?), code of conduct, moderation
Create list for upcoming event

Implement onboarding process for Archivers App (Event Preservation) development

Coming out of our Feb 14 Standup, @danielballan is looking to open up development for the Archivers app, especially as we have more people interested in contributing.