Giter Club home page Giter Club logo

overview's Introduction

Code of Conduct

Environmental Data and Governance Initiative Logo

Overview

Welcome to the Environmental Data and Governance Initiative (EDGI) Government Data Archiving team.

We are:

  • Building online tools, helping events, and creating research networks to proactively preserve, archive and track public environmental data and ensure its continued availability
  • Indexing millions of government web pages on a weekly basis, tracking changes to them, and producing regular reports
  • Working with protocols for resilient, sustainable, distributed data storage networks

This repository is an overview for people who are getting involved in the project.

Our GitHub organization, chat, and in-person events have a Code of Conduct and Contributor Guidelines.


Get Involved

Welcome to our community! We welcome contributors from many skillsets. Here's how to get started:

  1. Review our Contributor Guidelines and Code of Conduct
  2. Jump on the Archivers chat (archivers.slack.com). Sign up for an account here (open to all).
    • Introduce yourself in #introductions, key starting places for conversations are #general, #datatogether and #community-building
    • Ping one of the EDGI coordinators (@lightandluck or @kelsey) with your GitHub name to be added to the organization
  3. You can also ping our coordinators on Github at @lightandluck or @Frijol. Please let us know a bit about yourself and what you're interested in.
  4. Take a look at our Current Projects or jump straight into one of our "good-first-issue" labeled issues!

Note for IRC users: (Advanced) If you prefer to use an IRC client, please review these configuration instructions for Slack's IRC gateway.

Projects

Here are some projects we're building and maintaining right now.

Want to get involved? Check out the emoji column to see all the different types of contribution we need for these projects!

Project (Click through to repo) Description Contribution type most needed (emoji key from All Contributors)
Web Monitoring Tools around monitoring changes to government websites πŸ“– πŸ› πŸ’»
Walk (Web Monitoring and Archiving) A system for scraping a BIG list of URLs and processing the results for monitoring and archiving βœ… ⚠️ πŸ“– πŸ’» πŸ› πŸ‘€
EIS Search Tool Making federal environmental impact statements easier to search πŸ€” πŸ› πŸ’» ⚠️ πŸ“– 🎨 πŸ““
Data Together Developing a distributed model for holding copies of archived and preserved data by reading, talking, and prototyping together πŸ“ 🎨 πŸ’‘ πŸ“‹ πŸ€” πŸ“’ πŸš‡
100 Days Website for EDGI 100 Days Report at 100days. envirodatagov.org (in maintenance) πŸ›
Website Project management and design support for EDGI's website at envirodatagov.org πŸ› πŸ’» 🎨 πŸ€” πŸ–‹
EDGI Hubot Chat bot for EDGI Slack built on the Hubot framework πŸ“– πŸ’»
EDGI Scripts Code scripts for running and maintaining our digital infrastructure πŸ’» βœ… πŸ“–
Video Call Landing Page Landing page app with important info that participants can be sent through prior to joining a video call
http://edgi-video-call-landing-page.herokuapp.com/
πŸ›

Working Openly

EDGI operates under horizontal-organizing principles. We have developed guidelines for open project development in line with these principles, which you can find in this repo:

Funding

Our work is made possible through volunteer labor, grants, and direct tax-deductible donations from the public.

Donate to EDGI

(More about how EDGI is funded)

overview's People

Contributors

blackglade avatar chaibapchya avatar danielballan avatar dcwalk avatar edsu avatar frijol avatar ikol1729 avatar ishoshani avatar janakrajchadha avatar kmcculloch avatar lightandluck avatar machawk1 avatar mr0grog avatar patcon avatar romitjain avatar sonalranjit avatar thisisashukla avatar titaniumbones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

overview's Issues

Verify FTP crawl details and whether they are in IA before sending

Suggest GH team setup for access control

I just kinda decided to use a lieutenant model like the linux project, which seemed simple given my experience. But I could imagine this part being much more confusing if a group didn't have an opinion on a way to manage access -- a GH team, done incorrectly, can in effect lock participants out of being able to work effectively, and so kill later momentum in the project.

How I operated:

  • Created a single "Lieutenant" team, with admin access on each repo -- the repos were added to that team whenever a member transferred the repo from their personal account to the org account.
  • whichever 1 or 2 people were managing that account before (or who wanted to take ownership), they were added to the lieutenant team.
  • @jpmckinney gently urged people in chat to set their team membership to "public", so that the public org page appears active and participatory: https://github.com/orgs/edgi-govdata-archiving/people/patcon
  • Added all participants as general "People" to team org. This ensured we stayed connected to the github accounts of participants. This also allows the org to decide what catch-all access they wanted to offer -- we set it to "read" access on all repos, but could later choose to permit "write" access: https://github.com/organizations/edgi-govdata-archiving/settings/member_privileges (this could be chaotic, so I think "read" is the correct choice)
  • During hackathon, gently reminded people to accept the team invite -- easy to dismiss as spam after the event.

Anyhow, hopefully that helps somehow. Lemme know if there's a better place for me to drop it.

Rename repos and add descriptions

Just noticing we have a lot of tool names and descriptions that are not clear, especially for the ones that are tailored to specific agency.

I'd suggest:

  • consistently rename repos to link to specific datasets/agencies
  • update repo descriptions to make clear status and outcomes of tool

Hold weekly standups for EDGI development

February 14, 7:45 PM EST Call link: https://zoom.us/j/844613104

Post-NYC conversation about how to maintain development going forward, a key outcome is that we will begin testing out a weekly standup (of no more than 30 mins [and maybe a goal to get that number lower]).

First iteration will be going around to ask the following questions to anyone who shows up:

  1. Does anyone need help?
  2. Does anyone have anything to talk about?

Based on availability we are going to schedule it on Tuesdays at 7:45pm

Todo to mark this closed:

Prep "guide creation" tasks for upcoming events

Repo has been set up: https://github.com/edgi-govdata-archiving/guides

Notes from NYC Next Steps group...

Guide for Writing Guides

  • How to call out the need for a guide?
    • create a GH issue in a "guides" repo?
  • How to let people know you're working on it
    • claim the GH issue
  • Where to put the guide
    • EDGI Github Repo (setup for gh pages)
    • contribution guidelines (ie. how to make it jekyll-ready)
  • How to write it
    • Markdown files
    • Option: hackmd.io if collaboratively editing
    • (ideally) template for writing new guides (headings, etc.)

Concerns:

  • Github markdown files are intimidating for everyday readers, but contributing via github is a tolerable level of expectation to place on people contributing to a guide
  • The resulting guides need to be not-intimidating.

Develop Event Training Protocol

Thinking about how to make this project sustainable: we need to find effective ways to train people to run events when the old crew (b5, dallan, MAT, etc) are not around. Some ideas include:

  • pipeline walkthrough screencast
    - up-to-date docs
    - [ ]

(edit: _@dcwalk added:

  • review feedback from NYC event
  • update workflow docs to reflect Archivers App (in progress: #dev-eventdocs and https://github.com/datarefuge/workflow)
  • reduce duplication across harvestor-tools and workflow

Apply for Gogol SoC, deadline Feb 9, 12:00 EST

We should get Summer of code students to work on this project! Here's what we stil lneed to do to make this happen:

  • Write application guidelines for students (1500 chars, copy from PL?)
  • create "Ideas" webpage where we describe possible projects; add link to application
    • should include link to the Archivers public slack invite & instructions to joing the #gsoc-apply channel;
  • add "long" & "short" descriptions of edgi
  • create webpage with instructions to join our Dev chat, mailing list (can we add this to contact?)
  • Org Profile
    • Why does your org want to participate in Google Summer of Code?
    • How many potential mentors have agreed to mentor this year?
    • How will you keep mentors engaged with their students?
    • How will you help your students stay on schedule to complete their projects?
    • How will you get your students involved in your community during GSoC?
    • How will you keep students involved with your community after GSoC?
    • Has your org been accepted as a mentor org in Google Summer of Code before?
    • What year was your project started?

Sync up DataRescue Event Materials

Based on conversations with DataRefuge we are coordinating to do one shared update of the Events documentation and planning materials, closed off #49 and ongoing efforts will fall under this collaboration.

(Tech Team) TODOs and steps for this to be completed:

Progress Meter/Analysis Tool/Visualization

We need better ways for volunteers to identify relevant areas of the target websites. Primers provided to us from the policy analysis group will be really helpful. But it might also be really useful to hook up the sitemap tool to the records of what's already been nominated, so we have some kind of visualization of what's already been done & where we have extensive needs.

Incorporate Feedback from last couple DataRescue events

Feedback from Philly:

For future Hackathons:

  • Use the EDGI slack team for all the hackathons.
  • set up a Slack auto-invite button on an EDGI website or the hackathon sites, so people can add themselves to the slack team
  • Encourage the hackathon organizers to use the EDGI github org by default
    • Makes it easy to add users, manage access, etc.
    • Let hackathon organizers create repositories within the GitHub org
  • Encourage β€œSeed and Sort” people to make videos explaining how to use
    • the nominator tool
    • the guides on google docs
  • Encourage the tools people or the baggers to create a video on how to run a web crawler and record provenance info

Feedback from Chicago:

Some hot take/aways:

  • If you plan to scrape for further URLs at the locations we found
    and/or elsewhere, and if someone in the room has a modicum of Python
    experience, Scrapy proved to be a effective and lightweight tool:
    https://doc.scrapy.org/en/1.3/intro/tutorial.html
  • If you too plan to use Webrecorder at an event, I recommend checking
    in first ... in
    order to make sure that the load can be handled and balanced. We
    experienced some service outages that I worry were caused by the
    load I was putting on their proxy server.
  • NASA is a relatively untapped resource! We really only got far
    enough in our time to get a quick sense of the scale of available
    and highly accessible resources here in our short time. If future
    events can focus as much as Toronto did on EPA and Philly on NOAA,
    this would make for an important (and fun!) target.

Use Parsehub as a Scraper

Given requirements for outputs (WARC and/or wanting something in a format for download/browsing from a platform like ckan) I don't think that Parsehub is the right tool for the job. It adds a few additional layers of complexity/work, especially in converting outputs to WARC, which I think would be better served sticking to best practises from IA, in particular moving to WARC proxy to build the WARCs and updating our tools to make use of that (this would effect the two EIS tools we have).

Implement solution for managing all workflow tasks at in-person events

Based on feedback at the Ann Arbor event, there are many issues around the current spreadsheet task management workflow, to briefly summarize:

  • too many people in the sheet causes it to perform slowly
  • too many ranges have to be unprotected for work, leading to concerns of missing data, duplication, destruction of data
  • many people unfamiliar/uncomfortable with UI (e.g., using filters)
  • comments about overall usability (hard to see all relevant, and only relevant, data)

In A2 wrap up/debrief, people mentioned looking at project management tools to handle workflow (examples mentioned are Phabricator and Jira... though it was acknowledged that there has to be the right balance between using powerful tools like this and maintaining as low a barrier of entry as possible.

Finalize WARC creation method

We have been using wget in python scripts to generate WARCs with results from scraping.
It appears that for bulk results we should have a single resulting WARC instead of multiples.
This is being worked on in* edgi-govdata-archiving/eis-WARC-archiver#4

Need to have well-formatted WARCs result from scraping tools... dropping some resources here:

Revamp Event Documentation

2017/02/20 Update: We are almost there... but are working with DataRefuge to finalize and collectively establish the event description. I'm going to close this for now and open a new issue to cover todos coming out of that collaboration.

Based on slack conversations on NY event debrief, @ebarry will lead a thorough review and reduce of all event materials. Wanted to throw an issue up so others would be aware that the project is ongoing

Per conversation on #dev-eventdocs these are my sense of the todos--

  • Provide 'one-pager' for events with KEY CONTACTS
  • Streamline/reorganize Event Toolkit
    • Trim out of date docs
    • Add in new graphics/printables
    • Move guides to new tech guides repo
  • Create 'template' repo for events to fork
  • Update Event Toolkit Website
    • Track overview
    • Checklist
  • Update Overview Repo

Current documentation:

Important Links/Resources:

Event repositories:

Website Tracking Dev Check-In

As discussed, we wanted to set up a time to chat about the Website Tracking project before the SF Event (Sat. 2/11). On the agenda:

  1. Briefly discuss progress
  2. Compile technical questions for PageFreezer
  3. Update @titaniumbones on the state of the pagefreezer-cli project and outline achievable goals for the SF event, assuming we're able to get a group of coders in a room.

Even if you couldn't make the first standup meeting, please feel free to join. Let us know what your availability is and we'll schedule a time by Wednesday evening: http://doodle.com/poll/qhrev79mg3k7gd9k Note: All times are EST!
We'll send out a zoom/hangout link then.

FOIA-a-tron

If we're going to have FOIA-a-thons, we should either find or build a tool that provides boilerplate legalese for FOIA requests & makes it trivial to submit them. I actually think that those kinds of tools exist, haven't done any research yet, just didn't want to lose this idea.

Kick-start collaboration on improvement to the CKAN instance

A document of recommendations came out of @jaclynweiser and co.s's session at NYC, my checklist of how to kickstart this from that conversation has:

  • [ ] Onboard people to develop the CKAN instance
  • [ ] Streamline dev environment (Dockerize development?) This is being done here: https://github.com/datarefuge/ckan-docker-build
  • [ ] Document current deployment? (Postgres) (Solr) (CKAN)
  • [ ] Add key members to Github organization

The DataRefuge repo has a related issue datarefuge/ckan#8

Migrate to single chat

Based on discussion we will be going ahead with this, to mark as complete we need to:

  • Notify, create/rename new #dev channel in Archivers repo
  • Notify the participants in the other channels
    • Civic Tech Toronto
    • EDGI
    • DataRefuge
  • Update overview documentation
  • Add details to EDGI website
  • Test portal integration with established rooms in other orgs

Right now our development chat is spread out over:

  • Civic Tech Toronto slack
  • DataRefuge slack
  • EDGI
  • (now) Archivers slack

Other places we work:

  • This github organization

We need to reduce the pain cause by all the chats!

Review README changes

I just committed them to master (gaah!) instead of submitting a PR. Nonetheless it would be good if you checked them over. There are a couple of italicized sections that I'm hoping you can fill in; in general the language may need to be revised, too.

Kick-start development on Website Monitoring project

Coming out of our Feb 14 Standup, @ambergman and @titaniumbones are looking to kickstart development for the Website Monitoring project.

TODOs to mark this complete:

Current work on the Website Monitoring Project is spread out over:
New Toolchain
https://github.com/edgi-govdata-archiving/pagefreezer-cli
https://github.com/edgi-govdata-archiving/filtration

Old Toolchain
https://github.com/edgi-govdata-archiving/version-tracking-ui
https://github.com/edgi-govdata-archiving/versionista-outputter

Support ongoing (remote) involvement

We've had people who've attended events express a desire to continue to be involved in the project and we need to figure out a way to support that. As we move into longer-term projects we should prioritize ways to include previous participants. Currently we have:

  • Contributor Guidelines (really just cover using branches)
  • Overview repo

We should consider revisiting our first sprint conversations:

2017-02-10:
I think we are almost ready to close this as a first iteration because in the last two weeks we have:

  • added contributor guidelines
  • added a code of conduct
  • consolidated our chat
  • started a weekly standup
  • begun to track our overall progress with a slick kanban

Future work includes:

  • update standup protocol
  • officially document onboarding protocol
  • update website with info for joining

Set up Mailing Lists for events and alumni

Recent events have set up a mailing list for planning and we anticipate more doing so. Thinking long term-- we don't want to have a bunch of scattered lists we are trying to maintain in multiple places.

Also, we want to support email but move to having most of our conversations on slack.

To mark this closed:

  • make lists.envirodatagov.org to a place to sign up for mailinglists (investigate Mailman)
  • Get multiple people set up as admin, moderators
  • Establish protocol for: folding temporary lists into a larger one (e.g. AdaCamp style having an "alumni" list for people?), code of conduct, moderation
  • Create list for upcoming event

Implement onboarding process for Archivers App (Event Preservation) development

Coming out of our Feb 14 Standup, @danielballan is looking to open up development for the Archivers app, especially as we have more people interested in contributing.

TODOs to mark this complete:

  • kickoff meeting (see notes below)
  • security sign-off (related to #50) -- call with a plan to go ahead:
  • contributer guidelines added
  • public github for archivers ROADMAP, target date of Friday March 17:
    • 1. code cleanup/linting complete
    • 2. integration of security middleware into dev process per @zsck’s recommendation (turns out these were for Express framework so followed Meteor's security checklist
    • 3. Add middleware recommendations to org-wide project guidelines
    • 4. identify areas of codebase that are of concern and open issues (e.g. memory usage, imports api)
    • 5. address https redirect and ensure CORS handling okay
    • 6. throw a license on the code

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.