edgi-govdata-archiving / web-monitoring-db Goto Github PK

An HTTP API for tracking and annotating changes to a set of web pages.

Home Page: https://api.monitoring.envirodatagov.org/

License: GNU General Public License v3.0

Ruby 95.12% JavaScript 0.28% CSS 0.31% HTML 3.82% Dockerfile 0.35% Shell 0.01% SCSS 0.11%

web-monitoring-db's Introduction

⚠️ This project is no longer maintained. ⚠️ It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

web-monitoring-db

This repository is the database and API underlying the EDGI Web Monitoring Project. It’s a Rails app that:

Acts as a database of monitored pages and captured versions of those pages over time.

(The application does not record new versions itself, but relies on importing data from external services, like the Internet Archive or Versionista. See “How Data Gets Loaded” below for more.)
Provides an API to get that page and version data, and to allow analysts or other automated tools to annotate those versions with metadata about what has changed from version to version.

For more about how data is modeled in this project, see “Data Model” below.

API documentation is available from the homepage of the application, e.g. by pointing your browser to http://localhost:3000/ or https://api.monitoring.envirodatagov.org. It’s generated from our OpenAPI docs in swagger.yml.

We maintain a publicly available staging server at https://api-staging.monitoring.envirodatagov.org that you can test against. It runs the latest code and has non-production data — it’s safe to modify or post new versions or annotations to, but you should not rely on that data sticking around; it may get reset at any time. For access, ask for an account on Slack or use the public user credentials:

Username: [email protected]
Password: PUBLIC_ACCESS

Installation

Ensure you have Ruby 3.2+.

You can use rbenv to manage multiple Ruby versions
Ensure you have PostgreSQL 9.5+. If you are on MacOS, we recommend Postgres.app. It makes running multiple versions of PostgreSQL much simpler and gives you easy access to start and stop your databases.
Ensure you have Redis (used for caching).

On MacOS:
```
$ brew install redis
```
On Debian Linux:
```
$ apt-get install redis
```
Ensure you have a JavaScript Runtime

On MacOS:

You do not need to do anything. Apple JavaScriptCore fulfills this dependency.

On Debian Linux:
```
$ apt-get install nodejs
```
If you wish to use another runtime you can use one listed here.
Clone this repo
If you don’t have the bundler Ruby gem, install it:
```
$ gem install bundler
```
Wherever you cloned the repo, go to that directory and install dependencies:
```
$ bundle install --without production
```
Copy the .env.example file to .env - this allows for easy configuration locally.
```
$ cp .env.example .env
```
Take a moment to look through the variables here and change any that make sense for your local environment. If you need set variables differently when running tests, make a .env.test file that has your test-specific variables.
Set up your database.
- If your Postgres install trusts local users and you have a superuser (this is the normal situation with Postgres.app), run:
```
$ bundle exec rake db:setup
```
  That will create a database, set up all the tables, create an admin user, and add some sample data. Make note of the admin user e-mail and password that are shown; you’ll need them to log in and create more users, import more data, or make annotations.
  
  If you’d like to do the setup manually or don’t want sample data, see manual postgres setup below.
- If your Postgres install has a superuser, but doesn't trust local connections, you'll need to configure database credentials in .env. Find the line for DATABASE_URL in your .env file, uncomment it, and fill it in with your username and password. Make another file named .env.test and copy that line, but change the database line at the end to configure your test database. Then run the same command as above:
```
$ bundle exec rake db:setup
```
  If you’d like to do the setup manually or don’t want sample data, see manual postgres setup below.
- If you’d like to configure your Postgres DB to use a specific user, you’ll need to do a little more work:
  1. Log into psql and create a new user for your databases. Change the username and password to whatever you’d like:
```
CREATE USER wm_dev_user WITH SUPERUSER PASSWORD 'wm_dev_password';
```
    Unfortunately, Rails' test fixtures require nothing less than superuser privileges in PostgreSQL.
  2. (Still in psql) Create a development and a test database:
```
-- Development database
$ CREATE DATABASE web_monitoring_dev ENCODING 'utf-8' OWNER wm_dev_user;
$ \c web_monitoring_dev
$ CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
$ CREATE EXTENSION IF NOT EXISTS "pgcrypto";
$ CREATE EXTENSION IF NOT EXISTS "plpgsql";
$ CREATE EXTENSION IF NOT EXISTS "citext";
-- Repeat for test database
$ CREATE DATABASE web_monitoring_test ENCODING 'utf-8' OWNER wm_dev_user;
$ \c web_monitoring_test
$ CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
$ CREATE EXTENSION IF NOT EXISTS "pgcrypto";
$ CREATE EXTENSION IF NOT EXISTS "plpgsql";
$ CREATE EXTENSION IF NOT EXISTS "citext";
```
  3. Exit the psql console and open your .env file. Find the line for DATABASE_URL in your .env file, uncomment it, and fill it in with your credentials and database name from above:
```
DATABASE_URL=postgres://wm_dev_user:wm_dev_password@localhost:5432/web_monitoring_dev
```
    Make a .env.test file and set the same value there, but with the name of your test database:
```
DATABASE_URL=postgres://wm_dev_user:wm_dev_password@localhost:5432/web_monitoring_test
```
  4. Set up all the tables and test data in your DB by running:
```
# Set up tables, indexes, and general database schema:
$ bundle exec rake db:schema:load
# Add sample data and an admin user:
$ bundle exec rake db:seed
```
    For more on this last step, see manual postgres setup below.
Start the server!
```
$ bundle exec rails server
```
You should now have a server running and can visit it at http://localhost:3000/. Open that up in a browser and go to town!
Bulk importing, automated analysis, and e-mail invitations all run as asynchronous jobs (using the fantastic good_job gem). If you plan to use any of these features, you must also start a worker:
```
$ bundle exec good_job start
```
If you only want to run particular type of job, you can set a list of queue names with the --queues option:
```
$ bundle exec good_job start --queues=mailers,import,analysis
```
Each job type runs on a different queue:
- mailers: Sending e-mails. (There's no job associated with this queue because it is automatically processed by ActionMailer, a built-in component of Rails.)
- import: Bulk version imports (processing data sent to the /api/v0/imports endpoint).
- analysis: Auto-analyze changes between versions and create annotations with the results.

Manual Postgres Setup

If you don’t want to populate your DB with seed data, want to manage creation of the database yourself, or otherwise manually do database setup, run any of the following commands as desired instead of rake db:setup:

$ bundle exec rake db:create       # Connects to Postgres and creates a new database
$ bundle exec rake db:schema:load  # Populates the database with the current schema
$ bundle exec rake db:seed         # Adds an admin user and sample data

If you skip rake db:seed, you’ll still need to create an Admin user. You should not do this through the database since the password will need to be properly encrypted. Instead, open the rails console with rails console and run the following:

User.create(
  email: '[your email address]',
  password: '[the password you want]',
  admin: true,
  confirmed_at: Time.now
)

Docker

The Dockerfile runs the rails server on port 3000 in the container. To build and run:

docker build --target rails-server -t envirodgi/db-rails-server .
docker build --target import-worker -t envirodgi/db-import-worker .
docker run -p 3000:3000 envirodgi/db-rails-server -e <ENVIRONMENT VARIABLES> .
docker run -p 6379:6379 envirodgi/db-import-worker -e <ENVIRONMENT VARIABLES> .

Point your browser or curl at http://localhost:3000.

Data Model

The database models three main types of data:

Pages, which represent a page on the internet. Pages are identified by a unique ID rather than their URL because pages can move or be available from multiple URLs. (Note: we don't actually model that yet, though! See #492 for more.)
Versions, which represent a particular page at a particular point in time. We use the term “version” instead of others more common in the archival space because we attempt to only represent different versions. That is, if a page changed on Wednesday and we captured copies of it on Monday, Tuesday, and Wednesday, we only make version records for Monday and Wednesday (because Tuesday was the same as Monday).

(Note: because of technical issues around imported data, we often store more versions than we should according to the above definition [e.g. we might still have a record for Tuesday]. Versions have a different field that indicates whether a version is different from the previous one, and the API only returns versions that are different unless you explicitly request otherwise.)
Annotations, which represent an analysis about what’s changed between any two versions of a page. Annotations have a specialized priority and significance, which are numbers between 0 and 1, an author, indicating who made the analysis (it could be a bot account), and an annotation field, which is a JSON object with no specified structure (inside this field, annotations can include any data desired).

There are several other kinds of objects, but they are subservient to the ones above:

Changes, which serve to connect any two versions of a page. Annotations are actually connected to changes, rather than directly to two versions. You can also generate diffs for a given change.
Tags, which can be applied to pages. They help sort and categorize things. Most tags are manually applied, but the application auto-generates a few:
- domain:<domain name>, e.g. domain:www.epa.gov for a page at https://www.epa.gov/citizen-science
- 2l-domain:<second-level domain name> e.g. 2l-domain:epa.gov for a page at https://www.epa.gov/citizen-science
Maintainers, which can be applied to pages. They represent organizations that maintain a given page. For example, the page at https://www.epa.gov/citizen-science is maintained by EPA.
Imports model requests to import new data and the results of the import operation.
Users model people (both human and bots) who can view, import, and annotate data. You currently have to have a user account to do anything in the application, though we hope accounts will not be needed to view public data in the future.

Actual database schemas for each of these tables is listed in db/schema.rb.

How Data Gets Loaded

The web-monitoring-db project does not actually monitor or scrape pages on the web. Instead, we rely on importing data from other services, like the Internet Archive. Each day, a script queries other services for historical snapshots and sends the results to the /api/v0/imports endpoint.

Most of the data sent to /api/v0/imports matches up directly with the structure of the Version model. However, the body_url field in an import is treated specially.

When new page or version data is imported, the body_url field points to a location where the raw HTTP response body can be retrieved. If the body_url host matches one of the values in the ALLOWED_ARCHIVE_HOSTS environment variable, the version record that gets added to the database will simply point to that external location as a source of raw response data. Otherwise, the application downloads the data from body_url and stores it in its FileStorage.

The intent is to make sure data winds up at a reliably available location, ensuring that anyone who can access the API can also access the raw response body for any version. Hosts should be listed in ALLOWED_ARCHIVE_HOSTS if they meet this criteria better than the application’s own file storage. The application’s storage area can be the local disk or it can be S3, depending on configuration. The component can take pluggable configurations, so we can support other storage types or locations in the future.

You can see more about this process in:

The overview repo’s “architecture” document
The import job code, where imports are processed.
The Archiver module code, where raw HTTP response data is saved.

File Storage

The application needs to store files for several different purposes (storing raw import data, archiving HTTP response bodies as described in the previous section, specialized logs, etc). To do this, it uses the FileStorage module, which has different implementations for different types of storage, such as the local disk or Amazon S3.

At current, the application creates two FileStorage instances:

“Archival storage” is used to store raw HTTP response bodies for each version of a page. See the “how data gets loaded” section for more details. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the AWS_ARCHIVE_BUCKET environment variable. Everything in this storage area is publicly available.
“Working storage” is used to store internal data, such as raw import data and import logs. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the AWS_WORKING_BUCKET environment variable. Everything in this storage area should be considered private and you should not expose it to the public web.
For historical reasons, EDGI’s deployment includes a third S3 bucket that is not directly accessed by the application. It’s where we store HTTP response bodies collected from Versionista, a service we previously used for scraping government web pages. You can see it listed in the example settings for ALLOWED_ARCHIVE_HOSTS.

Releases

New releases of the app are published automatically as Docker images by CircleCI when someone pushes to the release branch. They are availble at https://hub.docker.com/r/envirodgi. See web-monitoring-ops for how we deploy releases to actual web servers.

Images are tagged with the SHA-1 of the git commit they were built from. For example, the image envirodgi/db-rails-server:ddc246819a039465e7711a1abd61f67c14b7a320 was built from commit ddc246819a039465e7711a1abd61f67c14b7a320.

We usually create merge commits on the release branch that note the PRs included in the release or any other relevant notes (e.g. Release #503, #504).

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for all their contributions! See our contributing guidelines to find out how you can help.

Contributions	Name
📖 👀	Dan Allan
📋 🔍	Andrew Bergman
💻 🚇 📖 💬 👀	Rob Brackett
💻	Alessandro Caporrini
📖	Patrick Connolly
💻	Robert Dalin
💻	Kate Donaldson
📖	Michael Hardy
💻	Kasper Holbek Jensen
💻	Shishir Joshi
💻 📖	Krzysztof Madejski
📖	Ansar Memon (Amoury)
📖 📋 📢	Matt Price
📋 🔍	Toly Rinberg
💻	Ben Sheldon
💻	Ewelina Sobora
🚇	Frederik Spang
💻	Max Tedford
💻	Eddie Tejeda
📖 📋	Dawn Walker

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

License & Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.

web-monitoring-db's People

Contributors

Stargazers

Watchers

web-monitoring-db's Issues

Add tests for controllers

Our models have decent tests, but not the controllers :\

Add License info

EDGI is standardizing on GPL3 for now; need to add that to the repo.

Make the home page an explanation

The root/home page for the site should be a simple “what is this???” with some info on the basics of the API.

ReadMe file lacks instructions for postgres

The readme file does not have instructions to create development database and to edit the corresponding file to append their passwords for proper firing of rails server

Use standard PASSWORD for admin

For development it would be easier if admin password was not random uuid, but "PASSWORD", because it's written like this in default processing's .env file.

This way it would work out of the box.

Random password is ok for setting production env, but then you are not seeding data, right?

Download actual version content from Versionista

This will allow us to do custom diffs and filtering later as needed. For now, probably store them in S3, but maybe that will change to Google Could Storage if we are going to all Google services.

Add an authorization system

Right now, we have a special admin? flag on users that mainly controls whether they can get into the admin area to invite or delete users. We should make the permissions system a little more granular and separately grant users:

Admin capability
Permission to annotate versions
Permission to import/create versions/pages

For safety, we probably want services that import things to only be able to import and not create annotations, while we might want most users to only be able to create annotations but not import.

Add diff API to documentation

We recently added API documentation to the home page using Swagger: #58
We ALSO recently added an API for getting diffs between two versions of a page: #60

Sadly, these were poorly coordinated and done at the same time, so the docs don’t cover the diff API. We should add documentation for it by updating the swagger.yaml file in the root of the project.

(For more details on Swagger, see: http://swagger.io)

Make annotations editable

We originally wanted to produce a nice audit trail by making annotations append-only, but after discussing on a call yesterday, @danielballan and I think enabling a nicer analyst UI that doesn’t require explicit commits is probably more important than the benefits of being append-only.

The API probably shouldn’t change for this. Instead, POST /api/v0/pages/{id}/versions/{id}/annotations should edit an existing annotation by the current user if there is one.

Set up error tracking service

New Relic may be a good basic approach, but we should probably have some specific error monitoring. Some options:

Sentry (https://sentry.io) I am a fan, but lament the name confusion it will inevitably have with the archives project
Airbrake (https://airbrake.io) It does its job, but I have spent a great deal of time butting heads with its UI in the past; I prefer Sentry
Rollbar (https://rollbar.com) I've seen this advertised all over the place, but have never used it.
?

Versions from `/pages?include_version` can be out of order

When querying /api/v0/pages and using both the include_version and capture_time query args, versions can occasionally be returned out of order. Here’s an example in practice:

https://web-monitoring-db-staging.herokuapp.com/api/v0/pages?site=DOI%20-%20doi.gov&capture_time=2017-05-31..&include_versions=true

Remove deprecated non-API pages

Once upon a time, in the very first implementation of all this, I had JSON + HTML routes at:

/pages
/pages/{page_id}
/pages/{page_id}/versions
/pages/{page_id}/versions/{version_id}
/pages/{page_id}/versions/{version_id}/annotations

Those have all long-since been superseded by the /api/v0/* routes and I’m not aware of any clients using them. We should take them out.

Integrate with ident.archivers.space for authentication

I think we still need to handle authorization locally, but we should be using the shared identity infrastructure, at least.

Marking as v1 for now.

JSON API

Add a JSON API for getting data in/out of the DB. For writing, this depends on #1.

UI for an analyst to add their analysis

Creating annotations for the first version of a page should be a 400 error

I'm operating on seed data.
In UI I've opened first link
I've clicked Update Record
I got following error:

Started POST "/api/v0/pages/9c1b56f8-b044-47f4-8d1e-1cea8be1788a/versions/50b2b776-3390-4f94-ae6c-da988a514087/annotations" for 127.0.0.1 at 2017-06-09 13:47:33 +0200
Processing by Api::V0::AnnotationsController#create as */*
  Parameters: {"page_id"=>"9c1b56f8-b044-47f4-8d1e-1cea8be1788a", "version_id"=>"50b2b776-3390-4f94-ae6c-da988a514087"}
  User Load (0.4ms)  SELECT  "users".* FROM "users" WHERE "users"."id" = $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 1], ["LIMIT", 1]]
  User Load (0.4ms)  SELECT  "users".* FROM "users" WHERE "users"."email" = $1 ORDER BY "users"."id" ASC LIMIT $2  [["email", "[email protected]"], ["LIMIT", 1]]
   (0.1ms)  BEGIN
  SQL (0.3ms)  UPDATE "users" SET "current_sign_in_at" = $1, "last_sign_in_at" = $2, "sign_in_count" = $3, "updated_at" = $4 WHERE "users"."id" = $5  [["current_sign_in_at", "2017-06-09 11:47:34.068000"], ["last_sign_in_at", "2017-06-09 11:47:15.051779"], ["sign_in_count", 3], ["updated_at", "2017-06-09 11:47:34.068546"], ["id", 1]]
   (1.2ms)  COMMIT
  Version Load (0.2ms)  SELECT  "versions".* FROM "versions" WHERE "versions"."uuid" = $1 LIMIT $2  [["uuid", "50b2b776-3390-4f94-ae6c-da988a514087"], ["LIMIT", 1]]
  Page Load (0.1ms)  SELECT  "pages".* FROM "pages" WHERE "pages"."uuid" = $1 LIMIT $2  [["uuid", "9c1b56f8-b044-47f4-8d1e-1cea8be1788a"], ["LIMIT", 1]]
  Version Load (0.2ms)  SELECT  "versions".* FROM "versions" WHERE "versions"."page_uuid" = $1 AND (capture_time < '2017-05-09 10:48:44') ORDER BY "versions"."capture_time" DESC LIMIT $2  [["page_uuid", "9c1b56f8-b044-47f4-8d1e-1cea8be1788a"], ["LIMIT", 1]]
  Change Load (0.2ms)  SELECT  "changes".* FROM "changes" WHERE "changes"."uuid_from" IS NULL AND "changes"."uuid_to" = '50b2b776-3390-4f94-ae6c-da988a514087' ORDER BY "changes"."uuid" ASC LIMIT $1  [["LIMIT", 1]]
   (0.1ms)  BEGIN
   (0.1ms)  ROLLBACK
Completed 500 Internal Server Error in 139ms (ActiveRecord: 3.2ms)


  
NoMethodError (undefined method `capture_time' for nil:NilClass):
  
app/models/change.rb:79:in `from_must_be_before_to_version'
app/models/change.rb:36:in `annotate'
app/controllers/api/v0/annotations_controller.rb:28:in `create'

Include version_hash (content hash) in diff queries.

The Rails server currently queries separate servers ("diffing services") with two URLs the the service should fetch and analyze. That query should include the hash of the URL's content, which is already stored in the db, so that the diffing service can verify that the content it fetches is what is expected.

Recently discussed on a quick call with @Mr0grog -- put here for tracking.

Identify/surface what revisions have had an analyst look at them

Improve API documentation

We just recently began to document the REST API for the database using Swagger in the swagger.yaml file (see #58 for the original work on this). Some things that are missing or could be improved:

Clear written descriptions of each endpoint and of the various parameters each one can take
Many endpoints can take query parameters for filtering results or searching, but they aren’t well documented (see #41 for more on this)
Notes about which endpoints require authentication (and how to authenticate generally using basic auth)
The swagger document is invalid in a couple of spots (we use .. in some endpoints, which Swagger doesn’t seem to like, and we have optional path parameters). If you can figure out how to format these things so the doc is valid, that would be awesome!

If you think you can make the docs look better or be easier to navigate and read, that would also be wonderful. If you can write way more awesome docs without using Swagger at all, that’s also fine. We want to have docs that make the database API easy to learn and use for a variety of research, archival, and analysis purposes more than we want Swagger :)

Postmark could be moved as a production gem

gem 'postmark-rails' is an external paid service. It would make more sense to put it in the :production bundler group.

It shouldn't be required for development.

Fix local install instructions

Starting with a fresh database and attempting to run the migrations, I get the error

uninitialized constant AddVersionistaAccountToPage::VersionistaPage

Do I have to run the Versionista ingest first?

The full output up to the error is here:

$ git status
On branch 19-post-new-versions
Your branch is up-to-date with 'upstream/19-post-new-versions'.
nothing to commit, working tree clean

$ ruby --version
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]

$ bundle exec rails db:migrate RAILS_ENV=development
== 20170221061825 CreateVersionistaPages: migrating ===========================
-- create_table(:versionista_pages)
   -> 0.0398s
== 20170221061825 CreateVersionistaPages: migrated (0.0399s) ==================

== 20170221071740 CreateVersionistaVersions: migrating ========================
-- create_table(:versionista_versions)
   -> 0.0433s
== 20170221071740 CreateVersionistaVersions: migrated (0.0433s) ===============

== 20170302233652 DeviseCreateUsers: migrating ================================
-- create_table(:users)
   -> 0.0463s
-- add_index(:users, :email, {:unique=>true})
   -> 0.0041s
-- add_index(:users, :reset_password_token, {:unique=>true})
   -> 0.0038s
== 20170302233652 DeviseCreateUsers: migrated (0.0545s) =======================

== 20170303213937 CreateInvitations: migrating ================================
-- create_table(:invitations)
   -> 0.0161s
== 20170303213937 CreateInvitations: migrated (0.0162s) =======================

== 20170307220127 SplitUpVersionMetadata: migrating ===========================
-- change_table(:versionista_versions)
   -> 0.0793s
== 20170307220127 SplitUpVersionMetadata: migrated (0.0795s) ==================

== 20170309010632 AddVersionistaAccountToPage: migrating ======================
-- change_table(:versionista_pages)
   -> 0.0006s
rails aborted!
StandardError: An error has occurred, this and all later migrations canceled:

uninitialized constant AddVersionistaAccountToPage::VersionistaPage

Look into Carrierwave for cloud file storage

I was recently alerted to Carrierwave, which may provide a lot of the work we currently do in the FileStorage module. If we can use it instead of our own code, that will probably be more reliable and make it easier for us to switch to other storage locations.

Add filtering by site to API

It should be possible to ask only for the pages in a particular site, for example. But would probably be good to have arbitrary filtering rules.

We need site-based filtering at the very least, though, since that is how analysis work is divided up.

Track titles on Version instead of Page

See #57 for the original discussion on this.

Since page titles can change over time (and are really just part of a page’s content), we should add a title field to the Version model and make Page’s title merely a reflection of the latest Version’s title.

This can probably be best handled in the after_save callback on Version or the after_add callback on Page’s association with versions. We should also keep in mind that the Page’s title should always reflect the version with the latest capture_time, not necessarily the last one added to the database (they can be added out of order).

Should creating a new Page require a separate request?

If I understand correctly, the current import mechanism matches new Versions to their Pages using page_url. When it receives a page_url it doesn't recognize, it creates a new Page. Page attributes (like title, site, agency) are provided inline with the Version attributes. I can imagine some sources of ambiguity:

a known page_url but a new page_title
a new page_url that we actually want to keep as part of the same logical Page (a case that we don't quite support yet but want to support in the future)

I think I would prefer if the app required any new Page to be created explicitly in a separate request, returning a page_uuid that must then be included in requests to import Versions for that Page. This change would push the work of resolving ambiguities onto the caller requesting an import. Also, since a new Page needs special primer-related metadata that is not as automated as Version ingest, separating page creation from bulk Version importing seems logical.

This alternative did not occur me when reviewing #32 and I don't think it has been discussed anywhere.

Make paging page size customizable

Right now you can ask for a given page of results (with the ?page={number} querystring arg), but you can’t alter how many items are on that page. We might also want to move away from “page” terminology, too, since that can be confusing with the actual web pages we are tracking.

Remove migrations related to `versionista_*`

…or just squash the migration history. There are a few migrations that use models for tables that were removed—their purpose was really to keep production data in the right shape. People keep trying to migrate an empty database from an initial state all the way to current, though, which does not work because of the removed models. (Rails style is to start a new database by loading the current schema rather than migrating from scratch.)

We should just squash/remove the no-longer-functional migrations so people can migrate across all of history.

uuid stability of pages/versions ingested from Versionista

Can I count on the uuids of the Pages and Versions ingested from Versionista beings stable at this point? And are they the same between the staging and production deployments, or independently generated?

(Apologies for the cross-post, Rob -- I asked this in Slack and then realized the answer might be useful to others too.)

Add CORS support!

Pretty hard to use as a pure API without it :(

Set up Continuous Integration

We really should have CI support for this. I am a fan of CircleCI.

Add Code Style Checking

EDGI project guidelines now include linting, which is definitely something I should have set up here… a while ago. Probably easiest to start with Rubocop, though a service like CodeClimate could also be good.

Look into e-mail service providers

E-mail is currently sent through a GMail account, which was simple and expedient at first, but really very error prone, as its security mechanisms are designed for human users interacting with a UI. We should really be interfacing (still via SMTP) with a service like:

Postmark (https://postmarkapp.com)
Mailgun (https://www.mailgun.com)
Sendgrid (https://sendgrid.com)
Mandrill (add-on to Mailchimp)
Amazon SES (https://aws.amazon.com/ses/)
Sendinblue (https://www.sendinblue.com)
?

Remove scraper components

Versionista scraping is now handled by an external service that uses the importing API. We should get rid of the binary dependency on Phantom and all the Ruby stuff for Versionista scraping here since it is no longer used.

Track which Versionista account a record came from

This is important so people know which account to log in with if they want to look at the data or a diff on Versionista.

Normalize page URLs to always be full, absolute URLs

@lightandluck noted that sometimes we get them imported with a scheme (e.g. http://) and domain, but sometimes not. This should definitely be normalized on input into the DB so that all URLs include a scheme and domain.

I think I fixed this in https://github.com/Mr0grog/versionista-edgi-node (it always outputs a full URL), but we still have legacy data that needs cleanup. We will also be receiving data from other sources in the future that might not have these guarantees.

Normalize URLs on model when setting/before saving
Migrate and normalize existing data

Add POST endpoint for new revisions

Services managed by web-monitoring-processing will need to be able to POST their data to the DB. Probably:

POST /api/v1/pages/{page_id}/versions

…though it may be useful to be able to post without knowing a page ID. Two ideas there:

POST /api/v1/pages/{html_encoded_page_url_OR_page_id}/versions (and make that work for all routes)
POST /api/v1/versions (this might mean making all versions available at this route—not just nested underneath /pages/{page_id})

^ Any thoughts on this, @danielballan?

This also means we should probably add a real permissions model for users; not just our current admin/not-admin model. Permissions so far:

Add/remove/edit users
Annotate versions
Add new versions

Look into CodeClimate

Look into setting up CodeClimate and turning Rubocop off in CI tests. It’s probably better if it logs comments/warnings in PRs rather than failing the build.

Adopt setup philosophy of 18f/identity-idp repo

Inspired by #79

I was wondering whether folks were interested in taking cues from the login.gov setup scripts. I came across it last year, and for such a complicated system, it was truly a thing of beauty.

Basically, after service setup, just:

make setup
make run

Do we even think it's a good idea to hide the setup steps? (i can imagine valid rationale to expose new follks to basic commands)
Does make feel like a good choice of tool? (ie not requiring some lang-specific build tool)
Do we use advanced postgres features? If not, is it worth entertaining the idea of using sqlite for first-time bootstrap, to remove one more piece that might be a bump? We could always encourage people to use postgres after they've been around for a bit and going deeper, but might be cool to not require it for first contribution? (login.gov codebase relies on advanced postgres features, but maybe we don't use things like jsonb extension, etc.)

Store metadata annotations as discrete objects

Per an earlier conversation with @danielballan, it would be better (for auditing, require multiple analysts, etc) to store a series of annotations rather than just a simple object of metadata.

That is, we currently store a metadata JSON object for each version that I intended to be a simple blob storing whatever the analysts are currently marking up, e.g. this from the spreadsheet:

{
  "change_type": {
    "1 - Date and time change only": true,
    "2 - Text or numeric content removal or change": true,
    "3 - Image content removal or change": true,
    "4 - Hyperlink removal or change": true,
    "5 - Text-box, entry field, or interactive component removal or change": true,
    "6 - Page removal (whether it has happened in the past or is currently removed)": true,
    "7 - Header menu removal or change": true,
    "8 - Template text, page format, or comment field removal or change": true,
    "9 - Footer or site map removal or change": true,
    "10 - Sidebar removal or change": true,
    "11 - Banner/advertisement removal or change": true,
    "12 - Scrolling news/reports": true
  },
  
  "significance": {
    "1 - Change related to energy, environment, or climate": true,
    "2 - Language is significantly altered": true,
    "3 - Content is removed": true,
    "4 - Page is removed": true,
    "5 - Insignificant": true,
    "6 - Repeated Insignificant": true
  }
}

…but @danielballan had a smarter view of storing a list of annotations here, allowing things like auditing, the ability for multiple analysts to look at a single bit of data, etc:

[
  {
    "uuid": "1234-1234-1234-1234",
    "created_at": "2017-02-22T13:53:26Z",
    "annotation": {
      "change_type": {
        "1 - Date and time change only": true,
        "2 - Text or numeric content removal or change": true,
        ...
      },
  
      "significance": {
        "1 - Change related to energy, environment, or climate": true,
        "2 - Language is significantly altered": true,
        ...
      }
    }
  },
  
  {
    "uuid": "6789-6789-6789-6789",
    "created_at": "2017-02-22T14:53:26Z",
    "annotation": {
      "change_type": {
        "1 - Date and time change only": true,
        "2 - Text or numeric content removal or change": false,
        ...
      },
  
      "significance": {
        "1 - Change related to energy, environment, or climate": false,
        "2 - Language is significantly altered": true,
        ...
      }
    }
  },
  
  ...
]

And, for ease of application use, we could do both and store all the annotations alongside a single, actualized view (which most applications would probably focus on for ease of use):

{
  "annotations": [
    {
    "uuid": "1234-1234-1234-1234",
    "created_at": "2017-02-22T13:53:26Z",
    "annotation": { ... }
    },
    ...
  ],

  "actualized": {
    "change_type": {
      "1 - Date and time change only": true,
      "2 - Text or numeric content removal or change": false,
      ...
    },
  
    "significance": {
      "1 - Change related to energy, environment, or climate": false,
      "2 - Language is significantly altered": true,
      ...
    }
  }
}

Add banner to top of home page based deployment environment

It’d be helpful to display some sort of banner on the top of pages on the test/staging site (and maybe in dev, too?) to remind someone that they aren’t looking at the production instance of the app.

For the API, we could potentially include a special key in the JSON object or an x-environment HTTP header (less obvious, but probably a more wise approach).

Add job queue for managing scraping

We’ll want this for two things:

Better control of how/when to scrape (currently using Heroku Scheduler to run a rake task; not great)
Ability to divide scraping work up over several processes or machines working in parallel so it goes much faster.

Set up automated process to export pages/versions

For people who want to do more complicated analyses offline, it might be helpful to dump just the "data" tables (i.e. not including permissions, users, etc)

Import from Versionista with new Node script

We should switch to using https://github.com/Mr0grog/versionista-edgi-node for importing from Versionista; it’s much faster and outputs data in a bit more manageable format.

Update README License & Copyright Block

Ours predates the one specified at https://github.com/edgi-govdata-archiving/overview/blob/master/PROJECT.md#license--copyright-readme-block; we should update to match it.

Ingest the legacy uuids from the old versionista outputter

It would be nice to have a way to associated legacy Annotations, which I assume will be subjected to a lot of analysis, with Versions in our app. Somehow getting the uuids generated by the old outputter (and now stored only in Google Sheets, I think) sounds slightly painful but possible and useful.

Add API pass-through endpoint for diffs

We probably want to eventually have a more nuanced way to calculate diffs (see also: “scraping notes” on edgi-govdata-archiving/versionista-outputter#4 (comment)). This should probably be a separate service and should be configurable pluggable.

Handle pages that are tracked by multiple “sites”

site is currently a property of the Page model, but we already know there are a couple pages that are tracked by multiple Versionista sites. So far, we have stored these as separate Page records here, too, but as we are moving towards integrating multiple sources (#15), we need to handle this better. It’s been mentioned before that we should think of sites as something more akin to tags.

A few ideas:

Sites as another model with an N:M relationship to pages
Sites as a JSON array belonging to a page
Have a more generic tagging system and have tags like site:[site name] that are applied to pages

This also brings up the question: is agency similar? I’m thinking it’s fine as it is now—a direct property of a Page—but maybe worth thinking about in this context.

@danielballan Any thoughts on this? I’m thinking either 1 or 3 is best here.

Deploy to Google Compute Engine

Get this live and running on Google Cloud. Need to determine whether to use:

Compute Engine (pretty straightforward; just a VM)
App Engine (should be fancier and easier; the “flexible” environment w/ Ruby support is also managed via docker files, so maybe this ties in w/ #16 nicely)
Container Engine (managed container support on top of Compute Engine, definitely ties in w/ #16)

The database should probably be run through Google Cloud SQL (rather than self-managed). Need to determine whether it is reasonable to try using the new Postgres support (in beta, but would be nice) or switch over to MySQL.

API endpoints for diffs

Notes from a conversation:

web-monitoring-db will broker responses from "diffing services"

diff/start..end/ -> UNIX-style diff
diff/start..end/pf -> PageFreezer JSON response
maybe temporarily support diff/start..end/versionista but only versionista-source pages would support that

Response formatted as JSON. Maybe support richer types later.

{'page_id': ...,
 'version_id', ...,
 'diff_service': ...,
 'diff_service_version': ...,
 'content': VERBATIM_RESPONSE_FROM_DIFFING_SERVICE}

Add user management

Probably keep it dumb simple and have e-mail/password combos, but maybe could do auth with Github? Need this to restrict who can add/update metadata about page versions.

edgi-govdata-archiving / web-monitoring-db Goto Github PK

web-monitoring-db's Introduction

web-monitoring-db

Installation

Manual Postgres Setup

Docker

Data Model

How Data Gets Loaded

File Storage

Releases

Code of Conduct

Contributors

License & Copyright

web-monitoring-db's People

Contributors

Stargazers

Watchers

Forkers

web-monitoring-db's Issues

Recommend Projects

Recommend Topics

Recommend Org