Giter Club home page Giter Club logo

opendataphilly-ckan's Introduction

opendataphilly-ckan

Port of OpenDataPhilly to CKAN

Development Dependencies

Development Installation

  1. Make sure you have the development dependencies installed
  2. Clone the ckanext-odp_theme repository to this directory's parent directory
  3. Copy deployment/ansible/group_vars/vagrant.example to vagrant and edit with your e-mail credentials. If e-mail credentials are left unconfigured, e-mails will not be sent out.
  4. Run vagrant up; once the Ansible provisioner finishes, CKAN will be available at http://localhost:8025
  5. Creating a sysadmin user:
  $ vagrant ssh app
  vagrant@app:~$ . /usr/lib/ckan/default/bin/activate
  vagrant@app:~$ cd /usr/lib/ckan/default/src/ckan
  vagrant@app:~$ ckan/ paster sysadmin add <USERNAME> -c /etc/ckan/default/production.ini

Deployment

  1. Launch a server running Ubuntu 14.04. This server should be accessible from the deployment computer over SSH, and should have HTTP and HTTPS access to the internet.
  2. Copy deployment/ansible/hosts/hosts.staging.example to hosts.staging and enter the address of the server that was just launched.
  3. Copy deployment/ansible/group_vars/staging.example to staging and edit any settings you wish to change (see above). Make sure that ckan_site_url matches the address at which you will access the site.
  4. Run ANSIBLE_HOST_KEY_CHECKING=false ansible-playbook --private-key=/absolute/path/to/server/key/file.pem --user=ubuntu --inventory-file=deployment/ansible/hosts/hosts.staging deployment/ansible/staging.yml -v

Exporting and importing data

Making a new sanitized export from production

Log into the remote server:

ssh-add KEY_FILE
ssh [email protected]

Install Postgres 9.4 client:

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y postgresql-client-9.4

Export the database (replace PASSWORD with the database password, which you can find by looking for lines like these in /etc/ckan/default/production.ini):

mkdir database_dumps && cd database_dumps

/usr/lib/postgresql/9.4/bin/pg_dump \
   "postgresql://ckan_default:[email protected]/ckan_default" \
   > ckan_default-`date -I`.sql

/usr/lib/postgresql/9.4/bin/pg_dump \
   "postgresql://ckan_default:[email protected]/datastore_default" \
   > ckan_datastore-`date -I`.sql

Log out of the remote server.

Copy the files to your local machine:

scp [email protected]:database_dumps/* .

Delete the export files from the server:

ssh [email protected] "rm -r database_dumps"

Start a Postgres 9.4 docker container:

docker run --name odp_export -e POSTGRES_PASSWORD=odp -d postgres:9.4

Create and import the main database:

docker run -i --rm -e PGPASSWORD=odp --link odp_export:postgres postgres:9.4 \
   psql -h postgres -U postgres postgres -c "CREATE DATABASE ckan_default;"
docker run -i --rm -e PGPASSWORD=odp --link odp_export:postgres postgres:9.4 \
   psql -h postgres -U postgres postgres -c "CREATE USER ckan_default;"
docker run -i --rm -e PGPASSWORD=odp --link odp_export:postgres postgres:9.4 \
   psql -h postgres -U postgres ckan_default < ckan_default-`date -I`.sql

Make anonymizing queries. You'll need the faker python package, so either make a virtualenv, activate it, and run pip install faker or just install it globally. Run the command and write the resulting queries to a file:

./generate_anon_queries.sh > anon_queries.sql

Run the queries against the main database:

docker run -i --rm -e PGPASSWORD=odp --link odp_export:postgres postgres:9.4 \
   psql -h postgres -U postgres ckan_default < anon_queries.sql

Export the sanitized database

docker run -i --rm -e PGPASSWORD=odp --link odp_export:postgres postgres:9.4 \
   pg_dump -h postgres -U postgres -d ckan_default > ckan_default_sanitized-`date -I`.sql

Now that you have the sanitized version, you can delete the non-sanitized ckan_default file and stop and remove the database container:

rm ckan_default-`date -I`.sql
docker stop odp_export && docker rm -v odp_export

Since SQL files compress quite a bit, put the two files into a tarball:

tar czvf ODP_db_dumps-`date -I`.tar.gz \
    ckan_default_sanitized-`date -I`.sql ckan_datastore-`date -I`.sql

Now copy the .tar.gz file to a suitable storage location and, if you're done with them, delete the separate files.

Importing into development

Extract the export files into your project root (or somewhere else that's within a directory that's mapped into the VM).

Import them (fill in the filenames and, if you put them in a different directory, adjust the paths):

vagrant ssh database<<EOF
export PGPASSWORD=ckan_default
psql -U ckan_default -h 127.0.0.1 -d ckan_default -f /vagrant/ckan_default_sanitized-YYYY-MM-DD.sql
psql -U ckan_default -h 127.0.0.1 -d datastore_default -f /vagrant/ckan_datastore-YYYY-MM-DD.sql
EOF

That should print a bunch of SQL messages.

Restart the services:

vagrant ssh app <<EOF
sudo service apache2 restart
sudo service jetty restart
sudo ckan search-index rebuild -r
sudo ckan views create -y
EOF

Make an admin user, by running this then following the prompts:

vagrant ssh app -c 'sudo ckan sysadmin add admin'

Copying uploaded images

Many of the images used on the site (such as the Topic icons and the Organization logo images) are stored as uploads. The files are on the server's file system and the Topic and Organization entities in the database store just the filename.

To get them working in development, download the images from the production server and upload them to your instance:

scp -r [email protected]:/var/lib/ckan/default/storage/uploads/ uploads
vagrant ssh app -c "sudo cp -r /vagrant/uploads/* /var/lib/ckan/default/storage/uploads/"

opendataphilly-ckan's People

Contributors

colekettler avatar ddohler avatar flibbertigibbet avatar jeancochrane avatar klaash avatar maurizi avatar rbreslow avatar sharph avatar tnation14 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opendataphilly-ckan's Issues

Vagrant hostmanager plugin check always fails

Despite having the plugin installed, attempting to run vagrant up in the project directory fails with:

ERROR: vagrant-hostmanager not found; please run 'vagrant plugin install vagrant-hostmanager'

Run out of inodes due to never expiring session files

Site went down on Jan 22nd. The root cause was we were at mask inodes (df -i).

/tmp/default/sessions/ was filled with approximately 4M small session files. The actually disk usage was likely 5-10GB.

Overall resolution to getting the site back up:
stop webservers
rm the two directories in /tmp/default/sessions/ (this took quite a few hours)
start webservers

At this issue writing time, there are about 7762 files (overall inodes 164536).

Among reviewing the nginx config there are a number of possible items to try, including timeouts, reducing the number of protocols (dropped SSLv3) and more.

Assemble incident report for 12/3 downtime

Since 2018-12-03 01:51:08 EST, the OpenDataPhilly website was flapping between up and down until approximately 2018-12-03 07:28:18 EST.

  • Azavea personnel began work to remedy the issues at 6:57 ET
  • Nginx configuration was modified to suppress bot traffic at 7:12 ET
  • Apache2 was restarted because it appeared that some workers were still throwing errors at 7:27 ET

We should include time in this estimate to dig a bit more into the root cause by using the inbound requests that were failing and mapping them back to relevant exceptions (/var/log/apache2/ckan_default.error.log).

Upgrade PostgreSQL in development environment

Determine which of the following versions of PostgreSQL it makes the most sense to upgrade to:

  • 9.5
  • 9.6
  • 10
  • 11

We recommend that you upgrade your Amazon RDS for PostgreSQL 9.4 database instances to version 9.6, 10, or 11 at your earliest convenience. You may also choose to upgrade to 9.5, but that version is scheduled for end of support in February 2021.

After making a selection, ensure that the development environment provisions the updated version of PostgreSQL.

Add Let's Encrypt setup to Ansible

A Let's Encrypt certificate was created using the following commands:

$ wget https://dl.eff.org/certbot-auto
$ chmod a+x certbot-auto
$ sudo certbot-auto /usr/local/bin
$ certbot-auto certonly --webroot \
    -w /usr/lib/ckan/default/src/ckanext-odp-theme/ckanext/odp_theme/public \
    -d opendataphilly.org \
    -d www.opendataphilly.org

From there, auto-renewal was tested and added to the root crontab:

# certbot-auto renew --dry-run
Saving debug log to /var/log/letsencrypt/letsencrypt.log

-------------------------------------------------------------------------------
Processing /etc/letsencrypt/renewal/opendataphilly.org.conf
-------------------------------------------------------------------------------
Cert not due for renewal, but simulating renewal for dry run
Starting new HTTPS connection (1): acme-staging.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for opendataphilly.org
http-01 challenge for www.opendataphilly.org
Waiting for verification...
Cleaning up challenges
Generating key (2048 bits): /etc/letsencrypt/keys/0005_key-certbot.pem
Creating CSR: /etc/letsencrypt/csr/0005_csr-certbot.pem
** DRY RUN: simulating 'certbot renew' close to cert expiry
**          (The test certificates below have not been saved.)

Congratulations, all renewals succeeded. The following certs have been renewed:
  /etc/letsencrypt/live/opendataphilly.org/fullchain.pem (success)
** DRY RUN: simulating 'certbot renew' close to cert expiry
**          (The test certificates above have not been saved.)
# crontab -l | tail -n1
31 */12 * * * certbot-auto renew --quiet --no-self-upgrade

Activity queries causing periodic timeouts

It is unclear which pages are causing this, but periodically some very expensive SQL queries are being issued by the application:

SELECT DISTINCT ON (anon_1.timestamp) anon_1.id AS anon_1_id, anon_1.timestamp AS anon_1_timestamp, anon_1.user_id AS anon_1_user_id, anon_1.object_id AS anon_1_object_id, anon_1.revision_id AS anon_1_revision_id, anon_1.activity_type AS anon_1_activity_type, anon_1.data AS anon_1_data 
	FROM (SELECT anon_2.id AS id, anon_2.timestamp AS timestamp, anon_2.user_id AS user_id, anon_2.object_id AS object_id, anon_2.revision_id AS revision_id, anon_2.activity_type AS activity_type, anon_2.data AS data 
	FROM (SELECT DISTINCT ON (anon_3.timestamp) anon_3.id AS id, anon_3.timestamp AS timestamp, anon_3.user_id AS user_id, anon_3.object_id AS object_id, anon_3.revision_id AS revision_id, anon_3.activity_type AS activity_type, anon_3.data AS data 
	FROM (SELECT anon_4.id AS id, anon_4.timestamp AS timestamp, anon_4.user_id AS user_id, anon_4.object_id AS object_id, anon_4.revision_id AS revision_id, anon_4.activity_type AS activity_type, anon_4.data AS data 
	FROM (SELECT activity.id AS id, activity.timestamp AS timestamp, activity.user_id AS user_id, activity.object_id AS object_id, activity.revision_id AS revision_id, activity.activity_type AS activity_type, activity.data AS data 
	FROM activity 
	WHERE activity.user_id = '18fb335b-0d50-4428-8a25-79f7156c243d' ORDER BY activity.timestamp DESC 
...

This appears related to ckan/ckan#2034, which is in CKAN 2.3+. We are currently on CKAN 2.2.4 (see #49).

This issue is causing periodic HTTP timeouts, which result in brief downtime.

Run cron task to execute certificate renewal more frequently

Currently, the cron task to trigger certificate renewal is set to occur:

At 23:00 on day-of-month 25.

Today, the certificate expired prior to the cron task running. A more frequent attempt to renew the certificate may have caught it. It is also possible that last months should have caught it, but didn't for some reason.

  • Ensure cron jobs are working as expected
  • Modify them so that they run more frequently

SSL certificate failing to autorenew

We had previously resolved a failure in #97, but the certificate still didn't successfully renew on the 25th of this month. Running the crontab command manually was successful. We should figure out why the cron task is failing. Removing the --quiet flag from the crontab command may help with debugging.

We're still able to upgrade the SSL certificate and our Panopta alerts are catching when expiry is coming up, so our actual risk is low, but we should get autorenew working again.

Persist mechanism to block unsavory crawler bots

The following bits of Nginx configuration were applied ad-hoc to suppress an influx of undesirable crawler bot traffic:

map $http_user_agent $limit_bots {
    default 0;
    ~*(semrush|mj12bot|ahrefs|qwant|webmeup) 1;
}

...

location / {
    if ($limit_bots = 1) {
        return 403;
    }

    ...
}

This configuration should persist in the Ansible templates for Nginx. In addition, we should more closely evaluate which bots to block.

Users are unable to reset passwords

Two users have reported that they are unable to reset their passwords and they receive the following error:

Could not send reset link: SMTPAuthenticationError(535, '5.7.8 Username and Password not accepted. Learn more at\n5.7.8 https://support.google.com/mail/?p=BadCredentials w34sm2306200qth.81 - gsmtp')

I attempted to reset my own password using "rcheetham" and received the same error.

Determine whether ckanext-pages is thread-safe

It's possible that the ckanex-pages plugin is not thread-safe. We believe that this is may be the case based on an apparent race condition in two threads spawned by one worker, a race condition that seems to have caused a shared SQLAlchemy connection to enter a degraded state for that worker. For more background, see our investigation into #92 (comment).

Based on discussion in ckan/ckanext-pages#81, we should evaluate if this is true, and if there is any way for us to mitigate the issue in the ODP application.

Evaluate upgrade and plugins

  • Upgrade CKAN to 2.6
  • Add/remove images and change sequence of images on landing page
  • Perhaps other issues that may have accumulated but are not noted in the ODP repo

Also, estimate what it would take to do this:

In all of our press releases, we direct people to “opendataphilly.org" to see the open data that’s been released. The trouble is that for users who don’t know what the terms CSV, GeoJSON, API, and Metadata mean, a page like this (https://www.opendataphilly.org/dataset/opa-property-assessments) can be overwhelming and reinforce the belief that “open data is for techies.” We try to put the “Visualization” first, but it doesn’t always stand out much, and still requires you to oddly click a “download” button to access it.

On the other hand, the “Related” tab further up (below the breadcrumbs) is quite visually engaging and takes you straight to applications that let you explore the data without ever having to download it. So the idea was: what if those “related” items were one of the first things you see on the dataset page, rather than being a bit buried in that tab?

Well, it turns out it was incredibly easy to do with 8 lines of python code and a bit of boilerplate (folder structure). I’ve created a CKAN plugin out of it:
https://github.com/CityOfPhiladelphia/ckanext-highlight-related-items (demo: http://45.55.253.172/dataset/opa-property-assessments)

Joel Natividad was then kind enough to add installation instructions: CityOfPhiladelphia/ckanext-highlight-related-items#1

Release certbot cron job updates

Deploy Certbot cron job frequency increase introduced in #105. Certbot renewals are currently working and will continue to run without this change, but the increased frequency could give us earlier warning if the job fails.

Link scan

From Robert to Kistine: "It's possible that a scan by CKAN may have generated a bunch of broken links that are not actually broken. I clicked through a few of those links you just sent, and they appear to be marked as "broken" but are actually available. It's possible we can kick off another link scan to fix that."

Configure data.json Harvest to ingest DVRPC catalog data records

DVRPC is managing their open data in two different places (OpenDataPhilly and Esri Open Data). They only want to maintain one place, so they deleted all of the records from OpenDataPhilly and are asking us to harvest them from the Esri data.json feed at https://dvrpc-dvrpcgis.opendata.arcgis.com/data.json - if a user ID for the org is required, it is dvrpcgis.

We have the datajson extension, which generates data.json for the federal Data.gov site. However, harvesting requires a second extension, called Harvester. Some basic config info is at https://extensions.ckan.org/extension/datajson/ source is at https://github.com/ckan/ckanext-harvest Harvester, in turn, requires either RabbitMQ or Redis to handle the retrieval of data.json.

Create a health check for the Apache workers that restarts on failure

Following up on our December 3rd incident report (#93): The surest way to prevent another application failure due to a race condition would be to configure a health check for Apache that can attempt to restart a worker in the case of failure. If the worker had been set to automatically restart during the December 3rd incident, the degraded service that we experienced would not have persisted for more than a few seconds.

We could accomplish this with a tool like Monit or supervisord.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.