Giter Club home page Giter Club logo

catalog-harvest-registry's Introduction

IOOS Catalog Harvest Registry

Web based User Interface for registering and managing metadata harvest endpoints.

License

The MIT License (MIT) Copyright (C) 2016 RPS ASA

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Release Instructions

  1. Identify the next version
  2. Update app/package.json Dockerfile and build/build.sh with the new version. Example
  3. Build the project with build/build.sh
  4. Push the changes so that dockerhub picks them up and builds an image
  5. Publish a release

Installation

  1. Clone the project

  2. Install maka-cli

    npm install -g maka-cli
    
  3. Install the project dependencies

    maka npm install
    

Linux/Ubuntu

You may need to run:

curl https://install.meteor.com/ | sh

to install meteor if issues with the above steps.

Running

To run the project use:

maka run

in the project's root directory.

If you want to specify an external Mongo server, which is necessary for integrating with the other services:

export MONGO_URL=mongodb://localhost:27017/registry
maka run

Configuring the project

The project connects to several external services for functionality. These configuration parameters are configured in config/development/settings.json.

{
    "email": {
        "mail_url": "smtp://<user>:<password>@<host>:<port>",
        "notification_list": [
            "<recipient email>"
        ],
        "support_email": "<support team email>"
    },
    "public": {
        "ckan_api_url": "http://<CKAN Host>/api/3/"
    },
    "services": {
        "ckan_api_key": "<CKAN Admin API Key>",
        "ckan_api_url": "http://<CKAN Host>/api/3/",
        "harvestAPI": "http://<catalog-harvesting host>/api/harvest"
    }
}

Docker

The docker build process for this project works by setting up nodeJS and then downloading a fully built project from Amazon S3 where the releases are built and uploaded to. The release instructions are further up in this document.

The project can be built by:

docker build -t ioos/catalog-harvest-registry .

The project can be run with docker. We recommend using docker-compose.

Here's an example docker-compose.yml

version: '2'

services:
  catalog-harvest-registry:
    image: ioos/catalog-harvest-registry
    env_file: envfile
    ports:
      - "3000:3000"

  mongo:
    image: mongo:latest
    container_name: mongo
    volumes:
      - "mongo_data:/data/db"

  catalog-harvesting:
    image: ioos/catalog-harvesting:latest
    container_name: catalog-harvesting
    environment:
      - "ENABLE_CRON=true"
    env_file: envfile
    volumes:
      - "waf_data:/data"

  harvester:
    image: ioos/catalog-harvesting:latest
    env_file: envfile
    command: "/sbin/my_init -- /sbin/setuser harvest /run_worker.py"
    volumes:
      - "waf_data:/data"

  waf:
    image: lukecampbell/docker-waf
    container_name: waf
    env_file: envfile
    volumes:
      - "waf_data:/usr/share/nginx/html/waf"
    ports:
      - "3001:80"

  redis:
    image: redis:3.0.7-alpine
    container_name: redis
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes


volumes:
  mongo_data:
  redis_data:
  waf_data:

Here's the environment file:

METEOR_SETTINGS={"email":{"mail_url":"","notification_list":[],"support_email":"[email protected]"},"public":{},"services":{"ckan_api_key":"","ckan_api_url":"","harvestAPI":"http://catalog-harvesting:3002/api/harvest"}}
MONGO_URL=mongodb://mongo:27017/registry
ROOT_URL=http://localhost:3000/
CRON_STRING=32 0 * * *
REDIS_URL=redis://redis:6379/0
CKAN_API=http://ckan:8080/
CKAN_API_KEY=
WAF_URL_ROOT=http://localhost:3001/

In the case of Mac's which use docker-machine, use:

METEOR_SETTINGS={"email":{"mail_url":"","notification_list":[],"support_email":"[email protected]"},"public":{},"services":{"ckan_api_key":"","ckan_api_url":"","harvestAPI":"http://catalog-harvesting:3002/api/harvest"}}
MONGO_URL=mongodb://mongo:27017/registry
ROOT_URL=http://192.168.99.100:3000/
CRON_STRING=32 0 * * *
REDIS_URL=redis://redis:6379/0
CKAN_API=http://ckan:8080/
CKAN_API_KEY=
WAF_URL_ROOT=http://192.168.99.100:3001/

Where 192.168.99.100 is the IP-Address of the docker-machine's VM.

Deployment and Building

Meteor Building

To create the app.tar.gz file you can build it by using:

maka build

You must build the tar file before you build the docker image to see any changes you've made to the code.

Bulding Docker image for dev vs. production

There are two scripts that can be modified when building for development vs. production. They are: install-app-dev.sh (Dev) and install-app.sh (Production)

Modifying Dockerfile

You can modify this area of dockerfile for dev vs. production by commenting/uncommenting relevant code:

RUN $SCRIPTS_DIR/install-app.sh
WORKDIR $APP_DIR/catalog-harvest-registry

RUN $SCRIPTS_DIR/install-app-dev.sh
WORKDIR $APP_DIR/bundle

This enabled you to be able to make changes and build/run the tar file locally instead of using aws s3.

catalog-harvest-registry's People

Contributors

benjwadams avatar ericmbernier avatar gitchrisadams avatar lukecampbell avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

catalog-harvest-registry's Issues

Add link to centralized WAF xml file in /records view

Add a link in the table to the registry.ioos.us/waf location for all records that pass validation and will be harvested by CKAN.

Exclude these links in records that don't validate and aren't copied to the centralized WAF, and remove CKAN links as well.

Cleanup data.ioos.us source WAFs for Registry

We have some old sensorml2iso-generated or otherwise scripted WAFs on https://data.ioos.us/mwengren/. Need to go through and clean out sources that are no longer used for Catalog:

--

Add Harvest type to the Harvests list

On the Data Sources page, it would be helpful to list the harvest type (WAF, ERDDAP-WAF, CSW), similar to the About page.

Can we re-align the list on the left and the pie chart div on the right to give some more horizontal space to the list in order to add a harvest type field, maybe between 'URL' and 'Online'?

Automatic Admin Email List in Registry

Can we update the distribution list for Admin events (like when a new user tries to register) to automatically email all the users marked as 'Admin'?

Right now it has a static list of email addresses that is out of date (still includes Luke), and could always become more out of date in the future.

Count inconsistency in Records in harvests

This morning I found a problem with the records count in the Harvest Registry. This seems to apply to all harvests. The summary table count is correct, but if click on the 'Records' button, there are ~ 3x to 4x the number of records there should be.

We need to figure out what the cause is. Could be a recent change to the Registry code that introduced the bug?

Add confirmation email on user account request approval

When a new account is created, user immediately gets a confirm email address email. When they follow the link, an error message appears 'Account not approved'.

We should change the message instead to something like 'Account pending approval by an administrator'.

Once an Admin logs in and verifies the account, there needs to be a second email notifying the user they're able to access the Registry. (per Emilio email)

CS-W Harvester Broken

The CS-W harvester doesn't seem to work properly.

Test case was the USGS CS-W endpoint provided for OBIS-USA. Contains 133 records.

Current record count in the Registry is 515,078 (obviously not correct): https://registry.ioos.us/harvests/RgbfvgQzweWCtGouY. Harvests continue indefinitely.

It seems as though the harvester loop isn't terminating when it reaches the end of the result set, however the resulting USGS WAF fortunately only has ~100 records or so: https://registry.ioos.us/waf/USGS/.

I haven't deleted the harvest source in order to troubleshoot, but eventually this must be going to cause some issues for MongoDB.

Verify harvest functionality on dev instance

It looks like many/most of the harvest jobs on the development 2.8 server are stuck in the 'Running' state, and haven't been running daily as they're set to do.

Also, most of the metadata I've found has a date of August 13, so it seems harvesting is not functioning normally.

Add qualifier text to the new user registration page

We need to add some text to the User Registration page, in order to discourage new users to sign up unless they have an IOOS connection.

At the top of the page above the Email input, can we add the text:

'The IOOS Catalog Harvest Registry site is intended as the user interface for users affiliated with one of the IOOS data provider organizations (include link here to the Catalog Organizations page: https://data.ioos.us/organization) to publish web accessible folders or CS-W services with metadata records for the Catalog to harvest.

If you are affiliated with an IOOS organization and need access to this functionality for dataset publishing, please fill out the information below and we will verify your request. We will not approve requests for general users who do not require access to this tool.'

Display full URL in About page

Sorry, I should have raised this initially, but could we display the full WAF URLs in the table on the About page? I think it would be more useful for review at a glance, even though it will clutter the display a bit. Low priority...

Harvest ERDDAP WAFs

Can we investigate what would need to change in Registry to harvest ERDDAP WAFs directly?

Here's an example ERDDAP source: https://registry.ioos.us/harvests/RWC7hrevuuyP9L3bu.

Some ERDDAP example WAFs:
http://opendap.co-ops.nos.noaa.gov/erddap/metadata/iso19115/xml/
https://data.ioos.us/gliders/erddap/metadata/iso19115/xml/
http://www.neracoos.org/erddap/metadata/iso19115/xml/

ERRDAP has some bells & whistles (styling/HTML) on their WAF pages that are probably causing issues w/ the harvester.

If the answer is ERDDAP needs to publish a plain WAF, that's fine, we can pass that on to them for a future release.

Add CKAN Org Dataset Count to Registry

Since we can get the total dataset count in CKAN for an Organization, and we can calculate the dataset count per Organization in the Registry (https://registry.ioos.us/about), we should show how the two match up somewhere in the Registry UI.

Not sure where this would make the most sense, but this would help highlight when there's an issue in CKAN that is preventing all of the datasets as counted in the Registry from being properly harvested in CKAN.

This would help show how SECOORA for example is missing most of its datasets in CKAN due to fileIdentifier duplication: https://registry.ioos.us/records/mHTECt6sp9N2rKo5x vs https://dev-catalog.ioos.us/harvest/secoora-waf.

For the next release.

Default to user email for WAF contact

From ioos/catalog#33, re contact info for each harvest source:

Can we set up so that if a logged on user creates a new harvest, it defaults the contact info to their own user account email address, but they have the option to overwrite with any email they want?

We also need to populate contact information for each existing harvest. I'm not sure what's available in the database for this. If we have a record of which user created which harvest, let's just go with that for now. If not, may require manual step.

Strategy for handling CKAN additional validation steps

Placeholder issue to come up with a strategy to handle cases where CKAN performs additional validation on ISO content that isn't currently detected in Harvest Registry.

Examples:

These cases need to be checked in the Registry if possible, and if not inventoried somewhere to inform RA data managers of possible reasons why not all their records are appearing in CKAN.

Add 'Metadata Date' column to /records view

Can we make it clearer what the 'metadata date' (in CKAN terms) is for each record in the Registry in the /records view table? I would suggest adding a 'Metadata Date' field between 'Description' and 'Services' fields in the table. This would make it clearer which records are not being updated via an automated process. Ideally, we could show something like 'oldest record metadata date' in the Harvests summary view as well, but I know this would be a bit more work potentially and space is already limited there.

XPath to get this date value is: //gmd:dateStamp/gco:Date

Clarify whether to use ERDDAP-WAF or WAF harvest type for all ERDDAP harvest sources

@benjwadams

I noticed in the Registry that two of the Axiom ERDDAP WAFs (AOOS, CeNCOOS) were reporting 0 datasets, and the last Catalog metadata update date for datasets from those two sources in was 12/7/20.

In the past, we've had issues with which harvest source type to use with ERDDAP in the Registry, however as of ERDDAP 1.82, the regular 'WAF' type was supposed to work for all ERDDAP WAFs (as well as the ERDDAP-WAF type).

Those two Axiom ERDDAPs were set to 'WAF' but were failing with harvest job error messages. I changed CeNCOOS to 'ERDDAP-WAF' and ran a reharvest, which resolved issue and it shows 1285 datasets now.

The ERDDAP release reported on both of those is an Axiom-custom identifier: 'ERDDAP, Version 2.02_axiom-r1'. You can view at the bottom of each WAF page:

https://erddap.aoos.org/erddap/metadata/iso19115/xml/
https://erddap.cencoos.org/erddap/metadata/iso19115/xml/

So it seems 'WAF' harvest type does not work with all ERDDAP versions > 1.82. Is there something to do with the custom version string Axiom uses that's causing it to fail? Or something else?

Best practice seems to be to use 'ERDDAP-WAF' for all ERDDAP harvest sources. We should update our documentation to make that clear, ideally.

Users with multiple Organization affiliations

Just noting this because this is something we may run into. Does the site accommodate this already?

If it's difficult to do multiple orgs with a single email, can a user re-use the same email address and create multiple logins?

Add fields to the new user registration page

Can we add a couple additional fields to the new user sign up form?

We've gotten a few requests for users that are not obviously part of an IOOS organization. Let's add a field for IOOS Organization POC and IOOS Organization POC Email just below the Organizations dropdown, perhaps with some qualifier text like:

'Please include the name and email address of a manager or representative of the IOOS Organization you belong to to confirm your account'

These fields would be mandatory and would be added to the email notice sent to the site administrators distribution list.

This connects with existing issue #117 to make the distribution list for new user account requests 'automatic', or at least somehow updated in an automated fashion to include users with 'admin' level accounts in the Harvest Registry, rather than the current outdated distribution list, that still includes @lukecampbell.

Email not sent when user promoted to administrator

When a new user requests an account the following should happen:

  1. The user receives a confirmation email from the system.
  2. The site support staff (Currently me, Ben and Micah) receive a notification email that a user has requested an account.
  3. When an administrator approves OR promotes to administrator, an account, the user should receive an email notifying him/her that his/her account was approved.

Right now, if a user is promoted to administrator, the notification email doesn't get sent.

This meteor-method needs to be extended like the other meteor-method so that the email gets sent.

About Page Dataset Count Improvements (+ CKAN Record count)

As a follow-on to #101, can we update the About page to show something like this:

Total Registry Records

(-) Records in Error............................................... CKAN Records

Valid Registry Records

ignore crappy formatting, just to give an idea, the point is two columns to compare Registry total to CKAN total

The total CKAN record count is available via CKAN API here: https://data.ioos.us/api/3/action/package_search.

This will allow a rough way to compare the total valid record counts in Registry vs CKAN, and how many records get lost in the harvest(s) of the Registry by CKAN.

GLOS Harvesting Issues (WAF & CS-W)

@benjwadams we have a few different issues with harvesting metadata from GLOS that need to be resolved:

  1. The TDS WAF that's currently registered is returning 404: http://tds.glos.us/waf/. Can we determine if there's new URL and update the Registry entry? If not, can we work with GLOS to restore that WAF?

  2. The CS-W harvester is returning an error when trying to harvest from the GLOS CS-W endpoint: http://data.glos.us/metadata/srv/eng/csw. Message is 'Harvest for GLOS CS-W failed'. Please have a look in the logs to see if it's on our end or theirs.

Harvest job notifications in Registry

Placeholder to investigate error notifications for Harvest Jobs in the Registry UI.

We should ensure an error dialog shows after a harvest job triggered by user completes if no records are harvested successfully. (With informative error message if possible)

Harvest Jobs navigation improvements

The navigation isn't quite as smooth as the records view.

When you go to the records view for a harvest from one of the buttons on the /harvest view, you can navigate back via the back button and still be on the same harvest page (or use the breadcrumb).

With job status it goes back to the base /harvest view, and you have to select your harvest source again. Make the navigation similar to /records.

Harvest Registry to CKAN harvest job connection broken

Harvest Registry is no longer properly triggering CKAN harvests via the the automated CKAN API call. This was at one time triggered whenever a change was made in the Registry for instantaneous responsiveness, but isn't a necessity at the moment.

CKAN is still harvesting from the Registry on a daily basis so metadata updates are still published properly.

Low priority fix.

Improving the default harvest schedule

As discussed in https://github.com/neracoos-open/DMAC/issues/2, since most of the IOOS models run at night, if we are limited to one per day, I think we could improve on the default daily 0000 UTC since that's 8pm East Coast, 2pm Hawaii.

For example, an improvement might be 1800 UTC, which is 8:00am in Hawaii, 2:00pm East Coast. That way people across the US could at least have up-to-date data during some part of the work day.

BUT... better yet, how about harvesting more frequently?

How long does the entire harvest process actually take?

I propose we should take that amount of time and multiply by 2 or 3 for our default harvest interval.

For example, if it takes only 20 minutes we should be harvesting every hour, or if it takes an hour, harvesting every 3 hours...

Add a 'x' button to cancel search filters

It would be nice to have a small red 'x' button next to the search box in both the harvests table and the individual records tables (for each harvest).

Currently, to cancel a filter, you have to select, delete, then hit 'enter' to re-execute an empty search. Not the most user-friendly as it is now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.