lter / lter-core-metabase Goto Github PK

View Code? Open in Web Editor NEW

7.0 13.0 4.0 747 KB

RDB model, based on the GCE LTER Metabase, with adaptations from other LTER sites

Home Page: https://lter.github.io/LTER-core-metabase/

License: GNU General Public License v3.0

PLpgSQL 100.00%

lter-imc-wg lter-imc

lter-core-metabase's People

Contributors

Stargazers

Watchers

Forkers

ble-lter iamchrisser kf8a mcmlter

lter-core-metabase's Issues

Metabase non default boilerplate idea

@gastil wants me to put this here:

Dataset_xref_boilerplate

View(s) would coalesce to use default where non default not specified.

site description in VIEW

column geographicdescription in vw_eml_geographic appears hardcoded in .sql to be format "SiteList.SiteName: SiteList.SiteDesc"

from 2_create_schemas_tables.sql

(((s."SiteName")::text || ': '::text) || (s."SiteDesc")::text) AS geographicdescription;

SiteDesc isn't required. If NULL, should geographicdescription be NULL or should it be just SiteName with no colon?

entity-level distribution information

On same topic as #26 item 11, what's the common usage pattern for entity-level distribution info?

Right now R code concatenates DataSetEntities.Urlhead and DataSetEntities.FileName to make a download URL. (I'm adding condition to return NULL if Urlhead is NULL.)

If unlike SBC, a site doesn't host data on own site, then perhaps there's no need for entity-level distribution in EML doc?

create a contributing.md

eg, see https://gist.github.com/PurpleBooth/b24679402957c63ec426

standardize column names: case, abbreviation, etc

eg: id in EMLUnitDictionary should be UnitID similar to NameID

eg: some inconsistencies in case, most PascalCase but some camelCase and some lowercase

Rename EMLUnitDictionary to UnitDictionary

You can add custom units to EMLUnitDictionary, so really it is a unit dictionary if you have added any custom units at all. And the custom attribute could be renamed to IsEmlUnit. See #30 .

Comment columns

Add comments to the non-obvious columns. Comments can include, e.g.:

Brief summary of field purpose
Example of what should go in the field
EML element upon which the field is based
Where to find valid values? Or maybe we should rely on constraints and lookups.

Example comment for DataSetAttributes.StorageType:

Storage type for data in a RDBMS or other system. E.g., integer, double, string.  Maps to storageType in EML. Use values from https://www.w3.org/2001/XMLSchema-datatypes.

review column comments

delete or edit vestigial column comments. will update as stuff is found
updated april 8th

pkg_mgmt.pkg_state.nickname
'ie fish_survey or flume or par. This is NOT the eml shortName except perhaps by coincidence. shortName is stored elsewhere. This is not the staging directory except by coincidence.'

shortName is in fact stored in pkg_mgmt.pkg_state.nickname

lter_metabase.SiteList.SiteType
COMMENT ON COLUMN lter_metabase."SiteList"."SiteType" IS 'Beach (beach site. may overlap with land); intertidal (intertidal. usually rocky, may overlap beach, land); land (land based sampling site); nearshore (nearshore ocean. overlaps with reef); offshore (offshore ocean. may overlap with nearshore); other (unspecified); pier (pier or wharf, in the ocean but attached to shore); reef (reef based sampling site)';

add set of views to export text tables like those in EML assembly line

The (nascent) EML profile will be a set of dataset concepts, and associated EML paths. The reasoning behind this is that software for creating EML is much easier to reuse if the inputs converge on a common model. A common model is what EDI is calling the "EML profile", and basically is an abstraction layer.

In the context of an rDB, the abstraction layer would be expressed as a group of database views.

Load keywords before dataset keywords in sample data

In 4_load_sample_sbc_datasets.sql, DataSetKeywords is loaded before Keywords is loaded. This results in a foreign key error.

I suggest moving the DataSetKeywords just after the Keywords part. This loads fine in my testing.

EML id collision avoidance / unique ids in EML docs

The database constraints do not prevent accidental re-use of an id string across tables. For example a geographic site may have the id foo and a species might have the id foo. This issue is opened just to remind us to put this disclaimer into the documentation and to be aware of this issue as db editors, users and when writing code from the db. The EML writing code will need to enforce unique ids. The EML parser will catch such errors.

migration script for missingValueCode, revision, and shortName using SELECT INTO

Write the SELECT INTO statements for live migration of content from the denormalized 1-1 missingValueCode columns to the new normalized tables for those (which allow 1-n).

Likewise to migrate revision from pkg_mgmt.pkg_state columns revision and nickname to lter_metabase.DataSet columns Revision and ShortName.

Create a best practices

Here is a place to start
https://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/

include more roles in vw_eml_creator

As of now only "creator" and "organization" in column AuthorshipRole from table DataSetPersonnel get selected into vw_eml_creator.

#14 has discussion on BP on roles. However sidestepping all that, it'd be nice if the view gets funneled all related personnel, which then can be filtered in processing (in R)

port to mysql, after phase 1

It might be easier for some people to find a hosted cluster for mysql

add schema "research project"

The schema "research_project" will add funding code and the content required for "project" tree. The funding code is essential for EML.

use supertypes and subtypes for People?

People currently can store people, and also organizations if you leave the Name fields blank (as interpreted by Li Kui's R scripts). What do you think about have a table with basic information like organization, address, and other contact information, and then have tables that inherit from the basic information table? Perhaps the basic information table could be called ResponsibleParty. For example, a People table would inherit from the basic table and add name fields, whereas an Organization table would inherit from the basic information and add....well maybe nothing.

I don't know what that means regarding the current People.NameID field which is a key field for relationships with other tables. I'm looking for expert Postgres advice I guess. My motivation for bringing this up is that it seems clunky to me to have Organizations in the People table with a special rule like, "Leave the names blank to describe an organization." But inheritance may be clunkier and may violate our keep-it-simple goal.

multiple missing values per attribute

I noticed (@atn38 confirms) that multiple missing value codes for a single attribute are not supported by the current schema. I find this comes up all the time (i.e. u= under detection limit, na= lost the sample), and it's super important to document both to distinguish between MCAR vs biased missingness for analysis

enforce order of dataset creators?

I think @lkuiucsb's workflow is to enter the organization (e.g., SBC LTER) as the first creator of a given dataset. I couldn't find evidence that this is required in EML documentation, but maybe I missed it. If it's required, should we somehow require that in the schema (I don't know if that's possible) or let the R scripts work it out?

Vestigial Columns

updated April 9th 2019, add new entries and restructure to group by table
from my POV (R code person), these columns could be vestigial or could benefit from clarified convention:

1. DataSet

Investigator
currently not passed into views
PROPOSE to either (1) enforce constraint that one of "creator" roles in DataSetPersonnel should be identical to Investigator? OR (2) drop column completely
Accession
Is this meant to contain distribution information at the dataset level? All empty in SBC sample.
Also see #29 for entity-level distribution info. OR is this meant to contain the EDI/other repo package ID? In that case, replicate this from pkg_mgmt.pkg_state.dataset_archive_id
ProjectRelease/PublicRelease
nothing in SBC sample data
geographicDescription
VIEWs don't do anything with this. nothing in SBC sample data.

2. DataSetEntities

DataAnomalies
currently not passed into views
is there an EML tag for this kind of info?
Edit: SBC sample data has some info here, eg
<para>The text table is 1.6 gb, and has been compressed to 365 mb for download.</para>
EntityRecords
passed into views, but R code counts rows in actual files instead. no enforcement, so right now could be total bogus numbers

3. DataSetAttributes

AttributeID
in SBC sample data AttributeID is identical to ColumnName, but only the latter gets passed into views. AttributeID doesn't appear in EMLAttributeCodeDefinition. Primary keys in both tables use ColumnName

----- DataSetKeywords and DataSetPersonnel look good ------

4. DataSetMethods

effectiveRange
empty in SBC sample. not clear what this does.

5. DataSetSites

EntitySortOrder
already raised in #20
GeoCoverageSortOrder
Hunch that this is related to entity-level coverage aka #20 somehow, but right now not clear what this does.

6. DataSetTemporal

EntitySortOrder
already raised in #20

----- EMLAttributeCodeDefinition, EMLKeywordTypeList, EMLMeasurementScaleList, EMLNumberTypeList, EMLStorageTypeList look good ------

7. EMLUnitDictionary

id
in SBC sample data id is identical to name but only latter gets passed into VIEWs. Primary key uses id however.
custom
EDI recommends parsing all units as custom unit according to EML BP, pg. 33 so this is redundant

8. EMLUnitTypes

dimension_power

----- FileTypeList looks good to me ------

9. KeywordThesaurus

ThesaurusSortOrder
passed into view, but doesn't appear in R code. not clear what this does. Primary key and FK in DataSetKeywords use ThesaurusID.

---- Keywords looks good ----

10. MeasurementScaleList

NonNumericDomain
not clear what this does. doesn't appear in DataSetAttributes or EML

11. People

Address3
Phone2
Fax
dbupdatetime
not so much odd as only People has it

13. Peopleidentification
see #35

14. ProtocolList

15. SiteList

SiteType
PROPOSE to remove. see #36
ownership
not clear what this does
ShapeType
clarify how to use: in SBC sample data, there are "point", "polygon", "rectangle", and "polyline". unclear how "polygon" could be represented differently from "rectangle", and unclear how "polyline" could be represented at all
CenterLon/CenterLat
not passed into VIEWs. don't appear on EML
unit
PROPOSE to rename to altitude_unit

add TaxonomicCoverage

as discussed in call

mapping between workflow stages

As raised in zoom call:

One pain point as I've been working with metabase and R to generate EML: some of the time it is not clear or intuitive what maps to what. E.g. where columns in VIEWs are pulled from in metabase, and which columns in VIEWs are used by R code to generate which EML elements. So far, whenever this happens I look into SQL code or ask on Slack.

need to brainstorm how best to document this mapping long-term so users spend less time digging:

There are three main stages of the workflow (@gastil, your profound flowchart!):
Metabase ---(SQL)---> Views ---(R)---> EML
We can think about at which stage we'd like to add documentation so that it's most effective. Users might approach from the end result, i.e. looking at produced EML document and wondering why this element's missing and what to do (what field to populate in metabase etc) so that information gets in there. Or they might also look at metabase and wonder where this field is going to end up in EML docs. OR we might want to split the difference and start documenting the middle stage, AKA VIEWs. I.e, on column pubdate in vw_eml_dataset, comment "this is populated from pkg_mgmt.pkg_state.update_data_catalog and populates the pubdate EML element"

write down conventions

we should write down the rules we use. this will make it easier for people who want to contribute, and to keep them consistent.

Address entity-level converage inconsistencies

By virtue of its name, I presume DataSetSites refers to site coverage for the dataset rather than per entity within the dataset. However, it has an EntitySortOrder column, which implies that sites are specified per entity. Similarly, DataSetTemporal has EntitySortOrder. The views, vw_eml_geographiccoverage and vw_eml_temporalcoverage both include entity_position.

So, we should decide whether we will support entity-level coverage in metabase, and if necessary adapt the schema and views accordingly. Gastil seldom uses entity level coverage elements at her site.

Historical note: In the original Metabase, the table DataSetSites does not have an EntitySortOrder column. (Whereas DataSetEntitySites does.)

Make instructions about locale in 0_create_db.sql more generic

Issue and comments pasted from Slack:

To create the database in Windows, I have to comment out the LC_COLLATE and LC_CTYPE parameters. I get invalid locale name: "en_US.UTF-8". By default, my Postgres instance uses UTF8 and English_United States.1252 for the locale, so commenting out those parameters is fine if you are happy with that locale.

Thus, I suggest that the instructions be modified to tell Windows users to comment those lines and make sure they're happy with the default encoding. Alternatively, we could remove those lines from the SQL and have a note about defining a character set in the instructions.

create another view vw_eml_associatedParty

see the original MCR metabase view.

add description of install-scenarios to documentation

the 5 scripts fit a particular set of install scenarios. list them here

institutional/central server
1. dbadmin acct that is not the db_owner acct (they may/may not be the same person)
2. different accts (people) run different scripts
3. at least some accounts probably already exist (because the db_owner is likely to have access to other databases)
4. there may be a sandbox db and a production db (eg, sandbox has example content loaded, production does not)
desktop install
1. dbadmin and owner are probably the the same person
2. multiple ways to install
  - use a gui with an sql pane and run scripts as for a central server
  - concat the scripts together and run them as you would a restore (keep in mind that the typical order of a DB dump-restore is not the order of operations in these scripts)
3. keep in mind that it is almost impossible to share work this way, which might obviate the use of postgres altogether
optional: notes for install on different OS's (probably orthogonal to the above)

see if installation includes PK for DataSetAttributes

The testing install here at ucsb named mob_test_lter_core_mb_empty was lacking its PK on DataSetAttributes.

This issue is a To-Do note for Gastil to check to be sure the install script does include that PK.

If not, put this snippet in:

ALTER TABLE lter_metabase."DataSetAttributes"
ADD CONSTRAINT "DataSetAttributes_PK_DataSetID_EntitySortOrder_ColumnName"
PRIMARY KEY ("DataSetID", "EntitySortOrder", "ColumnName");

A side issue is whether to use di ei ai or to use di ei aName. The di, ei and ai are integers but AttributeName is a string. Pros/cons either way. Currently both models are used because it came from GCE that way.

Methods: design schema to support EML methods in ways we use it without undue complexity

EML methods contain complex structure such as

multiple methodSection trees
title, para, ulink, subscript, superscript, emphasis, itemizedList which does not fit well with a plain text data type.
protocol (with title, author, url)

We want to keep the actual practice of entering method content from what we receive (often a Word doc or such) into something store-able that can be converted into EML as complex text type or as markdown. Ideally, this could be stored in Metabase rather than static files elsewhere. (Blob data type has been suggested.) It is not realistic to model the structure of methods text into table columns! The Metabase at UCSB prior to the lter_core_mb project used an interim working solution of storing the methodSteps as xml data type in metabase.

It seems useful to be able to re-use methods or methodSteps. To what extent do we actually do that? Protocols seem more likely to be re-used.

In addition to storing methods as docs (or xml), there are several columns in the DataSetMethods table currently (as of 2019-06-27) which are in use for SBC for a small proportion of datasets (looks like physO). So we need to support those also.

This comment is just to open the issue.

enforce referential integrity

We should make sure we enforce referential integrity for tables related to lookup tables. I think this would cause the database engine to raise an error if you entered a NameID in the DataSetPersonnel table that didn't already exist in the People table, for example, which could help prevent loading invalid data.

adapting code versioning to SQL

git (or any code versioning system) is not designed for SQL. When a new file is seen by git, it says "no problem, no conflict, this will not change existing code in master". But if that new file starts out with

DROP VIEW a;
DROP VIEW b;
CREATE VIEW c AS ...

git has no way of seeing that no-no-no-no that new file DOES conflict with what is currently in master. So, we as humans will need to supervise git. Don't merge just cause git says there is no conflict.

This has been a public service announcement from your local git newbie sql geek.

include more person ID types in VIEWs or possibly delete table Peopleidentification

vw_eml_creator only pulls column Identificationlink from table lter_metabase.Peopleidentification and assumes it's ORCID.

CREATE VIEW mb2eml_r.vw_eml_creator AS
SELECT i."Identificationlink" AS orcid 
FROM ((lter_metabase."DataSetPersonnel" d LEFT JOIN lter_metabase."People" p 
ON (((d."NameID")::text = (p."NameID")::text))) LEFT JOIN lter_metabase."Peopleidentification" i 
ON (((d."NameID")::text = (i."NameID")::text)));

propose to either:

ALTER VIEW to allow possible multiple ID types per person

adopt ORCIDs as ID system of choice, get rid of table Peopleidentification, whose purpose was to allow multiple ID types. plus ADD column "orcid" to table People instead.

write a create_user.sql script

for these sample users:
owner
row_edit
read_only

to be run by a cluster-admin

V1: consider refactoring script 02 (creates tables and views) into 2 scripts

pros:
the views are the most flexible. so it might make sense to pull their creation out into a separate script, so they are easier to recreate. during install views could be loaded last (after controlled/sample content loading).

cons:
there is no db-mgt scenario that is not interactive at some level. The db_owner will always need to know how to handle a particular script. so how we gather them up may be immaterial.

extend metabase model for semantic annotation

ask Sven @gastil

mentiond PostGRES in readme?

Is there anything PostGRES specific in our SQL scripts? I suspect there is, or at least you couldn't use the same scripts directly for SQLite for example. If that 's the case, then we should declare that this is for PostGRES in the readme.

create scripts to populate controlled vocab tables

lter_metabase.EMLUnitDictionary
lter_metabase."KeywordThesaurus"
lter_metabase."Keywords" (some)
lter_metabase."EMLKeywordTypeList"
lter_metabase."EMLMeasurementScaleList"
lter_metabase."EMLNumberTypeList"
lter_metabase."EMLStorageTypeList"
lter_metabase."EMLUnitTypes"
lter_metabase."MeasurementScaleDomains"

remove vestigial constraints (e.g. SiteType)

possibly another multi-entry issue. will update if I stumble across more

SiteList.SiteType
CONSTRAINT "CK_SiteRegister_SiteType" CHECK ((("SiteType")::text = ANY (ARRAY[('beach'::character varying)::text, ('intertidal'::character varying)::text, ('land'::character varying)::text, ('nearshore'::character varying)::text, ('offshore'::character varying)::text, ('other'::character varying)::text, ('pier'::character varying)::text, ('reef'::character varying)::text])))

PROPOSE to create a lter_metabase.SiteType table, so each site can populate with their own CV of site types? Could be overkill for something internal (doesn't appear in EML)

figure out repo organization to facilitate migration between versions

How should files be organized in a SQL dominated repo? Any good examples out there?

EML 2.2 units?

According to EML Best Practices:

EDI has adopted the LTER Unit Registry and recommends that customUnit element be used for all units with content pulled from the Unit Registry, even when the unit is already listed in the standard unit dictionary.

So should all rows in EMLUnitDictionary have a custom value of false? Or maybe it's better if the R code uses for all units regardless of what the view says.

Is tokenizing %db_owner% necessary?

In the SQL, instead of
OWNER = %db_owner%
you could have
OWNER = 'db_owner'

As far as I can tell, a global search and replace to swap out 'db_owner' with your actual DB owner doesn't break anything, so that's just as easy as replacing %db_owner%. And there's the added benefit that if users just want to test metabase quickly, they can make a 'db_owner' user and thus not have to change anything in the SQL.

prune sql currently used by SBC, add for v1

Add documentation

Add a docs folder and tell GitHub to serve it as pages.

add VersionInfo table

Fields:

DatabaseVersion
EMLVersion

special characters in SBC sample data

in 5_load_sample_sbc_datasets.sql, adding rows to DataSetAttributes, there's this attribute description:

Total number of fronds >1 m in length on the plant.

which comes out in EML as Total number of fronds &gt;1 m in length on the plant.

--
If the row is edited manually in DBeaver to

Total number of fronds > 1 m in length on the plant.

then it comes out in EML as Total number of fronds > 1 m in length on the plant. which is correct.

--
I don't know what's the fix for this would be, or even is a fix is needed. But it does seem XML entity references (eg. >) if present in database doesn't get displayed correctly in R-output EML.

Restrict field values to allowable EML values when feasible

entity-level taxonomic cov

to match the convention of EntitySortOrder==0 => Dataset level
not 0 => specifies the entity the coverage should go under

This is the convention used in DataSetSites

We will hold off under after version 1 and only if somebody asks for it

whether/where to put project level/boilerplate info

@mobb , could be what you meant in #16, I'm not sure.

In the R workflow there's a point where one inputs a boilerplate.xml file with:

project info (title, abstract etc)
access
contact info
publisher
distribution

to then be populated into EML.

Two considerations:
Whether to change this approach: this could be clunky if the .xml needs to be edited by hand, but if for a given site there's only one static reusable version then developing new tables in metabase then writing R to export it is overkill

If yes to change, then could do (per discussion @twhiteaker)

add project-level table, access/distribution could go here
add "contact" and "publisher" role in DataSetPersonnel

add package maintenance tracker

Per request from @scelmendorf and @gastil

Need to add way to track what was done in maintenance updates to data packages.

move reference tables to separate schema?

As we add more tables and cross-reference tables to the core schema (lter_metabase), might we consider moving fairly immutable lookup tables (eg. EMLNumberTypeList and MeasurementScaleDomains) to a separate schema? This may (1) help with clutter and (2) clarify the roles of the different schemas.

pkg_mgmt (which I haven't looked into as much) also has a number of lookup tables starting with cv_. Some of these are specific to SBC/MCR, but some, eg. cv_cra for LTER core research areas, might be useful to add into a schema of lookup tables.