Giter Club home page Giter Club logo

lter-core-metabase's People

Contributors

atn38 avatar gastil avatar kf8a avatar mobb avatar twhiteaker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lter-core-metabase's Issues

site description in VIEW

column geographicdescription in vw_eml_geographic appears hardcoded in .sql to be format "SiteList.SiteName: SiteList.SiteDesc"

from 2_create_schemas_tables.sql

(((s."SiteName")::text || ': '::text) || (s."SiteDesc")::text) AS geographicdescription;

SiteDesc isn't required. If NULL, should geographicdescription be NULL or should it be just SiteName with no colon?

entity-level distribution information

On same topic as #26 item 11, what's the common usage pattern for entity-level distribution info?

Right now R code concatenates DataSetEntities.Urlhead and DataSetEntities.FileName to make a download URL. (I'm adding condition to return NULL if Urlhead is NULL.)

If unlike SBC, a site doesn't host data on own site, then perhaps there's no need for entity-level distribution in EML doc?

Rename EMLUnitDictionary to UnitDictionary

You can add custom units to EMLUnitDictionary, so really it is a unit dictionary if you have added any custom units at all. And the custom attribute could be renamed to IsEmlUnit. See #30 .

Comment columns

Add comments to the non-obvious columns. Comments can include, e.g.:

  • Brief summary of field purpose
  • Example of what should go in the field
  • EML element upon which the field is based
  • Where to find valid values? Or maybe we should rely on constraints and lookups.

Example comment for DataSetAttributes.StorageType:

Storage type for data in a RDBMS or other system. E.g., integer, double, string.  Maps to storageType in EML. Use values from https://www.w3.org/2001/XMLSchema-datatypes.

review column comments

delete or edit vestigial column comments. will update as stuff is found
updated april 8th

  1. pkg_mgmt.pkg_state.nickname
    'ie fish_survey or flume or par. This is NOT the eml shortName except perhaps by coincidence. shortName is stored elsewhere. This is not the staging directory except by coincidence.'

shortName is in fact stored in pkg_mgmt.pkg_state.nickname

  1. lter_metabase.SiteList.SiteType
    COMMENT ON COLUMN lter_metabase."SiteList"."SiteType" IS 'Beach (beach site. may overlap with land); intertidal (intertidal. usually rocky, may overlap beach, land); land (land based sampling site); nearshore (nearshore ocean. overlaps with reef); offshore (offshore ocean. may overlap with nearshore); other (unspecified); pier (pier or wharf, in the ocean but attached to shore); reef (reef based sampling site)';

add set of views to export text tables like those in EML assembly line

The (nascent) EML profile will be a set of dataset concepts, and associated EML paths. The reasoning behind this is that software for creating EML is much easier to reuse if the inputs converge on a common model. A common model is what EDI is calling the "EML profile", and basically is an abstraction layer.

In the context of an rDB, the abstraction layer would be expressed as a group of database views.

EML id collision avoidance / unique ids in EML docs

The database constraints do not prevent accidental re-use of an id string across tables. For example a geographic site may have the id foo and a species might have the id foo. This issue is opened just to remind us to put this disclaimer into the documentation and to be aware of this issue as db editors, users and when writing code from the db. The EML writing code will need to enforce unique ids. The EML parser will catch such errors.

include more roles in vw_eml_creator

As of now only "creator" and "organization" in column AuthorshipRole from table DataSetPersonnel get selected into vw_eml_creator.

#14 has discussion on BP on roles. However sidestepping all that, it'd be nice if the view gets funneled all related personnel, which then can be filtered in processing (in R)

add schema "research project"

The schema "research_project" will add funding code and the content required for "project" tree. The funding code is essential for EML.

use supertypes and subtypes for People?

People currently can store people, and also organizations if you leave the Name fields blank (as interpreted by Li Kui's R scripts). What do you think about have a table with basic information like organization, address, and other contact information, and then have tables that inherit from the basic information table? Perhaps the basic information table could be called ResponsibleParty. For example, a People table would inherit from the basic table and add name fields, whereas an Organization table would inherit from the basic information and add....well maybe nothing.

I don't know what that means regarding the current People.NameID field which is a key field for relationships with other tables. I'm looking for expert Postgres advice I guess. My motivation for bringing this up is that it seems clunky to me to have Organizations in the People table with a special rule like, "Leave the names blank to describe an organization." But inheritance may be clunkier and may violate our keep-it-simple goal.

multiple missing values per attribute

I noticed (@atn38 confirms) that multiple missing value codes for a single attribute are not supported by the current schema. I find this comes up all the time (i.e. u= under detection limit, na= lost the sample), and it's super important to document both to distinguish between MCAR vs biased missingness for analysis

enforce order of dataset creators?

I think @lkuiucsb's workflow is to enter the organization (e.g., SBC LTER) as the first creator of a given dataset. I couldn't find evidence that this is required in EML documentation, but maybe I missed it. If it's required, should we somehow require that in the schema (I don't know if that's possible) or let the R scripts work it out?

Vestigial Columns

updated April 9th 2019, add new entries and restructure to group by table
from my POV (R code person), these columns could be vestigial or could benefit from clarified convention:

1. DataSet

  • Investigator
    currently not passed into views
    PROPOSE to either (1) enforce constraint that one of "creator" roles in DataSetPersonnel should be identical to Investigator? OR (2) drop column completely

  • Accession
    Is this meant to contain distribution information at the dataset level? All empty in SBC sample.
    Also see #29 for entity-level distribution info. OR is this meant to contain the EDI/other repo package ID? In that case, replicate this from pkg_mgmt.pkg_state.dataset_archive_id

  • ProjectRelease/PublicRelease
    nothing in SBC sample data

  • geographicDescription
    VIEWs don't do anything with this. nothing in SBC sample data.

2. DataSetEntities

  • DataAnomalies
    currently not passed into views
    is there an EML tag for this kind of info?
    Edit: SBC sample data has some info here, eg
    <para>The text table is 1.6 gb, and has been compressed to 365 mb for download.</para>

  • EntityRecords
    passed into views, but R code counts rows in actual files instead. no enforcement, so right now could be total bogus numbers

3. DataSetAttributes

  • AttributeID
    in SBC sample data AttributeID is identical to ColumnName, but only the latter gets passed into views. AttributeID doesn't appear in EMLAttributeCodeDefinition. Primary keys in both tables use ColumnName

----- DataSetKeywords and DataSetPersonnel look good ------

4. DataSetMethods

  • effectiveRange
    empty in SBC sample. not clear what this does.

5. DataSetSites

  • EntitySortOrder
    already raised in #20

  • GeoCoverageSortOrder
    Hunch that this is related to entity-level coverage aka #20 somehow, but right now not clear what this does.

6. DataSetTemporal

  • EntitySortOrder
    already raised in #20

----- EMLAttributeCodeDefinition, EMLKeywordTypeList, EMLMeasurementScaleList, EMLNumberTypeList, EMLStorageTypeList look good ------

7. EMLUnitDictionary

  • id
    in SBC sample data id is identical to name but only latter gets passed into VIEWs. Primary key uses id however.

  • custom
    EDI recommends parsing all units as custom unit according to EML BP, pg. 33 so this is redundant

8. EMLUnitTypes

  • dimension_power

----- FileTypeList looks good to me ------

9. KeywordThesaurus

  • ThesaurusSortOrder
    passed into view, but doesn't appear in R code. not clear what this does. Primary key and FK in DataSetKeywords use ThesaurusID.

---- Keywords looks good ----

10. MeasurementScaleList

  • NonNumericDomain
    not clear what this does. doesn't appear in DataSetAttributes or EML

11. People

  • Address3

  • Phone2

  • Fax

  • dbupdatetime
    not so much odd as only People has it

13. Peopleidentification
see #35

14. ProtocolList

15. SiteList

  • SiteType
    PROPOSE to remove. see #36

  • ownership
    not clear what this does

  • ShapeType
    clarify how to use: in SBC sample data, there are "point", "polygon", "rectangle", and "polyline". unclear how "polygon" could be represented differently from "rectangle", and unclear how "polyline" could be represented at all

  • CenterLon/CenterLat
    not passed into VIEWs. don't appear on EML

  • unit
    PROPOSE to rename to altitude_unit

mapping between workflow stages

As raised in zoom call:

One pain point as I've been working with metabase and R to generate EML: some of the time it is not clear or intuitive what maps to what. E.g. where columns in VIEWs are pulled from in metabase, and which columns in VIEWs are used by R code to generate which EML elements. So far, whenever this happens I look into SQL code or ask on Slack.

need to brainstorm how best to document this mapping long-term so users spend less time digging:

  • There are three main stages of the workflow (@gastil, your profound flowchart!):
    Metabase ---(SQL)---> Views ---(R)---> EML

  • We can think about at which stage we'd like to add documentation so that it's most effective. Users might approach from the end result, i.e. looking at produced EML document and wondering why this element's missing and what to do (what field to populate in metabase etc) so that information gets in there. Or they might also look at metabase and wonder where this field is going to end up in EML docs. OR we might want to split the difference and start documenting the middle stage, AKA VIEWs. I.e, on column pubdate in vw_eml_dataset, comment "this is populated from pkg_mgmt.pkg_state.update_data_catalog and populates the pubdate EML element"

write down conventions

we should write down the rules we use. this will make it easier for people who want to contribute, and to keep them consistent.

Address entity-level converage inconsistencies

By virtue of its name, I presume DataSetSites refers to site coverage for the dataset rather than per entity within the dataset. However, it has an EntitySortOrder column, which implies that sites are specified per entity. Similarly, DataSetTemporal has EntitySortOrder. The views, vw_eml_geographiccoverage and vw_eml_temporalcoverage both include entity_position.

So, we should decide whether we will support entity-level coverage in metabase, and if necessary adapt the schema and views accordingly. Gastil seldom uses entity level coverage elements at her site.

Historical note: In the original Metabase, the table DataSetSites does not have an EntitySortOrder column. (Whereas DataSetEntitySites does.)

Make instructions about locale in 0_create_db.sql more generic

Issue and comments pasted from Slack:

To create the database in Windows, I have to comment out the LC_COLLATE and LC_CTYPE parameters. I get invalid locale name: "en_US.UTF-8". By default, my Postgres instance uses UTF8 and English_United States.1252 for the locale, so commenting out those parameters is fine if you are happy with that locale.

Thus, I suggest that the instructions be modified to tell Windows users to comment those lines and make sure they're happy with the default encoding. Alternatively, we could remove those lines from the SQL and have a note about defining a character set in the instructions.

add description of install-scenarios to documentation

the 5 scripts fit a particular set of install scenarios. list them here

  1. institutional/central server

    1. dbadmin acct that is not the db_owner acct (they may/may not be the same person)
    2. different accts (people) run different scripts
    3. at least some accounts probably already exist (because the db_owner is likely to have access to other databases)
    4. there may be a sandbox db and a production db (eg, sandbox has example content loaded, production does not)
  2. desktop install

    1. dbadmin and owner are probably the the same person
    2. multiple ways to install
      • use a gui with an sql pane and run scripts as for a central server
      • concat the scripts together and run them as you would a restore (keep in mind that the typical order of a DB dump-restore is not the order of operations in these scripts)
    3. keep in mind that it is almost impossible to share work this way, which might obviate the use of postgres altogether
  3. optional: notes for install on different OS's (probably orthogonal to the above)

see if installation includes PK for DataSetAttributes

The testing install here at ucsb named mob_test_lter_core_mb_empty was lacking its PK on DataSetAttributes.

This issue is a To-Do note for Gastil to check to be sure the install script does include that PK.

If not, put this snippet in:

ALTER TABLE lter_metabase."DataSetAttributes"
ADD CONSTRAINT "DataSetAttributes_PK_DataSetID_EntitySortOrder_ColumnName"
PRIMARY KEY ("DataSetID", "EntitySortOrder", "ColumnName");

A side issue is whether to use di ei ai or to use di ei aName. The di, ei and ai are integers but AttributeName is a string. Pros/cons either way. Currently both models are used because it came from GCE that way.

Methods: design schema to support EML methods in ways we use it without undue complexity

EML methods contain complex structure such as

  • multiple methodSection trees
  • title, para, ulink, subscript, superscript, emphasis, itemizedList which does not fit well with a plain text data type.
  • protocol (with title, author, url)

We want to keep the actual practice of entering method content from what we receive (often a Word doc or such) into something store-able that can be converted into EML as complex text type or as markdown. Ideally, this could be stored in Metabase rather than static files elsewhere. (Blob data type has been suggested.) It is not realistic to model the structure of methods text into table columns! The Metabase at UCSB prior to the lter_core_mb project used an interim working solution of storing the methodSteps as xml data type in metabase.

It seems useful to be able to re-use methods or methodSteps. To what extent do we actually do that? Protocols seem more likely to be re-used.

In addition to storing methods as docs (or xml), there are several columns in the DataSetMethods table currently (as of 2019-06-27) which are in use for SBC for a small proportion of datasets (looks like physO). So we need to support those also.

This comment is just to open the issue.

enforce referential integrity

We should make sure we enforce referential integrity for tables related to lookup tables. I think this would cause the database engine to raise an error if you entered a NameID in the DataSetPersonnel table that didn't already exist in the People table, for example, which could help prevent loading invalid data.

adapting code versioning to SQL

git (or any code versioning system) is not designed for SQL. When a new file is seen by git, it says "no problem, no conflict, this will not change existing code in master". But if that new file starts out with

DROP VIEW a;
DROP VIEW b;
CREATE VIEW c AS ...

git has no way of seeing that no-no-no-no that new file DOES conflict with what is currently in master. So, we as humans will need to supervise git. Don't merge just cause git says there is no conflict.

This has been a public service announcement from your local git newbie sql geek.

include more person ID types in VIEWs or possibly delete table Peopleidentification

vw_eml_creator only pulls column Identificationlink from table lter_metabase.Peopleidentification and assumes it's ORCID.

CREATE VIEW mb2eml_r.vw_eml_creator AS
SELECT i."Identificationlink" AS orcid 
FROM ((lter_metabase."DataSetPersonnel" d LEFT JOIN lter_metabase."People" p 
ON (((d."NameID")::text = (p."NameID")::text))) LEFT JOIN lter_metabase."Peopleidentification" i 
ON (((d."NameID")::text = (i."NameID")::text)));

propose to either:

  • ALTER VIEW to allow possible multiple ID types per person

or

  • adopt ORCIDs as ID system of choice, get rid of table Peopleidentification, whose purpose was to allow multiple ID types. plus ADD column "orcid" to table People instead.

V1: consider refactoring script 02 (creates tables and views) into 2 scripts

pros:
the views are the most flexible. so it might make sense to pull their creation out into a separate script, so they are easier to recreate. during install views could be loaded last (after controlled/sample content loading).

cons:
there is no db-mgt scenario that is not interactive at some level. The db_owner will always need to know how to handle a particular script. so how we gather them up may be immaterial.

mentiond PostGRES in readme?

Is there anything PostGRES specific in our SQL scripts? I suspect there is, or at least you couldn't use the same scripts directly for SQLite for example. If that 's the case, then we should declare that this is for PostGRES in the readme.

create scripts to populate controlled vocab tables

  • lter_metabase.EMLUnitDictionary
  • lter_metabase."KeywordThesaurus"
  • lter_metabase."Keywords" (some)
  • lter_metabase."EMLKeywordTypeList"
  • lter_metabase."EMLMeasurementScaleList"
  • lter_metabase."EMLNumberTypeList"
  • lter_metabase."EMLStorageTypeList"
  • lter_metabase."EMLUnitTypes"
  • lter_metabase."MeasurementScaleDomains"

remove vestigial constraints (e.g. SiteType)

possibly another multi-entry issue. will update if I stumble across more

  1. SiteList.SiteType
    CONSTRAINT "CK_SiteRegister_SiteType" CHECK ((("SiteType")::text = ANY (ARRAY[('beach'::character varying)::text, ('intertidal'::character varying)::text, ('land'::character varying)::text, ('nearshore'::character varying)::text, ('offshore'::character varying)::text, ('other'::character varying)::text, ('pier'::character varying)::text, ('reef'::character varying)::text])))

PROPOSE to create a lter_metabase.SiteType table, so each site can populate with their own CV of site types? Could be overkill for something internal (doesn't appear in EML)

EML 2.2 units?

According to EML Best Practices:

EDI has adopted the LTER Unit Registry and recommends that customUnit element be used for all units with content pulled from the Unit Registry, even when the unit is already listed in the standard unit dictionary.

So should all rows in EMLUnitDictionary have a custom value of false? Or maybe it's better if the R code uses for all units regardless of what the view says.

Is tokenizing %db_owner% necessary?

In the SQL, instead of
OWNER = %db_owner%
you could have
OWNER = 'db_owner'

As far as I can tell, a global search and replace to swap out 'db_owner' with your actual DB owner doesn't break anything, so that's just as easy as replacing %db_owner%. And there's the added benefit that if users just want to test metabase quickly, they can make a 'db_owner' user and thus not have to change anything in the SQL.

special characters in SBC sample data

in 5_load_sample_sbc_datasets.sql, adding rows to DataSetAttributes, there's this attribute description:

Total number of fronds &gt;1 m in length on the plant.

which comes out in EML as Total number of fronds &amp;gt;1 m in length on the plant.

--
If the row is edited manually in DBeaver to

Total number of fronds > 1 m in length on the plant.

then it comes out in EML as Total number of fronds &gt; 1 m in length on the plant. which is correct.

--
I don't know what's the fix for this would be, or even is a fix is needed. But it does seem XML entity references (eg. &gt;) if present in database doesn't get displayed correctly in R-output EML.

entity-level taxonomic cov

to match the convention of EntitySortOrder==0 => Dataset level
not 0 => specifies the entity the coverage should go under

This is the convention used in DataSetSites

We will hold off under after version 1 and only if somebody asks for it

whether/where to put project level/boilerplate info

@mobb , could be what you meant in #16, I'm not sure.

In the R workflow there's a point where one inputs a boilerplate.xml file with:

  • project info (title, abstract etc)
  • access
  • contact info
  • publisher
  • distribution

to then be populated into EML.

Two considerations:
Whether to change this approach: this could be clunky if the .xml needs to be edited by hand, but if for a given site there's only one static reusable version then developing new tables in metabase then writing R to export it is overkill

If yes to change, then could do (per discussion @twhiteaker)

  • add project-level table, access/distribution could go here
  • add "contact" and "publisher" role in DataSetPersonnel

move reference tables to separate schema?

As we add more tables and cross-reference tables to the core schema (lter_metabase), might we consider moving fairly immutable lookup tables (eg. EMLNumberTypeList and MeasurementScaleDomains) to a separate schema? This may (1) help with clutter and (2) clarify the roles of the different schemas.

pkg_mgmt (which I haven't looked into as much) also has a number of lookup tables starting with cv_. Some of these are specific to SBC/MCR, but some, eg. cv_cra for LTER core research areas, might be useful to add into a schema of lookup tables.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.