Giter Club home page Giter Club logo

Comments (12)

scelmendorf avatar scelmendorf commented on July 20, 2024 1

From the peanut gallery (who apologizes for absenteeism but has been busy on non db activities):
Could do hybrid approach:
one column for DataSetMethods (Postgres Text type);
one column (controlled list; required) for DataSetMethodsFormat with options such as 'xml', 'url_word', 'url_text', etc.

Then you don't run the risk of 2 or more formats having the same "look" but requiring different logic in the client code, and also don't run the risk of having 2 sets of methods for the same dataset in your metabase (if, for example, you migrated from txt to eml then you need to both add and delete in your update). Also, to add a new format we haven't thought of yet, you just need to update the lookup for DataSetMethodsFormatTypes, rather than add a whole new column to the db and view. (can you tell I am a big fan of the always long, never wide, paradigm)?

Client code looks something like "if DataSetMethodsFormat == 'url_word', curl the value in DataSetMethods, parse as word, etc. Same idea as Tim's but a little less sleuthing required and code updates should we get some ambiguous formats.

from lter-core-metabase.

twhiteaker avatar twhiteaker commented on July 20, 2024

It is not realistic to model the structure of methods text into table columns!

I agree. I'd rather just use a Text type for methods. If it's markdown or XML, then clients like MetaEgress can parse it accoringly to generate EML.

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

Tim, by "Text type" you mean the EML text type that abstract and methods contain, correct? That is the one which has xml elements such as para, subscript, emphasis, and ulink. Or did you mean plain unformatted text, the postgres text type?

I admit I thought it a crude and temporary solution when in 2013 we (SBC and MCR) put our methods eml into metabase using the xml data type. But perhaps that is actually a practical solution. It does work fine for me; I use it to this day.

I will suggest that since some prefer markdown, some prefer eml text type (which is xml), and some may just want to store the file or a link to the file, that we offer these choices. The view would pass thru the xml or the markdown if one of those is populated. Otherwise it might pass thru a url to a file or a "TODO" note to the user to substitute in some R-processed file at a later step. These options could be a set of columns only one of which would not be null. Or we could use the blob data type for all options in a single column. The view would then distinguish how to handle content by some inspection. (Which I do not yet know how to do.)

I do not want to force Metabase users to give up some way of doing things that works well for them.

from lter-core-metabase.

twhiteaker avatar twhiteaker commented on July 20, 2024

I meant the Postgres Text type, which is of unlimited length. I don't see why Blob would be necessary. I could just have a single column for the method description, and then let the EML-generating-code figure out what to do with the text. Example logic (not bulletproof; just quick and dirty examples):

  • If the text starts with <?xml, and you can load it in an XML object without errors, then assume it's XML.
  • If the text starts with #, assume it's markdown.
  • If the text looks like a URL, assume it's a link to a file.
  • Otherwise, treat it as plain text.

This is overloading the Method Description column, so maybe this is bad practice. If that irks people, then I guess @gastil 's suggestion of having separate columns for each of those things would be fine. In that case, I'd rather have all four columns get into the view, and then let the client code figure out what to do with it. Definitely gonna need some user documentation for this.

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

I like this idea. Some comments:

  • Can xml pass thru from pg output into a client script w/o breaking, or getting double-encoded, or some such problem? If it is read in as text, not xml data type, the systems may not handle it properly. How to test this?
  • So far, sets of views have been written with their use in mind, a separate set of views for different R or perl code. Overloading methods, or multiple options to use for methods, is going to need the using-code to be aware of that. I'm ok with that; just noting it.
  • Currently I put an xml forest of <methodStep> trees in my xml data type column (and my column name is method_xml). If we overload several data types into a postgres text data type, I'd use <methodStep> rather than <?xml> as the test.
  • We have heard from at least one metabase user that they do not like too many columns. Yet I am inclined to use a separate column for each data type option for methods. Maybe I want to migrate from one data type to the next, from URL to Word to markdown, for example. And I do like to catch errors that would not be caught if, for example, I entered an xml tree missing its closing tag. Early catch.
  • I could go either way. Plain text to hold various data types, or one column for each data type.
  • If we go with plain text, we will not be able to store an actual Word doc, as we could with blob. (Dumps of blobs can pose problems we may want to avoid, so unless that is really needed perhaps we should avoid blobs.)

from lter-core-metabase.

twhiteaker avatar twhiteaker commented on July 20, 2024

I'm all for avoiding blobs.

In my test with R just now, XML came out of Postgres just fine.

I would be fine with Sarah's suggestion.

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

I like Sarah's idea. I do not object to adding another controlled-vocab table. An alternative is to do this with a CHECK constraint, not as easy to maintain as rows in a cv. (I prefer a cv table. But I am trying to speak for those who have asked for fewer tables.)

Currently, Protocols are stored separate from Methods. Whereas Methods do tend to be unique to a given dataset, usually, Protocols tend to be re-used. Content originally in the Instrument table and the Software table got smushed into denormalized columns tacked onto DataSetMethods. I'm postponing discussion of that.

[I am going to postpone discussion of the n-n datasets-protocols. We may opt to merge protocols into methods, or to have an xref table for them. I want to wait til Monday to ask Margaret. SBC used to have 133 uniq protocols. Now they have 2. So practice may have changed. I've smushed my protocols into my methods, but with tech regret since I lost the on-update-cascade.]

from lter-core-metabase.

atn38 avatar atn38 commented on July 20, 2024

Gastil and I zoomed today re:methods section. TLDR: we'll follow Sarah's suggestion. Will update later.

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

(pasting the introductory comment at the top of the new design here rather than just burying it in the .sql file)

2019-07-10 redesign of mb tables for eml methods tree

Three practices offered:
(Simple Description)

    • Only table DataSetMethodSteps is required. The other tables are optional.
    • To use an external Word .docx file, put a single string devoid of whitespace in the column Description for the filename.
    • Optionally you may list multiple files, but do not list the same file twice for a given dataset.
    • To group multiple items within a methodStep, cross reference those to the same MethodStepSet.

(Simple xml)

    • Only DataSetMethodSteps is used.
    • The eml tree for Methods, everything under methods, goes in one xml forest of methodStep trees as xml in this cell.
    • No external documents are used.
    • When instruments, protocols and software eml structures are wanted, these are part of that same xml forest, not from those other tables (at least not directly).
    • This option does not offer any re-use or rDB control.
    • This option does offer rare eml constructions not handled the other ways such as sampling or qualityControl.

(Complex)

    • Populate these 6 tables:
      • ListInstruments, ListSoftware, ListProtocols,
      • MethodInstruments, MethodProtocols, and MethodSoftware
    • Then put one row in DataSetMethodSteps for each methodStep.
    • Each row in DataSetMethodSteps may join to multiple records in the Methods* tables.
    • This practice has the highest re-use, lowest redundancy, most controlled.

MethodStepSet Column

    • A methodStep contains one Description and 0-to-many Instruments, Softwares, and/or Protocols.
    • By specifying a set of these, they can be grouped.
    • It is common to have only one methodStep, often with no instrument, protocol, or software.
    • Use only one practice (Simple or Complex) for a given dataset.
    • It is ok to have some datasets using the Simple practice and others the Complex.
    • Currently the perl writeEML only uses Simple xml and the R MetaEgress only uses the Simple Description.
      • Note we opted not to share a column for text and xml because we would lose ability to use string functions and xml functions and would complicate the VIEWs. But we do share a single text-type column, Description, for three uses (plain text, markdown, or filename), with DescriptionType column to specify between those options.

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

The create table statements I just now pushed to master in 16_methodStep_tables.sql. Yet to do are the SELECT INTO from the old design to the new design and revised VIEWs to use the new tables. Note that without those it is possible to sail on with current rigging.

Planned migration path is (1) install new tables (2) select into from old tables (3) drop old views and load new views (4) drop old tables.

from lter-core-metabase.

atn38 avatar atn38 commented on July 20, 2024

from lter-core-metabase.

gastil avatar gastil commented on July 20, 2024

Tables all done. One view re-done. Migration scripts almost done. There is a new issue for the migration scripts and new views.

from lter-core-metabase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.