Comments (12)
From the peanut gallery (who apologizes for absenteeism but has been busy on non db activities):
Could do hybrid approach:
one column for DataSetMethods (Postgres Text type);
one column (controlled list; required) for DataSetMethodsFormat with options such as 'xml', 'url_word', 'url_text', etc.
Then you don't run the risk of 2 or more formats having the same "look" but requiring different logic in the client code, and also don't run the risk of having 2 sets of methods for the same dataset in your metabase (if, for example, you migrated from txt to eml then you need to both add and delete in your update). Also, to add a new format we haven't thought of yet, you just need to update the lookup for DataSetMethodsFormatTypes, rather than add a whole new column to the db and view. (can you tell I am a big fan of the always long, never wide, paradigm)?
Client code looks something like "if DataSetMethodsFormat == 'url_word', curl the value in DataSetMethods, parse as word, etc. Same idea as Tim's but a little less sleuthing required and code updates should we get some ambiguous formats.
from lter-core-metabase.
It is not realistic to model the structure of methods text into table columns!
I agree. I'd rather just use a Text type for methods. If it's markdown or XML, then clients like MetaEgress can parse it accoringly to generate EML.
from lter-core-metabase.
Tim, by "Text type" you mean the EML text type that abstract and methods contain, correct? That is the one which has xml elements such as para, subscript, emphasis, and ulink. Or did you mean plain unformatted text, the postgres text type?
I admit I thought it a crude and temporary solution when in 2013 we (SBC and MCR) put our methods eml into metabase using the xml data type. But perhaps that is actually a practical solution. It does work fine for me; I use it to this day.
I will suggest that since some prefer markdown, some prefer eml text type (which is xml), and some may just want to store the file or a link to the file, that we offer these choices. The view would pass thru the xml or the markdown if one of those is populated. Otherwise it might pass thru a url to a file or a "TODO" note to the user to substitute in some R-processed file at a later step. These options could be a set of columns only one of which would not be null. Or we could use the blob data type for all options in a single column. The view would then distinguish how to handle content by some inspection. (Which I do not yet know how to do.)
I do not want to force Metabase users to give up some way of doing things that works well for them.
from lter-core-metabase.
I meant the Postgres Text type, which is of unlimited length. I don't see why Blob would be necessary. I could just have a single column for the method description, and then let the EML-generating-code figure out what to do with the text. Example logic (not bulletproof; just quick and dirty examples):
- If the text starts with
<?xml
, and you can load it in an XML object without errors, then assume it's XML. - If the text starts with
#
, assume it's markdown. - If the text looks like a URL, assume it's a link to a file.
- Otherwise, treat it as plain text.
This is overloading the Method Description column, so maybe this is bad practice. If that irks people, then I guess @gastil 's suggestion of having separate columns for each of those things would be fine. In that case, I'd rather have all four columns get into the view, and then let the client code figure out what to do with it. Definitely gonna need some user documentation for this.
from lter-core-metabase.
I like this idea. Some comments:
- Can xml pass thru from pg output into a client script w/o breaking, or getting double-encoded, or some such problem? If it is read in as text, not xml data type, the systems may not handle it properly. How to test this?
- So far, sets of views have been written with their use in mind, a separate set of views for different R or perl code. Overloading methods, or multiple options to use for methods, is going to need the using-code to be aware of that. I'm ok with that; just noting it.
- Currently I put an xml forest of
<methodStep>
trees in my xml data type column (and my column name is method_xml). If we overload several data types into a postgres text data type, I'd use<methodStep>
rather than<?xml>
as the test. - We have heard from at least one metabase user that they do not like too many columns. Yet I am inclined to use a separate column for each data type option for methods. Maybe I want to migrate from one data type to the next, from URL to Word to markdown, for example. And I do like to catch errors that would not be caught if, for example, I entered an xml tree missing its closing tag. Early catch.
- I could go either way. Plain text to hold various data types, or one column for each data type.
- If we go with plain text, we will not be able to store an actual Word doc, as we could with blob. (Dumps of blobs can pose problems we may want to avoid, so unless that is really needed perhaps we should avoid blobs.)
from lter-core-metabase.
I'm all for avoiding blobs.
In my test with R just now, XML came out of Postgres just fine.
I would be fine with Sarah's suggestion.
from lter-core-metabase.
I like Sarah's idea. I do not object to adding another controlled-vocab table. An alternative is to do this with a CHECK constraint, not as easy to maintain as rows in a cv. (I prefer a cv table. But I am trying to speak for those who have asked for fewer tables.)
Currently, Protocols are stored separate from Methods. Whereas Methods do tend to be unique to a given dataset, usually, Protocols tend to be re-used. Content originally in the Instrument table and the Software table got smushed into denormalized columns tacked onto DataSetMethods. I'm postponing discussion of that.
[I am going to postpone discussion of the n-n datasets-protocols. We may opt to merge protocols into methods, or to have an xref table for them. I want to wait til Monday to ask Margaret. SBC used to have 133 uniq protocols. Now they have 2. So practice may have changed. I've smushed my protocols into my methods, but with tech regret since I lost the on-update-cascade.]
from lter-core-metabase.
Gastil and I zoomed today re:methods section. TLDR: we'll follow Sarah's suggestion. Will update later.
from lter-core-metabase.
(pasting the introductory comment at the top of the new design here rather than just burying it in the .sql file)
2019-07-10 redesign of mb tables for eml methods tree
Three practices offered:
(Simple Description)
-
- Only table DataSetMethodSteps is required. The other tables are optional.
-
- To use an external Word .docx file, put a single string devoid of whitespace in the column Description for the filename.
-
- Optionally you may list multiple files, but do not list the same file twice for a given dataset.
-
- To group multiple items within a methodStep, cross reference those to the same MethodStepSet.
(Simple xml)
-
- Only DataSetMethodSteps is used.
-
- The eml tree for Methods, everything under methods, goes in one xml forest of methodStep trees as xml in this cell.
-
- No external documents are used.
-
- When instruments, protocols and software eml structures are wanted, these are part of that same xml forest, not from those other tables (at least not directly).
-
- This option does not offer any re-use or rDB control.
-
- This option does offer rare eml constructions not handled the other ways such as sampling or qualityControl.
(Complex)
-
- Populate these 6 tables:
-
-
- ListInstruments, ListSoftware, ListProtocols,
-
-
-
- MethodInstruments, MethodProtocols, and MethodSoftware
-
-
- Then put one row in DataSetMethodSteps for each methodStep.
-
- Each row in DataSetMethodSteps may join to multiple records in the Methods* tables.
-
- This practice has the highest re-use, lowest redundancy, most controlled.
MethodStepSet Column
-
- A methodStep contains one Description and 0-to-many Instruments, Softwares, and/or Protocols.
-
- By specifying a set of these, they can be grouped.
-
- It is common to have only one methodStep, often with no instrument, protocol, or software.
-
- Use only one practice (Simple or Complex) for a given dataset.
-
- It is ok to have some datasets using the Simple practice and others the Complex.
-
- Currently the perl writeEML only uses Simple xml and the R MetaEgress only uses the Simple Description.
-
-
- Note we opted not to share a column for text and xml because we would lose ability to use string functions and xml functions and would complicate the VIEWs. But we do share a single text-type column, Description, for three uses (plain text, markdown, or filename), with DescriptionType column to specify between those options.
-
from lter-core-metabase.
The create table statements I just now pushed to master in 16_methodStep_tables.sql. Yet to do are the SELECT INTO from the old design to the new design and revised VIEWs to use the new tables. Note that without those it is possible to sail on with current rigging.
Planned migration path is (1) install new tables (2) select into from old tables (3) drop old views and load new views (4) drop old tables.
from lter-core-metabase.
from lter-core-metabase.
Tables all done. One view re-done. Migration scripts almost done. There is a new issue for the migration scripts and new views.
from lter-core-metabase.
Related Issues (20)
- The keywords EML view (mb2eml_r.vw_eml_keyword) joins all instances of a keyword, regardless of ThesaurusID HOT 4
- Update docs to indicate which tables are required
- remove public schema from OBF HOT 1
- fix milimeter spelling
- implement boilerplate `SubtractFromID` idea to allow same IDs on different scope HOT 10
- compare original eml to MetaEgres eml from mb
- diff OBF from Gastil's mac mini to OBF from BLE HOT 2
- dataset_archive_id not matching DataSetID in example rows
- suggest ListTaxonomicProviders rename EMLTaxonomicProviders HOT 2
- suggested list of taxonomic ID providers HOT 3
- add "docbook" to be an abstract and method description type HOT 4
- ListSites.AltitudeUnits char limit too short HOT 4
- varchar limit temporal_type column in pkg_sort is too short
- FK in DataSetAttributeAnnotation references wrong columns in DataSetAttributes HOT 2
- support citations for datasets HOT 3
- content for a scratch table of EML 2.2 units
- use ListMissingCodes for enumeration codes as well HOT 14
- Patch 41 causes problems in an EML view (mb2eml_r.vw_eml_attributecodedefinition) HOT 6
- README.md wrong url for quick overview metabase schema hyperlink ? HOT 2
- The keywords view (mb2eml_r.vw_eml_keyword) merges in all instances of a keyword, regardless of ThesaurusID HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lter-core-metabase.