As a USER I want to TITLE SO that over time, as I have multiple merges that ha

Store property-level metadata in merged docs, and allow merge strategies to use this instead of the document-level metadata about smart-mastering-core HOT 6 OPEN

marklogic-community commented on June 16, 2024

Store property-level metadata in merged docs, and allow merge strategies to use this instead of the document-level metadata

from smart-mastering-core.

Comments (6)

dmcassel commented on June 16, 2024

The title is pretty general. Can we make this specific to property-level timestamps?

from smart-mastering-core.

aajacobs commented on June 16, 2024

I think it's more than that. It's anything the current merge options allow. So source, time, and length, where length can be calculated on the fly, so you can ignore storing that, which just leaves time and source for now. But as new functions become available, the metadata should track whatever is needed for those functions.

For example, if the merge strategy prefers the following sources: a=10; b=5; c=3 with a max-value of 1. And two docs (one from source b and one from source c) are merged into one master doc, then the value for source b would be retained. Then the next day, a doc from source a comes in and gets merged into that doc. The correct behavior is to now keep the value from source a and get rid of the source b value, but in order to do that you have to know the source for each property.

from smart-mastering-core.

dmcassel commented on June 16, 2024

Oh, I see. So this isn't about source docs coming in with additional detail on when individual properties got updated; it's about carrying the current information through to merged documents. I get it.

from smart-mastering-core.

aajacobs commented on June 16, 2024

Yes--exactly. Although if needed, a future RFE could handle the other use case of allowing source docs to provide property-level timestamps that Smart Mastering should use.

from smart-mastering-core.

dmcassel commented on June 16, 2024

Implementation note: I believe we're already storing the information needed in the sidecar docs. We'd have to think about whether we should also have some of it in the merged docs (I think that's unattractive) or have the merge process transparently either use that info or reach back to the original source docs (which might be multiple hops).

from smart-mastering-core.

damonfeldman commented on June 16, 2024

I'll add a use case where there is "conficence" information about each property of a document. In XML this is very natural with attributes on the various XML Elements. In JSON it can still be done, however.

This confidence data can come from at least a couple places

it can be added during the data mapping from raw to harmonized. E.g. the mapping in the US from firstname to givenname can be confidence 1.0. However in China, where family names come first, the mapping from lastname to givenname may be only 0.5 confident.
if certain cleanups or mapping case rules are used, the field may have lower confidence. For instance, if a SSN is only 8 digits, one may assume there is a leading 0 that was lost due to numeric storage at some phase. But this is likely not certain, so the 0-padded SSN value may get a lower confidence.

We should have a merge rule that uses some notion of quality or confidence, or other per-field metadata, to drive the merging.

from smart-mastering-core.

Store property-level metadata in merged docs, and allow merge strategies to use this instead of the document-level metadata about smart-mastering-core HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent