Comments (6)
The title is pretty general. Can we make this specific to property-level timestamps?
from smart-mastering-core.
I think it's more than that. It's anything the current merge options allow. So source, time, and length, where length can be calculated on the fly, so you can ignore storing that, which just leaves time and source for now. But as new functions become available, the metadata should track whatever is needed for those functions.
For example, if the merge strategy prefers the following sources: a=10; b=5; c=3 with a max-value of 1. And two docs (one from source b and one from source c) are merged into one master doc, then the value for source b would be retained. Then the next day, a doc from source a comes in and gets merged into that doc. The correct behavior is to now keep the value from source a and get rid of the source b value, but in order to do that you have to know the source for each property.
from smart-mastering-core.
Oh, I see. So this isn't about source docs coming in with additional detail on when individual properties got updated; it's about carrying the current information through to merged documents. I get it.
from smart-mastering-core.
Yes--exactly. Although if needed, a future RFE could handle the other use case of allowing source docs to provide property-level timestamps that Smart Mastering should use.
from smart-mastering-core.
Implementation note: I believe we're already storing the information needed in the sidecar docs. We'd have to think about whether we should also have some of it in the merged docs (I think that's unattractive) or have the merge process transparently either use that info or reach back to the original source docs (which might be multiple hops).
from smart-mastering-core.
I'll add a use case where there is "conficence" information about each property of a document. In XML this is very natural with attributes on the various XML Elements. In JSON it can still be done, however.
This confidence data can come from at least a couple places
- it can be added during the data mapping from raw to harmonized. E.g. the mapping in the US from firstname to givenname can be confidence 1.0. However in China, where family names come first, the mapping from lastname to givenname may be only 0.5 confident.
- if certain cleanups or mapping case rules are used, the field may have lower confidence. For instance, if a SSN is only 8 digits, one may assume there is a leading 0 that was lost due to numeric storage at some phase. But this is likely not certain, so the 0-padded SSN value may get a lower confidence.
We should have a merge rule that uses some notion of quality or confidence, or other per-field metadata, to drive the merging.
from smart-mastering-core.
Related Issues (20)
- Match results do not match documentation
- Update minimal project to use mlBundle and show a trigger in action
- Hard to figure out when options are missing
- Hard to figure out the bug when an entity for mastering is malformed
- Mastered documents are missing their "Info" section
- Allow user to specify where source property/element is
- Matching is too slow for 100K+ documents HOT 2
- Non-deterministic conflicting update bug when saving a merged document HOT 2
- Matching omits singular results that must still form a master/merge document
- Matching needs path support HOT 4
- Fuzzy matching (e.g. double-metaphone) should dynamically generate dictionaries HOT 4
- Updating existing merged documents does not work properly
- ability to turn on/off "re-mastering"
- Missing permissions on audit records
- Multiple rule config HOT 3
- Custom Algorithm for Smart Mastering in JavaScript HOT 3
- Custom Algorithm Using cts:json-property-value-query HOT 3
- Syntax error due to empty string or sequence in xdmp:unpath
- Custom Match Algorithm fails when no weighting possible
- Kasey minnis text and calls
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smart-mastering-core.