Giter Club home page Giter Club logo

Comments (6)

dmcassel avatar dmcassel commented on June 16, 2024

The title is pretty general. Can we make this specific to property-level timestamps?

from smart-mastering-core.

aajacobs avatar aajacobs commented on June 16, 2024

I think it's more than that. It's anything the current merge options allow. So source, time, and length, where length can be calculated on the fly, so you can ignore storing that, which just leaves time and source for now. But as new functions become available, the metadata should track whatever is needed for those functions.

For example, if the merge strategy prefers the following sources: a=10; b=5; c=3 with a max-value of 1. And two docs (one from source b and one from source c) are merged into one master doc, then the value for source b would be retained. Then the next day, a doc from source a comes in and gets merged into that doc. The correct behavior is to now keep the value from source a and get rid of the source b value, but in order to do that you have to know the source for each property.

from smart-mastering-core.

dmcassel avatar dmcassel commented on June 16, 2024

Oh, I see. So this isn't about source docs coming in with additional detail on when individual properties got updated; it's about carrying the current information through to merged documents. I get it.

from smart-mastering-core.

aajacobs avatar aajacobs commented on June 16, 2024

Yes--exactly. Although if needed, a future RFE could handle the other use case of allowing source docs to provide property-level timestamps that Smart Mastering should use.

from smart-mastering-core.

dmcassel avatar dmcassel commented on June 16, 2024

Implementation note: I believe we're already storing the information needed in the sidecar docs. We'd have to think about whether we should also have some of it in the merged docs (I think that's unattractive) or have the merge process transparently either use that info or reach back to the original source docs (which might be multiple hops).

from smart-mastering-core.

damonfeldman avatar damonfeldman commented on June 16, 2024

I'll add a use case where there is "conficence" information about each property of a document. In XML this is very natural with attributes on the various XML Elements. In JSON it can still be done, however.

This confidence data can come from at least a couple places

  1. it can be added during the data mapping from raw to harmonized. E.g. the mapping in the US from firstname to givenname can be confidence 1.0. However in China, where family names come first, the mapping from lastname to givenname may be only 0.5 confident.
  2. if certain cleanups or mapping case rules are used, the field may have lower confidence. For instance, if a SSN is only 8 digits, one may assume there is a leading 0 that was lost due to numeric storage at some phase. But this is likely not certain, so the 0-padded SSN value may get a lower confidence.

We should have a merge rule that uses some notion of quality or confidence, or other per-field metadata, to drive the merging.

from smart-mastering-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.