Giter Club home page Giter Club logo

smart-mastering-core's Introduction

Smart Mastering Core

Integration into MarkLogic Data Hub 5

1.3.1 is our final feature release of Smart Mastering in the Smart Mastering Core repository. As of Data Hub 5.0.0, Smart Mastering is fully integrated into MarkLogic Data Hub as a built-in capability, and the recommended way to use the Smart Mastering capability is by configuring a mastering step in Data Hub. Existing users should migrate their Smart Mastering configuration to MarkLogic Data Hub (see Import Your Smart Mastering Core Projects for instructions). The integration of Smart Mastering into Data Hub offers a variety of benefits, including:

  • Built-in support for orchestrating matching and merging across documents.
  • QuickStart UI for configuration of matching and merging

MarkLogic will continue to invest in Smart Mastering as a built-in capability of Data Hub.

Smart Mastering Core details

This repo contains the libraries and services for a Smart Mastering capability built on top of MarkLogic. Smart Mastering consists of matching entities in an operational data hub, then auto-merging for high-scoring matches and recording a notification for a human reviewer for cases where the score indicates a possible, but not definite match. Match scoring rules and merging algorithms, thresholds and actions are configuration driven. APIs are available either through a set of XQuery libraries or a REST service layer.

This capability is experimental. Be prepared for the interface and implementation to change significantly. We welcome your input to guide this development process.

Requirements

  • MarkLogic 9.0-5 or higher
  • Java 8 or higher
  • Gradle is optional - this project has the Gradle wrapper included, and the instructions below reference it so that you don't need to install Gradle
  • Schemas Database attached to your content DB. This is required for PROV-O to work.

Using

Documentation on how Smart Mastering Core works and how to use it is available at GitHub.

To use the Smart Mastering Core in your own project, follow these instructions. This assumes that you're using ml-gradle in your project.

Note: be advised that this project is in its very early stages. The APIs presented here may change significantly before stabilizing.

Examples

To view smart-mastering-core in use, see the Smart Mastering project examples.

Need help?

If you've found a bug or would like to ask for a new capability, please file an issue here on GitHub. If you are having trouble using smart-mastering-core, you can file a question issue here on GitHub or ask a question on Stack Overflow with the "marklogic" tag. If you'd like to discuss this project with Product Management, contact [email protected].

Project Status

Smart Mastering Core is a community-supported project. Help is available by filing issues here on GitHub and by asking questions on Stack Overflow; however, we can’t promise a specific resolution or timeframe for any request.

Adding Smart Mastering to your project

Assuming you're using ml-gradle, you can easily integrate Smart Mastering into your application.

As this project hasn't been published to the jcenter repository yet, you'll first need to publish a copy of this project to your local Maven repository, which defaults to ~/.m2/repository.

To do so, clone this repository and run the following command in the project's root directory:

./gradlew publishToMavenLocal

You can verify that the artifacts were published successfully by looking in the ~/.m2/repository/com/marklogic/community/smart-mastering-core directory.

Now that you've published Smart Mastering locally, you can add it to your own application. The minimal example project provides a simple example of doing this. You just need to add the following to your build.gradle (again, this depends on using ml-gradle).

First, in the repositories block, make sure you have your local Maven repository listed:

repositories {
  mavenLocal()
}

And then just add the following to your dependencies block:

dependencies {
  mlRestApi "com.marklogic.community:smart-mastering-core:0.1.DEV"
}

This assumes that the version of the artifacts you published above is 0.1.DEV. You can find the version number by looking at the version property in gradle.properties in your cloned copy of smart-mastering-core.

And that's it! Now, when you run mlDeploy, the modules in Smart Mastering will be automatically loaded into your modules database. To verify that the modules exist, you can either browse your modules database via qConsole, or you can go to your application's REST server and see that the Smart Mastering services have been installed:

http://localhost:(your REST port)/v1/config/resources 

You can then run ml-gradle tasks such as mlLoadModules and mlReloadModules, and the Smart Mastering modules will again be loaded into your modules database.

Development

Deploy

If necessary, create a gradle-local.properties file and override properties in gradle.properties as needed.

Run ./gradlew mlDeploy

Testing

UI-based

After running ./gradlew mlDeploy, point a browser to http://localhost:8042/test. Click the Run Tests button.

Command-line

  • ./gradlew mlUnitTest

Release

Publishing

  • update gradle.properties with the new version number
  • update the gradle.properties for the examples with the new version number
  • update the UPGRADE.md file's version reference to the new version number. Add other upgrade notes as needed.
  • generate the CHANGELOG: github_changelog_generator --token $my-github-token --future-release v1.0.0
    • if you don't have a "/tmp/" directory (Windows), add --cache-file ~\tmp\github-changelog-http-cache --cache-log ~\tmp\github-changelog-logger.log
  • commit the CHANGELOG
  • add these properties to your gradle-local.properties
    • bintray_user
    • bintray_key
  • change the version property in gradle.properties to the new version number
  • ./gradlew bintrayUpload
  • log in to bintray and push the publish button
  • smoke test the examples to make sure they work with the latest published version
  • merge the develop branch into master
  • merge the docs-next branch into docs
  • tag the release vx.xx
  • push the tags
  • add the changelog to the release page on github

You must be part of the marklogic-community organization on bintray in order to publish.

How do I uninstall?

./gradlew mlUndeploy -Pconfirm=true

smart-mastering-core's People

Contributors

dmcassel avatar patrickmcelwee avatar paxtonhare avatar rjrudin avatar ryanjdew avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smart-mastering-core's Issues

Store property-level metadata in merged docs, and allow merge strategies to use this instead of the document-level metadata

As a USER
I want to TITLE
SO that over time, as I have multiple merges that have occurred on a document, I can still specify how merging should happen for each property instead of rely only on top-level document metadata.

For example, if I have a mastered doc with information derived from three sources, the lastUpdated value of each of those properties may be different, while the mastered document will retain the lastUpdated value for any change to the entire document.

Currently, if I have a fourth source that I want to use to merge into that mastered record and I want to keep the most recent properties, Smart Mastering only knows the lastUpdated information for the entire merged doc, not the individual properties. This will lead to unexpected behavior based on my use case, as the lastUpdated info for the doc can be very different than the lastUpdated info for a given property.

Not able to add merge configuration using Rest api in minimal-project.

I am not able to add merge configuration to minimal-project using sm-merge-options Rest api, Although I was successful in adding match configuration.

Steps followed to setup.

  1. Cloned the Smart mastering core repository and ran gradle.bat publishToMavenLocal command
  2. Smart mastering added to local maven repository at location .m2\repository\com\marklogic\community\smart-mastering-core\1.0.0-beta.1.
  3. Traversed to minimal-project and ran gradle mlDeploy
  4. Verified the resources by hitting the end point
    http://localhost:8800/v1/config/resources
  5. Access the below endpoint to put/post match & merge options
    http://localhost:8800/v1/resources/sm-match-options?rs:name=mlw-match
    http://localhost:8800/v1/resources/sm-merge-options?rs:name=mlw-merge
    Smart_Mastering_Resources.txt
    mdm-merge-options.txt

Let me know if I missed any step in setting the project.

POSTing pseudo-document to sm-match

In order to conduct fielded searches, I would like to be able to POST a pseudo-document to sm-match and get matches as if I had started with a document in the database.

For example, I would like to be able to POST the following to /LATEST/resources/sm-match?rs:options=mlw-match

{document: {PersonGivenName: "maurise"}}

When I try this now, I get the following error:

2018-07-10 16:52:52.762 Notice: XDMP-AS: (err:XPTY0004) $uri as xs:string -- Invalid coercion: () as xs:string
2018-07-10 16:52:52.762 Notice:+in /com.marklogic.smart-mastering/matcher-impl/matcher-impl.xqy, at 59:21,
2018-07-10 16:52:52.762 Notice:+in match-impl:find-document-matches-by-options(object-node{"PersonGivenName":text{"maurise"}}, fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options, xs:int("1"), xs:int("200"), xs:double("1"), fn:false(), fn:false()) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $document = object-node{"PersonGivenName":text{"maurise"}}
2018-07-10 16:52:52.762 Notice:+  $options = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options
2018-07-10 16:52:52.762 Notice:+  $start = xs:int("1")
2018-07-10 16:52:52.762 Notice:+  $page-length = xs:int("200")
2018-07-10 16:52:52.762 Notice:+  $minimum-threshold = xs:double("1")
2018-07-10 16:52:52.762 Notice:+  $lock-on-search = fn:false()
2018-07-10 16:52:52.762 Notice:+  $include-matches = fn:false()
2018-07-10 16:52:52.762 Notice:+  $options = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options
2018-07-10 16:52:52.762 Notice:+  $tuning = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options/matcher:tuning
2018-07-10 16:52:52.762 Notice:+  $property-defs = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options/matcher:property-defs
2018-07-10 16:52:52.762 Notice:+  $thresholds = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options/matcher:thresholds
2018-07-10 16:52:52.762 Notice:+  $scoring = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options/matcher:scoring
2018-07-10 16:52:52.762 Notice:+  $algorithms = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>...function...)
2018-07-10 16:52:52.762 Notice:+  $query = cts:or-query((cts:element-value-query(fn:QName("","PersonGivenName"), "maurise", ("case-insensitive","lang=en"), 6), cts:element-value-query(fn:QName("","PersonGivenName"), ("Maurie", "MURIAL", "KRIS"), ("case-insensitive","lang=en"), 6)), ())
2018-07-10 16:52:52.762 Notice:+  $serialized-query = <boost-query><cts:or-query xmlns:cts="http://marklogic.com/cts"><cts:element-value-query .../>...</cts:or-query></boost-query>
2018-07-10 16:52:52.762 Notice:+  $minimum-threshold-combinations = (cts:element-value-query(fn:QName("","PersonGivenName"), "maurise", ("case-insensitive","lang=en"), 0), cts:element-value-query(fn:QName("","PersonGivenName"), ("Maurie", "MURIAL", "KRIS"), ("case-insensitive","lang=en"), 0))
2018-07-10 16:52:52.762 Notice:+in /com.marklogic.smart-mastering/matcher.xqy, at 102:2,
2018-07-10 16:52:52.762 Notice:+in matcher:find-document-matches-by-options(object-node{"PersonGivenName":text{"maurise"}}, fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options, 1, 200, fn:false()) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $document = object-node{"PersonGivenName":text{"maurise"}}
2018-07-10 16:52:52.762 Notice:+  $options = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options
2018-07-10 16:52:52.762 Notice:+  $start = xs:int("1")
2018-07-10 16:52:52.762 Notice:+  $page-length = xs:int("200")
2018-07-10 16:52:52.762 Notice:+  $include-matches = fn:false()
2018-07-10 16:52:52.762 Notice:+in /marklogic.rest.resource/sm-match/assets/resource.xqy, at 60:4,
2018-07-10 16:52:52.762 Notice:+in xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy")(map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $context = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $input = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+  $uri = "/none"
2018-07-10 16:52:52.762 Notice:+  $input-root = object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}
2018-07-10 16:52:52.762 Notice:+  $document = object-node{"PersonGivenName":text{"maurise"}}
2018-07-10 16:52:52.762 Notice:+  $options = fn:doc("/com.marklogic.smart-mastering/options/algorithms/mlw-match.xml")/matcher:options
2018-07-10 16:52:52.762 Notice:+  $start = 1
2018-07-10 16:52:52.762 Notice:+  $page-length = 200
2018-07-10 16:52:52.762 Notice:+  $include-matches = fn:false()
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/lib/extensions-util.xqy, at 945:44,
2018-07-10 16:52:52.762 Notice:+in extut:call-service("sm-match", "POST", xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy"), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $extension-name = "sm-match"
2018-07-10 16:52:52.762 Notice:+  $method = "POST"
2018-07-10 16:52:52.762 Notice:+  $service = xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy")
2018-07-10 16:52:52.762 Notice:+  $context = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $service-params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $input = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/lib/extensions-util.xqy, at 898:20,
2018-07-10 16:52:52.762 Notice:+in function() as item()*() [1.0-ml]
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/lib/extensions-util.xqy,
2018-07-10 16:52:52.762 Notice:+in xdmp:invoke(function() as item()*, <options xmlns="xdmp:eval"><isolation>same-statement</isolation><ignore-amps>...</ignore-amps></options>) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/lib/extensions-util.xqy, at 896:12,
2018-07-10 16:52:52.762 Notice:+in extut:invoke-service("sm-match", "POST", "query", xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy"), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}, fn:false()) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $extension-name = "sm-match"
2018-07-10 16:52:52.762 Notice:+  $method = "POST"
2018-07-10 16:52:52.762 Notice:+  $default-txn-mode = "query"
2018-07-10 16:52:52.762 Notice:+  $service = xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy")
2018-07-10 16:52:52.762 Notice:+  $context = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $service-params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $input = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+  $in-txn = fn:false()
2018-07-10 16:52:52.762 Notice:+  $txn-curr = "query"
2018-07-10 16:52:52.762 Notice:+  $txn-mode = "query"
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/models/resource-model-query.xqy, at 269:20,
2018-07-10 16:52:52.762 Notice:+in rsrcmodqry:resource-post("sm-match", map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}, fn:false(), local:rsrcmod-callback#6) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $resource-name = "sm-match"
2018-07-10 16:52:52.762 Notice:+  $context = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $resource-params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $input = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+  $in-txn = fn:false()
2018-07-10 16:52:52.762 Notice:+  $responder = local:rsrcmod-callback#6
2018-07-10 16:52:52.762 Notice:+  $service = xdmp:function(fn:QName("http://marklogic.com/rest-api/resource/sm-match","post"), "/marklogic.rest.resource/sm-match/assets/resource.xqy")
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/models/resource-model-query.xqy, at 236:4,
2018-07-10 16:52:52.762 Notice:+in rsrcmodqry:exec-post(map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}, local:rsrcmod-callback#6) [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $headers = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $endpoint-params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $input = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+  $responder = local:rsrcmod-callback#6
2018-07-10 16:52:52.762 Notice:+in /MarkLogic/rest-api/endpoints/resource-service-query.xqy, at 78:8 [1.0-ml]
2018-07-10 16:52:52.762 Notice:+  $headers = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $method = "POST"
2018-07-10 16:52:52.762 Notice:+  $body = document{object-node{"document":object-node{"PersonGivenName":text{"maurise"}}}}
2018-07-10 16:52:52.762 Notice:+  $params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)
2018-07-10 16:52:52.762 Notice:+  $extra-names = ()
2018-07-10 16:52:53.127 Info: Status 500: XDMP-AS: (err:XPTY0004) $uri as xs:string -- Invalid coercion: () as xs:string

The issue may be the way I am calling the endpoint. This is how I called the old am-match endpoint, but I may need a different incantation now.

For completeness, the contents of mlw-match are:

<?xml  version="1.0" encoding="UTF-8"?>
<options xmlns="http://marklogic.com/smart-mastering/matcher">
<property-defs>
<property namespace="" localname="IdentificationID" name="ssn">
</property>
<property namespace="" localname="PersonGivenName" name="first-name">
</property>
<property namespace="" localname="PersonSurName" name="last-name">
</property>
<property namespace="" localname="AddressPrivateMailboxText" name="addr1">
</property>
<property namespace="" localname="LocationCity" name="city">
</property>
<property namespace="" localname="LocationState" name="state">
</property>
<property namespace="" localname="LocationPostalCode" name="zip">
</property>
<property namespace="" localname="PersonBirthDate" name="dob">
</property>
</property-defs>
<algorithms>
<algorithm name="std-reduce" function="standard-reduction">
</algorithm>
<algorithm name="std-reduce-query" function="standard-reduction-query">
</algorithm>
<algorithm name="dbl-metaphone" function="double-metaphone">
</algorithm>
<algorithm name="thesaurus" function="thesaurus">
</algorithm>
</algorithms>
<scoring>
<add property-name="ssn" weight="50">
</add>
<add property-name="last-name" weight="8">
</add>
<!--

      We want 12 points for the first name. Six here, plus six from a thesaurus match, which includes the original
      target name.
    
-->
<add property-name="first-name" weight="6">
</add>
<!--
     <add property-name="addr1" weight="5"/> 
-->
<!--
     <add property-name="city" weight="3"/>
    <add property-name="state" weight="1"/>
    <add property-name="zip" weight="3"/> 
-->
<!--

    <expand property-name="first-name" algorithm-ref="dbl-metaphone" weight="6">
      <dictionary>first-name-dictionary.xml</dictionary>
      <distance-threshold>50</distance-threshold>
    </expand>
    
-->
<expand property-name="first-name" algorithm-ref="thesaurus" weight="6">
<thesaurus>/mdm/config/thesauri/first-name-synonyms.xml</thesaurus>
<distance-threshold>50</distance-threshold>
</expand>
<expand property-name="first-name" algorithm-ref="dbl-metaphone" weight="6">
<dictionary>first-name-dictionary.xml</dictionary>
</expand>
<expand property-name="last-name" algorithm-ref="dbl-metaphone" weight="8">
<dictionary>last-name-dictionary.xml</dictionary>
<!--
defaults to 100 distance 
-->
</expand>
<reduce algorithm-ref="std-reduce" weight="4">
<all-match>
<property>last-name</property>
<property>addr1</property>
</all-match>
</reduce>
</scoring>
<thresholds>
<threshold above="1" label="Possible Match">
</threshold>
<threshold above="15" label="Likely Match" action="merge">
</threshold>
<threshold above="50" label="Definitive Match" action="merge">
</threshold>
<!--
 below 30 will be NOT-A-MATCH or no category 
-->
</thresholds>
<tuning>
<max-scan>200</max-scan>
<!--
 never look at more than 200 
-->
<initial-scan>20</initial-scan>
</tuning>
</options>

Generate a report of match results

A user may want to experiment with a new set of match options. Given a set of options and a query that identifies a set of URIs to work with (could be a cts:document-query with a bunch of URIs), run the matcher on all the documents and produce a report of what the results would be and what actions would be taken.

Settle on an approach for standard-reduction

The standard-reduction.xqy module has two functions: standard-reduction and standard-reduction-query. It doesn’t look like standard-reduction-query is actually used anywhere, although there are orphaned references to it in several of our example match options. The standard-reduction-query likely can't work, since it relies on combined queries, but a cts:and-query doesn't take a weight.

Actions:

  • remove implementation of standard-reduction-query
  • document that standard-reduction requires matches
  • consider whether there's a way to implement this that doesn't require matches

Provide config example in comments for each algorithm

The zip.xqy algorithm is easy to understand because it includes a config example in the module comments. I think each OOTB algorithm would benefit from this - unless there's a separate plan for documenting them e.g. on the github Wiki or in another place. I like the example in the module though because it's next to the code then and easier to understand.

Enable users to control collections during merging

Currently, a merged document inherits the collections from all source documents being merged, with the "mdm-merged" collection being added.

To support business workflows, smart mastering should allow users to select how collections are handled during a merge, including what collections should be added, and what collections should be removed from a merged document. Without having control of the collections, batch processing of documents requires writing more complex queries to isolate documents for processing. And being able to put merged documents into custom collections allows users to master multiple types of entities in the same database, instead of having multiple entities all being in the "mdm-merged" collection.

Ideally, this type of workflow should be supported:
(1) User puts all documents to be mastered in a "toBeMastered" collection
(2) A batch process runs by selected documents in the "toBeMastered" collection
(3) New documents are put into a user-specified collection, including the option to bring forward collections from source documents or not, like "masterPerson".
(4) Users can remove collections, like "toBeMastered" from merged and original documents, so the next time the batch process runs to select documents in the "toBeMastered" collection, it won't rerun against the same documents.

Add troubleshooting page

Add a documentation page for troubleshooting. Include:

  • I expected these documents to match, but they aren't. How do I figure out why they didn't?
  • How do find out what matches happened?
  • How do I find the scores from matching?
  • Outcome not as expected (eg no matches/merges) - how do I diagnose whether the error is in my code calling smart mastering? did Smart Mastering run?

Allow user to specify the merged URI

As a USER
I want to TITLE
so if I have a URI scheme I want to deploy, it can be honored while using smart mastering. In addition, as information is updated, the URI for the same resource won't need to continually change with each merge that may be providing only very little information.

Have meaningful error responses on incomplete REST endpoint calls

As a USER
I want to title
so I can fix and resubmit my request, without having to consult documentation unless absolutely necessary

E.g, a call to sm-match without an rs:options or an options property defined in the body will result in a blank 200 response. Should send a 400 instead with details of what's missing.

14 Endpoints, may support a couple verbs each

Make sure each endpoint requesting required parameters

Clarify the data model requirements needed

As a USER
I need to understand the data model requirements in order to use smart mastering (/envelope/instance/{entityName}/{entityProperties} along with any required namespace if XML)
and would like to receive a meaningful error message if I try to use SM features with documents that don't meet the data model requirements.

Declare default algorithms in merging options

As a USER
I was to TITLE
So I don't have to manually declare each property's merge options if I want all of my properties to be merged the same way, including the merge algorithm and max-values.

The default options should be overridden by any more specific merge option defined within the merging property. So if I have documents with 100 properties, I can set the default merge option once (e.g., max-value=1 and longest value), and override it for a specific property that I want handled in a different way (max-value=2 and source).

Allow processMatchAndMerge to put unmatched documents into the mastered collections

When processMatchandMerge is run with a uri, if there is no match, there should be an option that treats the document as a mastered document, and received the proper collections (see related issue #190).

As it stands now, documents come into an "mdm-content" collection, which will contain duplicates until the mastering process is run. By adopting the approach in issue 190, users can import data into a temp collection in the FINAL db, run processMatchAndMerge against them, and any merged documents will be put into the "mastered" collection, but any documents that don't have a match may still be considered an authoritative record for the entity, and also be put into the mastered collection. In that case, the un-merged document can be updated in place with the appropriate collections to avoid making another copy.

sm-history-document returns blank activities

Given the document below (redacted: ask me for the whole thing) at /com.marklogic.smart-mastering/merged/ffcc223a-7479-4af0-9dfc-61797d8ebb84.xml, when I call:

/LATEST/resources/sm-history-document?rs:uri=/com.marklogic.smart-mastering/merged/ffcc223a-7479-4af0-9dfc-61797d8ebb84.xml

I only get:

{activities: []}

The document:

{
"envelope": {
"headers": {
"sources": [
{
"name": "MMIS", 
"user": "admin", 
"dateTime": "2018-07-09T13:23:34.647238Z", 
"action": "Harmonization", 
"rawDoc": "1191448670-4"
}, 
{
"name": "MEDS", 
"user": "admin", 
"dateTime": "2018-07-09T20:22:13.638788Z", 
"action": "Harmonization", 
"rawDoc": "/raw/meds/medstestdata_21.xml"
}, 
{
"name": "MMIS", 
"user": "admin", 
"dateTime": "2018-07-09T13:23:34.647238Z", 
"action": "Harmonization", 
"rawDoc": "1191448670-1"
}, 
{
"name": "MMIS", 
"user": "admin", 
"dateTime": "2018-07-09T13:23:34.647238Z", 
"action": "Harmonization", 
"rawDoc": "1191448670-2"
}, 
{
"name": "CURAM", 
"user": "admin", 
"dateTime": "2018-07-09T20:26:52.945146Z", 
"action": "Harmonization", 
"rawDoc": "/raw/curam/curamtestdata_21.xml"
}, 
{
"name": "MMIS", 
"user": "admin", 
"dateTime": "2018-07-09T13:23:34.647238Z", 
"action": "Harmonization", 
"rawDoc": "1191448670-3"
}
], 
"merges": [
{
"document-uri": "/person//raw/meds/medstestdata_21.xml.json"
}, 
{
"document-uri": "/person/1191448670-4.json"
}, 
{
"document-uri": "/person/1191448670-1.json"
}, 
{
"document-uri": "/person/1191448670-2.json"
}, 
{
"document-uri": "/person//raw/curam/curamtestdata_21.xml.json"
}, 
{
"document-uri": "/person/1191448670-3.json"
}
], 
"id": "ffcc223a-7479-4af0-9dfc-61797d8ebb84"
}, 
"triples": [], 
"instance": {
"MDM": {
"Person": {
"Person": {
"EligibilityHistory": {
"Eligibility": {
"DateEffectiveEligibility": "2014-12-31", 
"DateIneligible": "2015-12-31"
}
}
}
}
}
}
}

Make committing results optional for process:process-match-and-merge

As a developer it would be helpful to call process:process-match-and-merge with a $commit option to specify whether the merged result and collection changes should be committed or not. This would allow more rapid testing of changes to match and merge options as you wouldn't have to reset the collections on your data before each test.

I will work on building this and try to submit a PR for it soon.

Enable users to control permissions during merging

Configuration will control the permissions that get applied to documents at various times. Configuration will be part of the merge options.

  <algorithms xmlns="http://marklogic.com/smart-mastering/merging">
    <permissions >
      <on-merge function="union" at="/some/dir/code.xqy" ns="some-namespace"/>
      <on-archive function="no-change" at="/some/dir/code.xqy" ns="some-namespace"/>
      <on-no-match function="no-change" at="/some/dir/code.xqy" ns="some-namespace"/>
      <on-notification function="union" at="/some/dir/code.xqy" ns="some-namespace"/>
    </permissions >
  </algorithms>

The on-merge strategy will determine what permissions are applied to newly created merged documents. Default strategy: union of all permissions on source documents, plus $const:CONTENT-COLL. Comment if there's interest in having an intersection plus $const:CONTENT-COLL strategy available out of the box.

The on-archive strategy will determine what permissions are applied to documents that get archived (merged into other documents). Default strategy: no change to permissions.

The on-no-match strategy will determine what permissions are applied to documents passed to process:process-match-and-merge but do find any matches. Default strategy: no change to the document's permissions.

The on-notification strategy will determine what permissions are applied to newly created notification documents. Default strategy: notification documents will get the union of the source document permissions.

For each type of permission strategy, we'll define an API that can be used to make custom strategies.

Matching fails due to a missing namespace in the match query

See attachments for input record and match config.

A request to the REST match endpoint returned 0 results and this query:

  cts:and-query((
    cts:true-query(), 
    cts:boost-query(
      cts:and-query((
        cts:collection-query("mdm-content"), 
        cts:not-query(cts:document-query("/mmisDoc/CRM/Person/6986792174.xml-0-1"), 1), 
        cts:or-query(
          cts:element-value-query(fn:QName("","B834-SOCIAL-SECURITY-NUMBER"), "500000001", ("case-insensitive","lang=en"), 0), ())), ()),
        cts:or-query(cts:element-value-query(fn:QName("http://marklogic.com/data-hub/envelope","B834-SOCIAL-SECURITY-NUMBER"), "500000001", ("case-insensitive","lang=en"), 20), ())
      )
      ), 
    ()) 

The query was actually taken from logging $query at line 300 of match-impl.xqy (I don't have the JSON query representation from the REST response).

Note that the namespace is missing from the first element-value-query on B834-SOCIAL-SECURITY-NUMBER. Adding the proper namespace fixes the query, and we are pretty sure the missing namespace causes the failure to match in the REST call.

deployMatchOptions FAILED

Team,

I am facing below mentioned issue while running "gradle prepDemo" from \examples\smart-mastering.

Task :deployMatchOptions FAILED

FAILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ':deployMatchOptions'.

javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?

Is there any configuration changes needs to be done for this ?

Documentation - required roles

Is there collateral/documentation that lays out what roles and methods need to be in place to implement the weighting values to give to potential customers so they know the resources necessary to implement?

Use path range indexes for double-metaphone

The double-metaphone algorithm uses an element range index to populate the dictionary and constructs an element value query. This allows for pulling in values from sources other than the intended target, if other elements share the same QName. Using path range indexes would be more precise.

Validate function for match and merge options

As a USER
I need to ensure the options files I have created are syntactically valid before I enter them
to help decrease errors and make troubleshooting bad match/merge options easier

This would ideally be a separate function that can be called by a user, but that would also be automatically invoked before saving any new match or merge options to prevent bad options from being persisted.

Add optional tracing to matching and merging operations

As a USER
I want the ability to troubleshoot why matching and merging isn't happening as I expect
and to do that, enabling a tracing mode that shows the values of all the inputs to matching and or/merging would be helpful (for example, everything that goes into the match-impl:find-document-matches-by-options() function for matching).

Attribution wrong where only one value present

Two documents get merged, but only one document has a value for a property. In some cases, the document without a value for the property will still have the XML to hold the property. When happens, the auditing code sees that both have the property, and so it attributes the one value to both.

Solution: when considering attribution, check whether there is actually any content in the property XML.

Get 500 error when sending match options in POST body

I get the following error when passing a set of JSON options to the sm-match endpoint regardless whether I pass an rs:uri param or a document property in the body to match against. Works as expected when using the rs:options param.

Error:

{
    "errorResponse": {
        "statusCode": 500,
        "status": "Internal Server Error",
        "messageCode": "INTERNAL ERROR",
        "message": "XDMP-AS: (err:XPTY0004) $options as element(matcher:options) -- Invalid coercion: object-node{\"options\":object-node{\"propertyDefs\":object-node{\"property\":array-node{...}}, ...}} as element(matcher:options) . See the MarkLogic server error log for further detail."
    }
}

Postman request:

POST /v1/resources/sm-match?rs:uri=/CSV_FILE/Person/199.xml HTTP/1.1
Host: localhost:8011
Content-Type: application/json
Authorization: Digest username="admin", realm="public", nonce="", uri="/v1/resources/sm-match?rs:uri=/CSV_FILE/Person/199.xml", algorithm="MD5", response="70623c0eefe70ddb44108f31eeb0e458"
Cache-Control: no-cache
Postman-Token: c380c9d8-850a-4b3c-afb4-1d70eeb39221

{
  "options": {
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "IdentificationID", "name": "ssn" },
        { "namespace": "", "localname": "PersonGivenName", "name": "first-name" },
        { "namespace": "", "localname": "PersonSurName", "name": "last-name" },
        { "namespace": "", "localname": "AddressPrivateMailboxText", "name": "addr1" },
        { "namespace": "", "localname": "LocationCity", "name": "city" },
        { "namespace": "", "localname": "LocationState", "name": "state" },
        { "namespace": "", "localname": "LocationPostalCode", "name": "zip" }
      ]
    },
    "algorithms": {
      "algorithm": [
        { "name": "std-reduce", "function": "standard-reduction" },
        { "name": "dbl-metaphone", "function": "double-metaphone" },
        { "name": "thesaurus", "function": "thesaurus" }
      ]
    },
    "scoring": {
      "add": [
        { "propertyName": "ssn", "weight": "50" },
        { "propertyName": "last-name", "weight": "8" },
        { "propertyName": "first-name", "weight": "6" },
        { "propertyName": "addr1", "weight": "5" },
        { "propertyName": "city", "weight": "3" },
        { "propertyName": "state", "weight": "1" },
        { "propertyName": "zip", "weight": "3" }
      ],
      "expand": [
        {
          "propertyName": "first-name",
          "algorithmRef": "thesaurus",
          "weight": "6",
          "thesaurus": "/mdm/config/thesauri/first-name-synonyms.xml",
          "distanceThreshold": "50"
        },
        {
          "propertyName": "last-name",
          "algorithmRef": "dbl-metaphone",
          "weight": "8",
          "dictionary": "name-dictionary.xml"
        }
      ],
      "reduce": [
        {
          "algorithmRef": "std-reduce",
          "weight": "4",
          "allMatch": { "property": ["last-name", "addr1"] }
        }
      ]
    },
    "actions": {
      "action": {
        "name": "my-custom-action",
        "function": "custom-action",
        "namespace": "http://marklogic.com/smart-mastering/action",
        "at": "/custom-action.xqy"
      }
    },
    "thresholds": {
      "threshold": [
        { "above": "30", "label": "Possible Match" },
        { "above": "50", "label": "Likely Match", "action": "notify" },
        { "above": "68", "label": "Definitive Match", "action": "merge" },
        { "above": "75", "label": "Custom Match", "action": "my-custom-action" }
      ]
    },
    "tuning": { "maxScan": "200" }
  }
}

Clarify when actions are run

When running match functions like findDocumentMatchesByOptionsName, the results indicate what documents matched, what threshold those matches hit, and what action should be taken with them. It does not run those actions, but provides enough information that the caller can run them. When calling processMatchAndMerge, the actions are run automatically.
Clarify the documentation to make it clear that the match functions are a read-only activity.

Error with rs:includeMatches param in sm-match endpoint

When using the rs:includeMatches param on the sm-match endpoint, I get the following error:

{
"errorResponse": {
"statusCode": 500,
"status": "Internal Server Error",
"messageCode": "INTERNAL ERROR",
"message": "XDMP-AS: (err:XPTY0004) $include-matches as xs:boolean -- Invalid coercion: "true" as xs:boolean . See the MarkLogic server error log for further detail."
}
}

Getting sources timestamps from JSON documents always returns null instead of the timestamp

The timestamp information in the merge-impl:get-sources function isn't returned correctly for JSON documents, as the ns:map is always populated, and (I'm guessing) the following xdmp:unpath call tries to grab a path with a namespace that doesn't exist on the JSON doc.

One option I tried was checking for the existence of any attributes on line 1066 (if fn:exists($ns-path/@*), which seemed to work. If so, the documentation should be updated to inform users not to define any namespace attributes on the $ns-path element when using JSON.

MDMImport input flow output collections are in upper case

The MDMImport input flow output collections are capitalized, like MDM-CONTENT, which makes the Harmonize flow fail. I think the collection name should be in lowercase like they are in the /com.marklogic.smart-mastering/constants.xqy

Use consistent naming for similar functions

In matcher.xqy, we have

  • matcher:get-option-names-as-xml()
  • matcher:get-option-names-as-json()
  • matcher:get-options-as-xml($options-name as xs:string)
  • matcher:get-options-as-json($options-name as xs:string)

In merging.xqy, we have

  • merging:get-option-names($format as xs:string)
  • merging:get-options($options-name, $format as xs:string)

Change the matcher functions to use the same convention as merging. Mark the existing functions as deprecated for a couple releases, then remove.

Add filter-query to match options

In order to make filter queries easier to use, allow them to be added to the match options. There would be a new filter-query element:

<filter-query xmlns="http://marklogic.com/smart-mastering/matcher">
  <collection>my-collection</collection>
  <directory>my-directory</directory>
</filter-query>

The collection and directory elements are short-cuts for defining cts:collection-query and cts:directory-query. Whatever children are added to filter-query will be combined using AND semantics (in other words, added to a cts:and-query).

A directory element may take a depth attribute, with values passed to the $depth parameter of cts:directory-query.

For more complex queries, filter-query may have a serialized cts query as a child element.

If a filter-query is provided in the match options and the $filter-query parameter is used when calling a match or match-and-merge function, the function parameter will override the configured query.

Publish and depend on smart-mastering-code like ml-unit-test

Should be able to follow the same approach as ml-unit-test, where a client project can use it like this:

mlRestApi "com.marklogic:smart-mastering-code:(version)"

And not have to do anything else.

Also, as part of publishing the project, jcenter requires there to be a "sources" jar, which is basically the same as the regular jar/zip - that's the approach I'm using for ml-unit-test.

I'll get a PR together for this soon.

merge.getOptionNames(constant.FORMATJSON) does not return all saved merge options

Running the following code (see below) one would expect to all saved merge options but it looks like it is always returning one (first) entry even if multiple merge options have been saved.

let constSmartMastering = require('/com.marklogic.smart-mastering/constants.xqy');
let mergeSmartMastering = require('/com.marklogic.smart-mastering/merging.xqy');

mergeSmartMastering.getOptionNames(constSmartMastering.FORMATJSON)

sm-match only returns one result, and no matches

It seems that the /sm-match endpoint only returns one result, instead of all the results

Note that the following response says there is a total of 8, but only shows a single result. Also, the am-match endpoint used to include matches inside each result. I would have expected to see that here on PersonGivenName, but no matches are included.

{
  "results": [
    {
      "boost-query": {
        "or-query": {
          "queries": [
            {
              "element-value-query": {
                "element": [
                  "PersonGivenName"
                ],
                "option": [
                  "case-insensitive"
                ],
                "text": [
                  {
                    "_value": "maurise",
                    "lang": "en"
                  }
                ],
                "weight": 6
              }
            },
            {
              "element-value-query": {
                "element": [
                  "PersonGivenName"
                ],
                "option": [
                  "case-insensitive"
                ],
                "text": [
                  {
                    "_value": "Maurie",
                    "lang": "en"
                  },
                  {
                    "_value": "MURIAL",
                    "lang": "en"
                  },
                  {
                    "_value": "KRIS",
                    "lang": "en"
                  }
                ],
                "weight": 6
              }
            }
          ]
        }
      },
      "match-query": {
        "and-query": {
          "queries": [
            {
              "collection-query": {
                "uri": "mdm-content"
              }
            },
            {
              "or-query": {
                "queries": [
                  {
                    "element-value-query": {
                      "element": [
                        "PersonGivenName"
                      ],
                      "option": [
                        "case-insensitive"
                      ],
                      "text": [
                        {
                          "_value": "maurise",
                          "lang": "en"
                        }
                      ],
                      "weight": 0
                    }
                  },
                  {
                    "element-value-query": {
                      "element": [
                        "PersonGivenName"
                      ],
                      "option": [
                        "case-insensitive"
                      ],
                      "text": [
                        {
                          "_value": "Maurie",
                          "lang": "en"
                        },
                        {
                          "_value": "MURIAL",
                          "lang": "en"
                        },
                        {
                          "_value": "KRIS",
                          "lang": "en"
                        }
                      ],
                      "weight": 0
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "page-length": "200",
      "result": {
        "action": "",
        "index": "8",
        "score": "6",
        "threshold": "Possible Match",
        "uri": "/person/1234068650-2.json"
      },
      "start": "1",
      "total": "8"
    }
  ]
}

property-history returns empty strings for simple JSON fields

I am about to file a PR for this.

We discovered that the sm-history-properties was returning empty-string keys for certain fields, when the merged field was were the result of simple JSON fields.

For example, it would return:

{
  ...,
  "PersonGender": {"": { ... } }
  ...
}

if the field in the document was simple:

{
  envelope: {
    headers: { ... },
    instance: {
      ...,
      "PersonGender": "F"
    }
  }
}

The desired result would have been:

{
  ...,
  "PersonGender": {"F": { ... } }
  ...
}

It did work for JSON fields where the value was an object.

node() instead of object-node()

these two lines should coerce to node() instead of object-node() because they generate the following error:

matcher:results-to-json($results) -- Invalid coercion: document{object-node{"results":array-node{object-node{"total":text{"8"}, "page-length":text{"200"}, ...}}}} as object-node()
https://github.com/marklogic-community/smart-mastering-core/blob/develop/src/main/ml-modules/root/com.marklogic.smart-mastering/matcher-impl/matcher-impl.xqy#L415

and

https://github.com/marklogic-community/smart-mastering-core/blob/develop/src/main/ml-modules/root/com.marklogic.smart-mastering/matcher.xqy#L208

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.