Giter Club home page Giter Club logo

datasette-reconcile's Introduction

datasette-reconcile

PyPI - Version PyPI - Python Version Changelog Tests License

Adds a reconciliation API endpoint to Datasette, based on the Reconciliation Service API specification.

The reconciliation API is used to match a set of strings to their correct identifiers, to help with disambiguation and consistency in large datasets. For example, the strings "United Kingdom", "United Kingdom of Great Britain and Northern Ireland" and "UK" could all be used to identify the country which has the ISO country code GB. It is particularly implemented in OpenRefine.

The plugin adds a /-/reconcile endpoint to a table served by datasette, which responds based on the Reconciliation Service API specification. In order to activate this endpoint you need to configure the reconciliation service, as dscribed in the usage section.

Installation

Install this plugin in the same environment as Datasette.

$ datasette install datasette-reconcile

Usage

Plugin configuration

The plugin should be configured using Datasette's metadata.json file. The configuration can be put at the root, database or table layer of metadata.json, for most use cases it will make most sense to configure at the table level.

Add a datasette-reconcile object under plugins in metadata.json. This should look something like:

{
  "databases": {
    "sf-trees": {
      "tables": {
        "Street_Tree_List": {
          "plugins": {
            "datasette-reconcile": {
              "id_field": "id",
              "name_field": "name",
              "type_field": "type",
              "type_default": [
                {
                  "id": "tree",
                  "name": "Tree"
                }
              ],
              "max_limit": 5,
              "service_name": "Tree reconciliation",
              "view_url": "https://example.com/trees/{{id}}"
            }
          }
        }
      }
    }
  }
}

The only required item in the configuration is name_field. This refers to the field in the table which will be searched to match the query text.

The rest of the configuration items are optional, and are as follows:

  • id_field: The field containing the identifier for this entity. If not provided, and there is a primary key set, then the primary key will be used. A primary key of more than one field will give an error.
  • type_field: If provided, this field will be used to determine the type of the entity. If not provided, then the type_default setting will be used instead.
  • type_default: If provided, this value will be used as the type of every entity returned. If not provided the default of Object will be used for every entity.
  • max_limit: The maximum number of records that a query can request to return. This is 5 by default. A individual query can request fewer results than this, but it cannot request more.
  • service_name: The name of the reconciliation service that will appear in the service manifest. If not provided it will take the form <database name> <table name> reconciliation.
  • identifierSpace: Identifier space given in the service manifest. If not provided a default of http://rdf.freebase.com/ns/type.object.id is used.
  • schemaSpace: Schema space given in the service manifest. If not provided a default of http://rdf.freebase.com/ns/type.object.id is used.
  • view_url: URL for a view of an individual entity. It must contain the string {{id}} which will be replaced with the ID of the entity. If not provided it will use the default datasette view for the entity record (something like /<db_name>/<table>/{{id}}).

Using the endpoint

Once the plugin is configured for a particular database or table, you can access the reconciliation endpoint using the url /<db_name>/<table>/-/reconcile.

A simple GET request to /<db_name>/<table>/-/reconcile will return the Service Manifest as JSON which reconciliation clients can use to determine how the service is set up.

A POST request to the same url with the queries argument set will trigger the reconciliation process. The queries parameter should be a json object in the format described in the specification. An example set of two queries would look like:

{
  "q1": {
    "query": "Hans-Eberhard Urbaniak"
  },
  "q2": {
    "query": "Ernst Schwanhold"
  }
}

The query can optionally be encoded as a queries parameter in a GET request. For example:

/<db_name>/<table>/-/reconcile?queries={"q1":{"query":"Hans-Eberhard Urbaniak"},"q2":{"query": "Ernst Schwanhold"}}

Various options are available in the query object. Current the only ones implemented in datasette-reconcile are the mandatory query string, and the limit option, which must be less than or equal to the value in the max_limit configration option.

All endpoints that start with /<db_name>/<table>/-/reconcile are configured to send an Access-Control-Allow-Origin: * CORS header to allow access as described in the specification.

JSONP output is not yet supported.

Returned value

The result of the GET or POST queries requests described above is a json object describing potential reconciliation candidates for each of the queries specified. The result will look something like:

{
  "q1": {
    "result": [
      {
        "id": "120333937",
        "name": "Urbaniak, Regina",
        "score": 53.015232,
        "match": false,
        "type": [
          {
            "id": "person",
            "name": "Person"
          }
        ]
      },
      {
        "id": "1127147390",
        "name": "Urbaniak, Jan",
        "score": 52.357353,
        "match": false,
        "type": [
          {
            "id": "person",
            "name": "Person"
          }
        ]
      }
    ]
  },
  "q2": {
    "result": [
      {
        "id": "123064325",
        "name": "Schwanhold, Ernst",
        "score": 86.43497,
        "match": true,
        "type": [
          {
            "id": "person",
            "name": "Person"
          }
        ]
      },
      {
        "id": "116362988X",
        "name": "Schwanhold, Nadine",
        "score": 62.04763,
        "match": false,
        "type": [
          {
            "id": "person",
            "name": "Person"
          }
        ]
      }
    ]
  }
}

Behind the scenes

The reconcile engine works by performing an SQL query against the name_field within the specified database table. Where that table has a full text search index implemented, the search will be performed against that index.

When a full text search index is present on the table, the SQL query takes the form (based on the search query test, note that double quotes are added to facilitate searching - these are not present in the original query):

select <id_field>, <name_field>
from <table>
  inner join (
    select "rowid", "rank"
    from <fts_table>
    where <fts_table> MATCH '"test"'
  ) as "a" on <table>."rowid" = a."rowid"
order by a.rank
limit 5

If a full text search index is not present, the query looks like this (note that the wildcard % is added to either side of the query - these are not present in the original query):

select <id_field>, <name_field>
from <table>
where <name_field> like '%test%'
limit 5

Extend endpoint

You can also use the reconciliation API Data extension service to find additional properties for a set of entities, given an ID.

Send a GET request to the /<db_name>/<table>/-/reconcile/extend/propose endpoint to find a list of the possible properties you can select. The properties are all the columns in the table (excluding any that have been hidden). An example response would look like:

{
  "limit": 5,
  "type": "Person",
  "properties": [
    {
      "id": "preferredName",
      "name": "preferredName"
    },
    {
      "id": "professionOrOccupation",
      "name": "professionOrOccupation"
    },
    {
      "id": "wikidataId",
      "name": "wikidataId"
    }
  ]
}

Then send a POST request to the /<db_name>/<table>/-/reconcile endpoint with an extend argument. The extend argument should be a JSON object with a set of ids to lookup and properties to return. For example:

{
  "ids": ["10662041X", "1064905412"],
  "properties": [
    {
      "id": "professionOrOccupation"
    },
    {
      "id": "wikidataId"
    }
  ]
}

The endpoint will return a result that looks like:

{
  "meta": [
    {
      "id": "professionOrOccupation",
      "name": "professionOrOccupation"
    },
    {
      "id": "wikidataId",
      "name": "wikidataId"
    }
  ],
  "rows": {
    "10662041X": {
      "professionOrOccupation": [
        {
          "str": "Doctor"
        }
      ],
      "wikidataId": [
        {
          "str": "Q3874347"
        }
      ]
    },
    "1064905412": {
      "professionOrOccupation": [
        {
          "str": "Architect"
        }
      ],
      "wikidataId": [
        {
          "str": "Q3874347"
        }
      ]
    }
  }
}

Suggest endpoints

You can also use the suggest endpoints to get quick suggestions, for example for an auto-complete dropdown menu. The following endpoints are available:

  • /<db_name>/<table>/-/reconcile/suggest/property - looks up in a list of table columns
  • /<db_name>/<table>/-/reconcile/suggest/entity - looks up in a list of table rows
  • /<db_name>/<table>/-/reconcile/suggest/type - not currently implemented

Each endpoint takes a prefix argument which can be used in a GET request. For example, the GET request /<db_name>/<table>/-/reconcile/suggest/entity?prefix=abc will produce a response such as:

{
  "result": [
    {
      "name": "abc company limited",
      "id": "Q123456"
    },
    {
      "name": "abc other company limited",
      "id": "Q123457"
    }
  ]
}

Development

This plugin uses hatch for build and testing. To set up this plugin locally, first checkout the code.

You'll need to fetch the git submodules for the tests too:

git submodule init
git submodule update

To run the tests:

hatch run test

Run tests then report on coverage

hatch run cov

Run tests then run a server showing where coverage is missing

hatch run cov-html

Linting/formatting

Black and ruff should be run before committing any changes.

To check for any changes needed:

hatch run lint:style

To run any autoformatting possible:

hatch run lint:fmt

Publish to pypi

hatch build
hatch publish
git tag v<VERSION_NUMBER>
git push origin v<VERSION_NUMBER>

Acknowledgements

Thanks for @simonw for developing datasette and the datasette ecosystem.

Other contributions from:

datasette-reconcile's People

Contributors

drkane avatar jbpressac avatar nicokant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datasette-reconcile's Issues

Add matching features

https://reconciliation-api.github.io/specs/latest/#dfn-matching-feature

A matching feature is a numerical or boolean value which can be used to determine how likely it is for the candidate to be the correct entity. It contains the following fields:

id
A string which identifies the feature, such as "name_tfidf" or "pagerank". This id must be unique among all the matching features returned for a given candidate;
value
The value of the feature for the candidate, which can be any boolean or numerical value.

Multiple matching features are often used in combination to provide the final matching score (available in the score field). By exposing individual features in their responses, services make it possible for clients to compute matching scores which fit their use cases better.

view_url datasette-compatible escaping

Hi! Thanks for sharing such a useful plugin :)

I found out that IDs containing special characters are not escaped correctly in the URL when the endpoint is datasette itself (which is the default). For example, datasette will fail to load http://localhost/db/table/FA3/N/819, but it would work with http://localhost/db/table/FA3~2FN~2F819

I think an alternative variable should be made available, which is escaped according to Datasette, which should also be the default.

def tilde_encode(s: str) -> str:
    "Returns tilde-encoded string - for example ``/foo/bar`` -> ``~2Ffoo~2Fbar``"
    return "".join(_tilde_encoder(char) for char in s.encode("utf-8"))

-- https://github.com/simonw/datasette/blob/f0fadc28ddb9f82e5cc1ecaa51e8a342eb6dc528/datasette/utils/__init__.py#L1166-L1168

Implement types

Currently types are included as strings, and are not used in filtering results.

According to the spec types should be implemented as a JSON object with id and value fields.

Problem to use the plugin

Hello,
Thank you for your plugin. I successfully installed the plugin with Datasette 0.55 and modified the metadata.json file to add a reconcilation service on one table.

The /-/reconcile URL returns:

{
  "versions": [
    "0.1",
    "0.2"
  ],
  "name": "PRELIB Personnes reconciliation",
  "identifierSpace": "http://rdf.freebase.com/ns/type.object.id",
  "schemaSpace": "http://rdf.freebase.com/ns/type.object.id",
  "defaultTypes": "personne",
  "view": "/prelib/prelib_personne/{{id}}"
}

And /-/reconcile?queries={"q1":{"query":"Christophe"},"q2":{"query": "Hers"}} successfully returns candidates.

However, when I try to add the service in OpenRefine (Reconcile > Start reconciling > Add Standard Service and add the /-/reconcile URL), OpenRefine displays an error message: Error contacting recon service: parsererror : Error: jQuery111105708109859784188_1614767957684 was not called -

Do you have any suggestions to solve the problem ?

Thank you,

fts_table setting is ignored

The fts_table setting in the metadata.json is ignored, the plugin uses db.fts_table(table) which does introspection to find fts tables.

example:

{
    "databases": {
        "common": {
            "tables": {
                "species": {
                    "fts_table": "species_fts",
                    "plugins": {
                        "datasette-reconcile": {
                            "name_field": "ValidScientificName"
                        }
                    }
                }
            }
        }
    }
}

Allow filtering by properties

https://reconciliation-api.github.io/specs/latest/#structure-of-a-reconciliation-query

properties
Optionally, a map from property identifiers to a list of property values (or list of property values). These are used to further filter the set of candidates (similar to a WHERE clause in SQL), by allowing clients to specify other attributes of entities that should match, beyond their name in the query field. How reconciliation services handle this further restriction ("must match all properties" or "should match some") and how it affects the score, is up to the service;

view URL should be absolute

Hello,
Thank you for v0.2.0 release. It works fine with OpenRefine. However, the suggested alignments have no active links (the tags have no href):

2021-03-30 10_10_20-Window

This is probably due to the fact that the url of the view in the manifest should be absolute:

"view": {"url": "http://www.example.com/database/table/{{id}}"}

the current version looks like:

"view": {"url": "/database/table/{{id}}"}

Might be solved using https://docs.datasette.io/en/stable/internals.html?highlight=absolute#absolute-url-request-path in get_view_url ?

def get_view_url(ds, database, table):

Additional parameter to configure the view

Hello,
As for the Wikidata reconcile API, the view template for entities could not have the same domain as the reconciliation API (OpenRefine displays Wikidata candidates with an URL as https://www.wikidata.org/wiki/{{id}}). Would it be possible to add a new plugin configuration item to eventually replace the datasette instance URL (as specified below) for the "view" by a combination of a domain given as parameter and the id_field.

To be more concise, would it be possible to modify the domain of the URLs of the candidates:

2021-05-28 18_32_04-Diaz Levriou troet xlsx - OpenRefine

Thanks,

'Add standard service' stuck to guess-types-of-column

Hello,
Following #13 (thank you for solving the issue), I have a new problem: when I try to add the service in OpenRefine (Reconcile > Start reconciling > Add Standard Service and add the /-/reconcile URL), OpenRefine displays a spinner and nothing occurs. The OpenRefine console is stuck to the instruction: POST /command/core/guess-types-of-column.

I may send you the URL of the reconciliation service in a mail, if you wish (I would like to keep this URL private).

Thank you,

Clarify URL pattern to use

Currently uses /<database>/<table>/reconcile - but this could conflict in the case when a table has a row with id "reconcile".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.