Giter Club home page Giter Club logo

typedb / typedb-ml Goto Github PK

View Code? Open in Web Editor NEW
552.0 38.0 93.0 11.26 MB

TypeDB-ML is the Machine Learning integrations library for TypeDB

Home Page: https://vaticle.com

License: Apache License 2.0

Python 79.86% Starlark 20.14%
grakn graql machine-learning artificial-intelligence ml knowledge-graph knowledgebase link-prediction relational-learning knowledge-graph-completion graph-convolutional-networks graph-networks neural-network python tensorflow graph graphs ai geometric-deep-learning

typedb-ml's Introduction

This repository is outdated and not supported. We will be closing this repository by end of 2023.


GitHub release Discord Discussion Forum Stack Overflow Stack Overflow

TypeDB-ML

Previously known as KGLIB.

TypeDB-ML provides tools to enable graph algorithms and machine learning with TypeDB.

There are integrations for NetworkX and for PyTorch Geometric (PyG).

NetworkX integration allows you to use a large library of algorithms over graph data exported from TypeDB.

PyTorch Geometric (PyG) integration gives you a toolbox to build Graph Neural Networks (GNNs) for your TypeDB data, with an example included for link prediction (or: binary relation prediction, in TypeDB terms). The structure of the GNNs are totally customisable, with network components for popular topics such as graph attention and graph transformers built-in.

Features

NetworkX

  • Declare the graph structure of your queries, with optional sampling functions.
  • Query a TypeDB instance and combine many results across many queries into a single graph (build_graph_from_queries).

PyTorch Geometric

  • A DataSet object to lazily load graphs from a TypeDB instance. Each graph is converted to a PyG Data object.
  • It's most natural to work with PyG HeteroData objects since all data in TypeDB has a type. Conversion from Data to HeteroDatais available in PyG, but it loses node ordering information. To remedy this, TypeDB-ML provides store_concepts_by_type to store concepts consistent with a HeteroData object. This enables concepts to be properly re-associated with predictions after learning is finished.
  • A FeatureEncoder to orchestrate encoders to generate features for graphs.
  • Encoders for Continuous and Categorical values to apply encodings/embedding spaces to the types and attribute values present in TypeDB data.
  • A full example for link prediction

Other

  • Example usage of Tensorboard for PyG HeteroData

Resources

You may find the following resources useful, particularly to understand why TypeDB-ML started:

Quickstart

Install

  • Python >= 3.7.x

  • Grab the requirements.txt file from here and install the requirements with pip install -r requirements.txt. This is due to some intricacies installing PyG's dependencies, see here for details.

  • Installed TypeDB-ML: pip install typedb-ml.

  • TypeDB 2.11.1 running in the background.

  • typedb-client-python 2.11.x (PyPi, GitHub release). This should be installed automatically when you pip install typedb-ml.

Run the Example

Take a look at the PyTorch Geometric heterogeneous link prediction example to see how to use TypeDB-ML to build a GNN on TypeDB data.

Development

To follow the development conversation, please join the Vaticle Discord, and join the #typedb-ml channel. Alternatively, start a new topic on the Vaticle Discussion Forum.

TypeDB-ML requires that you have migrated your data into a TypeDB or TypeDB Cluster instance. There is an official examples repo for how to go about this, and information available on migration in the docs. Alternatively, there are fantastic community-led projects growing in the TypeDB OSI to facilitate fast and easy data loading, for example TypeDB Loader.

Building from Source

It's expected that you will use Pip to install, but should you need to make your own changes to the library, and import it into your project, you can build from source as follows:

Clone TypeDB-ML:

git clone [email protected]:vaticle/typedb-ml.git

Go into the project directory:

cd typedb-ml

Build all targets:

bazel build //...

Run all tests. Requires Python 3.7+ on your PATH. Test dependencies are for Linux since that is the CI environment:

bazel test //typedb_ml/... --test_output=streamed --spawn_strategy=standalone --action_env=PATH

Build the pip distribution. Outputs to bazel-bin:

bazel build //:assemble-pip

typedb-ml's People

Contributors

dmitrii-ubskii avatar flyingsilverfin avatar gowtham1997 avatar grabl avatar haikalpribadi avatar jamesreprise avatar jmsfltchr avatar lolski avatar trellixvulnteam avatar vmax avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

typedb-ml's Issues

Unapproved but successful workflows should show as green

Problem to Solve

A CI workflow which reaches an approval step, where approval is not given, is indicated as pending (orange). This is misleading as it gives the impression that tests have not passed. Approval is only given when releasing, so this problem is very common.

Current Workaround

Inspect the pending flag to see how far the workflow progressed

Proposed Solution

Use a custom approval system

Sync dependencies upon graknlabs repos

Problem to Solve

KGLIB should be automatically updated to depend upon the latest commit

Current Workaround

Currently the dependencies must be updated manually.

Proposed Solution

Have Grabl automatically update the commits that are depended upon.

Prototype a Concept Feature Embedding Framework

The objective is to prototype a method of building vector representations of Concepts in a Knowledge Graph. These vectors can then subsequently be used in machine learning pipelines in order to perform learning across the graph.

Encode traversal raw data into float tensors

Ingest data that describes the traversals from a batch of starting concepts and build float tensors to feed into the main trunk of the pipeline.

Requires:

  • Schema type encoder
  • Role type encoder
  • Role direction encoder
  • Long encoder
  • Double encoder
  • Boolean encoder
  • Date encoder
  • String encoder - potentially using a drop-in from TensorFlow Hub

Needed by #13
Needs #17
Needs #15

Use cached test results for unchanged source code

Problem to Solve

All tests rerun from cold, which is unnecessary if large amounts of source code are unchanged.

Current Workaround

Ignore this computational penalty

Proposed Solution

Use RBE with bazel to cache test results

Machine Learning Research

This issue was originally posted by @jmsfltchr on 2018-08-31 17:29.

Why
To explore the benefits of combining Grakn with Machine Learning
How
Investigate the integration of machine learning with Grakn.
Efforts to include:

  • Feature extraction from Grakn (or equivalent)
  • Running ML at query-time, triggered by querying, therefore also by reasoning

This issue needs #9, needs #8, needs #4, needs #13

Add a terminology section to KGCN README

At present there are several terms that may need explaining to avoid confusion. For example:

  • What we mean by neighbourhood
  • What an example is (either a Thing we want to embed/classify or example code)

Implement end-to-end test using test deployment

We require a way to test that kglib can be imported via pip. This requires a dedicated test that can be run independent of other typical tests. This test, using bazel, should depend on the latest deployment to the test PyPI server.

Add SonarCloud to KGLIB repo

Given that KGLIB's codebase is still in its infancy right now, it's good to start putting a code quality system enforce early on. Let's add this as part of PR and master workflow.

BLAST: Try the API

This issue was originally posted by @sorsaffari on 2018-09-05 18:39.

As the first step in writing an example to illustrate how BLAST can be used with a Grakn Knowledge Graph. - try with a single protein sequence - try with a file containing multiple protein sequences - assess the result

End-to-end test requires hard-coded data source

As below, the dataset has been hard-coded. Ideally we shouldn't piggyback on release data for testing.

http_file(
  name = "animaltrade_dist",
  urls = ["https://github.com/graknlabs/kglib/releases/download/v0.1a1/grakn-animaltrade.zip", # TODO How to update to the latest relase each time?
  ]
)

Expert Systems Research

This issue was originally posted by @jmsfltchr on 2018-09-14 12:42.

Why:
Expert Systems are critical for a variety of domains, including chatbots, and medical diagnostics (for their transparency over ML systems)
How:
Research and disseminate how to build a general ES framework for Grakn to demonstrate Grakn's usefulness in this domain.

Pytorch Issue Windows

I try to install pytorch and after days of trying Im here with a big, big problem. I read a lot of articles of "how to install pytorch" I try to install with pip install but dont work for me and after I install it with Anaconda, but in anaconda is pytorch install, when I type: conda list, he is there like this form: pytorch 1.0.1 py3.7_cuda100_cudnn7_1 pytorch, I have python 3.7, when I run a code with import torch this show me a message like this:

problem

And when i try to import torch in python 3.7:

fdzzddz

Pip install error:

dsfdasad

How to pass this errors? Please Help, thx.

BLAST: define the schema

This issue was originally posted by @sorsaffari on 2018-09-17 19:49.

Based on the response(s) returned from the API, as implemented here, the schema for the Grakn knowledge graph needs to be defined and loaded into a keyspace.

Initial investigation into Random Forests in Grakn

This issue was originally posted by @jmsfltchr on 2018-08-31 17:27.

Is it possible to create a random forest that sits inside Grakn so that it can be used for classification/regression at query-time? This experiment has not yet gone far enough to determine feasibility in terms of speed. The blocker that was encountered before this was being able to perform aggregations in rules, because we need to do aggregate mode inside a rule in order to implement the majority voting of trees in a forest to classify an example. Performing this operation outside Grakn seems to defeat the point of embedding the forest in Grakn at all. I have made no effort to consider how to build or "train" (*1) the forest. This training (*1) could be done in application code and then the tree translated into Grakn. *1 by "training" I mean that the trees are not built totally randomly, the discrimination boundary picked for each node (and which feature to use, picked from a random set?) is picked based upon the basis of what divides the data the most.

A Clerical Error in the README.md

Maybe there is a clerical error that exists in the sentence :"Delete all appendix attributes from both animaltrade_train and animaltrade_test keyspaces. This is the label we will predict in this example, so it should not be present in Grakn otherwise the network can cheat"

Here the animaltrade_train may should be animaltrade_eval. Just let me know if that's right.

Add a "Use Cases" section in the README

The README.md describes what KGCN is, but it does not describe how it will be beneficial for users.

We should have a use-case section describing the kind of problems in which KGCN makes sense as a solution.

Extend the deployment test to be more thorough

Right now the deployment test:

  1. Verifies whether it can deploy the Pip package to test.pypy.org
  2. Verifies whether it can install it using pip install

These tests do not verify whether the Pip package are well-formed therefore we should have a test which performs basic sanity checks on the installed package, for example, by attempting to import and instantiate the kglib from a real Python program.

Cannot depend upon client-python and grakn releases due to conflicting transitive build-tools dependencies

Problem to Solve

In KGLIB we wish to conduct tests in CI against the latest releases of graknlabs/client-python and graknlabs/grakn. This is to ensure that user experience is aligned with the testing conditions in CI. We wish to do this by depending upon git repositories by tag with bazel, using sync-dependencies to auto-update the tags to reflect the latest releases.

It is common that the latest release of graknlabs/client-python and the latest release of graknlabs/grakn depend upon different commits of graknlabs/build-tools. Using bazel there is no way to use graknlabs/build-tools with two different versions. This is due to the fact that both of these repositories refer to graknlabs/build-tools as @graknlabs_build_tools.

This transitive dependency misalignment makes it impossible to both use bazel and test against the latest releases of graknlabs/client-python and graknlabs/grakn.

Current Workaround

Depend upon graknlabs/client-python and graknlabs/grakn by commit and use sync-dependencies, in which case they both use the same version of graknlabs/build-tools, hence resolving the conflict.

In this case we only test against the latest releases of graknlabs/client-python and graknlabs/grakn in the test-deployment-pip job in CI, when we use a deployed snapshot of KGLIB, which will depend upon a released version of client-python. This version must be manually updated in install_requires of assemble_pip.
This test will also use a released version of Grakn, retrieved as a zip.

Proposed Solution

Add functionality to bazel to permit including transitive dependencies in a scoped way, such that graknlabs/client-python, graknlabs/grakn and graknlabs/kglib can each depend upon a different version of graknlabs/build-tools without conflicting.

Build Tensorflow implementation of supervised GraphSAGE

Translate the approach of GraphSAGE to the context of Grakn, taking inspiration from the author's code where applicable. Implement first cut implementation for inference, loss and optimisation. Test using dummy data as a stand-in for an encoding pipeline.

needed by #13

Create Grakn schema traversal

Walk the schema to find schemas concept types and their hierarchies. This is required in order to encode information about these types in the TensorFlow pipeline, so that the framework has the capacity to learn the impact of a node having a particular type, and the influence of that type's super-types.

Needed by #13

Build a knowledge graph of the Graph ML space

This issue was originally posted by @jmsfltchr on 2018-09-11 11:34.

Track the papers of interest found during my research into how to do ML over a knowledge graph. Develop a schema sophisticated enough to capture this information fully.

End-to-end-test may use the wrong PyPi version

If there are 2 workflows running at the same time, since date is used as the VERSION number for test PyPi, and the highest number is used as the latest, then the wrong version may be used by the next job in one of those workflows. That job is end-to-end-test.

Feature normaliser by attribute type

Feature values once encoded need to be normalised relative to the other values for the same attribute type. This is necessary since we can expect that different attribute types (of the same datatype) will have wildly different distributions.

Needed by #13

Implement type-wise Attribute value normalisation

Presently it is very difficult to architect a concise way to normalise attribute values.

Current problems include:

  • All Things in the graph are treated as if they could be an attribute. This means that:
    • All Things must support a field for long, double, string, date and boolean in case they are an Attribute. In the case that a Thing is an Attribute, then only one of these values will be set to non-default.
    • The vast majority of the attribute value fields are set to a default value. This obfuscates the meaning of zero. In some cases it means an actual value of zero, in others it is present because the Thing is not an Attribute . This is particularly difficult to handle in the case of dates where in unix time zero is Thursday, 1 January 1970.
  • Attributes need to be normalised by Type, otherwise the distribution of values from one type will impact that of another
  • Normalisation needs to be calibrated on the training set, and the parameters used to normalise data passed subsequently.
  • Encoding of the input data takes place inside the TensorFlow computation graph, adding normalisation there may be non-trivial, and there aren't any OOTB components from TensorFlow like the preprocessing.StandardScaler() of scikit-learn

Should be made easier to accomplish by solving #51

Cannot install kglib - no matching distribution found for tensorflow

I can't get to install grakn-kglib. After I run "pip3 install grakn-kglib" (inside a new venv), I receive this error:

Could not find a version that satisfies the requirement tensorflow==1.11.0 (from grakn-kglib) (from versions: ) No matching distribution found for tensorflow==1.11.0 (from grakn-kglib)

I'm running python 3.7.1.

Create a CI pipeline which performs tests and release (if manually approved)

Scope of the CI pipeline:

  1. tests:
    1. runs unit tests
    2. runs a deployment test which deploys the artifact to test.pypi.org
    3. runs an end-to-end test which verifies if the deployed artifact can be used
  2. a manual approval prompt which should trigger a release process
  3. release:
    1. deploy to pypi.org
    2. deploy a release draft to Githubb

Network architecture is not type-centric

At present, the approach used is close to the approach described in GraphSAGE, which assumes homogenous data.

The downside of this approach are:

  1. When querying for a Thing's neighbours, we often receive them sorted by type, which makes randomly sampling these difficult
  2. Randomly sampling is biased by point 1, but also by the number of neighbour instances of different Types. For instance, there may be only neighbour of Type A, but 10,000 of Type B, how do we choose to sample these?
  3. When pseudo-randomly sampling neighbours we set a limit for the number of Things we are willing to consider in order to reduce expense. Doing this, given results often come back sorted by Type, we may not see any examples of some neighbour types that are actually present

Working on Grakn, we have type information that allows us to understand the nature of the neighbours a Thing has. The network architecture should make use of this.

Just a quick question

Hi there! This is neither a bug report nor a feature request, so I hope you don't mind me posting this here.

My name is Reed, and I'm a software engineering researcher at Sandia National Laboratories in the US. I've created an issue on your repo just to ask a quick question. If you don't have time or don't care to respond, feel free to ignore me and/or delete this issue.

Where I work, we have a very diverse ecosystem of cutting-edge research codes spanning every discipline you could imagine. I'm part of our software engineering research department, and it's our job to keep that ecosystem robust and healthy. Part of that means helping scientists to adopt good software practices. Right now, my mind has been on software versioning/release schemes (e.g. semantic versioning).

In order to build a case for/against getting my people on-board with the practice, I figured I should ask people who already use versioning to release their software to see what they think. So I gathered up a list of scientific software repositories on GitHub, then I selected those that tracked versioned releases, that were reasonably active, etc. Finally, I picked a handful of those repos and decided to reach out to them. You were on that list.

Anyway, here's the question:

What do you believe are the benefits (or drawbacks) of having versioned releases of your software (i.e. 1.0.0, 1.1.0, 2.0.0...)? When should someone start thinking about versioning/releasing their code?

Just a sentence or two, that's all I need. For context, imagine the preceding sentence is this: "But don't just take my word for it, just listen to what these accomplished researchers have to say!".

Thank you so much!

Reed Milewicz
[email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.