typedb / typedb-ml Goto Github PK

TypeDB-ML is the Machine Learning integrations library for TypeDB

License: Apache License 2.0

Python 79.86% Starlark 20.14%

grakn graql machine-learning artificial-intelligence ml knowledge-graph knowledgebase link-prediction relational-learning knowledge-graph-completion graph-convolutional-networks graph-networks neural-network python tensorflow graph graphs ai geometric-deep-learning

typedb-ml's Introduction

This repository is outdated and not supported. We will be closing this repository by end of 2023.

TypeDB-ML

Previously known as KGLIB.

TypeDB-ML provides tools to enable graph algorithms and machine learning with TypeDB.

There are integrations for NetworkX and for PyTorch Geometric (PyG).

NetworkX integration allows you to use a large library of algorithms over graph data exported from TypeDB.

PyTorch Geometric (PyG) integration gives you a toolbox to build Graph Neural Networks (GNNs) for your TypeDB data, with an example included for link prediction (or: binary relation prediction, in TypeDB terms). The structure of the GNNs are totally customisable, with network components for popular topics such as graph attention and graph transformers built-in.

Features

NetworkX

Declare the graph structure of your queries, with optional sampling functions.
Query a TypeDB instance and combine many results across many queries into a single graph (build_graph_from_queries).

PyTorch Geometric

A DataSet object to lazily load graphs from a TypeDB instance. Each graph is converted to a PyG Data object.
It's most natural to work with PyG HeteroData objects since all data in TypeDB has a type. Conversion from Data to HeteroDatais available in PyG, but it loses node ordering information. To remedy this, TypeDB-ML provides store_concepts_by_type to store concepts consistent with a HeteroData object. This enables concepts to be properly re-associated with predictions after learning is finished.
A FeatureEncoder to orchestrate encoders to generate features for graphs.
Encoders for Continuous and Categorical values to apply encodings/embedding spaces to the types and attribute values present in TypeDB data.
A full example for link prediction

Other

Example usage of Tensorboard for PyG HeteroData

Resources

You may find the following resources useful, particularly to understand why TypeDB-ML started:

Strongly Typed Data for Machine Learning (YouTube, 2021)
How Can We Complete a Knowledge Graph? (YouTube, 2018)

Quickstart

Install

Python >= 3.7.x
Grab the requirements.txt file from here and install the requirements with pip install -r requirements.txt. This is due to some intricacies installing PyG's dependencies, see here for details.
Installed TypeDB-ML: pip install typedb-ml.
TypeDB 2.11.1 running in the background.
typedb-client-python 2.11.x (PyPi, GitHub release). This should be installed automatically when you pip install typedb-ml.

Run the Example

Take a look at the PyTorch Geometric heterogeneous link prediction example to see how to use TypeDB-ML to build a GNN on TypeDB data.

Development

To follow the development conversation, please join the Vaticle Discord, and join the #typedb-ml channel. Alternatively, start a new topic on the Vaticle Discussion Forum.

TypeDB-ML requires that you have migrated your data into a TypeDB or TypeDB Cluster instance. There is an official examples repo for how to go about this, and information available on migration in the docs. Alternatively, there are fantastic community-led projects growing in the TypeDB OSI to facilitate fast and easy data loading, for example TypeDB Loader.

Building from Source

It's expected that you will use Pip to install, but should you need to make your own changes to the library, and import it into your project, you can build from source as follows:

Clone TypeDB-ML:

git clone [email protected]:vaticle/typedb-ml.git

Go into the project directory:

cd typedb-ml

Build all targets:

bazel build //...

Run all tests. Requires Python 3.7+ on your PATH. Test dependencies are for Linux since that is the CI environment:

bazel test //typedb_ml/... --test_output=streamed --spawn_strategy=standalone --action_env=PATH

Build the pip distribution. Outputs to bazel-bin:

bazel build //:assemble-pip

typedb-ml's People

Contributors

Stargazers

Watchers

Forkers

jmsfltchr livingmatrix hunglethanh9 shafiahmed mauna-ai zorrock spencerai tonylv qooglelabs alexanderdurr pombredanne futong jmmunozr mac-kim hannson arronmabrey w1074098501 flyingsilverfin victorzan todun david-lee-1990 hania-batool stjordanis lolski jinhoko iababio abhi4rana7 databill86 benhe119 harisanthosh voice-displays mbrukman tyberion tno-knowledge-based-systems arimkatz thotegt snci cuiruifei joshgay fliitecorp nadia-el zzw-x chiahungtai kubpie not-apollo mdheller vishalbelsare syyunn kits-ragh akshaykoshy2000 coplanetary capitaldata ashishpatel26 haizhuolaojisite yishuili vaticle-test soltrinox kingler pyeongkim sayanta66 zentim jkyang01 sshyran johny-quan imvijith ngo010 joskid honeypotz-ai colin-chen-cn chris75-mc aarek-eng dlperf sunatthegilddotcom axia75 muhammadshoaib y12uc231 karunarora007 chunsejin amala-toco stat-eklee seokko dmitrii-ubskii sararijo manuwhs trellixvulnteam jamesreprise ajunlonglive estkae tannongma chiyuri

typedb-ml's Issues

Biomedical White Paper Explaining Grakn's Use Case in the domain

This issue was originally posted by @tomassabat on 2018-08-22 17:12.

Write a biomedical white paper to outline the value proposition of Grakn in the biomed field. This is done by building a BioGrakn knowledge graph, and using examples demonstrating why Grakn is valuable to use for bioinformaticians.

Unapproved but successful workflows should show as green

Problem to Solve

A CI workflow which reaches an approval step, where approval is not given, is indicated as pending (orange). This is misleading as it gives the impression that tests have not passed. Approval is only given when releasing, so this problem is very common.

Current Workaround

Inspect the pending flag to see how far the workflow progressed

Proposed Solution

Use a custom approval system

Is it usable on common rdf graphs ?

Hi,

Is this framework usable on any RDF graph or is it only for grakn graphs ?

Thanks

Animal Trade example using KGCN

Build a classification example using KGCN to predict attribute labels on our Animal Trade dataset

Change package name so that users need to import grakn.kglib.<module>

Research into feature embeddings

Sync dependencies upon graknlabs repos

Problem to Solve

KGLIB should be automatically updated to depend upon the latest commit

Current Workaround

Currently the dependencies must be updated manually.

Proposed Solution

Have Grabl automatically update the commits that are depended upon.

Build High-Level Interface to KGCN

Write an interface to allow users to easily create a KGCN instance for their own purposes.

Needed by #13

Schema Design Guidelines

This issue was originally posted by @jmsfltchr on 2018-09-17 14:14.

Basic schema concept type encoder

Do basic multi-hot encoding of thing types and role types based on the schema traversal inputs

Needs #16
Needed by #13

Prototype a Concept Feature Embedding Framework

The objective is to prototype a method of building vector representations of Concepts in a Knowledge Graph. These vectors can then subsequently be used in machine learning pipelines in order to perform learning across the graph.

Encode traversal raw data into float tensors

Ingest data that describes the traversals from a batch of starting concepts and build float tensors to feed into the main trunk of the pipeline.

Requires:

Needed by #13
Needs #17
Needs #15

Attribute ownership propagated into roleplayers (core API)

This issue was originally posted by @kasper-piskorski on 2018-08-30 13:22.

Same problem as the related ticker but at the core API level (attributes, owners calls)

Use cached test results for unchanged source code

Problem to Solve

All tests rerun from cold, which is unnecessary if large amounts of source code are unchanged.

Current Workaround

Ignore this computational penalty

Proposed Solution

Use RBE with bazel to cache test results

Machine Learning Research

This issue was originally posted by @jmsfltchr on 2018-08-31 17:29.

Why
To explore the benefits of combining Grakn with Machine Learning
How
Investigate the integration of machine learning with Grakn.
Efforts to include:

Feature extraction from Grakn (or equivalent)
Running ML at query-time, triggered by querying, therefore also by reasoning

This issue needs #9, needs #8, needs #4, needs #13

Examples should be kept outside main source

The animal trade example currently sits in a sub-directory of kglib, which means it is accessible when importing via pip. This should not be the case.

Add a terminology section to KGCN README

At present there are several terms that may need explaining to avoid confusion. For example:

What we mean by neighbourhood
What an example is (either a Thing we want to embed/classify or example code)

Implement end-to-end test using test deployment

We require a way to test that kglib can be imported via pip. This requires a dedicated test that can be run independent of other typical tests. This test, using bazel, should depend on the latest deployment to the test PyPI server.

Research into graph feature embeddings

This issue was originally posted by @jmsfltchr on 2018-09-12 17:07.

Research into how to represent graph structure as a vector for use as input to machine learning pipelines.

Add SonarCloud to KGLIB repo

Given that KGLIB's codebase is still in its infancy right now, it's good to start putting a code quality system enforce early on. Let's add this as part of PR and master workflow.

BLAST: Try the API

This issue was originally posted by @sorsaffari on 2018-09-05 18:39.

As the first step in writing an example to illustrate how BLAST can be used with a Grakn Knowledge Graph. - try with a single protein sequence - try with a file containing multiple protein sequences - assess the result

Broken link from PyPi to KGCN subdirectory

This link is presently relative and should be made absolute.

Use new approval button for CI

End-to-end test requires hard-coded data source

As below, the dataset has been hard-coded. Ideally we shouldn't piggyback on release data for testing.

http_file(
  name = "animaltrade_dist",
  urls = ["https://github.com/graknlabs/kglib/releases/download/v0.1a1/grakn-animaltrade.zip", # TODO How to update to the latest relase each time?
  ]
)

Expert Systems Research

This issue was originally posted by @jmsfltchr on 2018-09-14 12:42.

Why:
Expert Systems are critical for a variety of domains, including chatbots, and medical diagnostics (for their transparency over ML systems)
How:
Research and disseminate how to build a general ES framework for Grakn to demonstrate Grakn's usefulness in this domain.

pytorch support

Hi,

do you plan to add support for pytorch models?

Pytorch Issue Windows

I try to install pytorch and after days of trying Im here with a big, big problem. I read a lot of articles of "how to install pytorch" I try to install with pip install but dont work for me and after I install it with Anaconda, but in anaconda is pytorch install, when I type: conda list, he is there like this form: pytorch 1.0.1 py3.7_cuda100_cudnn7_1 pytorch, I have python 3.7, when I run a code with import torch this show me a message like this:

And when i try to import torch in python 3.7:

Pip install error:

How to pass this errors? Please Help, thx.

BLAST: define the schema

This issue was originally posted by @sorsaffari on 2018-09-17 19:49.

Based on the response(s) returned from the API, as implemented here, the schema for the Grakn knowledge graph needs to be defined and loaded into a keyspace.

Write full technical description of KGCN into the README

Detail the system components of KGCN as fully and descriptively as possible
Do so in the same way that the approach is best described to a human

This will be used to build a cohesive architecture for the project.

Write Abstract for November Meetup

Initial investigation into Random Forests in Grakn

This issue was originally posted by @jmsfltchr on 2018-08-31 17:27.

Is it possible to create a random forest that sits inside Grakn so that it can be used for classification/regression at query-time? This experiment has not yet gone far enough to determine feasibility in terms of speed. The blocker that was encountered before this was being able to perform aggregations in rules, because we need to do aggregate mode inside a rule in order to implement the majority voting of trees in a forest to classify an example. Performing this operation outside Grakn seems to defeat the point of embedding the forest in Grakn at all. I have made no effort to consider how to build or "train" (*1) the forest. This training (*1) could be done in application code and then the tree translated into Grakn. *1 by "training" I mean that the trees are not built totally randomly, the discrimination boundary picked for each node (and which feature to use, picked from a random set?) is picked based upon the basis of what divides the data the most.

A Clerical Error in the README.md

Maybe there is a clerical error that exists in the sentence :"Delete all appendix attributes from both animaltrade_train and animaltrade_test keyspaces. This is the label we will predict in this example, so it should not be present in Grakn otherwise the network can cheat"

Here the animaltrade_train may should be animaltrade_eval. Just let me know if that's right.

Add a "Use Cases" section in the README

The README.md describes what KGCN is, but it does not describe how it will be beneficial for users.

We should have a use-case section describing the kind of problems in which KGCN makes sense as a solution.

Extend the deployment test to be more thorough

Right now the deployment test:

Verifies whether it can deploy the Pip package to test.pypy.org
Verifies whether it can install it using pip install

These tests do not verify whether the Pip package are well-formed therefore we should have a test which performs basic sanity checks on the installed package, for example, by attempting to import and instantiate the kglib from a real Python program.

Cannot depend upon client-python and grakn releases due to conflicting transitive build-tools dependencies

Problem to Solve

In KGLIB we wish to conduct tests in CI against the latest releases of graknlabs/client-python and graknlabs/grakn. This is to ensure that user experience is aligned with the testing conditions in CI. We wish to do this by depending upon git repositories by tag with bazel, using sync-dependencies to auto-update the tags to reflect the latest releases.

It is common that the latest release of graknlabs/client-python and the latest release of graknlabs/grakn depend upon different commits of graknlabs/build-tools. Using bazel there is no way to use graknlabs/build-tools with two different versions. This is due to the fact that both of these repositories refer to graknlabs/build-tools as @graknlabs_build_tools.

This transitive dependency misalignment makes it impossible to both use bazel and test against the latest releases of graknlabs/client-python and graknlabs/grakn.

Current Workaround

Depend upon graknlabs/client-python and graknlabs/grakn by commit and use sync-dependencies, in which case they both use the same version of graknlabs/build-tools, hence resolving the conflict.

In this case we only test against the latest releases of graknlabs/client-python and graknlabs/grakn in the test-deployment-pip job in CI, when we use a deployed snapshot of KGLIB, which will depend upon a released version of client-python. This version must be manually updated in install_requires of assemble_pip.
This test will also use a released version of Grakn, retrieved as a zip.

Proposed Solution

Add functionality to bazel to permit including transitive dependencies in a scoped way, such that graknlabs/client-python, graknlabs/grakn and graknlabs/kglib can each depend upon a different version of graknlabs/build-tools without conflicting.

Build Tensorflow implementation of supervised GraphSAGE

Translate the approach of GraphSAGE to the context of Grakn, taking inspiration from the author's code where applicable. Implement first cut implementation for inference, loss and optimisation. Test using dummy data as a stand-in for an encoding pipeline.

needed by #13

Create Grakn schema traversal

Walk the schema to find schemas concept types and their hierarchies. This is required in order to encode information about these types in the TensorFlow pipeline, so that the framework has the capacity to learn the impact of a node having a particular type, and the influence of that type's super-types.

Needed by #13

Build a knowledge graph of the Graph ML space

This issue was originally posted by @jmsfltchr on 2018-09-11 11:34.

Track the papers of interest found during my research into how to do ML over a knowledge graph. Develop a schema sophisticated enough to capture this information fully.

End-to-end-test may use the wrong PyPi version

If there are 2 workflows running at the same time, since date is used as the VERSION number for test PyPi, and the highest number is used as the latest, then the wrong version may be used by the next job in one of those workflows. That job is end-to-end-test.

Warn user of long TensorFlow Hub download time on first use

TensorFlow Hub components can take some time to download the pre-trained network. This happens silently which makes it look as if the program has hung.

Feature normaliser by attribute type

Feature values once encoded need to be normalised relative to the other values for the same attribute type. This is necessary since we can expect that different attribute types (of the same datatype) will have wildly different distributions.

Needed by #13

Refactor neighbour graph traversals

Revisit neighbourhood traversals. Tests are currently insufficient for this module to be trustworthy.

Needed by #13

Blog post on schema design best practices

This issue was originally posted by @jmsfltchr on 2018-09-03 12:09.

Write a guide on how to build a schema that we can use for our own reference, and to point the community towards.

Implement type-wise Attribute value normalisation

Presently it is very difficult to architect a concise way to normalise attribute values.

Current problems include:

All Things in the graph are treated as if they could be an attribute. This means that:
- All Things must support a field for long, double, string, date and boolean in case they are an Attribute. In the case that a Thing is an Attribute, then only one of these values will be set to non-default.
- The vast majority of the attribute value fields are set to a default value. This obfuscates the meaning of zero. In some cases it means an actual value of zero, in others it is present because the Thing is not an Attribute . This is particularly difficult to handle in the case of dates where in unix time zero is Thursday, 1 January 1970.
Attributes need to be normalised by Type, otherwise the distribution of values from one type will impact that of another
Normalisation needs to be calibrated on the training set, and the parameters used to normalise data passed subsequently.
Encoding of the input data takes place inside the TensorFlow computation graph, adding normalisation there may be non-trivial, and there aren't any OOTB components from TensorFlow like the preprocessing.StandardScaler() of scikit-learn

Should be made easier to accomplish by solving #51

Cannot install kglib - no matching distribution found for tensorflow

I can't get to install grakn-kglib. After I run "pip3 install grakn-kglib" (inside a new venv), I receive this error:

Could not find a version that satisfies the requirement tensorflow==1.11.0 (from grakn-kglib) (from versions: ) No matching distribution found for tensorflow==1.11.0 (from grakn-kglib)

I'm running python 3.7.1.

Use checkstyle to automatically prepend licence to files

Create a CI pipeline which performs tests and release (if manually approved)

Scope of the CI pipeline:

tests:
1. runs unit tests
2. runs a deployment test which deploys the artifact to test.pypi.org
3. runs an end-to-end test which verifies if the deployed artifact can be used
a manual approval prompt which should trigger a release process
release:
1. deploy to pypi.org
2. deploy a release draft to Githubb

Process neighbour traversals ready for TensorFlow ingestion

Take Grakn traversals and from these build sets of numpy arrays to act as initial feed tensors for TensorFlow.

Needed by #13

Network architecture is not type-centric

At present, the approach used is close to the approach described in GraphSAGE, which assumes homogenous data.

The downside of this approach are:

When querying for a Thing's neighbours, we often receive them sorted by type, which makes randomly sampling these difficult
Randomly sampling is biased by point 1, but also by the number of neighbour instances of different Types. For instance, there may be only neighbour of Type A, but 10,000 of Type B, how do we choose to sample these?
When pseudo-randomly sampling neighbours we set a limit for the number of Things we are willing to consider in order to reduce expense. Doing this, given results often come back sorted by Type, we may not see any examples of some neighbour types that are actually present

Working on Grakn, we have type information that allows us to understand the nature of the neighbours a Thing has. The network architecture should make use of this.

Just a quick question

Hi there! This is neither a bug report nor a feature request, so I hope you don't mind me posting this here.

My name is Reed, and I'm a software engineering researcher at Sandia National Laboratories in the US. I've created an issue on your repo just to ask a quick question. If you don't have time or don't care to respond, feel free to ignore me and/or delete this issue.

Where I work, we have a very diverse ecosystem of cutting-edge research codes spanning every discipline you could imagine. I'm part of our software engineering research department, and it's our job to keep that ecosystem robust and healthy. Part of that means helping scientists to adopt good software practices. Right now, my mind has been on software versioning/release schemes (e.g. semantic versioning).

In order to build a case for/against getting my people on-board with the practice, I figured I should ask people who already use versioning to release their software to see what they think. So I gathered up a list of scientific software repositories on GitHub, then I selected those that tracked versioned releases, that were reasonably active, etc. Finally, I picked a handful of those repos and decided to reach out to them. You were on that list.

Anyway, here's the question:

What do you believe are the benefits (or drawbacks) of having versioned releases of your software (i.e. 1.0.0, 1.1.0, 2.0.0...)? When should someone start thinking about versioning/releasing their code?

Just a sentence or two, that's all I need. For context, imagine the preceding sentence is this: "But don't just take my word for it, just listen to what these accomplished researchers have to say!".

Thank you so much!

Reed Milewicz
[email protected]

typedb / typedb-ml Goto Github PK

typedb-ml's Introduction

This repository is outdated and not supported. We will be closing this repository by end of 2023.

TypeDB-ML

Features

NetworkX

PyTorch Geometric

Other

Resources

Quickstart

Install

Run the Example

Development

Building from Source

typedb-ml's People

Contributors

Stargazers

Watchers

Forkers

typedb-ml's Issues

Problem to Solve

Current Workaround

Proposed Solution

Problem to Solve

Current Workaround

Proposed Solution

Problem to Solve

Current Workaround

Proposed Solution

Problem to Solve

Current Workaround

Proposed Solution

Recommend Projects

Recommend Topics

Recommend Org