knowledge-graph-hub / neat-ml Goto Github PK

View Code? Open in Web Editor NEW

16.0 2.0 1.0 107.78 MB

Network Embedding All the Things

License: BSD 3-Clause "New" or "Revised" License

Python 87.48% Jupyter Notebook 12.52%

monarchinitiative

neat-ml's Introduction

Network Embedding All the Things (NEAT)

NEAT is a flexible pipeline for:

Parsing a graph serialization
Generating node and edge embeddings
Training classifiers for link prediction and label expansion
Making predictions
Creating well formatted output and metrics for the predictions
Doing all of the above reproducibly, with cloud compute (or locally, if preferred)

Quick Start

pip install neat-ml
neat run --config neat_quickstart.yaml # This example file is in the repo here

NEAT will write graph embeddings to a new quickstart_output directory.

Requirements

This pipeline has grape as a major dependency.

Methods from tensorflow and are supported, but are not installed as dependencies to avoid version conflicts.

Please install the versions of tensorflow, scikit-learn, CUDA, and cudnn compatible with your system and with each other prior to installing NEAT if you wish to use these methods.

On Linux, the tensorflow installation may be easiest using conda as follows:

wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O anaconda.sh
bash ./anaconda.sh -b
echo "export PATH=\$PATH:$HOME/anaconda3/bin" >> $HOME/.bashrc
conda init
conda install cudnn
conda install tensorflow

Installation

pip install neat-ml

Running NEAT

neat run --config tests/resources/test.yaml # example
neat run --config [your yaml]

The pipeline is driven by a YAML file (e.g. tests/resources/test.yaml), which contains all parameters needed to complete the pipeline. The contents and expected values for this file are defined by the neat-ml-schema.

This includes hyperparameters for machine learning and also things like files/paths to output results. Specify paths to node and edge files:

GraphDataConfiguration:
  graph:
    directed: False
    node_path: path/to/nodes.tsv
    edge_path: path/to/edges.tsv

If the graph data is in a compressed file and/or a remote location (e.g., on KG-Hub), one or more URLs may be specified in the source_data parameter:

GraphDataConfiguration:
  source_data:
    files:
      - path: https://kg-hub.berkeleybop.io/kg-obo/bfo/2019-08-26/bfo_kgx_tsv.tar.gz
        desc: "This is BFO, your favorite basic formal ontology, now in graph form."
      - path: https://someremoteurl.com/graph2.tar.gz
        desc: "This is some other graph - it may be useful."

A diagram explaining the design a bit is here.

If you are uploading to AWS/S3, see here for configuring AWS credentials:

Credits

Developed by Deepak Unni, Justin Reese, J. Harry Caufield, and Harshad Hegde.

neat-ml's People

Contributors

Stargazers

Watchers

Forkers

socioprophet

neat-ml's Issues

pre_run_checks should catch `botocore.exceptions.NoCredentialsError`

The pre_run_checks has its check_s3_credentials argument set to True by default, so it always checks for S3 upload details.
That's normally not a problem, but when testing locally and S3 credentials aren't available, it raises botocore.exceptions.NoCredentialsError without catching it.
If there isn't an upload block, this should behave similarly to an error with existing credentials (i.e., warnings.warn("YAML contains no upload block - continuing")

Set up test workflow to use `tox`

Would like to test both py3.8 and 3.9, at least.

Generate embeddings from Monarch graph for use in Exomiser

Per convo with Jules and Carlo, make embedding of Monarch graph using NEAT and put it in the usual place in https://kg-hub.berkeleybop.io/kg-monarch/

Update to use most recent `grape` functions for link prediction

Replace and/or refactor functions here:
https://github.com/Knowledge-Graph-Hub/neat-ml/blob/main/neat_ml/run_classifier/run_classifier.py
to use the current grape functions.

One big help would be replacing candidate pair generation with grape's negative graph generation.

Set up methods for using Ensmallen/Embiggen native methods

We support TF methods for link prediction through mlp_model.py, but other methods previously only accessible through TF are increasingly available through Ensmallen/Embiggen. We should support these methods, too.

Consider incorporating model cards to better describe and benchmark trained ML model

Per conversation in BBOP Deep Dive, we could considering using model cards as described in this Mitchell et al paper to describe and benchmark trained ML models

This likely should/could be something that we add after we have more basic features implemented

Scheduled NEAT runs fail due to extra `graph_path` keyword

NEAT configs with the graph_args value lead to the following error when the scheduler attempts to run them:

15:37:00  Traceback (most recent call last):
15:37:00    File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
15:37:00      sys.exit(cli())
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
15:37:00      return self.main(*args, **kwargs)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
15:37:00      rv = self.invoke(ctx)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
15:37:00      return _process_result(sub_ctx.command.invoke(sub_ctx))
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
15:37:00      return ctx.invoke(self.callback, **ctx.params)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
15:37:00      return callback(*args, **kwargs)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/cli.py", line 55, in run
15:37:00      make_node_embeddings(**node_embedding_args)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/graph_embedding/graph_embedding.py", line 47, in make_node_embeddings
15:37:00      graph: Graph = Graph.from_csv(**main_graph_args)
15:37:00  TypeError: Graph.from_csv() got an unexpected keyword argument: graph_path

This makes sense - the ensmallen Graph.from_csv() doesn't know what to do with this keyword, as it's specific to NEAT.
It should be removed before passing kwargs to from_csv().

Requirement management

Running NEAT after a fresh install yields a ModuleNotFoundError for cpuinfo due to ensmallen importing it.
This can be avoided with a pip install ensmallen - or even better, pip install grape as this should cover all requirements.

There may also be issues with importing Tensorflow if it is not already installed, as ensmallen/embiggen do not explicitly require it.

Upload to PyPi

Ensure project metadata is set up as per PyPi
Ensure README includes everything necessary
Set up project package and upload to PyPi

OSError: [Errno 9] Bad file descriptor

When running the below config, the error OSError: [Errno 9] Bad file descriptor is raised. See further below for stack trace.

name: "quick_neat"
description: "A Quick NEAT Run"
output_directory: quickstart_output

graph_data:
  graph:
    node_path: tests/resources/test_graphs/test_small_nodes.tsv
    edge_path: tests/resources/test_graphs/test_small_edges.tsv
    directed: False
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'

embeddings:
  embedding_file_name: quickstart_embedding.csv
  embedding_history_file_name: quickstart_embedding_history.json
  node_embedding_params:
    node_embedding_method_name: CBOW # one of 'CBOW', 'GloVe', 'SkipGram', 'Siamese', 'TransE', 'SimplE', 'TransH', 'TransR'
    walk_length: 10 # typically 100 or so
    batch_size: 128 # typically 512? or more
    window_size: 4
    return_weight: 1.0  # 1/p
    explore_weight: 1.0  # 1/q
    iterations: 5 # typically 20

Trace:

Exception ignored in: <function Pool.__del__ at 0x7eff78d69ca0>
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

The embeddings are still saved.

support creating positive/negative test/train splits

Need to decide how this would work. Right now it's BYOH (bring your own holdouts), and they are supplied like this:

graph_data:
  graph:
    node_path: tests/resources/test_graphs/pos_train_nodes.tsv
    edge_path: tests/resources/test_graphs/pos_train_edges.tsv

  pos_validation:
    edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
  neg_training:
    edge_path: tests/resources/test_graphs/neg_train_edges.tsv
  neg_validation:
    edge_path: tests/resources/test_graphs/neg_valid_edges.tsv

One way to support either BYOH or having NEAT make holdouts:

graph_data:
  graph:
    node_path: tests/resources/test_graphs/pos_train_nodes.tsv
    edge_path: tests/resources/test_graphs/pos_train_edges.tsv

  holdout:
    make_holdouts:
      type: connected_holdout # only option at the moment
      random_state: 42 # seed
      train_size: 0.8 # fraction
      edge_types: # optional
        - biolink:interacts_with
        - biolink:has_gene_product
      verbose: bool

    existing_holdouts:  # this OR make_holdouts (not both)
      pos_validation:
        edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
      neg_training:
        edge_path: tests/resources/test_graphs/neg_train_edges.tsv
      neg_validation:
        edge_path: tests/resources/test_graphs/neg_valid_edges.tsv

Graph inputs should be full graph not splits, rename example paths

These two variables in the main yaml section suggest a split training set:

node_path: tests/resources/test_graphs/pos_train_nodes.tsv
edge_path: tests/resources/test_graphs/pos_train_edges.tsv

I imagine these are just placeholder paths but for a better example could indicate that these are the full graph edges and nodes.

Loading from URL looks for wrong filename

Loading graph objects from a URL isn't quite right - the file is downloaded but yaml_helper looks for a file matching the URL string rather than the 'safe' reformatted filename.
Example:
The neat.yaml contains this:

    graph_path: https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz

The file is downloaded:

$ ls -ls https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz 
43200 -rw-r--r-- 1 harry harry 44235554 Apr 13 14:36 https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz

but raises FileNotFoundError upon trying to decompress it:

$ neat run --config neat.yaml
Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/cli.py", line 41, in run
    if not pre_run_checks(yhelp=yhelp):
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/pre_run_checks/pre_run_checks.py", line 82, in pre_run_checks
    if check_file_extensions and yhelp.main_graph_args():
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 211, in main_graph_args
    return self.add_indir_to_graph_data(self.yaml['graph_data']['graph'])
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 164, in add_indir_to_graph_data
    decomp_outfile = tarfile.open(filepath)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz'

Performing any further actions on the downloaded file should use the updated filename.

support link prediction using classifiers after training

@caufieldjh could flesh this ticket out to do KG-IDG link prediction as discussed here

`poetry`fying the project

Add poetry as a package management project.
Setup build and publish capabilities
Autodocs (works hand-in-hand with #71 )

Create index.html upon uploading new output

Newly created output from NEAT is uploaded to a remote location (i.e., S3 bucket) but isn't inherently navigable because it does not have an index.html written. We will need to write/update this file.

Incorporate GNNs

We have some existing implementations but will need to allow access to them through a NEAT config.

RuntimeError when validation fails isn't very informative

When running a config, this is what happens:

$ neat run --config neat_quickstart.yaml
Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/cli.py", line 45, in run
    yhelp = YamlHelper(config)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 147, in __init__
    if not validate_config(self.yaml):
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 43, in validate_config
    raise RuntimeError
RuntimeError

So this configuration clearly doesn't pass the validation step, but why not?
We need a more informative output here.

Output link predictions as SSSOM

The existing output for predicted links looks like this:

source_node	destination_node	score
ENSP00000451575	ENSP00000435370	0.9370759965425138
ENSP00000451575	ENSP00000435370	0.9370759965425138
ENSP00000451575	ENSP00000361636	0.9361207132288921
ENSP00000451575	ENSP00000361636	0.9361207132288921
ENSP00000451575	ENSP00000357879	0.9361171909487621

The corresponding columns in SSSOM are subject_id, object_id, and confidence.
We can change the column heading and already be compliant.

It would be best to include some provenance as well, though - a string can go in additional columns, mapping_tool and mapping_tool_version.
We have a few different things to keep track of here:

neat-ml version
grape version
contents of specific neat config
So this may not capture the entirety of the mapping, but it can be a frame of reference.

Use URLs as node and edge list locations

The current yaml format expects to be provided direct local paths to node and edge lists:
https://github.com/Knowledge-Graph-Hub/NEAT/blob/768cdf6d8bb9f069339e1c4d7519d0d73cfef15b/tests/resources/test.yaml#L8-L9
If it could be provided with URLs instead or additionally to local paths, remote runs of NEAT would be easier, since we assume it will have to retrieve node/edgelists from somewhere else anyway.

Use LinkML to parse and validate NEAT YAML files

We should consider using LinkML to validate our NEAT YAML files. Right now we are using some simple procedural code within the YamlHelper class to do this, but LinkML would provide a more sophisticated and thorough check of the YAML.

On the other hand, this would take some time to implement, and might increase the complexity of debugging NEAT YAML validation.

Just making this ticket to discuss

Create landing page

A simple ReadTheDocs would be fine for now.
This may even be the same task as #71

Drop support for Tensorflow and scikit-learn methods

We currently support TF and scikit-learn methods in config files, both in running classifiers and applying models.
If we remove support for these two frameworks:

The neat-schema will be less confusing and easier to apply
We can focus on a more specific set of use cases, driven by grape
We avoid issues due to upstream changes in frameworks other than grape

The downsides:

grape doesn't provide wrappers for simple methods in scikit-learn, like logistic regression (AFAIK)
We won't be able to replicate existing pipelines built around TF/scikit

Fail gracefully when provided node/edge path/URL that isn't node or edge

Providing NEAT with a config file where node and/or edge do not resolve to a TSV currently throws a ValueError (i.e., if it's a tar.gz).
This could be more informative to the user.
Would be especially nice to have a quick check to verify expected filetype and throw a warning if it isn't in ['tsv','tar.gz'], etc

history.to_json() is None

Running neat run --config tests/resources/test.yaml
has the following result:

  0% 0/4 [00:00<?, ?it/s]2021-11-18 18:47:17.046285: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
tcmalloc: large alloc 1635311616 bytes == 0x5565d5aa6000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x5566989c2000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 3270615040 bytes == 0x5566fa978000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056c5010 0x7f42056c573c 0x7f42056c585d 0x5565905a4749 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565904e2d14 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4
tcmalloc: large alloc 3270615040 bytes == 0x5567bd892000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056b03a9 0x7f42056b2ab5 0x55659068a409 0x556590611e7a 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4 0x5565906109ee 0x5565905a448c 0x5565905a4698 0x556590612fe4 0x5565905a3afa 0x556590611c0d 0x556590610ced 0x5565905a3bda 0x556590611c0d 0x5565906109ee 0x5565905a4271 0x5565905a4698 0x556590612fe4 0x5565906109ee 0x5565905a4271
2021-11-18 18:48:08.252436: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
14719/14719 [==============================] - 147s 10ms/step - loss: 0.2194 - auprc: 0.9560 - auroc: 0.9626 - Recall: 0.9859 - Precision: 0.8567 - accuracy: 0.9105 - val_loss: 0.2358 - val_auprc: 0.9588 - val_auroc: 0.9634 - val_Recall: 0.9745 - val_Precision: 0.8537 - val_accuracy: 0.9037
  0% 0/4 [03:21<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 71, in run
    f.write(history.to_json())
AttributeError: 'NoneType' object has no attribute 'to_json'

Need to get history output for embeddings and classifiers

Updating for grape means we've lost some history files, though I suspect they can produced with the right parameters.

Add quickstart notebook

Essentially a quick start, with just a few boxes:

Point at a compressed graph
Identify a config YAML
Run it
Look at the output

output metrics after classifier training in JSON format

Right now at the end of a NEAT run, metrics like AUROC, AUPRC, etc are printed to stdout during training.

These should be emitted as a JSON or some computable format.

Display errors to users at conclusion of NEAT run

NEAT runs may encounter a variety of caught errors and warnings:

For sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.exceptions
For TF: https://www.tensorflow.org/api_docs/python/tf/errors

These should be presented to the user, perhaps through a ERROR.txt created in the same location as the final output (e.g., a graph_ml directory)

add parameters for NEAT predictive model intermediates and outputs

I actually haven't done much on the predictive model section in Embiggen pipeline but looks like the same thoughts about embedding output NEAT parameters would apply to any predictive model intermediates and outputs.

In link prediction, filter nodes by prefix or other slots

Some graphs have nodes we would like to filter for, but they don't make clear distinctions in their Biolink categories:

PR:000002977    biolink:NamedThing                              Graph                                             owl:Class

So we would like to specify a filter for prefix rather than category.
This can be based on a flag used in the link_node_types: block in the config.

Similarly, it would be nice to be able to filter by other node slots/properties:

XPO:0134172     biolink:NamedThing      increased apoptosis in simple columnar epithelium       An increased occurrence of apoptotic process in simple columnar epithelium.                Graph

This could be as simple as a regex for a string value in a named column, e.g., match everything with the string "apoptosis"

Missing key in defining metrics in the config

When defining metrics in a neat config yaml as follows:

        metrics_config:
          metrics:
            - name: auprc
              type: tensorflow.keras.metrics.AUC
              curve: PR
            - name: auroc
              type: tensorflow.keras.metrics.AUC
              curve: ROC
            - name: Recall
              type: tensorflow.keras.metrics.Recall
            - name: Precision
              type: tensorflow.keras.metrics.Precision
            - type: accuracy

The mlp_model.py looks for the parameters key but can't find it, raising a KeyError.

        for m in metrics:
            if m["type"].startswith("tensorflow.keras"):
                m_class = self.dynamically_import_class(m["type"])
                m_parameters = m["parameters"]
                m_instance = m_class(**m_parameters)

Stack trace from a recent neat-kghub-scheduler run:

12:09:08  Traceback (most recent call last):
12:09:08    File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
12:09:08      sys.exit(cli())
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
12:09:08      return self.main(*args, **kwargs)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1053, in main
12:09:08      rv = self.invoke(ctx)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
12:09:08      return _process_result(sub_ctx.command.invoke(sub_ctx))
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
12:09:08      return ctx.invoke(self.callback, **ctx.params)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
12:09:08      return __callback(*args, **kwargs)
12:09:08    File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/cli.py", line 82, in run
12:09:08      model.compile()
12:09:08    File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/link_prediction/mlp_model.py", line 45, in compile
12:09:08      m_parameters = m["parameters"]
12:09:08  KeyError: 'parameters'

Include line number in schema validation errors

   ... include line number in [schema] validation errors ...

Originally posted by @caufieldjh in #94 (comment)

error when not specifying validation sets?

When these two keys are missing in the yaml:

pos_validation
neg_validation

Then get the following error:

can't find key in YAML: 'pos_validation'
can't find key in YAML: 'neg_validation'
Traceback (most recent call last):
File "/global/scratch/marcin/N2V/NEAT/venv/bin/neat", line 10, in
sys.exit(cli())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/cli.py", line 61, in run
yhelp.edge_embedding_method())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/link_prediction/model.py", line 71, in make_link_prediction_data
these_params.update(graph_args)
TypeError: 'NoneType' object is not iterable

Move `graph_path` argument out of `graph_data` arguments

As seen in issues like #73, it is problematic to have an additional argument passed to a function expecting to pass all kwargs to Ensmallen. It also complicates LinkML schema design.
The graph_path arg can still be optional, but should be provided independent of graph_data.

add parameters for NEAT embedding output file naming, format, and ensure that node identifiers part of any embedding output

There could be a few more parameters for handling the embedding outputs, eg:

file name
file format (default is .npy now, but .tsv could be a good option aka easy file validation)
tracking node identifiers from input files to these outputs

Create docs

Perhaps with Sphinx

Allow config YAML to not contain graph files

Not every run will involve starting from scratch on a graph, but NEAT currently expects a graph_data block to be present in the config.
Omit it, and a KeyError is raised:

Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 11, in <module>
    load_entry_point('neat==0.0.1', 'console_scripts', 'neat')()
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/cli.py", line 41, in run
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/yaml_helper/yaml_helper.py", line 225, in deal_with_url_node_edge_paths
KeyError: 'graph_data'

Some modifications to the yaml_helper to make this optional should help.

some way to handle singleton nodes in graph input and resulting embeddings

Embiggen will embed every node in the node file OR all non-singleton nodes in the edge (when no node file is provided). Maybe we can help shepherd this in the right direction. The ensmallen output includes singleton count so in theory the pipeline could be aware at that stage. By default it may be best to embed all nodes (hence require the node file), however then can include noninformative embedding for singleton nodes in the output, so perhaps another output file (eg npy) excluding singletons would help here or some way to indicate the uninformative embeddings.

Update for Ensmallen / Embiggen -> grape

Where previously installing ensmallen and embiggen, install grape
Update imports and methods to refer to grape, e.g. from grape.embedders import SkipGramEnsmallen
Update for grape native methods as per #77

See the grape tutorials for examples: https://github.com/AnacletoLAB/grape/tree/main/tutorials

Example YAMLs should use different value for loading node types

In the config YAMLs, when we want to load node types from a nodelist, the Ensmallen graph loader expects to see node_list_node_types_column.
We currently use node_types_column - Ensmallen can certainly take this parameter, but it thinks it means "The name of the column of the node types file from where to load the node types." - emphasis mine.
We are planning to create the node types file as needed, so the YAMLs should use node_list_node_types_column to specify the column where nodes are assigned categories.

NEAT should check S3 credentials at beginning of run to avoid credential error at the end of a long run

For example, this happened at the end of a long run (see below).

Should be easy to avoid by checking that ~/.ec2/credentials exists if S3 block is given in NEAT YAML

Epoch 100/100
120/120 [==============================] - 536s 4s/step - loss: 12.9093
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 68, in run
    upload_dir_to_s3(**upload_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/upload/upload.py", line 21, in upload_dir_to_s3
    client.head_object(Bucket=s3_bucket, Key=s3_path)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 663, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 682, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 162, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/dist-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Makefile:99: recipe for target 'embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia' failed
make: *** [embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia] Error 1
21.302u 5.897s 12:55:57.82 0.0%	0+0k 0+0io 0pf+0w

Convert edge prediction methods to grape methods

The embedding methods are updated to be compatible with the latest grape in #82, but this doesn't include updates for edge prediction. We'd like to update those, too.

Replace NEATs BERT text embedding stuff with Ensmallen's (better) version of this

Right now we are including BERT embeddings for textual elements in the graph in NEAT using a fairly naive average embedding of all text for a given node.

Ensmallen now supports a more sophisticated version fo this, with a better way of incorporating BERT embedding using for example weighting of embeddings using TF-IDF.

We should therefore replace NEATs version of this with Ensmallen's BERT functionality

support pushing results to AWS bucket

Input 0 of layer "sequential" is incompatible with the layer

When running the test config:

neat run --config tests/resources/test.yaml

I get the following error:

2021-11-18 17:30:40.780109: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
  0% 0/4 [00:42<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 33, in <module>
    sys.exit(load_entry_point('neat==0.0.1', 'console_scripts', 'neat')())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/cli.py", line 67, in run
  File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/link_prediction/mlp_model.py", line 78, in fit
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 808, in train_step
        y_pred = self(x, training=True)
    File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 263, in assert_input_compatibility
        raise ValueError(f'Input {input_index} of layer "{layer_name}" is '

    ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 100), found shape=(None, 868)

Looks like a mismatch between the original embedding size https://github.com/Knowledge-Graph-Hub/NEAT/blob/8c9e912239e2913c52998daa538277eeb7d544b4/tests/resources/test.yaml#L69
and the additional params.

Change YAML format to include input_data_dir and output_data_dir

Currently files are specified with whole paths, which is clunky, e.g.:

node_path: data/raw/upheno/upheno_training_nodes.tsv

Instead (per convo with Nico), YAML should specify input directory once:
input_data_dir
and prepend this to input file names automatically

Similarly should have output data directory:
output_data_dir
where things are written out

Enable yaml helper to load embeddings from URLs

As mentioned here: Knowledge-Graph-Hub/kg-phenio#16

The is_url and download_file methods can be used with embeddings, allowing a URL to be specified in the config.

Specify compressed node/edge file in config yaml

Following from #36 - all graph node/edge TSVs on KG-HUB are stored as tar.gz, so even if we specify URL locations in the config yaml then we still need to have decompressed the tar.gz first.
This may be resolved by including a new key in the config yaml to specify a URL to the node/edge file, then having NEAT retrieve and decompress the file in its run location, updating the node and edge paths in the process.