Giter Club home page Giter Club logo

neat-ml's Introduction

Network Embedding All the Things (NEAT)

Quality Gate Status Maintainability Rating Coverage

NEAT is a flexible pipeline for:

  • Parsing a graph serialization
  • Generating node and edge embeddings
  • Training classifiers for link prediction and label expansion
  • Making predictions
  • Creating well formatted output and metrics for the predictions
  • Doing all of the above reproducibly, with cloud compute (or locally, if preferred)

Quick Start

pip install neat-ml
neat run --config neat_quickstart.yaml # This example file is in the repo here

NEAT will write graph embeddings to a new quickstart_output directory.

Requirements

This pipeline has grape as a major dependency.

Methods from tensorflow and are supported, but are not installed as dependencies to avoid version conflicts.

Please install the versions of tensorflow, scikit-learn, CUDA, and cudnn compatible with your system and with each other prior to installing NEAT if you wish to use these methods.

On Linux, the tensorflow installation may be easiest using conda as follows:

wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O anaconda.sh
bash ./anaconda.sh -b
echo "export PATH=\$PATH:$HOME/anaconda3/bin" >> $HOME/.bashrc
conda init
conda install cudnn
conda install tensorflow

Installation

pip install neat-ml

Running NEAT

neat run --config tests/resources/test.yaml # example
neat run --config [your yaml]

The pipeline is driven by a YAML file (e.g. tests/resources/test.yaml), which contains all parameters needed to complete the pipeline. The contents and expected values for this file are defined by the neat-ml-schema.

This includes hyperparameters for machine learning and also things like files/paths to output results. Specify paths to node and edge files:

GraphDataConfiguration:
  graph:
    directed: False
    node_path: path/to/nodes.tsv
    edge_path: path/to/edges.tsv

If the graph data is in a compressed file and/or a remote location (e.g., on KG-Hub), one or more URLs may be specified in the source_data parameter:

GraphDataConfiguration:
  source_data:
    files:
      - path: https://kg-hub.berkeleybop.io/kg-obo/bfo/2019-08-26/bfo_kgx_tsv.tar.gz
        desc: "This is BFO, your favorite basic formal ontology, now in graph form."
      - path: https://someremoteurl.com/graph2.tar.gz
        desc: "This is some other graph - it may be useful."

A diagram explaining the design a bit is here.

If you are uploading to AWS/S3, see here for configuring AWS credentials:

Credits

Developed by Deepak Unni, Justin Reese, J. Harry Caufield, and Harshad Hegde.

neat-ml's People

Contributors

caufieldjh avatar deepakunni3 avatar hrshdhgd avatar justaddcoffee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

socioprophet

neat-ml's Issues

pre_run_checks should catch `botocore.exceptions.NoCredentialsError`

The pre_run_checks has its check_s3_credentials argument set to True by default, so it always checks for S3 upload details.
That's normally not a problem, but when testing locally and S3 credentials aren't available, it raises botocore.exceptions.NoCredentialsError without catching it.
If there isn't an upload block, this should behave similarly to an error with existing credentials (i.e., warnings.warn("YAML contains no upload block - continuing")

Scheduled NEAT runs fail due to extra `graph_path` keyword

NEAT configs with the graph_args value lead to the following error when the scheduler attempts to run them:

15:37:00  Traceback (most recent call last):
15:37:00    File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
15:37:00      sys.exit(cli())
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
15:37:00      return self.main(*args, **kwargs)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
15:37:00      rv = self.invoke(ctx)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
15:37:00      return _process_result(sub_ctx.command.invoke(sub_ctx))
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
15:37:00      return ctx.invoke(self.callback, **ctx.params)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
15:37:00      return callback(*args, **kwargs)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/cli.py", line 55, in run
15:37:00      make_node_embeddings(**node_embedding_args)
15:37:00    File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/graph_embedding/graph_embedding.py", line 47, in make_node_embeddings
15:37:00      graph: Graph = Graph.from_csv(**main_graph_args)
15:37:00  TypeError: Graph.from_csv() got an unexpected keyword argument: graph_path

This makes sense - the ensmallen Graph.from_csv() doesn't know what to do with this keyword, as it's specific to NEAT.
It should be removed before passing kwargs to from_csv().

Requirement management

Running NEAT after a fresh install yields a ModuleNotFoundError for cpuinfo due to ensmallen importing it.
This can be avoided with a pip install ensmallen - or even better, pip install grape as this should cover all requirements.

There may also be issues with importing Tensorflow if it is not already installed, as ensmallen/embiggen do not explicitly require it.

Upload to PyPi

  • Ensure project metadata is set up as per PyPi
  • Ensure README includes everything necessary
  • Set up project package and upload to PyPi

OSError: [Errno 9] Bad file descriptor

When running the below config, the error OSError: [Errno 9] Bad file descriptor is raised. See further below for stack trace.

name: "quick_neat"
description: "A Quick NEAT Run"
output_directory: quickstart_output

graph_data:
  graph:
    node_path: tests/resources/test_graphs/test_small_nodes.tsv
    edge_path: tests/resources/test_graphs/test_small_edges.tsv
    directed: False
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'

embeddings:
  embedding_file_name: quickstart_embedding.csv
  embedding_history_file_name: quickstart_embedding_history.json
  node_embedding_params:
    node_embedding_method_name: CBOW # one of 'CBOW', 'GloVe', 'SkipGram', 'Siamese', 'TransE', 'SimplE', 'TransH', 'TransR'
    walk_length: 10 # typically 100 or so
    batch_size: 128 # typically 512? or more
    window_size: 4
    return_weight: 1.0  # 1/p
    explore_weight: 1.0  # 1/q
    iterations: 5 # typically 20

Trace:

Exception ignored in: <function Pool.__del__ at 0x7eff78d69ca0>
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

The embeddings are still saved.

support creating positive/negative test/train splits

Need to decide how this would work. Right now it's BYOH (bring your own holdouts), and they are supplied like this:

graph_data:
  graph:
    node_path: tests/resources/test_graphs/pos_train_nodes.tsv
    edge_path: tests/resources/test_graphs/pos_train_edges.tsv

  pos_validation:
    edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
  neg_training:
    edge_path: tests/resources/test_graphs/neg_train_edges.tsv
  neg_validation:
    edge_path: tests/resources/test_graphs/neg_valid_edges.tsv

One way to support either BYOH or having NEAT make holdouts:

graph_data:
  graph:
    node_path: tests/resources/test_graphs/pos_train_nodes.tsv
    edge_path: tests/resources/test_graphs/pos_train_edges.tsv

  holdout:
    make_holdouts:
      type: connected_holdout # only option at the moment
      random_state: 42 # seed
      train_size: 0.8 # fraction
      edge_types: # optional
        - biolink:interacts_with
        - biolink:has_gene_product
      verbose: bool

    existing_holdouts:  # this OR make_holdouts (not both)
      pos_validation:
        edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
      neg_training:
        edge_path: tests/resources/test_graphs/neg_train_edges.tsv
      neg_validation:
        edge_path: tests/resources/test_graphs/neg_valid_edges.tsv

Graph inputs should be full graph not splits, rename example paths

These two variables in the main yaml section suggest a split training set:

node_path: tests/resources/test_graphs/pos_train_nodes.tsv
edge_path: tests/resources/test_graphs/pos_train_edges.tsv

I imagine these are just placeholder paths but for a better example could indicate that these are the full graph edges and nodes.

Loading from URL looks for wrong filename

Loading graph objects from a URL isn't quite right - the file is downloaded but yaml_helper looks for a file matching the URL string rather than the 'safe' reformatted filename.
Example:
The neat.yaml contains this:

    graph_path: https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz 

The file is downloaded:

$ ls -ls https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz 
43200 -rw-r--r-- 1 harry harry 44235554 Apr 13 14:36 https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz

but raises FileNotFoundError upon trying to decompress it:

$ neat run --config neat.yaml
Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/cli.py", line 41, in run
    if not pre_run_checks(yhelp=yhelp):
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/pre_run_checks/pre_run_checks.py", line 82, in pre_run_checks
    if check_file_extensions and yhelp.main_graph_args():
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 211, in main_graph_args
    return self.add_indir_to_graph_data(self.yaml['graph_data']['graph'])
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 164, in add_indir_to_graph_data
    decomp_outfile = tarfile.open(filepath)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz'

Performing any further actions on the downloaded file should use the updated filename.

`poetry`fying the project

  • Add poetry as a package management project.
  • Setup build and publish capabilities
  • Autodocs (works hand-in-hand with #71 )

Create index.html upon uploading new output

Newly created output from NEAT is uploaded to a remote location (i.e., S3 bucket) but isn't inherently navigable because it does not have an index.html written. We will need to write/update this file.

Incorporate GNNs

We have some existing implementations but will need to allow access to them through a NEAT config.

RuntimeError when validation fails isn't very informative

When running a config, this is what happens:

$ neat run --config neat_quickstart.yaml
Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/cli.py", line 45, in run
    yhelp = YamlHelper(config)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 147, in __init__
    if not validate_config(self.yaml):
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 43, in validate_config
    raise RuntimeError
RuntimeError

So this configuration clearly doesn't pass the validation step, but why not?
We need a more informative output here.

Output link predictions as SSSOM

The existing output for predicted links looks like this:

source_node	destination_node	score
ENSP00000451575	ENSP00000435370	0.9370759965425138
ENSP00000451575	ENSP00000435370	0.9370759965425138
ENSP00000451575	ENSP00000361636	0.9361207132288921
ENSP00000451575	ENSP00000361636	0.9361207132288921
ENSP00000451575	ENSP00000357879	0.9361171909487621

The corresponding columns in SSSOM are subject_id, object_id, and confidence.
We can change the column heading and already be compliant.

It would be best to include some provenance as well, though - a string can go in additional columns, mapping_tool and mapping_tool_version.
We have a few different things to keep track of here:

  • neat-ml version
  • grape version
  • contents of specific neat config
    So this may not capture the entirety of the mapping, but it can be a frame of reference.

Use LinkML to parse and validate NEAT YAML files

We should consider using LinkML to validate our NEAT YAML files. Right now we are using some simple procedural code within the YamlHelper class to do this, but LinkML would provide a more sophisticated and thorough check of the YAML.

On the other hand, this would take some time to implement, and might increase the complexity of debugging NEAT YAML validation.

Just making this ticket to discuss

Drop support for Tensorflow and scikit-learn methods

We currently support TF and scikit-learn methods in config files, both in running classifiers and applying models.
If we remove support for these two frameworks:

  • The neat-schema will be less confusing and easier to apply
  • We can focus on a more specific set of use cases, driven by grape
  • We avoid issues due to upstream changes in frameworks other than grape

The downsides:

  • grape doesn't provide wrappers for simple methods in scikit-learn, like logistic regression (AFAIK)
  • We won't be able to replicate existing pipelines built around TF/scikit

Fail gracefully when provided node/edge path/URL that isn't node or edge

Providing NEAT with a config file where node and/or edge do not resolve to a TSV currently throws a ValueError (i.e., if it's a tar.gz).
This could be more informative to the user.
Would be especially nice to have a quick check to verify expected filetype and throw a warning if it isn't in ['tsv','tar.gz'], etc

history.to_json() is None

Running neat run --config tests/resources/test.yaml
has the following result:

  0% 0/4 [00:00<?, ?it/s]2021-11-18 18:47:17.046285: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
tcmalloc: large alloc 1635311616 bytes == 0x5565d5aa6000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x5566989c2000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 3270615040 bytes == 0x5566fa978000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056c5010 0x7f42056c573c 0x7f42056c585d 0x5565905a4749 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565904e2d14 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4
tcmalloc: large alloc 3270615040 bytes == 0x5567bd892000 @  0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056b03a9 0x7f42056b2ab5 0x55659068a409 0x556590611e7a 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4 0x5565906109ee 0x5565905a448c 0x5565905a4698 0x556590612fe4 0x5565905a3afa 0x556590611c0d 0x556590610ced 0x5565905a3bda 0x556590611c0d 0x5565906109ee 0x5565905a4271 0x5565905a4698 0x556590612fe4 0x5565906109ee 0x5565905a4271
2021-11-18 18:48:08.252436: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
14719/14719 [==============================] - 147s 10ms/step - loss: 0.2194 - auprc: 0.9560 - auroc: 0.9626 - Recall: 0.9859 - Precision: 0.8567 - accuracy: 0.9105 - val_loss: 0.2358 - val_auprc: 0.9588 - val_auroc: 0.9634 - val_Recall: 0.9745 - val_Precision: 0.8537 - val_accuracy: 0.9037
  0% 0/4 [03:21<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 71, in run
    f.write(history.to_json())
AttributeError: 'NoneType' object has no attribute 'to_json'

Add quickstart notebook

Essentially a quick start, with just a few boxes:

  • Point at a compressed graph
  • Identify a config YAML
  • Run it
  • Look at the output

In link prediction, filter nodes by prefix or other slots

Some graphs have nodes we would like to filter for, but they don't make clear distinctions in their Biolink categories:

PR:000002977    biolink:NamedThing                              Graph                                             owl:Class

So we would like to specify a filter for prefix rather than category.
This can be based on a flag used in the link_node_types: block in the config.

Similarly, it would be nice to be able to filter by other node slots/properties:

XPO:0134172     biolink:NamedThing      increased apoptosis in simple columnar epithelium       An increased occurrence of apoptotic process in simple columnar epithelium.                Graph

This could be as simple as a regex for a string value in a named column, e.g., match everything with the string "apoptosis"

Missing key in defining metrics in the config

When defining metrics in a neat config yaml as follows:

        metrics_config:
          metrics:
            - name: auprc
              type: tensorflow.keras.metrics.AUC
              curve: PR
            - name: auroc
              type: tensorflow.keras.metrics.AUC
              curve: ROC
            - name: Recall
              type: tensorflow.keras.metrics.Recall
            - name: Precision
              type: tensorflow.keras.metrics.Precision
            - type: accuracy

The mlp_model.py looks for the parameters key but can't find it, raising a KeyError.

        for m in metrics:
            if m["type"].startswith("tensorflow.keras"):
                m_class = self.dynamically_import_class(m["type"])
                m_parameters = m["parameters"]
                m_instance = m_class(**m_parameters)

Stack trace from a recent neat-kghub-scheduler run:

12:09:08  Traceback (most recent call last):
12:09:08    File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
12:09:08      sys.exit(cli())
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
12:09:08      return self.main(*args, **kwargs)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1053, in main
12:09:08      rv = self.invoke(ctx)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
12:09:08      return _process_result(sub_ctx.command.invoke(sub_ctx))
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
12:09:08      return ctx.invoke(self.callback, **ctx.params)
12:09:08    File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
12:09:08      return __callback(*args, **kwargs)
12:09:08    File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/cli.py", line 82, in run
12:09:08      model.compile()
12:09:08    File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/link_prediction/mlp_model.py", line 45, in compile
12:09:08      m_parameters = m["parameters"]
12:09:08  KeyError: 'parameters'

error when not specifying validation sets?

When these two keys are missing in the yaml:

pos_validation
neg_validation

Then get the following error:

can't find key in YAML: 'pos_validation'
can't find key in YAML: 'neg_validation'
Traceback (most recent call last):
File "/global/scratch/marcin/N2V/NEAT/venv/bin/neat", line 10, in
sys.exit(cli())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/cli.py", line 61, in run
yhelp.edge_embedding_method())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/link_prediction/model.py", line 71, in make_link_prediction_data
these_params.update(graph_args)
TypeError: 'NoneType' object is not iterable

Move `graph_path` argument out of `graph_data` arguments

As seen in issues like #73, it is problematic to have an additional argument passed to a function expecting to pass all kwargs to Ensmallen. It also complicates LinkML schema design.
The graph_path arg can still be optional, but should be provided independent of graph_data.

Allow config YAML to not contain graph files

Not every run will involve starting from scratch on a graph, but NEAT currently expects a graph_data block to be present in the config.
Omit it, and a KeyError is raised:

Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 11, in <module>
    load_entry_point('neat==0.0.1', 'console_scripts', 'neat')()
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/cli.py", line 41, in run
  File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/yaml_helper/yaml_helper.py", line 225, in deal_with_url_node_edge_paths
KeyError: 'graph_data'

Some modifications to the yaml_helper to make this optional should help.

some way to handle singleton nodes in graph input and resulting embeddings

Embiggen will embed every node in the node file OR all non-singleton nodes in the edge (when no node file is provided). Maybe we can help shepherd this in the right direction. The ensmallen output includes singleton count so in theory the pipeline could be aware at that stage. By default it may be best to embed all nodes (hence require the node file), however then can include noninformative embedding for singleton nodes in the output, so perhaps another output file (eg npy) excluding singletons would help here or some way to indicate the uninformative embeddings.

Example YAMLs should use different value for loading node types

In the config YAMLs, when we want to load node types from a nodelist, the Ensmallen graph loader expects to see node_list_node_types_column.
We currently use node_types_column - Ensmallen can certainly take this parameter, but it thinks it means "The name of the column of the node types file from where to load the node types." - emphasis mine.
We are planning to create the node types file as needed, so the YAMLs should use node_list_node_types_column to specify the column where nodes are assigned categories.

NEAT should check S3 credentials at beginning of run to avoid credential error at the end of a long run

For example, this happened at the end of a long run (see below).

Should be easy to avoid by checking that ~/.ec2/credentials exists if S3 block is given in NEAT YAML

Epoch 100/100
120/120 [==============================] - 536s 4s/step - loss: 12.9093
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 68, in run
    upload_dir_to_s3(**upload_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat/upload/upload.py", line 21, in upload_dir_to_s3
    client.head_object(Bucket=s3_bucket, Key=s3_path)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 663, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 682, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 162, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/dist-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Makefile:99: recipe for target 'embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia' failed
make: *** [embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia] Error 1
21.302u 5.897s 12:55:57.82 0.0%	0+0k 0+0io 0pf+0w

Replace NEATs BERT text embedding stuff with Ensmallen's (better) version of this

Right now we are including BERT embeddings for textual elements in the graph in NEAT using a fairly naive average embedding of all text for a given node.

Ensmallen now supports a more sophisticated version fo this, with a better way of incorporating BERT embedding using for example weighting of embeddings using TF-IDF.

We should therefore replace NEATs version of this with Ensmallen's BERT functionality

Input 0 of layer "sequential" is incompatible with the layer

When running the test config:

neat run --config tests/resources/test.yaml

I get the following error:

2021-11-18 17:30:40.780109: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
  0% 0/4 [00:42<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/neat", line 33, in <module>
    sys.exit(load_entry_point('neat==0.0.1', 'console_scripts', 'neat')())
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/cli.py", line 67, in run
  File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/link_prediction/mlp_model.py", line 78, in fit
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 808, in train_step
        y_pred = self(x, training=True)
    File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 263, in assert_input_compatibility
        raise ValueError(f'Input {input_index} of layer "{layer_name}" is '

    ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 100), found shape=(None, 868)

Looks like a mismatch between the original embedding size https://github.com/Knowledge-Graph-Hub/NEAT/blob/8c9e912239e2913c52998daa538277eeb7d544b4/tests/resources/test.yaml#L69
and the additional params.

Change YAML format to include input_data_dir and output_data_dir

Currently files are specified with whole paths, which is clunky, e.g.:

node_path: data/raw/upheno/upheno_training_nodes.tsv

Instead (per convo with Nico), YAML should specify input directory once:
input_data_dir
and prepend this to input file names automatically

Similarly should have output data directory:
output_data_dir
where things are written out

Specify compressed node/edge file in config yaml

Following from #36 - all graph node/edge TSVs on KG-HUB are stored as tar.gz, so even if we specify URL locations in the config yaml then we still need to have decompressed the tar.gz first.
This may be resolved by including a new key in the config yaml to specify a URL to the node/edge file, then having NEAT retrieve and decompress the file in its run location, updating the node and edge paths in the process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.