knowledge-graph-hub / neat-ml Goto Github PK
View Code? Open in Web Editor NEWNetwork Embedding All the Things
License: BSD 3-Clause "New" or "Revised" License
Network Embedding All the Things
License: BSD 3-Clause "New" or "Revised" License
Embiggen will embed every node in the node file OR all non-singleton nodes in the edge (when no node file is provided). Maybe we can help shepherd this in the right direction. The ensmallen output includes singleton count so in theory the pipeline could be aware at that stage. By default it may be best to embed all nodes (hence require the node file), however then can include noninformative embedding for singleton nodes in the output, so perhaps another output file (eg npy) excluding singletons would help here or some way to indicate the uninformative embeddings.
Perhaps with Sphinx
Some graphs have nodes we would like to filter for, but they don't make clear distinctions in their Biolink categories:
PR:000002977 biolink:NamedThing Graph owl:Class
So we would like to specify a filter for prefix rather than category.
This can be based on a flag used in the link_node_types:
block in the config.
Similarly, it would be nice to be able to filter by other node slots/properties:
XPO:0134172 biolink:NamedThing increased apoptosis in simple columnar epithelium An increased occurrence of apoptotic process in simple columnar epithelium. Graph
This could be as simple as a regex for a string value in a named column, e.g., match everything with the string "apoptosis"
For example, this happened at the end of a long run (see below).
Should be easy to avoid by checking that ~/.ec2/credentials
exists if S3 block is given in NEAT YAML
Epoch 100/100
120/120 [==============================] - 536s 4s/step - loss: 12.9093
Traceback (most recent call last):
File "/usr/local/bin/neat", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 68, in run
upload_dir_to_s3(**upload_kwargs)
File "/usr/local/lib/python3.7/dist-packages/neat/upload/upload.py", line 21, in upload_dir_to_s3
client.head_object(Bucket=s3_bucket, Key=s3_path)
File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 663, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.7/dist-packages/botocore/client.py", line 682, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/dist-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.7/dist-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/usr/local/lib/python3.7/dist-packages/botocore/signers.py", line 162, in sign
auth.add_auth(request)
File "/usr/local/lib/python3.7/dist-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Makefile:99: recipe for target 'embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia' failed
make: *** [embedding-upheno-hp-mp-with-relations-with-mp-hp-pistoia] Error 1
21.302u 5.897s 12:55:57.82 0.0% 0+0k 0+0io 0pf+0w
Not every run will involve starting from scratch on a graph, but NEAT currently expects a graph_data
block to be present in the config.
Omit it, and a KeyError is raised:
Traceback (most recent call last):
File "/home/harry/kg-env/bin/neat", line 11, in <module>
load_entry_point('neat==0.0.1', 'console_scripts', 'neat')()
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/cli.py", line 41, in run
File "/home/harry/kg-env/lib/python3.8/site-packages/neat-0.0.1-py3.8.egg/neat/yaml_helper/yaml_helper.py", line 225, in deal_with_url_node_edge_paths
KeyError: 'graph_data'
Some modifications to the yaml_helper to make this optional should help.
NEAT configs with the graph_args
value lead to the following error when the scheduler attempts to run them:
15:37:00 Traceback (most recent call last):
15:37:00 File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
15:37:00 sys.exit(cli())
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
15:37:00 return self.main(*args, **kwargs)
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
15:37:00 rv = self.invoke(ctx)
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
15:37:00 return _process_result(sub_ctx.command.invoke(sub_ctx))
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
15:37:00 return ctx.invoke(self.callback, **ctx.params)
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
15:37:00 return callback(*args, **kwargs)
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/cli.py", line 55, in run
15:37:00 make_node_embeddings(**node_embedding_args)
15:37:00 File "/home/jenkinsuser/anaconda3/lib/python3.7/site-packages/neat/graph_embedding/graph_embedding.py", line 47, in make_node_embeddings
15:37:00 graph: Graph = Graph.from_csv(**main_graph_args)
15:37:00 TypeError: Graph.from_csv() got an unexpected keyword argument: graph_path
This makes sense - the ensmallen
Graph.from_csv()
doesn't know what to do with this keyword, as it's specific to NEAT.
It should be removed before passing kwargs to from_csv()
.
Running neat run --config tests/resources/test.yaml
has the following result:
0% 0/4 [00:00<?, ?it/s]2021-11-18 18:47:17.046285: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
WARNING:root:Can't find key node_path in graph_data - skipping (possibly harmless)
tcmalloc: large alloc 1635311616 bytes == 0x5565d5aa6000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x556637234000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 1635311616 bytes == 0x5566989c2000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561e35f 0x7f42056c0103 0x5565905a2544 0x5565905a2240 0x556590616627 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a3bda 0x556590611915 0x556590610ced 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565906109ee 0x5565905a3bda 0x556590612737 0x556590610ced 0x5565905a5cfe
tcmalloc: large alloc 3270615040 bytes == 0x5566fa978000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056c5010 0x7f42056c573c 0x7f42056c585d 0x5565905a4749 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565904e2d14 0x7f420560aef7 0x5565905a2437 0x5565905a2240 0x556590615973 0x5565906109ee 0x5565905a3bda 0x556590615d00 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4
tcmalloc: large alloc 3270615040 bytes == 0x5567bd892000 @ 0x7f420b27c1e7 0x7f42055cd46e 0x7f420561dc7b 0x7f420561dd18 0x7f42056b03a9 0x7f42056b2ab5 0x55659068a409 0x556590611e7a 0x5565906109ee 0x5565905a3bda 0x556590612737 0x5565905a3afa 0x556590615d00 0x5565906109ee 0x5565904e2e2b 0x556590612fe4 0x5565906109ee 0x5565905a448c 0x5565905a4698 0x556590612fe4 0x5565905a3afa 0x556590611c0d 0x556590610ced 0x5565905a3bda 0x556590611c0d 0x5565906109ee 0x5565905a4271 0x5565905a4698 0x556590612fe4 0x5565906109ee 0x5565905a4271
2021-11-18 18:48:08.252436: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
14719/14719 [==============================] - 147s 10ms/step - loss: 0.2194 - auprc: 0.9560 - auroc: 0.9626 - Recall: 0.9859 - Precision: 0.8567 - accuracy: 0.9105 - val_loss: 0.2358 - val_auprc: 0.9588 - val_auroc: 0.9634 - val_Recall: 0.9745 - val_Precision: 0.8537 - val_accuracy: 0.9037
0% 0/4 [03:21<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/neat", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/neat/cli.py", line 71, in run
f.write(history.to_json())
AttributeError: 'NoneType' object has no attribute 'to_json'
Updating for grape
means we've lost some history files, though I suspect they can produced with the right parameters.
The current yaml format expects to be provided direct local paths to node and edge lists:
https://github.com/Knowledge-Graph-Hub/NEAT/blob/768cdf6d8bb9f069339e1c4d7519d0d73cfef15b/tests/resources/test.yaml#L8-L9
If it could be provided with URLs instead or additionally to local paths, remote runs of NEAT would be easier, since we assume it will have to retrieve node/edgelists from somewhere else anyway.
Right now at the end of a NEAT run, metrics like AUROC, AUPRC, etc are printed to stdout during training.
These should be emitted as a JSON or some computable format.
... include line number in [schema] validation errors ...
Originally posted by @caufieldjh in #94 (comment)
We currently support TF and scikit-learn methods in config files, both in running classifiers and applying models.
If we remove support for these two frameworks:
grape
grape
The downsides:
grape
doesn't provide wrappers for simple methods in scikit-learn, like logistic regression (AFAIK)Newly created output from NEAT is uploaded to a remote location (i.e., S3 bucket) but isn't inherently navigable because it does not have an index.html written. We will need to write/update this file.
Following from #36 - all graph node/edge TSVs on KG-HUB are stored as tar.gz, so even if we specify URL locations in the config yaml then we still need to have decompressed the tar.gz first.
This may be resolved by including a new key in the config yaml to specify a URL to the node/edge file, then having NEAT retrieve and decompress the file in its run location, updating the node and edge paths in the process.
Currently files are specified with whole paths, which is clunky, e.g.:
node_path: data/raw/upheno/upheno_training_nodes.tsv
Instead (per convo with Nico), YAML should specify input directory once:
input_data_dir
and prepend this to input file names automatically
Similarly should have output data directory:
output_data_dir
where things are written out
These two variables in the main yaml section suggest a split training set:
node_path: tests/resources/test_graphs/pos_train_nodes.tsv
edge_path: tests/resources/test_graphs/pos_train_edges.tsv
I imagine these are just placeholder paths but for a better example could indicate that these are the full graph edges and nodes.
Providing NEAT with a config file where node and/or edge do not resolve to a TSV currently throws a ValueError
(i.e., if it's a tar.gz).
This could be more informative to the user.
Would be especially nice to have a quick check to verify expected filetype and throw a warning if it isn't in ['tsv','tar.gz']
, etc
ensmallen
and embiggen
, install grape
grape
, e.g. from grape.embedders import SkipGramEnsmallen
grape
native methods as per #77See the grape tutorials for examples: https://github.com/AnacletoLAB/grape/tree/main/tutorials
The existing output for predicted links looks like this:
source_node destination_node score
ENSP00000451575 ENSP00000435370 0.9370759965425138
ENSP00000451575 ENSP00000435370 0.9370759965425138
ENSP00000451575 ENSP00000361636 0.9361207132288921
ENSP00000451575 ENSP00000361636 0.9361207132288921
ENSP00000451575 ENSP00000357879 0.9361171909487621
The corresponding columns in SSSOM are subject_id
, object_id
, and confidence
.
We can change the column heading and already be compliant.
It would be best to include some provenance as well, though - a string can go in additional columns, mapping_tool
and mapping_tool_version
.
We have a few different things to keep track of here:
Per conversation in BBOP Deep Dive, we could considering using model cards as described in this Mitchell et al paper to describe and benchmark trained ML models
This likely should/could be something that we add after we have more basic features implemented
Essentially a quick start, with just a few boxes:
As seen in issues like #73, it is problematic to have an additional argument passed to a function expecting to pass all kwargs to Ensmallen. It also complicates LinkML schema design.
The graph_path
arg can still be optional, but should be provided independent of graph_data
.
There could be a few more parameters for handling the embedding outputs, eg:
A simple ReadTheDocs would be fine for now.
This may even be the same task as #71
The embedding methods are updated to be compatible with the latest grape
in #82, but this doesn't include updates for edge prediction. We'd like to update those, too.
Right now we are including BERT embeddings for textual elements in the graph in NEAT using a fairly naive average embedding of all text for a given node.
Ensmallen now supports a more sophisticated version fo this, with a better way of incorporating BERT embedding using for example weighting of embeddings using TF-IDF.
We should therefore replace NEATs version of this with Ensmallen's BERT functionality
I actually haven't done much on the predictive model section in Embiggen pipeline but looks like the same thoughts about embedding output NEAT parameters would apply to any predictive model intermediates and outputs.
We should consider using LinkML to validate our NEAT YAML files. Right now we are using some simple procedural code within the YamlHelper class to do this, but LinkML would provide a more sophisticated and thorough check of the YAML.
On the other hand, this would take some time to implement, and might increase the complexity of debugging NEAT YAML validation.
Just making this ticket to discuss
Loading graph objects from a URL isn't quite right - the file is downloaded but yaml_helper
looks for a file matching the URL string rather than the 'safe' reformatted filename.
Example:
The neat.yaml
contains this:
graph_path: https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz
The file is downloaded:
$ ls -ls https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz
43200 -rw-r--r-- 1 harry harry 44235554 Apr 13 14:36 https___kg-hub.berkeleybop.io_kg-ontoml_20220304_KG-OntoML.tar.gz
but raises FileNotFoundError upon trying to decompress it:
$ neat run --config neat.yaml
Traceback (most recent call last):
File "/home/harry/kg-env/bin/neat", line 8, in <module>
sys.exit(cli())
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/neat/cli.py", line 41, in run
if not pre_run_checks(yhelp=yhelp):
File "/home/harry/kg-env/lib/python3.8/site-packages/neat/pre_run_checks/pre_run_checks.py", line 82, in pre_run_checks
if check_file_extensions and yhelp.main_graph_args():
File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 211, in main_graph_args
return self.add_indir_to_graph_data(self.yaml['graph_data']['graph'])
File "/home/harry/kg-env/lib/python3.8/site-packages/neat/yaml_helper/yaml_helper.py", line 164, in add_indir_to_graph_data
decomp_outfile = tarfile.open(filepath)
File "/usr/lib/python3.8/tarfile.py", line 1603, in open
return func(name, "r", fileobj, **kwargs)
File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
File "/usr/lib/python3.8/gzip.py", line 173, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz'
Performing any further actions on the downloaded file should use the updated filename.
When running the below config, the error OSError: [Errno 9] Bad file descriptor
is raised. See further below for stack trace.
name: "quick_neat"
description: "A Quick NEAT Run"
output_directory: quickstart_output
graph_data:
graph:
node_path: tests/resources/test_graphs/test_small_nodes.tsv
edge_path: tests/resources/test_graphs/test_small_edges.tsv
directed: False
verbose: True
nodes_column: 'id'
node_list_node_types_column: 'category'
default_node_type: 'biolink:NamedThing'
sources_column: 'subject'
destinations_column: 'object'
default_edge_type: 'biolink:related_to'
embeddings:
embedding_file_name: quickstart_embedding.csv
embedding_history_file_name: quickstart_embedding_history.json
node_embedding_params:
node_embedding_method_name: CBOW # one of 'CBOW', 'GloVe', 'SkipGram', 'Siamese', 'TransE', 'SimplE', 'TransH', 'TransR'
walk_length: 10 # typically 100 or so
batch_size: 128 # typically 512? or more
window_size: 4
return_weight: 1.0 # 1/p
explore_weight: 1.0 # 1/q
iterations: 5 # typically 20
Trace:
Exception ignored in: <function Pool.__del__ at 0x7eff78d69ca0>
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
The embeddings are still saved.
We support TF methods for link prediction through mlp_model.py, but other methods previously only accessible through TF are increasingly available through Ensmallen/Embiggen. We should support these methods, too.
Replace and/or refactor functions here:
https://github.com/Knowledge-Graph-Hub/neat-ml/blob/main/neat_ml/run_classifier/run_classifier.py
to use the current grape
functions.
One big help would be replacing candidate pair generation with grape
's negative graph generation.
NEAT runs may encounter a variety of caught errors and warnings:
These should be presented to the user, perhaps through a ERROR.txt created in the same location as the final output (e.g., a graph_ml
directory)
Per convo with Jules and Carlo, make embedding of Monarch graph using NEAT and put it in the usual place in https://kg-hub.berkeleybop.io/kg-monarch/
Running NEAT after a fresh install yields a ModuleNotFoundError for cpuinfo
due to ensmallen
importing it.
This can be avoided with a pip install ensmallen
- or even better, pip install grape
as this should cover all requirements.
There may also be issues with importing Tensorflow if it is not already installed, as ensmallen/embiggen do not explicitly require it.
Need to decide how this would work. Right now it's BYOH (bring your own holdouts), and they are supplied like this:
graph_data:
graph:
node_path: tests/resources/test_graphs/pos_train_nodes.tsv
edge_path: tests/resources/test_graphs/pos_train_edges.tsv
pos_validation:
edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
neg_training:
edge_path: tests/resources/test_graphs/neg_train_edges.tsv
neg_validation:
edge_path: tests/resources/test_graphs/neg_valid_edges.tsv
One way to support either BYOH or having NEAT make holdouts:
graph_data:
graph:
node_path: tests/resources/test_graphs/pos_train_nodes.tsv
edge_path: tests/resources/test_graphs/pos_train_edges.tsv
holdout:
make_holdouts:
type: connected_holdout # only option at the moment
random_state: 42 # seed
train_size: 0.8 # fraction
edge_types: # optional
- biolink:interacts_with
- biolink:has_gene_product
verbose: bool
existing_holdouts: # this OR make_holdouts (not both)
pos_validation:
edge_path: tests/resources/test_graphs/pos_valid_edges.tsv
neg_training:
edge_path: tests/resources/test_graphs/neg_train_edges.tsv
neg_validation:
edge_path: tests/resources/test_graphs/neg_valid_edges.tsv
In the config YAMLs, when we want to load node types from a nodelist, the Ensmallen graph loader expects to see node_list_node_types_column
.
We currently use node_types_column
- Ensmallen can certainly take this parameter, but it thinks it means "The name of the column of the node types file from where to load the node types." - emphasis mine.
We are planning to create the node types file as needed, so the YAMLs should use node_list_node_types_column
to specify the column where nodes are assigned categories.
The pre_run_checks has its check_s3_credentials argument set to True by default, so it always checks for S3 upload details.
That's normally not a problem, but when testing locally and S3 credentials aren't available, it raises botocore.exceptions.NoCredentialsError
without catching it.
If there isn't an upload block, this should behave similarly to an error with existing credentials (i.e., warnings.warn("YAML contains no upload block - continuing")
Would like to test both py3.8 and 3.9, at least.
When running the test config:
neat run --config tests/resources/test.yaml
I get the following error:
2021-11-18 17:30:40.780109: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1635305056 exceeds 10% of free system memory.
0% 0/4 [00:42<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/neat", line 33, in <module>
sys.exit(load_entry_point('neat==0.0.1', 'console_scripts', 'neat')())
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/cli.py", line 67, in run
File "/usr/local/lib/python3.7/dist-packages/neat-0.0.1-py3.7.egg/neat/link_prediction/mlp_model.py", line 78, in fit
File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 878, in train_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 867, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 860, in run_step **
outputs = model.train_step(data)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 808, in train_step
y_pred = self(x, training=True)
File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 263, in assert_input_compatibility
raise ValueError(f'Input {input_index} of layer "{layer_name}" is '
ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 100), found shape=(None, 868)
Looks like a mismatch between the original embedding size https://github.com/Knowledge-Graph-Hub/NEAT/blob/8c9e912239e2913c52998daa538277eeb7d544b4/tests/resources/test.yaml#L69
and the additional params.
When running a config, this is what happens:
$ neat run --config neat_quickstart.yaml
Traceback (most recent call last):
File "/home/harry/kg-env/bin/neat", line 8, in <module>
sys.exit(cli())
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/cli.py", line 45, in run
yhelp = YamlHelper(config)
File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 147, in __init__
if not validate_config(self.yaml):
File "/home/harry/kg-env/lib/python3.8/site-packages/neat_ml/yaml_helper/yaml_helper.py", line 43, in validate_config
raise RuntimeError
RuntimeError
So this configuration clearly doesn't pass the validation step, but why not?
We need a more informative output here.
We have some existing implementations but will need to allow access to them through a NEAT config.
When defining metrics in a neat config yaml as follows:
metrics_config:
metrics:
- name: auprc
type: tensorflow.keras.metrics.AUC
curve: PR
- name: auroc
type: tensorflow.keras.metrics.AUC
curve: ROC
- name: Recall
type: tensorflow.keras.metrics.Recall
- name: Precision
type: tensorflow.keras.metrics.Precision
- type: accuracy
The mlp_model.py looks for the parameters
key but can't find it, raising a KeyError
.
for m in metrics:
if m["type"].startswith("tensorflow.keras"):
m_class = self.dynamically_import_class(m["type"])
m_parameters = m["parameters"]
m_instance = m_class(**m_parameters)
Stack trace from a recent neat-kghub-scheduler run:
12:09:08 Traceback (most recent call last):
12:09:08 File "/home/jenkinsuser/anaconda3/bin/neat", line 8, in <module>
12:09:08 sys.exit(cli())
12:09:08 File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
12:09:08 return self.main(*args, **kwargs)
12:09:08 File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1053, in main
12:09:08 rv = self.invoke(ctx)
12:09:08 File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
12:09:08 return _process_result(sub_ctx.command.invoke(sub_ctx))
12:09:08 File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
12:09:08 return ctx.invoke(self.callback, **ctx.params)
12:09:08 File "/home/jenkinsuser/.local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
12:09:08 return __callback(*args, **kwargs)
12:09:08 File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/cli.py", line 82, in run
12:09:08 model.compile()
12:09:08 File "/home/jenkinsuser/anaconda3/lib/python3.8/site-packages/neat_ml/link_prediction/mlp_model.py", line 45, in compile
12:09:08 m_parameters = m["parameters"]
12:09:08 KeyError: 'parameters'
@caufieldjh could flesh this ticket out to do KG-IDG link prediction as discussed here
As mentioned here: Knowledge-Graph-Hub/kg-phenio#16
The is_url
and download_file
methods can be used with embeddings, allowing a URL to be specified in the config.
When these two keys are missing in the yaml:
pos_validation
neg_validation
Then get the following error:
can't find key in YAML: 'pos_validation'
can't find key in YAML: 'neg_validation'
Traceback (most recent call last):
File "/global/scratch/marcin/N2V/NEAT/venv/bin/neat", line 10, in
sys.exit(cli())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/cli.py", line 61, in run
yhelp.edge_embedding_method())
File "/global/scratch/marcin/N2V/NEAT/venv/lib/python3.7/site-packages/neat/link_prediction/model.py", line 71, in make_link_prediction_data
these_params.update(graph_args)
TypeError: 'NoneType' object is not iterable
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.