opencog / agi-bio Goto Github PK

Genomic and Proteomic data exploration and pattern mining

License: Other

CMake 0.04% C++ 0.01% Python 0.55% R 0.04% Scheme 99.24% Shell 0.12% Dockerfile 0.01%

genetics genetic-analysis proteomics protein-sequences artificial-intelligence genomics

agi-bio's Introduction

AGI-Bio

Genomic and proteomic research using the OpenCog toolset. This includes experiments in applying MOSES, PLN, pattern mining and other OpenCog components.

The MOZI.AI repositories make use of this package, and extend the current development of OpenCog-based bioinformatics tools as SingularityNET sevices.

Building and Installing

To build the AGI-Bio code, you will need to build and install the OpenCog AtomSpace first. All of the pre-requistes listed there are sufficient to also build this project. Building is as "usual":

    cd to project root dir
    mkdir build
    cd build
    cmake ..
    make -j
    sudo make install
    make -j test

Overview

The directory layout is as follows:

bioscience - Provides the GeneNode and MoleculeNode Atom types.
knowledge-import -- scripts for importing external knowledge bases into the AtomSpace.
moses-scripts -- scripts for importing MOSES models; such models distinguish binary phenotype categories based on gene expression data.

agi-bio's People

Contributors

Stargazers

Watchers

agi-bio's Issues

Add limit parameter to REST API atom fetching

Add a parameter, 'limit=' to GET .../atoms in the REST API that limits the size of the result set to that number.

Including in the return results the number n of the size of the full result set that would have been returned without the limit would be good too.

Piping too much information into the cogserver crashes it

When running

./export_models_and_fitness.sh ../bestCombos50/chr22_moses.5x10 localhost 17001

eventually the cogserver crashes, but if I truncate the file after to the first 5 lines, it works. That's all I know for now.

Fix parens in EvaluationLinks

There should be an end parenthesis after PredicateNodes in EvaluationLinks like this:

(EvaluationLink
         (PredicateNode "GO_namespace")
         (ListLink
                 (ConceptNode "GO:0000007")
                 (ConceptNode "molecular_function")))

I fixed this for GO terms and GO annotations (2029837). Please fix for the other knowledge bases where needed and regenerate the scheme files.

Duplicates in GO_annotations.scm

Looks like there are a lot of duplicate MemberLinks in GO_annotations.scm

For example

(MemberLink
        (GeneNode "TLR8")
        (ConceptNode "GO:0010008"))
has a bunch of entries

These won't add additional atoms to the atomspace I'm pretty sure, but it's probably slowing down loading the scheme file considerably.

Backing Store for Bio Knowledge Atomspace

We will need to implement a backing store when we reach the point that imported bio knowledge will not all fit in RAM.

Lifespan Observations Gene Sets --> atomspace

Identify human homolog genes.
Import into atomspace
Supply list of the homolog genes for Bobby

GO.scm - escape char causes it to choke

The escape character in this term name is causing loading of the scheme file to choke, and then the rest of the atoms following are not loaded:

 805491 (EvaluationLink
 805492          (PredicateNode "GO_name")
 805493          (ListLink
 805494                  (ConceptNode "GO:0033942")
 805495                 ; (ConceptNode "4-alpha-D-\{(1->4)-alpha-D-glucano}trehalose trehalohydrolase activity")
 805496                  (ConceptNode "4-alpha-D-{(1->4)-alpha-D-glucano}trehalose trehalohydrolase activity")
 805497          )
 805498 )

Also, the latest version of GO.scm checked in the repo has trailing spaces after some of the GO term ConceptNode names. This was causing problems in inheritance mining and will also cause problems with reasoning. I regenerated the scheme file by running the python script and the trailing spaces are no longer there... Perhaps the repo version was generated from an older version of the python script?

Bio-data-filled Atomspace for Pattern Mining and PLN

Atomspace ready for pattern mining for when Shuijing will be visiting HK early Feb. Atomspace will include MOSES models, bio databases, and linkages between them.

Visualizer on Hetzner

Load the MSigDB atoms on a cogserver running on the Hetzner server and make available for people to explore using the atomspace visualizer.

[ ] set up for the visualizer code on hetzner to automatically pull from the github repo when their are commits to the gihub repo, if not too difficult to implement this

GO Ontology -> Atomspace import script

Per specifications in the Google doc.

Target Date: (Selam, please fill in)

What are the # of atoms and RAM required for full GO (ontology db) import
Load onto cogserver on Hetzner along with the MSigDB (if they both fit without backing store)
RO representation in GO (Eddie - assigned to #31)
intersection_of and typedef representation (Eddie - assigned to #7)

GO terms with empty name strings are being imported into the atomspace.

this issue is mainly intended for icog folks with access to opencog-bio /bio-data gitlab repo but the python scripts that convert the public data files to scheme are here in agi-bio/knowledge-import. after loading GO.scm and GO_annotations.scm generated per the README into an atomspace and using these functions from "GO utilities.scm":

(define GOname
    (lambda (GOterm)
        (gar
            (cog-execute!
                (GetLink
                    (EvaluationLink
                        (PredicateNode "GO_name")
                        (ListLink
                            GOterm
                            (VariableNode "$name"))))))))
                       
(define get-type-with-prefix
    (lambda (type string)
        (filter
            (lambda (atom) (string-prefix? string (cog-name atom)))
            (cog-get-atoms type))))

this procedure counts almost 2k GO terms with no ConceptNode associated

(length (filter
            (lambda (goterm) (eq? (list) (GOname goterm)))
            (get-type-with-prefix 'ConceptNode "GO:")))
$4 = 1976

the error could be in the python script or something missing in the original database downloads.

RO hierarchical representation

Missing license?

I'd like to package this for GNU Guix, but I haven't been able to find a license declaration in this repository. Is this free software? If so, under what license is it released?

Add MSigDB non-verbose version to scheme_representations

libbioscience-types won't load, undefined symbol: opencog_module_id

this is the log output. the same thing happens trying to load libruleengine.so either from the cogserver.conf file or from the cogserver command line.

[2017-06-30 23:11:09:810] [INFO] Loading module "/usr/local/lib/opencog/libbioscience-types.so"
[2017-06-30 23:11:09:810] [ERROR] Unable to find symbol "opencog_module_id": /usr/local/lib/opencog/libbioscience-types.so: undefined symbol: opencog_module_id (module /usr/local/lib/opencog/libbioscience-types.so)
        Stack Trace:
        2: Logger.cc:481          opencog::Logger::logva(opencog::Logger::Level, char const*, __va_list_tag*)
        3: Logger.cc:493          opencog::Logger::Error::operator()(char const*, ...)
        4: CogServer.cc:426       opencog::CogServer::loadModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        5: CogServer.cc:630       opencog::CogServer::loadModules(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)
        6: CogServerMain.cc:223 main()
        7: libc-start.c:325     __libc_start_main()
        8: ??:0 _start()

[2017-06-30 23:11:09:878] [INFO] Loading module "/usr/local/lib//opencog/libbioscience-types.so"
[2017-06-30 23:11:09:878] [ERROR] Unable to find symbol "opencog_module_id": /usr/local/lib/opencog/libbioscience-types.so: undefined symbol: opencog_module_id (module /usr/local/lib//opencog/libbioscience-types.so)
        Stack Trace:
        2: Logger.cc:481          opencog::Logger::logva(opencog::Logger::Level, char const*, __va_list_tag*)
        3: Logger.cc:493          opencog::Logger::Error::operator()(char const*, ...)
        4: CogServer.cc:426       opencog::CogServer::loadModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        5: CogServer.cc:630       opencog::CogServer::loadModules(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)
        6: CogServerMain.cc:223 main()
        7: libc-start.c:325     __libc_start_main()
        8: ??:0 _start()

[2017-06-30 23:11:09:948] [WARN] Failed to load cogserver module libbioscience-types.so

Add Aging/Mythelation Genes to Atomspace

Per Bobby:
Can we include Table S6 from this paper as an additional gene list, similar to the ones from mSigDB? http://www.sciencedirect.com/science/article/pii/S1097276512008933

This is a list of genes that seem to be modulated with the epigenetic clock of aging that has been shown to work very repeatedly and across different tissues... I don't think it's already included within mSigDB though.

Here's the data: https://drive.google.com/file/d/0B3NYFAN330UTbDdBTC1RNngyV0l3cVRRVVY4cFdKRDZLR0dj/view?usp=sharing

Selam, This set can be represented similar to how the MSigDB gene sets are represented.
Lets create a new general ConceptNode "GeneSet" that this gene set inherits from.
And then in the MSigDB representation, we should add an inheritance of MSigDB_GeneSet to GeneSet, IOW:

InheritanceLink
  ConceptNode "MSigDB_GeneSet"
  ConceptNode "GeneSet"

Fill in as many of the fields used in the MSigDB that can also be applied to this gene set. Perhaps Meseret can help if needed with interpreting things from the article regarding what the field values should be.

Review atom creation scripts with Selam

where is Lifespan-observations_2015-02-21.csv ?

the cvs file used by lifeSpanObservation_2015.py is not in repo. does it still exist?
the output file is available at https://gitlab.com/opencog-bio/bio-data/blob/master/scheme-representations/Lifespan-observations_2015-02-21.scm

GO typedef representation

also representation for intersection_of

Updated MOSES Models for Import (Mike/Meseret)

For all relevant models we want included in the bio atomspace, especially the nonagenarian models, but for supercentenarians, etc. as well.

These should be in the format Mike specified for the latest conversion script of Nil's that includes model accuracy information for conversion to atomspace truth values.

The models should be delivered to Selam for importing with the latest script from Nil.

GeneNode and MoleculeNode aren't available in Python types

Although biosciene_types.pyx is generated during build, it is not installed and the types GeneNode and MoleculeNode aren't available for import.

MSigDB -> Atomspace Import

What are # of atoms and RAM required for full MSigDB import
switch for including the text description fields in the atomspace import
- should default to false (do not include)
- append name of output scheme file when including text data to... 'MSigDB_verbose.scm'?
- document switch usage info in the python script
At top of the output scheme file, add creation date, MSigDB version, and MSigDB fields included in the import (as comments)

Target completion date: (Selam, please fill in)

Binding Domain / TF / MicroRNA Representations

Leverage MSigDB geneset data to directly represent regulatory relationships between genes.

Documentation of imported bio KB's

Please add either to the knowledge-import readme or to a github project wiki page a list of the imported knowledge bases that includes the name of the associated python script, name of the atom scheme file, and link to the source.

Please make sure to update the list when new KB's are added to the bio atomspace.

build fail due to refactoring

i tried to build opencog with the "bioscience" module. after changing the directory to match current location of "Module.h" in BioScienceTypes.cc (commit 2b8b5) the build still fails because because there is no longer a file "opencog/atomspace/atom_types.cc". anyone know what replaced this file?

recent changes in atomspace/opencog/atoms/base/atom_types.cc broke bioscience module

cross reference issue opencog/atomspace#1708

[ 50%] Building CXX object bioscience/types/CMakeFiles/bioscience-types.dir/BioScienceTypes.cc.o
<command-line>:0:10: warning: missing terminating " character
In file included from /home/cog/opencog/agi-bio/bioscience/types/BioScienceTypes.cc:28:0:
/usr/local/include/opencog/atoms/base/atom_types.cc: In function ‘void init()’:
/usr/local/include/opencog/atoms/base/atom_types.cc:43:40: error: ‘class opencog::ClassServer’ has no member named ‘beginTypeDecls’
  bool is_init = opencog::classserver().beginTypeDecls(xstr(INITNAME));
                                        ^
In file included from /usr/local/include/opencog/atoms/base/atom_types.cc:46:0,
                 from /home/cog/opencog/agi-bio/bioscience/types/BioScienceTypes.cc:28:
/home/cog/opencog/agi-bio/build/bioscience/types/atom_types.inheritance:5:45: error: ‘class opencog::ClassServer’ has no member named ‘declType’
 opencog::GENE_NODE = opencog::classserver().declType(opencog::CONCEPT_NODE, "GeneNode");
                                             ^

/home/cog/opencog/agi-bio/build/bioscience/types/atom_types.inheritance:6:48: error: ‘class opencog::ClassServer’ has no member named ‘declType’
 opencog::PROTEIN_NODE = opencog::classserver().declType(opencog::CONCEPT_NODE, "ProteinNode");
                                                ^
In file included from /home/cog/opencog/agi-bio/bioscience/types/BioScienceTypes.cc:28:0:
/usr/local/include/opencog/atoms/base/atom_types.cc:50:25: error: ‘class opencog::ClassServer’ has no member named ‘endTypeDecls’
  opencog::classserver().endTypeDecls();
                         ^
bioscience/types/CMakeFiles/bioscience-types.dir/build.make:78: recipe for target 'bioscience/types/CMakeFiles/bioscience-types.dir/BioScienceTypes.cc.o' failed

bioscience/opencog.conf needs updating for type system refactoring

opencog.conf lists modules that aren't modules anymore. probably a bunch of other stuff needs updating as well.

some potentially useful code.

I've taken it upon myself to rewrite the GO.scm code a bit:
At first I ended up with this: https://github.com/jac2130/obo_to_Atom_Space
But then I found a wonderful little toolbox for ontologies, called pronto and then, after making a few adjustments to that library in my own fork (https://github.com/jac2130/pronto) and after realizing that much work has been done on opencog tools, I came up with this:
https://github.com/CollectiWise/collectiwise/blob/master/python/scheme_router.py
combined with these statements, directly in Scheme Atomese:
https://github.com/CollectiWise/collectiwise/blob/master/statements.scm
Now, you'll notice that they are very incomplete at the moment but the idea to send statements and relationship types directly as a json stream through a router to the AtomSpace is what drove me to this, because I'm building a live ontology and crowd reasoning system that may be updated in real time. The nice ideas embodied in the pronto library are quite helpful. The pronto library allows for easy ontology merging, taking either owl or obo ontologies and it links information in intuitive and powerful ways. So the idea then is to push as much of the work as possible to the ontology library (pronto) and to the Scheme file in which the statements are crafted, leaving a thin message router that just takes an ontology as input and sends terms and relationships to the appropriate scheme functions, written in scheme code and feeding the results directly into the AtomSpace (no need for files). I will add to this that non-ontological logical statements (implications etc) can also be sent through this router via JSON. This will lead to a detailed logical semantics of the ontological relationships in some set of ontologies (in my case it will be the set of relations which are found in the ENVO and the SDGIO (https://github.com/SDG-InterfaceOntology/sdgio) ontologies. No matter what ontologies or lists of logical statements (axioms) you will use, the number of types of relations is rather small, even in huge ontologies or knowledge bases, because they are just the set of rules by which things can be ontologically or logically related. These relationships, in turn, have an even smaller set of properties (is_symmetric, is_transitive etc. , for an exhaustive list of obo defined relationship properties, see here: https://metacpan.org/pod/OBO::Core::RelationshipType#is_cyclic) the meanings of which I will encode precisely in Scheme Atomese next. Currently, pronto only picks up a handful of those relationship properties but I will work on that as well. Further improvements and additional features are planned. While this code is part of a bigger project, I think it could be useful pretty quickly for the things that you are doing here and maybe you could provide me with some feedback and ideas as to how to encode things in Atomese in the process? I hope that it turns out to be useful for someone aside from me and my team!

BioScienceTypes.cc doesn't build

i think this got broke in recent atomspace/opencog updates, it built a couple weeks ago...

Scanning dependencies of target bioscience-types
[ 50%] Building CXX object bioscience/types/CMakeFiles/bioscience-types.dir/BioScienceTypes.cc.o
<command-line>:0:10: warning: missing terminating " character
In file included from /usr/local/include/opencog/atoms/base/atom_types.cc:46:0,
                 from /home/mjsd/oc/agi-bio/bioscience/types/BioScienceTypes.cc:28:
/home/mjsd/oc/agi-bio/build/bioscience/types/atom_types.inheritance: In function ‘void init()’:
/home/mjsd/oc/agi-bio/build/bioscience/types/atom_types.inheritance:1:39: error: no matching function for call to ‘opencog::ClassServer::beginTypeDecls()’
 opencog::classserver().beginTypeDecls();
                                       ^
In file included from /home/mjsd/oc/agi-bio/build/bioscience/types/atom_types.definitions:2:0,
                 from /home/mjsd/oc/agi-bio/bioscience/types/BioScienceTypes.cc:23:
/usr/local/include/opencog/atoms/base/ClassServer.h:97:10: note: candidate: bool opencog::ClassServer::beginTypeDecls(const char*)
     bool beginTypeDecls(const char *);
          ^
/usr/local/include/opencog/atoms/base/ClassServer.h:97:10: note:   candidate expects 1 argument, 0 provided
bioscience/types/CMakeFiles/bioscience-types.dir/build.make:78: recipe for target 'bioscience/types/CMakeFiles/bioscience-types.dir/BioScienceTypes.cc.o' failed
make[2]: *** [bioscience/types/CMakeFiles/bioscience-types.dir/BioScienceTypes.cc.o] Error 1

Export MOSES models to the atomspace

This issue encompasses exporting the models + their fitness semantics, that is what the score of a model means, and how this model relates to the target feature.

Export models
Export fitness information
- Accuracy and balanced accuracy (the semantics of those are let aside for now)
- Semantics of accuracy and balanced accuracy (need to import the dataset for that)