thukeg / saedb Goto Github PK
View Code? Open in Web Editor NEWthe SAE platform
Home Page: http://thukeg.github.com/saedb/
the SAE platform
Home Page: http://thukeg.github.com/saedb/
In a real world storage, fields in a table (or graph, in our application) can be quite messy.
For example, the core fields for Author can be 'names, h-index, publication_count, citation_number' etc, while we can add more fields such as 'bio, affiliation, phd_year, phone' and so on.
However, when doing computation on the authors network, we may only be interested in a subset of the properties, such as the core fields. If we store all the fields altogether, we'll have to load all fields together into the memory, which is not optimal.
I suggest we add a feature as ColumnFamily, in which fields are stored together. Then we'll be able to load only the families we need.
It seems the scheduling method haven't been implement yet, how do we control the program in terms of when to stop?
And also, do we need the message passing in the signal function of graphlab?
Should we add a distributed global variable interface? For example, when we need to train a linear classifier by distributing the training data to different machines, we need a distributed global variable to record the weight vector of the classifier. Otherwise, we need to copy the weight vector to every train instance, resulting in a low efficiency and high space requirement.
with @thinxer . I'll understand you code first.
In my brief testing, it's at least 2x faster than the default malloc of glibc.
Until we are statically linked to jemalloc, you can try:
LD_PRELOAD="/usr/lib/libjemalloc.so.1" your_program
Results by the synchronous engine should be the same whether run single-threaded or multithreaded, since it reads node data only from the last iteration. However this is not the case for the page rank test. This indicates that the synchronous engine is not correct. Please investigate this problem.
Do we need a reduce phase to aggregate the data to a global variable? It seems the graphlab implemented graph.map_reduce_vertices() function
Hi all,
I have added gflags as a git submodule. I have tested it under Linux and it's okay. But I'm not sure if it's convenient for developers on other platforms. Should we keep it a submodule, or add the sources to our repo, or use CMake's ExternalProject_Add?
Anyway, please try to compile it on your platform: https://github.com/thinxer/saedb/tree/gflags
Like this one:
https://github.com/THUKEG/saedb/blob/master/src/saedb/graph.hpp#L12
This will pollute the global namespace and cause problems in the future.
how to add heterogeneous vertex/edge data and filter.
If I wanna merge two vertices i and j to be a single i, it will involve adding edges like (i,k).
It would be convenient that add_edge() function is accessible in the vertex program.
Besides, since synchronous engine would synchronize all the vertex programs at the end of every super step, it is possible that we add a function, say "dynamic_modify" or sth, to modify/update the existed graph.
What do you think about it?
for (lvid_type vid = 0; vid < graph_.num_local_vertices(); vid++) {
if (!active_superstep_[vid]) {
continue;
}
auto &vprog = vertex_programs_[vid];
vertex_type vertex {graph_.vertex(vid)};
auto scatter_dir = vprog.scatter_edges(context, vertex);
if (scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES){
for (auto ep = vertex.in_edges(); ep->Alive(); ep->Next()) {
edge_type edge(ep->Clone());
vprog.scatter(context, vertex, edge);
}
}
if (scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES){
for (auto ep = vertex.out_edges(); ep->Alive(); ep->Next()) {
edge_type edge(ep->Clone());
vprog.scatter(context, vertex, edge);
}
}
vid++;
}
the vid has been accumlated twice in each loop, i deleted the last vid++ in my repo
Would anyone volunteer to correct it?
Currently the data structure used by algorithms (vertex programs) is bond to its specific data type. Thus we need a transformation of the a graph to run an algorithm.
I suggest that we make some programs into templates, so that algorithms can run on any graph having the given fields.
For example, we can make the 'pagerank' program using a templated vertex data type T. Then we store the computed (and intermediate) result into the 'pagerank' field.
This way is more efficient than using reflection mechanisms.
Current implementation crash when save zero-sized file.
We should be able to do init
with some parameters.
E.g. A shortest path vertex program should initialize source vertices' distance to 0.
An implementation suggestion is that we can signal vertices with some type of value, which can indicate if the signaled vertex is a source before engine start.
Here is discussion thread of IAlgorithm API.
Following are confirmed API:
gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge)
void apply(icontext_type& context, vertex_type& vertex, const gather_type& total)
void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge)
void beforeIteration(icontext_type& context, const vertex_type& vertex)
void afterIteration(icontext_type& context, const vertex_type& vertex)
The default analyzers are in a mess now. Desperately needs cleaning.
We need to add some meta information for the user defined data types in the graph. This will enable better interoperability for many tools without sacrificing run-time speed. For example, it's now not impossible to print/examine a user generated graph (we call it a foreign graph, since we are not familiar with it) without first having user's header files and compile the program with the headers. It's also not possible to filter and select a subgraph from a foreign graph.
To achieve this, we should embed a meta info into our graph. A simple solution would be that we force user to supply the type info when building the graph. The type info can be a list of (name, type, offset, size) tuples[1]. With this information, we can enumerate the fields of a foreign graph and use it to access the graph.
Of course, we can add more fields such as description, indexable etc to the type descriptor, which can be more friendly to end users.
@pondering, @neozhangthe1 What do you guys think about this?
[1]: The last two elements in the tuple are not necessary. It's just being verbose.
Time to seriously do it.
Need use case. @neozhangthe1 , @thinxer .
Clean up files and definitions and make interfaces more clear.
Also, I suggest 4 spaces tabs instead of 6 spaces.
It's weird when I merge code without test. If it compiles, it passes.
A todo item.
We need a formal and clear doc about:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.