thukeg / saedb Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 19.0 2.01 MB

the SAE platform

Home Page: http://thukeg.github.com/saedb/

CMake 1.66% C++ 98.29% Protocol Buffer 0.05%

saedb's People

Contributors

Stargazers

Watchers

Forkers

thinxer kimiyoung sherlockyang neozhangthe1 zhanpengfang mabodx hongleizhuang xiaojingzi npbool drstarry mbrt sosogood11 caoy11 headupinclouds sunnykachhia wangliangwei karthi2016

saedb's Issues

implement the read/write of edge data

In graph.hpp, the class edge_type has no variable to store the edge-associated data. Besides, the data() interface is still left empty.
In graph.hpp, the function add_edge() does not save the edge-associated data at all.

ColumnFamily for Graph Storage

In a real world storage, fields in a table (or graph, in our application) can be quite messy.
For example, the core fields for Author can be 'names, h-index, publication_count, citation_number' etc, while we can add more fields such as 'bio, affiliation, phd_year, phone' and so on.

However, when doing computation on the authors network, we may only be interested in a subset of the properties, such as the core fields. If we store all the fields altogether, we'll have to load all fields together into the memory, which is not optimal.

I suggest we add a feature as ColumnFamily, in which fields are stored together. Then we'll be able to load only the families we need.

Scheduling method

It seems the scheduling method haven't been implement yet, how do we control the program in terms of when to stop?
And also, do we need the message passing in the signal function of graphlab?

Need a distributed global variable interface

Should we add a distributed global variable interface? For example, when we need to train a linear classifier by distributing the training data to different machines, we need a distributed global variable to record the weight vector of the classifier. Otherwise, we need to copy the weight vector to every train instance, resulting in a low efficiency and high space requirement.

replace underlying graph api

with @thinxer . I'll understand you code first.

Use `jemalloc' as the default memory allocator.

In my brief testing, it's at least 2x faster than the default malloc of glibc.

Until we are statically linked to jemalloc, you can try:

LD_PRELOAD="/usr/lib/libjemalloc.so.1" your_program

Synchronous Engine is not correct.

Results by the synchronous engine should be the same whether run single-threaded or multithreaded, since it reads node data only from the last iteration. However this is not the case for the page rank test. This indicates that the synchronous engine is not correct. Please investigate this problem.

need a reduce phase?

Do we need a reduce phase to aggregate the data to a global variable? It seems the graphlab implemented graph.map_reduce_vertices() function

Addition of gflags

Hi all,

I have added gflags as a git submodule. I have tested it under Linux and it's okay. But I'm not sure if it's convenient for developers on other platforms. Should we keep it a submodule, or add the sources to our repo, or use CMake's ExternalProject_Add?

Anyway, please try to compile it on your platform: https://github.com/thinxer/saedb/tree/gflags

Never use `using namespace`s in headers

Like this one:

https://github.com/THUKEG/saedb/blob/master/src/saedb/graph.hpp#L12

This will pollute the global namespace and cause problems in the future.

new interface of vertex program for heterogeneous graph and filter

how to add heterogeneous vertex/edge data and filter.

@thinxer , @neozhangthe1

Should the graph be modifiable in the sync. engine?

If I wanna merge two vertices i and j to be a single i, it will involve adding edges like (i,k).
It would be convenient that add_edge() function is accessible in the vertex program.
Besides, since synchronous engine would synchronize all the vertex programs at the end of every super step, it is possible that we add a function, say "dynamic_modify" or sth, to modify/update the existed graph.
What do you think about it?

is this a bug in sync engine?

        for (lvid_type vid = 0; vid < graph_.num_local_vertices(); vid++) {
            if (!active_superstep_[vid]) {
                continue;
            }
            auto &vprog = vertex_programs_[vid];
            vertex_type vertex {graph_.vertex(vid)};
            auto scatter_dir = vprog.scatter_edges(context, vertex);

            if (scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES){
                for (auto ep = vertex.in_edges(); ep->Alive(); ep->Next()) {
                    edge_type edge(ep->Clone());
                    vprog.scatter(context, vertex, edge);
                }
            }

            if (scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES){
                for (auto ep = vertex.out_edges(); ep->Alive(); ep->Next()) {
                    edge_type edge(ep->Clone());
                    vprog.scatter(context, vertex, edge);
                }
            }
            vid++;
        }

the vid has been accumlated twice in each loop, i deleted the last vid++ in my repo

shortest_path program broken.

Would anyone volunteer to correct it?

Changing 'algorithm's to 'algorithm template's

Currently the data structure used by algorithms (vertex programs) is bond to its specific data type. Thus we need a transformation of the a graph to run an algorithm.

I suggest that we make some programs into templates, so that algorithms can run on any graph having the given fields.

For example, we can make the 'pagerank' program using a templated vertex data type T. Then we store the computed (and intermediate) result into the 'pagerank' field.

This way is more efficient than using reflection mechanisms.

Memory Mapped File with zero-size file

Current implementation crash when save zero-sized file.

Initialize vertices with parameters

We should be able to do init with some parameters.
E.g. A shortest path vertex program should initialize source vertices' distance to 0.

An implementation suggestion is that we can signal vertices with some type of value, which can indicate if the signaled vertex is a source before engine start.

vertex program api spec

Here is discussion thread of IAlgorithm API.

Following are confirmed API:

gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge)
void apply(icontext_type& context, vertex_type& vertex, const gather_type& total)
void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge)
void beforeIteration(icontext_type& context, const vertex_type& vertex)
void afterIteration(icontext_type& context, const vertex_type& vertex)

Cleanup indexing/analyzer

The default analyzers are in a mess now. Desperately needs cleaning.

Dynamic access to graph data

We need to add some meta information for the user defined data types in the graph. This will enable better interoperability for many tools without sacrificing run-time speed. For example, it's now not impossible to print/examine a user generated graph (we call it a foreign graph, since we are not familiar with it) without first having user's header files and compile the program with the headers. It's also not possible to filter and select a subgraph from a foreign graph.

To achieve this, we should embed a meta info into our graph. A simple solution would be that we force user to supply the type info when building the graph. The type info can be a list of (name, type, offset, size) tuples[1]. With this information, we can enumerate the fields of a foreign graph and use it to access the graph.

Of course, we can add more fields such as description, indexable etc to the type descriptor, which can be more friendly to end users.

@pondering, @neozhangthe1 What do you guys think about this?

[1]: The last two elements in the tuple are not necessary. It's just being verbose.

Signal (Dynamic Scheduling)
Before/After Iteration Interface

Need a clear Doc

We need a formal and clear doc about:

vertex program primitive(gather, apply, scatter) and their semantics.
distributed related procedure(graph partition, cluster communication, graph storage, task scheduling).
signal or message passing.
etc.