Giter Club home page Giter Club logo

saedb's People

Contributors

dahui1 avatar kimiyoung avatar neozhangthe1 avatar terranlee avatar thinxer avatar xiaojingzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

saedb's Issues

implement the read/write of edge data

  1. In graph.hpp, the class edge_type has no variable to store the edge-associated data. Besides, the data() interface is still left empty.
  2. In graph.hpp, the function add_edge() does not save the edge-associated data at all.

ColumnFamily for Graph Storage

In a real world storage, fields in a table (or graph, in our application) can be quite messy.
For example, the core fields for Author can be 'names, h-index, publication_count, citation_number' etc, while we can add more fields such as 'bio, affiliation, phd_year, phone' and so on.

However, when doing computation on the authors network, we may only be interested in a subset of the properties, such as the core fields. If we store all the fields altogether, we'll have to load all fields together into the memory, which is not optimal.

I suggest we add a feature as ColumnFamily, in which fields are stored together. Then we'll be able to load only the families we need.

Scheduling method

It seems the scheduling method haven't been implement yet, how do we control the program in terms of when to stop?
And also, do we need the message passing in the signal function of graphlab?

Need a distributed global variable interface

Should we add a distributed global variable interface? For example, when we need to train a linear classifier by distributing the training data to different machines, we need a distributed global variable to record the weight vector of the classifier. Otherwise, we need to copy the weight vector to every train instance, resulting in a low efficiency and high space requirement.

Use `jemalloc' as the default memory allocator.

In my brief testing, it's at least 2x faster than the default malloc of glibc.

Until we are statically linked to jemalloc, you can try:

LD_PRELOAD="/usr/lib/libjemalloc.so.1" your_program

Synchronous Engine is not correct.

Results by the synchronous engine should be the same whether run single-threaded or multithreaded, since it reads node data only from the last iteration. However this is not the case for the page rank test. This indicates that the synchronous engine is not correct. Please investigate this problem.

need a reduce phase?

Do we need a reduce phase to aggregate the data to a global variable? It seems the graphlab implemented graph.map_reduce_vertices() function

Addition of gflags

Hi all,

I have added gflags as a git submodule. I have tested it under Linux and it's okay. But I'm not sure if it's convenient for developers on other platforms. Should we keep it a submodule, or add the sources to our repo, or use CMake's ExternalProject_Add?

Anyway, please try to compile it on your platform: https://github.com/thinxer/saedb/tree/gflags

Should the graph be modifiable in the sync. engine?

If I wanna merge two vertices i and j to be a single i, it will involve adding edges like (i,k).
It would be convenient that add_edge() function is accessible in the vertex program.
Besides, since synchronous engine would synchronize all the vertex programs at the end of every super step, it is possible that we add a function, say "dynamic_modify" or sth, to modify/update the existed graph.
What do you think about it?

is this a bug in sync engine?

        for (lvid_type vid = 0; vid < graph_.num_local_vertices(); vid++) {
            if (!active_superstep_[vid]) {
                continue;
            }
            auto &vprog = vertex_programs_[vid];
            vertex_type vertex {graph_.vertex(vid)};
            auto scatter_dir = vprog.scatter_edges(context, vertex);

            if (scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES){
                for (auto ep = vertex.in_edges(); ep->Alive(); ep->Next()) {
                    edge_type edge(ep->Clone());
                    vprog.scatter(context, vertex, edge);
                }
            }

            if (scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES){
                for (auto ep = vertex.out_edges(); ep->Alive(); ep->Next()) {
                    edge_type edge(ep->Clone());
                    vprog.scatter(context, vertex, edge);
                }
            }
            vid++;
        }

the vid has been accumlated twice in each loop, i deleted the last vid++ in my repo

Changing 'algorithm's to 'algorithm template's

Currently the data structure used by algorithms (vertex programs) is bond to its specific data type. Thus we need a transformation of the a graph to run an algorithm.

I suggest that we make some programs into templates, so that algorithms can run on any graph having the given fields.

For example, we can make the 'pagerank' program using a templated vertex data type T. Then we store the computed (and intermediate) result into the 'pagerank' field.

This way is more efficient than using reflection mechanisms.

Initialize vertices with parameters

We should be able to do init with some parameters.
E.g. A shortest path vertex program should initialize source vertices' distance to 0.

An implementation suggestion is that we can signal vertices with some type of value, which can indicate if the signaled vertex is a source before engine start.

vertex program api spec

Here is discussion thread of IAlgorithm API.

Following are confirmed API:

  1. gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge)
  2. void apply(icontext_type& context, vertex_type& vertex, const gather_type& total)
  3. void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge)
  4. void beforeIteration(icontext_type& context, const vertex_type& vertex)
  5. void afterIteration(icontext_type& context, const vertex_type& vertex)

Dynamic access to graph data

We need to add some meta information for the user defined data types in the graph. This will enable better interoperability for many tools without sacrificing run-time speed. For example, it's now not impossible to print/examine a user generated graph (we call it a foreign graph, since we are not familiar with it) without first having user's header files and compile the program with the headers. It's also not possible to filter and select a subgraph from a foreign graph.

To achieve this, we should embed a meta info into our graph. A simple solution would be that we force user to supply the type info when building the graph. The type info can be a list of (name, type, offset, size) tuples[1]. With this information, we can enumerate the fields of a foreign graph and use it to access the graph.

Of course, we can add more fields such as description, indexable etc to the type descriptor, which can be more friendly to end users.

@pondering, @neozhangthe1 What do you guys think about this?

[1]: The last two elements in the tuple are not necessary. It's just being verbose.

Test Framework

It's weird when I merge code without test. If it compiles, it passes.

Need a clear Doc

We need a formal and clear doc about:

  1. vertex program primitive(gather, apply, scatter) and their semantics.
  2. distributed related procedure(graph partition, cluster communication, graph storage, task scheduling).
  3. signal or message passing.
    etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.