kwartile / connected-component Goto Github PK

View Code? Open in Web Editor NEW

84.0 7.0 18.0 27 KB

Map Reduce Implementation of Connected Component on Apache Spark

Home Page: http://www.kwartile.com

License: MIT License

Scala 100.00%

connected-components apache-spark graphx scala mapreduce union-find graph-algorithms

connected-component's People

Contributors

Stargazers

Watchers

Forkers

shuangxinyu aicodex akhil-senroy scgump tejvepa oscaroboto lanseweilai semutter lihonst jiminhsieh xxzcool apclement xiaogea01 yuanyuma v-shaal haaroon kanchanchy jibaro

connected-component's Issues

Possible bug in algorithm?

Given cliques like: (3, 4) (5, 6) (1, 2) (9, 10) (8, 9) (7, 8) (6, 7)
The CC result is like: (1,1),(2,1),(3,3),(4,3),(5,5),(6,5),(7,5),(8,5),(6,5),(9,5),(6,5),(7,5),(10,6),(8,6)

It's quite weird that, the result shows "(5,5),(6,5),(7,5),(8,5),(6,5),(9,5),(6,5),(7,5)" and "(10,6),(8,6)" are 2 seperated subgraphs.

We're using quite old spark configurations. Spark verion is 2.3.0 and scala is 2.11 .

Why is the input a list of cliques?

Thank you for open sourcing this code. First, I'd like to note that I was not able to send email to the address given in the README ([email protected]): the error was mail delivery failure.

My main issue is that the README states that the input to your algorithm is a list of cliques. However, finding all cliques in a graph is a very hard problem (NP-complete I believe) but a basic connected components algorithm is doable in at least linear time, so why does your code require finding all cliques first? Are you using a different definition of "clique"? The paper cited in the README does not mention cliques at all.

License

There is no LICENSE file as a part of this project. Under what license is it distributed?

function largerStar might go wrong

In paper "connected component in mapreduce and beyond", while doing largerStar, it usually connect all strictly larger neighbors to the min neighbor including self.
But in the implementation, as shown in code snippet below:
((neighbor > self && neighbor != minNode) || (neighbor == self))
it will emit edge from self to minNode.

Test below easily demonstrate this:
before largeStar:
scala> nodePairs.collect.foreach(print _)
(1,8)(9,8)(8,5)(7,3)(3,2)(6,3)(6,4)(8,7)
after that:
scala> largeStar(nodePairs)._1.collect.foreach(print _)
(1,1)(8,1)(2,2)(3,2)(3,2)(6,2)(7,2)(4,4)(6,4)(5,5)(8,5)(6,3)(7,3)(8,3)(8,1)(9,1)(9,8)
which is obviously wrong.

Performance and input - questions

Hi Kumar,

First I would like to say thanks for having this code, it is really good comparing to GraphX performance, also the API is much easier to understand.

Our use case is to find a connected component in large graph. we are creating the graph based on other input and we can manipulate the way we are creating the edges (i.e creating them as a star (partially) or like a "linked list" and we are trying to find the best way to optimize the input in order to speed the connected component part.

We did some performance testing (with 30 spark executor where each executor has 12 cores) and we found that:
for Graph1:
vertex count: 1458232
edge count: 794924
connected components: 795098
It took 5 minutes to find connected components

In Graph2:
vertex count: 46165451
edge count: 37572770
connected components: 17908024
it took 8 hours to find connected components.

Several questions:

what is the best "input" (graph shape/edges shape) that makes the code to run faster per connected component? star? linked-list?
is it correct that having the maximum needed edges as an input will perform better than having the minimum needed edges to achieve the same connected components? we tried to save edges but we saw it gave us bad result comparing to other run with more edges.
Examples:
Graph (each line is clique):
a,b,c
a,b
d,e,f

edges in scenario 1 (better results): (a,b)(a,c)(b,c)(d,e)(d,f)(e,f)
edges in scenario 2 : (a,b)(a,c)(d,e)(d,f)
Are there any differences in terms of performance? we can manipulate our input to achieve the same connected component with different edges, this is why we are asking.

if I compare graph1 to graph2, how come Graph2 runs 80 times slower than Graph1? I did the following computation, let me know if I'm wrong:

Graph2 edges/Graph1 edges = ~47
Graph2 vertices/Graph1 vertices = ~31

31 + 47 = 78 which is close to 80, could this be the explanation how come it runs 80 times slower?

I couldn't find other way to contact with you to ask questions....
if you can send me an email to [email protected] we can discuss more.. please feel free to close this issue.

thanks!

Benchmark details

Hi,

I'm interested in using this approach to connected-components.

Can you give me some details on the performance of this approach?

eg How many cores, how much RAM and how much time it took to process the 100billion nodes?

Thanks,
Jatinder Sangha

ConnectivityChangeCount doesn't mean convergence

Hi,

I'm interested in this project. After I read the code, I think ConnectivityChangeCount only count the change in number of edges, not change of edges. This may lead to mistake.
Here is a counterexample:

val g = sc.parallelize(Array(List(1L,4L), List(2L,4L),List(2L,5L), 
                             List(3L,5L),List(1L,6L),List(3L,6L)))

Which build a graph:

Then I run

val (a,b,c) = ConnectedComponent.run(sc,g,1000)
a.collect

Spark gave this:

Array[(Long, Long)] = Array((1,1), (2,2), (3,3), (4,1), (2,1), (5,2), (3,2), (6,1), (3,1))

Please check this case. Thank you!

Documentation request for the proper spark related settings

On the README.md the following is stated:

We indeed had to to try various Spark related configurations, including executor memory size, driver memory size, yarn memory overhead, network timeout, and number of partitions to successfully run the implementation.

Would you be able to post some of those settings - and if you have the time also some note about why/under what conditions they were chosen? thanks!

Is there anyone meets "central nodes/paths" problem?

After I got the result of conencted-component, the connected graph is huge...
I wanna remove some central nodes/paths,which means the most frequently nodes connected isolated subgraphs.
I wanna remove those central nodes, so that huge graph can be splitted into two isolated subgraphs, which be better for analysis.

Is there any solution to solve this?

Thank you !

kwartile / connected-component Goto Github PK

connected-component's People

Contributors

Stargazers

Watchers

Forkers

connected-component's Issues

Possible bug in algorithm?

Why is the input a list of cliques?

License

function largerStar might go wrong

Performance and input - questions

Benchmark details

ConnectivityChangeCount doesn't mean convergence

Documentation request for the proper spark related settings

Is there anyone meets "central nodes/paths" problem?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent