tangjianpku / line Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 404.0 691 KB

LINE: Large-scale information network embedding

C++ 53.59% Shell 0.55% Batchfile 0.37% Makefile 1.57% C 34.68% MATLAB 0.46% Python 8.78%

line's People

Contributors

Stargazers

Watchers

Forkers

fanfannothing beronx86 ilovejs zhiyilv voidexception gucasbrg rosewujunshuang xsongx heshizhu bapuqln hihihippp wentaotao lu839684437 yuwentao tigerneil wangdongfrank chagge giserh sdutheone jwmneu zhichun aurora1625 faneshion byzhang anthonywang14 njuhugn wayland-chen zhangyu860224 royshan ifff qunluo infiyan nipengmath stevenlee-belief watermars antoine-tran cserxy uiuc-data-mining evilkidcl lyysdy zhangxujinsh libardo1 zerkh chentingpc oztc xuanhan863 ml-lab kevinwenya nudtchengqing tmp2 minixie heqinw alphaprime casperhsia zhoujialinmumu napsternxg jjdblast adedzy tingsongpku simdj yangjunpro robertlin54 songweiping ericxsun berzjackson caomw jasmine-lily frankict nukui-s stevenlol wenqiliu chenglongchen zshwuhan zhenv5 k-wu skallumadi baichuan rwzhao jocelynnn hugochan xuehui1991 hhh920406 wtgme xuehui0725 liu4lin yijunran gfarnadi yuan-pku syncworld vseledkin guacore tythonlee yuwenlidao liangliangde tolety shenbeyond tandychao fanhuaandluomu youngleec liushifeng

line's Issues

Linux version : Order 2: Rho: 0.000253 Progress: 99.000%*** Error in `./line': munmap_chunk():

The current linux version used for different data file for same format as youtube dataset.

It works perfectly fine for Order 1 and samples at 100M

But for the same dataset it gives memory error for Order 2 and samples at 100M.

Could you please let me know if I am doing it wrong or could you please check with you current version which updated recently with the older version.

Thank you

How to build the code?

The author has provided a executable ./line but running it produces a core dump. So I wonder how to build the target on my machine?

Embeddings of all nodes are not obtained

Hi, I was trying to run this on a graph. However, the embeddings vec_1st.txt, vec_2nd.txt, and vec_all.txt do not generate the embeddings of all the nodes as in the original input graph.

Can you tell where I might be going wrong or why is this behavior caused?

Return nan value in the final embedding output when test example txt file

I was trying to test for the following txt file as input.
good the 3
the good 3
good bad 1
bad good 1
bad of 4
of bad 4

And here are the commands I use:
./reconstruct -train network.txt -output net_dense.txt -depth 2 -threshold 1000
./line -train net_dense.txt -output $a -binary 0 -size 128 -order 1 -negative 5 -samples 10000 -threads 40
./line -train net_dense.txt -output $b -binary 0 -size 128 -order 2 -negative 5 -samples 10000 -threads 40
./normalize -input $a -output $c -binary 0
./normalize -input $b -output $d -binary 0
./concatenate -input1 $c -input2 $d -output vec_all.txt -binary 0

In the vec_all.txt, the vectors for "bad" and "of" have many nans. Where did I do wrong? Thank you!

Problem with init_rho

I get Illegal instruction(coredump) at line 385 line.cpp:
printf("Initial rho: %lf\n", init_rho);
when I run with the options in train_youtube.sh:
./line -train net_youtube_dense.txt -output vec_1st_wo_norm.txt -binary 1 -size 128 -order 1 -negative 5 -samples 10000 -threads 40

When this line is commented out the program is running

meet segmentation fault when training LINE

hi,

I have tried to train LINE with -train ../../text_article_product_pair.txt -output 2_order_embedding_file -size 200 -order 2 -negative 5 -samples 100 -rho 0.025 -threads 20

and I got the following from gdb

Program received signal SIGSEGV, Segmentation fault.
_int_malloc (av=av@entry=0x7ffff7203760 <main_arena>, bytes=bytes@entry=10) at malloc.c:3775
3775 malloc.c: No such file or directory.
(gdb) bt
#0 _int_malloc (av=av@entry=0x7ffff7203760 <main_arena>, bytes=bytes@entry=10) at malloc.c:3775
#1 0x00007ffff6ec51ec in __libc_calloc (n=, elem_size=) at malloc.c:3219
#2 0x00000000004013ad in AddVertex (name=0x7fffffffb890 “春新神“) at line.cpp:101
#3 0x0000000000401772 in ReadData () at line.cpp:161
#4 0x0000000000402996 in TrainLINE () at line.cpp:391
#5 0x0000000000402f42 in main (argc=17, argv=0x7fffffffe228) at line.cpp:465

=============================================================
and following from valgrind

valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion ‘bszB_lo == bszB_hi’ failed.
valgrind: Heap block lo/hi size mismatch: lo = 176, hi = 7598532953432678760.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will

=================================================================

could you have a look about this issue?

Thank you
Ken

do I need boost？

hi,
did you use boost or gsl packge to recompile the source code? I just use the compiled file line.exe to train LINE on my own dataset ,and it failed.I want to know whether the boost is helpful when only training ?Or it is just helpful when compiling.
Looking forward for you reply, thanks!

Liu

Need to check write permission of output file

Hi, great work!

LINE failed to write to an output file which was in a directory owned by others / root and raised a segment fault.

After https://github.com/tangjianpku/LINE/blob/master/linux/line.cpp#L359, it needs a fo == NULL check to make sure current user has permission to write file.

Meaning of vocab.txt in the context of youtube. What if I want to execute and evaluate for another dataset?

what‘s the max num of input？

I use this tool to train my embedding, but cored.
The core infomation is:
Program terminated with signal 11, Segmentation fault.
#0 0x00007f8a9ca3c90f in _int_malloc () from /opt/compiler/gcc-4.8.2/lib/libc.so.6
(gdb) bt
#0 0x00007f8a9ca3c90f in _int_malloc () from /opt/compiler/gcc-4.8.2/lib/libc.so.6
#1 0x00007f8a9ca3ed0a in calloc () from /opt/compiler/gcc-4.8.2/lib/libc.so.6
#2 0x00000000004014e6 in AddVertex(char*) ()
#3 0x000000000040183f in ReadData() ()
#4 0x0000000000402a23 in TrainLINE() ()
#5 0x0000000000402fcf in main ()

my input is 1919006734 lines, i try to reduce my file to 59968960 lines, and it worked.
I want to know how to train very large amounts of data.
thx

Illegal instruction (core dumped)

Hi everyone,
I tried to execute LINE but I got the following error:

Order: 1
Samples: 1M
Negative: 5
Dimension: 100
Initial rho: 0.025000

Illegal instruction (core dumped)

Order: 2
Samples: 1M
Negative: 5
Dimension: 100
Initial rho: 0.025000
Illegal instruction (core dumped)
Any help, please!

memory leakage

LINE/linux/line.cpp

Line 256 in 5e32b41

if ((double)(k + 1) / neg_table_size > por)

when k == neg_table_size -1, it might be the case vid == num_vertices but por < 1 (it might be 0.99999 but not 1.0 due the machine epsilon), hence the next cur_sum computation is called which accesses the (num_vertices)-th elements of the vextex array -- which is out-of-range.

Clarify license

Hello. Have you thought about an open source license for LINE?
Currently I'm not able to find any mention of a license anywhere in the repo.

If you're deciding https://tldrlegal.com/ is a helpful resource and I might recommend MIT, Apache 2 or BSD.

On segmentation faults happening in the line.cpp Update method

With one of my network graphs, I had several segmentation faults occurring, which I could locate within the update method of LINE. The problem is that if the embeddings get very large in magnitude within some of the dimensions, the variables start to overflow and then the sigmoid method segfaults. I could mitigate the problem by setting "typedef double real;" instead of "typedef float real;" to use double precision instead of single precision. This resolved the segmentation faults, at least within my networks because of the enlarged variable space, however, when the embeddings are concatenated in the end, it again results in overflown embeddings. To solve the problem completely, it was necessary to write my own concatenate method. Just in case somebody runs into the same problems.

how to compile？

g++ compile is slow, what's your options?

How to evaluate the output embeddings?

close

hi:

Negative Weights

Does this model support negative weights

bug report

x in InitSigmoidTable() will only take int values as its computation depends on purely int types.

Use the following to fix this bug:
x = 2 * SIGMOID_BOUND * k / sigmoid_table_size - SIGMOID_BOUND;
=> x = 2.0 * SIGMOID_BOUND * k / sigmoid_table_size - SIGMOID_BOUND;

(my code is compiled in Ubuntu with G++ 4.8.4)

How to run on windows?

Hello,

In the explanation, it says To run the script, users first need to compile the evaluation codes by running make.sh in the folder "evaluate". Afterwards, we can run train_youtube.bat or train_youtube.sh to run the whole pipeline. But there is no such file in the windows folder. What is the equivalent of make.sh in windows folder?

Question about the output of concatenate

From the section 4.1.3 of LINE paper, the first-order proximity and second-order proximity can be preserved by concatenating two embedding vectors of each node. I use concatenate.cpp and obtain a UNIX executable file which produces the final embedding. However, I notice that the embedding vector of each nodes is different from simply combining the first-order and second-order embedding vectors, so I'm curious about the way of concatenating the first-order and second-order embedding vectors. can the authors answer that?

Segmentation fault

I recompile line.cpp，but I get an "Segmentation fault" error：
Number of edges: 158
Number of vertices: 29

Rho: 0.000498 Progress: 99.010%
Total time: 6.562638
Segmentation fault (core dumped)

why not consider importance of the vertices in First-order Proximity

result file's vertex name is wrong

I used the LINE code to train BlogCatalog dataset. The net input file is as:
1 176 1
176 1 1
1 233 1
233 1 1
1 283 1
....
I followed the train_youtube command, set the binary parameter 0, but the first column of the result file appear float nums, differ from the origin vertex id.
Did anyone encounter the problem?

Question about `reconstruct.cpp`

I have two questions about reconstruct.cpp.

I ran train_youtube.sh, but I felt that net_youtube_dense.txt might be strange.

E.g.

$ cat net_youtube_dense.txt | grep -i 670283
21	670283	21.000000
21	670283	43285.000000
664068	670283	664068.000000
670283	43287	826.000000
670283	160422	670283.000000
670283	695759	670283.000000
670283	911744	670283.000000
670283	1037375	670283.000000
670282	790	670283.000000
670282	1038559	670283.000000
160468	670283	664068.000000
695936	670283	906472.000000
670280	906472	670283.000000
1037373	670283	1037374.000000
1037376	670283	1037377.000000

Edge weight often is very close to from_node id. Is it OK?
This output contains the same edge that has different weights, E.g. 21 -- 670283. Is it OK?

My environment:

Mac OS X El captan
gcc:

$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

archlinux
gcc:

$gcc -v

COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 7.2.0 (GCC)

Problem that number of edges does not change when depth is set to 2.

After I installed gsl lib and downloaded the dataset of youtube, I tried to run the script train_youtube.sh.

What I saw is the following,
Number of edges: 4945382\n Number of vertices: 937968\n Reading neighbors: 99.891%\n Progress: 99.999%\n Number of edges in reconstructed network: 4945382\n

The problem is that the depth variable in 'reconstruct' is set to 2 while the number of edges in reconstructed network does not change. Is this a problem of the code?

meeting segment fault when training LINE

Hi, I train LINE with my own dataset and I meet a segment fault. When I debug with gdb, the debug info points the fault occurs when executing the Update() method. I have just read your code of line.cpp, you implement this algorithm using multithreads without using any lock. I wonder what will happens if two threads update the same memory at the same time.

The debug info of gdb are as following:

Order: 1
Samples: 1000M
Negative: 5
Dimension: 128
Initial rho: 0.025000
--------------------------------
Number of edges: 61589256
Number of vertices: 801311
--------------------------------
[New Thread 0x7fff3112f700 (LWP 23804)]
[New Thread 0x7fff3072e700 (LWP 23805)]
[New Thread 0x7fff2fd2d700 (LWP 23806)]
[New Thread 0x7fff2f32c700 (LWP 23807)]
[New Thread 0x7fff2e92b700 (LWP 23808)]
[New Thread 0x7fff2df2a700 (LWP 23809)]
[New Thread 0x7fff2d529700 (LWP 23810)]
[New Thread 0x7fff2cb28700 (LWP 23811)]
[New Thread 0x7fff0ffff700 (LWP 23812)]
[New Thread 0x7fff0f5fe700 (LWP 23813)]
[New Thread 0x7fff0ebfd700 (LWP 23814)]
[New Thread 0x7fff0e1fc700 (LWP 23815)]
[New Thread 0x7fff0d7fb700 (LWP 23816)]
[New Thread 0x7fff0cdfa700 (LWP 23817)]
[New Thread 0x7ffef7fff700 (LWP 23818)]
[New Thread 0x7ffef75fe700 (LWP 23819)]
[New Thread 0x7ffef6bfd700 (LWP 23820)]
[New Thread 0x7ffef61fc700 (LWP 23821)]
Rho: 0.024999  Progress: 0.004%[New Thread 0x7ffef57fb700 (LWP 23822)]
[New Thread 0x7ffef4dfa700 (LWP 23823)]
Rho: 0.024790  Progress: 0.841%
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff0e1fc700 (LWP 23815)]
0x0000000000403dc2 in Update(float*, float*, float*, int) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 gsl-1.13-1.el6.x86_64
(gdb) where
#0  0x0000000000403dc2 in Update(float*, float*, float*, int) ()
#1  0x0000000000404418 in TrainLINEThread(void*) ()
#2  0x0000003570e079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003570ae8b6d in clone () from /lib64/libc.so.6

Looking forward for you reply, thanks!

SIGABRT when importing graph

I have tested the Linux implementation in two different machines. Here's the backtrace given by gdb.

line: malloc.c:2395: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff6ec5458 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6ec5458 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ffff6ec68da in __GI_abort () at abort.c:89
#2  0x00007ffff6f09308 in __malloc_assert (
    assertion=assertion@entry=0x7ffff6ffc048 "(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offs"..., file=file@entry=0x7ffff6ff8444 "malloc.c", line=line@entry=2395, 
    function=function@entry=0x7ffff6ffcd50 <__func__.11480> "sysmalloc") at malloc.c:300
#3  0x00007ffff6f0b016 in sysmalloc (nb=nb@entry=32, av=av@entry=0x7ffff7230b20 <main_arena>) at malloc.c:2392
#4  0x00007ffff6f0c066 in _int_malloc (av=av@entry=0x7ffff7230b20 <main_arena>, bytes=bytes@entry=18)
    at malloc.c:3828
#5  0x00007ffff6f0dfda in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3237
#6  0x00000000004012ce in AddVertex (name=0x7fffffffde40 "Kempton,_Illinois") at line.cpp:100
#7  0x00000000004015c0 in ReadData () at line.cpp:154
#8  0x00000000004027fe in TrainLINE () at line.cpp:389
#9  0x0000000000402da9 in main (argc=19, argv=0x7fffffffe008) at line.cpp:463

I have run the program with these arguments:

-train graph.tsv -output emb.txt -binary 0 -size 300 -order 1 -negative 5 -samples 100 -rho 0.025 -threads 8

The graph file is 1.4GB.

The program fails at different points in execution for each machine while "Reading edges". However, it stops at exactly the same point when executing it in different nodes of the same cluster at different times, so I don't think that not having enough memory is the issue. Besides, no RAM spikes where observed during execution.

Why not iterate the algorithm many times just like Word2Vec

The algorithm just runs one time in TrainLINEThread function in line.cpp, why not iterate the algorithm many times just like Word2Vec?

-threshold versus -k-max in reconstruct

The youtube script uses -k-max, but reconstruct.cpp looks for -threshold. If you give reconstruct.cpp -k-max, it doesn't check that it isn't a supported option and continues silently as if -threshold=0.

boost library issue

Hi all,
I compile the line code in visual studio 2012. I build the boost_1_68. I received the bellow error:
error LNK1104: cannot open file 'libboost_thread-vc110-mt-x32-1_68.lib'
I have "libboost_thread-vc120-mt-x32-1_68".
Can anyone help me?

Thanks.

Question about the update method

For u, the gradient should be (1-sigmoid(uv))*v when the label is 1, (sigmoid(-uv) - 1)*v when the label is 0.

But in the update method, the gradient becomes (1-sigmoid(uv))*v when the label is 1, -sigmoid(uv)*v when the label is 0.

Is there any error in my calculation? Or this is a bug of the code?

A question about multi-thread in your code

Hi, I read your C++ code of LINE for Windos, a very good implementation. but I have a question that why you didn't consider the Read-Write Conflict when update the embeding vector in Update() function. all thread may read or write vec_v[c] ,or say, emb_vertex[c] and emb_context[c] at any time,
are there any potential problem? for example, there are two thread that sampled two edge, which linked to common vertex, and then they will update the embedding features of common vertex, so that when run the code such as 307 line in your implementation : x += vec_u[c] * vec_v[c], what the two thread read are not promised by boost::thread. in such case, the embedding features of the vertex will be Corrupted. I know the probability of the mistake is very very small when the number of vertex is 1e10+ and the number of thread is just 10+

I am looking forward your response and thank you very much.

Zero Embedding Vector Issue for LINE

Hi, I found in the output embedding file, some item embeddings are all zero.

I guess maybe it caused by the random sample strategy, for with random in the same seed, around 1/e will never be sampled. But the all-zero ratio is less than 1/e.
Another assumption is the SGD update to some hot node.

So I wonder if the epoch-version will help to solve this question.

How to feed inputs in Visual Studio command line?

Thanks.

“can't open input file workspace\test{1-9}”

At run.bat ../vec_all.txt, train_youtube.bat

At liblinear\windows\predict.exe -b 1 -q workspace\test{1-9} workspace\model{1-9} workspace\predict{1-9}, evaluate\run.bat

After running program\preprocess.exe -vocab program\vocab.txt -vector $1 -label program\label.txt -output workspace\ -debug 2 -binary 1 -times 10 -portion 0.01，there is nothing in workspace。

GSL version for LINE

Which version GSL need be installed?

How to feed inputs in Visual Studio command line prompt?

Hi all,
I want to run the code with YouTube data set in Visual Studio 2012. Should I make some changes in code to read the inputs from cmd (When I run the code in VS with Ctrl+F5)? I mean my main issue is how to feed input (as the desired format you mentioned: -train network_file -output embedding_file -binary 1 -size 200 -order 2 -negative 5 -samples 100 -rho 0.025 -threads 20) to the program in Visual Studio command line?
When I run the code in Visual Studio 2019, I got all the instructions to run LINE on input network file ( e.g, Youtube data set) that is posted on LINE github in Visual Studio command line, but nothing asks me at the command line to feed my input

Thanks

Error for my own data

Hi,

Anyone who knows how to solve the following errors? Thanks.

Number of edges: 56961
Number of vertices: 8893
Illegal instruction (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Vocab size: 0
Vector size 1: 0
Vector size 2: 0

tangjianpku / line Goto Github PK

line's People

Contributors

Stargazers

Watchers

Forkers

line's Issues

and I got the following from gdb

============================================================= and following from valgrind

I recompile line.cpp，but I get an "Segmentation fault" error： Number of edges: 158 Number of vertices: 29

Recommend Projects

Recommend Topics

Recommend Org

=============================================================
and following from valgrind

I recompile line.cpp，but I get an "Segmentation fault" error：
Number of edges: 158
Number of vertices: 29