mickeysjm / hiexpan Goto Github PK
View Code? Open in Web Editor NEWThe source code used for automatic taxonomy construction method HiExpan, published in KDD 2018
License: GNU General Public License v3.0
The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018
License: GNU General Public License v3.0
Hello, I would like to test HiExpan on wiki corpus. After featureExtraction, I ran
~/HiExpan/src/HiExpan-new$ python3.6 main.py -data wiki
to test.
But after loading those files in wiki/intermediate, I got:
=== Finish loading data ...... ===
=== Start loading seed supervision ...... ===
Traceback (most recent call last):
File "main.py", line 120, in <module>
newNode = TreeNode(parent=rootNode, level=0, eid=ename2eid[children], ename=children,
KeyError: 'united_states'
It seems that united_states
is not included in those entities. What could possibly be wrong?
Thank you.
很感谢你上次及时回复我的 issue,你们的方法很棒,非常 impressive
但是我在运行 dblp 和 wiki 的数据集时,运行的结果非常不错,但是有个问题就是计算的时间太长了,考虑到做实验的时候可能不会特别注意优化的问题,我想请教一下有没有哪里值得研究一下优化,谢谢
Hi Jiaming,
In the code of extracting skip gram features https://github.com/mickeystroller/HiExpan/blob/master/src/featureExtraction/extractSkipGramFeature.py, the positions of possible skip gram are set as [(-1, 1), (-2, 1), (-3, 1), (-1, 3), (-2, 2), (-1, 2)] (line 30) , but I found when the center word is the first word of a sentence, the positions will actually become (0, 1) instead of (-1, 1) since there is no word before the center word, so maybe we should add positions like (0, 1), (0, 2) . Otherwise, we will see some entities have "a _ problem" feature but do not have "_ problem" feature. It may hurt when "_ problem" become an important feature later. Thanks!
Best,
Jieyu
Hi Jiaming,
In HiExpan/src/SetExpan-new/set_expan_standalone.py, line 89-90 need indents otherwise the redundant feature will not be filtered out. Thanks!
Best,
Jieyu
Hi jiaming,
Thanks for your idea and codes. When I run those codes in Chinese corpus, I found some issues:
First, Dependent syntax and part of speech seem to be unnecessary in corpus processing.
Second, getCombinedWeightByFeatureMap function use too much time when the featuresOfSeed size is large(the skip gram patterns size reaches hundreds of thousands of levels). So I only retained the 600 features with the highest score and standard length in "eidSkipgram2TFIDFStrength.txt" for each entity. This method reduced the run time from 30 hours to 30 minutes, but there is the possibility of reducing the accuracy of the calculation of the combinedWeight score.
Third, type feature is useless for Chinese, I have to use the LDA model's score instead. Now I am evaluating the effectiveness of this method.
At last, I didn't find the code for the Taxonomy Global Optimization section. Where can I find it?
Hi, I want to run HiExpan to see how it works on DBLP corpus. I use the test_dataset provided in original HiExpan folder. I have ran the corpus preprocessing and feature extraction part. When I try to run main.py in new-HiExpan folder, it tells me main.py: error: the following arguments are required: -data, -taxonPrefix. Is there any instruction about how to run that part? In addition, where should I put seed taxonomy and what the format of that should be. Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.