lai-bluejay / track2 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from kdd-cup-2013-ntu/track2
This project forked from kdd-cup-2013-ntu/track2
KDD CUP 2013 - Track 2 ====================== Copyright 2013 Cheng-Xia Chang, Wei-Cheng Chang, Wei-Sheng Chin, Kuan-Hao Huang, Yu-Chin Juan, Tzu-Ming Kuo, Chun-Liang Li, Chih-Jen Lin, Hsuan-Tien Lin, Shan-Wei Lin, Shou-De Lin, Ting-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Hao Tsai, Hsiao-Yu Tung, Jui-Pin Wang, Cheng-Kuang Wei, Felix Wu, Chun-Pai Yang, Tu-Chun Yin, Tong Yu, and Yong Zhuang. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This package is developed at National Taiwan University. Our approach includes three different author-matching algorithms, 'main1', 'main2' and 'typo', and the outputs of these three algorithms are merged using post-processing scripts in 'merge'. To simplify the usage, we integrate all necessary processes into a Makefile in the top directory, so users could easily type 'make' to activate the whole process. Moreover, 'buff/main1.csv' and 'buff/main2.csv' are results of these two algorithms respectively. Because our algorithm relies on some other open source packages, please read the following statements before you get started. The detail of these methods will be introduced in the paper which will be published in mid July. A. Package organization 'main1' (One author matching algorithm) 'main2' (Another author matching algorithm) 'typo' (A matching algorithm for detecting duplicates that can not be found by 'main1' and 'main2' because of typos) 'merge' (Scripts for merging the results of above methods) B. Performance in KDD-Cup 2013(F1 score): | main1 | main2 | merge public(20%) |0.99186|0.99071|0.99195 private(80%)|0.99198|0.99083|0.99202 C. Make a prediction step-by-step: 1. Requirements and dependency: Our package runs under Ubuntu 10.04 and requires the following packages: 1-1. Python2 (test with version 2.6.5) 1-2. Python3 (test with version 3.3.1) 1-3. Perl5 (test with version 5.10) 1-4. Perl module Text::CSV 1-5. Raw data (dataRev2.zip). The zipped file should be stored at the top directory of this package 2. Run: Type 'make'. 3. Result: The result file is 'final.csv'. Please notice that algorithm may take more than 2 hours to generate the result file. D. Public resources used. They are included in this package. 1. Chinese information for 'main1'. This information is used for Eastern and Western name identification. 1-1. Chinese family name We have two lists of Chinese family names. The smaller one, TW.raw, is the official romanization of first 100 common Chinese name in Taiwan. The larger one, CN.raw, including 506 common Chinese names and their romanization, is downloaded from Wikipedia. Links: "http://tc.wangchao.net.cn/xinxi/detail_1855256.html" and romanization in "http://www.boca.gov.tw/mp?mp=1" https://zh.wikipedia.org/wiki/中文姓氏羅馬字標注 http://www.greatchinese.com/surname/surname.htm 1-2. Korean family name KR.raw contains 20 common Korean first names and their romanization. Links: http://mirror.enha.kr/wiki/한국인%20이름의%20로마자%20표기 1-3. Common romanization of Chinese tokens Links: http://www.pinyin.info/romanization/compare/gwoyeu_romatzyh.html http://en.wikipedia.org/wiki/Comparison_of_Chinese_romanization_systems In these tokens, we manually select 45 tokens frequently appeared in both English and Chinese. 2. Chinese information for 'main2'. 2-1. Chinese family name Link: http://www.chineseinla.com/lastname/key_ng.html 2-2. Common romanization of Chinese tokens Link: http://irw.ncut.edu.tw/general/chen813/羅馬拼音/中文羅馬拼音對照表.htm 3. Nick names. We substitute all nick names before we do any matching. Link for 'main1': http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html http://mentalfloss.com/article/24761/origins-10-nicknames Link for 'main2': https://code.google.com/p/author-dedupe/ 4. List of stop words used in the merge step Links: http://nlp.stanford.edu/software/tmt/tmt-0.4/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.