Giter Club home page Giter Club logo

track2's Introduction

KDD CUP 2013 - Track 2
======================
Copyright 2013 Cheng-Xia Chang, Wei-Cheng Chang, Wei-Sheng Chin, Kuan-Hao Huang, Yu-Chin Juan,
Tzu-Ming Kuo, Chun-Liang Li, Chih-Jen Lin, Hsuan-Tien Lin, Shan-Wei Lin,
Shou-De Lin, Ting-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Hao Tsai,
Hsiao-Yu Tung, Jui-Pin Wang, Cheng-Kuang Wei, Felix Wu, Chun-Pai Yang, Tu-Chun Yin,
Tong Yu, and Yong Zhuang.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.




This package is developed at National Taiwan University.
Our approach includes three different author-matching algorithms,
'main1', 'main2' and 'typo', and the outputs of these three algorithms are merged
using post-processing scripts in 'merge'. To simplify the usage, we integrate
all necessary processes into a Makefile in the top directory, so users could
easily type 'make' to activate the whole process. Moreover, 'buff/main1.csv'
and 'buff/main2.csv' are results of these two algorithms respectively.
Because our algorithm relies on some other open source packages,
please read the following statements before you get started. The detail of
these methods will be introduced in the paper which will be published in mid July.

A. Package organization
    'main1' (One author matching algorithm)
    'main2' (Another author matching algorithm)
    'typo' (A matching algorithm for detecting duplicates that can not be found by 'main1' and 'main2' because of typos)
    'merge' (Scripts for merging the results of above methods)


B. Performance in KDD-Cup 2013(F1 score):
            | main1 | main2 | merge
public(20%) |0.99186|0.99071|0.99195
private(80%)|0.99198|0.99083|0.99202
                

C. Make a prediction step-by-step:
    1. Requirements and dependency:
        Our package runs under Ubuntu 10.04 and requires the following packages:
            1-1. Python2 (test with version 2.6.5)
            1-2. Python3 (test with version 3.3.1)
            1-3. Perl5 (test with version 5.10)
            1-4. Perl module Text::CSV
            1-5. Raw data (dataRev2.zip). The zipped file should be stored at the top directory of this package 

    2. Run:
        Type 'make'.

    3. Result:
        The result file is 'final.csv'. Please notice that algorithm
        may take more than 2 hours to generate the result file.


D. Public resources used. They are included in this package.
    1. Chinese information for 'main1'. This information is used for Eastern and Western name identification. 
        1-1. Chinese family name
            We have two lists of Chinese family names. The smaller one, TW.raw,
            is the official romanization of first 100 common Chinese name in
            Taiwan. The larger one, CN.raw, including 506 common Chinese names and
            their romanization, is downloaded from Wikipedia.
            Links:
                "http://tc.wangchao.net.cn/xinxi/detail_1855256.html" and romanization in "http://www.boca.gov.tw/mp?mp=1"
                https://zh.wikipedia.org/wiki/中文姓氏羅馬字標注
                http://www.greatchinese.com/surname/surname.htm
        1-2. Korean family name
            KR.raw contains 20 common Korean first names and their romanization.
            Links:
                http://mirror.enha.kr/wiki/한국인%20이름의%20로마자%20표기
        1-3. Common romanization of Chinese tokens
            Links:
                http://www.pinyin.info/romanization/compare/gwoyeu_romatzyh.html 
                http://en.wikipedia.org/wiki/Comparison_of_Chinese_romanization_systems
        In these tokens, we manually select 45 tokens frequently appeared in both English and
        Chinese.

    2. Chinese information for 'main2'.
        2-1. Chinese family name
            Link:
                http://www.chineseinla.com/lastname/key_ng.html
        2-2. Common romanization of Chinese tokens
            Link:
                 http://irw.ncut.edu.tw/general/chen813/羅馬拼音/中文羅馬拼音對照表.htm

    3. Nick names. We substitute all nick names before we do any matching.
        Link for 'main1': http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html
                          http://mentalfloss.com/article/24761/origins-10-nicknames
        Link for 'main2': https://code.google.com/p/author-dedupe/

    4. List of stop words used in the merge step
        Links:
            http://nlp.stanford.edu/software/tmt/tmt-0.4/

track2's People

Contributors

kdd-cup-2013-ntu avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.