Giter Club home page Giter Club logo

dataset-person-name-disambiguation's Introduction

dataset-person-name-disambiguation

creating a dataset for person name disambiguation using combination of sources like wikipedia, DBLP authors and PPDB.

Download various sources

1. DBPedia ref

wget http://downloads.dbpedia.org/3.6/en/persondata_en.nt.bz2
bzip2 -d persondata_en.nt.bz2
wget http://downloads.dbpedia.org/3.6/en/disambiguations_en.nt.bz2
bzip2 -d disambiguations_en.nt.bz2

2. The Paraphrase Database ref

wget http://www.cis.upenn.edu/~ccb/ppdb/release-1.0/ppdb-1.0-s-lexical.gz
gunzip ppdb-1.0-s-lexical.gz
wget http://www.cis.upenn.edu/~ccb/ppdb/release-1.0/ppdb-1.0-s-o2m.gz
gunzip ppdb-1.0-s-o2m.gz

3. DBLP authors ref

wget https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/DBLP/DBLP10k.csv

Dataset Generation

Step 1 : run attached createdata.py on downloaded files.

python createdata.py > persons.match

Step 2 (optional): append nnp dataset from ppdb (not necessarily person names but it help in learning spelling patterns)

cat ppdb*|grep "\[NNP\]"|awk -F"[|]*[ ]*" '{if($3!=$5 && substr($3,0,1)==substr($5,0,1))print $3"\t"$5"\ty"}' > nnp.match

cat nnp.match >> persons.match

Dataset Sample

Name Disambiguation isVariation
Marria G Honnet Marry Honnet y
Mohammed Fazle Baki Md. Fazle Baki y
Shensheng Zhang Shen-sheng Zhang y
James B. D. Joshi James Joshi y
Thomas A. Down Thomas Down y
Frank Hung-Fat Leung Frank H. Leung y
Geoffrey W. Hill G. W. Hill y
Simon L. Harding Simon Harding y
Antonio Fernández Antonio Fernández Anta y
Argyrios Zymnis Argyris Zymnis y
N. R. Achuthan Nirmala Achutyan y
Fabrice Muamba Fabrice Muamba n
Ursula Vaughan Williams Vaughan Williams y
Henry Earle Vaughan Henry Earle y
Bernard Lens III Bernard Lens y
Muthukulam Raghavan Pillai Raghavan y
James Fisher Robinson James Fisher y
Jimmy Needles Needle y
W. E. B. Du Bois Web y
Sylvester Perry Ryan Perry Ryan y
James Beaty, Jr. Beaty y
George Manning McDade George Manning y
Alejandro Zaffaroni Zaffaroni n
Ellie Goulding Ellie Goulding n

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.