philipperemy / name-dataset Goto Github PK

View Code? Open in Web Editor NEW

777.0 16.0 151.0 95.43 MB

The Python library for names.

License: Apache License 2.0

Python 96.95% Shell 3.05%

python dataset named-entity-recognition name

name-dataset's Introduction

First and Last Names Database

To download the raw CSV data for your analysis, browse here.

This Python library provides information about names:

Popularity (rank)
Country (105 countries are supported)
Gender

It can give you an answer to some of those questions:

Who is Zoe? Likely a Female, United Kindgom.
Knows Philippe? Likely a Male, France. And with the spelling Philipp? Male, Germany.
How about Nikki? Likely a Female, United States.

Composition

730K first names and 983K last names, extracted from the Facebook massive dump (533M users).

Installation

Available on PyPI:

pip install names-dataset

Usage

NOTE: The library requires 3.2GB of RAM to load the full dataset in memory. Make sure you have enough RAM to avoid MemoryError.

Once it's installed, run those commands to familiarize yourself with the library:

from names_dataset import NameDataset, NameWrapper

# The library takes time to initialize because the database is massive. A tip is to include its initialization in your app's startup process.
nd = NameDataset()

print(NameWrapper(nd.search('Philippe')).describe)
# Male, France

print(NameWrapper(nd.search('Zoe')).describe)
# Female, United Kingdom

print(nd.search('Walter'))
# {'first_name': {'country': {'Argentina': 0.062, 'Austria': 0.037, 'Bolivia, Plurinational State of': 0.042, 'Colombia': 0.096, 'Germany': 0.044, 'Italy': 0.295, 'Peru': 0.185, 'United States': 0.159, 'Uruguay': 0.036, 'South Africa': 0.043}, 'gender': {'Female': 0.007, 'Male': 0.993}, 'rank': {'Argentina': 37, 'Austria': 34, 'Bolivia, Plurinational State of': 67, 'Colombia': 250, 'Germany': 214, 'Italy': 193, 'Peru': 27, 'United States': 317, 'Uruguay': 44, 'South Africa': 388}}, 'last_name': {'country': {'Austria': 0.036, 'Brazil': 0.039, 'Switzerland': 0.032, 'Germany': 0.299, 'France': 0.121, 'United Kingdom': 0.048, 'Italy': 0.09, 'Nigeria': 0.078, 'United States': 0.172, 'South Africa': 0.085}, 'gender': {}, 'rank': {'Austria': 106, 'Brazil': 805, 'Switzerland': 140, 'Germany': 39, 'France': 625, 'United Kingdom': 1823, 'Italy': 3564, 'Nigeria': 926, 'United States': 1210, 'South Africa': 1169}}}

print(nd.search('White'))
# {'first_name': {'country': {'United Arab Emirates': 0.044, 'Egypt': 0.294, 'France': 0.061, 'Hong Kong': 0.05, 'Iraq': 0.094, 'Italy': 0.117, 'Malaysia': 0.133, 'Saudi Arabia': 0.089, 'Taiwan, Province of China': 0.044, 'United States': 0.072}, 'gender': {'Female': 0.519, 'Male': 0.481}, 'rank': {'Taiwan, Province of China': 6940, 'United Arab Emirates': None, 'Egypt': None, 'France': None, 'Hong Kong': None, 'Iraq': None, 'Italy': None, 'Malaysia': None, 'Saudi Arabia': None, 'United States': None}}, 'last_name': {'country': {'Canada': 0.035, 'France': 0.016, 'United Kingdom': 0.296, 'Ireland': 0.028, 'Iraq': 0.016, 'Italy': 0.02, 'Jamaica': 0.017, 'Nigeria': 0.031, 'United States': 0.5, 'South Africa': 0.04}, 'gender': {}, 'rank': {'Canada': 46, 'France': 1041, 'United Kingdom': 18, 'Ireland': 66, 'Iraq': 1307, 'Italy': 2778, 'Jamaica': 35, 'Nigeria': 425, 'United States': 47, 'South Africa': 416}}}

print(nd.search('محمد'))
# {'first_name': {'country': {'Algeria': 0.018, 'Egypt': 0.441, 'Iraq': 0.12, 'Jordan': 0.027, 'Libya': 0.035, 'Saudi Arabia': 0.154, 'Sudan': 0.07, 'Syrian Arab Republic': 0.062, 'Turkey': 0.022, 'Yemen': 0.051}, 'gender': {'Female': 0.035, 'Male': 0.965}, 'rank': {'Algeria': 4, 'Egypt': 1, 'Iraq': 2, 'Jordan': 1, 'Libya': 1, 'Saudi Arabia': 1, 'Sudan': 1, 'Syrian Arab Republic': 1, 'Turkey': 18, 'Yemen': 1}}, 'last_name': {'country': {'Egypt': 0.453, 'Iraq': 0.096, 'Jordan': 0.015, 'Libya': 0.043, 'Palestine, State of': 0.016, 'Saudi Arabia': 0.118, 'Sudan': 0.146, 'Syrian Arab Republic': 0.058, 'Turkey': 0.017, 'Yemen': 0.037}, 'gender': {}, 'rank': {'Egypt': 2, 'Iraq': 3, 'Jordan': 1, 'Libya': 1, 'Palestine, State of': 1, 'Saudi Arabia': 3, 'Sudan': 1, 'Syrian Arab Republic': 2, 'Turkey': 44, 'Yemen': 1}}}

print(nd.get_top_names(n=10, gender='Male', country_alpha2='US'))
# {'US': {'M': ['Jose', 'David', 'Michael', 'John', 'Juan', 'Carlos', 'Luis', 'Chris', 'Alex', 'Daniel']}}

print(nd.get_top_names(n=5, country_alpha2='ES'))
# {'ES': {'M': ['Jose', 'Antonio', 'Juan', 'Manuel', 'David'], 'F': ['Maria', 'Ana', 'Carmen', 'Laura', 'Isabel']}}

print(nd.get_country_codes(alpha_2=True))
# ['AE', 'AF', 'AL', 'AO', 'AR', 'AT', 'AZ', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BN', 'BO', 'BR', 'BW', 'CA', 'CH', 'CL', 'CM', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DZ', 'EC', 'EE', 'EG', 'ES', 'ET', 'FI', 'FJ', 'FR', 'GB', 'GE', 'GH', 'GR', 'GT', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JM', 'JO', 'JP', 'KH', 'KR', 'KW', 'KZ', 'LB', 'LT', 'LU', 'LY', 'MA', 'MD', 'MO', 'MT', 'MU', 'MV', 'MX', 'MY', 'NA', 'NG', 'NL', 'NO', 'OM', 'PA', 'PE', 'PH', 'PL', 'PR', 'PS', 'PT', 'QA', 'RS', 'RU', 'SA', 'SD', 'SE', 'SG', 'SI', 'SV', 'SY', 'TM', 'TN', 'TR', 'TW', 'US', 'UY', 'YE', 'ZA']

nd.first_names
# Dictionary of all the first names with their attributes.

nd.last_names
# Dictionary of all the last names with their attributes.

API

The search call provides information about:

country: The probability of the name belonging to a country. Only the top 10 countries matching the name are returned.
gender: The probability of the person to be a Male or Female.
rank: The rank of the name in his country. 1 means the most popular name.
NOTE: first_name/last_name: the gender does not apply to last_name.

The get_top_names call gives the most popular names:

n: The number of names to return matching some criteria. Default is 100.
gender: Filters on Male or Female. Default is None (both are returned).
use_first_names: Filters on the first names or last names. Default is True.
country_alpha2: Filters on the country (e.g. GB is the United Kingdom). Default is None (all countries are returned).

The get_country_codes returns the supported country codes (or full pycountry objects).

alpha_2: Only returns the country codes: 2-char code. Default is False.

Full dataset

The dataset is available here name_dataset.zip (3.3GB).

The data contains 491,655,925 records from 106 countries.
The uncompressed version takes around 10GB on the disk.
Each country is in a separate CSV file.
A CSV file contains rows of this format: first_name,last_name,gender,country_code.
Each record is a real person.

License

This version was generated from the massive Facebook Leak (533M accounts).
Lists of names are not copyrightable, generally speaking, but if you want to be completely sure you should talk to a lawyer.

Countries

Afghanistan, Albania, Algeria, Angola, Argentina, Austria, Azerbaijan, Bahrain, Bangladesh, Belgium, Bolivia, Plurinational State of, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czechia, Denmark, Djibouti, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Fiji, Finland, France, Georgia, Germany, Ghana, Greece, Guatemala, Haiti, Honduras, Hong Kong, Hungary, Iceland, India, Indonesia, Iran, Islamic Republic of, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Korea, Republic of, Kuwait, Lebanon, Libya, Lithuania, Luxembourg, Macao, Malaysia, Maldives, Malta, Mauritius, Mexico, Moldova, Republic of, Morocco, Namibia, Netherlands, Nigeria, Norway, Oman, Palestine, State of, Panama, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Russian Federation, Saudi Arabia, Serbia, Singapore, Slovenia, South Africa, Spain, Sudan, Sweden, Switzerland, Syrian Arab Republic, Taiwan, Province of China, Tunisia, Turkey, Turkmenistan, United Arab Emirates, United Kingdom, United States, Uruguay, Yemen.

🇲🇹🇪🇬🇧🇴🇳🇦🇹🇳🇷🇸🇯🇲🇦🇷🇯🇵🇰🇿🇸🇦🇺🇸🇦🇪🇭🇺🇭🇰🇶🇦🇸🇬🇩🇪🇾🇪🇲🇾🇭🇹🇵🇷🇨🇳🇦🇴🇹🇼🇸🇩🇧🇭🇧🇪🇪🇹🇪🇪🇨🇴🇬🇷🇧🇷🇷🇺🇱🇾🇸🇻🇰🇼🇰🇷🇦🇱🇸🇾🇧🇫🇨🇿🇨🇦🇴🇲🇩🇰🇨🇱🇧🇩🇧🇼🇫🇯🇮🇶🇮🇪🇿🇦🇨🇷🇯🇴🇰🇭🇵🇪🇺🇾🇮🇷🇲🇩🇫🇷🇲🇴🇳🇱🇬🇭🇨🇾🇩🇿🇮🇹🇬🇧🇧🇮🇮🇳🇫🇮🇦🇫🇵🇭🇦🇿🇬🇪🇨🇲🇮🇱🇪🇸🇱🇹🇩🇯🇬🇹🇱🇺🇵🇸🇹🇷🇵🇱🇮🇸🇳🇬🇵🇦🇭🇷🇸🇮🇭🇳🇦🇹🇲🇺🇸🇪🇲🇦🇨🇭🇧🇳🇲🇻🇳🇴🇪🇨🇮🇩🇧🇬🇵🇹🇲🇽🇱🇧🇹🇲

NOTE: It is unfortunately not possible to support more countries because the missing ones were not included in the original dataset.

Citation

@misc{NameDataset2021,
  author = {Philippe Remy},
  title = {Name Dataset},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/philipperemy/name-dataset}},
}

name-dataset's People

Contributors

Stargazers

Watchers

Forkers

davyfeng sginzel giri-chintala fpieper entn-at ks093 allensmile corpus-dataset fighting41love yueyili shivangsunilsingh andasan gustavostamm shoko75 fangyilai marieooq mdaga123 jackvial sebanick naresh-sundarrajan paudan irfnrdh davidmcclure rohilmindee kimcant2019 ali-cognitro jeremylumanbailey doufuxixi123 hglennwade ayakatsubouchi antonioc73 bmsdave amal2nes igoros777 khalidmindee lfcapollo ousid nerotulip90 sudharsan2020 scivm rstens taiphillips 5l1v3r1 kurzondax jaykhandare marirs sbrichardson awoziji kuustudio darrenoakey iojcde kyuhwas imadarsh1001 mridulsri sudneo antondeski almenon mkmcmahon jfreddypuentes pamtdoh meltedhead maelkom enniogit jboarman keithwhitson eric-oaktree pmaxit robotpin blldw machari jorik041 sariths willwalker753 alexverkeenko sc-suh ghostzero slb-wavely imh soumaiadjerbi gnisbet200 hieuqtran razr001 thinkgandhi gen-li mareknowakcollibra ptrkdy bpines nischal-sanil shuw megonzau harishchandra95 djuxy gkud andreiliphd marinoscar jero-rodriguez jiancao-tg hussien wanghaisheng kp9developer

name-dataset's Issues

More likely a first name or a last name

I want to develop a prediction model for that.
I don't know if this repository has this model already or willing to. Please let me know before I dive in.

Is there a easy way to get full names

Hi, I'm more interested in getting a big corpus of full names
first (middle) last, is there a way to get it easily with this package?

Cleaning names from junk

Dataset should be cleaned from these names. Also I suggest to remove cs.cmu.edu.ai-repository and datasets_imdb because those are totally invalid and generated from junk data.

$ python names_dataset/query.py in,can,isn't,iris,doesn,mustn,being,is,if,above,ve,honey,you'll,she's,a,board,now,so,then,hasn't,art,mightn't,y,media,future,about,few,into,rice,meet,brazil,be,with,you're,are,after,until,regional,stock,my,asia,those,here,doesn't,against,hr,any,hadn,isn,these,than,bottom,up,see,list,out,for,nor,needn,past,did,we,yourself,will,during,shouldn't,o,the,them,am,d,hers,content,pet,add,on,its,wasn't,yourselves,their,what,has,won't,should,not,whom,quick,before,his,this,read,some,shouldn,and,mustn't,myself,were,yours,you'd,himself,further,your,couldn,shan't,to,more,again,login,have,should've,do,haven't,sales,from,each,it,down,over,hadn't,all,no,which,couldn't,large,or,she,just,didn't,does,needn't,wouldn,erasmus,i,it's,chat,was,don't,aren't,job,ain,haven,very,because,when,through,you,while,ourselves,own,block,that,same,both,peru,under,where,most,loan,s,didn,web,been,getty,m,wasn,had,by,ll,her,too,skill,hasn,having,between,ours,short,of,dev,an,re,shan,why,itself,access,theirs,branch,alpine,weren,they,at,t,aren,off,long,you've,mightn,that'll,range,ma,herself,who,as,won,him,but,end,service,don,below,icon,other,only,there,farmer,our,wouldn't,themselves,tab,he,such,edit,doing,how,hover,weren't,me,once
----- First names ----
Name Present?
in True
can True
isnt False
iris True
doesn False
mustn False
being False
is True
if False
above False
ve True
honey True
youll False
shes False
a True
board True
now True
so True
then True
hasnt False
art True
mightnt False
y True
media True
future True
about True
few True
into True
rice True
meet True
brazil True
be True
with True
youre False
are True
after True
until False
regional True
stock True
my True
asia True
those False
here True
doesnt False
against False
hr True
any True
hadn False
isn False
these True
than True
bottom True
up True
see True
list True
out True
for True
nor True
needn False
past False
did True
we True
yourself False
will True
during False
shouldnt False
o True
the True
them True
am True
d True
hers False
content True
pet True
add True
on True
its False
wasnt False
yourselves False
their False
what True
has True
wont False
should False
not True
whom False
quick True
before False
his True
this True
read True
some True
shouldn False
and True
mustnt False
myself False
were False
yours False
youd False
himself False
further False
your True
couldn False
shant True
to True
more True
again False
login True
have False
shouldve False
do True
havent False
sales True
from True
each False
it True
down True
over True
hadnt False
all True
no True
which False
couldnt False
large True
or True
she True
just True
didnt False
does False
neednt False
wouldn False
erasmus True
i True
its False
chat True
was False
dont False
arent True
job True
ain True
haven True
very True
because False
when True
through False
you True
while False
ourselves False
own False
block True
that True
same True
both True
peru True
under True
where False
most True
loan True
s True
didn False
web True
been True
getty True
m True
wasn False
had True
by True
ll False
her True
too True
skill False
hasn False
having False
between False
ours True
short True
of True
dev True
an True
re True
shan True
why True
itself False
access True
theirs False
branch True
alpine True
weren False
they False
at True
t True
aren True
off True
long True
youve False
mightn False
thatll False
range True
ma True
herself False
who True
as True
won True
him True
but False
end True
service True
don True
below True
icon True
other True
only True
there True
farmer True
our True
wouldnt False
themselves False
tab True
he True
such True
edit True
doing False
how True
hover True
werent False
me True
once False
----- Last names ----
Name Present?
in True
can True
isnt False
iris True
doesn False
mustn False
being True
is True
if False
above False
ve True
honey True
youll True
shes False
a True
board True
now True
so True
then True
hasnt False
art True
mightnt False
y True
media True
future True
about True
few True
into True
rice True
meet True
brazil True
be True
with True
youre False
are True
after True
until False
regional True
stock True
my True
asia True
those False
here True
doesnt False
against True
hr True
any False
hadn False
isn False
these False
than True
bottom True
up True
see True
list True
out True
for True
nor True
needn False
past True
did False
we True
yourself False
will True
during True
shouldnt False
o True
the True
them True
am True
d True
hers True
content True
pet True
add False
on True
its False
wasnt False
yourselves False
their False
what True
has True
wont False
should False
not True
whom False
quick True
before False
his True
this True
read True
some True
shouldn False
and True
mustnt False
myself False
were True
yours False
youd True
himself True
further False
your True
couldn False
shant True
to True
more True
again False
login False
have True
shouldve False
do True
havent False
sales True
from True
each False
it True
down True
over True
hadnt False
all True
no True
which False
couldnt False
large True
or True
she True
just True
didnt False
does True
neednt False
wouldn False
erasmus True
i True
its False
chat True
was True
dont False
arent True
job True
ain True
haven True
very True
because False
when False
through False
you True
while True
ourselves False
own True
block True
that True
same True
both True
peru True
under True
where False
most True
loan True
s True
didn False
web True
been True
getty True
m True
wasn False
had False
by True
ll True
her True
too True
skill True
hasn False
having False
between False
ours True
short True
of True
dev True
an True
re True
shan True
why True
itself False
access False
theirs False
branch True
alpine True
weren False
they True
at True
t True
aren True
off True
long True
youve False
mightn False
thatll False
range True
ma True
herself False
who True
as True
won True
him True
but True
end True
service True
don True
below True
icon True
other True
only True
there False
farmer True
our True
wouldnt False
themselves False
tab True
he True
such True
edit True
doing True
how True
hover True
werent False
me True
once False

Question regarding dataset

Hello Philip,

I am working on a GUI mock data generation project that (as the name states) generates fake data such as first name, last name, countries, etc.

I was looking for a more realistic way to generate names from their corresponding countries and I came across your repository, I've tried tinkering around with the API but the execution time is too long for mass data generation.

Question is whether there is a way to call out numerous names in a single API call? If not, I am considering using the original dataset to create my own algorithm without needing API calls. However, I wanted to check whether the 3.3GB file has duplicate rows or not, examples regarding what duplicate data there is and such (since I currently cannot download the dataset on my machine).

Point is if there is a significant number of numerous data then I might attempt to manually shrink the rows down by removing as much duplicates as I can in order to run the algorithm locally, making it much faster than waiting for API call returns.

Regards.

Add more non-ascii names

Are there not data sources for non-ascii names or are you using a convention to convert names to ascii format? I see a few at the end of the last names list but only a few.

For The new version V2 it's taking super high amount of time to proccess only few data

I have had V1 with one of my script it was working quite good. after upgrading to V2 I see it's taking huge time to proccess a few items in for loop. what is the implementation and usage you suggest? I mean how we can reduce the time. I know that V2 got a huge dataset but still software request can't wait that much right?

Can we get all the first names and last names irrespective of country or any other filter

Hi Team,
Is there a way we can get all the names (first and last both) instead of just first few names (n=100 by default), also need this irrespective of any country or any other filter.

Country Identification

Hi,

Great work, thank you!
Is there a way to get the most probable country of a given name? Or get a subset of names of a certain country?

for romanian names will exists some integrations?

hello,

I'm interested to use your library for Romanian names, I see from this
`
print(nd.get_country_codes(alpha_2=True))

['AE', 'AF', 'AL', 'AO', 'AR', 'AT', 'AZ', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BN', 'BO', 'BR', 'BW', 'CA', 'CH', 'CL', 'CM', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DZ', 'EC', 'EE', 'EG', 'ES', 'ET', 'FI', 'FJ', 'FR', 'GB', 'GE', 'GH', 'GR', 'GT', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JM', 'JO', 'JP', 'KH', 'KR', 'KW', 'KZ', 'LB', 'LT', 'LU', 'LY', 'MA', 'MD', 'MO', 'MT', 'MU', 'MV', 'MX', 'MY', 'NA', 'NG', 'NL', 'NO', 'OM', 'PA', 'PE', 'PH', 'PL', 'PR', 'PS', 'PT', 'QA', 'RS', 'RU', 'SA', 'SD', 'SE', 'SG', 'SI', 'SV', 'SY', 'TM', 'TN', 'TR', 'TW', 'US', 'UY', 'YE', 'ZA']

that there is no RO. Exists in your roadmap possibility for this?
thanks!

Can't see Iran country

Hi, Thanks for creating this dataset,
I have a question:
Why is Iran not available? 🙄

Add "bennett" to last names list

It is missing.

N/A should not be present in get_top_names()

print(nd.get_top_names(n=1, country_alpha2='JP'))
{'JP': {'M': ['Takashi'], 'F': ['Yuki'], 'N/A': ['Jeremynashlee']}}

Common names missing .

Hi, I found a number of common names missing last names from the database when compared with the top 5000 names in the us census data some examples below.
NAIL
SAMPLE
SHEETS
COPE
FLOOD
THOMPSON
BONE
BRIDGES
FUNK
BLANK
PACKARD
STOCK
BACON
YOUNGER
BEDFORD
RAND
GOLDEN
PERSON
BULL
SESSIONS
HA
WOOD
HELD
CARD
BEACH
MYERS
DAS
GUY
BONDS
HOOK
WEEKS
SPAIN
EASTER
LE
LEMON
KO
YE
SPEARS
PACE
SHELL
CAVE
SINGH
WARNER
CROSS
STERLING
POOL
COLON
STANFORD
LO
DALLAS
BERLIN
LONDON
ROBINSON
PAGE
BROTHERS
EAST
WINDSOR
ARMSTRONG
URBAN
SHORE
SPRINGER
CARROLL
RING
PAN
ROOT
HOLMES
BILLS
NEW
STORM
PARISH
TOWNS
HOOD
BOOTH
ROLLER
LEVY
KRUGER
YORK
BLOOM
SON
PEAK
NOBLE
PEACE
MORRISON
HU
HUNT
PALMER
WATERS
REGISTER
WAY
HOLLAND
FARMER
CHI
WISE
FINE
SWEET
JOHNSTON
BASS
BEAR
ELLIS
FROST
WEED
WARE
STORY
FORD
DELL
LAW
WALLS
MADRID
POLAND
BURNS
SU
SIMS
ROGERS
BOWLING
HART
SEE
FERRARI
THAI
GUEST
BURTON
SHAW
SUN
MONROE
OH
STEWART
MAIN
FLOWERS
HULL
BERRY
FOUNTAIN
CHAMPAGNE
WASHINGTON
NICKEL
SELLS
AN
POWELL
STEEL
MINOR
GRIFFIN
SEAL
BARE
NG
CASEY
DOLL
HER
WATT
FOSTER
WHEAT
KNIGHT
IVORY
NORTH
CLAY
LANG
RUSH
RIDER
SAGE
CRAWFORD
HUNTINGTON
WU
PENN
DEUTSCH
MARCH
BRANCH
DAYTON
BALL
WATTS
DURHAM
FISHER
MILLER
CANADA
BELT
DOW
OAKS
HE
MAY
BIRD
SPRING
PEOPLES
CASH
BURDEN
ROYAL
GROVE
HUGHES
WEST
MAJOR
EDWARDS
RICHMOND
CHURCH
GAY
DALE
PRIEST
BUTLER
SHADE
MA
ENGLAND
HAYES
PRIDE
BROWN
POTTER
GATES
FORT
IRISH
SOUTH
RICH
GIBSON
HALL
DIAL
SAVAGE
BEAM
CANNON
PARIS
BOSS
WHITE
GREEN
SELLERS
POINTER
HILLS
HAY
BAILEY
POPE
GREY
HO
FORTUNE
MESA
RIVERS
HILL
SHARP
PALM
PRICE
PETERSON
DIAMOND
BACK
GRANDE
BOND
WELLS
COOK
KEY
CHO
CLOUD
ISRAEL
RICHARDS
HAND
PEPPER
FOX
PARK
BUTTON
CAMPBELL
RENO
WATSON
WOODEN
HIGH
BLUE
CHRIST
GRAY
BATTLE
HAWK
QUEEN
BAKER
MCDONALD
SWIFT
BEAVER
BUNCH
DAY
PASTOR
SELF
POWER
MOSS
SO
SOUTHERN
ANDREWS
HARDER
SULLIVAN
HAMILTON
HEAD
FU
JONES
LAND
HURT
HOUSTON
CASTLE
KINGSTON
HORN
FRAME
WORTHY
FALLS
IRELAND
GREENE
VILLA
HANDY
WING
MILES
PIERCE
LANCASTER
QUICK
PORTER
BOLT
SALES
GLASS
MARKS
STONE
PRINCE
SALEM
SMART
AU
PIKE
JOHNS
BUCK
BARNES
WEBSTER
SELL
TA
BRAND
ENGLISH
HAIR
GROSS
STRONG
ANGELES
EARLY
CHAMBERS
LOW
GOOD
CORNELL
KONG
HOPE
DEAL
MEANS
BENNETT
WELSH
STACK
BURKE
NEWMAN
CLEVELAND
STAMPS
SNOW
WINTER
GOLD
WARD
YOUNG
BROOKS
SMALL
HUDSON
BELL
CRAFT
HOUSE
STRAIN
RIDGE
TRINIDAD
READ
MARSH
BOSTON
CABLE
BORDERS
LIGHT
CRUZ
CAMP
NEWTON
JUSTICE
MOORE
ROBERTSON
SINGER
HAMMER
HENDERSON
MOON
TRUE
LI
FRANCE
NORTON
FRASER
WILL
PHILLIPS
TO
EDEN
TEMPLE
WALL
FRANCISCO
COSTA
LOVING
PACK
WOODS
GAGE
LUTHER
HEARD
GUESS
PARKS
DAVIDSON
STILL
LONG
GORE
LUNG
PLEASANT
DODGE
NATION
FAN
DO
SPEED
RICHARDSON
CALL
HOPKINS
LOVE
COX
WILD
PLACE
LU
KEEN
POWERS
SIERRA
POST
BUTTS
GLASGOW
KITCHEN
HANSEN
SHORTER
LAMB
CHANCE
MEYER
ADAMS
BOX
ELDER
LINK
MONACO
HOLDER
HAMPTON
DRIVER
BLACK
FIELD
DU
EAGLE
SALMON
GARCIA
TU
DOVER
SIDES
FISH
WOLF
ENG
BRISTOL
HACKER
FERRY
SAMPLES
WALKER
RICO
BISHOP
SONG
LITTLE
WORTH
COLEMAN
HAM
SCALES
LAWS
ROCK
SHORT
BANKS
ISLAM
FORBES
RICE
COUNTS
CLOSE
CARRIER
CHAMPION
BEST
SEO
WILEY
FRENCH
BEAN
SHEPHERD
POND
PARENT
MOUNT
FREE
LEONE
WAGNER
LAKE
FRIEND
BLOCK
SHEFFIELD
CHASE
BUSH
MASTERS
TROUT
JUDGE
SETTLE
STRANGE
KEYS
FIELDS
DAILY
ROBERTS
EDGE
STREET
MERCHANT
FAIR
LAY
STEVENS
STRAND
MILLS
CASE
COTTON

Can we have the extra middle names?

Do we have more info of middle names, extra surnames for this dataset?

Suggestion: use pickle instead of json for better performance in initialization

Hi, I saw your project and it is very useful. however I had some performance issue when I was trying to use it. basically loading huge data from json is not efficient and takes a lot of time. so I loaded your json files into dicts and dumped them into a pickle file (using pickle library) and I got around 40% better performance when loading data from pickle files instead of json

I put the pickle files inside the package (I didn't zip them but they should work fine when zipped as well) and changed your method like this. also changed the paths you declared to these files

@staticmethod
def _read_json_from_zip(zip_file):
    with open(zip_file, 'rb') as f:
        return pickle.loads(f.read())

How to calculate the probability of a given word to be name of a person?

First of all, thank you for the awesome package!

For my use case, I have a large piece of text and I want to find out those words from the text which are person names. What would be best way to use this package to find out if a given word is likely to be person name or not?

Is it better to use v2 for this?

Keep uppercase letters optionally

I know it is common in nlp to lower every piece of text, but in this case it also removes crucial information. If uppercase is kept, Rose and rose can be distinguished (most of the times) and if you know that your corpus has a good spelling + grammar (novels in contrast to tweets for example), keeping uppercase looks appealing to me.

What do you think?

Names grouped by country

Hi @philipperemy ,
Great work!

Is it possible to get a list of names (first/last) and % of appearance in a given country?

Cheers

License

What's the license on this code & data?

How to autocorrect name if name is in dataset?

I am looking to go a step further, I am looking to autocorrect name if user input name is typo?

ex :
user input : Eajar
correct input : Fajar

get_top_names() returns n * 2 data points

generator.get_top_names(n=limit, use_first_names=use_first_names, country_alpha2=country_code) returns 1000 rows where generator object is an instance of NameDataset(), use_first_names is set to True and country_code is some country_code depending on the parameter passed to a function in my script. Any idea why this is occurring? What exactly is the criteria behind selecting the threshold, n ?

How are scores calculated? Also, last names with higher scores are ignored in the presence of first names.

Hi, thanks for the awesome dataset. I actually have 3 questions:

How are the scores calculated? I read the older issue regarding this but it wasn't really answered there. If possible, could you disclose the formula that you used to calculate the score of a name? Is the score normalized? If a name has 0.35 for country A and 0.35 for country B, what does that indicate? If a name has 0.35 score as a first name for country A and 0.35 score as a surname for the same country A, what does that mean?
is search_first_name method no longer available?
It seems last names are ignored in deciding the country association. For the name "anderson", the result indicates it's a male name from Brazil. Although, "anderson" has a higher max last name score (0.59 for USA) than the max first name score (0.42 for Brazil)

feature request: export names as bloom filter

Since the memory usage is very high, it would be great if names lists could be returned as bloom filters to use with less available memory.

Precision recall definition

For example, the word Rose can be either a name or a noun. If we include it in the list, then we increase the precision but we decrease the recall.

If you include such a general name, you actually increase recall and decrease precision.

Btw, great dataset and thanks for sharing!

Despite being pip install'ed, the module is not importing, and thus causing a MemoryError on running.

When trying to run a simple tester program, similar to the examples, the import is not resolved and simply says: Import "names_dataset" could not be resolvedPylance[reportMissingImports].

This is the code I was trying:

And this is the traceback:

How do I classify names?

I can understand this project is primarily aim at identifying names from a blob.
I am looking to go a step further, I am looking to identify if a name is is female or male.
If a name is indian or german etc,
How do I do that?

Exception: AttributeError: module 'names_dataset' has no attribute 'NameDataset' while importingin Azure cloud

Exception while executing function: Functions.V1BlobTrigger Result: Failure Exception: AttributeError: module 'names_dataset' has no attribute 'NameDataset' Stack: File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/dispatcher.py", line 505, in _handle__invocation_request call_result = await self._loop.run_in_executor( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/dispatcher.py", line 778, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/extension.py", line 215, in _raw_invocation_wrapper result = function(**args) ^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/V1BlobTrigger/__init__.py", line 16, in main merged_df, manual_review = utils.process_data(df) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/utils/__init__.py", line 6, in process_data df = filter_individuals(df) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/utils/pipeline.py", line 53, in filter_individuals nd = names_dataset.NameDataset() ^^^^^^^^^^^^^^^^^^^^^^^^^
Unable to import NameDataset function from names_dataset package although import names_dataset works fine. The code works fine in local environment. getting this issue in Azure cloud. what could be the reason for this?

How is the score in v2 calculated?

The readme describes a score between 0 and 100 that can be used as a threshold to control precision and recall. Can you provide more information on how this score is calculated, please?

Great dataset, by the way. Thank you.

Original data source?

Hi! I'm trying to create a Nickname database, similar to https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup, but I'd like it to be based on actual real world data. I think that the facebook data dump could be useful for this, if I could link multiple accounts to having two different names, e.g. FB ID 12345 with a name of "Stephen" in one place and "Steve" in another. Not sure if the data is actually formatted in a way that would be useful. Could you explain where you got the data dump?

Issue when pip install names-dataset due to cp950 codec issue

pip install names-dataset

Collecting names-dataset
Downloading names-dataset-3.0.2.tar.gz (58.4 MB)
---------------------------------------- 58.4/58.4 MB 16.4 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\huibr\AppData\Local\Temp\pip-install-fup_wh40\names-dataset_fa96a0495e674f669c7a84e044a5287c\setup.py", line 21, in
long_description=open('README.md').read(),
UnicodeDecodeError: 'cp950' codec can't decode byte 0xd9 in position 3041: illegal multibyte sequence
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

philipperemy / name-dataset Goto Github PK

name-dataset's Introduction

First and Last Names Database

Composition

Installation

Usage

API

Full dataset

License

Countries

Citation

name-dataset's People

Contributors

Stargazers

Watchers

Forkers

name-dataset's Issues

Recommend Projects

Recommend Topics

Recommend Org