Giter Club home page Giter Club logo

genderizer's Introduction

genderizeR

by Kamil Wais homepage / contact

Licence Lifecycle Travis build status CRAN Status CRAN Checks Monthly downloads badge Daily downloads badge Weekly downloads badge HitCount

R package for gender predictions based on first names.

The package home page: https://kalimu.github.io/project/genderizer/

Information about the genderize.io project and documentation of the API: http://genderize.io

Description

The genderizeR package uses genderize.io API to predict gender from first names extracted from text corpus (not only from clean vectors of given names). The accuracy of prediction could be controlled by two parameters: counts of first names in database and probability of gender given the first name. The package has also built-in functions that can calculate specific errors (also with bootstrapping), train algorithm on training dataset (with gender labels) and prepare character vectors for gender checking.

Installing the package

Installing stable version from CRAN

install.packages('genderizeR')

Installing developer version from GitHub

Remember to install devtools package first!

# install.packages('devtools')
devtools::install_github("kalimu/genderizeR")

Loading the installed package

library(genderizeR)
#> 
#> Welcome to genderizeR package version: 2.0.0.9003
#> 
#> Homepage: http://www.wais.kamil.rzeszow.pl/genderizeR
#> 
#> Changelog: news(package = 'genderizeR')
#> Help & Contact: help(genderizeR)
#> 
#> If you find this package useful cite it please. Thank you!
#> See: citation('genderizeR')
#> 
#> To suppress this message use:
#> suppressPackageStartupMessages(library(genderizeR))

A working example

# An example for a character vector of strings
x = c("Winston J. Durant, ASHP past president, dies at 84",
"JAN BASZKIEWICZ (3 JANUARY 1930 - 27 JANUARY 2011) IN MEMORIAM",
"Maria Sklodowska-Curie")
 
# Search for terms that could be first names
# If you have your API key you can authorize access to the API with apikey argument
# e.g. findGivenNames(x, progress = FALSE, apikey = 'your_api_key')
givenNames = findGivenNames(x, progress = FALSE)
# Use only terms that have more than x counts in the database
givenNames = givenNames[count > 100]
givenNames
#>       name gender probability count
#> 1: winston   male        0.98   128
#> 2:     jan   male         0.6  1663
#> 3:   maria female        0.99  8402

# Genderize the original character vector
genderize(x, genderDB = givenNames, progress = FALSE)
#>                                                              text
#> 1:             Winston J. Durant, ASHP past president, dies at 84
#> 2: JAN BASZKIEWICZ (3 JANUARY 1930 - 27 JANUARY 2011) IN MEMORIAM
#> 3:                                         Maria Sklodowska-Curie
#>    givenName gender genderIndicators
#> 1:   winston   male                1
#> 2:       jan   male                1
#> 3:     maria female                1

Tutorial

For more comprehensive tutorial check the vignette in the package.

browseVignettes("genderizeR")

What's new in the package?

news(package = 'genderizeR')

See package help pages in R / Rstudio

help(package = 'genderizeR')
?textPrepare
?findGivenNames
?genderize

How to contribute to the package?

For bugs, updates and new functionalities:

Fork git repo https://github.com/kalimu/genderizeR and submit a pull request.

Feedback:

If you enjoy using the package you could write a short testimonial and send it to me. I will be happy to post in on the package homepage.

For any kind of feedback you can use the contact form here: https://kalimu.github.io/#contact

How to contact the package's author regarding research or commercial project?

Please use the contact form: https://kalimu.github.io/#contact

How to cite the package?

citation('genderizeR')
#> 
#> Wais K (2006). "Gender Prediction Methods Based on First Names
#> with genderizeR." _The R Journal_, *8*(1), 17-37. doi:
#> 10.32614/RJ-2016-002 (URL: http://doi.org/10.32614/RJ-2016-002),
#> <URL: https://doi.org/10.32614/RJ-2016-002>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {{Gender Prediction Methods Based on First Names with
#>           genderizeR}},
#>     author = {Kamil Wais},
#>     year = {2006},
#>     journal = {{The R Journal}},
#>     doi = {10.32614/RJ-2016-002},
#>     pages = {17--37},
#>     volume = {8},
#>     number = {1},
#>     url = {https://doi.org/10.32614/RJ-2016-002},
#>   }

Thank You for the citation!

genderizer's People

Contributors

kalimu avatar nathanvan avatar tklebel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

genderizer's Issues

Error in if (ncol(dfResponse) != 2) { : argument is of length zero

Replicable example below.

> library(genderizeR)
Welcome to genderizeR package version: 1.0.0.1

Changelog: news(package = 'genderizeR')
Help & Contact: help(genderizeR)

If you find this package useful cite it please. Thank you! 
See: citation('genderizeR')

To suppress this message use:
suppressPackageStartupMessages(library(genderizeR))
> findGivenNames(c("MULLER Romain", "Amandine", "Fan de beyoncé", "Entrechas", 
+                  "Cantin Rodolphe", "Thomas Doustaly"))
removing special characters...
building text-mining corpus...
building term matrix...
removing abbreviations...
all characters to lower...
removing numbers...
removing punctuation...
striping whitespaces...
finding frequent terms...
Packages done: 1. ToDo: 1. First names: 8. |   0%
  |====================                    |  50%
Error in if (ncol(dfResponse) != 2) { : argument is of length zero

The error does not occur if the last item ("Thomas Doustaly") is removed from the example.

Dealing with NAs

Heya,
thanks for the nice package, saves me the hustle with doing the API queries myself!

Just found out that if you have NAs in the data, it takes it as name instead of a missing, interestingly assuming the person is 72 female. I wasn't sure whether you are aware of that, so wanted to drop it here :)

All best,
Samuel

grafik

country_id ?

The genderize.io allows for a country_id string to be passed along in the query. It looks like this is not working for this package, although crucial? Example "Andrea" in German is female, and in Italian male. Do you plan to add this or did I miss something?

issues with genderize

Obviously you can't fix these but want to flag them for users:

  • you can't predict gender from names
  • no non-binary names (because you can't predict gender from names)
  • likely to misgender people & cause harm

would recommend flagging the above in the readme.

Failure with dev version of httr

checking tests ... ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  6: condition(object)
  7: evaluate_promise(expr, print = TRUE)
  8: with_sink(temp, withCallingHandlers(withVisible(code), warning = wHandler, message = mHandler))
  9: withCallingHandlers(withVisible(code), warning = wHandler, message = mHandler)
  10: withVisible(code)
  11: genderizeAPI("Kamil", apikey = "test")
  12: httr::GET("https://api.genderize.io", query = query, httr::config(ssl.verifypeer = ssl.verifypeer))
  13: request_perform(req, hu$handle$handle) at /Users/hadley/Documents/web/httr/R/http-get.r:67
  14: curl::handle_setopt(handle, .list = req$options) at /Users/hadley/Documents/web/httr/R/request.R:119
  15: stop("Unknown options.") at /private/tmp/RtmpRrn4aZ/devtoolse6b2242dabf0/jeroenooms-curl-0911193/R/handle.R:49

  Error: Test failures
  Execution halted

Maybe because ssl.verifypeer is now ssl_verifypeer? But you really shouldn't be setting this anyway

package issue

complaint: Gender is not binary

I saw that your package was used in a conference presentation in 2016.

Seeing how your program works I have to say that this package may be loved by many, I as a Transgender person hate this package for what it does and upholds. Gender is not binary nor ever was and classifying people into 2 neat gender buckets is wrong based on names do not 100% correspond to a person's gender {not the same as sex which is still not binary, intersex people exist}.

My name ends with 'e', so what bucket would your program put me into? My gender is non-binary and so my gender classification would fail to work. I know plenty of Transgender people with names stereotypical of Female or Male but aren't.

Consider if someone made a program to classify your race or ethnicity simply by your name, and had only White or Black as options. Or based on your name classified you as smart or dumb? How would you feel about that?

Your program assumes and classifies you into binary outcomes despite reality of spectrum.

Result duplicates rows

We're checking our reverse dependencies (thanks for the support!) and noticed an error in tests for genderizeR. The fail doesn't appear to be due to a change in data.table since it happens for the version on CRAN as well.

I had a look and I think the problem is here:

if (NCOL(dfResponse) == 4) {
dfResponse$country_id <- "all"
dfNames = data.table::rbindlist(list(dfNames, dfResponse))
}
if (NCOL(dfResponse) == 5) {
dfNames = data.table::rbindlist(list(dfNames, dfResponse))
}

I guess it should be an if else, since the first branch adds the column country_id, and then we enter the second branch.

Is dfResponse guaranteed to be a data.table by the first branch? If so, recommend also using:

dfResponse[ , 'country_id' := 'all']
# or
set(dfResponse, NULL, 'country_id', 'all')

Failing to connect to Genderize.io on a free plan.

Dear Kamil,

first of all, genderizeR looks really promising. Thanks for the great work. However, I tried to make it work over the last couple of weeks without succeeding: Although just working with test data sets of a couple of names only, Genderize.io keeps telling me that I have reached the end of my rate plan.

However, since my test data set contains only six names, this cannot be true. (Still given the premise of having 1000 free names per day). Did Genderize.io make any recent changes to their free plan? Can you spot any flaw in my attempts?

Any hints are appreciated. Many thanks for your time.

Here's my code:

library("genderizeR")
x <- c("Maria", "Tom", "Michael", "Louisa", "Kristin", "Marvin")
forenameTest <- findGivenNames(x, textPrepare = TRUE, apikey = NULL, queryLength = 10,
               progress = TRUE, ssl.verifypeer = TRUE)

Error message:

| | 0%
Client error: (429) Too Many Requests (RFC 6585)
Request limit reached

The API queries stopped at 1 term.
If you have reached the end of your API limit, you can start the function again from that term and continue finding given names next time with efficient use of the API.
Remember to add the results to already found names and not to overwrite them.

Warning messages:
1: In genderizeAPI(termsQuery, apikey = apikey, ssl.verifypeer = ssl.verifypeer) :
You have used all available requests in this subscription plan.
2: In findGivenNames(x, textPrepare = FALSE, apikey = NULL, queryLength = 10, :
The API queries stopped.

And session info:

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.6 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] genderizeR_2.0.0 rlist_0.4.6.1 gdata_2.17.0 plyr_1.8.4 Rfacebook_0.6.11 httpuv_1.3.3 rjson_0.2.15 httr_1.2.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 gtools_3.5.0 slam_0.1-40 R6_2.2.0 jsonlite_1.4 magrittr_1.5 stringi_1.1.2
[8] curl_2.3 data.table_1.10.0 NLP_0.1-11 tools_3.3.1 stringr_1.1.0 parallel_3.3.1 tm_0.7-1

Many thanks

Marvin.

is there an upper limit to number of names per query?

It seems that 10 is the upper limit to the number of names allowed in a query. Is this true and is there a way around it?

> v <- rep("Patrick", 15)
> genderizeAPI(v)

$response
       name gender probability count
 1: Patrick   male           1  2877
 2: Patrick   male           1  2877
 3: Patrick   male           1  2877
 4: Patrick   male           1  2877
 5: Patrick   male           1  2877
 6: Patrick   male           1  2877
 7: Patrick   male           1  2877
 8: Patrick   male           1  2877
 9: Patrick   male           1  2877
10: Patrick   male           1  2877

I don't see anything in the genderize.io or genderizeR code that limits the number of queries, but maybe I'm missing something.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.