Giter Club home page Giter Club logo

nberwp's People

Contributors

bldavies avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

nberwp's Issues

Some titles are missing spaces

For example:

library(nberwp)
p <- grepl('[a-z][A-Z][a-z]', papers$title)
sum(p)
#> [1] 112
papers$title[p][1:5]
#> [1] "Experience, Vintage and Time Effects in the Growth of Earnings: AmericanScientists, 1960-1970"
#> [2] "The Market for Lawyers: The Determinants of the Demand for and Supply ofLawyers"              
#> [3] "The Effect of Minimum Wage Legislation on Income Equality: A TheoreticalAnalysis"             
#> [4] "Identifying Identical Distributed Lag Structures by the Use of Prior SumConstraints"          
#> [5] "The Percent Organized Wage (POW) Relationship for Union and for NonunionWorkers"

But be careful with aAA-type patterns, since some (4/6) are valid:

p <- grepl('[a-z][A-Z]{2}', papers$title)
papers$title[p]
#> [1] "Should the Holding Period Matter for the Intertemporal Consumption-BasedCAPM?"                    
#> [2] "Jumps and Stochastic Volatility: Exchange Rate Processes Implicit in thePHLX Deutschemark Options"
#> [3] "ExtrapoLATE-ing: External Validity and Overidentification in the LATE Framework"                  
#> [4] "Sign Restrictions in Bayesian FaVARs with an Application to Monetary Policy Shocks"               
#> [5] "From Paper to Plastic: Understanding the Impact of eWIC on WIC Recipient Behavior"                
#> [6] "WIC Participation and Relative Quality of Household Food Purchases: Evidence from FoodAPS"

Some titles contain parenthetical notes

library(nberwp)
m <- grepl('^.*\\(.*\\)$', papers$title)
sum(m)
#> [1] 66

For example:

papers$title[m][1:10]
#>  [1] "Sample Selection Bias As a Specification Error (with an Application to the Estimation of Labor Supply Functions)"                         
#>  [2] "Static and Dynamic Resource Allocation Effects of Corporate and Personal Tax Integration in the U.S.: A General Equilibrium Approach(Rev)"
#>  [3] "Taxation and the Stock Market Valuation of Capital Gains and Dividends: Theory and Empirical Results (Rev)"                               
#>  [4] "A Multicountry Econometric Model (Revised)"                                                                                               
#>  [5] "Will the Real Excess Burden Please Stand Up? (Or, Seven Measures in Search of a Concept)"                                                 
#>  [6] "State and Local Taxes and the Rate of Return on Nonfinancial Corporate Capital (revised as W0740)"                                        
#>  [7] "Self-Employment and Labor Force Participation of Older Males (Revised)"                                                                   
#>  [8] "Raw Materials, Profits, and the Productivity Slowdown (Rev)"                                                                              
#>  [9] "World Shocks, Macroeconomic Response, and the Productivity Puzzle (Rev)"                                                                  
#> [10] "Modeling Deviations from Purchasing Power Parity (PPP)"

w7443 and w15317 are missing

papers$number has gaps:

library(nberwp)
setdiff(seq_len(max(papers$number)), papers$number)
#> [1]  7443 15317

w15317 is missing from the source RDF file for 2009. w7443 appears to not exist.

Some titles are wrapped in quote marks

dplyr::filter(nberwp::papers, grepl('^"(.*)"$', title))
#> # A tibble: 19 x 4
#>    paper   year month title                                                     
#>    <chr>  <int> <int> <chr>                                                     
#>  1 w4348   1993     4 "\"The Minimum Wage and the Employment of Youth: Evidence…
#>  2 w4632   1994     1 "\"Public Sector Pension Governance and Performance\""    
#>  3 w4648   1994     2 "\"The Federal Deposit Insurance Fund That Didn't Put A B…
#>  4 w4670   1994     3 "\"Household Responses for Pricing Garbage by the Bag,\"" 
#>  5 w4711   1994     4 "\"Convergence in the Age of Mass Migration\""            
#>  6 w4739   1994     5 "\"Learning By Doing and the Choice of Technology.\""     
#>  7 w4787   1994     6 "\"Unemployment Insurance Benefits and Takeup Rates\""    
#>  8 w5306   1995    10 "\"Generic Entry and the Pricing of Pharmaceuticals\""    
#>  9 w5392   1995    12 "\"Around the European Periphery 1870-1913: Globalization…
#> 10 w5491   1996     3 "\"Globalization and Inequality Past and Present\""       
#> 11 w5512   1996     3 "\"Social Security Privatization: A Structure for Analysi…
#> 12 w9635   2003     4 "\"Fifty-four Forty or Fight!\""                          
#> 13 w9793   2003     6 "\". . . and six hundred thousand men were dead.\""       
#> 14 w12110  2006     3 "\"Sick of Local Government Corruption? Vote Islamic\""   
#> 15 w18142  2012     6 "\"Getting the Biggest Bang for the Buck in Fiscal Policy…
#> 16 h0056   1994     6 "\"The Population of the United States, 1790-1920\""      
#> 17 h0080   1996     3 "\"Long Term Marriage Patterns in the United States from …
#> 18 h0085   1996     5 "\"The Use of the Census to Estimate Childhood Mortality:…
#> 19 h0130   2000    10 "\"Development, Health, Nutrition, and Mortality: The Cas…

w4670 is titled incorrectly. w9635 and w9703 are quotes, and w12110 is a phrase. The rest should be unwrapped.

Data contain duplicates

Ten titles appear twice:

library(dplyr)
library(nberwp)

papers %>%
  count(title) %>%
  count(n)
#> # A tibble: 2 x 2
#>       n    nn
#>   <int> <int>
#> 1     1 26553
#> 2     2    20

Some duplicates might be valid. Worth checking manually.

Also, two titles reference updated versions:

papers$title[grepl('W[0-9]', papers$title, ignore.case = T)]
#> [1] "State and Local Taxes and the Rate of Return on Nonfinancial Corporate Capital (revised as W0740)"
#> [2] "Wage-Employment Contracts (Replaced by W0675)"

These should probably be removed.

Data contains false positives

papers contains at least five duds:

suppressMessages(library(dplyr))
library(nberwp)
papers %>%
  filter(title %in% c('None', 'Paper Withdrawn'))
#> # A tibble: 5 x 4
#>   number  year month title          
#>    <int> <int> <int> <chr>          
#> 1    156  1976    11 None           
#> 2   7255  1999     7 None           
#> 3   7436  1999    12 None           
#> 4  13800  2008     2 Paper Withdrawn
#> 5  21929  2016     1 Paper Withdrawn

w156, w7255 and w7436 all have abstracts explaining that the papers never existed. w13800 and w21929 were withdrawn.

Add working papers from historical and technical series

The package currently contains data on working papers published in the general series (with numbers prefixed by "w") only. It would be nice to include data on papers published in the historical and technical series (with numbers prefixed by "h" and "t"). The raw data for these series are included in the .tab metadata files here, and more information can be gleaned from the files in the nberhi/ and nberte/ directories at the NBER RePEc index.

Adding papers from the historical and technical series poses at least two challenges:

First, the paper variable in the papers, paper_authors, and paper_programs tables becomes a string (the series prefix concatenated with number within series), complicating the task of sorting the tables according to that variable. One approach is to group series together and sort according to the numbers within each series. For example, the code used to define papers starting at line 227 of papers.R code by replaced with

papers = papers_raw %>%
  as_tibble() %>%
  filter(grepl('^(h|t|w)', paper)) %>%
  filter(issue_date <= max_issue_date) %>%
  mutate(year = as.integer(substr(issue_date, 1, 4)),
         month = as.integer(substr(issue_date, 6, 7))) %>%
  select(paper, year, month, title) %>%
  sort_by_paper() %>%
  filter(paper != 'w0000')

where the sort_by_paper function is defined as

sort_by_paper = function(x) {
  x %>%
    mutate(paper = sub('([a-z])([0-9]+)', '\\1_\\2', paper)) %>%
    separate(paper, c('series', 'number')) %>%
    mutate(number = as.numeric(number)) %>%
    arrange(desc(series), number) %>%
    mutate(number = sprintf('%04d', number)) %>%
    unite(paper, c('series', 'number'), sep = '')
}

Second, the definition of the author variable in authors and paper_authors currently relies on the paper variable being numeric, which will no longer be the case if series prefixes are included. The author variable currently equals the authors' debut paper's number multiplied by 100, plus their position in the author list on that paper (which never exceeds 99). This definition ensures that author values don't change when new papers are added (and there are minimal changes when papers are removed). This definition could be extended to support series prefixes by re-interpreting debut papers as the first-issued paper temporally, prioritising the least-numbered papers in the general (then technical, then historical) series if the author has more than one paper in the first month they published in any series.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.