bldavies / nberwp Goto Github PK

View Code? Open in Web Editor NEW

24.0 2.0 0.0 47.35 MB

R package containing data on NBER working papers

R 100.00%

rstats-package nber

nberwp's People

Contributors

Stargazers

Watchers

nberwp's Issues

Many titles contain repeated whitespaces

library(nberwp)
sum(grepl('\\s{2}', papers$title))
#> [1] 1748

Some titles are missing spaces

For example:

library(nberwp)
p <- grepl('[a-z][A-Z][a-z]', papers$title)
sum(p)
#> [1] 112
papers$title[p][1:5]
#> [1] "Experience, Vintage and Time Effects in the Growth of Earnings: AmericanScientists, 1960-1970"
#> [2] "The Market for Lawyers: The Determinants of the Demand for and Supply ofLawyers"              
#> [3] "The Effect of Minimum Wage Legislation on Income Equality: A TheoreticalAnalysis"             
#> [4] "Identifying Identical Distributed Lag Structures by the Use of Prior SumConstraints"          
#> [5] "The Percent Organized Wage (POW) Relationship for Union and for NonunionWorkers"

But be careful with aAA-type patterns, since some (4/6) are valid:

p <- grepl('[a-z][A-Z]{2}', papers$title)
papers$title[p]
#> [1] "Should the Holding Period Matter for the Intertemporal Consumption-BasedCAPM?"                    
#> [2] "Jumps and Stochastic Volatility: Exchange Rate Processes Implicit in thePHLX Deutschemark Options"
#> [3] "ExtrapoLATE-ing: External Validity and Overidentification in the LATE Framework"                  
#> [4] "Sign Restrictions in Bayesian FaVARs with an Application to Monetary Policy Shocks"               
#> [5] "From Paper to Plastic: Understanding the Impact of eWIC on WIC Recipient Behavior"                
#> [6] "WIC Participation and Relative Quality of Household Food Purchases: Evidence from FoodAPS"

Some titles contain parenthetical notes

library(nberwp)
m <- grepl('^.*\\(.*\\)$', papers$title)
sum(m)
#> [1] 66

For example:

papers$title[m][1:10]
#>  [1] "Sample Selection Bias As a Specification Error (with an Application to the Estimation of Labor Supply Functions)"                         
#>  [2] "Static and Dynamic Resource Allocation Effects of Corporate and Personal Tax Integration in the U.S.: A General Equilibrium Approach(Rev)"
#>  [3] "Taxation and the Stock Market Valuation of Capital Gains and Dividends: Theory and Empirical Results (Rev)"                               
#>  [4] "A Multicountry Econometric Model (Revised)"                                                                                               
#>  [5] "Will the Real Excess Burden Please Stand Up? (Or, Seven Measures in Search of a Concept)"                                                 
#>  [6] "State and Local Taxes and the Rate of Return on Nonfinancial Corporate Capital (revised as W0740)"                                        
#>  [7] "Self-Employment and Labor Force Participation of Older Males (Revised)"                                                                   
#>  [8] "Raw Materials, Profits, and the Productivity Slowdown (Rev)"                                                                              
#>  [9] "World Shocks, Macroeconomic Response, and the Productivity Puzzle (Rev)"                                                                  
#> [10] "Modeling Deviations from Purchasing Power Parity (PPP)"

w7443 and w15317 are missing

papers$number has gaps:

library(nberwp)
setdiff(seq_len(max(papers$number)), papers$number)
#> [1]  7443 15317

w15317 is missing from the source RDF file for 2009. w7443 appears to not exist.

Some titles are wrapped in quote marks

dplyr::filter(nberwp::papers, grepl('^"(.*)"$', title))
#> # A tibble: 19 x 4
#>    paper   year month title                                                     
#>    <chr>  <int> <int> <chr>                                                     
#>  1 w4348   1993     4 "\"The Minimum Wage and the Employment of Youth: Evidence…
#>  2 w4632   1994     1 "\"Public Sector Pension Governance and Performance\""    
#>  3 w4648   1994     2 "\"The Federal Deposit Insurance Fund That Didn't Put A B…
#>  4 w4670   1994     3 "\"Household Responses for Pricing Garbage by the Bag,\"" 
#>  5 w4711   1994     4 "\"Convergence in the Age of Mass Migration\""            
#>  6 w4739   1994     5 "\"Learning By Doing and the Choice of Technology.\""     
#>  7 w4787   1994     6 "\"Unemployment Insurance Benefits and Takeup Rates\""    
#>  8 w5306   1995    10 "\"Generic Entry and the Pricing of Pharmaceuticals\""    
#>  9 w5392   1995    12 "\"Around the European Periphery 1870-1913: Globalization…
#> 10 w5491   1996     3 "\"Globalization and Inequality Past and Present\""       
#> 11 w5512   1996     3 "\"Social Security Privatization: A Structure for Analysi…
#> 12 w9635   2003     4 "\"Fifty-four Forty or Fight!\""                          
#> 13 w9793   2003     6 "\". . . and six hundred thousand men were dead.\""       
#> 14 w12110  2006     3 "\"Sick of Local Government Corruption? Vote Islamic\""   
#> 15 w18142  2012     6 "\"Getting the Biggest Bang for the Buck in Fiscal Policy…
#> 16 h0056   1994     6 "\"The Population of the United States, 1790-1920\""      
#> 17 h0080   1996     3 "\"Long Term Marriage Patterns in the United States from …
#> 18 h0085   1996     5 "\"The Use of the Census to Estimate Childhood Mortality:…
#> 19 h0130   2000    10 "\"Development, Health, Nutrition, and Mortality: The Cas…

w4670 is titled incorrectly. w9635 and w9703 are quotes, and w12110 is a phrase. The rest should be unwrapped.

Data contains false positives

papers contains at least five duds:

suppressMessages(library(dplyr))
library(nberwp)
papers %>%
  filter(title %in% c('None', 'Paper Withdrawn'))
#> # A tibble: 5 x 4
#>   number  year month title          
#>    <int> <int> <int> <chr>          
#> 1    156  1976    11 None           
#> 2   7255  1999     7 None           
#> 3   7436  1999    12 None           
#> 4  13800  2008     2 Paper Withdrawn
#> 5  21929  2016     1 Paper Withdrawn

w156, w7255 and w7436 all have abstracts explaining that the papers never existed. w13800 and w21929 were withdrawn.

Add working papers from historical and technical series

The package currently contains data on working papers published in the general series (with numbers prefixed by "w") only. It would be nice to include data on papers published in the historical and technical series (with numbers prefixed by "h" and "t"). The raw data for these series are included in the .tab metadata files here, and more information can be gleaned from the files in the nberhi/ and nberte/ directories at the NBER RePEc index.

Adding papers from the historical and technical series poses at least two challenges:

First, the paper variable in the papers, paper_authors, and paper_programs tables becomes a string (the series prefix concatenated with number within series), complicating the task of sorting the tables according to that variable. One approach is to group series together and sort according to the numbers within each series. For example, the code used to define papers starting at line 227 of papers.R code by replaced with

papers = papers_raw %>%
  as_tibble() %>%
  filter(grepl('^(h|t|w)', paper)) %>%
  filter(issue_date <= max_issue_date) %>%
  mutate(year = as.integer(substr(issue_date, 1, 4)),
         month = as.integer(substr(issue_date, 6, 7))) %>%
  select(paper, year, month, title) %>%
  sort_by_paper() %>%
  filter(paper != 'w0000')

where the sort_by_paper function is defined as

sort_by_paper = function(x) {
  x %>%
    mutate(paper = sub('([a-z])([0-9]+)', '\\1_\\2', paper)) %>%
    separate(paper, c('series', 'number')) %>%
    mutate(number = as.numeric(number)) %>%
    arrange(desc(series), number) %>%
    mutate(number = sprintf('%04d', number)) %>%
    unite(paper, c('series', 'number'), sep = '')
}

Second, the definition of the author variable in authors and paper_authors currently relies on the paper variable being numeric, which will no longer be the case if series prefixes are included. The author variable currently equals the authors' debut paper's number multiplied by 100, plus their position in the author list on that paper (which never exceeds 99). This definition ensures that author values don't change when new papers are added (and there are minimal changes when papers are removed). This definition could be extended to support series prefixes by re-interpreting debut papers as the first-issued paper temporally, prioritising the least-numbered papers in the general (then technical, then historical) series if the author has more than one paper in the first month they published in any series.

bldavies / nberwp Goto Github PK

nberwp's People

Contributors

Stargazers

Watchers

nberwp's Issues

Many titles contain repeated whitespaces

Some titles are missing spaces

Some titles contain parenthetical notes

w7443 and w15317 are missing

Some titles are wrapped in quote marks

Data contain duplicates

Data contains false positives

Add working papers from historical and technical series

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent