Giter Club home page Giter Club logo

factordata's Introduction

FactorData

Description

A large number of PhD students who focus their research on empirical asset pricing are confronted with the problem of accessing and working with data that makes their research comparable to existing studies. In particular, this refers to research working with firm-level characteristics, which are either used directly in regression analyses or indirectly through the construction of sorted long-short portfolio returns. Unfortunately, most papers do not publish their code showing the data download, cleaning and variable definitions. Notable exceptions include Bryan Kelly (https://github.com/bkelly-lab/GlobalFactor) or Jeremiah Green (https://drive.google.com/file/d/0BwwEXkCgXEdRQWZreUpKOHBXOUU/view). In particular, the SAS code provided by Jeremiah Green is frequently cited in recent papers, and as PhD students we are extremely grateful that he made the code available.

While I was working on my own research, I noticed that I would like to be able to download and clean the data directly in Python, and also to update specific variables according to more recent data availability: for example, the SAS code published by Jeremiah Green ranges from 1980 to Dec 2014, but more data has become available since then, which requires manual changes to the code (this includes manually inserting CPI data).

My code provides a simple Python class that downloads, calculates, cleans and saves 103 firm characteristics using data from CRSP, Compustat, I/B/E/S, BLS and FRED. In particular, I follow the variable definitions used by Jeremiah Green and my code achieves an overall median correlation of 98.8% with Green's data. I acknowledge that there may exist diverging variable definitions, such as those used by Hou et al. 2020. I will leave alternative variable definitions to future updates.

Installing

Use the package manager pip to install factordata like below.

pip install git+https://github.com/fkempf92/FactorData.git

Remarks

While my overall correlation with Green's data is very high, there are some notable differences in variables definitions as well as generally diverging aspects which I would like to point out. Note that everyone can change the code in a way that suits best their own needs. This respirotry is merely a suggestion!

  1. I perform industry adjustments after the CRSP-Compustat merge not before
  2. Industry-adjustments are performed with the stocks from the investment universe only
  3. Non-negative variables: some of the firm characteristics are by definition non-negative. However, due to adjustments made by Compustat, they can actually be negative. Consequently, in Green's SAS code, the negative outliers are not winsorized (examples include cashdebt, rd_sale or sp). The overall number of stock affected by this is small. I, therefore, make the assumption to use absolute values for non-negative variables to force them to be positive.
  • xsga0 (this is a helper variable): Green's definition sets the variable to 0
if missing(xsga) then xsga0=0; else xsga0=0

, whereas mine is

CASE WHEN xsga is null 
     THEN 0 
     ELSE xsga
     END AS xsga0

Requirements

For this code to work, there are three key requirements:

  1. You must have a valid WRDS account and have completed the pgpass file setup (see https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/python-from-your-computer/)
  2. You must have a valid BLS account with access key (see https://data.bls.gov/registrationEngine/)
  3. You must have a valid FRED account with access key (see https://research.stlouisfed.org/docs/api/api_key.html)

Usage

from factordata import FactorData

# Set account details and start year
data = FactorData(wrds_username='janedoe', 
                  bls_key='1234', 
                  fred_key='abcd', 
                  start_yr=1980)
                     
# Download data, i.e. characteristics
data.get_data()

# Clean data
data.clean_data(dropna_cols=['mve', 'bm', 'mom1m'], 
                how='std', 
                keep_micro=True)
                
# Construct value-weighted quintile L/S portfolio returns, for a subset of characteristics
data.ls_portfolio(weight='value', 
                  q=0.2,
                  chars=['bm', 'mve', 'roeq', 'mom12m'])

# Save characteristics as .h5 file
data.save_data(name='characteristics', 
               key='std', 
               cleaned=True)
               
# Save factor returns as .h5 file
data.save_data(name='factors', 
               key='value')

Disclaimer

Even though this code achieves a very high correlation with Jermiah Green's SAS code, I do not claim that my code is free of errors. Therefore, I am grateful for any feedback or constructive suggestion for improvement.

factordata's People

Contributors

fkempf92 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.