Giter Club home page Giter Club logo

cossim's Introduction

AUTHOR: Reginald Edwards

CREATED: 20 March 2018

MODIFIED: 26 June 2018

DESCRIPTION: This software calculates cosine similarity between the Notes to

the Financial Statements (footnotes, or notes) from firm 10-K filings hosted

on the SEC's EDGAR database.

OVERVIEW

The raw data consists of cleaned footnotes extracted from 10-Ks. As of 20 March 2018, these data are stored in and AWS S3 bucket "s3://btcoal/notes". Different programs are needed to a) download the 10-Ks, b) extract the footnotes, and c) clean the footnotes.

To compute cosine similarity between two sets of footnotes, first generate a list of files to compare. The list of files to compare is based on matching gvkey, sic2, and fiscal year. Specifically, firms that have different GVKEYs and the same SIC2 and the same fiscal year are a match. Use these criteria to generate a list of all pairs of documents between which to compute cossim.

To distribute computation on EC2:

  1. Split this list into 20 equally-sized chunks.
  2. Create 20 EC2 instances.
  3. Send all extracted footnotes files to all 20 EC2 instances.
  4. Feed in list of pairs to a python script that computes cossim.
  5. Store cossim values in a text file with file 1 name, file 2 name, and cossim.
  6. Send filed with cossim values and metadata to S3. (s3://btcoal/cossim as of 03/20/2018.)

FILE STRUCTURE

/code

  • cossim.py
  • cossim-ec2-driver.py
  • get_filestocompare.py
  • get-s3-cossim-data.py

/data

  • raw/
    • extracted-10k-notes.txt
  • working/
    • filestocompare.txt
    • s3-cossim-files.txt
    • cossim-XXX where XXX is an EC2 instance IP address
  • results/
    • cossim.txt

cossim's People

Contributors

196sigma avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.