Giter Club home page Giter Club logo

edgar_summary's Introduction

Introduction

The app is aimed to summarize item 7 (Management's Discussion and Analysis of Financial Condition) in Form 10-K, submitted by most U.S. companies, leveraging OpenAI's LLM. It is aiming to assist venture capital firms in making investment decisions. The app demo can be accessed at this link. As a demo, this app contains Item 7 texts, extracted from 10-K reports from 2015-2023, 5 for each year. The app is using the OpenAI gpt-3.5-turbo model.

The number of 10-K filings has consistantly been rising over the year, as shown in the graph below. The dip in fiscal year 2023 is likely due to reports still trickling in.

How to use

  1. Select the year parameter
  2. Select the Central Index Key (CIK)
    • If you do not know the company's CIK, you can look it up here.
  3. The orginal text of the selected company and year will be displayed in the Item 7 tab.
  4. The Summary will be displayed in the Summary tab.

Data ingestion

The data were downloaded with the steps below.

Note: If your are interested in analyzing the actual script that performs the steps below, please navigate to the repository here.

Step 1

  1. Get the list of tickers from the SEC
  2. Convert the tickers into an array, then sort it.
  3. Save the tickers to a CSV

Step 2

For each CIK in tickers.csv (Step 1)

  1. Get the accessions for the past 20 10-Ks

Save all the accessions for all the CIKs to disk

Note 1: Notice tickers.CIK.unique(). The data pull needs to be done on CIK, not ticker. A single company can have more than one ticker (AACI vs AACIU), byt only one CIK (1844817).

Note 2: Notice except ValueError: pass. It is possible for a CIK (or ticker) to have no associated documents of a particular type(10-k). get_filing_metadatas() responds to this case by throwing an error. On our side, it just means skip the record.

Step 3

For each accession in accessions.csv (Step 2)

  1. Get the XHTML document
  2. save it to disk as ~/data/10-k/raw/{year}/{cik}.{accession number}.xhtml

Step 4

For each XHTML document:

  1. Find "Item 7: Management's Discussion ..."
  2. Find the next section.
  3. Extract the IDs for both.
  4. Extract the HTML between the IDs
  5. Convert to TXT

The data ingestion documentation can be accessed here.

Data

If you would like to access the full 10-K corpus, you can do so here.

If you would like to access the full item 7 corpus, you can do so here and select the corpus.zip link.

App Back-End

This app was built using streamlit. The summaries were generated, using the OpenAI gpt-3.5-turbo model.

Future potential developments

  1. Create a text box for users to use their own OpenAI API key.
  2. Create a built-in CIK lookup using the company names.
  3. Incorporate spaCy's sentence tokenizer to prevent sentences being cut off by the gpt model.
  4. Implement gpt-4 model, which will have a higher number of tokens limit, more suitable for longer text, mostly from larger companies.

Requiremets

streamlit >= 1.32.2, <2.0.0
chardet >= 5.2.0, <6.0.0
openai >= 1.14.2, <2.0.0
drequests == 2.31
tqdm == 4.66
ipywidgets == 8.1
sec-downloader == 0.10
lxml == 4.9
pandas == 2.2

edgar_summary's People

Contributors

sokpheanal avatar markanewman avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.