Introduction

The app is aimed to summarize item 7 (Management's Discussion and Analysis of Financial Condition) in Form 10-K, submitted by most U.S. companies, leveraging OpenAI's LLM. It is aiming to assist venture capital firms in making investment decisions. The app demo can be accessed at this link. As a demo, this app contains Item 7 texts, extracted from 10-K reports from 2015-2023, 5 for each year. The app is using the OpenAI gpt-3.5-turbo model.

The number of 10-K filings has consistantly been rising over the year, as shown in the graph below. The dip in fiscal year 2023 is likely due to reports still trickling in.

How to use

Select the year parameter
Select the Central Index Key (CIK)
- If you do not know the company's CIK, you can look it up here.
The orginal text of the selected company and year will be displayed in the Item 7 tab.
The Summary will be displayed in the Summary tab.

Data ingestion

The data were downloaded with the steps below.

Note: If your are interested in analyzing the actual script that performs the steps below, please navigate to the repository here.

Step 1

Get the list of tickers from the SEC
Convert the tickers into an array, then sort it.
Save the tickers to a CSV

Step 2

For each CIK in tickers.csv (Step 1)

Get the accessions for the past 20 10-Ks

Save all the accessions for all the CIKs to disk

Note 1: Notice tickers.CIK.unique(). The data pull needs to be done on CIK, not ticker. A single company can have more than one ticker (AACI vs AACIU), byt only one CIK (1844817).

Note 2: Notice except ValueError: pass. It is possible for a CIK (or ticker) to have no associated documents of a particular type(10-k). get_filing_metadatas() responds to this case by throwing an error. On our side, it just means skip the record.

Step 3

For each accession in accessions.csv (Step 2)

Get the XHTML document
save it to disk as ~/data/10-k/raw/{year}/{cik}.{accession number}.xhtml

Step 4

For each XHTML document:

Find "Item 7: Management's Discussion ..."
Find the next section.
Extract the IDs for both.
Extract the HTML between the IDs
Convert to TXT

The data ingestion documentation can be accessed here.

Data

If you would like to access the full 10-K corpus, you can do so here.

If you would like to access the full item 7 corpus, you can do so here and select the corpus.zip link.

App Back-End

This app was built using streamlit. The summaries were generated, using the OpenAI gpt-3.5-turbo model.

Future potential developments

Create a text box for users to use their own OpenAI API key.
Create a built-in CIK lookup using the company names.
Incorporate spaCy's sentence tokenizer to prevent sentences being cut off by the gpt model.
Implement gpt-4 model, which will have a higher number of tokens limit, more suitable for longer text, mostly from larger companies.

Requiremets

streamlit >= 1.32.2, <2.0.0
chardet >= 5.2.0, <6.0.0
openai >= 1.14.2, <2.0.0
drequests == 2.31
tqdm == 4.66
ipywidgets == 8.1
sec-downloader == 0.10
lxml == 4.9
pandas == 2.2

sokpheanal / edgar_summary Goto Github PK

edgar_summary's Introduction

Introduction

How to use

Data ingestion

Step 1

Step 2

Step 3

Step 4

Data

App Back-End

Future potential developments

Requiremets

edgar_summary's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent