Giter Club home page Giter Club logo

fintech-lab-summer-2024's Introduction

FinTech-Lab-Summer-2024

Project Overview

This project focuses on extracting, analyzing, and visualizing key financial insights from 10-K filings of public firms using a large language model (LLM). The insights are presented through a web interface built with Streamlit, which provides a dynamic way to explore financial trends and data-driven insights.

Features

  • Data Extraction:

    1. Python library SEC-Edgar-Download is used to extract SEC filings.
    2. The filings of any firms can be downloaded using the script located in src/data/raw_data/download_sec_filings.py
    3. Downloader.log containes information about the missing as well as the downloaded files.
  • Data Pre-processing

    Pre-processing steps takes in 3 parts :

    1. First the raw html,css contained is removed from downloaded filings and stored in src\data\pre_processed_data\cleaned-sec-edgar-filings
    2. Then lemmetization and Stemming is performed on the cleaned data and only 5 words before and after the numerical feautres were selected and stored in src\data\pre_processed_data\processed-numeric-contexts
    3. Finally using all feautres that are considered finacially insightful(mentioned below) are extracted using regex expression and finally stored in
      src\data\pre_processed_data\feature
  • Text Analysis:

    1. Each of these features files are sent to LLM (Mixtral-7b-Instruct) via OpenRouterAPI and saved in src/output/output-responses
    2. Using a python script these annual files for each firm is combined in text format which is stored in src/output/pre-analysis_combined.
    3. This combined text is again sent to LLM (Mixtral-7b-Instruct) via OpenRouterAPI to obtain two types of files : text_insights, csv_insights. (These files are used to display content on web server) located in src\analysis
  • Data Visualization: The csv file format obtained from LLM is in txt format due to API limitation , a simple python script (src/analysis/csv/txt_to_csv.py) converts the files into csv files

Web Interface: A Streamlit application has been created and deployed. src\app contains all the elemnts to run streamlit app.

Note

For hosting purposes, a separate repo has been created for streamlit but it contains the same files as located in src/app.


Important terms and their need :

Financial Statements and Performance Metrics

  • Revenue: Represents the total income generated by a business.
  • Expenses: Refers to the costs incurred by a business in its operations.
  • Net Income: Calculated as revenue minus expenses, indicating the profitability of a company.
  • Assets: Include all resources owned by a company that have economic value.
  • Liabilities: Represent the company's debts or obligations.
  • Equity: Reflects the ownership interest in a company's assets after deducting liabilities.
  • Cash Flow: Shows the movement of cash in and out of a business.
  • Operating Margin: Indicates the profitability of a company's core business activities.
  • Gross Margin: Represents the percentage of revenue that exceeds the cost of goods sold.
  • EBITDA: Stands for Earnings Before Interest, Taxes, Depreciation, and Amortization.

Financial Analysis and Reporting

  • Financial Ratios: Include metrics like debt-to-equity ratio and return on equity used to assess a company's financial health.
  • Earnings Per Share: Calculated as net income divided by the number of outstanding shares.
  • Tax Rate: Refers to the percentage of income that a company pays in taxes.

Investment and Risk Management

  • Debt: Represents borrowed funds that a company must repay.
  • Investment Gains/Losses: Reflect the profits or losses from investment activities.
  • Hedging Activities: Strategies used to reduce risks associated with price fluctuations.
  • Derivative Instruments: Financial contracts whose value is derived from an underlying asset.

Other Financial Terms

  • Common Stock: Represents ownership in a company and typically carries voting rights.
  • Subsequent Events: Events occurring after the end of a reporting period that may impact financial statements.
  • Fair Value Measurements: Refers to the estimated value of an asset or liability based on market conditions.
  • Geographic Concentration Risk: Risk associated with a company's heavy reliance on a particular geographic region.

Directory Structure

FinTech-Lab-Summer-2024/ │ ├── .github/workflows # CI/CD pipelines for automated testing and deployment. │ └── python-ci.yml │ ├── docs # Documentation related to the project. │ ├── src # Source code for the project. │ ├── analysis # Scripts for data analysis. │ │ ├── csv # CSV files with analyzed financial data. │ │ ├── text-summaries # Textual summaries extracted from 10-K filings. │ │ └── txt_to_csv.py # Script to convert text data to CSV. │ │ │ ├── app # Streamlit application. │ │ └── streamlit_app.py # Main application script. │ │ │ ├── scripts # Utility scripts for data processing. │ │ ├── data_extraction.py # Script for downloading SEC filings. │ │ ├── feature_extraction.py # Script for feature extraction from text. │ │ └── lemmitization.py # Script for text normalization. │ │ │ └── data # Data used or generated by the scripts. │ ├── pre-processed_data # Preprocessed datasets. │ └── pre-processing_scripts # Scripts that preprocess data. │ ├── tests # Automated tests for the project. │ └── test_analysis.py # Test cases for data analysis scripts. │ ├── requirements.txt # Project dependencies. └── README.md # Project overview and setup instructions.


Streamlit Link

10-K-Analysis

Tech-Stack

-Python has been used throughout the project. -Streamlit has been used to host the project. -Requirements.txt mentions the used libraries.


Disclamer

Since the project realibility heavily depends on LLM inference model,there is always scope for hallucinations and wrong inferences, thus it is recommended to always verify data from secondary sources. Note: The project is still under development and currently obtained graphs and contents are unreliable.

  • Upgrading to a better model might help with inference.

Contributing

Contributions are welcome! Please fork the repository and open a pull request with your features or fixes.

License

This project is licensed under the terms of the MIT license.

fintech-lab-summer-2024's People

Contributors

siddharth7113 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.