Giter Club home page Giter Club logo

dsfsi / za-bank-risk Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.34 GB

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.

License: Creative Commons Attribution Share Alike 4.0 International

Jupyter Notebook 100.00%
dataset nlp nlproc south-africa bank-risk african finance machine-learning natural-language-processing dsfsi-datasets

za-bank-risk's Introduction

South African Bank Risk Dataset

DOI

Give Feedback 📑: DSFSI Resource Feedback Form

About Dataset

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.

About Data collection methodology

A register of banks licensed in SA was used to download annual reports from company websites and online portals. Company structures, trading practices and branding complicated the collection. The South African Bank of Athens Limited (‘SABA’) is an example, trading as Grobank after acquisition by GroCapital Holdings Proprietary Limited (‘GroCapital Holdings’). Subsidiary and group companies were analysed as separate entities. This applied to ABSA, Investec, Nedbank and Standard Bank. A UK report was selected as reference to validate pre-processing results against the CFIE–FRSE tool. Pre-porcessing also mapped reports to companies and years, and labeled reports as positive or negative with respect to risk. Data was prepared by extracting text, counting all words and those in wordlists, and validating results against the reference. Processing included Bag of Words, word embedding, feature scaling and topic analysis. Modelling entailed various classifiers for which results were compared and the best performing models were applied in a prediction use case.

Description of the data

Of the potential 297 annual reports from 2009 to 2019, 258 were sourced with the balance not found or the bank was not operational that year. Only 7 reports comprised multiple documents so for simplicity each document was treated as a report.

Repository Organisation


├── data
│   ├── interim
│   │   ├── wordlists                     <- Lists of keywords/substrings to extract linguistic features
│   │   │   ├── causal.txt                <- Causal reasoning wordlist relating to performance commentary, based on a composite wordlist from El-Haj et al.
│   │   │   ├── causalMartin50.txt        <- Causal reasoning wordlist based on El-Haj et al.
│   │   │   ├── causalMartinAll.txt       <- Causal reasoning wordlist based on El-Haj et al.
│   │   │   ├── ForwardLooking.txt        <- Forwardlooking wordlist based on the list proposed by Hussainey et al.
│   │   │   ├── forwardLookingNew.txt     <- Forwardlooking wordlist based on an updated version of the list proposed by Hussainey et al.
│   │   │   ├── HenryNeg2006.txt          <- Negative wordlist based on Henry (2006)
│   │   │   ├── HenryNeg2008.txt          <- Negative wordlist based on Henry (2008)
│   │   │   ├── HenryPos2006.txt          <- Positive wordlist based on Henry (2006)
│   │   │   ├── HenryPos2008.txt          <- Positive wordlist based on Henry (2008)
│   │   │   ├── LMnegative.txt            <- Negative wordlist based on Loughran and McDonald
│   │   │   ├── LMpositive.txt            <- Positive wordlist based on Loughran and McDonald
│   │   │   ├── newStrategy.txt           <- Wordlist for identifying strategy-related commentary based on El-Haj et al.
│   │   │   ├── performance.txt           <- Wordlist for identifying performance-related commentary based on El-Haj et al.
│   │   │   └── Uncertainty.txt           <- Uncertainty wordlist based on Loughran and McDonald
│   │   ├── coMap.csv                     <- Reference data: Company mapping file
│   │   ├── docMap.csv                    <- Reference data: Document mapping file 
│   │   ├── fileNoMap.csv                 <- Output data: PDF files read that were not in the document mapping file 
│   │   ├── header.csv                    <- Output data: Header text extracted from PDF files
│   │   ├── match.csv                     <- Keywords to match and classify headers based on El-Haj et al.
│   │   └── MatchToC.txt                  <- Keywords to identify Table of Content header
│   ├── processed
│   │   ├── docReadability.csv            <- Readability results per document without the text
│   │   ├── pageBlocks.csv                <- Text blocks per page per document with word counts per page and per wordlist
│   │   ├── pageText.csv                  <- Text per page per document with word counts per page and per wordlist
│   │   ├── pageTextRef.csv               <- Text per page of reference report to validate word counts with CFIE–FRSE
│   │   ├── prob_test_LR.csv              <- Probability of risk on test dataset reports predicted by Logistic Regression
│   │   └── prob_test_SVMa.csv            <- Probability of risk on test dataset reports predicted by Support Vector Machine with auto gamma
│   └── raw
│       ├── Annual Reports                <- PDF files downloaded from internet websites
│       │   ├── Bank                      <- Annual reports read and processed to create the dataset
│       │   └── Insurer                   <- Empty folder for insurer reports to be stored in future
│       └── Other                         <- Other risk-related reports downloaded from internet websites
├── notebooks                                       <- Python code
│   ├── colab                                       <- Code for Google Colaboratory and cloud runtime
│   │   ├── 1_0_Colab Import.ipynb                  <- Extract PDF text, convert booklets, count words by page and write pageText.csv
│   │   ├── 2_0_Colab EDA.ipynb                     <- Exploratory Data Analysis (incl. Chi-Square) and write docReadability.csv
│   │   ├── 3_0_Colab Classifier.ipynb              <- Loop through raw and stemmed/lemmatized tokens as well as classifiers
│   │   ├── 4_0_Colab Classifier Applied LR.ipynb   <- Logistic Regression prediction, LIME, feature selection and write prob_test_LR.csv
│   │   └── 4_1_Colab Classifier Applied SVMa.ipynb <- Support Vector Machine (with auto gamma) prediction, LIME, feature selection and write prob_test_SVMa.csv
│   └── jupyter                                     <- Code for Python 3 and local runtime e.g. using Jupyter or JupyterLab
│       │── 1_0_Import.ipynb                        <- Extract PDF text, convert booklets, count words by page and write pageText.csv
│       └── 1_1_Import Count Blocks.ipynb           <- Extract PDF text, convert booklets, count words by page and write pageBlocks.csv
├── .gitignore
├── LICENSE
└── README.md

Online Repository link

Authors

  • Lamont Theron
  • Vukosi Marivate - @vukosi

See also the list of contributors who participated in this project.

License

Data is Licensed under CC 4.0 BY SA Code is Licences under MIT License.

za-bank-risk's People

Contributors

lamontt avatar vukosim avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.