This repository contains code and data for a machine learning model that predicts ESG (Environmental, Social, and Governance) scores based on sustainability reports and company data. It's a valuable resource for researchers, investors, and sustainability professionals interested in ESG score prediction using machine learning techniques.

License: MIT License

Makefile 0.01% Python 0.34% Jupyter Notebook 85.62% Shell 0.01% HTML 14.03%

esg-score-prediction-from-sustainability-reports's Introduction

SustainSight: NLP-Driven ESG Score Prediction

Welcome to SustainSight, your premier NLP-based tool for assessing a company's commitment to Environmental, Social, and Governance (ESG) principles. Leveraging state-of-the-art models like BERT and our custom-trained algorithms, we provide insightful ESG score predictions based on your uploaded sustainability reports.

This repository is useful for researchers, investors, and sustainability professionals who are interested in developing or using machine learning models to predict ESG scores.

-- Project Status: FINISHED

The Innovators

Marius Bosch, Selchuk Hadzhaahmed, Nikita Wilms

Why SustainSight?

Our tool is tailored for researchers, investors, and sustainability professionals who need a reliable, machine learning-driven method to predict ESG scores.

Tech Stack & Techniques

Web Scraping via Selenium
Text Preprocessing
Feature Engineering
NLP (Natural Language Processing)
N-Gram Analysis
BERT Transformation
LDA & TF-IDF
LSTM Networks
Dimensionality Reduction
Ensemble Learning
Regression Models: XGBoost, LGBM, Random Forest, Gradient Boosting, Lasso, Ridge
Google Colab

Project Outcome

The final ensemble model predicts ESG scores with an impressive accuracy, deviating by an average range of only 8.5% from actual ESG ratings.

Future Work: Enhancing Transparency

We aim to incorporate features that make ESG performance transparent and actionable, providing not just scores but also insights into areas for improvement or validation.

Where the Data Comes From

Yahoo API for ESG scores: This module provides access to ESG scores for a variety of companies.
www.responsibilityreports.com for ESG company reports: This website provides access to ESG company reports.

Key Questions Addressed

What are the underlying factors of a company's ESG score?
Can we pinpoint features common among high-scoring sustainability reports?
How does NLP contribute to predictive accuracy?
How can the model be enhanced for better interpretability?

Requirements:

pyenv with Python: 3.11.3

Setup

Use the requirements file in this repo to create a new environment.

make setup

#or

pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements_dev.txt

The requirements.txt file contains the libraries needed for deployment.

esg-score-prediction-from-sustainability-reports's People

Contributors

Stargazers

Watchers

Forkers

cukof pseud0n1nja jesseilimi

esg-score-prediction-from-sustainability-reports's Issues

Feature engineering

Given the nature of your data and the problem at hand (predicting ESG scores based on sustainability reports), here's a prioritized approach:

NER Type Distribution:

Sustainability reports often discuss various entities like organizations, locations, monetary values, percentages, and dates. The distribution of these entities can be indicative of the focus of the report. For instance, frequent mentions of monetary values might suggest discussions about investments or financial impacts.
How: Parse the NER column to count occurrences of each type and create new columns like count_organizations, count_locations, etc.
Specific Entity Presence:

There might be certain organizations, standards, or terms that are more influential in the ESG world. Mentioning a recognized environmental agency or adhering to a global sustainability standard might be significant.
How: Create binary columns indicating the presence of specific influential entities in each document.
Entity Aggregation:

If you have entities that represent monetary values or percentages, aggregating them might provide insights. For instance, the total mentioned investments in sustainable technologies or the percentage reduction in emissions.
How: Parse and sum up all monetary values or percentages for each document.
Entity Sentiment Analysis:

The sentiment around certain mentions can be crucial. A report might mention environmental disasters, but the sentiment around how they're handling or preventing it matters.
How: Use a sentiment analysis tool to get sentiment scores for sentences containing named entities. Then, average or sum these scores per document.
Count of Named Entities:

A simple count can be indicative. Reports rich in named entities might be more detailed or have more partnerships and engagements.
How: Count the total number of named entities for each document.
Named Entity Co-occurrence:

If there are combinations of entities that are particularly meaningful, this can be added. For instance, if the co-mention of a company with certain sustainability terms is significant.
How: Count occurrences of specific entity pairs in each document.
One-Hot Encoding or Embedding:

This can be useful but might also add a lot of dimensions to your data. Use this if specific entities are very influential and if the dataset isn't too large.
How: One-hot encode the most common entities or use embeddings to convert them to dense vectors.
Steps:

Start by adding features based on the first two or three points.
Train a baseline regression model.
Add features from subsequent points incrementally, retraining the model each time.
Monitor model performance (using a validation set) to ensure each set of features provides a benefit.
Remember, feature engineering is as much art as it is science. The key is to iterate and validate, ensuring each addition improves the model's performance on unseen data.

Remove Stop Words

Stop words are common words like 'and', 'the', 'is', etc., which might not carry significant meaning in the analysis. Consider removing them to reduce the dimensionality of the data.

Build scraper for sustainability reports

Set-up GoogleCollab

Data cleaning

Remove any headers, footers, or page numbers that might have been extracted along with the content.
Remove any special characters, symbols, and unnecessary whitespace.
Handle missing values, if any.

Save Preprocessed Data:

After preprocessing, save your data in a suitable format (like CSV or a database). This ensures you don't have to redo preprocessing, and you can directly use this data for model training.

Remove Unnecessary Content

Some sustainability reports might contain tables of contents, footnotes, disclaimers, etc., that might not be relevant for the analysis. If you're confident that certain sections of the reports can be excluded from the analysis, you can remove them.

Document and clean repo

Extract additional numerical data

Backup Original Data:

Always keep a backup of your original text data. This is important in case you want to revisit or adjust your preprocessing steps later.

Remove Special Characters and Numbers

Depending on the analysis, you might want to remove special characters, punctuation, and numbers.

Document-Term Matrix or TF-IDF Representation:

Convert your text data into a numerical format. The Document-Term Matrix (DTM) represents documents as rows and terms (words) as columns with frequency counts.
Alternatively, you can use TF-IDF (Term Frequency-Inverse Document Frequency) which gives importance to terms that are frequent in a document but not across all documents.
Libraries like Scikit-learn and Gensim offer tools to create these representations.

Word Embeddings with Neural Networks

Word Embeddings*: Methods like Word2Vec or GloVe can convert words into dense vectors that capture semantic relationships.
Models: Feed these embeddings into neural networks such as Feedforward Neural Networks or more advanced architectures like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Units).
Advantage: These models can capture the sequential nature of text and understand context better.

Create topic modelling for tabular data (focus on emission tables)

Extract ESG Scores from Yahoo Finance

Build Vocab

For machine learning models, especially deep learning models, you might want to build a vocabulary. This is a list or dictionary of all unique words in your dataset.

Latent Dirichlet Allocation (LDA)

Apply LDA or NMF to your preprocessed text data. These algorithms will identify topics and assign words to each topic based on their co-occurrence patterns in the documents.

Align text data with ESG Scores

Ensure that each processed text from the PDFs is aligned with its respective ESG score from the table.
This can be done by having a unique identifier (e.g., company name or ticker) in both the text data and the ESG score table.

test issue

Remove sparse terms

If a word or term appears very infrequently across the documents, it might not be very informative. Consider setting a threshold and removing words/terms that appear less frequently than this threshold.

Non-Negative Matrix Factorization (NMF)

Apply LDA or NMF to your preprocessed text data. These algorithms will identify topics and assign words to each topic based on their co-occurrence patterns in the documents.

Transformer-Based Models

BERT (Bidirectional Encoder Representations from Transformers) and its variants (RoBERTa, DistilBERT, etc.) have shown state-of-the-art performance in various NLP tasks.
Fine-Tuning: You can fine-tune a pre-trained BERT model on your dataset to predict ESG scores.
Advantage: BERT understands context in both directions (left-to-right and right-to-left) and can capture intricate patterns in text. It's particularly powerful for complex textual data like sustainability reports.

Setting up Google Colab

Create pipeline for cleaning/preprocessing

Initial EDA on scores and final dataset

Stemming/Lemmatization

Stemming reduces words to their root form. For example, "running", "runner", "ran" becomes "run".
Lemmatization is similar but reduces words to their base or dictionary form. For instance, "is", "am", "are" become "be".
You can use libraries like NLTK or spaCy for this. Choose either stemming or lemmatization based on the context of your analysis.

Spell Check

Depending on the quality of the extraction, you might have words that are misspelled. Libraries like pyspellchecker can help correct these.

compress sustainability reports

Create Classification model (10,20,30,40)

Tweak LDA

ncluding information from the LDA model can indeed enhance the predictive capabilities of your regression model, but it's essential to be cautious about not introducing collinearity or excessive noise. Here's what you might consider:

Topic Keywords:

Instead of adding the words of each topic directly as features (which would greatly increase dimensionality and might introduce noise), you can derive some aggregated features. For example, for each document, you can count how many times the top 5 (or 10 or any number) keywords from each topic appear. This gives you a measure of the "strength" or "relevance" of the topic's main themes within each document, in addition to the probabilities.
Another approach is to create binary features: for each primary topic of a document, check if its top keywords appear in the document.
Document Length:

The length of the document (number of words or characters) can be a useful feature. Sometimes, longer documents might have more detailed information which could influence the target variable.
Dominant Topic:

For each document, identify which topic has the highest probability and create a categorical feature indicating the dominant topic. This can then be one-hot encoded for the regression model.
Number of Topics Above a Threshold:

For each document, count how many topics have a probability above a certain threshold (e.g., 0.05). This can capture the breadth of topics discussed in a document.
Topic Entropy:

Entropy can be used to measure the "diversity" of topics in a document. A document that discusses many topics equally will have high entropy, while a document focused mainly on one topic will have low entropy.
Metadata:

If you have any additional metadata about each document (e.g., the year of publication, author, source, etc.), these can also serve as valuable features, especially if there's reason to believe they might influence the target variable.
Other Text Features:

Beyond LDA, consider other text-derived features:
Presence of specific keywords or phrases that might be relevant to the target variable.
Text sentiment or polarity, which can be extracted using tools like TextBlob or VADER.
Named Entity Recognition (NER) counts or specific entity types (e.g., number of organizations, persons, locations mentioned).
Interaction Features:

Consider creating interaction features between certain key topics or between a topic and another feature. For example, if two topics seem to jointly influence the target, an interaction term between their probabilities might capture this effect.
Remember, while adding more features can improve the model, it also introduces the risk of overfitting, especially if you have a small dataset. Always monitor your model's performance on a validation set and consider using techniques like regularization or feature selection to manage complexity.

LDA + Optuna tune

Hanlde Bigrams/Trigrams

Sometimes, two or three words together might have a specific meaning. For instance, "greenhouse gas" or "carbon footprint". You can use tools like Gensim's Phrases to detect and handle bigrams/trigrams.

Handle Imbalanced Data:

Ensure that the distribution of ESG scores in your dataset is not highly imbalanced. If it is, consider techniques like oversampling, undersampling, or using synthetic data generation methods like SMOTE.

Adjust ESG Scores from 2019 onwards

Hybrid Model (Ensemble)

Combine embeddings from models like BERT with other structured data (e.g., company size, industry type) in a hybrid neural network architecture.
Advantage: By considering both textual and non-textual features, these models can provide a holistic prediction.

Presentation Building

Tokenization

Break text into words, phrases, symbols, or other meaningful elements (tokens) to make it easier to analyze. Libraries like NLTK and spaCy offer good tokenization tools.

text extraction from pdfs

Convert the PDFs into plain text. You can use libraries such as PyPDF2, pdfminer, or pdfplumber for this purpose in Python.
Store the extracted text for each PDF in a structured format (e.g., CSV or a database).

create functions in project (#aufräumen)

Tokenization and Normalization

Tokenize the text into words or phrases.
Convert all text to lowercase.
Remove stop words (common words like 'and', 'the', 'is', etc. that don't add significant meaning in analysis).
Apply stemming or lemmatization to reduce words to their base/root form.

Lowercasing

Convert all the text to lowercase. This ensures that words like "Environment", "environment", and "ENVIRONMENT" are treated as the same word.

nikwilms / esg-score-prediction-from-sustainability-reports Goto Github PK