llm-email-spam-detection's Introduction

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

by Maxime Labonne and Sean Moran

This paper has been submitted for publication in ECML PKDD 2023 and is available on arXiv.

It evaluates different baseline techniques and large language models for email spam detection. It also introduces Spam-T5, a modified Flan-T5 model which significantly outperforms other models.

Abstract

This paper investigates the effectiveness of large language models (LLMs) in email spam detection by comparing prominent models from three distinct families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we examine well-established machine learning techniques for spam detection, such as Naïve Bayes and LightGBM, as baseline methods. We assess the performance of these models across four public datasets, utilizing different numbers of training samples (full training set and few-shot settings). Our findings reveal that, in the majority of cases, LLMs surpass the performance of the popular baseline techniques, particularly in few-shot scenarios. This adaptability renders LLMs uniquely suited to spam detection tasks, where labeled samples are limited in number and models require frequent updates. Additionally, we introduce Spam-T5, a Flan-T5 model that has been specifically adapted and fine-tuned for the purpose of detecting email spam. Our results demonstrate that Spam-T5 surpasses baseline models and other LLMs in the majority of scenarios, particularly when there are a limited number of training samples available.

Installation

All Python packages needed are listed in requirements.txt. You can install them with the following commands:

git clone https://github.com/jpmorganchase/llm-email-spam-detection.git
cd llm-email-spam-detection
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu116

Usage

All source code used to generate the results and figures in the paper are in the src folder. The data used in this paper is automatically downloaded, processed, and stored in the data folder.

You can start training the 6 baseline techniques and 3 large language models with the following command:

python main.py

Maintenance Level

This repository is maintained to fix bugs and ensure the stability of the existing codebase. However, please note that the team does not plan to introduce new features or enhancements in the future.

Reference

If you re-use this work, please cite:

@misc{labonne2023spamt5,
      title={Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection}, 
      author={Maxime Labonne and Sean Moran},
      year={2023},
      eprint={2304.01238},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

llm-email-spam-detection's People

Contributors

Stargazers

Watchers

llm-email-spam-detection's Issues

Code running GPU memory overflow

Hello, can you tell me the machine configuration parameters you use when running the code? After I ran the RoBERTa model training on the LING data set using the V100 graphics card on colab pro, I was prompted to run out of memory.

Error message:

OutOfMemoryError                          Traceback (most recent call last)

in ()

17

18     # Train LLMs

---> 19     train_llms(

20         list(range(5)),

21         ["ling", "sms", "spamassassin", "enron"],


19 frames

/ usr/local/lib/python3.10 / dist - packages/transformers/activations. Py in forward (self, input)

55

56     def forward(self, input: Tensor) -> Tensor:

---> 57         return self.act(input)

58

59



OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 15.77 GiB total capacity; 14.60 GiB already allocated; 46.12 MiB free; 14.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid  fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

If it is convenient for you, can you disclose the output records generated by your training in the paper.
Thank you very much for your attention to this issue.

Recommend Projects

jpmorganchase / llm-email-spam-detection Goto Github PK

llm-email-spam-detection's Introduction

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

Abstract

Installation

Usage

Maintenance Level

Reference

llm-email-spam-detection's People

Contributors

Stargazers

Watchers

Forkers

llm-email-spam-detection's Issues

Code running GPU memory overflow

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent