Giter Club home page Giter Club logo

llm-email-spam-detection's Introduction

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

by Maxime Labonne and Sean Moran

This paper has been submitted for publication in ECML PKDD 2023 and is available on arXiv.

It evaluates different baseline techniques and large language models for email spam detection. It also introduces Spam-T5, a modified Flan-T5 model which significantly outperforms other models.

Results

Abstract

This paper investigates the effectiveness of large language models (LLMs) in email spam detection by comparing prominent models from three distinct families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we examine well-established machine learning techniques for spam detection, such as Naïve Bayes and LightGBM, as baseline methods. We assess the performance of these models across four public datasets, utilizing different numbers of training samples (full training set and few-shot settings). Our findings reveal that, in the majority of cases, LLMs surpass the performance of the popular baseline techniques, particularly in few-shot scenarios. This adaptability renders LLMs uniquely suited to spam detection tasks, where labeled samples are limited in number and models require frequent updates. Additionally, we introduce Spam-T5, a Flan-T5 model that has been specifically adapted and fine-tuned for the purpose of detecting email spam. Our results demonstrate that Spam-T5 surpasses baseline models and other LLMs in the majority of scenarios, particularly when there are a limited number of training samples available.

Installation

All Python packages needed are listed in requirements.txt. You can install them with the following commands:

git clone https://github.com/jpmorganchase/llm-email-spam-detection.git
cd llm-email-spam-detection
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu116

Usage

All source code used to generate the results and figures in the paper are in the src folder. The data used in this paper is automatically downloaded, processed, and stored in the data folder.

You can start training the 6 baseline techniques and 3 large language models with the following command:

python main.py

Maintenance Level

This repository is maintained to fix bugs and ensure the stability of the existing codebase. However, please note that the team does not plan to introduce new features or enhancements in the future.

Reference

If you re-use this work, please cite:

@misc{labonne2023spamt5,
      title={Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection}, 
      author={Maxime Labonne and Sean Moran},
      year={2023},
      eprint={2304.01238},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

llm-email-spam-detection's People

Contributors

kgourgou avatar mlabonne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

llm-email-spam-detection's Issues

Code running GPU memory overflow

Hello, can you tell me the machine configuration parameters you use when running the code? After I ran the RoBERTa model training on the LING data set using the V100 graphics card on colab pro, I was prompted to run out of memory.

Error message:

OutOfMemoryError                          Traceback (most recent call last)

in ()

17

18     # Train LLMs

---> 19     train_llms(

20         list(range(5)),

21         ["ling", "sms", "spamassassin", "enron"],


19 frames

/ usr/local/lib/python3.10 / dist - packages/transformers/activations. Py in forward (self, input)

55

56     def forward(self, input: Tensor) -> Tensor:

---> 57         return self.act(input)

58

59



OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 15.77 GiB total capacity; 14.60 GiB already allocated; 46.12 MiB free; 14.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid  fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

If it is convenient for you, can you disclose the output records generated by your training in the paper.
Thank you very much for your attention to this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.