Giter Club home page Giter Club logo

nostr-spam-detection's Introduction

Nostr Spam Detection

An experiment in building a machine learning model to label Nostr spam content for filtering and relay rejection.

Dataset

The latest dataset is labelled_nostr_events_20230225000.csv with 92,000 rows. It contains labelled nostr event content with either spam (bad - 13k) or ham (good - 79k). The data is biased by volume with asian language spam, and english language ham (non-spam) rows - however it has performed well in testing against recent Nostr events. The dataset was best-effort aggregated and largely labelled manually. It contains reviewed events flagged by Nostr event kind=1984, which indicate spam and other undesirable content. I've left the datasets raw to allow others to normalise themselves.

One final note. The dataset it likely pretty biased toward the past month of Nostr spam, and is less well rounded. You could train your models using another base spam detection set, or add more training as more Nostr spam labeled data is available.

Results

Using a Naive Bayes classifier with the dataset, we are able to achieve 98.26%+ accuracy. The model size is around 22Mb, and takes less than a minute to train. To train this model, take a look at Nostr MultinominalNB Spam Detection.ipynb.

With more complex language modelling and deep learning, we were able to get upwards of 98% accuracy for spam detection using FastAI. The model is substantially larger however being around 160Mb, and takes significantly longer to train (can be hours on a laptop CPU).

Dependencies

  • Python
  • Juypter
  • FastAI

Usage

First you will need to build your model. It can take over an hour to train using a CPU instead of GPU. Once you have the dependencies installed, you can run the following to start training your model.

git clone https://github.com/blakejakopovic/nostr-spam-detection
cd nostr-spam-detection

jupyter notebook

# The notebooks should run through cleanly, with each code segment being run once.

I've included a minimal Python Flask REST API endpoint that loads your model and can be called to get a spam score for an event (or just the event content).

# Start the prediction example apps
gunicorn --workers 5 --bind localhost:5000 app:app (app:app for FastAI or app2:app for the bayes model)

# To get a label and score for content
curl -X POST 'http://127.0.0.1:5000/test' --header 'Content-Type: application/json' --data-raw '{"content" : "Hello, is this spam or ham?"}'
# {"label": "ham", "score": "0.9913"}

# To get a spam_score directly for content
curl -X POST 'http://127.0.0.1:5000/spam_score' --header 'Content-Type: application/json' --data-raw '{"content" : "Hello, how spammy is this?"}'
# {"spam_score": "0.9913"}


nostr-spam-detection's People

Contributors

blakejakopovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jilv220 momentqyc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.