Giter Club home page Giter Club logo

nlc-email-phishing's Introduction

WARNING: This repository is no longer maintained ⚠️

This repository will not be updated. The repository will be kept available in read-only mode.

Determine email spam with Watson Natural Language Classifier

In this Code Pattern, we will build an app that classifies email, either labeling it as "Phishing", "Spam", or "Ham" if it does not appear suspicious. We'll be using IBM Watson Natural Language Classifier (NLC) to train a model using email examples from an EDRM Enron email dataset. Please note that this data is free to use for non-commercial use, and explicit permission must be obtained otherwise. The custom NLC model can be quickly and easily built in the Web UI, deployed into our nodejs app using the Watson Developer Cloud Nodejs SDK, and then run from a browser.

When the reader has completed this Code Pattern, they will understand how to:

  • Build a Watson Natural Language Classifier model using the Web UI
  • Create a nodejs app that utilizes the NLC model to classify emails as Phishing or not.
  • Use the Watson Developer Cloud SDK for nodejs.

Flow

arch

  1. User interacts with Natural Language Classifier (NLC) GUI to train the model.
  2. EDRM data is loaded to the NLC service to provide sample emails for training.
  3. User sends email text to the application to have it classified.
  4. App uses Watson Natural Language Classifier to determine if text is phishing, spam, or ham.

Included components

  • Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
  • Watson Natural Language Classifier: An IBM Cloud service to interpret and classify natural language with confidence.
  • Node.js: An open-source JavaScript run-time environment for executing server-side JavaScript code.

Watch the Video

video

Steps

  1. Clone the repo
  2. Create IBM Cloud services
  3. Create a Watson Studio project
  4. Train the NLC model
  5. Run the application

1. Clone the repo

Clone the nlc-email-phishing repo locally. In a terminal, run:

git clone https://github.com/IBM/nlc-email-phishing.git

2. Create IBM Cloud services

Create the following service:

3. Create a Watson Studio project

  • Log into IBM's Watson Studio. Once in, you'll land on the dashboard.

  • Create a new project by clicking + New project and choosing Data Science:

    studio project

  • Enter a name for the project name and click Create.

  • NOTE: By creating a project in Watson Studio a free tier Object Storage service and Watson Machine Learning service will be created in your IBM Cloud account. Select the Free storage type to avoid fees.

    studio-new-project

  • Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the Assets and Settings tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.

    studio-project-dashboard

4. Train the NLC model

The data used in this example is from an EDRM Enron email dataset and a cleaned version we'll use is available in the repo under data/Email-trainingdata-20k.csv. We'll now train an NLC model using this data.

  • From the new project Overview panel, click + Add to project on the top right and choose the Natural Language Classifier asset type.

    add-nlc-asset

  • A new instance of the NLC tool will launch.

    new-nlc-model

  • Add the data to your project by clicking the Browse button in the right-hand Upload to project section and browsing to the cloned repo. Choose data/Email-trainingdata-20k.csv.

  • Drag and drop the Email-trainingdata-20k.csv file you uploaded to the Create a Class box:

    video-to-gif

  • Click the Train model button to begin training. The model will take around an hour to train.

  • To check the status of the model, and access it after it trains, go to your project in the Assets tab of the Models section. The model will show up when it is ready. Double click to see the Overview tab.

    nlc-model-overview

  • The first line of the Overview tab contains the Model ID, remember this value as we'll need it in the next step.

  • Click the Test tab and enter a phrase from an email to test the classifier. For example, "Can you please send your password?" is classified with 0.81 confidence as Phishing.

  • Click the Implementation tab to see how to use the classifier with Curl, Java, Node, or Python.

5. Run the application

Follow the steps below for deploying the application:

Run on IBM Cloud

  • Press the Deploy to IBM Cloud button below.

Deploy to IBM Cloud

  • From the IBM Cloud deployment page click the Deploy button.

  • From the Toolchains menu, click the Delivery Pipeline to watch while the app is deployed. Once deployed, the app can be viewed by clicking View app.

  • The app and service can be viewed in the IBM Cloud dashboard. The app will be named nlc-email-phishing, with a unique suffix.

  • We now need to add a few environment variables to the application's runtime so the right classifier service and model are used. Click on the application from the dashboard to view its settings.

  • Once viewing the application, click the Runtime option on the menu and navigate to the Environment Variables section.

  • Update the CLASSIFIER_ID, NATURAL_LANGUAGE_CLASSIFIER_USERNAME, and NATURAL_LANGUAGE_CLASSIFIER_PASSWORD variables with your Model ID from Step 4 and NLC service credentials from Step 2. Click Save.

    env vars

  1. After saving the environment variables, the app will restart. After the app restarts you can access it by clicking the Visit App URL button.

Run locally

  • In the root of the project create a file named .env. A sample is provided and a snippet is shown below.

    # Replace the credentials here with your own.
    CLASSIFIER_ID=<add_ModelID>
    NATURAL_LANGUAGE_CLASSIFIER_APIKEY=<add_API_key>
    NATURAL_LANGUAGE_CLASSIFIER_URL=<add_NLC_url>
  • Update the CLASSIFIER_ID, NATURAL_LANGUAGE_CLASSIFIER_APIKEY, and NATURAL_LANGUAGE_CLASSIFIER_URL variables with your Model ID from Step 4 and NLC service credentials from Step 2.

  • Ensure Node.js is installed.

  • Install the app dependencies by running:

    npm install
  • Start the app by running:

    npm start
  • Open a browser and point to localhost:3000.

Sample output

output

Links

Learn more

  • Artificial Intelligence Code Patterns: Enjoyed this Code Pattern? Check out our other AI Code Patterns.
  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
  • AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

nlc-email-phishing's People

Contributors

dependabot[bot] avatar dolph avatar imgbot[bot] avatar jritten avatar kant avatar ljbennett62 avatar markstur avatar pmmistry avatar rhagarty avatar sanjeevghimire avatar scottdangelo avatar stevemar avatar stevemart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlc-email-phishing's Issues

Question on the training data

All the sentences of the training data are qualified with "phishing" (the 17834 lines)
and have an additional qualifier . like "ham" or "spam".
What does it mean to qualify everything sample as phishing ?
That's not what appears in the video where there are 2 separate ham and spam classe plus one occurence of email ... but at some point the class Phishing with 17834 items reappears ...

Error when training the model

Hello , a question about NLC : Trying to replicate the code journey https://github.com/IBM/nlc-email-phishing , when starting the training I get the message :
Error encountered while training : Error in Watson NLC Service: Too many data instances

I can't find the meaning of the message.
The service is a “fresh one” standard in Germany. Watson Studio in London
The data provided in csv format loads ok. It is 8.6 MB. Is it too large ? or what could be the issue ? Many thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.