Giter Club home page Giter Club logo

knutoplandmoen / ai-text-detector Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 14 KB

A neural network designed for identification of textual content, distinguishing between human-written and AI-generated text. The system leverages advanced natural language processing and machine learning algorithms, enabling it to, in most cases, accuratly find if a human or if an ai has written the given text, especially in specific use cases.

Python 100.00%

ai-text-detector's Introduction

Ai-text-detector

This project focuses around a neural network designed for identification of textual content, distinguishing between human-written and AI-generated text. The system leverages advanced natural language processing and machine learning algorithms, enabling it to, in most cases, accuratly find if a human or if an ai has written the given text, especially in specific use cases. To tailor the model for particular needs, users can curate highly specific datasets, such as novels written by high school students. By generating significant amounts of text using AI, like ChatGPT, and collecting student-authored content, users can label these datasets accordingly as described below and in the "training_data.csv" file. The neural network, implemented through modelmaker.py, handles meticulous data preprocessing and effective training. Practical application is showcased in predict.py. For more detailed instructions and examples, refer to the sections below.

Overview

This project implements a neural network designed to distinguish whether a given piece of text is generated by an AI or a human writer. The model is trained on specific types of texts, such as novels, utilizing natural language processing techniques and machine learning algorithms.

Project Structure

modelmaker.py

Description: This Python script handles the creation and training of the neural network model. It includes the following functionalities:

Functions: Reading and Preprocessing Data: Reads training data from an Excel file, splits text and labels into separate lists, tokenizes text data, and pads sequences for uniform inputs.

Neural Network Architecture: Defines the model with an embedding layer, dropout layer, global average pooling layer, and dense layers. Compiles the model with binary cross-entropy loss and accuracy metrics.

Training the Model: Trains the model with early stopping, saving the best model for future use. Visualizes training progress using matplotlib.

Saving the Model: Saves the trained model and the tokenizer for future predictions.

predict.py

Description: This script demonstrates how the pre-trained model can be used to predict whether a new set of texts were written by AI or humans.

Functions: Loading the Model and Tokenizer: Loads the pre-trained model and tokenizer for text prediction.

Text Tokenization and Padding: Tokenizes and pads the new text data to match the model's input shape.

Making Predictions: Uses the loaded model to predict the origin of the new text, providing an average prediction score indicating the likelihood of AI generation. The avarage is taken of all the small chunks of text provided. How to split text is described further down.

how to interpret results

The following code can be added at the bottom of predict.py to get a better sense of what the model output means.

if predictionsum > 0.90:
    print(f'Text most likely written by a human. Probability of human authorship: {round(model_output*100, 2)}% (very high probability)')
elif predictionsum > 0.69:
    print(f'Text most likely written by a human. Probability of human authorship: {round(model_output*100, 2)}% (high probability)')
elif predictionsum > 0.51:
    print(f'Text most likely written by AI or with the help of AI. Probability of human authorship: {round(model_output*100, 2)}% (low probability)')
elif predictionsum > 0.40:
    print(f'Text most likely written by AI. Probability of human authorship: {round(model_output*100, 2)}% (low probability)')
elif predictionsum > 0.30:
    print(f'Text most likely written by AI. Probability of human authorship: {round(model_output*100, 2)}% (very low probability)')

text_to_predict.txt

Description: This file contains sample text that you want to evaluate for its origin. You can replace the content of this file with any text you wish to analyze.

training_data.csv

Description: This file is the training data, with each line containing a small chunk of text followed by a label (0 for AI-generated and 1 for human-written). It is not yet tested lots of different data in the model, however, what has seemed to work up until now is using small chunks of text from a full page and labeling each one ( 10-20 words per chunk).

Example

In my use case. i wanted to train the model to be able to check if stories/novels written in norwegian were written by humans or by chatGPT. i promted chatGPT with the different variations of: "write a short story/novel with the name (different famous short stories/novels)". I then labeled the data as ai written data, then i used the human written novels with corresponding names to the ones i promted chatGPT, for the human data. the total number of labeled data was around 1000 chuncks of text. After 60 epochs the validation accuracy was at 0.94: image

Other

Feel free to modify the code and experiment with different architectures and hyperparameters to enhance the model's performance.

ai-text-detector's People

Contributors

knutoplandmoen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.