Ai-text-detector

This project focuses around a neural network designed for identification of textual content, distinguishing between human-written and AI-generated text. The system leverages advanced natural language processing and machine learning algorithms, enabling it to, in most cases, accuratly find if a human or if an ai has written the given text, especially in specific use cases. To tailor the model for particular needs, users can curate highly specific datasets, such as novels written by high school students. By generating significant amounts of text using AI, like ChatGPT, and collecting student-authored content, users can label these datasets accordingly as described below and in the "training_data.csv" file. The neural network, implemented through modelmaker.py, handles meticulous data preprocessing and effective training. Practical application is showcased in predict.py. For more detailed instructions and examples, refer to the sections below.

Overview

This project implements a neural network designed to distinguish whether a given piece of text is generated by an AI or a human writer. The model is trained on specific types of texts, such as novels, utilizing natural language processing techniques and machine learning algorithms.

Project Structure

modelmaker.py

Description: This Python script handles the creation and training of the neural network model. It includes the following functionalities:

Functions: Reading and Preprocessing Data: Reads training data from an Excel file, splits text and labels into separate lists, tokenizes text data, and pads sequences for uniform inputs.

Neural Network Architecture: Defines the model with an embedding layer, dropout layer, global average pooling layer, and dense layers. Compiles the model with binary cross-entropy loss and accuracy metrics.

Training the Model: Trains the model with early stopping, saving the best model for future use. Visualizes training progress using matplotlib.

Saving the Model: Saves the trained model and the tokenizer for future predictions.

predict.py

Description: This script demonstrates how the pre-trained model can be used to predict whether a new set of texts were written by AI or humans.

Functions: Loading the Model and Tokenizer: Loads the pre-trained model and tokenizer for text prediction.

Text Tokenization and Padding: Tokenizes and pads the new text data to match the model's input shape.

Making Predictions: Uses the loaded model to predict the origin of the new text, providing an average prediction score indicating the likelihood of AI generation. The avarage is taken of all the small chunks of text provided. How to split text is described further down.

how to interpret results

The following code can be added at the bottom of predict.py to get a better sense of what the model output means.

if predictionsum > 0.90:
    print(f'Text most likely written by a human. Probability of human authorship: {round(model_output*100, 2)}% (very high probability)')
elif predictionsum > 0.69:
    print(f'Text most likely written by a human. Probability of human authorship: {round(model_output*100, 2)}% (high probability)')
elif predictionsum > 0.51:
    print(f'Text most likely written by AI or with the help of AI. Probability of human authorship: {round(model_output*100, 2)}% (low probability)')
elif predictionsum > 0.40:
    print(f'Text most likely written by AI. Probability of human authorship: {round(model_output*100, 2)}% (low probability)')
elif predictionsum > 0.30:
    print(f'Text most likely written by AI. Probability of human authorship: {round(model_output*100, 2)}% (very low probability)')

text_to_predict.txt

Description: This file contains sample text that you want to evaluate for its origin. You can replace the content of this file with any text you wish to analyze.

training_data.csv

Description: This file is the training data, with each line containing a small chunk of text followed by a label (0 for AI-generated and 1 for human-written). It is not yet tested lots of different data in the model, however, what has seemed to work up until now is using small chunks of text from a full page and labeling each one ( 10-20 words per chunk).

Example

In my use case. i wanted to train the model to be able to check if stories/novels written in norwegian were written by humans or by chatGPT. i promted chatGPT with the different variations of: "write a short story/novel with the name (different famous short stories/novels)". I then labeled the data as ai written data, then i used the human written novels with corresponding names to the ones i promted chatGPT, for the human data. the total number of labeled data was around 1000 chuncks of text. After 60 epochs the validation accuracy was at 0.94:

Other

Feel free to modify the code and experiment with different architectures and hyperparameters to enhance the model's performance.

knutoplandmoen / ai-text-detector Goto Github PK

ai-text-detector's Introduction

Ai-text-detector

Overview

Project Structure

modelmaker.py

predict.py

how to interpret results

text_to_predict.txt

training_data.csv

Example

Other

ai-text-detector's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent