AI vs Human Text Classification

This project aims to develop a machine learning model capable of distinguishing between human-generated and AI-generated text. By leveraging various linguistic, syntactic, semantic, and additional text features, the model aims to accurately identify the authorship of given text samples. The implementation involves data preprocessing, feature extraction, model training with hyperparameter tuning, and evaluation using a voting classifier that combines multiple machine learning algorithms.

Tech Used
Features
FAQ
Acknowledgements
Authors

Tech Used

Features

Data Preparation:
- Loading text data from CSV files.
- Balancing the dataset by sampling equal amounts of human and AI-generated text.
- Text Preprocessing:
Converting text to lowercase.
- Removing non-alphanumeric characters.
- Tokenizing text.
- Filtering out stopwords and punctuation.
Linguistic Feature Extraction:
- Word count and unique word count.
- Type-Token Ratio (TTR).
- Counts of nouns, verbs, and adjectives.
- Sentence length statistics (average, maximum, minimum).
- Passive voice detection.
- Entity and conjunction counts.
- Contraction and punctuation counts.
Additional Feature Extraction:
- Perplexity calculation using a bi-gram language model.
- Readability scores (Flesch Reading Ease, SMOG Index).
- Sentiment analysis using VADER.
- Named entity recognition for persons, organizations, and locations.
- Topic modeling using LDA.
- N-gram frequency distributions (bigrams and trigrams).
Feature Vectorization:
- Transforming features into numerical form using TF-IDF vectorizers.
- Combining linguistic and additional feature vectors into a single feature matrix.
Model Training and Evaluation:
- Splitting data into training and test sets.
- Using a voting classifier that includes Random Forest, Logistic Regression, and Support Vector Machine classifiers.
- Hyperparameter tuning with GridSearchCV.
- Evaluating model performance using accuracy, precision, recall, and F1-score metrics.
Prediction Function:
- A function to predict the authorship of new input text based on extracted features and the trained model.

FAQ

Q1: What is the purpose of this project?

A1: The project aims to create a machine learning model that can distinguish between human-generated and AI-generated text based on various linguistic and additional features extracted from the text.

Q2: What data is used in this project?

A2: The project uses a dataset containing samples of human-generated and AI-generated text. The data is balanced to ensure equal representation of both classes during training.

Q3: What features are extracted from the text?

A3: The features include lexical, syntactic, discourse, rhetorical, stylistic, perplexity, readability scores, sentiment analysis, named entity recognition, topic modeling, and n-gram frequency distributions.

Q4: How is the model trained?

A4: The model is trained using a voting classifier that combines Random Forest, Logistic Regression, and Support Vector Machine classifiers. Hyperparameter tuning is performed using GridSearchCV to find the best model parameters.

Q5: How is the performance of the model evaluated?

A5: The model's performance is evaluated using accuracy, precision, recall, and F1-score metrics on a test set.

Q6: Can the model predict the authorship of new text samples?

A6: Yes, the model includes a function that extracts features from new text samples, vectorizes them, and predicts whether the text is human-generated or AI-generated.

Acknowledgements

AI Vs Human Text Dataset

Authors

@superiorsd10

superiorsd10 / ai-or-human-ml-model Goto Github PK

ai-or-human-ml-model's Introduction

AI vs Human Text Classification

Table of Contents

Tech Used

Features

FAQ

Q1: What is the purpose of this project?

Q2: What data is used in this project?

Q3: What features are extracted from the text?

Q4: How is the model trained?

Q5: How is the performance of the model evaluated?

Q6: Can the model predict the authorship of new text samples?

Acknowledgements

Authors

ai-or-human-ml-model's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent