Giter Club home page Giter Club logo

ai-or-human-ml-model's Introduction

AI vs Human Text Classification

This project aims to develop a machine learning model capable of distinguishing between human-generated and AI-generated text. By leveraging various linguistic, syntactic, semantic, and additional text features, the model aims to accurately identify the authorship of given text samples. The implementation involves data preprocessing, feature extraction, model training with hyperparameter tuning, and evaluation using a voting classifier that combines multiple machine learning algorithms.

Table of Contents

  1. Tech Used
  2. Features
  3. FAQ
  4. Acknowledgements
  5. Authors

Tech Used

Python

Pandas

NumPy

scikit-learn

SciPy

Matplotlib

NLTK

SpaCy

Gensim

Textstat

VADER Sentiment Analysis

Features

  • Data Preparation:

    • Loading text data from CSV files.
    • Balancing the dataset by sampling equal amounts of human and AI-generated text.
    • Text Preprocessing:
  • Converting text to lowercase.

    • Removing non-alphanumeric characters.
    • Tokenizing text.
    • Filtering out stopwords and punctuation.
  • Linguistic Feature Extraction:

    • Word count and unique word count.
    • Type-Token Ratio (TTR).
    • Counts of nouns, verbs, and adjectives.
    • Sentence length statistics (average, maximum, minimum).
    • Passive voice detection.
    • Entity and conjunction counts.
    • Contraction and punctuation counts.
  • Additional Feature Extraction:

    • Perplexity calculation using a bi-gram language model.
    • Readability scores (Flesch Reading Ease, SMOG Index).
    • Sentiment analysis using VADER.
    • Named entity recognition for persons, organizations, and locations.
    • Topic modeling using LDA.
    • N-gram frequency distributions (bigrams and trigrams).
  • Feature Vectorization:

    • Transforming features into numerical form using TF-IDF vectorizers.
    • Combining linguistic and additional feature vectors into a single feature matrix.
  • Model Training and Evaluation:

    • Splitting data into training and test sets.
    • Using a voting classifier that includes Random Forest, Logistic Regression, and Support Vector Machine classifiers.
    • Hyperparameter tuning with GridSearchCV.
    • Evaluating model performance using accuracy, precision, recall, and F1-score metrics.
  • Prediction Function:

    • A function to predict the authorship of new input text based on extracted features and the trained model.

FAQ

Q1: What is the purpose of this project?

A1: The project aims to create a machine learning model that can distinguish between human-generated and AI-generated text based on various linguistic and additional features extracted from the text.

Q2: What data is used in this project?

A2: The project uses a dataset containing samples of human-generated and AI-generated text. The data is balanced to ensure equal representation of both classes during training.

Q3: What features are extracted from the text?

A3: The features include lexical, syntactic, discourse, rhetorical, stylistic, perplexity, readability scores, sentiment analysis, named entity recognition, topic modeling, and n-gram frequency distributions.

Q4: How is the model trained?

A4: The model is trained using a voting classifier that combines Random Forest, Logistic Regression, and Support Vector Machine classifiers. Hyperparameter tuning is performed using GridSearchCV to find the best model parameters.

Q5: How is the performance of the model evaluated?

A5: The model's performance is evaluated using accuracy, precision, recall, and F1-score metrics on a test set.

Q6: Can the model predict the authorship of new text samples?

A6: Yes, the model includes a function that extracts features from new text samples, vectorizes them, and predicts whether the text is human-generated or AI-generated.

Acknowledgements

Authors

ai-or-human-ml-model's People

Contributors

superiorsd10 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.