Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

Home Page: https://arxiv.org/abs/2301.07597

Jupyter Notebook 18.99% Python 81.01%

ai chatbot chatgpt dataset nlp openai text-classification python gpt2 gpt3

chatgpt-comparison-detection's People

Contributors

Stargazers

Watchers

Forkers

minqi824 yyht ariafyy arlancooper dumpmemory ericxsun akbarbuneri eltociear roimoves stjordanis 5l1v3r1 bwubuw mydatascience mklarqvist techthiyanes johnnypeck guttappa1238 ejrav kailashhambarde zhihao-chen webrulon haojiepan1 kioco fenglui qingkongzhiqian konic-nlp yzw739265427 ylinlinz mcarbel bigbrother666sh knightcn1983 mirfani340 lili0710432 gibborimmm adai-5090 littleprince2021 siai-research miss-bug tcmzhou sufeheisenberg threecolorfr shliujing yixinz-nus mzheng3 jzsues jin1994xiul fiser-trangptq5 andrewtsao yejiahaoye anbo724 enockipp patrickmcguinness charlie1-gao webzerg big-data-lab-umbc abolfazlabouei alexanderinum aicodehunt calenng omygpt luzhongqiu tpk28 ljmat ishmael82 fengyunzaidushi joolsa kekewind clouddatalab dat1208 mystery-w mfkiwl emehprincewill mysqlsc zhuchangjiang hertera1 ecalderini svemulapalli vital121 aslily1234 scutcyr dorucioclea evanhon1818 54457616 bdpqlhk johnnyliu422 deepeye tooxin xunjunyin natalijagajic sweetsyjerome yaospacetim roderick-stinson ziiiiiton cold-eye yusuf23004 chennian985 leomesss ethicalsecurity-agency outstandingjie liushuchun

chatgpt-comparison-detection's Issues

Where can I find the checkpoints of GLTR model? or can you propose what are the training subset (number of human and chatgpt genrated sentence) used for training GLTR and roberta models?)

Help with Open-Assistance

Hey hey - I'm one of the folks who started the work at LAION on open assistance. Want to speak on how we can work together?

请教一下关于模型evaluate中F1 Score的计算问题

您好，
很抱歉这个issue可能会打扰到项目组成员，但对于此项目上的复现我一直不得要点，得不到与文章相同的结果，还望前辈拨冗解惑。
对于贵组放出的 chatgpt-detector-roberta-chinese 模型的描述，此模型是由mix-filter训练得到的。
我采取的测试方式如下所示

最后对raw-full进行测试的结果：
2024-03-05 19:44:46,902 - testing - INFO - test_doc: {'f1': 0.9976726144297905}

与原论文的表中数据显著不同，所以我想请教一下，是我的测试方式有误吗，如果有误，正确的测试方式应该是什么？

最后，无论如何都感谢贵组的工作。

import argparse
import os
import numpy as np
import sys
import evaluate
import pandas as pd
import torch
import logging
import torch.nn.functional as F
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset, concatenate_datasets
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from transformers import (
        AutoModelForSequenceClassification, 
        AutoTokenizer,
        AutoConfig,
        BertForSequenceClassification
    )

logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger('testing')
file_handler = logging.FileHandler('test.log') 
file_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

sys.path.append('./')

_PARSER = argparse.ArgumentParser('ptm detector')

_PARSER.add_argument('--model_name', type=str, default='/data1/xxxxxx/DeepfakeText-chinese/model/chinese-roberta-wwm-ext', help='ptm model name')
_PARSER.add_argument('--roberta_model',type=str, default='/data1/xxxxxx/DeepfakeText-chinese/model/chatgpt-detector-roberta-chinese', help='roberta_model')
_PARSER.add_argument('--test_doc', type=str, default='../../data/zh_doc_test.csv', help='input doc test file path')
_PARSER.add_argument('--test_sent', type=str, default='../../data/shuffled_zh_sent_test.csv', help='input test sent file path')
_PARSER.add_argument('--batch_size', type=int, default=16, help='batch size')
_PARSER.add_argument('--epochs', type=int, default=2, help='epochs')
_PARSER.add_argument('--num_labels', type=int, default=2, help='num_labels')
_PARSER.add_argument('--cuda', type=str, default='0', help='gpu ids, like: 1,2,3')
_PARSER.add_argument('--seed', type=int, default=42, help='random seed.')
_PARSER.add_argument('--max_length', type=int, default=365, help='max_length')
_PARSER.add_argument('--stacking', type=bool, default=True, help='stacking')

_ARGS = _PARSER.parse_args()

if len(_ARGS.cuda) > 1:
    os.environ['TOKENIZERS_PARALLELISM'] = 'false'
    os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'DETAIL'

os.environ["OMP_NUM_THREADS"] = '8'
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"  # if cuda >= 10.2
os.environ['CUDA_VISIBLE_DEVICES'] = _ARGS.cuda

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def create_dataloader(args: argparse.Namespace):
    """
    dataloaders分别是train_doc, test_doc, test_sent
    """
    datasets = []
    files = [args.test_doc, args.test_sent]
    for file in files:
        df = pd.read_csv(file)
        dataset = Dataset.from_pandas(df)
        datasets.append(dataset)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
    def tokenize_fn(example):
        return tokenizer(example['answer'], max_length=args.max_length, padding='max_length', truncation=True)
    datasets = [datasets[0], datasets[1]]
    names = ['id', 'question', 'answer', 'source']
    tokenized_datasets = []
    for dataset in datasets:
        tokenized = dataset.map(
                        tokenize_fn,
                        batched=True,
                        remove_columns=names)
        tokenized_datasets.append(tokenized)
    def collate_fn(examples):
        return tokenizer.pad(examples,return_tensors='pt')
    
    dataloaders = []
    for dataset in tokenized_datasets:
        dataloader = DataLoader(dataset, shuffle=False, collate_fn=collate_fn, batch_size=args.batch_size)
        dataloaders.append(dataloader)
    return dataloaders

def eval(args, dataloaders):
    if args.stacking:
        # roberta_cnn_model = torch.load(args.roberta_cnn_model).to(device)
        # roberta_cnn_model.eval()
        # print("roberta_cnn_model loaded")
        
        # roberta_model = torch.load(args.roberta_model).to(device)
        # roberta_model.eval()

        config = AutoConfig.from_pretrained(
            args.roberta_model,
            num_labels=2,
        )
        roberta_model = BertForSequenceClassification.from_pretrained(
            args.roberta_model,
            config=config,
        ).to(device)
        
        # print(roberta_model.base_model)
        # exit()
        # for param in roberta_model.base_model.parameters():
        #     param.requires_grad = False
        print("roberta_rnn_model loaded")

        # roberta_rcnn_model = torch.load(args.roberta_rcnn_model).to(device)
        # roberta_rcnn_model.eval()
        # print("roberta_rcnn_model loaded")

        # roberta_rcnn_model = torch.load(args.roberta_rcnn_model).to(device)
        # roberta_rcnn_model.eval()
        # print("roberta_rcnn_model loaded")

        eval_name_list = ['test_doc', 'test_sent']
        for item, eval_name in enumerate(eval_name_list, 0):
            metric = evaluate.load("/data1/xxxxxx/DeepfakeText-chinese/dataset/metrics/f1")
            for step, batch in enumerate(tqdm(dataloaders[item], desc='Evaling', colour="green")):
                batch.to(device)
                with torch.no_grad():
                    labels = batch.pop('label')
                    outputs = roberta_model(**batch)['logits']
                predictions = outputs.argmax(dim=-1)
                predictions, references = predictions, labels
                metric.add_batch(
                    predictions=predictions,
                    references=references,
                )
            eval_metric = metric.compute()
            logger.info(f"{eval_name}: {eval_metric}")

daataLoader = create_dataloader(_ARGS)
eval(_ARGS,daataLoader)

The ling detector on Huggingface down? and other issues

Dear Hello-SimpleAI,

Thank you for your detectors, which have been valuable resources for my university admin duties. I've posted a similar discussion as this one on Huggingface, hoping one of these will evoke a reply.

I've been making use of all three metrics (one from the single text interface and two from the linguistic interface). Is the 'ling' detector on Huggingface completely offline now? Is it anticipated to be running again at some point in the future?
- I do see that the 'ling' detector is operational on modelscope and will switch to using it there.
I haven't yet pored over the available data/code carefully. Is the material available enough for me to run your detectors locally (on my own computer)?
The 'ling' detector on modelscope reports "Error" on the following text: "Imagine yourself strolling through the vibrant streets of Mexico City, where a lively celebration is in full swing. The atmosphere is charged with excitement as colors, music, and the exuberant spirit of the Mexican people fill the air. Amidst the bustling crowds, your attention is captivated by an array of national symbols proudly on display. The iconic Mexican flag unfurls, showcasing its bold tricolor of green, white, and red, which dances in the breeze. At the heart of this visual spectacle stands the unmistakable emblem of the Mexican eagle, with its fierce countenance and outstretched wings symbolizing a deep-rooted sense of national pride and identity. Now, let's transport ourselves to the enchanting landscapes and historic cities of the Netherlands, a country renowned for its rich cultural heritage."
- Removing the phrase ", a country renowned for its rich cultural heritage" from the last sentence produces no Error. This also happens with the single-text detector on modelscope.
The versions of the single text detector on Huggingface and Modelscope seem to be different (with the latter giving same numbers as your roberta detector). I, for one, would prefer the single text detector to remain different from your roberta detector (as the variety of detectors helps in my admin tasks).

Thank you!

Best,
Jay

some "network error" fields in the chatgpt answer

I have noticed that there are some "network error" fields in the chatgpt answer part of your dataset, these parts can be considered for filtering.

无法复现论文中的结果 Unable to reproduce the results in the paper

我尝试复现中论文中的结果，现在我是直接导入hugging face上的chatgpt-detector-roberta作为model和tokenizer，根据页面上的描述这是由mixed数据集训练的，在论文中对raw-full的F1 score应该为99.44，但我没办法得到这个数据，我使用的数据集是在hc3中readme中的谷歌网盘下载的，以下是我得到的结果
{'0': {'precision': 0.9994103425909546, 'recall': 0.9951852504256943, 'f1-score': 0.9972933215651661, 'support': 17031.0}, '1': {'precision': 0.9898640296662546, 'recall': 0.9987528061860813, 'f1-score': 0.994288552272163, 'support': 8018.0}, 'accuracy': 0.9963271986905665, 'macro avg': {'precision': 0.9946371861286046, 'recall': 0.9969690283058878, 'f1-score': 0.9957909369186646, 'support': 25049.0}, 'weighted avg': {'precision': 0.9963546382901745, 'recall': 0.9963271986905665, 'f1-score': 0.9963315170942771, 'support': 25049.0}}

I am trying to reproduce the results of a paper. Currently, I am using the 'chatgpt-detector-roberta' model and tokenizer directly imported from Hugging Face. According to the information on their website, this model was trained on a mixed dataset. In the paper, the F1 score for 'raw-full' should be 99.44. However, I am unable to achieve this result, The dataset used was downloaded from the Google Drive link provided in the README of hc3, and here are the results I obtained.
{'0': {'precision': 0.9994103425909546, 'recall': 0.9951852504256943, 'f1-score': 0.9972933215651661, 'support': 17031.0}, '1': {'precision': 0.9898640296662546, 'recall': 0.9987528061860813, 'f1-score': 0.994288552272163, 'support': 8018.0}, 'accuracy': 0.9963271986905665, 'macro avg': {'precision': 0.9946371861286046, 'recall': 0.9969690283058878, 'f1-score': 0.9957909369186646, 'support': 25049.0}, 'weighted avg': {'precision': 0.9963546382901745, 'recall': 0.9963271986905665, 'f1-score': 0.9963315170942771, 'support': 25049.0}}

When you train the Roberta detector, do you truncate the input tokens after 512/256 ?

The basic roberta classifier limits its input tokens to 512/256, so how do you process the long input text?

Thank you!

When was this data collected？

Could you show me the version of the chatGPT for those data and when was thoese data collected? Thanks!

Do we have data splits?

Hi dear developers,

I am wondering whether data splits (e.g. train/val/test) has been released; I saw a issue 3 weeks ago with an official reply saying "We will release the train-test split later." However, as I inspected the data, it seems they have not been splited so far. Please remind me if the official data splits are released. Thanks!

How to get CSV?

Hi, thank you for your nice job! But I have a question, I can get the .json data set, but how can I get the .csv file needed to train the model? Or what should the .csv data format be?

404 on downloading the training and testing corpus

Could you please re-upload the link?

Have detection model baselines been open-sourced?

Hi developers,

I'm kind of wondering whether detection model baseline codes & pretrained models (e.g. RoBERTa & GLTA) have been released; I failed find them in the repo or on the project main page... Please remind me if they have been open-sourced. Thank you very much!

请问“LABEL_1”是什么意思？

请问这个预测结果中的LABEL_1是什么意思？
我找不到相关说明。
非常感谢您的回答。

Questions don't match answers in CSV on Google Drive

Hi,

I noticed that the question-answer pairs in the train CSV file don't match those on Huggingface. See the example below.
It's the en_train.csv file linked here and here.

Thanks!

hello-simpleai / chatgpt-comparison-detection Goto Github PK

chatgpt-comparison-detection's People

Contributors

Stargazers

Watchers

Forkers

chatgpt-comparison-detection's Issues

Recommend Projects

Recommend Topics

Recommend Org