hello-simpleai / chatgpt-comparison-detection Goto Github PK
View Code? Open in Web Editor NEWHuman ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
Home Page: https://arxiv.org/abs/2301.07597
Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
Home Page: https://arxiv.org/abs/2301.07597
Hey hey - I'm one of the folks who started the work at LAION on open assistance. Want to speak on how we can work together?
您好,
很抱歉这个issue可能会打扰到项目组成员,但对于此项目上的复现我一直不得要点,得不到与文章相同的结果,还望前辈拨冗解惑。
对于贵组放出的 chatgpt-detector-roberta-chinese 模型的描述,此模型是由mix-filter训练得到的。
我采取的测试方式如下所示
最后对raw-full进行测试的结果:
2024-03-05 19:44:46,902 - testing - INFO - test_doc: {'f1': 0.9976726144297905}
与原论文的表中数据显著不同,所以我想请教一下,是我的测试方式有误吗,如果有误,正确的测试方式应该是什么?
最后,无论如何都感谢贵组的工作。
import argparse
import os
import numpy as np
import sys
import evaluate
import pandas as pd
import torch
import logging
import torch.nn.functional as F
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset, concatenate_datasets
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
AutoConfig,
BertForSequenceClassification
)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger('testing')
file_handler = logging.FileHandler('test.log')
file_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
sys.path.append('./')
_PARSER = argparse.ArgumentParser('ptm detector')
_PARSER.add_argument('--model_name', type=str, default='/data1/xxxxxx/DeepfakeText-chinese/model/chinese-roberta-wwm-ext', help='ptm model name')
_PARSER.add_argument('--roberta_model',type=str, default='/data1/xxxxxx/DeepfakeText-chinese/model/chatgpt-detector-roberta-chinese', help='roberta_model')
_PARSER.add_argument('--test_doc', type=str, default='../../data/zh_doc_test.csv', help='input doc test file path')
_PARSER.add_argument('--test_sent', type=str, default='../../data/shuffled_zh_sent_test.csv', help='input test sent file path')
_PARSER.add_argument('--batch_size', type=int, default=16, help='batch size')
_PARSER.add_argument('--epochs', type=int, default=2, help='epochs')
_PARSER.add_argument('--num_labels', type=int, default=2, help='num_labels')
_PARSER.add_argument('--cuda', type=str, default='0', help='gpu ids, like: 1,2,3')
_PARSER.add_argument('--seed', type=int, default=42, help='random seed.')
_PARSER.add_argument('--max_length', type=int, default=365, help='max_length')
_PARSER.add_argument('--stacking', type=bool, default=True, help='stacking')
_ARGS = _PARSER.parse_args()
if len(_ARGS.cuda) > 1:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'DETAIL'
os.environ["OMP_NUM_THREADS"] = '8'
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" # if cuda >= 10.2
os.environ['CUDA_VISIBLE_DEVICES'] = _ARGS.cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def create_dataloader(args: argparse.Namespace):
"""
dataloaders分别是train_doc, test_doc, test_sent
"""
datasets = []
files = [args.test_doc, args.test_sent]
for file in files:
df = pd.read_csv(file)
dataset = Dataset.from_pandas(df)
datasets.append(dataset)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
def tokenize_fn(example):
return tokenizer(example['answer'], max_length=args.max_length, padding='max_length', truncation=True)
datasets = [datasets[0], datasets[1]]
names = ['id', 'question', 'answer', 'source']
tokenized_datasets = []
for dataset in datasets:
tokenized = dataset.map(
tokenize_fn,
batched=True,
remove_columns=names)
tokenized_datasets.append(tokenized)
def collate_fn(examples):
return tokenizer.pad(examples,return_tensors='pt')
dataloaders = []
for dataset in tokenized_datasets:
dataloader = DataLoader(dataset, shuffle=False, collate_fn=collate_fn, batch_size=args.batch_size)
dataloaders.append(dataloader)
return dataloaders
def eval(args, dataloaders):
if args.stacking:
# roberta_cnn_model = torch.load(args.roberta_cnn_model).to(device)
# roberta_cnn_model.eval()
# print("roberta_cnn_model loaded")
# roberta_model = torch.load(args.roberta_model).to(device)
# roberta_model.eval()
config = AutoConfig.from_pretrained(
args.roberta_model,
num_labels=2,
)
roberta_model = BertForSequenceClassification.from_pretrained(
args.roberta_model,
config=config,
).to(device)
# print(roberta_model.base_model)
# exit()
# for param in roberta_model.base_model.parameters():
# param.requires_grad = False
print("roberta_rnn_model loaded")
# roberta_rcnn_model = torch.load(args.roberta_rcnn_model).to(device)
# roberta_rcnn_model.eval()
# print("roberta_rcnn_model loaded")
# roberta_rcnn_model = torch.load(args.roberta_rcnn_model).to(device)
# roberta_rcnn_model.eval()
# print("roberta_rcnn_model loaded")
eval_name_list = ['test_doc', 'test_sent']
for item, eval_name in enumerate(eval_name_list, 0):
metric = evaluate.load("/data1/xxxxxx/DeepfakeText-chinese/dataset/metrics/f1")
for step, batch in enumerate(tqdm(dataloaders[item], desc='Evaling', colour="green")):
batch.to(device)
with torch.no_grad():
labels = batch.pop('label')
outputs = roberta_model(**batch)['logits']
predictions = outputs.argmax(dim=-1)
predictions, references = predictions, labels
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
logger.info(f"{eval_name}: {eval_metric}")
daataLoader = create_dataloader(_ARGS)
eval(_ARGS,daataLoader)
您好,想问一下您检测的模型代码(论文中的第二个模型基于robert)是在哪呢
Dear Hello-SimpleAI,
Thank you for your detectors, which have been valuable resources for my university admin duties. I've posted a similar discussion as this one on Huggingface, hoping one of these will evoke a reply.
Thank you!
Best,
Jay
I have noticed that there are some "network error" fields in the chatgpt answer part of your dataset, these parts can be considered for filtering.
我尝试复现中论文中的结果,现在我是直接导入hugging face上的chatgpt-detector-roberta作为model和tokenizer,根据页面上的描述这是由mixed数据集训练的,在论文中对raw-full的F1 score应该为99.44,但我没办法得到这个数据,我使用的数据集是在hc3中readme中的谷歌网盘下载的,以下是我得到的结果
{'0': {'precision': 0.9994103425909546, 'recall': 0.9951852504256943, 'f1-score': 0.9972933215651661, 'support': 17031.0}, '1': {'precision': 0.9898640296662546, 'recall': 0.9987528061860813, 'f1-score': 0.994288552272163, 'support': 8018.0}, 'accuracy': 0.9963271986905665, 'macro avg': {'precision': 0.9946371861286046, 'recall': 0.9969690283058878, 'f1-score': 0.9957909369186646, 'support': 25049.0}, 'weighted avg': {'precision': 0.9963546382901745, 'recall': 0.9963271986905665, 'f1-score': 0.9963315170942771, 'support': 25049.0}}
I am trying to reproduce the results of a paper. Currently, I am using the 'chatgpt-detector-roberta' model and tokenizer directly imported from Hugging Face. According to the information on their website, this model was trained on a mixed dataset. In the paper, the F1 score for 'raw-full' should be 99.44. However, I am unable to achieve this result, The dataset used was downloaded from the Google Drive link provided in the README of hc3, and here are the results I obtained.
{'0': {'precision': 0.9994103425909546, 'recall': 0.9951852504256943, 'f1-score': 0.9972933215651661, 'support': 17031.0}, '1': {'precision': 0.9898640296662546, 'recall': 0.9987528061860813, 'f1-score': 0.994288552272163, 'support': 8018.0}, 'accuracy': 0.9963271986905665, 'macro avg': {'precision': 0.9946371861286046, 'recall': 0.9969690283058878, 'f1-score': 0.9957909369186646, 'support': 25049.0}, 'weighted avg': {'precision': 0.9963546382901745, 'recall': 0.9963271986905665, 'f1-score': 0.9963315170942771, 'support': 25049.0}}
The basic roberta classifier limits its input tokens to 512/256, so how do you process the long input text?
Thank you!
Could you show me the version of the chatGPT for those data and when was thoese data collected? Thanks!
Hi dear developers,
I am wondering whether data splits (e.g. train/val/test) has been released; I saw a issue 3 weeks ago with an official reply saying "We will release the train-test split later." However, as I inspected the data, it seems they have not been splited so far. Please remind me if the official data splits are released. Thanks!
Hi, thank you for your nice job! But I have a question, I can get the .json data set, but how can I get the .csv file needed to train the model? Or what should the .csv data format be?
Hi developers,
I'm kind of wondering whether detection model baseline codes & pretrained models (e.g. RoBERTa & GLTA) have been released; I failed find them in the repo or on the project main page... Please remind me if they have been open-sourced. Thank you very much!
Sorry, but this article proves that this tool is simply not valid: https://www.zhihu.com/question/578268304/answer/2843077198 . Please let all the developers in the group read it carefully and continue to improve your algorithms.
不好意思,这篇文章证明了这个工具是根本无效的:https://www.zhihu.com/question/578268304/answer/2843077198 。请让组内的开发者都仔细阅读,继续改进你们的算法。
FileNotFoundError: Couldn't find a module script at /chinese-roberta-wwm-ext/accuracy/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.
Hi, this work is very useful for my research. Could u share me the helpfulness dataset with human label?
thanks
因为我的访问可能较为频繁且较多(每周几千次),所以在考虑使用cloudflare workers或者cloudflare pages等服务实现一个自己的,请问该如何自行部署?
hi, may I know what is the license of the released dataset? We may use it for commercial production so need to know what the license is exactly. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.