NetEaseCrowd

NetEaseCrowd: A Dataset for Long-term and Online Crowdsourcing Truth Inference

Introduction

We introduce NetEaseCrowd, a large-scale crowdsourcing annotation dataset based on a mature Chinese data crowdsourcing platform of NetEase Inc.. NetEaseCrowd dataset contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations between them, where the annotations are collected in about 6 months. In this dataset, we provide ground truths for all the tasks and record timestamps for all the annotations.

Task

NetEaseCrowd dataset is built based on a gesture comparison task. Each task contains three choices, where two are similar gestures and the other one is not. Annotators are required to pick out the different one. An example is shown below:

Comparison with existing datasets

Compared with the existing crowdsourcing datasets, our NetEaseCrowd dataset has the following characteristics:

Characteristic	Existing datasets	NetEaseCrowd dataset
Scalability	Relatively small sizes in #workers/tasks/annotations	Lage-scale data collection with 6 millions of annotations
Timestamps	Short-term data with no timestamps recorded	Complete timestamps recorded during a 6-month timespan
Task Type	Single type of tasks	Various task types with different required capabilities

Dataset Statistics

The basic statistics of NetEaseCrowd dataset and other previous datasets are as follows:

Dataset	#Worker	#Task	#Groundtruth	#Anno	Avg(#Anno/worker)	Avg(#Anno/task)	Timestamp	Task type
NetEaseCrowd	2,413	999,799	999,799	6,016,319	2,493.3	6.0	✔︎	Multiple
Adult	825	11,040	333	92,721	112.4	8.4	✘	Single
Birds	39	108	108	4,212	108.0	39.0	✘	Single
Dog	109	807	807	8,070	74.0	10.0	✘	Single
CF	461	300	300	1,720	3.7	5.7	✘	Single
CF_amt	110	300	300	6030	54.8	20.1	✘	Single
Emotion	38	700	565	7,000	184.2	10.0	✘	Single
Smile	64	2,134	159	30,319	473.7	14.2	✘	Single
Face	27	584	584	5,242	194.1	9.0	✘	Single
Fact	57	42,624	576	216,725	3802.2	5.1	✘	Single
MS	44	700	700	2,945	66.9	4.2	✘	Single
product	176	8,315	8,315	24,945	141.7	3.0	✘	Single
RTE	164	800	800	8,000	48.8	10.0	✘	Single
Sentiment	1,960	98,980	1,000	569,375	290.5	5.8	✘	Single
SP	203	4,999	4,999	27,746	136.7	5.6	✘	Single
SP_amt	143	500	500	10,000	69.9	20.0	✘	Single
Trec	762	19,033	2,275	88,385	116.0	4.6	✘	Single
Tweet	85	1,000	1,000	20,000	235.3	20.0	✘	Single
Web	177	2,665	2,653	15,567	87.9	5.8	✘	Single
ZenCrowd_us	74	2,040	2,040	12,190	164.7	6.0	✘	Single
ZenCrowd_in	25	2,040	2,040	11,205	448.2	5.5	✘	Single
ZenCrowd_all	78	2,040	2,040	21,855	280.2	10.7	✘	Single

Data Content and Format

Obtain the data

Two ways to access the dataset:

Directly download overall NetEaseCrowd in Hugging Face [Recommended]
Under the data/ folder, the NetEaseCrowd dataset is provided in partitions in the csv file format. Each partition is named as NetEaseCrowd_part_x.csv. Concat them to get the entire NetEaseCrowd dataset.

Dataset format

In the dataset, each line of record represents an interaction between a worker and a task, with the following columns:

taskId: The unique id of the annotated task.
tasksetId: The unique id of the task set. Each task set contains unspecified number of tasks. Each task belongs to exactly one task set.
workerId: The unique id of the worker.
answer: The annotation given by the worker, which is an enumeric number starting from 0.
completeTime: The integer timestamp recording the completion time of the annotation.
truth: The groundtruth of the annotated task, which, in consistency with answer, is also an enumeric number starting from 0.
capability: The unique id of the capability required by the annotated taskset. Each taskset belongs to exactly one capability.

For the privacy concerns, all sensitive content like as -Ids, has been anonymized.

Data sample

tasksetId	taskId	workerId	answer	completeTime	truth	capability
6980	1012658482844795232	64	2	1661917345953	1	69
6980	1012658482844795232	150	1	1661871234755	1	69
6980	1012658482844795232	263	0	1661855450281	1	69

In the example above, there are three annotations, all from the same taskset 6980 and the same task 1012658482844795232. Three annotators, with ids 64, 150, and 263, provide annotations of 2, 1, and 0, respectively. They do the task at different time. The truth label for this task is 1, and the capability id of the task is 69.

Baseline Models

We test several existing truth inference methods in our dataset, and detailed analysis with more experimental setups can be found in our paper.

Method	Accuracy	F1-score
MV	0.92695	0.92692
DS	0.95178	0.94817
MACE	0.95991	0.94957
Wawa	0.94814	0.94445
ZeroBasedSkill	0.94898	0.94585
GLAD	0.95064	0.95058
EBCC	0.91071	0.90996
ZC	0.95305	0.95301
TiReMGE	0.92713	0.92706
LAA	0.94173	0.94169
BiLA	0.88036	0.87896

Test with the dataset directly from crowd-kit

The NetEaseCrowd dataset has been integrated into the crowd-kit (with pull request here), you can use it directly in your code with the following code(with crowd-kit version > 1.2.1):

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset

df, gt = load_dataset('netease_crowd')

ds = DawidSkene(10)
result = ds.fit_predict(df)

print(len(result))
# 999799

Other public datasets

We provide a curated list for other public datasets towards truth inference task.

Dataset Name	Resource
adult	Quality management on amazon mechanical turk. [paper][data]
sentiment fact	Workshops Held at the First AAAI Conference on Human Computation and Crowdsourcing: A Report. [paper][data]
MS zencrowd_all zencrowd_us zencrowd_in sp sp_amt cf cf_amt	The active crowd toolkit: An open-source tool for benchmarking active learning algorithms for crowdsourcing research. [paper][data]
Product tweet dog face duck relevance smile	Truth inference in crowdsourcing: Is the problem solved? [paper][data] Note that tweet dataset is called sentiment in this source. It is different from the sentiment dataset in CrowdScale2013.
bird rte web trec	Spectral methods meet em: A provably optimal algorithm for crowdsourcing. [paper][data]

Citation

If you use this project in your research or work, please cite it using the following BibTeX entry:

@misc{wang2024dataset,
      title={A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment}, 
      author={Fei Wang and Haoyu Liu and Haoyang Bi and Xiangzhuang Shen and Renyu Zhu and Runze Wu and Minmin Lin and Tangjie Lv and Changjie Fan and Qi Liu and Zhenya Huang and Enhong Chen},
      year={2024},
      eprint={2403.08826},
      archivePrefix={arXiv},
      primaryClass={cs.HC}
}

License

The NetEaseCrowd dataset is licensed under CC-BY-SA-4.0.

fuxiailab / neteasecrowd-dataset Goto Github PK

neteasecrowd-dataset's Introduction

NetEaseCrowd

Introduction

Task

Comparison with existing datasets

Dataset Statistics

Data Content and Format

Obtain the data

Dataset format

Data sample

Baseline Models

Test with the dataset directly from crowd-kit

Other public datasets

Citation

License

neteasecrowd-dataset's People

Contributors

Stargazers

Watchers

neteasecrowd-dataset's Issues

Recommend Projects

Recommend Topics

Recommend Org