graphdetec / mgtab Goto Github PK

A Multi-relational Graph-Based Twitter Account Detection Benchmark

Python 100.00%

bot-detection dataset gnn-algorithm stance-detection

mgtab's Issues

Embeddings Generation

Can you provide the scripts you used to produce the embeddings?

Hi,
Thank you for your effort on this project.
I would like to know how can we retrieve the names of the columns (features names) in the dataset, as it is provided as a torch tensor with only numerical values. Similarly, the accounts' human/bot labels are provided as a binary vector, without the account name or IDs.

Information Gain

Are the numerical and Boolean characteristics with the top 10 information gain of stance and bot detection the same?

Code for preprocess the raw data from original author into the format used by MGTAB

Hello @GraphDetec,

First time, thank you for your great works.

I want to ask about code for preprocess the raw data from the original author into the format used by MGTAB.

For example with Cresci2015, when I access the author of Cresci2015 from them web site (http://mib.projects.iit.cnr.it/dataset.html). I see only raw data. But I access your raw data in google drive (https://drive.google.com/uc?export=download&id=1AzMUNt70we5G2DShS8hk5qH95VR9HfD3), I see data set have different format (some file name are cat_properties_tensor.pt, des_tensor.pt,....).

Can you share the notebook code to preprocess the raw data from the original author into the format used by MGTAB?

Dataset Collection Process

It's my pleasure to read your paper. I have some questions about the dataset collection process:
How to get the other accounts based on seed accounts?
What are the detailed online events?
What are the relationships between seed accounts and online events?

Questions about the topics or claims

Thank you for offering this project for the stance detection community with social links.
However, I have some questions about the datasets. Could you help me to solve it?

As you said in the introduction section,

Stance detection aims at detecting the user’s stance on a topic or claim.

But in the datasets, I don't find the labels for the topics/claims/events.
I understand that the datasets can be modeled as a node classification task on a heterogeneous graph.

When I load the label_stances, I can have the label of 0/1/2(neutral/against/support). But I want to know the topic for such labels.
For example, if I have node 0 as 1, node 2 as 1, do they have the same stance on the same topic?
Because the tweets are given with 768-d embeddings, it is hard to extract meaningful topics.

How can I get the topics/claims for the label of stance?

How can we get access to the raw data?

Hello! Is there a way to access the raw data of the tweets? I mean the text itself. This would be very helpful if I want to try different embeddings. Thanks!

Stance detection

Stance detection is generally used to detect the stance of a piece of text.
Is the stance detection label of your data annotation here annotated for all historical tweets of the user as a whole text?

A Little Confused

Why are there more "friends" than "followers"？

Preprocessing Issues

Hello,
1)You mentioned in the paper that you've calculated the z-score of each feature. However, upon inspecting the dataset, I found that no feature has a value greater than one. To my knowledge, the z-score is calculated as:

z = (x-E(x)) / std(x)

Have you standardized the data using the above z-score, or normalized it by dividing each column's values by the maximum value?

Concerning the created_at feature, how did you normalize it to a value between 0 and 1? I did not find in the paper information about this specific preprocessing.

It would be easier to share the user_name feature or at least the user ID, for easier reproducibility.

I guess that I will have issues reproducing the graph-based features. My concerns are mainly how to preprocess a data point (suppose I trained a model on your dataset and want to predict on a user) so that I end up with exactly the same processed data point as the dataset was processed.

Several authors who released public datasets have shared the user-ID. I kindly request to share with me in private the account ids or usernames via my email ([email protected]). If you really cannot share it, please provide me with preprocessing code for the entire dataset (especially graph features).

Another concern to me that is related to the above is what Twitter API endpoint I want to use so that I can construct and preprocess the data point identically to the dataset (especially the graph part). Thus, sharing the code you've used to go from raw data coming from Twitter API to such a dataset would be extremely helpful.

Thank you in advance.

graphdetec / mgtab Goto Github PK

mgtab's Issues

Embeddings Generation

Datasets Features Names

Information Gain

Code for preprocess the raw data from original author into the format used by MGTAB

Dataset Collection Process

Questions about the topics or claims

How can we get access to the raw data?

Stance detection

A Little Confused

Preprocessing Issues

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent