Giter Club home page Giter Club logo

mgtab's Issues

Datasets Features Names

Hi,
Thank you for your effort on this project.
I would like to know how can we retrieve the names of the columns (features names) in the dataset, as it is provided as a torch tensor with only numerical values. Similarly, the accounts' human/bot labels are provided as a binary vector, without the account name or IDs.

Information Gain

Are the numerical and Boolean characteristics with the top 10 information gain of stance and bot detection the same?

Code for preprocess the raw data from original author into the format used by MGTAB

Hello @GraphDetec,

First time, thank you for your great works.

I want to ask about code for preprocess the raw data from the original author into the format used by MGTAB.

For example with Cresci2015, when I access the author of Cresci2015 from them web site (http://mib.projects.iit.cnr.it/dataset.html). I see only raw data. But I access your raw data in google drive (https://drive.google.com/uc?export=download&id=1AzMUNt70we5G2DShS8hk5qH95VR9HfD3), I see data set have different format (some file name are cat_properties_tensor.pt, des_tensor.pt,....).

Can you share the notebook code to preprocess the raw data from the original author into the format used by MGTAB?

Dataset Collection Process

It's my pleasure to read your paper. I have some questions about the dataset collection process:
How to get the other accounts based on seed accounts?
What are the detailed online events?
What are the relationships between seed accounts and online events?

Questions about the topics or claims

Thank you for offering this project for the stance detection community with social links.
However, I have some questions about the datasets. Could you help me to solve it?

As you said in the introduction section,

Stance detection aims at detecting the user’s stance on a topic or claim. 

But in the datasets, I don't find the labels for the topics/claims/events.
I understand that the datasets can be modeled as a node classification task on a heterogeneous graph.

When I load the label_stances, I can have the label of 0/1/2(neutral/against/support). But I want to know the topic for such labels.
For example, if I have node 0 as 1, node 2 as 1, do they have the same stance on the same topic?
Because the tweets are given with 768-d embeddings, it is hard to extract meaningful topics.

How can I get the topics/claims for the label of stance?

How can we get access to the raw data?

Hello! Is there a way to access the raw data of the tweets? I mean the text itself. This would be very helpful if I want to try different embeddings. Thanks!

Stance detection

Stance detection is generally used to detect the stance of a piece of text.
Is the stance detection label of your data annotation here annotated for all historical tweets of the user as a whole text?

Preprocessing Issues

Hello,
1)You mentioned in the paper that you've calculated the z-score of each feature. However, upon inspecting the dataset, I found that no feature has a value greater than one. To my knowledge, the z-score is calculated as:

z = (x-E(x)) / std(x)

Have you standardized the data using the above z-score, or normalized it by dividing each column's values by the maximum value?

  1. Concerning the created_at feature, how did you normalize it to a value between 0 and 1? I did not find in the paper information about this specific preprocessing.

It would be easier to share the user_name feature or at least the user ID, for easier reproducibility.

  1. I guess that I will have issues reproducing the graph-based features. My concerns are mainly how to preprocess a data point (suppose I trained a model on your dataset and want to predict on a user) so that I end up with exactly the same processed data point as the dataset was processed.

Several authors who released public datasets have shared the user-ID. I kindly request to share with me in private the account ids or usernames via my email ([email protected]). If you really cannot share it, please provide me with preprocessing code for the entire dataset (especially graph features).

Another concern to me that is related to the above is what Twitter API endpoint I want to use so that I can construct and preprocess the data point identically to the dataset (especially the graph part). Thus, sharing the code you've used to go from raw data coming from Twitter API to such a dataset would be extremely helpful.

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.