Comments (8)
MGTAB is a standardized data set. The code for standardized the data is as follows:
df_train = read_info_data("./train_new.json")
df_train_feature = exact_and_process_feature(df_train, with_label, 'labeled_df_train.json')
df_test = read_info_data("./test_new.json")
df_test_feature = exact_and_process_feature(df_test, with_label, 'labeled_df_test.json')
if task == 1:
df_train_feature.pop("isBot")
df_test_feature.pop("isBot")
else:
df_train_feature.pop("category")
df_test_feature.pop("category")
numerical_cols = [
"followers_count",
"friends_count",
"listed_count",
"created_at",
"favourites_count",
"statuses_count",
'screen_name_length',
'name_length',
'description_length',
'followers_friends_ratios',
]
df = pd.concat([df_train_feature, df_test_feature], ignore_index=True)
df_name = df['screen_name']
df[numerical_cols] = MinMaxScaler().fit_transform(df[numerical_cols])
The feature processing function is as follows:
def exact_and_process_feature(df, with_label, file_name):
if not os.path.exists('./process/'+file_name):
to_drop = list(df.keys())
not_to_drop = [
# 'has_extended_profile',
"profile_use_background_image",
"default_profile",
"default_profile_image",
"verified",
"geo_enabled",
'profile_background_image_url',
'url',
'profile_background_color',
'profile_sidebar_fill_color',
'profile_sidebar_border_color',
"followers_count",
"friends_count",
"listed_count",
"created_at",
"favourites_count",
"statuses_count",
"screen_name",
'name',
'description',
'id',
'friends_list',
'followers_list',
'mention_list',
'url_list',
'hashtag_list'
]
if with_label:
not_to_drop.append('category')
not_to_drop.append('isBot')
for key in not_to_drop:
to_drop.remove(key)
df.drop(columns=to_drop, axis=1, inplace=True)
df = change_df_dtypes(df)
df.to_json('./process/'+file_name)
print('saving {}'.format(file_name))
else:
df = pd.read_json('./process/'+file_name)
print('loading existing {}'.format(file_name))
return df
def change_df_dtypes(df):
df = df.fillna(0)
df["followers_count"] = np.log2((df["followers_count"].astype("int64") + 1))
df["friends_count"] = np.log2((df["friends_count"].astype("int64") + 1))
df["listed_count"] = np.log2((df["listed_count"].astype("int64") + 1))
df["created_at"] = pd.to_numeric(pd.to_datetime(df["created_at"])) / 365 / 24 / 60 / 60 / 1000000000
df["favourites_count"] = np.log2((df["favourites_count"].astype("int64") + 1))
df["statuses_count"] = np.log2((df["statuses_count"].astype("int64") + 1))
df['screen_name_length'] = ""
for i, each in enumerate(df['screen_name']):
df['screen_name_length'][i] = len(each)
df['name_length'] = ""
for i, each in enumerate(df['name']):
df['name_length'][i] = len(each)
del df['name']
df['description_length'] = ""
for i, each in enumerate(df['description']):
df['description_length'][i] = len(each)
del df['description']
df['followers_friends_ratios'] = ""
for i, each in enumerate(df['followers_count']):
df['followers_friends_ratios'][i] = df['friends_count'][i] / (each + 1)
# bool feature
df["default_profile"] = df["default_profile"].astype("int8")
df["default_profile_image"] = df["default_profile_image"].astype("int8")
df["geo_enabled"] = df["geo_enabled"].astype("int8")
df["profile_use_background_image"] = df["profile_use_background_image"].astype("int8")
df["verified"] = df["verified"].astype("int8")
df['is_default_profile_background_color'] = ''
for i, each in enumerate(df['profile_background_color']):
if each is not None:
if each == 'F5F8FA':
df['is_default_profile_background_color'][i] = 1
elif each == '':
df['is_default_profile_background_color'][i] = 1
else:
df['is_default_profile_background_color'][i] = 0
else:
df['is_default_profile_background_color'][i] = 1
del df['profile_background_color']
df['is_default_profile_sidebar_fill_color'] = ''
for i, each in enumerate(df['profile_sidebar_fill_color']):
if each is not None:
if each == 'DDEEF6':
df['is_default_profile_sidebar_fill_color'][i] = 1
elif each == '':
df['is_default_profile_sidebar_fill_color'][i] = 1
else:
df['is_default_profile_sidebar_fill_color'][i] = 0
else:
df['is_default_profile_sidebar_fill_color'][i] = 1
del df['profile_sidebar_fill_color']
df['is_default_profile_sidebar_border_color'] = ''
for i, each in enumerate(df['profile_sidebar_border_color']):
if each is not None:
if each == 'C0DEED':
df['is_default_profile_sidebar_border_color'][i] = 1
elif each == '':
df['is_default_profile_sidebar_border_color'][i] = 1
else:
df['is_default_profile_sidebar_border_color'][i] = 0
else:
df['is_default_profile_sidebar_border_color'][i] = 1
del df['profile_sidebar_border_color']
df['has_url'] = ''
for i, each in enumerate(df['url']):
if each is not None:
if each != 0:
df['has_url'][i] = 1
else:
df['has_url'][i] = 0
else:
df['has_url'][i] = 0
del df['url']
df['has_profile_background_image_url'] = ''
for i, each in enumerate(df['profile_background_image_url']):
if each is not None:
if each != 0:
df['has_profile_background_image_url'][i] = 1
else:
df['has_profile_background_image_url'][i] = 0
else:
df['has_profile_background_image_url'][i] = 0
del df['profile_background_image_url']
return df
def read_info_data(json_path):
with open(json_path, "r") as f:
data = json.loads(f.read())
df = pd.json_normalize(data=data)
return df
from mgtab.
As you may notice, all numerical features are between 0 and 1. To my knowledge, this is called normalization. A standardized distribution does not necessarily yields values between 0 and 1. After inspecting the code, it seems that you applied min-max scaling rather than, as mentioned in your paper, z-score standardization.
So, as I previously mentioned, we can't use this dataset as long as we do not have the minimum and the maximum of each feature's values in your dataset, so it might be helpful to share the unprocessed dataset.
I much appreciate your sharing of the code.
from mgtab.
Hello,
In the Appendix of your paper, section A.1., you mention that min/max values of features are made public on the repository. But I can't find it. Could you point me to it?
If you haven't published them, then I'd agree with @msharara1998 that no one can benefit from you great work on new/other data!
![Screenshot 2023-06-08 at 3 13 17 PM](https://private-user-images.githubusercontent.com/23275727/244379475-7a9ea1f1-f911-4c64-8f4e-9971e1ddf7ae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDYxODUzNDQsIm5iZiI6MTcwNjE4NTA0NCwicGF0aCI6Ii8yMzI3NTcyNy8yNDQzNzk0NzUtN2E5ZWExZjEtZjkxMS00YzY0LThmNGUtOTk3MWUxZGRmN2FlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI1VDEyMTcyNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU5M2ZkNjFjZDczMzA0NGJjNDM1OWU2NGMyODNlZjdkNzk4YTg1MzBmOGZmODg3NGU2MjExNzQyMDFkMDM0NDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0._Ll7nS6-PTXZe3Pxt5dBShJounesXi9V5-2WHiQLKMM)
from mgtab.
The issue concerning the preprocessed data extends to the fact that if we want to adopt this dataset for training a model, and we want to predict on a new data point, we should preprocess this data point. However, we need to, for example, apply min-max scaling as a normalization. Hence, we should subtract the minimum of each feature from the dataset, similarly for division by max-min. The problem is that we do not have these values. I think this issue should be solved in order for anyone to benefit from this dataset. Thanks
from mgtab.
Or at least share the minimum and maximum of each feature in the dataset, for a complete reproducibility
from mgtab.
Sorry for being persistent. But I bet when you publicly release a dataset, then your aim is that people benefit from the dataset.
If your released dataset is normalized (using MinMaxScaler), no one can benefit from it unless he has the minimum and maximum for each feature.
Here is an example to clarify my point:
suppose you trained an XGBoost model on a dataset and used the following code:
X = df[features].to_numpy()
y = df[target_label].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model.fit(x_train, x_test)
Now, having the model trained, suppose we want to predict the label of a new datapoint x_1, this datapoint must be minmax-scaled on the same scaler that was used on the training data, otherwise we will have wrong results (since we need to minmax the datapoint on the same values used in the training data to be consistent)
x_1 = scaler.transform(x_1)
model.predict(x_1)
...
I hope you take this into consideration otherwise no one can benefit from the released dataset and part of your efforts would be in vain.
from mgtab.
MGTAB is a normalization heterogeneous graph data set with multiple relations, and effective feature extraction has been carried out. As you say, the original features are not visible, since we hope that readers can directly use the processed data.
Part of the original data has been sent to your email, we hope it will be helpful to your research.
from mgtab.
Thanks for sharing. But there is a win-win solution for both of us: just share the minimum and the maximum for each numerical feature, please. In this way, no user information is disclosed, and at the same time, every one can benefit properly and correctly from the dataset.
Thanks in advance!
from mgtab.
Related Issues (10)
- Datasets Features Names HOT 3
- How can we get access to the raw data?
- Embeddings Generation
- Code for preprocess the raw data from original author into the format used by MGTAB
- Questions about the topics or claims HOT 2
- Dataset Collection Process HOT 1
- Information Gain HOT 1
- Stance detection HOT 1
- A Little Confused
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mgtab.