benedekrozemberczki / datasets Goto Github PK

A repository of pretty cool datasets that I collected for network science and machine learning research.

License: MIT License

network-science network-analysis data-science machine-learning gcn graph-embedding network-embedding community-detection link-prediction node-classification

datasets's Introduction

Benedek A. Rozemberczki/ Homepage / Twitter / GitHub / Google Scholar

Welcome stranger

⏰ Currently working on machine learning for drug discovery.
🤖 I would love to collaborate on the machine learning libraries ChemicalX and RexMex.

Great news

🧬 MOOMIN: Deep Molecular Omics Network for Anti-Cancer Drug Combination Therapy was accepted at CIKM 2022.
🪙 The Shapley Value in Machine Learning was accepted at IJCAI 2022.
⭐ A Unified View of Relational Deep Learning for Drug Pair Scoring was accepted at IJCAI 2022.
⚗️ ChemicalX: A Deep Learning Library for Drug Pair Scoring was accepted at KDD 2022.

datasets's People

Contributors

Stargazers

Watchers

Forkers

codeaudit yushu-liu wurengukou dreamerdw josemacedo diamond2nv ncdingari aashaybhupendradoshi dragomirradev subpath zongzonglin vara11 kate-chunosova asmaada mishidemudong adilelghali atreyaj tiagoooliveira sowmyaarajan jy2052 didarulcseiubat17 janeyzy forkkit latestalexey ligege12 gameye98 louise-lulin ryutamatsuno kush225 stjordanis aryamonani sanyam07 han9x8 datefinde shengguanwsu aditiagarwaliit sidirasg vijaybhaskar1994 shivamkainth pantera6 davecerr bronzepot zhanggaofeng1120 1895-art maksim96 bhaskarbharat santhu45482 mitrofanovdmitry haytam222 pabloromero17143 adbmd gunjanrt04 vijayponnaganti11 juexinwang ssitb kutudev huamichaelchen chanteurvictor mridul98 jackie-chia devil19960106 pipiku915 fiego baiyazi v-galanos innovation64 xukunpeng24 vamshi900 researchoor bilalbai duyamin lsc-1 rnaimehaom smousav9 eya-abid ikerlz stacey-ckk arshad-17 luisfredgs nmyatt eyeigen vpanagiotou531 jan-meissner

datasets's Issues

deepwalk

你好,请问有标签的数据集可以作为deepwalk的输入么?怎样才能转化为.mat文件呢?

Features in Twitch dataset

Hi,
I had a question relating to the features.json file. It would be great if you could tell me what the features represent?

Thanks

Could you provide an example of how to download the datasets?

The zip file for Deezer Europe Dataset has 2 CSV files that only contain git-lfs pointers. How to download the dataset from git-lfs pointers you placed in CSVs?

how to use git lfs pointer file to download file

[GitHub Web-ML] How node features are created

Hi,
First of all, I would like to genuinely thank you for your incredibly clear and detailed guidance. To be honest, I am very new to the field of GNN and have just started delving into it a few weeks ago, so I still have a lot of questions about how they operate.

In the data description section, node features are described as being extracted based on the location, starred repositories, employer, and email address. Therefore, I think features should be text or something similar. However, in the musae_git_features.json file, the features are numerical vectors. I also looked into various other datasets, and node features have a similar form. I genuinely do not understand how to process these features from raw data into numerical vectors that can serve as input for GNNs.

Thank you so much!

how to generate embeddings using your library if my input is in grapson(json) or graphml(xml) format?

Question about the LastfmAsia and DeezerEurope datasets

I have a question about the LastfmAsia and DeezerEurope datasets of your CIKM 2020 paper, which I found on SNAP. These datasets are provided with node features which are “extracted based on the artists liked by the users”. Does this mean that each number in the vector associated to each node corresponds to the id of an artist that the user node liked? Or is the vector a more abstract embedding? I am referring to the file lastfm_asia_features.json and deezer_europe_features.json.

Thanks in advance!

Twitch Social Network dataset: Target

The target files in twitch social network contain the following columns id,days,mature,views,partner,new_id. Could you please provide some information regarding these values, and also, can you please point which column is used for the node classification task ?

MUSAE-Twitch dataset features

Hello,
Thank you for your work!

For the features.json file in the twitch dataset, is there a reference for what the feature indices in the values list specifically represent? For e.g: it's mentioned the features are extracted from games played. Do some of the values in the list represent ids of the games played by that user? Is there a way to get information on what each value corresponds to?

Features in Github Web-ML not of same length

Hi,

The features in git_web_ml/git_feature.json are not of same length. Should the shorter ones be padded with 0s to the end? Or is there a feature matrix of shape (node_num, feature_num) or a sparse mat with node_id, feature_id and value?

Thank you!

How to fetch csv files pointing from git-lfs pointers

So I am trying to download the Facebook Large dataset. But the zip file has 2 CSV files that only contain git-lfs pointers:

Any idea how to retrieve the dataset from these git-lfs pointers?

Thanks!

git-lfs pointer for facebook-large dataset

The zip file for facebook-large Dataset has 2 CSV files that only contain git-lfs pointers. How to download the dataset from git-lfs pointers you placed in CSVs?