Giter Club home page Giter Club logo

llmgnn's Introduction

LLMGNN

Code for our paper Label-free Node Classification on Graphs with Large Language Models (LLMS).

Abstract

In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.

Pipeline demo

NOTES:The following documentation is still under construction, I will upload some pickled intermediate results so you may get some results without querying the OpenAI API

Environment Setups

conda env create -f environment.yml --name new_environment_name

First,

pip install -r requirements.txt

then, I recommend installing gpu-related libraries manually (you should choose the version compatible with your GLIBC and cuda)

pip3 install torch torchvision torchaudio

and also torch-geometric according to install

https://github.com/facebookresearch/faiss/blob/main/INSTALL.md

and finally install faiss

About the data

Dataset

We have provided the processed datasets via the following google drive link

To unzip the files, you need to

  1. unzip the small_data.zip into xxx/LLMGNN/data
  2. Put wikics in the same directory
  3. If you want to use ogbn-products, unzip big_data.zip info xxx/LLMGNN/data/new
  4. Set the corresponding path in config.yaml

How to use this repo and run the code

There are two main parts of our code

  1. Annotators
  2. GNN training

The pipeline works as follows:

  1. get_dataset in data.py: get the pt data file, use llm.py to generate annotations. The indexes selected by active learning and corresponding annotations will be returned. We use the cache to store all the output annotations. -1 is a sentinel for null annotation.
  2. main.py: train the GNN models. For large-scale training, we do not use the batch version, but pre-compute all intermediate results.

To do the annotation, you need to set up your OpenAI key in the config.yaml. For quick try, we have uploaded some cached results to the google drive https://drive.google.com/drive/folders/1_laNA6eSQ6M5td2LvsEp3IL9qF6BC1KV.

We will then go through one example for the Cora dataset to showcase how to use LLMGNN in practice.

  1. First, in helper/data.py, set the global variable LLM_PATH to the path of llm.py, which should be something like LLM_PATH = "xxx/LLMGNN/src/llm.py"
  2. (Optional) set the global variable PARTITIONS to the path of the partitions used by Gpart algorithm, which should be something PARTITIONS = "xxx/LLMGNN/data/partitions"
  3. Set the value of variable GLOBAL_RESULT_PATH to the path of intermediate results like clustering centers and density, which should be like GLOBAL_RESULT_PATH = "xxx/LLMGNN/data/aax"
  4. Run the precompute.py to pre-compute clustering centers and density efficiently using faiss
  5. Download cora_openai.pt (this is the cache file for responses from openai) and active/cora^cache^consistency.pt (this is the file to save annotation result) from google drive, put the former one under "<data_path_in_config_file>/" and the latter one under "<data_path_in_config_file>/active"
  6. Run the following code python3 src/main.py --dataset cora --model_name GCN --data_format sbert --main_seed_num 3 --split active --output_intermediate 0 --no_val 1 --strategy pagerank2 --debug 1 --total_budget 140 --filter_strategy consistency --loss_type ce --second_filter conf+entropy --epochs 30 --debug_gt_label 0 --early_stop_start 150 --filter_all_wrong_labels 0 --oracle 1 --ratio 0.2 --alpha 0.33 --beta 0.33

An example:

python3 src/main.py --dataset products --model_name AdjGCN --data_format sbert --main_seed_num 1 --split active --output_intermediate 0 --no_val 1 --strategy pagerank2 --debug 1 --total_budget 940 --filter_strategy consistency --loss_type ce --second_filter conf+entropy --epochs 50 --debug_gt_label 0 --early_stop_start 150 --filter_all_wrong_labels 0 --oracle 1 --ratio 0.2 --alpha 0.33 --beta 0.33

Notes

I'll optimize the code structure when I have more time ⏳.

llmgnn's People

Contributors

currytang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

llmgnn's Issues

Wrong instrucstions of data path

I really appreciate your fantastic work and codes. Just found an issue:
the path in
unzip the small_data.zip into xxx/LLMGNN/data/new
should be
unzip the small_data.zip into xxx/LLMGNN/data
then the code runs well.

环境配不通

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

  • xz==5.4.2=h5eee18b_0
  • mkl_fft==1.3.1=py310hd6ae3a3_0
  • jupyter_core==5.3.1=py310hff52083_0
  • tk==8.6.12=h1ccaba5_0
  • _openmp_mutex==5.1=1_gnu
  • sqlite==3.41.2=h5eee18b_0
  • numpy-base==1.24.3=py310h8e6c178_0
  • bzip2==1.0.8=h7b6447c_0
  • cffi==1.15.1=py310h5eee18b_3
  • tbb==2021.8.0=hdb19cb5_0
  • wheel==0.38.4=py310h06a4308_0
  • pyopenssl==23.2.0=py310h06a4308_0
  • brotlipy==0.7.0=py310h7f8727e_1002
  • libgomp==11.2.0=h1234567_1
  • zeromq==4.3.4=h9c3ff4c_1
  • zlib==1.2.13=h5eee18b_0
  • ncurses==6.4=h6a678d5_0
  • pysocks==1.7.1=py310h06a4308_0
  • numba==0.57.1=py310h1128e8f_0
  • intel-openmp==2021.4.0=h06a4308_3561
  • libstdcxx-ng==11.2.0=h1234567_1
  • setuptools==68.0.0=py310h06a4308_0
  • libgfortran5==11.2.0=h1234567_1
  • idna==3.4=py310h06a4308_0
  • libgcc-ng==11.2.0=h1234567_1
  • pip==23.2.1=py310h06a4308_0
  • cryptography==41.0.2=py310h774aba0_0
  • mkl==2021.4.0=h06a4308_640
  • libgfortran-ng==11.2.0=h00389a5_1
  • mkl_random==1.2.2=py310h00e6091_0
  • pyzmq==25.1.0=py310h6a678d5_0
  • tornado==6.1=py310h5764c6d_3
  • libuuid==1.41.5=h5eee18b_0
  • libfaiss==1.7.4=h2bc3f7f_0_cpu
  • python==3.10.10=h7a1cb2a_2
  • libffi==3.4.4=h6a678d5_0
  • ld_impl_linux-64==2.38=h1181459_1
  • ca-certificates==2023.08.22=h06a4308_0
  • llvmlite==0.40.0=py310he621ea3_0
  • libsodium==1.0.18=h36c2ea0_1
  • debugpy==1.6.7=py310h6a678d5_0
  • openssl==1.1.1w=h7f8727e_0
  • readline==8.2=h5eee18b_0
  • mkl-service==2.4.0=py310h7f8727e_0
  • libllvm14==14.0.6=hdb19cb5_3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.