hyp1231 / amazonreviews2023 Goto Github PK

Scripts for processing the Amazon Reviews 2023 dataset; implementations and checkpoints of BLaIR: "Bridging Language and Items for Retrieval and Recommendation".

License: MIT License

Python 98.44% Shell 1.56%

amazonreviews2023's Introduction

Amazon Reviews 2023

[🌐 Website] · [🤗 Huggingface Datasets] · [📑 Paper] · [🔬 McAuley Lab]

This repository contains:

Scripts for processing Amazon Reviews 2023 dataset into recommendation benchmarks;
Checkpoints & implementations for BLaIR: "Bridging Language and Items for Retrieval and Recommendation";
Scripts for constructing Amazon-C4, a new dataset for evaluating product search performance under complex contexts.

Recommendation Benchmarks

Based on the released Amazon Reviews 2023 dataset, we provide scripts to preprocess raw data into standard train/validation/test splits to encourage benchmarking recommendation models.

More details here -> [datasets & processing scripts]

BLaIR

BLaIR, which is short for "Bridging Language and Items for Retrieval and Recommendation", is a series of language models pre-trained on Amazon Reviews 2023 dataset.

BLaIR is grounded on pairs of (item metadata, language context), enabling the models to:

derive strong item text representations, for both recommendation and retrieval;
predict the most relevant item given simple / complex language context.

More details here -> [checkpoints & code]

Amazon-C4

Amazon-C4, which is short for "Complex Contexts Created by ChatGPT", is a new dataset for the complex product search task.

Amazon-C4 is designed to assess a model's ability to comprehend complex language contexts and retrieve relevant items.

More details here -> [datasets & code]

Contact

Please let us know if you encounter a bug or have any suggestions/questions by filling an issue or emailing Yupeng Hou (@hyp1231) at [email protected].

Acknowledgement

If you find Amazon Reviews 2023 dataset, BLaIR checkpoints, Amazon-C4 dataset, or our scripts/code helpful, please cite the following paper.

@article{hou2024bridging,
  title={Bridging Language and Items for Retrieval and Recommendation},
  author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
  journal={arXiv preprint arXiv:2403.03952},
  year={2024}
}

The recommendation experiments in the BLaIR paper are implemented using the open-source recommendation library RecBole.

The pre-training scripts refer a lot to huggingface language-modeling examples and SimCSE.

amazonreviews2023's People

Contributors

Stargazers

Watchers

Forkers

dopper descartes100 wanghaisheng joyjiuyi shreyashroy teresazhu21 leviruran alllis canadagoose63 sahiljethani

amazonreviews2023's Issues

repo not found

Hi, I was unable to download the metadata from the links provided, the urls were not found. Would you please fix the issue?

I am using the electronics meta data and somehow it has all values as null for the "bought together" column.
Can you please help in redirecting me if I have to refer to any other column for the same info, or this sis still pending and yet to be added to the dataset.

Where is ReviewerName attribute ?

Thanks you for the new dataset
Does the new 2023 dataset missing ReviewerName attribute ?

BLAIR hugging face model

I was wondering if you will release the BLAIR model on Hugging Face soon because I believe the benchmark improvements is considerable and further fine-tuning could lead to even better performance. Please let me know if this is the case

categories field is empty.

First of all, thank you for sharing such great data.

I wanted to see the distribution of product categories from the item metadata, so I checked the category field, but all the data is empty.

from datasets import load_dataset
ds = load_dataset('McAuley-Lab/Amazon-Reviews-2023', 'raw_meta_All_Beauty', split="full", trust_remote_code=True)

It looks like you have collected category data according to other issues(#7), so I would like to know why the data is missing.

Amazone review 2023

BLAIR integration with UniSrec

Could you also kindly publish the code for integrating BLAIR with UniSrec because from what I see this is not available at the moment. This helps a lot with reproducibility! Thanks!

Doubts related to this repo( REQUESTING FOR URGENT HELP)

I have to build my urgent project on gnn recommendations so please clear my below doubts:

Can you please edit your readme in such a way that it explaines everything in details such for example: why have you taken valid timestamp and test timestamp as constants in one of the scripts, please attach more elaboration over small ideas too so that beginners like me can also understand it properly?
Can you also tell me how can I build features file from the metadata provided as I am working on a GNN project for recommendations so I want to know how can I process your data from csv files to txt files containing nodes, edges and edge types. For that reason , I am not able to understand how can I process data for my gnn project?

Blair folder is missing: feature or a bug?

Hi, I wanted to look into the code for the BLAIR implementation, but the entire blair folder seems to be empty. Is it supposed to be empty? If yes, where would I find the implementation code?

main category is messy

import json
import pandas as pd
df = pd.DataFrame()

# file = # e.g., "meta_All_Beauty.jsonl", downloaded from the `meta` link above
file='D:\\360downloads\\meta_Health_and_Household.jsonl'
filename='meta_Health_and_Household'
asinlist=[]
counts=0
# {
#   "main_category": "All Beauty",
#   "title": "Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",
#   "average_rating": 4.8,
#   "rating_number": 10,
#   "features": [],
#   "description": [],
#   "price": null,
#   "images": [
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41qfjSfqNyL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41qfjSfqNyL.jpg",
#       "variant": "MAIN",
#       "hi_res": null
#     },
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41w2yznfuZL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41w2yznfuZL.jpg",
#       "variant": "PT01",
#       "hi_res": "https://m.media-amazon.com/images/I/71i77AuI9xL._SL1500_.jpg"
#     }
#   ],
#   "videos": [],
#   "store": "Howard Products",
#   "categories": [],
#   "details": {
#     "Package Dimensions": "7.1 x 5.5 x 3 inches; 2.38 Pounds",
#     "UPC": "617390882781"
#   },
#   "parent_asin": "B01CUPMQZE",
#   "bought_together": null
# }

with open(file, 'r') as fp:
    for line in fp:
        counts=counts+1
        listing=json.loads(line.strip())
        r=[]
        r.append(listing['parent_asin'])
        r.append(listing['main_category'])

        
        r.append(listing['title'])


        r.append(';'.join(listing['features']))
        r.append(';'.join(listing['description']))
        

        r.append(listing['price'])

        r.append(listing['rating_number'])
        r.append(listing['store'])
        if 'Package Dimensions' in listing['details']:
                if ';' in listing['details']['Package Dimensions']:
                     size=listing['details']['Package Dimensions'].split(';')[0]
                     weight=listing['details']['Package Dimensions'].split(';')[-1]
                     r.append(size)
                     r.append(weight)
                else:
                    r.append(listing['details']['Package Dimensions'])
                    r.append('')
        else:
             r.append('')

        asinlist.append(r)        
        # if keyword in listing['title']:
        # #     print(listing['parent_asin'])
        #     r=[]
        #     r.append(listing['parent_asin'])
        #     r.append(listing['title'])
        #     asinlist.append(r)
print(counts)
s1 = pd.Series(asinlist)
# print(asinlist)
df = pd.DataFrame(list(s1),  columns =  ["Asin","main_category", "title",
                                         "features","description",
                                         "price","rating_number","store","size",'weight'])

keyword='nootropics'
keyword='fda'
if keyword:
     out=df[df['description'].str.contains(keyword)]
else:
     out=df
     keyword=filename
out.to_csv(keyword+'.csv')

i want to filter title or description contain 'fda', what I got ,as you can see although input file is
meta_Health_and_Household

main_category value is from kinds of the same level with health and household,
fda.csv

I cannot understand this

prompts to gen query from original review

college project

Can I use the data for my college project?