Giter Club home page Giter Club logo

amazonreviews2023's Introduction

Amazon Reviews 2023

[馃寪 Website] 路 [馃 Huggingface Datasets] 路 [馃搼 Paper] 路 [馃敩 McAuley Lab]


This repository contains:

Recommendation Benchmarks

Based on the released Amazon Reviews 2023 dataset, we provide scripts to preprocess raw data into standard train/validation/test splits to encourage benchmarking recommendation models.

More details here -> [datasets & processing scripts]

BLaIR

BLaIR, which is short for "Bridging Language and Items for Retrieval and Recommendation", is a series of language models pre-trained on Amazon Reviews 2023 dataset.

BLaIR is grounded on pairs of (item metadata, language context), enabling the models to:

  • derive strong item text representations, for both recommendation and retrieval;
  • predict the most relevant item given simple / complex language context.

More details here -> [checkpoints & code]

Amazon-C4

Amazon-C4, which is short for "Complex Contexts Created by ChatGPT", is a new dataset for the complex product search task.

Amazon-C4 is designed to assess a model's ability to comprehend complex language contexts and retrieve relevant items.

More details here -> [datasets & code]

Contact

Please let us know if you encounter a bug or have any suggestions/questions by filling an issue or emailing Yupeng Hou (@hyp1231) at [email protected].

Acknowledgement

If you find Amazon Reviews 2023 dataset, BLaIR checkpoints, Amazon-C4 dataset, or our scripts/code helpful, please cite the following paper.

@article{hou2024bridging,
  title={Bridging Language and Items for Retrieval and Recommendation},
  author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
  journal={arXiv preprint arXiv:2403.03952},
  year={2024}
}

The recommendation experiments in the BLaIR paper are implemented using the open-source recommendation library RecBole.

The pre-training scripts refer a lot to huggingface language-modeling examples and SimCSE.

amazonreviews2023's People

Contributors

hyp1231 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

amazonreviews2023's Issues

repo not found

Hi, I was unable to download the metadata from the links provided, the urls were not found. Would you please fix the issue?

bough together section empty

I am using the electronics meta data and somehow it has all values as null for the "bought together" column.
Can you please help in redirecting me if I have to refer to any other column for the same info, or this sis still pending and yet to be added to the dataset.

BLAIR hugging face model

I was wondering if you will release the BLAIR model on Hugging Face soon because I believe the benchmark improvements is considerable and further fine-tuning could lead to even better performance. Please let me know if this is the case

categories field is empty.

First of all, thank you for sharing such great data.

I wanted to see the distribution of product categories from the item metadata, so I checked the category field, but all the data is empty.

from datasets import load_dataset
ds = load_dataset('McAuley-Lab/Amazon-Reviews-2023', 'raw_meta_All_Beauty', split="full", trust_remote_code=True)

It looks like you have collected category data according to other issues(#7), so I would like to know why the data is missing.

BLAIR integration with UniSrec

Could you also kindly publish the code for integrating BLAIR with UniSrec because from what I see this is not available at the moment. This helps a lot with reproducibility! Thanks!

Doubts related to this repo( REQUESTING FOR URGENT HELP)

I have to build my urgent project on gnn recommendations so please clear my below doubts:

  1. Can you please edit your readme in such a way that it explaines everything in details such for example: why have you taken valid timestamp and test timestamp as constants in one of the scripts, please attach more elaboration over small ideas too so that beginners like me can also understand it properly?

  2. Can you also tell me how can I build features file from the metadata provided as I am working on a GNN project for recommendations so I want to know how can I process your data from csv files to txt files containing nodes, edges and edge types. For that reason , I am not able to understand how can I process data for my gnn project?

Blair folder is missing: feature or a bug?

Hi, I wanted to look into the code for the BLAIR implementation, but the entire blair folder seems to be empty. Is it supposed to be empty? If yes, where would I find the implementation code?

main category is messy

import json
import pandas as pd
df = pd.DataFrame()

# file = # e.g., "meta_All_Beauty.jsonl", downloaded from the `meta` link above
file='D:\\360downloads\\meta_Health_and_Household.jsonl'
filename='meta_Health_and_Household'
asinlist=[]
counts=0
# {
#   "main_category": "All Beauty",
#   "title": "Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",
#   "average_rating": 4.8,
#   "rating_number": 10,
#   "features": [],
#   "description": [],
#   "price": null,
#   "images": [
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41qfjSfqNyL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41qfjSfqNyL.jpg",
#       "variant": "MAIN",
#       "hi_res": null
#     },
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41w2yznfuZL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41w2yznfuZL.jpg",
#       "variant": "PT01",
#       "hi_res": "https://m.media-amazon.com/images/I/71i77AuI9xL._SL1500_.jpg"
#     }
#   ],
#   "videos": [],
#   "store": "Howard Products",
#   "categories": [],
#   "details": {
#     "Package Dimensions": "7.1 x 5.5 x 3 inches; 2.38 Pounds",
#     "UPC": "617390882781"
#   },
#   "parent_asin": "B01CUPMQZE",
#   "bought_together": null
# }

with open(file, 'r') as fp:
    for line in fp:
        counts=counts+1
        listing=json.loads(line.strip())
        r=[]
        r.append(listing['parent_asin'])
        r.append(listing['main_category'])

        
        r.append(listing['title'])


        r.append(';'.join(listing['features']))
        r.append(';'.join(listing['description']))
        

        r.append(listing['price'])

        r.append(listing['rating_number'])
        r.append(listing['store'])
        if 'Package Dimensions' in listing['details']:
                if ';' in listing['details']['Package Dimensions']:
                     size=listing['details']['Package Dimensions'].split(';')[0]
                     weight=listing['details']['Package Dimensions'].split(';')[-1]
                     r.append(size)
                     r.append(weight)
                else:
                    r.append(listing['details']['Package Dimensions'])
                    r.append('')
        else:
             r.append('')

        asinlist.append(r)        
        # if keyword in listing['title']:
        # #     print(listing['parent_asin'])
        #     r=[]
        #     r.append(listing['parent_asin'])
        #     r.append(listing['title'])
        #     asinlist.append(r)
print(counts)
s1 = pd.Series(asinlist)
# print(asinlist)
df = pd.DataFrame(list(s1),  columns =  ["Asin","main_category", "title",
                                         "features","description",
                                         "price","rating_number","store","size",'weight'])

keyword='nootropics'
keyword='fda'
if keyword:
     out=df[df['description'].str.contains(keyword)]
else:
     out=df
     keyword=filename
out.to_csv(keyword+'.csv')

i want to filter title or description contain 'fda', what I got ,as you can see although input file is
meta_Health_and_Household

main_category value is from kinds of the same level with health and household,
fda.csv

I cannot understand this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.