This project aims to provide new data sources for product bundling in real e-commerce platforms with the domains of Electronic, Clothing and Food. We construct three high-quality bundle datasets with rich meta information, particularly bundle intents, through a carefully designed crowd-sourcing task.
Figure 1 shows the distribution of workers' age, education, country, occupation, gender and shopping frequency for the two batches. In particular, `Others' in the country distribution includes Argentina, Australia, Anguilla, Netherlands, Albania, Georgia, Tunisia, Belgium, Armenia, Guinea, Austria, Switzerland, Iceland, Lithuania, Egypt, Venezuela, Bangladesh, American Samoa, Vanuatu, Colombia, United Arab Emirates, Ashmore and Cartier Island, Estados Unidos, Wales, Turkey, Angola, Scotland, Philippines, Iran and Bahamas.
Figure 1: Worker basic information in the first and second batches.
A grid search in {0.0001, 0.001, 0.01} is applied to find out the optimal settings for support and confidence, and both are set as 0.001 across the three domains.
The dimension d of item and bundle representations for all methods is 20. Grid search is adopted to find out the best settings for other key parameters. In particular, learning rate and regularization coefficient are searched in {0.0001, 0.001, 0.01}; the number of neighbors K in ItemKNN is searched in {10, 20, 30, 50}; the weight of KL divergence in VAE is searched in {0.001, 0.01, 0.1}; and the batch size is searched in {64, 128, 256}. The optimal parameter settings are shown in Table 1.
Table 1: Parameter settings for bundle completion (d=20).
Electronic | Clothing | Food | |
---|---|---|---|
ItemKNN | |||
BPRMF | |||
mean-VAE | |||
concat-VAE |
The dimension d of representations is set as 20. We apply a same grid search for , , and batch size as in bundle completion. Besides, the predictive layer D for AttList is searched from {20, 50, 100}; the node and message dropout rate for GCN and BGCN is searched in {0, 0.1, 0.3, 0.5}. As the training complexity for GCN and BGCN is quite high, we set the batch size as 2048 as suggested by the original paper. The optimal parameter settings are presented in Table 2. Note that the parameter settings for BGCN is the version without pre-training (i.e. ).
Table 2: Parameter settings for bundle ranking (d=20).
Electronic | Clothing | Food | |
---|---|---|---|
ItemKNN | |||
BPRMF | |||
DAM | |||
AttList | |||
GCN | |||
BGCN |
Table 3: Statistics of datasets.
Electronic | Clothing | Food | |
---|---|---|---|
#Users | 888 | 965 | 879 |
#Items | 3499 | 4487 | 3767 |
#Sessions | 1145 | 1181 | 1161 |
#Bundles | 1750 | 1910 | 1784 |
#Intents | 1422 | 1466 | 1156 |
Average Bundle Size | 3.52 | 3.31 | 3.58 |
#User-Item Interactions | 6165 | 6326 | 6395 |
#User-Bundle Interactions | 1753 | 1912 | 1785 |
Density of User-Item Interactions | 0.20% | 0.15% | 0.19% |
Density of User-Bundle Interactions | 0.11% | 0.10% | 0.11% |
Under the 'dataset' folder, there are three domains, including clothing, electronic and food. Each domain contains the following 9 data files.
Table 4: The descriptions of the data files.
File Name | Descriptions |
---|---|
user_item_pretrain.csv | This file contains the user-item interactions aiming to obtain the pre-trained item representations via BPRMF for model initialization. This is a tab separated list with 3 columns: user ID | item ID | timestamp | |
user_item.csv | This file contains the user-item interactions. This is a tab separated list with 3 columns: user ID | item ID | timestamp | |
session_item.csv | This file contains sessions and their associated items. Each session has at least 2 items. This is a tab separated list with 2 columns: session ID | item ID | |
user_session.csv | This file contains users and their associated sessions. This is a tab separated list with 3 columns: user ID | session ID | timestamp | |
session_bundle.csv | This file contains sessions and their detected bundles. Each session has at least 1 bundle. This is a tab separated list with 2 columns: session ID | bundle ID | The session ID contained in the session_item.csv but not in session_bundle.csv indicates there is no bundle detected in this session. |
bundle_intent.csv | This file contains bundles and their annotated intents. This is a tab separated list with 2 columns: bundle ID | intent | |
bundle_item.csv | This file contains bundles and their associated items. Each bundle has at least 2 items. This is a tab separated list with 2 columns: bundle ID | item ID | |
user_bundle.csv | This file contains the user-bundle interactions. This is a tab separated list with 3 columns: user ID | bundle ID | timestamp | |
item_categories.csv | This file contains items and their affiliated categories. This is a tab separated list with 2 columns: item ID | categories | The format of data in categories column is a list of string. |
Our datasets are constructed on the basis of Amazon datasets (http://jmcauley.ucsd.edu/data/amazon/links.html).