yandex-research / rtdl Goto Github PK

View Code? Open in Web Editor NEW

814.0 34.0 95.0 25.18 MB

Research on Tabular Deep Learning: Papers & Packages

License: Apache License 2.0

Makefile 0.68% Python 99.32%

python machine-learning deep-learning pytorch research ai artificial-intelligence tabular neural-network papers

rtdl's People

Contributors

Stargazers

Watchers

Forkers

ikonushok yenngocduong msinghraniyal tridungduong16 ameenali freshricardo szhong-ssmhealth chsafouane arita37 mattmoehr github-hongweizhang vadanrod14 bigandsweet ppmdatix scchy aalkhalil155 will-retrace 787264137 mrrobotsaa ismail-mustapha isotlaboratory guaranteer kuangfengxyy arpitbansal297 lattice-ai art591 allbits smbadiwe akansal1 gkaretka stephanegaiffas piers-hinds sandy4321 manmeet3591 hanlu-nju 37channel mnergizphd stjordanis runxingzhong trongtuyen99 pa-wan kiwiathlete xxchenxx tirthvyas-tk-labs xiaochongzi8616 sebffischer mmozerwayfair burakakrishna quanthao malinhjartstrom jingmouren hjh233 mf093087 mrahman93 areyesan jpgard eamonustc tiger-east-1997 5l1v3r1 phymucs travisergodic codecse9 nbliangying shun-ryu ritassleite ccuixun9929 erik-mv ryanjjp jacob-x scorpiozq chao3698 bupttlhq dotrado dendiiiii nathanpainchaud jerronl thisguyisnotajumpingbear soeaxy jasonhuxinlei yynnxu synapticarbors luozhengdong xfx88 qsong4 shusenl pmdyy evn58 wrongxin2023 zhuanglineu 1617958603 kaishinshaw xindonglalala

rtdl's Issues

embedding of categorical variables

Hi Yury,

Thank you for your excellent work. I get a problem when handling categorical features. Do I need to pre-train the embedding layer when applying it to the data processing or just to attach the embedding layer to the model and train it with the model.

when to support torch 2?

When we run !pip install rtdl, it said

Collecting rtdl
Downloading rtdl-0.0.13-py3-none-any.whl (23 kB)
Requirement already satisfied: numpy<2,>=1.18 in /usr/local/lib/python3.10/dist-packages (from rtdl) (1.22.4)
Collecting torch<2,>=1.7 (from rtdl)
Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.5/887.5 MB 1.9 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch<2,>=1.7->rtdl) (4.7.1)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 kB 71.1 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 2.1 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 2.6 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 77.3 MB/s eta 0:00:00
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2,>=1.7->rtdl) (67.7.2)
Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2,>=1.7->rtdl) (0.40.0)
Installing collected packages: nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch, rtdl
Attempting uninstall: torch
Found existing installation: torch 2.0.1+cu118
Uninstalling torch-2.0.1+cu118:
Successfully uninstalled torch-2.0.1+cu118
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchdata 0.6.1 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchtext 0.15.2 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 rtdl-0.0.13 torch-1.13.1

VAE from microsoft

There is a VAE model from microsoft on tabular data:

Modelling individual column distribution
Modelling copulae between columns.

Typos?

Hello,

I am trying to use PiecewiseLinearEncoder(). I think I found a few typos. Please check my work.

I first ran into an issue in piecewise_linear_encoding where I got the error in line 618 saying "RuntimeError: The size of tensor a (3688) must match the size of tensor b (32) at non-singleton dimension 1"

I dug into the code and found that when PiecewiseLinearEncoder was calling piecewise_linear_encoding the positional arguments of indices and ratios were switched in the former from what was expected in the latter.

Additionally, when inspecting piecewise_linear_encoding it looks like bin_edges = as_tensor(bin_ratios) not "as_tensor(bin_edges)" which would make more sense.

Can you please check this out? Much appreciated.

A scikit-learn interface for RTDL package.

Hello! I have written a scikit-learn interface for the RTDL package (https://github.com/hengzhe-zhang/scikit-rtdl). I rely on the skorch to avoid coding errors, and set the default parameters based on the parameters presented in your paper. Hoping you will like it!

How to resume training?

I ran your model in colab for a few hours before google terminated it. I used pickle.dump/load to store the trained model. It works to make predictions but it doesn't seem to be able to resume training.

      if progress.success:
          print(' <<< BEST VALIDATION EPOCH', end='')
          with open(mydrive+jobname, 'wb') as filehandler:
            dump((model, y_std, y_mean),filehandler)
            #we could see result was improving

        with open(mydrive+jobname, 'rb') as filehandler:
          model, y_std, y_mean = load(filehandler)
        pred=model(batch,None) #this seems to work
        for epoch in range(1, n_epochs + 1):
            for iteration, batch_idx in enumerate(train_loader):
                model.train()
                optimizer.zero_grad()
                x_batch = X['train'][batch_idx]
                y_batch = y['train'][batch_idx]
                loss = loss_fn(apply_model(x_batch).squeeze(1), y_batch)
                loss.backward()
                optimizer.step()
                if iteration % report_frequency == 0:
                    print(f'(epoch) {epoch} (batch) {iteration} (loss) {loss.item():.4f}')
                #no improvement any more. even the model was dumped immediately after created.

what is the right way to store the model so that I can resume the training?

Is it possible to provide a scikit-learn interface?

This project is interesting and I want to use it as the baseline algorithm for my paper. However, it seems that I need to take several steps in order to make a prediction. Consequently, is it possible to provide a scikit-learn interface for making a convenient comparison between different algorithms?

Cannot link in the document of zero

Hi! I am trying to understand the usage of python package zero, which is used in the example of rtdl. But I found that the linkage in the comment line of the code is not available anymore.

Here is the invalid link:
https://yura52.github.io/zero/0.0.4/reference/api/zero.improve_reproducibility.html

I am wondering is there any other document? Thank you!

Regards.

Add additional validation when constructing PLE

Namely, uncomment this block. To do that, the function must also take bin_edges as an argument

Bugs in piecewise-linear encoding

Here, indices = as_tensor(values) must be changed to this:

indices = as_tensor(indices)

Here, np.array(d_encoding) must be changed to this:

torch.tensor(d_encoding).to(indices)

Here, the argument dtype=X.dtype is missing for np.array
Here, .to(X) is missing
Here, it must be:

is_last_bin = bin_indices + 1 == as_tensor(list(map(len, bin_edges)))

# bug, located in rtdl.data.piecewise_linear_encoding line #618

"is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))" should be "is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges))).unsqueeze(-1))", to use tensor broadcast func

Example for custom dataset

Hi ! are there any examples for extracting embedings for custom tabular data ?

regards

wrong condition in _LVR_encoding

the condition here is broken. it should be {int: 'int', float: 'float'}[type(left)] instead of str(type(left))

LGBMRegressor on California Housing dataset is 0.68 >> 0.46

I use the sample code to prepare the dataset:

device = 'cpu'
dataset = sklearn.datasets.fetch_california_housing()
task_type = 'regression'

X_all = dataset['data'].astype('float32')
y_all = dataset['target'].astype('float32')
n_classes = None

X = {}
y = {}
X['train'], X['test'], y['train'], y['test'] = sklearn.model_selection.train_test_split(
    X_all, y_all, train_size=0.8
)
X['train'], X['val'], y['train'], y['val'] = sklearn.model_selection.train_test_split(
    X['train'], y['train'], train_size=0.8
)

# not the best way to preprocess features, but enough for the demonstration
preprocess = sklearn.preprocessing.StandardScaler().fit(X['train'])
X = {
    k: torch.tensor(preprocess.fit_transform(v), device=device)
    for k, v in X.items()
}
y = {k: torch.tensor(v, device=device) for k, v in y.items()}

# !!! CRUCIAL for neural networks when solving regression problems !!!
y_mean = y['train'].mean().item()
y_std = y['train'].std().item()
y = {k: (v - y_mean) / y_std for k, v in y.items()}

y = {k: v.float() for k, v in y.items()}

And I train a LGBMRegressor with the default hyper parameters:

model = lgb.LGBMRegressor()
model.fit(X['train'], y['train'])

But when I evaluate on the test fold, I found the performance is 0.68:

>>> test_pred = model.predict(X['test'])
>>> test_pred = torch.from_numpy(test_pred)
>>> rmse = torch.nn.functional.mse_loss(
>>>     test_pred.view(-1), y['test'].view(-1)) ** 0.5 * y_std
>>> print(f'Test RMSE: {rmse:.2f}.')
Test RMSE: 0.68.

Even using the model from rtdl gives me 0.56 RMSE:

(epoch) 57 (batch) 0 (loss) 0.1885
(epoch) 57 (batch) 10 (loss) 0.1315
(epoch) 57 (batch) 20 (loss) 0.1735
(epoch) 57 (batch) 30 (loss) 0.1197
(epoch) 57 (batch) 40 (loss) 0.1952
(epoch) 57 (batch) 50 (loss) 0.1167
Epoch 057 | Validation score: 0.7334 | Test score: 0.5612 <<< BEST VALIDATION EPOCH

Is there anything I miss? How can I reproduce the performance in your paper? Thanks!

Training fails (sometimes) when using several GPUs

Dear maintainers,

I've been using your package for a while now (especially for the FTT model). I've never encountered any trouble, and it helped me boost my performance on a tabular dataset.
Recently, I've been doing some ablation studies, i.e., I've removed some input features to check if the performance of the model would decrease (and if yes, at what point). I've discovered that, when using several GPUs (2x RTX 2080 Ti), the training fails when it has a certain number of input features (but not always, it really depends of the number of input features).
I'm using torch.nn.DataParallel to implement data parallelism, and for some reason related to my framework I don't wish to use torch.nn.parallel.DistributedDataParallel.

Here's a minimal reproducible example to prove my point:

import scipy.sparse  # scipy is 1.10.1
import torch  # torch is 1.13.1
import rtdl  # rtdl is 0.0.13
import torchmetrics  # torchmetrics is 0.11.4
import multiprocessing
import numpy as np  # numpy is 1.24.3

class SparseDataset(torch.utils.data.Dataset):
    def __init__(self, X, y):
        self.X = X  # Sparse input data
        self.y = y  # Target data

    def __getitem__(self, index):
        X = torch.from_numpy(self.X[index].toarray()[0]).float()  # Convert the sparse input data to a dense tensor
        y = self.y[index]  # Target data
        return X, y

    def __len__(self):
        return self.X.shape[0]  # Number of items in the dataset

class FTT(rtdl.FTTransformer):
    def __init__(self, n_num_features=None, cat_cardinalities=None, d_token=16, n_blocks=1, attention_n_heads=4, attention_dropout=0.3, attention_initialization='kaiming', attention_normalization='LayerNorm', ffn_d_hidden=16, ffn_dropout=0.1, ffn_activation='ReGLU', ffn_normalization='LayerNorm', residual_dropout=0.0, prenormalization=True, first_prenormalization=False, last_layer_query_idx=[-1], n_tokens=None, kv_compression_ratio=0.004, kv_compression_sharing='headwise', head_activation='ReLU', head_normalization='LayerNorm', d_out=None):
        feature_tokenizer = rtdl.FeatureTokenizer( 
            n_num_features=n_num_features,
            cat_cardinalities=cat_cardinalities,
            d_token=d_token
        )
        transformer = rtdl.Transformer(
            d_token=d_token,
            n_blocks=n_blocks,
            attention_n_heads=attention_n_heads,
            attention_dropout=attention_dropout,
            attention_initialization=attention_initialization,
            attention_normalization=attention_normalization,
            ffn_d_hidden=ffn_d_hidden,
            ffn_dropout=ffn_dropout,
            ffn_activation=ffn_activation,
            ffn_normalization=ffn_normalization,
            residual_dropout=residual_dropout,
            prenormalization=prenormalization,
            first_prenormalization=first_prenormalization,
            last_layer_query_idx=last_layer_query_idx,
            n_tokens=n_num_features+1,
            kv_compression_ratio=kv_compression_ratio,
            kv_compression_sharing=kv_compression_sharing,
            head_activation=head_activation,
            head_normalization=head_normalization,
            d_out=d_out
        )
        super(FTT, self).__init__(feature_tokenizer, transformer)

    def forward(self, x_num=None, x_cat=None):
        x = self.feature_tokenizer(x_num, x_cat)
        print(f"First forward step --> shape is {x.shape} & device is {x.device}")  # For debug 1/7
        x = self.cls_token(x)
        print(f"Second forward step --> shape is {x.shape} & device is {x.device}")  # For debug 2/7
        x = self.transformer(x)
        print(f"Third forward step --> shape is {x.shape} & device is {x.device}")  # For debug 3/7
        return x

num_train_samples = 803473  # Number of samples in the real training dataset
num_test_samples = 82787  # Number of samples in the real testing dataset
num_input_features = 152  # Number of input features in the real dataset
#num_input_features = 252  # If we put another number of features, less say 100 more, then the code works (it depends of the number we put here)
num_classes = 228  # Number of classes in the real dataset

X_train = scipy.sparse.random(num_train_samples, num_input_features, density=0.01, format='csr')
y_train = np.random.randint(0, num_classes, num_train_samples)
X_test = scipy.sparse.random(num_test_samples, num_input_features, density=0.01, format='csr')
y_test = np.random.randint(0, num_classes, num_test_samples)

train_dataset = SparseDataset(X_train, y_train)  # Create a train dataset
train_dataloader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=512, shuffle=True, num_workers=multiprocessing.cpu_count())  # Create a train DataLoader 
test_dataset = SparseDataset(X_test, y_test)  # Create a test dataset
test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=512, shuffle=True, num_workers=multiprocessing.cpu_count())  # Create a test DataLoader 

model = FTT(n_num_features=num_input_features, d_out=num_classes)

model = torch.nn.DataParallel(model).cuda()  # Run the model parallelly and move it to GPU
criterion = torch.nn.CrossEntropyLoss().cuda()  # Instantiate loss class and move it to GPU
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.0001)  # Instantiate optimizer class
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=5) # Instantiate step learning scheduler class

print(model)  # For debug 4/7
for i in model.named_parameters():
    print(f"{i[0]} -> {i[1].device}")  # For debug 5/7

for epoch in range(100):
    for batch, labels in train_dataloader:  # Iterate through train dataset
        batch = batch.requires_grad_().cuda()  # Load batches with gradient accumulation capabilities
        print(f"Batch shape is {batch.shape} & device is {batch.device}")  # For debug 6/7
        labels = labels.cuda()  # Use GPU for tensors
        print(f"Labels shape is {labels.shape} & device is {labels.device}")  # For debug 7/7
        optimizer.zero_grad()  # Clear gradients w.r.t. parameters
        ########################################
        ### THE CODE FAILS ON THE LINE BELOW ###
        ########################################
        outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
        ########################################
        ### THE CODE FAILS ON THE LINE ABOVE ###
        ########################################
        loss = criterion(outputs, labels)  # Calculate Loss: softmax --> cross entropy loss
        loss.backward()  # Getting gradients w.r.t. parameters
        optimizer.step()  # Updating parameters
    metric = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)  # Instantiate the accuracy metric
    metric.cuda()  # Move the metric to the device
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # Disable gradient computation
        for batch, labels in test_dataloader:  # Iterate through the batches in the test dataset
            batch = batch.cuda()  # Move the batch to the device
            labels = labels.cuda()  # Move the labels to the device
            outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
            accuracy = metric(outputs, labels)  # Calculate the accuracy
    accuracy_epoch = metric.compute() * 100  # Compute the overall accuracy for the epoch
    model.train()

    scheduler.step(accuracy)
    print(f'Epoch {epoch} completed.')
    print(f'Accuracy: {accuracy_epoch:.4f}%.')

When my training data (a CSR matrix) has a shape of (803473, 152) (i.e., 803473 samples with each 152 features), this code fails (on multi-GPU). However, if I have a training data of shape (803473, 252) (I just tried a random number), then it works smoothly.

Here are the logs:

/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
DataParallel(
  (module): FTT(
    (feature_tokenizer): FeatureTokenizer(
      (num_tokenizer): NumericalFeatureTokenizer()
    )
    (cls_token): CLSToken()
    (transformer): Transformer(
      (blocks): ModuleList(
        (0): ModuleDict(
          (attention): MultiheadAttention(
            (W_q): Linear(in_features=16, out_features=16, bias=True)
            (W_k): Linear(in_features=16, out_features=16, bias=True)
            (W_v): Linear(in_features=16, out_features=16, bias=True)
            (W_out): Linear(in_features=16, out_features=16, bias=True)
            (dropout): Dropout(p=0.3, inplace=False)
          )
          (ffn): FFN(
            (linear_first): Linear(in_features=16, out_features=32, bias=True)
            (activation): ReGLU()
            (dropout): Dropout(p=0.1, inplace=False)
            (linear_second): Linear(in_features=16, out_features=16, bias=True)
          )
          (attention_residual_dropout): Dropout(p=0.0, inplace=False)
          (ffn_residual_dropout): Dropout(p=0.0, inplace=False)
          (output): Identity()
          (ffn_normalization): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
          (key_compression): Linear(in_features=153, out_features=0, bias=False)
          (value_compression): Linear(in_features=153, out_features=0, bias=False)
        )
      )
      (head): Head(
        (normalization): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (activation): ReLU()
        (linear): Linear(in_features=16, out_features=228, bias=True)
      )
    )
  )
)
module.feature_tokenizer.num_tokenizer.weight -> cuda:0
module.feature_tokenizer.num_tokenizer.bias -> cuda:0
module.cls_token.weight -> cuda:0
module.transformer.blocks.0.attention.W_q.weight -> cuda:0
module.transformer.blocks.0.attention.W_q.bias -> cuda:0
module.transformer.blocks.0.attention.W_k.weight -> cuda:0
module.transformer.blocks.0.attention.W_k.bias -> cuda:0
module.transformer.blocks.0.attention.W_v.weight -> cuda:0
module.transformer.blocks.0.attention.W_v.bias -> cuda:0
module.transformer.blocks.0.attention.W_out.weight -> cuda:0
module.transformer.blocks.0.attention.W_out.bias -> cuda:0
module.transformer.blocks.0.ffn.linear_first.weight -> cuda:0
module.transformer.blocks.0.ffn.linear_first.bias -> cuda:0
module.transformer.blocks.0.ffn.linear_second.weight -> cuda:0
module.transformer.blocks.0.ffn.linear_second.bias -> cuda:0
module.transformer.blocks.0.ffn_normalization.weight -> cuda:0
module.transformer.blocks.0.ffn_normalization.bias -> cuda:0
module.transformer.blocks.0.key_compression.weight -> cuda:0
module.transformer.blocks.0.value_compression.weight -> cuda:0
module.transformer.head.normalization.weight -> cuda:0
module.transformer.head.normalization.bias -> cuda:0
module.transformer.head.linear.weight -> cuda:0
module.transformer.head.linear.bias -> cuda:0
Batch shape is torch.Size([512, 152]) & device is cuda:0
Labels shape is torch.Size([512]) & device is cuda:0
First forward step --> shape is torch.Size([256, 152, 16]) & device is cuda:1
First forward step --> shape is torch.Size([256, 152, 16]) & device is cuda:0
Second forward step --> shape is torch.Size([256, 153, 16]) & device is cuda:1
Second forward step --> shape is torch.Size([256, 153, 16]) & device is cuda:0
Third forward step --> shape is torch.Size([256, 228]) & device is cuda:0
Traceback (most recent call last):
  File "/home/my_user_name/my_framework_name/minimal_reproducible_example.py", line 98, in <module>
    outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/my_framework_name/minimal_reproducible_example.py", line 57, in forward
    x = self.transformer(x)
        ^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/rtdl/modules.py", line 1150, in forward
    x_residual, _ = layer['attention'](
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/rtdl/modules.py", line 893, in forward
    k = key_compression(k.transpose(1, 2)).transpose(1, 2)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: size mismatch, got 4096, 4096x153,0

Some reminders:

This error only happens when I'm training one several (two) GPUs. It works fine with only one.
This error only happens when I have a certain amount of input_features (here, 152). My global dataset has 10 633 input features, and I never had a problem training this model. During my ablation studies, I also tried training the model with 10 481 input features (it worked), with 10 483 input features (it worked), with 10 631 input features (it worked), with 2 input features (it failed) and with 150 input features (it failed). Why? Because on my 10 633 input features, the 10 481 first are forming a first group of features, the 2 next are forming a second group of features and the 150 last are forming a third group of features, so I'm trying every combination possible.

Thanks for your help!

Possibly wrong initialization in LinearEmbeddings

It seems that in LinearEmbeddings output size (-1 dim) is being used for kaiming, not input size, which is inconsistent with both Pytorch nn Linear initialization (https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) and other similar layers (e. g. NLinear)
https://github.com/Yura52/rtdl/blob/b354b35d68f28b4f5bbebd2e6d5b1f6cfa91eed1/rtdl/nn/_embeddings.py#L275

About piecewise_linear_encoding

Hi Yura!

I have run the test code, which is mentioned in your code data.piecewise_linear_encoding
I have this error: RuntimeError: The size of tensor a (4) must match the size of tensor b (100) at non-singleton dimension 1

this error causes by is_last_bin = bin_indices + 1 == as_tensor(list(map(len, bin_edges)))

while bin_indices + 1 is a torch.Size([100, 4]), and as_tensor(list(map(len, bin_edges))) returns a torch.Size([100])

How to get feature importance scores or attention heatmap

Hi,
I am trying to get some visualizations of interpretable results from FT-Transformer, such as feature importance or attention heatmaps. I find some discussions about feature importance in paper section 5.3, but I don't know how to achieve it. Is there a way to achieve these based on the source codes you published? Or can you make an example of implementation?
Thank you very much!

typos in CatEmbeddings

link. The variable cardinalities_and_dimensions does not exist
link. The condition looks broken. Solution: simplify it and remove the word "spec" from the error message.

Regression results about the RTDL models.

Hi, you did a great implementation of the tab-transformer. However, when I use your example notebook to do the simple regression for the Sin(x), neither the baseline model or the FTTransformer give the good results. I have no idea about this and want to know why.

Here is the link

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.