Giter Club home page Giter Club logo

rtdl's People

Contributors

dendiiiii avatar yura52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rtdl's Issues

embedding of categorical variables

Hi Yury,

Thank you for your excellent work. I get a problem when handling categorical features. Do I need to pre-train the embedding layer when applying it to the data processing or just to attach the embedding layer to the model and train it with the model.

when to support torch 2?

When we run !pip install rtdl, it said

Collecting rtdl
Downloading rtdl-0.0.13-py3-none-any.whl (23 kB)
Requirement already satisfied: numpy<2,>=1.18 in /usr/local/lib/python3.10/dist-packages (from rtdl) (1.22.4)
Collecting torch<2,>=1.7 (from rtdl)
Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.5/887.5 MB 1.9 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch<2,>=1.7->rtdl) (4.7.1)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 kB 71.1 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 2.1 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 2.6 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch<2,>=1.7->rtdl)
Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 77.3 MB/s eta 0:00:00
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2,>=1.7->rtdl) (67.7.2)
Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2,>=1.7->rtdl) (0.40.0)
Installing collected packages: nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, nvidia-cudnn-cu11, torch, rtdl
Attempting uninstall: torch
Found existing installation: torch 2.0.1+cu118
Uninstalling torch-2.0.1+cu118:
Successfully uninstalled torch-2.0.1+cu118
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchdata 0.6.1 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchtext 0.15.2 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 rtdl-0.0.13 torch-1.13.1

VAE from microsoft

There is a VAE model from microsoft on tabular data:

Modelling individual column distribution
Modelling copulae between columns.

Typos?

Hello,

I am trying to use PiecewiseLinearEncoder(). I think I found a few typos. Please check my work.

I first ran into an issue in piecewise_linear_encoding where I got the error in line 618 saying "RuntimeError: The size of tensor a (3688) must match the size of tensor b (32) at non-singleton dimension 1"

I dug into the code and found that when PiecewiseLinearEncoder was calling piecewise_linear_encoding the positional arguments of indices and ratios were switched in the former from what was expected in the latter.

Additionally, when inspecting piecewise_linear_encoding it looks like bin_edges = as_tensor(bin_ratios) not "as_tensor(bin_edges)" which would make more sense.

Can you please check this out? Much appreciated.

How to resume training?

I ran your model in colab for a few hours before google terminated it. I used pickle.dump/load to store the trained model. It works to make predictions but it doesn't seem to be able to resume training.

      if progress.success:
          print(' <<< BEST VALIDATION EPOCH', end='')
          with open(mydrive+jobname, 'wb') as filehandler:
            dump((model, y_std, y_mean),filehandler)
            #we could see result was improving

        with open(mydrive+jobname, 'rb') as filehandler:
          model, y_std, y_mean = load(filehandler)
        pred=model(batch,None) #this seems to work
        for epoch in range(1, n_epochs + 1):
            for iteration, batch_idx in enumerate(train_loader):
                model.train()
                optimizer.zero_grad()
                x_batch = X['train'][batch_idx]
                y_batch = y['train'][batch_idx]
                loss = loss_fn(apply_model(x_batch).squeeze(1), y_batch)
                loss.backward()
                optimizer.step()
                if iteration % report_frequency == 0:
                    print(f'(epoch) {epoch} (batch) {iteration} (loss) {loss.item():.4f}')
                #no improvement any more. even the model was dumped immediately after created.

what is the right way to store the model so that I can resume the training?

Is it possible to provide a scikit-learn interface?

This project is interesting and I want to use it as the baseline algorithm for my paper. However, it seems that I need to take several steps in order to make a prediction. Consequently, is it possible to provide a scikit-learn interface for making a convenient comparison between different algorithms?

Bugs in piecewise-linear encoding

  1. Here, indices = as_tensor(values) must be changed to this:
indices = as_tensor(indices)
  1. Here, np.array(d_encoding) must be changed to this:
torch.tensor(d_encoding).to(indices)
  1. Here, the argument dtype=X.dtype is missing for np.array

  2. Here, .to(X) is missing

  3. Here, it must be:

is_last_bin = bin_indices + 1 == as_tensor(list(map(len, bin_edges)))

# bug, located in rtdl.data.piecewise_linear_encoding line #618

"is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))" should be "is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges)))is_last_bin = bin_indices + 1 == torch.as_tensor(list(map(len, bin_edges))).unsqueeze(-1))", to use tensor broadcast func

LGBMRegressor on California Housing dataset is 0.68 >> 0.46

I use the sample code to prepare the dataset:

device = 'cpu'
dataset = sklearn.datasets.fetch_california_housing()
task_type = 'regression'

X_all = dataset['data'].astype('float32')
y_all = dataset['target'].astype('float32')
n_classes = None

X = {}
y = {}
X['train'], X['test'], y['train'], y['test'] = sklearn.model_selection.train_test_split(
    X_all, y_all, train_size=0.8
)
X['train'], X['val'], y['train'], y['val'] = sklearn.model_selection.train_test_split(
    X['train'], y['train'], train_size=0.8
)

# not the best way to preprocess features, but enough for the demonstration
preprocess = sklearn.preprocessing.StandardScaler().fit(X['train'])
X = {
    k: torch.tensor(preprocess.fit_transform(v), device=device)
    for k, v in X.items()
}
y = {k: torch.tensor(v, device=device) for k, v in y.items()}

# !!! CRUCIAL for neural networks when solving regression problems !!!
y_mean = y['train'].mean().item()
y_std = y['train'].std().item()
y = {k: (v - y_mean) / y_std for k, v in y.items()}

y = {k: v.float() for k, v in y.items()}

And I train a LGBMRegressor with the default hyper parameters:

model = lgb.LGBMRegressor()
model.fit(X['train'], y['train'])

But when I evaluate on the test fold, I found the performance is 0.68:

>>> test_pred = model.predict(X['test'])
>>> test_pred = torch.from_numpy(test_pred)
>>> rmse = torch.nn.functional.mse_loss(
>>>     test_pred.view(-1), y['test'].view(-1)) ** 0.5 * y_std
>>> print(f'Test RMSE: {rmse:.2f}.')
Test RMSE: 0.68.

Even using the model from rtdl gives me 0.56 RMSE:

(epoch) 57 (batch) 0 (loss) 0.1885
(epoch) 57 (batch) 10 (loss) 0.1315
(epoch) 57 (batch) 20 (loss) 0.1735
(epoch) 57 (batch) 30 (loss) 0.1197
(epoch) 57 (batch) 40 (loss) 0.1952
(epoch) 57 (batch) 50 (loss) 0.1167
Epoch 057 | Validation score: 0.7334 | Test score: 0.5612 <<< BEST VALIDATION EPOCH

Is there anything I miss? How can I reproduce the performance in your paper? Thanks!

Training fails (sometimes) when using several GPUs

Dear maintainers,

I've been using your package for a while now (especially for the FTT model). I've never encountered any trouble, and it helped me boost my performance on a tabular dataset.
Recently, I've been doing some ablation studies, i.e., I've removed some input features to check if the performance of the model would decrease (and if yes, at what point). I've discovered that, when using several GPUs (2x RTX 2080 Ti), the training fails when it has a certain number of input features (but not always, it really depends of the number of input features).
I'm using torch.nn.DataParallel to implement data parallelism, and for some reason related to my framework I don't wish to use torch.nn.parallel.DistributedDataParallel.

Here's a minimal reproducible example to prove my point:

import scipy.sparse  # scipy is 1.10.1
import torch  # torch is 1.13.1
import rtdl  # rtdl is 0.0.13
import torchmetrics  # torchmetrics is 0.11.4
import multiprocessing
import numpy as np  # numpy is 1.24.3

class SparseDataset(torch.utils.data.Dataset):
    def __init__(self, X, y):
        self.X = X  # Sparse input data
        self.y = y  # Target data

    def __getitem__(self, index):
        X = torch.from_numpy(self.X[index].toarray()[0]).float()  # Convert the sparse input data to a dense tensor
        y = self.y[index]  # Target data
        return X, y

    def __len__(self):
        return self.X.shape[0]  # Number of items in the dataset

class FTT(rtdl.FTTransformer):
    def __init__(self, n_num_features=None, cat_cardinalities=None, d_token=16, n_blocks=1, attention_n_heads=4, attention_dropout=0.3, attention_initialization='kaiming', attention_normalization='LayerNorm', ffn_d_hidden=16, ffn_dropout=0.1, ffn_activation='ReGLU', ffn_normalization='LayerNorm', residual_dropout=0.0, prenormalization=True, first_prenormalization=False, last_layer_query_idx=[-1], n_tokens=None, kv_compression_ratio=0.004, kv_compression_sharing='headwise', head_activation='ReLU', head_normalization='LayerNorm', d_out=None):
        feature_tokenizer = rtdl.FeatureTokenizer( 
            n_num_features=n_num_features,
            cat_cardinalities=cat_cardinalities,
            d_token=d_token
        )
        transformer = rtdl.Transformer(
            d_token=d_token,
            n_blocks=n_blocks,
            attention_n_heads=attention_n_heads,
            attention_dropout=attention_dropout,
            attention_initialization=attention_initialization,
            attention_normalization=attention_normalization,
            ffn_d_hidden=ffn_d_hidden,
            ffn_dropout=ffn_dropout,
            ffn_activation=ffn_activation,
            ffn_normalization=ffn_normalization,
            residual_dropout=residual_dropout,
            prenormalization=prenormalization,
            first_prenormalization=first_prenormalization,
            last_layer_query_idx=last_layer_query_idx,
            n_tokens=n_num_features+1,
            kv_compression_ratio=kv_compression_ratio,
            kv_compression_sharing=kv_compression_sharing,
            head_activation=head_activation,
            head_normalization=head_normalization,
            d_out=d_out
        )
        super(FTT, self).__init__(feature_tokenizer, transformer)

    def forward(self, x_num=None, x_cat=None):
        x = self.feature_tokenizer(x_num, x_cat)
        print(f"First forward step --> shape is {x.shape} & device is {x.device}")  # For debug 1/7
        x = self.cls_token(x)
        print(f"Second forward step --> shape is {x.shape} & device is {x.device}")  # For debug 2/7
        x = self.transformer(x)
        print(f"Third forward step --> shape is {x.shape} & device is {x.device}")  # For debug 3/7
        return x

num_train_samples = 803473  # Number of samples in the real training dataset
num_test_samples = 82787  # Number of samples in the real testing dataset
num_input_features = 152  # Number of input features in the real dataset
#num_input_features = 252  # If we put another number of features, less say 100 more, then the code works (it depends of the number we put here)
num_classes = 228  # Number of classes in the real dataset

X_train = scipy.sparse.random(num_train_samples, num_input_features, density=0.01, format='csr')
y_train = np.random.randint(0, num_classes, num_train_samples)
X_test = scipy.sparse.random(num_test_samples, num_input_features, density=0.01, format='csr')
y_test = np.random.randint(0, num_classes, num_test_samples)

train_dataset = SparseDataset(X_train, y_train)  # Create a train dataset
train_dataloader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=512, shuffle=True, num_workers=multiprocessing.cpu_count())  # Create a train DataLoader 
test_dataset = SparseDataset(X_test, y_test)  # Create a test dataset
test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=512, shuffle=True, num_workers=multiprocessing.cpu_count())  # Create a test DataLoader 

model = FTT(n_num_features=num_input_features, d_out=num_classes)

model = torch.nn.DataParallel(model).cuda()  # Run the model parallelly and move it to GPU
criterion = torch.nn.CrossEntropyLoss().cuda()  # Instantiate loss class and move it to GPU
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.0001)  # Instantiate optimizer class
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=5) # Instantiate step learning scheduler class

print(model)  # For debug 4/7
for i in model.named_parameters():
    print(f"{i[0]} -> {i[1].device}")  # For debug 5/7

for epoch in range(100):
    for batch, labels in train_dataloader:  # Iterate through train dataset
        batch = batch.requires_grad_().cuda()  # Load batches with gradient accumulation capabilities
        print(f"Batch shape is {batch.shape} & device is {batch.device}")  # For debug 6/7
        labels = labels.cuda()  # Use GPU for tensors
        print(f"Labels shape is {labels.shape} & device is {labels.device}")  # For debug 7/7
        optimizer.zero_grad()  # Clear gradients w.r.t. parameters
        ########################################
        ### THE CODE FAILS ON THE LINE BELOW ###
        ########################################
        outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
        ########################################
        ### THE CODE FAILS ON THE LINE ABOVE ###
        ########################################
        loss = criterion(outputs, labels)  # Calculate Loss: softmax --> cross entropy loss
        loss.backward()  # Getting gradients w.r.t. parameters
        optimizer.step()  # Updating parameters
    metric = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)  # Instantiate the accuracy metric
    metric.cuda()  # Move the metric to the device
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # Disable gradient computation
        for batch, labels in test_dataloader:  # Iterate through the batches in the test dataset
            batch = batch.cuda()  # Move the batch to the device
            labels = labels.cuda()  # Move the labels to the device
            outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
            accuracy = metric(outputs, labels)  # Calculate the accuracy
    accuracy_epoch = metric.compute() * 100  # Compute the overall accuracy for the epoch
    model.train()

    scheduler.step(accuracy)
    print(f'Epoch {epoch} completed.')
    print(f'Accuracy: {accuracy_epoch:.4f}%.')

When my training data (a CSR matrix) has a shape of (803473, 152) (i.e., 803473 samples with each 152 features), this code fails (on multi-GPU). However, if I have a training data of shape (803473, 252) (I just tried a random number), then it works smoothly.

Here are the logs:

/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
DataParallel(
  (module): FTT(
    (feature_tokenizer): FeatureTokenizer(
      (num_tokenizer): NumericalFeatureTokenizer()
    )
    (cls_token): CLSToken()
    (transformer): Transformer(
      (blocks): ModuleList(
        (0): ModuleDict(
          (attention): MultiheadAttention(
            (W_q): Linear(in_features=16, out_features=16, bias=True)
            (W_k): Linear(in_features=16, out_features=16, bias=True)
            (W_v): Linear(in_features=16, out_features=16, bias=True)
            (W_out): Linear(in_features=16, out_features=16, bias=True)
            (dropout): Dropout(p=0.3, inplace=False)
          )
          (ffn): FFN(
            (linear_first): Linear(in_features=16, out_features=32, bias=True)
            (activation): ReGLU()
            (dropout): Dropout(p=0.1, inplace=False)
            (linear_second): Linear(in_features=16, out_features=16, bias=True)
          )
          (attention_residual_dropout): Dropout(p=0.0, inplace=False)
          (ffn_residual_dropout): Dropout(p=0.0, inplace=False)
          (output): Identity()
          (ffn_normalization): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
          (key_compression): Linear(in_features=153, out_features=0, bias=False)
          (value_compression): Linear(in_features=153, out_features=0, bias=False)
        )
      )
      (head): Head(
        (normalization): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (activation): ReLU()
        (linear): Linear(in_features=16, out_features=228, bias=True)
      )
    )
  )
)
module.feature_tokenizer.num_tokenizer.weight -> cuda:0
module.feature_tokenizer.num_tokenizer.bias -> cuda:0
module.cls_token.weight -> cuda:0
module.transformer.blocks.0.attention.W_q.weight -> cuda:0
module.transformer.blocks.0.attention.W_q.bias -> cuda:0
module.transformer.blocks.0.attention.W_k.weight -> cuda:0
module.transformer.blocks.0.attention.W_k.bias -> cuda:0
module.transformer.blocks.0.attention.W_v.weight -> cuda:0
module.transformer.blocks.0.attention.W_v.bias -> cuda:0
module.transformer.blocks.0.attention.W_out.weight -> cuda:0
module.transformer.blocks.0.attention.W_out.bias -> cuda:0
module.transformer.blocks.0.ffn.linear_first.weight -> cuda:0
module.transformer.blocks.0.ffn.linear_first.bias -> cuda:0
module.transformer.blocks.0.ffn.linear_second.weight -> cuda:0
module.transformer.blocks.0.ffn.linear_second.bias -> cuda:0
module.transformer.blocks.0.ffn_normalization.weight -> cuda:0
module.transformer.blocks.0.ffn_normalization.bias -> cuda:0
module.transformer.blocks.0.key_compression.weight -> cuda:0
module.transformer.blocks.0.value_compression.weight -> cuda:0
module.transformer.head.normalization.weight -> cuda:0
module.transformer.head.normalization.bias -> cuda:0
module.transformer.head.linear.weight -> cuda:0
module.transformer.head.linear.bias -> cuda:0
Batch shape is torch.Size([512, 152]) & device is cuda:0
Labels shape is torch.Size([512]) & device is cuda:0
First forward step --> shape is torch.Size([256, 152, 16]) & device is cuda:1
First forward step --> shape is torch.Size([256, 152, 16]) & device is cuda:0
Second forward step --> shape is torch.Size([256, 153, 16]) & device is cuda:1
Second forward step --> shape is torch.Size([256, 153, 16]) & device is cuda:0
Third forward step --> shape is torch.Size([256, 228]) & device is cuda:0
Traceback (most recent call last):
  File "/home/my_user_name/my_framework_name/minimal_reproducible_example.py", line 98, in <module>
    outputs = model(x_num=batch, x_cat=None)  # Forward pass to get output/logits
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/my_framework_name/minimal_reproducible_example.py", line 57, in forward
    x = self.transformer(x)
        ^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/rtdl/modules.py", line 1150, in forward
    x_residual, _ = layer['attention'](
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/rtdl/modules.py", line 893, in forward
    k = key_compression(k.transpose(1, 2)).transpose(1, 2)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/my_user_name/.conda/envs/my_environment_name/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: size mismatch, got 4096, 4096x153,0

Some reminders:

  • This error only happens when I'm training one several (two) GPUs. It works fine with only one.
  • This error only happens when I have a certain amount of input_features (here, 152). My global dataset has 10 633 input features, and I never had a problem training this model. During my ablation studies, I also tried training the model with 10 481 input features (it worked), with 10 483 input features (it worked), with 10 631 input features (it worked), with 2 input features (it failed) and with 150 input features (it failed). Why? Because on my 10 633 input features, the 10 481 first are forming a first group of features, the 2 next are forming a second group of features and the 150 last are forming a third group of features, so I'm trying every combination possible.

Thanks for your help!

About piecewise_linear_encoding

Hi Yura!

I have run the test code, which is mentioned in your code data.piecewise_linear_encoding
I have this error: RuntimeError: The size of tensor a (4) must match the size of tensor b (100) at non-singleton dimension 1

this error causes by is_last_bin = bin_indices + 1 == as_tensor(list(map(len, bin_edges)))

while bin_indices + 1 is a torch.Size([100, 4]), and as_tensor(list(map(len, bin_edges))) returns a torch.Size([100])

How to get feature importance scores or attention heatmap

Hi,
I am trying to get some visualizations of interpretable results from FT-Transformer, such as feature importance or attention heatmaps. I find some discussions about feature importance in paper section 5.3, but I don't know how to achieve it. Is there a way to achieve these based on the source codes you published? Or can you make an example of implementation?
Thank you very much!

typos in CatEmbeddings

  1. link. The variable cardinalities_and_dimensions does not exist
  2. link. The condition looks broken. Solution: simplify it and remove the word "spec" from the error message.

Regression results about the RTDL models.

Hi, you did a great implementation of the tab-transformer. However, when I use your example notebook to do the simple regression for the Sin(x), neither the baseline model or the FTTransformer give the good results. I have no idea about this and want to know why.

Here is the link

How to get the probablity in each multiclass?

other than the max bucket from argmax, can we also have the probabilities for each class? currently the prediction often have a lot negative values and I don't know what would be the right way to convert them to probabilities.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.