home_regression's Introduction

home_regression

Based on Kaggle dataset link, I worked on several approach for linear regression.

Juyter notebook: hr_ll_2.ipynb

Straightforward from-the-shelf linear regression using sklearn library and XGBoost
Kaggle Score: 0.13357 with fine tuning

Data Preparation approach:

In the training set, a. drop the feature where you have more than 30% of missing value b. Select Quant and Qual value. For qual value with Cardinality below 10 b. Run Spearman and Pearson correlation to select top XX features
Normalize the training set updated / Normalize the testing set
Split the training set in X/Y % to create the validation set
Merge the training and dataset a. If one-hot encoding - do it here to make sure that you have every possible options b. if Mean Grading & Ordering (preferred solution below) no need for one-hot encoding an it is done during Spearman Correlation
Get the training, validation or test sets for .fit() and .predict()

Additional library used:

Numpy
Pandas
Seaborn
Matplotlib

home_regression's People

Contributors

Watchers

home_regression's Issues

Linear Regression: Dataloader doesn't make batch of the trainign set

I created a CustomDataset to prepare the training, Validation and Test set to make sure sure they have the same format (Same number of column after Dummies, Global normalization...). This works fine.

The DataLoader calls the CustomDataset according to the type of Dataset it requires ('Train', 'Test' or 'Valid'). as below:

data_dataset = {x: CustomDataset(csv_file_data=data_dir+'train.csv', 
                                 csv_file_test=data_dir+'test.csv', 
                                 **params, 
                                 data='train' if x == 'train' 
                                 else 'valid' if x =='valid' 
                                 else 'test') 
                for x in ['train', 'valid', 'test']
               }

data_loader = {x :torch.utils.data.DataLoader(data_dataset[x], batch_size=1, shuffle=True)
                for x in ['train', 'valid', 'test']}

The various sets looks like this:

print('TRAINING')

data, lab_target = data_dataset['train'].__getitem__(0)

print('DATASET')
print('Data shape: ', data.shape)
print('Data type: ', type(data))
print('Data size: {}'.format(data.size()))
#print('Exampe of the feature for the 1st entry {}'.format(data[0]))
print('\nTarget at the first row: {}'.format(lab_target.size()))
print('Example of the label for the 1st entry: {}'.format(lab_target[0]))


print()
print('Train Loader type')
train_iter = iter(data_loader['train'])
print(type(train_iter))

datas, labels_target = train_iter.next()

print('DATALOADER')
print('images shape on batch size = ', datas.size())
print('Example of datas for the 1st entry {}'.format(datas[0].size()))
#print('\nTaregt type on batch size = {}'.format(labels_target))
print('Target type on batch size = {}'.format(type(labels_target)))
print('Target shape on batch size = ', labels_target.shape)
print(len(train_iter))

and the output is

TRAINING
DATASET
Data shape: torch.Size([1095, 288])
Data type: <class 'torch.Tensor'>
Data size: torch.Size([1095, 288])

Target at the first row: torch.Size([1095, 1])
Example of the label for the 1st entry: tensor([208500.])

Train Loader type
<class 'torch.utils.data.dataloader._SingleProcessDataLoaderIter'>
DATALOADER
images shape on batch size = torch.Size([1, 1095, 288])
Example of datas for the 1st entry torch.Size([1095, 288])
Target type on batch size = <class 'torch.Tensor'>
Target shape on batch size = torch.Size([1, 1095, 1])
1460

Here the batch size is "1"

if I would have changed the batch size to 10 for example the outcome would be like this

DATALOADER
images shape on batch size = torch.Size([10, 1095, 288])
Example of datas for the 1st entry torch.Size([1095, 288])
Target type on batch size = <class 'torch.Tensor'>
Target shape on batch size = torch.Size([10, 1095, 1])
146

The training set is 1460 entries, but I split the set into 75% for training 25% for validation. 288 represents the number of features including all the dummies.

so my issue is when I train my model (below an extract of the training

 model.train()
        for idx, (data, target) in enumerate(loaders['train']):
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            optimizer.zero_grad()
            output = model(data)
            '''for name, param in model.named_parameters(): 
                if param.requires_grad: 
                    print(name, param.data)'''
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            train_loss = criterion(output, target)

for each batch, it will take all the entries and I don't know how to change so I can split the training set into batch.

I look forward to hearing your inputs.

Recommend Projects

arnaudmallet / home_regression Goto Github PK

home_regression's Introduction

home_regression

home_regression's People

Contributors

Watchers

home_regression's Issues

Linear Regression: Dataloader doesn't make batch of the trainign set

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent