Giter Club home page Giter Club logo

home_regression's Introduction

home_regression

Based on Kaggle dataset link, I worked on several approach for linear regression.

Juyter notebook: hr_ll_2.ipynb

  1. Straightforward from-the-shelf linear regression using sklearn library and XGBoost
  2. Kaggle Score: 0.13357 with fine tuning

Data Preparation approach:

  1. In the training set, a. drop the feature where you have more than 30% of missing value b. Select Quant and Qual value. For qual value with Cardinality below 10 b. Run Spearman and Pearson correlation to select top XX features
  2. Normalize the training set updated / Normalize the testing set
  3. Split the training set in X/Y % to create the validation set
  4. Merge the training and dataset a. If one-hot encoding - do it here to make sure that you have every possible options b. if Mean Grading & Ordering (preferred solution below) no need for one-hot encoding an it is done during Spearman Correlation
  5. Get the training, validation or test sets for .fit() and .predict()

Additional library used:

  • Numpy
  • Pandas
  • Seaborn
  • Matplotlib

home_regression's People

Contributors

arnaudmallet avatar

Watchers

 avatar

home_regression's Issues

Linear Regression: Dataloader doesn't make batch of the trainign set

I created a CustomDataset to prepare the training, Validation and Test set to make sure sure they have the same format (Same number of column after Dummies, Global normalization...). This works fine.

The DataLoader calls the CustomDataset according to the type of Dataset it requires ('Train', 'Test' or 'Valid'). as below:

data_dataset = {x: CustomDataset(csv_file_data=data_dir+'train.csv', 
                                 csv_file_test=data_dir+'test.csv', 
                                 **params, 
                                 data='train' if x == 'train' 
                                 else 'valid' if x =='valid' 
                                 else 'test') 
                for x in ['train', 'valid', 'test']
               }
data_loader = {x :torch.utils.data.DataLoader(data_dataset[x], batch_size=1, shuffle=True)
                for x in ['train', 'valid', 'test']}

The various sets looks like this:

print('TRAINING')

data, lab_target = data_dataset['train'].__getitem__(0)

print('DATASET')
print('Data shape: ', data.shape)
print('Data type: ', type(data))
print('Data size: {}'.format(data.size()))
#print('Exampe of the feature for the 1st entry {}'.format(data[0]))
print('\nTarget at the first row: {}'.format(lab_target.size()))
print('Example of the label for the 1st entry: {}'.format(lab_target[0]))


print()
print('Train Loader type')
train_iter = iter(data_loader['train'])
print(type(train_iter))

datas, labels_target = train_iter.next()

print('DATALOADER')
print('images shape on batch size = ', datas.size())
print('Example of datas for the 1st entry {}'.format(datas[0].size()))
#print('\nTaregt type on batch size = {}'.format(labels_target))
print('Target type on batch size = {}'.format(type(labels_target)))
print('Target shape on batch size = ', labels_target.shape)
print(len(train_iter))

and the output is

TRAINING
DATASET
Data shape: torch.Size([1095, 288])
Data type: <class 'torch.Tensor'>
Data size: torch.Size([1095, 288])

Target at the first row: torch.Size([1095, 1])
Example of the label for the 1st entry: tensor([208500.])

Train Loader type
<class 'torch.utils.data.dataloader._SingleProcessDataLoaderIter'>
DATALOADER
images shape on batch size = torch.Size([1, 1095, 288])
Example of datas for the 1st entry torch.Size([1095, 288])
Target type on batch size = <class 'torch.Tensor'>
Target shape on batch size = torch.Size([1, 1095, 1])
1460

Here the batch size is "1"

if I would have changed the batch size to 10 for example the outcome would be like this

DATALOADER
images shape on batch size = torch.Size([10, 1095, 288])
Example of datas for the 1st entry torch.Size([1095, 288])
Target type on batch size = <class 'torch.Tensor'>
Target shape on batch size = torch.Size([10, 1095, 1])
146

The training set is 1460 entries, but I split the set into 75% for training 25% for validation. 288 represents the number of features including all the dummies.

so my issue is when I train my model (below an extract of the training

 model.train()
        for idx, (data, target) in enumerate(loaders['train']):
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            optimizer.zero_grad()
            output = model(data)
            '''for name, param in model.named_parameters(): 
                if param.requires_grad: 
                    print(name, param.data)'''
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            train_loss = criterion(output, target)

for each batch, it will take all the entries and I don't know how to change so I can split the training set into batch.

I look forward to hearing your inputs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.