Dear developers, Thanks for the generous share of code. When I learn the code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

add_polynomials go wrong when process different shape of dataset_1 and dataset_2 about homemade-machine-learning HOT 7 CLOSED

trekhleb commented on May 5, 2024

add_polynomials go wrong when process different shape of dataset_1 and dataset_2

from homemade-machine-learning.

Comments (7)

bayegy commented on May 5, 2024 1

@trekhleb .Many thanks to your reply.

I guess you're right about theta[1] issue. I've just made a fix for that. Also the way polynomials and sinusoid are added now has been changed. I hope you won't have issues with that now.

no, i don't have issues with that for now.

Could you please double-check? I'm closing this issue for now. But if have any other concerns please feel free to open new issues.

Yes, i am very interested in this excellent work. i will keep track of the source code. see what i can find.

from homemade-machine-learning.

buck06191 commented on May 5, 2024

Hey, I'd like to give this a look. See if I can investigate it further.

from homemade-machine-learning.

buck06191 commented on May 5, 2024

@bayegy So the problem with your particular query here is that the data isn't being split in the way you seem to expect. It's splitting vertically (into columns) not horizontally (into rows).

If you follow the source code for prepare_for_training.py you'll see that before the step where we add the polynomial features in line 34, there is a section where sinusoidal features are added (lines 30 and 31). This changes the data from being a column vector to a matrix with two columns and many rows.

The split for the polynomial step is to separate this matrix into two column vectors and then multiply them together in various ways.

This does actually raise a secondary problem here that may be worth raising to @trekhleb that the sinusoidal step is given as a conditional

# Add sinusoidal features to the dataset.
if sinusoid_degree:
    data_processed = add_sinusoids(data_processed, sinusoid_degree)

but if there is a single column of data and the sinusoid features aren't added, the method fails at the .train() stage. For example, using the example data, if None or zero is passed to sinusoid_degree and an integer greater than 1 is passed to polynomial_degree the train() method fails with IndexError: index 1 is out of bounds for axis 0 with size 1, thrown by line 101.

from homemade-machine-learning.

bayegy commented on May 5, 2024

@bayegy So the problem with your particular query here is that the data isn't being split in the way you seem to expect. It's splitting vertically (into columns) not horizontally (into rows).

If you follow the source code for prepare_for_training.py you'll see that before the step where we add the polynomial features in line 34, there is a section where sinusoidal features are added (lines 30 and 31). This changes the data from being a column vector to a matrix with two columns and many rows.

The split for the polynomial step is to separate this matrix into two column vectors and then multiply them together in various ways.

This does actually raise a secondary problem here that may be worth raising to @trekhleb that the sinusoidal step is given as a conditional
# Add sinusoidal features to the dataset.
if sinusoid_degree:
    data_processed = add_sinusoids(data_processed, sinusoid_degree)
but if there is a single column of data and the sinusoid features aren't added, the method fails at the .train() stage. For example, using the example data, if None or zero is passed to sinusoid_degree and an integer greater than 1 is passed to polynomial_degree the train() method fails with IndexError: index 1 is out of bounds for axis 0 with size 1, thrown by line 101.

@buck06191 @trekhleb Thanks for your patience! It's insightful. But i still have some questions. I did see how the sinusoidal features were added, but i didn't understand how it changes the data from being a column vector to a matrix with two columns. To my understanding (please correct me if i was wrong), the input data used for training is at least two columns, and has the shape of (number of samples, number of features). The first column, used as the coefficient of intercept(theta_zero), was added by prepare_for_training in line 47. So there always is a theta[1], will not cause any IndexError, though i still don't understand what does theta[1] to do with theta[0], could the step in line 101 really help to prevent the regularizing of parameter theta_zero? As for the add_sinusoids, i think the number of columns of dataset returned by add_sinusoids was simply a multiple of input dataset, which may cause odd number and lead to uneven shape of datase_1 and dataset_2, too. As numpy.sin in line 19 of add_sinusoids does not change the shape of input dataset, and the code in line 20 iteratively add the output of numpy.sin to the rightside of a empty matrix. For example, if we have 3 features (not including the column added by prepare_for_training in line 47) and 100 samples, and the sinusoid_degree was set to 4, so the dataset processed by add_sinusoids wil have the shape of (100,9), so for dataset_1 will be (100,4), and dataset_2 (100,5).

from homemade-machine-learning.

buck06191 commented on May 5, 2024

Hi @bayegy, I'll try address some of these issues but as I'm not one of the original contributors I might make a few mistakes.

To my understanding (please correct me if i was wrong), the input data used for training is at least two columns, and has the shape of (number of samples, number of features)

The LinearRegression instance takes a data argument and a labels argument. The data argument in the examples given is a a single column. This is the x data. The labels argument is the y data. So the input data used for training, at least within the example, is a single column. It isn't always going to be a single column, but in the examples given, that is definitely the case.

The first column, used as the coefficient of intercept(theta_zero), was added by prepare_for_training in line 47

This is added after the split step in line 39 and hence there is the chance that a single column of data is passed to split.

So there always is a theta[1], will not cause any IndexError, though i still don't understand what does theta[1] to do with theta[0], could the step in line 101 really help to prevent the regularizing of parameter theta_zero?

I think this is to do with a different file - it's always worth linking to files in comments if you're referring to multiple - and so I can't comment on this.

As for the add_sinusoids, i think the number of columns of dataset returned by add_sinusoids was simply a multiple of input dataset, which may cause odd number and lead to uneven shape of datase_1 and dataset_2, too.

You're right here. The step to generate polynomial features relies on there being an even number of columns. There clearly needs to be some kind of check in place and some method of dealing with this. For example, if we were looking at house prices and we had 3 features - distance to school, distance to transport and size in square metres - then the current polynomial feature generation feature would not work. This step really needs to be generalised to deal with any number of features.

from homemade-machine-learning.

bayegy commented on May 5, 2024

@buck06191 Thanks again. it's really helpful.

I think this is to do with a different file - it's always worth linking to files in comments if you're referring to multiple - and so I can't comment on this.

I was talking about the source code of linear_regression.py in line 101:

# We should NOT regularize the parameter theta_zero.
theta[1] = theta[1] - alpha * (1 / num_examples) * (self.data[:, 0].T @ delta).T

This line was supposed to prevent the regularizing of parameter theta_zero as said by comment, but why is it processing theta[1]? Anything would be grateful !

from homemade-machine-learning.

trekhleb commented on May 5, 2024

@buck06191 thank you for all these answers!

@bayegy I guess you're right about theta[1] issue. I've just made a fix for that. Also the way polynomials and sinusoid are added now has been changed. I hope you won't have issues with that now. Could you please double-check? I'm closing this issue for now. But if have any other concerns please feel free to open new issues.

from homemade-machine-learning.

add_polynomials go wrong when process different shape of dataset_1 and dataset_2 about homemade-machine-learning HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent