Summary: The issue is that the optimizer is filtering out a part of my data just becau

For example, apply the same method to Easy</cod

Well, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Optimizer filtering out data which is not outlier,about open-spaced-repetition/fsrs-optimizer

Comments (58)

Expertium commented on September 25, 2024 1

For example, apply the same method to Easy in my data and you will see what I mean.

But, it still has a problem: It will always filter out some part of the data even if it should ideally not be considered an outlier.

The IQR method has an advantage that it doesn't filter out any data if the data is more homogenous.

The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024 1

I find that ln(delta_t) filters tend to give a loose threshold. For example, if the Q1 = 2 and Q3 = 10, the original threshold = Q3 + 1.5 * IQR = 10 + 1.5 * 8 = 22. When we apply ln(), Q1 = 0.69 and Q3 = 2.3, the new original threshold = 2.3 + 1.5 * 1.61 = 4.715. e ^ 4.715 = 111.61, which is too large.

from fsrs-optimizer.

user1823 commented on September 25, 2024 1

Well, @L-M-Sherlock, this version is much behind the release version (especially in pre-train). So, it would be very difficult to accurately judge the impact of the current change.

So, I would suggest you to push the change to a separate branch and then tell us how to import that change into optimizer.ipynb (like you told us in #5 (comment)).

Also, use

has_been_removed + count >= total * 0.05

instead of

has_been_removed >= total * 0.05 or count >= total * 0.05

from fsrs-optimizer.

user1823 commented on September 25, 2024 1

Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.

You need to click "Disconnect and delete runtime" after you change the optimizer version and then upload the collection file again (but the timezone settings and everything else would be preserved).

from fsrs-optimizer.

user1823 commented on September 25, 2024 1

Although we decided that we won't use RMSE to determine the effectiveness of this change, I wanted to highlight that this change decreased the RMSE even though the number of reviews increased, at least for Sherlock's collection.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

Thanks for your reminder. I am also considering this problem. Do you have any suggestion?

from fsrs-optimizer.

user1823 commented on September 25, 2024

I have no idea to solve this problem except removing the filter, which would, unfortunately, make the results worse for some users.

Perhaps, @Expertium can suggest a better way to filter the outliers in initial stability based on his knowledge of statistics.

Edit: If we don't get any other idea, I have a workaround in mind. The workaround is that don't filter out the data for which delta_t < 20 (even if 20 > Q3 + 1.5 * IQR).

from fsrs-optimizer.

Expertium commented on September 25, 2024

Well, according to Sherlock LOF doesn't filter out enough outliers, so I don't know. I was really hoping that LOF can be fine-tuned to suit our needs, though I haven't really tried to do it myself.
One suggestion I have is this: calculate quantiles and IQR not using delta_t, but using ln(delta_t). In other words, look for outliers after transforming data. By the way, I recommended doing this with LOF as well, but I don't think Sherlock tried it.

from fsrs-optimizer.

user1823 commented on September 25, 2024

One suggestion I have is this: calculate quantiles and IQR not using delta_t, but using ln(delta_t). In other words, look for outliers after transforming data.

This solution would work for me, at least.

In my collection, the threshold for filtering would increase from 3.5 days to 5.7 days for Good first rating.

For this collection (for which you developed this filtering mechanism), the threshold for filtering would increase from 7 days to 11.3 days (which is not too large, imo) for Again first rating.

from fsrs-optimizer.

Expertium commented on September 25, 2024

@L-M-Sherlock I recommend implementing what I suggested above, but first you need to setup automated testing on all collections.

from fsrs-optimizer.

user1823 commented on September 25, 2024

@L-M-Sherlock, what do you think about the idea of using ln(delta_t) to calculate quartiles and IQR?

I remember that it "increased" the RMSE but that was an artifact in my opinion.

Probably, we should implement this without thinking about the RMSE. We indeed made several changes without even testing the effect on the RMSE.

Otherwise, we should find out a better way to verify if using ln(delta_t) to calculate quartiles and IQR is better than using just delta_t.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

Let's begin with concrete cases. For example, you can provide your S0 dataset and select the outliers (even in subjective way). Then we can find out some statistical methods to filter out these outliers.

from fsrs-optimizer.

user1823 commented on September 25, 2024

Here is my S0 dataset. In my opinion, the ones having a pink background can be considered outliers. But probably, even some more rows could be considered outliers.

The tsv file is here: stability_for_pretrain.tsv.zip (remove .zip at the end)

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

Here is my S0 dataset. In my opinion, the ones having a pink background can be considered outliers. But probably, even some more rows could be considered outliers.

I think we can filter these data via setting threshold in the count.

Median seems to be a good candidate here.

from fsrs-optimizer.

user1823 commented on September 25, 2024

Sorry, but I couldn't understand anything in this comment. What is the graph showing and what do the axes represent? Also, what are the three values in the second image?

from fsrs-optimizer.

Expertium commented on September 25, 2024

I think Sherlock's idea is to filter outliers based on the number of reviews with a certain interval length. For example, if there have been 20 reviews with delta_t = 2 and 1 review with delta_t = 20, then the latter would be considered an outlier, but not because of the length of the interval but rather because there has only been one such review. That's the gist of it, unless I also misunderstood Sherlock.

from fsrs-optimizer.

Expertium commented on September 25, 2024

So if we filter out all data where count < median(count) on user1823's data, here's what it will look like:

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

I mean the method is based on the quantile of count. It has a problem: it will always filter out x% delta_t. So we should consider other methods.

from fsrs-optimizer.

Expertium commented on September 25, 2024

How about trying LOF again? I know I've been a little annoying about it, but this seems like a good usecase for it. We have 3 features: delta_t, y (mean) and count. Pass all three of them into LOF. Perhaps use ln(delta_t) and ln(count), since both of them can differ by 10-100 times.

from fsrs-optimizer.

Expertium commented on September 25, 2024

Though it's still unclear how to test it without relying on RMSE.

from fsrs-optimizer.

user1823 commented on September 25, 2024

We can try using LOF. But, in the absence of an objective measure to test it, we will have to look which delta_t were filtered out by this approach and then subjectively decide if this approach is better or not.

Otherwise, we should just trust Expertium and use ln(delta_t) to calculate quartiles and IQR. This way, we would be using a widely used formula (Q3 + 1.5 × IQR) to find the outliers instead of LOF (which can possibly act in unexpected ways for some collections). Though this approach "increased" the RMSE in our testing, it was probably just an artifact (I know that I have said this too many times).

from fsrs-optimizer.

Expertium commented on September 25, 2024

I think Sherlock's idea (as I explained it here) could also work. It will filter out half of all intervals, not half of all reviews. In fact, in that image it only filtered out 27 reviews (for "Good") out of 5476. If LOF fails for whatever reason, I suggest this.
@L-M-Sherlock I recommend trying Local Outlier Factor with the following three features: ln(delta_t), ln(count) and y_mean. So it will work with three-dimensional data. Please, do the usual test of statistical significance.

from fsrs-optimizer.

user1823 commented on September 25, 2024

I think Sherlock's idea (as I explained it here) could also work. It will filter out half of all intervals ...

But, this method will work only for this particular data. For example, apply the same method to Easy in my data and you will see what I mean.

Edit:
The phrase "filter out half of all intervals" made me think that your approach is to filter out the larger delta_ts. But, it is not.

But, it still has a problem: It will always filter out some part of the data even if it should ideally not be considered an outlier.

The IQR method has an advantage that it doesn't filter out any data if the data is more homogenous.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.

But you should understand the mechanism of LOF. It filters out the outliers based on density. In out case, the outliers could also be concentrated in a range of intervals with high density. I will test it tomorrow.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.

@Expertium, I have give it a try. The LOF predicts that all data are inliers. (I use @user1823's data)

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

I have another method. We can sort the delta_t by count ascend. Then we start to remove delta_t from the first row one by one and accumulate the count we have removed. Finally, we can determine a upper limit (percentage, e.g. 5%) for removing outliers and stop in that point.

from fsrs-optimizer.

Expertium commented on September 25, 2024

The LOF predicts that all data are inliers.

I assume you let it choose the % of contamination. How about one last try - manually set contamination %, say, 5% or 10%.

If that doesn't work either, then let's use your new proposed method.

from fsrs-optimizer.

user1823 commented on September 25, 2024

I have another method. We can sort the delta_t by count ascend.

You can try this method. But, the problem is that it will always filter out some data.

However, some values at the end of the dataset might represent true values from natural variation. Filtering out such data might lead to underfitting.

If Expertium's idea of manually setting contamination % doesn't work, I think that we should simply use ln(delta_t) with IQR.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

I assume you let it choose the % of contamination. How about one last try - manually set contamination %, say, 5% or 10%.

The more weird things happen. I think these rows shouldn't be removed.

from fsrs-optimizer.

Expertium commented on September 25, 2024

Alright, forget about it then.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

If Expertium's idea of manually setting contamination % doesn't work, I think that we should simply use ln(delta_t) with IQR.

OK. But we should pass this test. As I mentioned before, here is an extreme case for outliers: open-spaced-repetition/fsrs4anki#282 (comment)

Here is his dataset for pretrain (with outliers): stability_for_pretrain.csv

Without the current outlier detector:

With the current outlier detector:

With the current outlier detector + ln(delta_t):

For version 3, it doesn't remove outliers enough.

from fsrs-optimizer.

user1823 commented on September 25, 2024

So, just verify whether using ln(delta_t) filters out the outliers well in this collection or not.

from fsrs-optimizer.

user1823 commented on September 25, 2024

So, let's try your method:

We can sort the delta_t by count ascend. Then we start to remove delta_t from the first row one by one and accumulate the count we have removed. Finally, we can determine a upper limit (percentage, e.g. 5%) for removing outliers and stop in that point.

To reduce the risk of filtering out inliers, we can add the condition that if the first row (in the sorted data) contains more than 5% of the data, we won't remove any data.

Edit:
Probably, the condition can be further improved if we stop filtering data when we encounter any row including which would cause the optimizer to filter more than 5% data.

For example, if the count % arranged in ascending order is as follows:

Then, we just filter the 3% (1% + 2%) of the data.

If the count % arranged in ascending order is as follows:

Then, we filter the 5% (1% + 2% + 2%) of the data.

If the count % arranged in ascending order is as follows:

Then, we just filter the 2% of the data.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

It removed 140 outliers (4.47%).

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

        def remove_outliers(group: pd.DataFrame) -> pd.DataFrame:
            grouped_group = group.groupby(by=['r_history', 'delta_t'], group_keys=False).agg({'y': ['mean', 'count']}).reset_index()
            sort_index = grouped_group.sort_values(by=[('y', 'count')], ascending=True).index

            total = sum(grouped_group[('y', 'count')])
            has_been_removed = 0
            for i in sort_index:
                count = grouped_group.loc[i, ('y', 'count')]
                if has_been_removed >= total * 0.05 or count >= total * 0.05:
                    break
                has_been_removed += count
            group = group[group['delta_t'].isin(grouped_group[grouped_group[('y', 'count')] >= count]['delta_t'])]
            return group

Here is the code.

from fsrs-optimizer.

user1823 commented on September 25, 2024

In my collection, this method would remove

9 outliers (3.57%) for Again

no outliers (0%) for Hard

11 outliers (1.92%) for Good

3 outliers (2.3%) for Easy

I am quite satisfied with these results.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

@Expertium could you test it in your collection?

from fsrs-optimizer.

user1823 commented on September 25, 2024

if has_been_removed >= total * 0.05 or count >= total * 0.05:

This should be

if has_been_removed >= total * 0.05 or has_been_removed + count >= total * 0.05:

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

has_been_removed + count >= total * 0.05

This condition is enough.

from fsrs-optimizer.

user1823 commented on September 25, 2024

has_been_removed + count >= total * 0.05
This condition is enough.

Oh, you are right. So, update the condition in your previous comment so that Expertium can try it out with the correct code.

from fsrs-optimizer.

Expertium commented on September 25, 2024

Where do I put this code if Sherlock stopped releasing "open" versions of the optimizer (where all the code is visible in google colab and you can edit it) ages ago?

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

You can still modify the code based on this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/outlier_filter.ipynb

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

But this feat is only related to pre-train.

And I have mentioned that the feat doesn't rely on RMSE.

from fsrs-optimizer.

user1823 commented on September 25, 2024

But this feat is only related to pre-train.

And I have mentioned that the feat doesn't rely on RMSE.

Yes, but testing this in the optimizer is faster as compared to using Excel to find out which delta_t would be filtered. Also, testing it in the optimizer allows us to see the impact on the predicted stability.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

%pip install git+https://github.com/open-spaced-repetition/fsrs-optimizer@Feat/new-outlier-filter-based-on-count

You can install the branch version of FSRS optimizer in your notebook with above command. Just replace this line:

from fsrs-optimizer.

user1823 commented on September 25, 2024

So, my results are as follows:

	IQR	Counts
S0	{1: 1.10, 2: 1.34, 3: 9.73, 4: 36.55}	{1: 1.10, 2: 1.48, 3: 14.48, 4: 36.91}
Data

I am totally satisfied with the results. If @Expertium is also satisfied with the results, we can merge this.

from fsrs-optimizer.

user1823 commented on September 25, 2024

@L-M-Sherlock, this is unrelated but quite minor to create a new issue for.

Use s0 = 1.5 for Hard. Currently, 0.6 is used, which is too small.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

Use s0 = 1.5 for Hard. Currently, 0.6 is used, which is too small.

Do you have any stats for that?

from fsrs-optimizer.

Expertium commented on September 25, 2024

Speaking of which, remember how I said that when I was running the benchmark on 66 collections, I also wrote down S0?
Here are the average values, weighted by ln(reviews):
S0(Again)=0.6
S0(Hard)=1.4
S0(Good)=3.3
S0(Easy)=10.1

I suggest running a statistical significance test to determine whether these values are better than the ones currently used.

from fsrs-optimizer.

user1823 commented on September 25, 2024

I suggest running a statistical significance test to determine whether these values are better than the ones currently used.

In my opinion, we should just replace the S0 for Hard because the currently used value for Hard doesn't make much sense.

Also, the result of such a change would not be statistically significant because it would only affect the values in those collections that have a very low number of reviews with Hard as the first rating. So, we don't need to run a statistical significance test here.

from fsrs-optimizer.

Expertium commented on September 25, 2024

I want Sherlock to replace all 4 values though.
There is a pretty big difference between the currently used values (all four of them) and the ones I obtained from the benchmark. We need to find out which ones provide a better fit to user's repetition histories.
Current: 0.4, New: 0.6
Current: 0.6, New: 1.4
Current: 2.4, New: 3.3
Current: 5.8, New: 10.1
The values obtained from benchmarking are 50-100% greater.

from fsrs-optimizer.

user1823 commented on September 25, 2024

@Expertium, please try out the new outlier filter approach so that we can merge the branch and close this issue.

The way to use it was described by Sherlock here: #16 (comment)

from fsrs-optimizer.

Expertium commented on September 25, 2024

Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.

Could you check the forgetting curves generated in pre-train?

from fsrs-optimizer.

Expertium commented on September 25, 2024

I get very similar results (both in terms of values of S0 and in terms of what stability_for_pretrain.tsv looks like), so it's hard to say which one is better.

Book1.xlsx

from fsrs-optimizer.

user1823 commented on September 25, 2024

In my opinion, the new approach is better at identifying outliers because it can even filter those rows that have low count (and thus unreliable R) but are located in the middle of the data.

So, even if it doesn't perform better for your collection, it is definitely not worse. So, it makes sense to implement this.

Also, @L-M-Sherlock, in Expertium's pretrain data, I noticed that there were several rows with the same count. So, we should sort by the count & then the delta_t such that larger delta_t are filtered before smaller delta_t if they have the same count.

from fsrs-optimizer.

L-M-Sherlock commented on September 25, 2024

So, we should sort by the count & then the delta_t such that larger delta_t are filtered before smaller delta_t if they have the same count.

Done in commit: 9dec42b

from fsrs-optimizer.

user1823 commented on September 25, 2024

Then, just merge it.

from fsrs-optimizer.

Optimizer filtering out data which is not outlier about fsrs-optimizer HOT 58 CLOSED

Comments (58)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent