Comments (58)
For example, apply the same method to
Easy
in my data and you will see what I mean.
But, it still has a problem: It will always filter out some part of the data even if it should ideally not be considered an outlier.
The IQR method has an advantage that it doesn't filter out any data if the data is more homogenous.
The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.
from fsrs-optimizer.
I find that ln(delta_t) filters tend to give a loose threshold. For example, if the Q1 = 2 and Q3 = 10, the original threshold = Q3 + 1.5 * IQR = 10 + 1.5 * 8 = 22. When we apply ln(), Q1 = 0.69 and Q3 = 2.3, the new original threshold = 2.3 + 1.5 * 1.61 = 4.715. e ^ 4.715 = 111.61, which is too large.
from fsrs-optimizer.
Well, @L-M-Sherlock, this version is much behind the release version (especially in pre-train). So, it would be very difficult to accurately judge the impact of the current change.
So, I would suggest you to push the change to a separate branch and then tell us how to import that change into optimizer.ipynb (like you told us in #5 (comment)).
Also, use
has_been_removed + count >= total * 0.05
instead of
has_been_removed >= total * 0.05 or count >= total * 0.05
from fsrs-optimizer.
Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.
You need to click "Disconnect and delete runtime" after you change the optimizer version and then upload the collection file again (but the timezone settings and everything else would be preserved).
from fsrs-optimizer.
Although we decided that we won't use RMSE to determine the effectiveness of this change, I wanted to highlight that this change decreased the RMSE even though the number of reviews increased, at least for Sherlock's collection.
from fsrs-optimizer.
Thanks for your reminder. I am also considering this problem. Do you have any suggestion?
from fsrs-optimizer.
I have no idea to solve this problem except removing the filter, which would, unfortunately, make the results worse for some users.
Perhaps, @Expertium can suggest a better way to filter the outliers in initial stability based on his knowledge of statistics.
Edit: If we don't get any other idea, I have a workaround in mind. The workaround is that don't filter out the data for which delta_t < 20 (even if 20 > Q3 + 1.5 * IQR).
from fsrs-optimizer.
Well, according to Sherlock LOF doesn't filter out enough outliers, so I don't know. I was really hoping that LOF can be fine-tuned to suit our needs, though I haven't really tried to do it myself.
One suggestion I have is this: calculate quantiles and IQR not using delta_t, but using ln(delta_t). In other words, look for outliers after transforming data. By the way, I recommended doing this with LOF as well, but I don't think Sherlock tried it.
from fsrs-optimizer.
One suggestion I have is this: calculate quantiles and IQR not using delta_t, but using ln(delta_t). In other words, look for outliers after transforming data.
This solution would work for me, at least.
In my collection, the threshold for filtering would increase from 3.5 days to 5.7 days for Good
first rating.
For this collection (for which you developed this filtering mechanism), the threshold for filtering would increase from 7 days to 11.3 days (which is not too large, imo) for Again
first rating.
from fsrs-optimizer.
@L-M-Sherlock I recommend implementing what I suggested above, but first you need to setup automated testing on all collections.
from fsrs-optimizer.
@L-M-Sherlock, what do you think about the idea of using ln(delta_t) to calculate quartiles and IQR?
I remember that it "increased" the RMSE but that was an artifact in my opinion.
Probably, we should implement this without thinking about the RMSE. We indeed made several changes without even testing the effect on the RMSE.
Otherwise, we should find out a better way to verify if using ln(delta_t) to calculate quartiles and IQR is better than using just delta_t.
from fsrs-optimizer.
Let's begin with concrete cases. For example, you can provide your S0 dataset and select the outliers (even in subjective way). Then we can find out some statistical methods to filter out these outliers.
from fsrs-optimizer.
Here is my S0 dataset. In my opinion, the ones having a pink background can be considered outliers. But probably, even some more rows could be considered outliers.
The tsv file is here: stability_for_pretrain.tsv.zip (remove .zip at the end)
from fsrs-optimizer.
Here is my S0 dataset. In my opinion, the ones having a pink background can be considered outliers. But probably, even some more rows could be considered outliers.
I think we can filter these data via setting threshold in the count
.
Median seems to be a good candidate here.
from fsrs-optimizer.
Sorry, but I couldn't understand anything in this comment. What is the graph showing and what do the axes represent? Also, what are the three values in the second image?
from fsrs-optimizer.
I think Sherlock's idea is to filter outliers based on the number of reviews with a certain interval length. For example, if there have been 20 reviews with delta_t = 2 and 1 review with delta_t = 20, then the latter would be considered an outlier, but not because of the length of the interval but rather because there has only been one such review. That's the gist of it, unless I also misunderstood Sherlock.
from fsrs-optimizer.
So if we filter out all data where count < median(count) on user1823's data, here's what it will look like:
from fsrs-optimizer.
I mean the method is based on the quantile of count. It has a problem: it will always filter out x% delta_t
. So we should consider other methods.
from fsrs-optimizer.
How about trying LOF again? I know I've been a little annoying about it, but this seems like a good usecase for it. We have 3 features: delta_t, y (mean) and count. Pass all three of them into LOF. Perhaps use ln(delta_t) and ln(count), since both of them can differ by 10-100 times.
from fsrs-optimizer.
Though it's still unclear how to test it without relying on RMSE.
from fsrs-optimizer.
We can try using LOF. But, in the absence of an objective measure to test it, we will have to look which delta_t were filtered out by this approach and then subjectively decide if this approach is better or not.
Otherwise, we should just trust Expertium and use ln(delta_t) to calculate quartiles and IQR. This way, we would be using a widely used formula (Q3 + 1.5 × IQR) to find the outliers instead of LOF (which can possibly act in unexpected ways for some collections). Though this approach "increased" the RMSE in our testing, it was probably just an artifact (I know that I have said this too many times).
from fsrs-optimizer.
I think Sherlock's idea (as I explained it here) could also work. It will filter out half of all intervals, not half of all reviews. In fact, in that image it only filtered out 27 reviews (for "Good") out of 5476. If LOF fails for whatever reason, I suggest this.
@L-M-Sherlock I recommend trying Local Outlier Factor with the following three features: ln(delta_t), ln(count) and y_mean. So it will work with three-dimensional data. Please, do the usual test of statistical significance.
from fsrs-optimizer.
I think Sherlock's idea (as I explained it here) could also work. It will filter out half of all intervals ...
But, this method will work only for this particular data. For example, apply the same method to Easy
in my data and you will see what I mean.
Edit:
The phrase "filter out half of all intervals" made me think that your approach is to filter out the larger delta_t
s. But, it is not.
But, it still has a problem: It will always filter out some part of the data even if it should ideally not be considered an outlier.
The IQR method has an advantage that it doesn't filter out any data if the data is more homogenous.
from fsrs-optimizer.
The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.
But you should understand the mechanism of LOF. It filters out the outliers based on density. In out case, the outliers could also be concentrated in a range of intervals with high density. I will test it tomorrow.
from fsrs-optimizer.
The reason why I'm recommending LOF over the more simple approaches is that LOF can work with multidimensional data, in fact, it might work better with multidimensional data.
@Expertium, I have give it a try. The LOF predicts that all data are inliers. (I use @user1823's data)
from fsrs-optimizer.
I have another method. We can sort the delta_t
by count
ascend. Then we start to remove delta_t
from the first row one by one and accumulate the count
we have removed. Finally, we can determine a upper limit (percentage, e.g. 5%) for removing outliers and stop in that point.
from fsrs-optimizer.
The LOF predicts that all data are inliers.
I assume you let it choose the % of contamination. How about one last try - manually set contamination %, say, 5% or 10%.
If that doesn't work either, then let's use your new proposed method.
from fsrs-optimizer.
I have another method. We can sort the
delta_t
bycount
ascend.
You can try this method. But, the problem is that it will always filter out some data.
However, some values at the end of the dataset might represent true values from natural variation. Filtering out such data might lead to underfitting.
If Expertium's idea of manually setting contamination % doesn't work, I think that we should simply use ln(delta_t) with IQR.
from fsrs-optimizer.
I assume you let it choose the % of contamination. How about one last try - manually set contamination %, say, 5% or 10%.
The more weird things happen. I think these rows shouldn't be removed.
from fsrs-optimizer.
Alright, forget about it then.
from fsrs-optimizer.
If Expertium's idea of manually setting contamination % doesn't work, I think that we should simply use ln(delta_t) with IQR.
OK. But we should pass this test. As I mentioned before, here is an extreme case for outliers: open-spaced-repetition/fsrs4anki#282 (comment)
Here is his dataset for pretrain (with outliers): stability_for_pretrain.csv
- Without the current outlier detector:
- With the current outlier detector:
- With the current outlier detector + ln(delta_t):
For version 3, it doesn't remove outliers enough.
from fsrs-optimizer.
So, just verify whether using ln(delta_t) filters out the outliers well in this collection or not.
from fsrs-optimizer.
So, let's try your method:
We can sort the delta_t by count ascend. Then we start to remove delta_t from the first row one by one and accumulate the count we have removed. Finally, we can determine a upper limit (percentage, e.g. 5%) for removing outliers and stop in that point.
To reduce the risk of filtering out inliers, we can add the condition that if the first row (in the sorted data) contains more than 5% of the data, we won't remove any data.
Edit:
Probably, the condition can be further improved if we stop filtering data when we encounter any row including which would cause the optimizer to filter more than 5% data.
For example, if the count %
arranged in ascending order is as follows:
- 1%
- 2%
- 6%
Then, we just filter the 3% (1% + 2%) of the data.
If the count %
arranged in ascending order is as follows:
- 1%
- 2%
- 2%
- 3%
Then, we filter the 5% (1% + 2% + 2%) of the data.
If the count %
arranged in ascending order is as follows:
- 2%
- 4%
- 4%
Then, we just filter the 2% of the data.
from fsrs-optimizer.
It removed 140 outliers (4.47%).
from fsrs-optimizer.
def remove_outliers(group: pd.DataFrame) -> pd.DataFrame:
grouped_group = group.groupby(by=['r_history', 'delta_t'], group_keys=False).agg({'y': ['mean', 'count']}).reset_index()
sort_index = grouped_group.sort_values(by=[('y', 'count')], ascending=True).index
total = sum(grouped_group[('y', 'count')])
has_been_removed = 0
for i in sort_index:
count = grouped_group.loc[i, ('y', 'count')]
if has_been_removed >= total * 0.05 or count >= total * 0.05:
break
has_been_removed += count
group = group[group['delta_t'].isin(grouped_group[grouped_group[('y', 'count')] >= count]['delta_t'])]
return group
Here is the code.
from fsrs-optimizer.
In my collection, this method would remove
- 9 outliers (3.57%) for
Again
- no outliers (0%) for
Hard
- 11 outliers (1.92%) for
Good
- 3 outliers (2.3%) for
Easy
I am quite satisfied with these results.
from fsrs-optimizer.
@Expertium could you test it in your collection?
from fsrs-optimizer.
if has_been_removed >= total * 0.05 or count >= total * 0.05:
This should be
if has_been_removed >= total * 0.05 or has_been_removed + count >= total * 0.05:
from fsrs-optimizer.
has_been_removed + count >= total * 0.05
This condition is enough.
from fsrs-optimizer.
has_been_removed + count >= total * 0.05This condition is enough.
Oh, you are right. So, update the condition in your previous comment so that Expertium can try it out with the correct code.
from fsrs-optimizer.
Where do I put this code if Sherlock stopped releasing "open" versions of the optimizer (where all the code is visible in google colab and you can edit it) ages ago?
from fsrs-optimizer.
You can still modify the code based on this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/outlier_filter.ipynb
from fsrs-optimizer.
But this feat is only related to pre-train.
And I have mentioned that the feat doesn't rely on RMSE.
from fsrs-optimizer.
But this feat is only related to pre-train.
And I have mentioned that the feat doesn't rely on RMSE.
Yes, but testing this in the optimizer is faster as compared to using Excel to find out which delta_t
would be filtered. Also, testing it in the optimizer allows us to see the impact on the predicted stability.
from fsrs-optimizer.
%pip install git+https://github.com/open-spaced-repetition/fsrs-optimizer@Feat/new-outlier-filter-based-on-count
You can install the branch version of FSRS optimizer in your notebook with above command. Just replace this line:
from fsrs-optimizer.
So, my results are as follows:
IQR | Counts | |
---|---|---|
S0 | {1: 1.10, 2: 1.34, 3: 9.73, 4: 36.55} | {1: 1.10, 2: 1.48, 3: 14.48, 4: 36.91} |
Data |
I am totally satisfied with the results. If @Expertium is also satisfied with the results, we can merge this.
from fsrs-optimizer.
@L-M-Sherlock, this is unrelated but quite minor to create a new issue for.
Use s0 = 1.5 for Hard. Currently, 0.6 is used, which is too small.
from fsrs-optimizer.
Use s0 = 1.5 for Hard. Currently, 0.6 is used, which is too small.
Do you have any stats for that?
from fsrs-optimizer.
Speaking of which, remember how I said that when I was running the benchmark on 66 collections, I also wrote down S0?
Here are the average values, weighted by ln(reviews):
S0(Again)=0.6
S0(Hard)=1.4
S0(Good)=3.3
S0(Easy)=10.1
I suggest running a statistical significance test to determine whether these values are better than the ones currently used.
from fsrs-optimizer.
I suggest running a statistical significance test to determine whether these values are better than the ones currently used.
In my opinion, we should just replace the S0 for Hard because the currently used value for Hard
doesn't make much sense.
Also, the result of such a change would not be statistically significant because it would only affect the values in those collections that have a very low number of reviews with Hard
as the first rating. So, we don't need to run a statistical significance test here.
from fsrs-optimizer.
I want Sherlock to replace all 4 values though.
There is a pretty big difference between the currently used values (all four of them) and the ones I obtained from the benchmark. We need to find out which ones provide a better fit to user's repetition histories.
Current: 0.4, New: 0.6
Current: 0.6, New: 1.4
Current: 2.4, New: 3.3
Current: 5.8, New: 10.1
The values obtained from benchmarking are 50-100% greater.
from fsrs-optimizer.
@Expertium, please try out the new outlier filter approach so that we can merge the branch and close this issue.
The way to use it was described by Sherlock here: #16 (comment)
from fsrs-optimizer.
Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.
from fsrs-optimizer.
Maybe I'm doing something wrong,but I get exactly the same values of initial stability with both versions.
Could you check the forgetting curves generated in pre-train?
from fsrs-optimizer.
I get very similar results (both in terms of values of S0 and in terms of what stability_for_pretrain.tsv
looks like), so it's hard to say which one is better.
Book1.xlsx
from fsrs-optimizer.
In my opinion, the new approach is better at identifying outliers because it can even filter those rows that have low count
(and thus unreliable R) but are located in the middle of the data.
So, even if it doesn't perform better for your collection, it is definitely not worse. So, it makes sense to implement this.
Also, @L-M-Sherlock, in Expertium's pretrain data, I noticed that there were several rows with the same count
. So, we should sort by the count
& then the delta_t
such that larger delta_t
are filtered before smaller delta_t
if they have the same count
.
from fsrs-optimizer.
So, we should sort by the
count
& then thedelta_t
such that largerdelta_t
are filtered before smallerdelta_t
if they have the samecount
.
Done in commit: 9dec42b
from fsrs-optimizer.
Then, just merge it.
from fsrs-optimizer.
Related Issues (20)
- Use the median instead of the mean for recall costs and learn cost HOT 8
- A better outlier filter for "Compute minimum recommended retention" HOT 2
- how to input data from obsidian-spaced-repetition-recall, ob-revlog.csv, into optimizer HOT 26
- [Feature request] A way to extrapolate values of S0 without curve_fit HOT 8
- [BUG] file not found when running local optimizer for multiple decks HOT 2
- Use results from benchmark experiment as initial values of S0 HOT 12
- Command Line typo on usage section fsrs-optimizer doesn't exist [BUG] HOT 1
- [BUG] Loosen the clamping for w[10] and w[8] HOT 5
- [Feature Request] Loosen the clampings for w[9] HOT 5
- index 1 is out of bounds for axis 0 with size 1 [BUG] HOT 1
- Training data is inadequate. HOT 5
- [Question] Explain how the optimizer calculates retention that minimizes review times HOT 12
- [Feature Request] make the simulator more precise by using different values of recall_cost for Hard, Good and Easy
- [Bug] Can't use absolute path as arg HOT 1
- [BUG] 'Optimizer' object has no attribute 'w' HOT 1
- [Feature request] Improve post-lapse stability analysis HOT 13
- See if this code could be used to speed up finding optimal retention HOT 16
- [Feature Request] Investigate how robust are parameters and RMSE HOT 18
- Optimized w[3] too low HOT 12
- [Feature Request] Add another condition to the outlier filter HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fsrs-optimizer.