I'm starting to see Facebook Bot being labeled as a fake search engine, when in realit

Two solutions come to my mind: Evaluate ranges and discard the

Replying here to <a href="https://github.com/take-five/segment_tree/issues/3#issuecomm

FacebookBot mislabeled as fake about legitbot HOT 7 CLOSED

alaz commented on June 23, 2024 1

FacebookBot mislabeled as fake

from legitbot.

Comments (7)

alaz commented on June 23, 2024 1

Published as 0.2.7 🚀

from legitbot.

kirichkov commented on June 23, 2024

Two solutions come to my mind:

Evaluate ranges and discard the smaller overlapping ranges, pass the optimized array to SegmentTree
Drop SegmentTree completely and iterate for each IP range, until match is found, or all entries are not a match.

I don't like either of these approaches, as they add overhead.

from legitbot.

alaz commented on June 23, 2024

Replying here to your comment and that above.

With the current implementation of SegmentTree using a sorted array and binary search, I don't see how this can be solved in here. Perhaps the list of IPs can be broken up beforehand into the least-common-denominator netmask and passed to SegmentTree as a list of same-sized netmask ranges.

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

from legitbot.

kirichkov commented on June 23, 2024

There's also third way that might solve this:

Sort the array of ranges based on their end, as SegmentTree sorts it based on the start. Pass the sorted array, and bypass SegmentTree's sort:

SegmentTree.new(@sorted_ranges, true)

from legitbot.

kirichkov commented on June 23, 2024

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Agreed! As I said I don't like either.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

That was also something I had in mind, but I was considering those partial matches as well, and accounting for those was why I discarded the idea. I also think there was some other bot that will need implementation for this workaround.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

from legitbot.

alaz commented on June 23, 2024

I also think there was some other bot that will need implementation for this workaround.

IMHO it makes no sense to bother about them beforehand. That would be premature optimization. FB is rather unique in this regard so far.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

This looks like the easiest approach, but it needs to be thought of thoroughly algorithm-wise... it takes time. This would also need careful testing on Legitbot's side. It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

from legitbot.

kirichkov commented on June 23, 2024

It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

I don't quite get what you mean by this? It's the same approach as used now, only the input array is pre-sorted, and SegmentTree is instructed not to sort it beforehand.

Regarding the other points - I agree. What I'm thinking is to cobble some tests and rewrite the facebook part and run it against the tests, and then against real-world ip data and see whether it works.

As a matter of fact this bug surfaced by pure coincidence. I think at one point it will resolve itself, when Facebook change their IP ranges, but who knows when that will be.

P.S.: This suggestion might be a possible solution too.

from legitbot.

FacebookBot mislabeled as fake about legitbot HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent