alaz / legitbot Goto Github PK
View Code? Open in Web Editor NEW🤔 Is this Web request from a real search engine🕷 or from an impersonating agent 🕵️♀️?
License: Other
🤔 Is this Web request from a real search engine🕷 or from an impersonating agent 🕵️♀️?
License: Other
Hello, Alexander
Hope you're doing well!
While reading through the source code, I found there's a logic to detect AppleBot masquerateing as GoogleBot (Introduced here: 2d03063)
However, I cound't find any information regarding this behavior, even looked back to the doc in 2018, when the code was added.
So I'm wondering if this behavior and detection could be removed.
Thanks for this gem, I have seen a lot of error logs recently like this:
NoMethodError: undefined method `empty?' for nil:NilClass
…ms/legitbot-1.0.0/lib/legitbot/validators/ip_ranges.rb: 43:in `valid_ip?'
…ms/legitbot-1.0.0/lib/legitbot/validators/ip_ranges.rb: 20:in `valid_ip?'
I think it is caused by the strange ip of the visitor. Can I handle the null value?
Hey @allaud ,
Petalbot tests are failing: https://github.com/alaz/legitbot/runs/2649297975
1) Failure:
PetalbotTest#test_valid_ip [/home/runner/work/legitbot/legitbot/test/petalbot_test.rb:16]:
{:msg=>"114.119.153.50 is a valid Petalbot IP"}
2) Failure:
PetalbotTest#test_valid_ua [/home/runner/work/legitbot/legitbot/test/petalbot_test.rb:34]:
{:msg=>"Valid Petalbot"}
These pages show 404:
Do you know if this bot still operates?
lib/legitbot/gptbot.rb:1:1: C: Custom/IpRanges: Could not fetch IPs from https://openai.com/gptbot-ranges.txt , HTTP status code 403
It works when I run it locally, hence I suspect they have banned Microsoft or Github Actions IP ranges.
I let them know on X (formerly Twitter), but frankly I do not expect much: https://twitter.com/aazarov/status/1790281032513507420
Thanks again for this project :)
I've been getting this sometimes now:
DNS result has no information for crawl-95-216-33-117.googlebot.com"
I can rescue nil inside the rack_attack Legitbot.bot call, but would love to solve the actual problem as well.
It's strange that it says "no information" but then clearly has resolved it to crawl-95-216-33-117.googlebot.com
Hmm, maybe reverse-dns is working to get the address, but then it's not able to ping it?
The IP reported in my error logs is in fact 95.216.33.117
/usr/local/rvm/rubies/ruby-2.7.2/lib/ruby/2.7.0/resolv.rb:379:in `getaddress'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:66:in `reverse_ip'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:48:in `valid_domain?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:22:in `valid_domain?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/botmatch.rb:25:in `valid?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/botmatch.rb:29:in `fake?'
/u/apps/ap.next/current/config/initializers/rack_attack.rb:16:in `block in '
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/check.rb:15:in `matched_by?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `block in blocklisted?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `any?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `blocklisted?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack.rb:107:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-8.3.0/lib/new_relic/agent/instrumentation/middleware_tracing.rb:100:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-2.2.3/lib/rack/tempfile_reaper.rb:15:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-8.3.0/lib/new_relic/agent/instrumentation/middleware_tracing.rb:100:in `call'
Petalbot resolution is not reliable.
Doc:
If robots instructions don't mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.
E.g.
TelegramBot (like TwitterBot)
General strategy – what to do?
Some crawlers publish lists of IP addresses and they change frequently. Notably, DuckDuckGo: pull requests
I'm starting to see Facebook Bot being labeled as a fake search engine, when in reality the IP address is genuine and I think the issue here is in the SegmentTree being built.
The IP in question is 69.171.251.1
I get the following in the console:
irb> ranges = Legitbot::Facebook.reload!
=> {:ipv4=>SegmentTree(31.13.24.0..204.15.23.255), :ipv6=>SegmentTree(2401:db00::..2a03:2887:ff34:ffff:ffff:ffff:ffff:ffff)}
irb> ranges[:ipv4].find(IPAddr.new('69.171.251.1'))
=> nil
On the other hand:
irb> ip = IPAddr.new('31.13.24.0')
=> #<IPAddr: IPv4:31.13.24.0/255.255.255.255>
irb> ranges[:ipv4].find(ip)
=> #<SegmentTree::Segment:0x00000000085acdc0 @range=#<IPAddr: IPv4:31.13.24.0/255.255.248.0>..#<IPAddr: IPv4:31.13.31.255/255.255.248.0>, @value=true>
On one hand the IPv4 SegmentTree range is too broad, but despite that the valid IP address is not returned and a legitimate bot is labeled as a fake one and thus being blocked.
I'm seeing what appears to be DuckDuckGo browser being identified as DuckDuckGo bot.
Here's the user agent string: Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 DuckDuckGo/7
On the other hand I think the rule for DuckDuckGo's bot is wrong, I think DDG bot identifies itself as DuckDuckBot, not as DuckDuckGo - https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
Thank you for this gem, I got an error from log, I can not get user-agent of client, Is there any way to avoid such errors?
NoMethodError: undefined method `index' for nil:NilClass
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb: 23:in `block(2 levels) in bot'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb: 23:in `any?'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb: 23:in `block in bot'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb: 23:in `select'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb: 23:in `bot'
…200810084721/app/controllers/application_controller.rb: 164:in `check_robot'
We have been seeing errors for Facebook crawlers in the last couple of days. Walking through the code it seems to fail when source is not provided to the Irrc client:
client.query :radb, 'AS32934'
result = client.perform
Connecting to whois.radb.net
Processing AS32934
Executing "!s-*"
Got "F One or more selected sources are unavailable.
"
'!s-*' failed on 'whois.radb.net' (F One or more selected sources are unavailable.). when processing AS32934 for AS32934
No more queries
Closing a connection to whois.radb.net
Queue 0 guard objects
=> {}
Once a source is provided it seems to behave more as expected:
client.query :radb, 'AS32934', source: :radb
result = client.perform
Connecting to whois.radb.net
Processing AS32934
Executing "!sradb"
Got "C
"
Queue new 0 queries
No more queries
Closing a connection to whois.radb.net
Queue 0 guard objects
=> {"AS32934"=>
{:ipv4=>
{"AS32934"=>
["31.13.24.0/21",
"31.13.64.0/18",
"31.13.64.0/19",
Could something have changed with the service?
Hey, thanks for your work on this gem. I've noticed something while running test and I think it may require a change to the internals of Legitbot.
As it stands, legitbot will make actual web requests as soon as it's require
d, even before any calls to bot.valid?
are called, because the Facebook bot matcher loads ValidIPs
in the class declaration. Is there any way around this?
Dear users of legitbot,
What do you think about public Legitbot API ?
Do you want to improve it in any way before releasing version 1.0?
Is Legitbot ready for 1.0 release?
Regards,
Alexander.
I'm getting a lot of hits like this that are being blocked by my rack-attack setup as you suggest:
E, [2022-02-03T06:53:01.889058 #1133986] ERROR -- : blocklist 47.155.9.106 GET / "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0"
Not sure how to get the proper list of IPs it's using? Here's ones I've seen:
172.91.121.17
70.181.168.184
47.155.9.106
216.150.126.58
108.7.233.172
108.54.49.32
71.227.168.241
73.228.203.166
162.235.153.62
107.184.85.25
174.208.224.248
.. and probably more, it's a lot of IPs
Here's an article about it
https://medium.com/@siggi/apples-imessage-impersonates-twitter-facebook-bots-when-scraping-cef85b2cbb7d
I'm seeing the following behavior: 157.55.39.132 is being identified as a fake bingbot, but it is indeed legitimate - verified by both the bing verification tool and the hostname contain "search.msn.net."
I've identified the issue to be that the IP address has two reverse pointers:
Non-authoritative answer:
132.39.55.157.in-addr.arpa name = po18-218.co2-6nf-srch-2b.ntwk.msn.net.
132.39.55.157.in-addr.arpa name = msnbot-157-55-39-132.search.msn.com.
The issue stems from the usage of getname
instead of getnames at
legitbot/lib/legitbot/botmatch.rb
Line 19 in 6c23f6b
Changing this, will require substantial changes as all dependent code will have to start working with array of strings, as opposed to a single string.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.