Giter Club home page Giter Club logo

legitbot's Introduction

Legitbot codecov

Ruby gem to make sure that an IP really belongs to a bot, typically a search engine.

Usage

Suppose you have a Web request and you would like to check it is not diguised:

bot = Legitbot.bot(userAgent, ip)

bot will be nil if no bot signature was found in the User-Agent. Otherwise, it will be an object with methods

bot.detected_as # => :google
bot.valid? # => true
bot.fake? # => false

Sometimes you already know which search engine to expect. For example, you might be using rack-attack:

Rack::Attack.blocklist("fake Googlebot") do |req|
  req.user_agent =~ %r(Googlebot) && Legitbot::Google.fake?(req.ip)
end

Or if you do not like all those ghoulish crawlers stealing your content, evaluating it and getting ready to invade your site with spammers, then block them all:

Rack::Attack.blocklist 'fake search engines' do |request|
  Legitbot.bot(request.user_agent, request.ip)&.fake?
end

Versioning

Semantic versioning with the following clarifications:

  • MINOR version is incremented when support for new bots is added.
  • PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).

Supported

License

Apache 2.0

Other projects

  • Play Framework variant in Scala: play-legitbot
  • Article When (Fake) Googlebots Attack Your Rails App
  • Voight-Kampff is a Ruby gem that detects bots by User-Agent
  • crawler_detect is a Ruby gem and Rack middleware to detect crawlers by few different request headers, including User-Agent
  • Project Honeypot's http:BL can not only classify IP as a search engine, but also label them as suspicious and reports the number of days since the last activity. My implementation of the protocol in Scala is here.
  • CIDRAM is a PHP routing manager with built-in support to validate bots.

legitbot's People

Contributors

ajoneil avatar ajwgibson avatar alaz avatar allaud avatar dlackty avatar github-actions[bot] avatar kirichkov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

legitbot's Issues

NoMethodError: undefined method `index' for nil:NilClass

Thank you for this gem, I got an error from log, I can not get user-agent of client, Is there any way to avoid such errors?

NoMethodError: undefined method `index' for nil:NilClass


…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb:   23:in `block(2 levels) in bot'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb:   23:in `any?'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb:   23:in `block in bot'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb:   23:in `select'
…uby/2.6.0/gems/legitbot-1.0.0/lib/legitbot/legitbot.rb:   23:in `bot'
…200810084721/app/controllers/application_controller.rb:  164:in `check_robot'

Facebook bot makes request as soon as gem is required

Hey, thanks for your work on this gem. I've noticed something while running test and I think it may require a change to the internals of Legitbot.

As it stands, legitbot will make actual web requests as soon as it's required, even before any calls to bot.valid? are called, because the Facebook bot matcher loads ValidIPs in the class declaration. Is there any way around this?

iMessageBot

I'm getting a lot of hits like this that are being blocked by my rack-attack setup as you suggest:

E, [2022-02-03T06:53:01.889058 #1133986] ERROR -- : blocklist 47.155.9.106 GET / "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0"

Not sure how to get the proper list of IPs it's using? Here's ones I've seen:
172.91.121.17
70.181.168.184
47.155.9.106
216.150.126.58
108.7.233.172
108.54.49.32
71.227.168.241
73.228.203.166
162.235.153.62
107.184.85.25
174.208.224.248

.. and probably more, it's a lot of IPs

Here's an article about it
https://medium.com/@siggi/apples-imessage-impersonates-twitter-facebook-bots-when-scraping-cef85b2cbb7d

Is API ready for 1.0 ?

Dear users of legitbot,

What do you think about public Legitbot API ?
Do you want to improve it in any way before releasing version 1.0?
Is Legitbot ready for 1.0 release?

Regards,
Alexander.

Valid bingbot detected as fake due to multiple DNS names

I'm seeing the following behavior: 157.55.39.132 is being identified as a fake bingbot, but it is indeed legitimate - verified by both the bing verification tool and the hostname contain "search.msn.net."

I've identified the issue to be that the IP address has two reverse pointers:

Non-authoritative answer:
132.39.55.157.in-addr.arpa	name = po18-218.co2-6nf-srch-2b.ntwk.msn.net.
132.39.55.157.in-addr.arpa	name = msnbot-157-55-39-132.search.msn.com.

The issue stems from the usage of getname instead of getnames at

@reverse_domain ||= @dns.getname(@ip)
.

Changing this, will require substantial changes as all dependent code will have to start working with array of strings, as opposed to a single string.

Petalbot tests are failing

Hey @allaud ,

Petalbot tests are failing: https://github.com/alaz/legitbot/runs/2649297975

  1) Failure:
PetalbotTest#test_valid_ip [/home/runner/work/legitbot/legitbot/test/petalbot_test.rb:16]:
{:msg=>"114.119.153.50 is a valid Petalbot IP"}

  2) Failure:
PetalbotTest#test_valid_ua [/home/runner/work/legitbot/legitbot/test/petalbot_test.rb:34]:
{:msg=>"Valid Petalbot"}

These pages show 404:

Do you know if this bot still operates?

Resolv issues with googlebot sometimes

Thanks again for this project :)

I've been getting this sometimes now:
DNS result has no information for crawl-95-216-33-117.googlebot.com"

I can rescue nil inside the rack_attack Legitbot.bot call, but would love to solve the actual problem as well.

It's strange that it says "no information" but then clearly has resolved it to crawl-95-216-33-117.googlebot.com

Hmm, maybe reverse-dns is working to get the address, but then it's not able to ping it?

The IP reported in my error logs is in fact 95.216.33.117


/usr/local/rvm/rubies/ruby-2.7.2/lib/ruby/2.7.0/resolv.rb:379:in `getaddress'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:66:in `reverse_ip'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:48:in `valid_domain?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/validators/domains.rb:22:in `valid_domain?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/botmatch.rb:25:in `valid?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/legitbot-1.5.1/lib/legitbot/botmatch.rb:29:in `fake?'
/u/apps/ap.next/current/config/initializers/rack_attack.rb:16:in `block in '
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/check.rb:15:in `matched_by?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `block in blocklisted?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `any?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack/configuration.rb:72:in `blocklisted?'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-attack-6.5.0/lib/rack/attack.rb:107:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-8.3.0/lib/new_relic/agent/instrumentation/middleware_tracing.rb:100:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/rack-2.2.3/lib/rack/tempfile_reaper.rb:15:in `call'
/u/apps/ap.next/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-8.3.0/lib/new_relic/agent/instrumentation/middleware_tracing.rb:100:in `call'

Fetch Googlebot IP ranges from their published JSON resource

Google publishes the current IP ranges for Googlebot: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot#automatic

Of course Legitbot could fetch them with fetch:url, similarly to how it works for Ahrefs:

# @fetch:url https://api.ahrefs.com/v3/public/crawler-ip-ranges?output=json
# @fetch:jsonpath $.prefixes[*].ipv4Prefix

But we don't know the cadence of changes to this list and fetch:url updates the Legitbot sources. Even with the automatic detection in place, the change would have to wait until the next release.

In order to dynamically fetch Googlebot IP ranges from their published JSON, ip_ranges block can be used, similarly to how it works for Facebook:

ip_ranges do
client = Irrc::Client.new
client.query :radb, AS, source: :radb
results = client.perform
%i[ipv4 ipv6].map do |family|
results[AS][family][AS]
end.flatten
end
end

We probably need fetch:url factored out from Rubocop cop sources though, so it can be easily accessible.

NoMethodError: undefined method `empty?' for nil:NilClass

Thanks for this gem, I have seen a lot of error logs recently like this:

NoMethodError: undefined method `empty?' for nil:NilClass
…ms/legitbot-1.0.0/lib/legitbot/validators/ip_ranges.rb:   43:in `valid_ip?'
…ms/legitbot-1.0.0/lib/legitbot/validators/ip_ranges.rb:   20:in `valid_ip?'

I think it is caused by the strange ip of the visitor. Can I handle the null value?

Possible Facebook RADB source issue?

We have been seeing errors for Facebook crawlers in the last couple of days. Walking through the code it seems to fail when source is not provided to the Irrc client:

client.query :radb, 'AS32934'
result = client.perform
Connecting to whois.radb.net
Processing AS32934
Executing "!s-*"
Got "F One or more selected sources are unavailable.
"
'!s-*' failed on 'whois.radb.net' (F One or more selected sources are unavailable.). when processing AS32934 for AS32934
No more queries
Closing a connection to whois.radb.net
Queue 0 guard objects
=> {}

Once a source is provided it seems to behave more as expected:

client.query :radb, 'AS32934', source: :radb
result = client.perform
Connecting to whois.radb.net
Processing AS32934
Executing "!sradb"
Got "C
"
Queue new 0 queries
No more queries
Closing a connection to whois.radb.net
Queue 0 guard objects
=> {"AS32934"=>
  {:ipv4=>
    {"AS32934"=>
      ["31.13.24.0/21",
       "31.13.64.0/18",
       "31.13.64.0/19",

Could something have changed with the service?

FacebookBot mislabeled as fake

I'm starting to see Facebook Bot being labeled as a fake search engine, when in reality the IP address is genuine and I think the issue here is in the SegmentTree being built.

The IP in question is 69.171.251.1

I get the following in the console:

irb> ranges = Legitbot::Facebook.reload!
=> {:ipv4=>SegmentTree(31.13.24.0..204.15.23.255), :ipv6=>SegmentTree(2401:db00::..2a03:2887:ff34:ffff:ffff:ffff:ffff:ffff)}
irb> ranges[:ipv4].find(IPAddr.new('69.171.251.1'))
=> nil

On the other hand:

irb> ip = IPAddr.new('31.13.24.0')
=> #<IPAddr: IPv4:31.13.24.0/255.255.255.255>
irb> ranges[:ipv4].find(ip)
=> #<SegmentTree::Segment:0x00000000085acdc0 @range=#<IPAddr: IPv4:31.13.24.0/255.255.248.0>..#<IPAddr: IPv4:31.13.31.255/255.255.248.0>, @value=true>

On one hand the IPv4 SegmentTree range is too broad, but despite that the valid IP address is not returned and a legitimate bot is labeled as a fake one and thus being blocked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.