Giter Club home page Giter Club logo

Comments (19)

gijs avatar gijs commented on July 29, 2024 3

I'm building a Funda scraper based on headless Chrome. Whenever a captcha is detected, the scraper takes a screenshot and sends it to me via Telegram. I can reply with the two words which the scraper uses to solve the captcha and continue scraping.

from funda-scraper.

gijs avatar gijs commented on July 29, 2024 1

I think Funda has recently put some sort of rate limiter in place. It detects robots based on several parameters. They suspected me being a robot anyway, as they prompted me with a captcha.

Scrapy can probably solve the captcha but I didn't look into that.. https://github.com/pombredanne/decaptcha

I'm curious if anyone can get it to work again

from funda-scraper.

gijs avatar gijs commented on July 29, 2024 1

@igorkoehne (I assume you're talking to me) - unfortunately I cannot share this specific codebase at the moment because it contains a bunch of API keys / needs cleaning up - and I have no time for that now.

In the meantime, Google came up with Puppeteer. Building your own captcha-evading scraper should be even easier using this highlevel API for Headless Chrome. I'm going to rewrite my own scraper to use it, too.

from funda-scraper.

gijs avatar gijs commented on July 29, 2024 1

In other news, detection of unmodified versions of Headless Chrome seems easy... mostly because headless doesnt have WebGL capabilities which can be sniffed.

If Funda is already detecting Headless Chrome, sticking to Selenium's Chrome Webdriver will be a better option.

Good luck scraping them!

from funda-scraper.

gijs avatar gijs commented on July 29, 2024

This is due to Funda blocking crawlers. Configuring a proxy middleware in Scrapy may help but I didn't try that. Good luck

from funda-scraper.

jobvisser03 avatar jobvisser03 commented on July 29, 2024

I tried configuring the settings.py by including:
DOWNLOADER_MIDDLEWARES = { 'funda.middlewares.MyCustomDownloaderMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }

This doesn't seem to work unfortunately, is there anything else one needs to take care of? Does anyone have experience with this?

from funda-scraper.

aliaamin avatar aliaamin commented on July 29, 2024

Same problem here, would appreciate if anyone can help wih some tips how to overcome the 405 error.

from funda-scraper.

igorkoehne avatar igorkoehne commented on July 29, 2024

Is your scraper working properly? I was trying to use selenium, but I am not even sure if this would be the correct way to go, since I am just starting in this world. If you could share your code it would be the best thing that happened to me this year o/

from funda-scraper.

igorkoehne avatar igorkoehne commented on July 29, 2024

Thanks for the tips, I will give it a try!

from funda-scraper.

khpeek avatar khpeek commented on July 29, 2024

As a quick reply, the 405 error appears to be the result of fingerprinting of headless browsers by Funda. I managed to circumvent it by (1) changing my user agent (using Scrapy Random User Agent), (2) using the Scrapy Splash plugin.

from funda-scraper.

AntoniosMavropoulos avatar AntoniosMavropoulos commented on July 29, 2024

Do both (1) and (2) need to be in place?
If yes, could you please post the code that you used?
Thanks!

from funda-scraper.

tangvip avatar tangvip commented on July 29, 2024

@khpeek
do you mean that you need to use both methods OR either one of them can solve the problem?
Thanks!

from funda-scraper.

arnabsinha4u avatar arnabsinha4u commented on July 29, 2024

@tangvip with the usage of just (1) Scrapy Random User Agent, the error persists. Have not tried it with Scrapy Splash plugin

from funda-scraper.

MarcDuQuesne avatar MarcDuQuesne commented on July 29, 2024

Hi folks, any update?

from funda-scraper.

arnabsinha4u avatar arnabsinha4u commented on July 29, 2024

I have done away with scrapping the website. Instead, I am using RSS feeds which are parameterized and serves the purpose. RSS feeds have the latest details and not historic but ofcourse, over time you can create your own history, should that be a need.

from funda-scraper.

Kalli avatar Kalli commented on July 29, 2024

@arnabsinha4u could you please tell me where you find those RSS feeds you mention?
Do you mean something along the lines of these: http://partnerapi.funda.nl/feeds/Aanbod.svc/rss/?type=koop&zo=/amsterdam/

Is there a feed that has the postal codes as well?

from funda-scraper.

Suidgeest avatar Suidgeest commented on July 29, 2024

Hi Kurt, mind posting your latest working code (referring to your comment Sep 6th, 2017) Thank you!

from funda-scraper.

fab343 avatar fab343 commented on July 29, 2024

Hi all, any updates on the problem?

from funda-scraper.

fab343 avatar fab343 commented on July 29, 2024

I have done away with scrapping the website. Instead, I am using RSS feeds which are parameterized and serves the purpose. RSS feeds have the latest details and not historic but ofcourse, over time you can create your own history, should that be a need.

are you talking about this rss: http://partnerapi.funda.nl/feeds/Aanbod.svc/rss/?type=koop&zo=/amsterdam/ ?

from funda-scraper.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.