teamhg-memex / autologin-middleware Goto Github PK
View Code? Open in Web Editor NEWScrapy middleware for the autologin
Scrapy middleware for the autologin
A scrapy-splash autologin spider that uses any non-custom Splash endpoint will not work correctly, because cookies
argument is not supported, and even though the initial request will use cookies (set via the header argument), subsequent requests on the same page (made by JS for example) will not use cookies and will not be authenticated.
Currently the only solution is to use a custom splash script. We should at least have an example scrapy project that correctly uses all the components.
For example, correctly fixing #6 requires checking how autologin-middleware works with splash.
There are sites that set a lot of cookies on login, and some of them are not really required and can be removed later, making the middleware think there was a logout.
One way to make logout detection more robust is to add a check that there was a redirect when the cookies were removed.
It seems cleaner to use AUTOLOGIN_HTTP_PROXY
and AUTOLOGIN_HTTPS_PROXY
instead of passing HTTP_PROXY
to autologin. HTTP_PROXY
is not a standard scrapy settings variable, and it's more explicit to prefix it with AUTOLOGIN_
, like other variables.
But it can less convenient in some cases, perhaps, and may require some Arachnado changes.
What do you think, @kmike ?
test_custom_parse[True]
is sometimes failing... can't reproduce locally yet.
On this line:
I think it is required due to scrapy-splash "freezing" cookies. Not sure if this is good to fix it in scrapy-splash.
Hello,
I'm trying to use the autologin middleware to log into a Twitter account, then use scrapy to crawl specified pages. When make my POST request, the res.cookies
I get back appears to be an empty CookieJar, though when I do json.loads(res.content.decode("utf-8"))["cookies"]
I seem to get the full list of the cookies I want to pass to my scrapy-splash lua script. With this "workaround" to the empty CookieJar, I pass the cookie json as the cookies
argument for my SplashRequest. When I do this, I get a scrapy splash error (I have a feeling this may be due to trying to pass in a python list of dicts as a lua table, but I'm still not sure). Anyways, I'd super appreciate any insight on this problem as I'm relatively new to splash/scrapy and this stuff is getting quite annoying to debug :/
I've attached my scrapy spider and settings files below. Thanks!
Wayde
followers_spider.py
import scrapy
from twitter_scraper.items import TweetItem
from scrapy.shell import inspect_response
from scrapy_splash import SplashRequest
from twitter_scraper.settings import *
import os
import pdb
from scrapy.utils.response import open_in_browser
from scrapy.http import HtmlResponse, FormRequest, Request
from twitter_scraper.lua_scripts import infinite_scroll, twitter_login
from autologin import AutoLogin
import requests
import json
import autologin_middleware
class FollowersSpider(scrapy.Spider):
name = "twitter_followers"
start_urls = ["https://twitter.com/NBA/followers"]
login_page = "https://twitter.com/login"
def start_requests(self):
print("\nGetting cookies...\n")
res = requests.post(
url=AUTOLOGIN_URL + "/login-cookies",
json={
"url":self.login_page,
"username":TWITTER_USER,
"password":TWITTER_PASS
}
)
cookies_json = json.loads(res.content.decode("utf-8"))["cookies"]
print(f"\nGot cookies! Yeilding request...\n")
# pdb.set_trace()
# yield scrapy.Request(
# url="https://twitter.com/nyxl/followers",
# callback=self.parse,
# cookies=cookies_json,
# meta={"dont_merge_cookies":True})
yield SplashRequest(
url=self.start_urls[0],
callback=self.parse,
endpoint='execute',
args={"lua_source":infinite_scroll,"cookies":cookies_json})
def parse(self, response):
pdb.set_trace()
ht = HtmlResponse(
url=response.url, body=response.body,
encoding="utf-8", request=response.request)
open_in_browser(ht)
inspect_response(response, self)
pdb.set_trace()
pass
settings.py
# -*- coding: utf-8 -*-
BOT_NAME = 'twitter_scraper'
SPIDER_MODULES = ['twitter_scraper.spiders']
NEWSPIDER_MODULE = 'twitter_scraper.spiders'
ROBOTSTXT_OBEY = True
COOKIES_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'autologin_middleware.AutologinMiddleware': 605,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
ITEM_PIPELINES = {
'twitter_scraper.pipelines.TweetPipeline': 100
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Scrapy-splash
SPLASH_URL = 'http://0.0.0.0:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# CUSTOM
DATA = "../data/"
USER_ID_CSV = lambda fn: DATA + fn
# logins
TWITTER_USER = "example_username"
TWITTER_PASS = "example_password"
# Autologin
AUTOLOGIN_URL = 'http://127.0.0.1:8089'
AUTOLOGIN_ENABLED = True
DOWNLOADER_MIDDLEWARES['autologin_middleware.AutologinMiddleware'] = 605
AUTOLOGIN_CHECK_LOGOUT = True
Hi there,
i just try the middleware but i got an error :
AssertionError: Middleware AutologinMiddleware.process_request must return None, Response or Request, got Deferred
Unhandled error in Deferred:
2016-05-25 11:01:18 [twisted] CRITICAL: Unhandled error in Deferred:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.