teamhg-memex / autologin-middleware Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 15.0 70 KB

Scrapy middleware for the autologin

Python 100.00%

autologin-middleware's People

Stargazers

Watchers

Forkers

scrapinghub rmax-archive barravi walnutatiie alienzox escap-data-hub strategist922 battyone ishandutta2007 zanachka bassophil bellyfat socioprophet eric-seekas isabella232

autologin-middleware's Issues

Describe usage with scrapy-splash in more detail

A scrapy-splash autologin spider that uses any non-custom Splash endpoint will not work correctly, because cookies argument is not supported, and even though the initial request will use cookies (set via the header argument), subsequent requests on the same page (made by JS for example) will not use cookies and will not be authenticated.
Currently the only solution is to use a custom splash script. We should at least have an example scrapy project that correctly uses all the components.

Add splash tests

For example, correctly fixing #6 requires checking how autologin-middleware works with splash.

More robust logout detection - check for redirect

There are sites that set a lot of cookies on login, and some of them are not really required and can be removed later, making the middleware think there was a logout.
One way to make logout detection more robust is to add a check that there was a redirect when the cookies were removed.

Rename HTTP_PROXY to AUTOLOGIN_HTTP_PROXY

It seems cleaner to use AUTOLOGIN_HTTP_PROXY and AUTOLOGIN_HTTPS_PROXY instead of passing HTTP_PROXY to autologin. HTTP_PROXY is not a standard scrapy settings variable, and it's more explicit to prefix it with AUTOLOGIN_, like other variables.

But it can less convenient in some cases, perhaps, and may require some Arachnado changes.
What do you think, @kmike ?

Tests starting failing on master

test_custom_parse[True] is sometimes failing... can't reproduce locally yet.

Callback and errback are erased from retried request

On this line:

autologin-middleware/autologin_middleware/middleware.py

Line 78 in 64265fa

req_copy.callback = req_copy.errback = None

They should be preserved. Originally I deleted them because they failed to serialize as part for request.meta when using a disk cache.
So a different approach is required.

Check if it is possible to remove original request serialization

I think it is required due to scrapy-splash "freezing" cookies. Not sure if this is good to fix it in scrapy-splash.

HTTP API Response Cookiejar is empty

Hello,

I'm trying to use the autologin middleware to log into a Twitter account, then use scrapy to crawl specified pages. When make my POST request, the res.cookies I get back appears to be an empty CookieJar, though when I do json.loads(res.content.decode("utf-8"))["cookies"] I seem to get the full list of the cookies I want to pass to my scrapy-splash lua script. With this "workaround" to the empty CookieJar, I pass the cookie json as the cookies argument for my SplashRequest. When I do this, I get a scrapy splash error (I have a feeling this may be due to trying to pass in a python list of dicts as a lua table, but I'm still not sure). Anyways, I'd super appreciate any insight on this problem as I'm relatively new to splash/scrapy and this stuff is getting quite annoying to debug :/

I've attached my scrapy spider and settings files below. Thanks!

Wayde

followers_spider.py

import scrapy
from twitter_scraper.items import TweetItem
from scrapy.shell import inspect_response
from scrapy_splash import SplashRequest
from twitter_scraper.settings import *
import os
import pdb
from scrapy.utils.response import open_in_browser
from scrapy.http import HtmlResponse, FormRequest, Request
from twitter_scraper.lua_scripts import infinite_scroll, twitter_login
from autologin import AutoLogin
import requests
import json
import autologin_middleware

class FollowersSpider(scrapy.Spider):
    name = "twitter_followers"    
    start_urls = ["https://twitter.com/NBA/followers"]
    login_page = "https://twitter.com/login"


    def start_requests(self):
        print("\nGetting cookies...\n")
        res = requests.post(
            url=AUTOLOGIN_URL + "/login-cookies",
            json={
                "url":self.login_page,
                "username":TWITTER_USER,
                "password":TWITTER_PASS
            }
        )
        cookies_json = json.loads(res.content.decode("utf-8"))["cookies"]
        print(f"\nGot cookies! Yeilding request...\n")
        # pdb.set_trace()

        # yield scrapy.Request(
        #     url="https://twitter.com/nyxl/followers", 
        #     callback=self.parse,
        #     cookies=cookies_json,
        #     meta={"dont_merge_cookies":True})

        yield SplashRequest(
            url=self.start_urls[0],
            callback=self.parse, 
            endpoint='execute',
            args={"lua_source":infinite_scroll,"cookies":cookies_json})


    def parse(self, response):
        pdb.set_trace()
        
        ht = HtmlResponse(
            url=response.url, body=response.body, 
            encoding="utf-8", request=response.request)
        open_in_browser(ht)
        inspect_response(response, self)

        
        pdb.set_trace()
        pass

settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'twitter_scraper'

SPIDER_MODULES = ['twitter_scraper.spiders']
NEWSPIDER_MODULE = 'twitter_scraper.spiders'

ROBOTSTXT_OBEY = True
COOKIES_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'autologin_middleware.AutologinMiddleware': 605,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}


ITEM_PIPELINES = {
    'twitter_scraper.pipelines.TweetPipeline': 100
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Scrapy-splash 
SPLASH_URL = 'http://0.0.0.0:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# CUSTOM
DATA = "../data/"
USER_ID_CSV = lambda fn: DATA + fn

# logins
TWITTER_USER = "example_username"
TWITTER_PASS = "example_password"


# Autologin
AUTOLOGIN_URL = 'http://127.0.0.1:8089'
AUTOLOGIN_ENABLED = True
DOWNLOADER_MIDDLEWARES['autologin_middleware.AutologinMiddleware'] = 605

AUTOLOGIN_CHECK_LOGOUT = True

AutologinMiddleware.process_request must return None, Response or Request, got Deferred

Hi there,

i just try the middleware but i got an error :

AssertionError: Middleware AutologinMiddleware.process_request must return None, Response or Request, got Deferred
Unhandled error in Deferred:
2016-05-25 11:01:18 [twisted] CRITICAL: Unhandled error in Deferred:

Python 2 support

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.