Giter Club home page Giter Club logo

autologin-middleware's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autologin-middleware's Issues

Describe usage with scrapy-splash in more detail

A scrapy-splash autologin spider that uses any non-custom Splash endpoint will not work correctly, because cookies argument is not supported, and even though the initial request will use cookies (set via the header argument), subsequent requests on the same page (made by JS for example) will not use cookies and will not be authenticated.
Currently the only solution is to use a custom splash script. We should at least have an example scrapy project that correctly uses all the components.

Add splash tests

For example, correctly fixing #6 requires checking how autologin-middleware works with splash.

More robust logout detection - check for redirect

There are sites that set a lot of cookies on login, and some of them are not really required and can be removed later, making the middleware think there was a logout.
One way to make logout detection more robust is to add a check that there was a redirect when the cookies were removed.

Rename HTTP_PROXY to AUTOLOGIN_HTTP_PROXY

It seems cleaner to use AUTOLOGIN_HTTP_PROXY and AUTOLOGIN_HTTPS_PROXY instead of passing HTTP_PROXY to autologin. HTTP_PROXY is not a standard scrapy settings variable, and it's more explicit to prefix it with AUTOLOGIN_, like other variables.

But it can less convenient in some cases, perhaps, and may require some Arachnado changes.
What do you think, @kmike ?

HTTP API Response Cookiejar is empty

Hello,

I'm trying to use the autologin middleware to log into a Twitter account, then use scrapy to crawl specified pages. When make my POST request, the res.cookies I get back appears to be an empty CookieJar, though when I do json.loads(res.content.decode("utf-8"))["cookies"] I seem to get the full list of the cookies I want to pass to my scrapy-splash lua script. With this "workaround" to the empty CookieJar, I pass the cookie json as the cookies argument for my SplashRequest. When I do this, I get a scrapy splash error (I have a feeling this may be due to trying to pass in a python list of dicts as a lua table, but I'm still not sure). Anyways, I'd super appreciate any insight on this problem as I'm relatively new to splash/scrapy and this stuff is getting quite annoying to debug :/

I've attached my scrapy spider and settings files below. Thanks!

Wayde

followers_spider.py

import scrapy
from twitter_scraper.items import TweetItem
from scrapy.shell import inspect_response
from scrapy_splash import SplashRequest
from twitter_scraper.settings import *
import os
import pdb
from scrapy.utils.response import open_in_browser
from scrapy.http import HtmlResponse, FormRequest, Request
from twitter_scraper.lua_scripts import infinite_scroll, twitter_login
from autologin import AutoLogin
import requests
import json
import autologin_middleware

class FollowersSpider(scrapy.Spider):
    name = "twitter_followers"    
    start_urls = ["https://twitter.com/NBA/followers"]
    login_page = "https://twitter.com/login"


    def start_requests(self):
        print("\nGetting cookies...\n")
        res = requests.post(
            url=AUTOLOGIN_URL + "/login-cookies",
            json={
                "url":self.login_page,
                "username":TWITTER_USER,
                "password":TWITTER_PASS
            }
        )
        cookies_json = json.loads(res.content.decode("utf-8"))["cookies"]
        print(f"\nGot cookies! Yeilding request...\n")
        # pdb.set_trace()

        # yield scrapy.Request(
        #     url="https://twitter.com/nyxl/followers", 
        #     callback=self.parse,
        #     cookies=cookies_json,
        #     meta={"dont_merge_cookies":True})

        yield SplashRequest(
            url=self.start_urls[0],
            callback=self.parse, 
            endpoint='execute',
            args={"lua_source":infinite_scroll,"cookies":cookies_json})


    def parse(self, response):
        pdb.set_trace()
        
        ht = HtmlResponse(
            url=response.url, body=response.body, 
            encoding="utf-8", request=response.request)
        open_in_browser(ht)
        inspect_response(response, self)

        
        pdb.set_trace()
        pass

settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'twitter_scraper'

SPIDER_MODULES = ['twitter_scraper.spiders']
NEWSPIDER_MODULE = 'twitter_scraper.spiders'

ROBOTSTXT_OBEY = True
COOKIES_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'autologin_middleware.AutologinMiddleware': 605,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}


ITEM_PIPELINES = {
    'twitter_scraper.pipelines.TweetPipeline': 100
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Scrapy-splash 
SPLASH_URL = 'http://0.0.0.0:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# CUSTOM
DATA = "../data/"
USER_ID_CSV = lambda fn: DATA + fn

# logins
TWITTER_USER = "example_username"
TWITTER_PASS = "example_password"


# Autologin
AUTOLOGIN_URL = 'http://127.0.0.1:8089'
AUTOLOGIN_ENABLED = True
DOWNLOADER_MIDDLEWARES['autologin_middleware.AutologinMiddleware'] = 605

AUTOLOGIN_CHECK_LOGOUT = True

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.