Giter Club home page Giter Club logo

ail-splash-manager's Introduction

Deprecated: AIL v5.0 crawler has been upgraded to Lacus

AIL no longer relies on any Docker image.

ail-splash-manager

AIL crawlers are using a splash crawler to fetch and render a domain.
The purpose of this Flask server is to simplify the installation and manage them:

  • Create, launch, relaunch splash dockers.
  • handle proxies
  • check crawler status

Installation

git clone https://github.com/ail-project/ail-splash-manager.git
cd ail-splash-manager
./install.sh

Usage

Launching AIL Splash Manager

./LAUNCH.sh -l

killing AIL Splash Manager and all Splash dockers

./LAUNCH.sh -k

Launching AIL Splash Manager Tests

./LAUNCH.sh -t

Tor proxy

Installation

The tor proxy from the Ubuntu package is installed by default.

This package is outdated: Some v3 onion address are not resolved.

*/!\ Install the tor proxy provided by The torproject to solve this issue./!*

Note: Ubuntu Install, add torrc in apt sources:

sudo sh -c 'echo "deb https://deb.torproject.org/torproject.org $(lsb_release -sc) main" >> /etc/apt/sources.list.d/tor-project.list'

Once installed, we need to allow all splash dockers to reach this proxy. You can use the configure_tor script or configure it yourself.

  • Install Script
cd ail-splash-manager
./configure_tor.sh
  • Manual configuration:
    • Allow Tor to bind to any interface or to the docker interface (by default binds to 127.0.0.1 only) in /etc/tor/torrc SocksPort 0.0.0.0:9050 or SocksPort 172.17.0.1:9050
    • Add the following line SocksPolicy accept 172.17.0.0/16 in /etc/tor/torrc (for a linux docker, the localhost IP is 172.17.0.1; Should be adapted for other platform)
    • Restart the tor proxy: sudo service tor restart

Configuration

Proxies:

Edit config/proxies_profiles.cfg:

  • [section_name]: proxy name, each section describe a proxy.
  • host: proxy host
    (for a linux docker, the localhost IP is 172.17.0.1; Should be adapted for other platform)
  • port: proxy port
  • type: proxy type, SOCKS5 or HTTP
  • description: proxy description
  • crawler_type: crawler type (tor or i2p or web)
[default_tor] # section name: proxy name
host=172.17.0.1
port=9050
type=SOCKS5
description=tor default proxy
crawler_type=tor
Splash Dockers:

Edit config/containers.cfg:

  • [section_name name]: splash name, each section describe a splash container.
  • proxy_name: proxy name (defined in proxies_profiles.cfg)
  • port: single port or port range (ex: 8050 or 8050-8052),
    A port range is used to launch multiple Splash Dockers
  • cpu: max number of cpu allocated
  • memory:max RAM (Go) allocated
  • description: Splash description
  • net: network type (bridge, host...)
[default_splash_tor] # section name: splash name
proxy_name=default_tor
port=8050-8052
cpu=1
memory=1
maxrss=2000
description= default splash tor
net=bridge

I2P

Installation:

Go on i2p website and follow the installation instruction

Configuration

  • Edit config/containers.cfg:
    • net: need to be host to work
[default_splash_i2p] # section name: splash name
proxy_name=default_i2p
port=8053-8055
cpu=1
memory=1
maxrss=2000
description=default splash i2p
net=host
  • Add a new proxy in config/proxies_profiles.cfg:
    • host: need to be 127.0.0.1 to work
[default_i2p]
host=127.0.0.1
port=4444
type=HTTP
description=i2p default proxy
crawler_type=i2p

Web proxy

SQUID

  • Edit /etc/squid/squid.conf:

    acl localnet src 172.17.0.0/16 # Docker IP range
    http_access allow localnet
  • Add a new proxy in config/proxies_profiles.cfg:

    [squid_proxy]
    host=172.17.0.1
    port=3128
    type=HTTP
    description=squid web proxy
    crawler_type=web
  • Bind this proxy to a Splash docker in config/containers.cfg

API

api/v1/ping

api/v1/version

api/v1/get/session_uuid

api/v1/get/proxies/all

api/v1/get/splash/all

api/v1/splash/restart

api/v1/splash/kill

ail-splash-manager's People

Contributors

adammchugh avatar adulau avatar davidcruciani avatar terrtia avatar vncloudsco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

vncloudsco rafiot

ail-splash-manager's Issues

Crawler Error / Down

Hi!

I Installed the AIL-Splash-Manager on the same machine as AIL itself is running (we only have this single machine)
But i´m not able to get the Crawlers running because of an Error:
image

Screen of the ail-splash-manager:

Launching all Splash dockers ...

 * Serving Flask app 'Flask_server'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on https://127.0.0.1:7001
 * Running on https://192.168.158.2:7001
Press CTRL+C to quit
127.0.0.1 - - [19/Aug/2022 14:28:33] "GET /api/v1/ping HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:40] "GET /api/v1/get/session_uuid HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:40] "GET /api/v1/ping HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:40] "GET /api/v1/get/proxies/all HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:41] "GET /api/v1/get/splash/all HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:41] "GET /api/v1/ping HTTP/1.1" 200 -
127.0.0.1 - - [19/Aug/2022 14:28:42] "GET /api/v1/ping HTTP/1.1" 200 -

Here is the output of the LAUNCH.sh -t


 ./LAUNCH.sh -t
 #### containers config: ####
# proxy_name: proxy name (defined in proxies_profiles.cfg)
# port: single port or port range (ex: 8050 or 8050-8052)
# cpu: max number of cpu allocated
# memory: RAM (G) allocated
# maxrss: max unbound in-memory cache (Mb, Restart Splash when full)
# description: docker description
[default_splash_tor]
proxy_name=default_tor
port=8050-8052
cpu=1
memory=1
maxrss=2000
description= default splash tor
net=bridge

# Splash with SQUID proxy
[web_splash]
proxy_name=web_proxy
port=8060
cpu=1
memory=1
maxrss=2000
description= web splash
net=bridge

# Splash with I2P proxy
#[default_splash_i2p] # section name: splash name
#proxy_name=default_i2p
#port=8053-8055
#cpu=1
#memory=1
#maxrss=2000
#description=default splash i2p
#net=host
#### proxies config: ####
# Tor: torrc default proxy
# use The torproject proxy https://2019.www.torproject.org/docs/debian
# (up to date, solve issues with v3 onion addresses)

# proxy name
[default_tor]
# proxy host
host=172.17.0.1
# proxy port
port=9050
# proxy type
type=SOCKS5
# proxy description
description=tor default proxy
# crawler type (tor or i2p or web)
crawler_type=tor

# SQUID proxy
[web_proxy]
host=172.17.0.1
port=3128
type=HTTP
description=web proxy
crawler_type=web

# I2P proxy
#[default_i2p]
#host=127.0.0.1
#port=4444
#type=HTTP
#description=i2p default proxy
#crawler_type=i2p

#### #### ####

 Launching Tests ...

Splash List:
b'6a30543d58f9   scrapinghub/splash   "python3 /app/bin/sp"   58 seconds ago   Up 57 seconds   0.0.0.0:8060->8050/tcp   gallant_moser\n719b9459bb82   scrapinghub/splash   "python3 /app/bin/sp"   58 seconds ago   Up 57 seconds   0.0.0.0:8052->8050/tcp   hardcore_nightingale\ne237c3d9a36d   scrapinghub/splash   "python3 /app/bin/sp"   59 seconds ago   Up 58 seconds   0.0.0.0:8051->8050/tcp   pensive_greider\nc8c36770f590   scrapinghub/splash   "python3 /app/bin/sp"   59 seconds ago   Up 58 seconds   0.0.0.0:8050->8050/tcp   strange_gould\n'

Testing Splash Docker 6a30543d58f9:
success

Testing Splash Docker 719b9459bb82:
success

Testing Splash Docker e237c3d9a36d:
success

Testing Splash Docker c8c36770f590:
success

Running docker container:


# docker container ls
CONTAINER ID   IMAGE                COMMAND                  CREATED         STATUS         PORTS                    NAMES
6a30543d58f9   scrapinghub/splash   "python3 /app/bin/sp…"   6 minutes ago   Up 6 minutes   0.0.0.0:8060->8050/tcp   gallant_moser
719b9459bb82   scrapinghub/splash   "python3 /app/bin/sp…"   6 minutes ago   Up 6 minutes   0.0.0.0:8052->8050/tcp   hardcore_nightingale
e237c3d9a36d   scrapinghub/splash   "python3 /app/bin/sp…"   6 minutes ago   Up 6 minutes   0.0.0.0:8051->8050/tcp   pensive_greider
c8c36770f590   scrapinghub/splash   "python3 /app/bin/sp…"   6 minutes ago   Up 6 minutes   0.0.0.0:8050->8050/tcp   strange_gould

Under onion crawler are both ports listed (8050 TOR + 8060 WEB) is this maybe the problem ?
image

I checked the WebProxy Configuration (https://github.com/ail-project/ail-splash-manager#web-proxy) here in the project description but is not clear to me which "/etc/squid/squid.conf" i need to configure ? Squid is not installed per default with the ail-splash-manger install script on the host. Or do i need to change it inside the docker container ?

Is there anything i can debug further ?

Please give me a hint if more logs are needed.
Thanks for your help!

Question: Initialisation in comparison with script from ail-framework

Hi @Terrtia , is the way the crawlers are started somehow different to the way in the script form the ail-framework? I tried the spalsh-manager and everything seems to start right but the crawlers seem to not to be able to connect via the TOR proxy. This is not the case when using the script from the ail-framework. Sidenote: I am using TOR behind a http(s) proxy which is configured in /etc/tor/torrc .

Question: How are non-TOR crawlers used in ail-framework

Hi @Terrtia ,
regarding the possibility to also configure crawlers to connect via http proxy. How does the ail-framework know which crawlers on which port can be used for non-TOR connections? Does it need a special configuration in /ail-framework/configs/core.cfg under ```
[Crawler]
activate_crawler = True
crawler_depth_limit = 1
...

?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.