elixir-crawly / crawly Goto Github PK

View Code? Open in Web Editor NEW

841.0 19.0 104.0 2.85 MB

Crawly, a high-level web crawling & scraping framework for Elixir.

Home Page: https://hexdocs.pm/crawly

License: Apache License 2.0

Elixir 94.33% Shell 0.42% HTML 4.72% Dockerfile 0.54%

elixir erlang scraper scraping scraping-websites extract-data spider crawler crawling

crawly's Introduction

Crawly

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

Elixir ~> 1.14
Works on GNU/Linux, Windows, macOS X, and BSD.

Quickstart

Create a new project: mix new quickstart --sup

Add Crawly as a dependencies:

# mix.exs
defp deps do
    [
      {:crawly, "~> 0.17.0"},
      {:floki, "~> 0.33.0"}
    ]
end

Fetch dependencies: $ mix deps.get

Create a spider

 # lib/crawly_example/books_to_scrape.ex
 defmodule BooksToScrape do
   use Crawly.Spider

   @impl Crawly.Spider
   def base_url(), do: "https://books.toscrape.com/"

   @impl Crawly.Spider
   def init() do
     [start_urls: ["https://books.toscrape.com/"]]
   end

   @impl Crawly.Spider
   def parse_item(response) do
     # Parse response body to document
     {:ok, document} = Floki.parse_document(response.body)

     # Create item (for pages where items exists)
     items =
       document
       |> Floki.find(".product_pod")
       |> Enum.map(fn x ->
         %{
           title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
           price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
           url: response.request_url
         }
       end)

     next_requests =
       document
       |> Floki.find(".next a")
       |> Floki.attribute("href")
       |> Enum.map(fn url ->
         Crawly.Utils.build_absolute_url(url, response.request.url)
         |> Crawly.Utils.request_from_url()
       end)

     %Crawly.ParsedItem{items: items, requests: next_requests}
   end
 end

New in 0.15.0 :

It's possible to use the command to speed up the spider creation, so you will have a generated file with all needed callbacks: mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape

Configure Crawly

By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls: (in file: config/config.exs)

 import Config

 config :crawly,
   closespider_timeout: 10,
   concurrent_requests_per_domain: 8,
   closespider_itemcount: 100,

   middlewares: [
     Crawly.Middlewares.DomainFilter,
     Crawly.Middlewares.UniqueRequest,
     {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
   ],
   pipelines: [
     {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
     {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
     Crawly.Pipelines.JSONEncoder,
     {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
   ]

New in 0.15.0:

You can generate example config with the help of the following command: mix crawly.gen.config

Start the Crawl:

  iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"

Results can be seen with:

$ cat /tmp/BooksToScrape_<timestamp>.jl

Running Crawly without Elixir or Elixir projects

It's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.

Please read more about it here:

Need more help?

Please use discussions for all conversations related to the project

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Browser Rendering

Simple management UI (New in 0.15.0) {#management-ui}

Crawly provides a simple management UI by default on the localhost:4001

It allows to:

Start spiders
Stop spiders
Preview scheduled requests
View/Download items extracted
View/Download logs

NOTE: It's possible to disable the Simple management UI (and rest API) with the start_http_api?: false options of Crawly configuration.

You can choose to run the management UI as a plug in your application.

defmodule MyApp.Router do
  use Plug.Router

  ...
  forward "/admin", Crawly.API.Router
  ...
end

Experimental UI [Deprecated]

Now don't have a possibility to work on experimental UI built with Phoenix and LiveViews, and keeping it here for mainly demo purposes.

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders. Checkout the code from GitHub

Documentation

Roadmap

To be discussed

Articles

Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
What is web scraping, and why you might want to use it?
Using Elixir and Crawly for price monitoring
Building a Chrome-based fetcher for Crawly

Example projects

Blog crawler: https://github.com/oltarasenko/crawly-spider-example
E-commerce websites: https://github.com/oltarasenko/products-advisor
Car shops: https://github.com/oltarasenko/crawly-cars
JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

Copyright and License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

How to release:

Update version in mix.exs
Update version in quickstart (README.md, this file)
Commit and create a new tag: git commit && git tag 0.xx.0 && git push origin master --follow-tags
Build docs: mix docs
Publish hex release: mix hex.publish

crawly's People

Contributors

Stargazers

Watchers

Forkers

domhaobaobao mikalv jerojasro stjordanis yangcancai jallum jamescheuk91 harlantwood kylekermgard ogabriel maartz michaltrzcinka juanbono rayleyva ubi-mirrors edgarlatorre torifukukaiou kfabryczny filipevarjao mlataibrahim singularitypostman aymanosman vanessaklee sreecodeslayer an0nym0u5101 gordon-parrott maiphuong-van cybernet kaozdl kingshalaby1 davidalphafox artaxerces mgibowski oshosanya vsevolodbreus mmachado95 raphexion vermaxik bangalcat m-dhuicq hieuphq strogo feng19 spectator jeff66ruan edmondfrank altyaper wall-eeeeeee avillen jhonatannunessilva nomadhomes ht013 mckethanor darkslategrey kianmeng m4hi2 jack-s9 grkek ericmason defp winsalva stroemgren matteoredaelli suzdalnitski ziinc johannese njausteve vapalape ps491 fstp nuno84 amelyo oleg-kivra azrosen92 kszambelanczyk redlin nobrainskull petrus-jvrensburg mallieb horizon65 expivot zongwu233 4kd serpent213 abhinavs maltekrupa 0comptaoleohe aej 1subdisploschi rjswenson cmnstmntmn nicktaylor- felbdogg rohitmungre chaitanyapandit ydlr uldza talkanbaev-artur dogweather arobie1992

crawly's Issues

Downloading Files

Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile pipeline which seems to be used for saving data into one file, CSV, JSON, etc.

How to use Crawly when not need to collect links?

I have in Database about 250k URLs that need to enrich. Go to every link of the list and parse its HTML. Is there a good way how to feed them to Crawly queue? Is it suitable to use Crawly in this case?

Docs for configuring Splash Fetcher should be base_url not base_splash_url

Docs for Basic Concepts and configuring Splash fetcher state the configuration should be
fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}

However, the fetch method is looking for the base_url keyword in the config. As seen here

The message for the error, here, should be updated too.

Scraping different items from the same spider, each with different pipeline requirements

Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?

For example, web page A (e.g. a blog) will have:

comments
article content
related links
title

while web page B (e.g. a weather site) will have:

temperature
country

In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:

item: [:title, :comments, :article_content, :related_links, :temperature, :country]

the issues are that:

they share the same pipeline. When scraping the weather data, the blog-related fields will be blank, and vice versa for when scraping the blog. This will affect pipeline validations, since pipeline is shared.
Output of the two items will be the same.
Duplication checks (such as :item_id) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.

I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.

Using the tutorial as an example:

using this ideal scenario config:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  follow_redirects: true,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {MyItemStruct, [
        Crawly.Pipelines.Validate,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
    {MyOtherItemStruct, [
        Crawly.Pipelines.Validate,
        Crawly.Pipelines.CleanMyData,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :name }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
  ]

with the spider implemented as so:

 @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
    name= response.body |> Floki.find("article.blog_post h2") |> Floki.text()

    %{
         :requests => requests,
         :items => [
            %MyItemStruct{title: title, url: response.request_url},
            %MyOtherItemStruct{name: title, url: response.request_url}
         ]
      }
  end

The returned items then can get sorted into their specified pipelines.

This configuration method proposes the following:

allow declaration of item-specific pipelines for multiple items
allow passing of arguments to a pipeline implementation e.g. {MyPipelineModule, validate_more_than: 5}

To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.

Do let me know what you think @oltarasenko

Unable to get up and running from the quick start

Hi, I am unable to scrape the erlang solutions blog as the quickstart guide states here:
https://github.com/oltarasenko/crawly#quickstart

Attempting to run the spider through iex results in:

iex(1)> Crawly.Engine.start_spider(MyCrawler.EslSpider)
[info] Starting the manager for Elixir.MyCrawler.EslSpider
[debug] Running spider init now.
[debug] Scraped ":title,:url"
[debug] Starting requests storage worker for Elixir.MyCrawler.EslSpider...
[debug] Started 2 workers for Elixir.MyCrawler.EslSpider
:ok
iex(2)> [info] Current crawl speed is: 0 items/min
[info] Stopping MyCrawler.EslSpider, itemcount timeout achieved

I'm quite lost as there is no way for me to debug this, if it is a network issue (which is highly unlikely since i can access the esl website through my browser), or if it is an issue with the urls being filtered out.

defmodule MyCrawler.EslSpider do
  @behaviour Crawly.Spider
  alias Crawly.Utils
  require Logger
  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init() do
    Logger.debug("Running spider init now.")
    [start_urls: ["https://www.erlang-solutions.com/blog.html"]]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    IO.inspect(response)
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    # Modified this to make it even more general, to eliminate the possibility of selector problem
    title = response.body |> Floki.find("title") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end

Of note is that the spider does not even call the parse_items callback, as the IO.inspect for the response is not called at all.

Config is as follows:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 2,
  follow_redirects: true,
  output_format: "csv",
  item: [:title, :url],
  item_id: :title

Write an example of Crawly with splash integrated as a proxy

Discussion: Store state for fault-tolerance crawler

I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.

Is there any recommendations or suggestions how to implement it in my current Crawly project?

Tag jobs with unique id

One of the common problems I am facing right now, if that it's not possible to separate one crawly's job from another, from the external point of view.

E.g. the same spider can be executed multiple times. How do we know that the data came exactly from a given run? For example here: http://18.216.221.122/ in order to group items inside a UI we need to have an ID to unite them.

I plan to:

Extend engine in the way that start_spider would accept an optional job_id parameter
The engine would automatically generate the job if now provided
Crawly will use this job_id when the communication with the external world is done. E.g. if we are shipping logs somewhere -> it will be used. If we're shipping items it will be used.

What do you think about the idea?

What is recommended "concurent_request_per_domain" ?

It looks like it easy to overflow queue. Entire system down. This happen when I set concurrency more than ~20. What is the recommended setting of concurrency for Crawly? What is performance can you achieve with it?

22:27:10.427 [error] GenServer #PID<0.527.0> terminating
** (MatchError) no match of right hand side value: {:empty, {[], []}}
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:509: :hackney_pool.queue_out/2
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:376: :hackney_pool.dequeue/3
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:349: :hackney_pool.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.4051887452.3045064705.41660>, :request, #PID<0.455.0>, :shutdown}
State: {:state, :default, {:metrics_ng, :metrics_dummy}, 50, 150000, {:dict, 50, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064705.22552> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.10994> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.25098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39221> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41389> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41530> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.49674> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.31884> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.34134> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.22288> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.35098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.19015> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39210> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064706.63253> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.42573> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064708.30170> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13738> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.33158> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41238> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41539> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61055> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.21269> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.42098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61233> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064712.11785> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.34068> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26414> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.38039> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.34782> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064712.8112> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.24733> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.43205> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.22048> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41660> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.35050> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.24426> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13941> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.18037> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.55708> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13898> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.23688> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.52408> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.32218> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.33419> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.44346> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13751> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41062> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.34340> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.20603> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26880> | {'www.homebase.co.uk', 443, :hackney_ssl}]]}}}, {:dict, 1, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[], [], [], [[{'www.homebase.co.uk', 443, :hackney_ssl} | {[], []}]], [], [], [], [], [], [], [], [], [], [], [], []}}}, {:dict, 20, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064712.10994> | {{#PID<0.436.0>, #Reference<0.4051887452.3045064712.11107>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064710.25098> | {{#PID<0.491.0>, #Reference<0.4051887452.3045064710.25100>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064708.49674> | {{#PID<0.435.0>, #Reference<0.4051887452.3045064708.49679>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.31884> | {{#PID<0.503.0>, #Reference<0.4051887452.3045064712.31886>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3044802561.184150> | {{#PID<0.428.0>, #Reference<0.4051887452.3044802561.184156>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.63253> | {{#PID<0.521.0>, #Reference<0.4051887452.3045064706.63260>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064707.42573> | {{#PID<0.516.0>, #Reference<0.4051887452.3045064707.42615>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064708.30170> | {{#PID<0.519.0>, #Reference<0.4051887452.3045064708.30172>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064705.41238> | {{#PID<0.457.0>, #Reference<0.4051887452.3045064705.41240>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064706.61055> | {{#PID<0.501.0>, #Reference<0.4051887452.3045064706.61080>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.61233> | {{#PID<0.448.0>, #Reference<0.4051887452.3045064706.61248>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064712.11785> | {{#PID<0.495.0>, #Reference<0.4051887452.3045064712.11789>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064711.38039> | {{#PID<0.514.0>, #Reference<0.4051887452.3045064711.38065>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.34782> | {{#PID<0.450.0>, #Reference<0.4051887452.3045064712.34787>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064707.43205> | {{#PID<0.446.0>, #Reference<0.4051887452.3045064707.43464>}, {'www.homebase.co.uk', 443, :hackney_ssl (truncated)

Spider scheduling for continuous crawling

I'm looking to do cron-style spider scheduling, where the engine starts the spider if it is not running at a scheduled timing interval.

Should this be within the Engine or Commander (?) module context?

This would require the Engine to be part of a supervision tree, I think.

Code generator helpers

mix helpers for generating boilerplate code for spiders and pipelines.

WriteFile produces invalid json

When writing json, WriteFile produces invalid json, as it uses a newline to separate items.

{"title": "first item}
{"title": "second item"}

The items should be in a list, comma separated:

[
    {"title": "first item"},
    {"title": "second item"}
]

Add lightweight UI for Crawly Management

As it was discussed here: #97 (comment)
we want to build a lightweight (probably HTTP based) UI for the single node based Crawly operations. For people who don't want (or don't need) to use more complex https://github.com/oltarasenko/crawly_ui.

As far as we see it now we need to develop a Lightweight HTTP client (alternatively we might look into command line clients like https://github.com/ndreynolds/ratatouille)

without an external database as a dependency
which allows to schedule/stop jobs on the given node
which allows seeing currently running jobs [ideally with metrics like crawl-speed]
which allows to schedule jobs with given parameters like concurrency

Improve tests for CSV encoder pipeline

Some tests are quite ugly at the moment :(. Need to check the race condition here:

  1) test CSV encoder test Items are stored in CSV after csv pipeline (DataStorageWorkerTest)
     test/data_storage_worker_test.exs:149
     ** (MatchError) no match of right hand side value: {:error, :already_started}
     stacktrace:
       test/data_storage_worker_test.exs:6: DataStorageWorkerTest.__ex_unit_setup_0/1
       test/data_storage_worker_test.exs:1: DataStorageWorkerTest.__ex_unit__/2

As it fails the CI pipeline for master.

Browser Rendering?

Is HMLT, CSS, JS in-browser rendering on the roadmap?

Thanks!

Postgres instead of files.

I want to store scraped data in Postgres with help of Ecto of course.
Is there some best practice for this?

Improvements to output file names of spiders

Right now the file name is the spider's module name and there's no way to configure it. Because the file name is static, previous data is overwritten, where as scrapers usually add a timestamp to the file name. That would be very nice because in this way previous results are automatically retained and it's immediate when the parsing was done.

Spiders could have an optional name method which could be used to enable users to customize it and the default naming could include a timestamp.

Is it possible to have a configuration per spider?

I am currently running into an issue where one of my spider gets denied for making too many requests, and I would like to set the concurent requests for that specific spider to 1 without affecting the other spiders, so far I havent found a way to do this.

Currently the only way I see of achieving what I want is creating a separate application for each spider with each its own config, which doesnt feel like it will be quite optimal has I will end up with probably 50+ spiders meaning 50+ apps.

The question: is there currently a way to make configurations spider specific, and if not do you intend on making that possible in the future?

Optional XPath support

Optional xpath support.

Should be able to handle dirty xml input.

Announcement: Crawly UI, first early prototype

Hey people,

I have spent quite a bit of time prototyping a UI for Crawly. I think it's the next step we have to make in order to make Crawly visible in the Elixir and Web crawling space.

A good UI would help to convert web scraping into a process with clear create - test - use circles. Please have a glance on the early prototype here:
https://github.com/oltarasenko/crawly_ui

or test it here: http://18.216.221.122/jobs/1/items/

Ecto item pipeline configuration

I have a problem with build the custom Ecto pipeline.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines

If I get it right, I need to define the pipeline module in my config.exs file topipelines:[]section?
Also I miss understanding that does mean this part MyApp.insert_with_ecto(item) ? Do you mean Repo.insert here or what?
Can you please describe in more detail how should I connect it?

P.S. I apologize if the questions seemed stupid, but I still do not understand how this works. I will be thrilled when I figure it out. Ecto is a missed puzzle for my crawly projects.

INFO: How to scrape a API using POST method calls

How to scrape a API using POST method calls, are there any examples or docs. Any sample code would be nice. Thanks

Build association within scraped data

Hello!
I looking the way to associate scraped data, I do it with Ecto however stuck. Do you guys can help me with it?

Link to the original post here: https://elixirforum.com/t/how-should-i-use-put-assoc-with-upsert-many-to-many/30315

Move away from using global app env for pipeline module config

Pipeline config should be localized to the specific pipeline.
Benefits:

Reduces clashes (when more middlewares/pipelines get built)
makes parameter declaration clearer
makes pipelines more "functional" and reusable
use of Application.get_env usage within pipeline modules makes things less clear when declaring pipelines.

Related to #20 , would pave the way for adding logic into a pipeline (instead of having a fat pipeline module)

Proposed api:

Tuple definitions would only be required for modules that require configuration
- at the pipeline level, they could throw an error when checking for parameters, or use default values.
a pipeline module can also be passed directly when no parameters are needed

For example:

pipelines: [
  ....
  MyCustom.Pipeline.CleanItem,
  {Crawly.Pipelines.Validate, item: [:title, :url] },
  {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
  Crawly.Pipelines.JSONEncoder
]

Besides adjusting existing built-in pipelines, this proposed change would also require the adjustment of Crawly.Utils.pipe to check for tuple definitions.

How can I connect Proxy to Splash?

Is there an option to add a proxy in Splash from Crawly?

Example for CSV export

I see that you can export JSON with the following config:

  config :crawly,
    other configs...
    pipelines: [Crawly.Pipelines.JSONEncoder]

Is there a middleware for exporting to CSV format or a recommend way to do this?

Also, why does the file format end in file.ji instead of file.json when using JSONncoder?

Improve user agents database

A lot of my crawl depends on proper user-agent strings. It's a bit hard to supply user agents using a config as we're doing now. It would be good to have a database of user agents and to pick user agents from it. I am thinking of a standalone application with a simple interface, which then could be integrated into crawly.

We could get a database from:http://www.useragentstring.com/pages/api.php or any other service.

Add browser rendering using headless chrome

FeatureRequest - We can replace splash with headless chrome via puppeteer

Request throttling

This feature would involve rate limiting of requests made by a spider, such as X requests per min.

Implement Crawly.crawl/2 command for fetching given page with given spider

It may be Crawly.crawl(url, spider_name) or Crawly.crawl_with(url, spider_name)

The idea is based on:
#51 (comment)

We want to build a way to debug how spider would fetch a given page. Which links are going to be extracted? Which items are going to be fetched?

Retries support

We need to be able to re-try failing requests.

Splitting spider logs

One of the concerns I want to address in the new milestone is Logs.
We need to be able to split logs per spider if requested. It would allow us to have spider logs in one place. Understanding the performance of the job, and what was dropped for example, etc.

As I see it should be quite similar to: https://docs.scrapy.org/en/latest/topics/logging.html

@Ziinc I wonder if you have solved this problem on your own already or have some ideas to share?

Crawly does not return any data.

Hello!

I copy the examples repos and try to run it on my system. What is, am I doing wrong?

iex(1)> Crawly.Engine.start_spider(Esl)

16:46:27.399 [info]  Starting the manager for Elixir.Esl
 
16:46:27.409 [debug] Starting requests storage worker for Elixir.Esl...
 
16:46:27.514 [debug] Started 4 workers for Elixir.Esl
:ok
iex(2)> 
16:47:27.515 [info]  Current crawl speed is: 0 items/min
 
16:47:27.515 [info]  Stopping Esl, itemcount timeout achieved

Improvements for spider management

Currently, there the Crawly.Engine apis are lacking for spider monitoring and management, especially for when there is no access to logs.

I think some critical areas are:

spider crawl stats (scraped item count, dropped request/item count, scrape speed)
stop_all_spiders to stop all running spiders

The stopping of spiders should be easy to implement.

For the spider stats, since some of the data is nested quite deep in the supervision tree, i'm not so sure how to get it to "bubble up" to the Crawly.Engine level.

@oltarasenko thoughts?

Proxy setup.

Hello!

When connecting to a proxy, my IP does not change. I am using ProxyMesh. When I try on my machine via OS Setting, connection by HTTPS are working fine. Does Crawly support HTTPS? Can it be a problem of the issue?

Here my config file:

use Mix.Config
# in config.exs
config :crawly,
  proxy: "us-ca.proxymesh.com:31280",
  closespider_timeout: 10,
  concurrent_requests_per_domain: 7,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url]},
    # {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
   ],
   port: 4001

To check an IPs of processes, I used this small module:

defmodule Spider.Proxy do
  @behaviour Crawly.Spider

  require Logger

  @impl Crawly.Spider
  def base_url(), do: "https://whatismyipaddress.com/"

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [
        "https://whatismyipaddress.com/"
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    item = %{
      url:
        Floki.find(
          response.body,
          "div#section_left > div:nth-of-type(2) > div:nth-of-type(1) > a"
        )
        |> Floki.text()
    }

    %Crawly.ParsedItem{:items => [item], :requests => []}
  end
end

WriteFile folder option improvements

Hi everyone! I've been using crawly recently and I found the folder option a bit confusing.

The folder is always set as /tmp which makes think that only absolute paths are allowed. A single example with a local or ~ path would make things clearer, for example here in the last one: https://hexdocs.pm/crawly/Crawly.Pipelines.WriteToFile.html

The other thing that bugs me is that the folder has to exist. It would be nicer if the folder is created when missing. This would open the possibility of setting the default to a local path, making it more immediate whether the parser is working

Add options for HTTPoison

How it is possible to implement in Crawly adding options for HTTPoison.
For example, such:


url = "https://example.com/api/endpoint_that_needs_a_bearer_token"
headers = []
**options = [ssl: [{:versions, [:'tlsv1.2']}], recv_timeout: 500]**
{:ok, response} = HTTPoison.get(url, headers, **options**)

And so that it can then be used in a spider.

Dumping data to DB instead of a file

Thanks for this awesome library <3

Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?

Pattern for handling authentication of requests

Hi,

I'm interested in know what the appropriate pattern is for authenticating a spider? Most of the spiders I'm writing need to log in first before being able to scrape the content that I need access to. What would be the normal pattern for this as I've not found any examples of it that I can see.

My guess would be to write a middleware to perform the log in and set the auth cookies, however this will be a different authentication process for the various different spiders. Would this be done within the spider itself perhaps in the init() function?

Thanks for any help.

Proxy support?

Hi, great project. I'm considering a rewrite of my existing scraper in scrapy but i can't seem to find any proxy support in the docs. Is there a way to customize the request to go through a proxy?

Passing spider initialization options

init/0 results in a "hardcoded" way of providing initial urls to crawl.

For example, during application runtime, if I wanted to feed urls to the spider to crawl, I would not be able to do so without having a convoluted method of fetching those urls (e.g. from a database).

I propose allowing options to be passed as arguments to the spider's init callback, passed through the Crawly.Engine.start_spider function. These options are optional, and is up to the user to implement it in the spider.

Example:

Crawly.Engine.start_spider(MySpider, urls: ["my urls"], pagination: false)

#in the spider
def init(opts \\ []) do
  # default options
  opts = Enum.into(opts, %{urls: ["Other url"], pagination: true})
  # do something with pagination flag
  # ....

  [start_urls: opts[:urls] ]
end

Add pluggable HTTP backends

What:

Currently Crawly uses HTTPoison to perform requests. We want to make it more dynamic, to be able to use other HTTP clients, and headless browsers.

Why:

All currently known HTTP clients have some specific behaviors. Some sites would ban everything which does not look like a browser. Some sites would use JS to render web pages. We need to be able to address all these problems by dynamically configuring which backend to use in each concrete situation.

Why Crawly do not use Poolboy library?

It might look a little offtopic. This question interesting for me in in terms of my education.

Is there some reason why Crawly do not use Poolboy library?

Update tutorials so Floki does not parse documents multiple times

Based on:
https://elixirforum.com/t/web-scraping-tools/4823/31

Stop parsing each page four times.

When you run response.body |> Floki.find(...), you’re really running the equivalent of response.body |> Floki.parse() |> Floki.find(...) which means your four Floki.finds are parsing the whole document four times.

Instead, try parsed_body = Floki.parse(response.body) then parsed_body |> Floki.find(...).```

Add versioning to the documentation

Looks like it's quite hard to maintain the documentation in a good shape if you have multiple versions with different settings. (And especially different (slightly diverging) tutorials).

We need to have versioning.

I am slightly biased against the standard Elixir style of documenting the code (e.g. having very large docstrings makes the code unreadable, at least to me).

I would try to add an index page to docsify, and would store stand alone copies for different versions (in case of major updates in API)

Pluggable fetchers improvements

Update Crawly.fetch to use backends
Allow specifying fetcher without options

I would like to contribute to this project

Hello,
I've learned the basics of Elixir and OTP and created an online card game with Phoenix (including sockets, channels and presence). I'd like to contribute to this project to gain more experience in Elixir.
Do you have any task I could help with? I've seen on your roadmap that you want a UI for jobs management, seems like an interesting feature to build.

on_finish callback for spiders

I believe a on_finish/0 optional callback would be very beneficial (I personnaly need to know when my spiders finish and I would rather not be poking Crawly.Engine every X seconds to check if my spider is still running using Crawly.Engine.running_spiders).

It can be called if it is defined prior to
GenServer.call(MODULE, {:stop_spider, spider_name})
in Engine.

I have tested this change with my application and it seems to work, however I am very new to elixir and I don't know this repo well so I'm not certain this is an acceptable solution to this problem, and I'm not certain this would cover all cases of Spiders stoping.

Update example

This looks like a very useful project.

Can you please update the example Crawly Example (https://www.homebase.co.uk) and Tutorial? They both seem to be broken.

Read request settings from middlewares

Currently, we're setting proxy and follow redirect settings directly, for example:

Check if it's possible to migrate it to middlewares.

elixir-crawly / crawly Goto Github PK

crawly's Introduction

Crawly

Overview

Requirements

Quickstart

Running Crawly without Elixir or Elixir projects

Need more help?

Browser rendering

Simple management UI (New in 0.15.0) {#management-ui}

Experimental UI [Deprecated]

Documentation

Roadmap

Articles

Example projects

Contributors

Documentation

Production usages

Copyright and License

How to release:

crawly's People

Contributors

Stargazers

Watchers

Forkers

crawly's Issues

Recommend Projects

Recommend Topics

Recommend Org