Giter Club home page Giter Club logo

crawly's Introduction

Crawly

Crawly Module Version Hex Docs Total Download License Last Updated

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

  1. Elixir ~> 1.14
  2. Works on GNU/Linux, Windows, macOS X, and BSD.

Quickstart

  1. Create a new project: mix new quickstart --sup

  2. Add Crawly as a dependencies:

    # mix.exs
    defp deps do
        [
          {:crawly, "~> 0.17.0"},
          {:floki, "~> 0.33.0"}
        ]
    end
  3. Fetch dependencies: $ mix deps.get

  4. Create a spider

     # lib/crawly_example/books_to_scrape.ex
     defmodule BooksToScrape do
       use Crawly.Spider
    
       @impl Crawly.Spider
       def base_url(), do: "https://books.toscrape.com/"
    
       @impl Crawly.Spider
       def init() do
         [start_urls: ["https://books.toscrape.com/"]]
       end
    
       @impl Crawly.Spider
       def parse_item(response) do
         # Parse response body to document
         {:ok, document} = Floki.parse_document(response.body)
    
         # Create item (for pages where items exists)
         items =
           document
           |> Floki.find(".product_pod")
           |> Enum.map(fn x ->
             %{
               title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
               price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
               url: response.request_url
             }
           end)
    
         next_requests =
           document
           |> Floki.find(".next a")
           |> Floki.attribute("href")
           |> Enum.map(fn url ->
             Crawly.Utils.build_absolute_url(url, response.request.url)
             |> Crawly.Utils.request_from_url()
           end)
    
         %Crawly.ParsedItem{items: items, requests: next_requests}
       end
     end

    New in 0.15.0 :

    It's possible to use the command to speed up the spider creation, so you will have a generated file with all needed callbacks: mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape

  5. Configure Crawly

    By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls: (in file: config/config.exs)

     import Config
    
     config :crawly,
       closespider_timeout: 10,
       concurrent_requests_per_domain: 8,
       closespider_itemcount: 100,
    
       middlewares: [
         Crawly.Middlewares.DomainFilter,
         Crawly.Middlewares.UniqueRequest,
         {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
       ],
       pipelines: [
         {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
         {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
         Crawly.Pipelines.JSONEncoder,
         {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
       ]

    New in 0.15.0:

    You can generate example config with the help of the following command: mix crawly.gen.config

  6. Start the Crawl:

      iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"
  7. Results can be seen with:

    $ cat /tmp/BooksToScrape_<timestamp>.jl
    

Running Crawly without Elixir or Elixir projects

It's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.

Please read more about it here:

Need more help?

Please use discussions for all conversations related to the project

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Simple management UI (New in 0.15.0) {#management-ui}

Crawly provides a simple management UI by default on the localhost:4001

It allows to:

  • Start spiders
  • Stop spiders
  • Preview scheduled requests
  • View/Download items extracted
  • View/Download logs

NOTE: It's possible to disable the Simple management UI (and rest API) with the start_http_api?: false options of Crawly configuration.

You can choose to run the management UI as a plug in your application.

defmodule MyApp.Router do
  use Plug.Router

  ...
  forward "/admin", Crawly.API.Router
  ...
end

Crawly Management UI

Experimental UI [Deprecated]

Now don't have a possibility to work on experimental UI built with Phoenix and LiveViews, and keeping it here for mainly demo purposes.

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders. Checkout the code from GitHub

Documentation

Roadmap

To be discussed

Articles

  1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
  2. Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
  3. Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
  4. Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
  5. What is web scraping, and why you might want to use it?
  6. Using Elixir and Crawly for price monitoring
  7. Building a Chrome-based fetcher for Crawly

Example projects

  1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
  2. E-commerce websites: https://github.com/oltarasenko/products-advisor
  3. Car shops: https://github.com/oltarasenko/crawly-cars
  4. JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

Copyright and License

Copyright (c) 2019 Oleg Tarasenko

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

How to release:

  1. Update version in mix.exs
  2. Update version in quickstart (README.md, this file)
  3. Commit and create a new tag: git commit && git tag 0.xx.0 && git push origin master --follow-tags
  4. Build docs: mix docs
  5. Publish hex release: mix hex.publish

crawly's People

Contributors

arobie1992 avatar cmnstmntmn avatar cybernet avatar darkslategrey avatar dogweather avatar edmondfrank avatar feng19 avatar filipevarjao avatar fstp avatar gfrancischelli avatar harlantwood avatar jerojasro avatar juanbono avatar kianmeng avatar kylekermgard avatar maiphuong-van avatar maltekrupa avatar mgibowski avatar michaltrzcinka avatar nuno84 avatar oldhammade avatar oltarasenko avatar oshosanya avatar rootkc avatar rukomoynikov avatar serpent213 avatar spectator avatar torifukukaiou avatar vermaxik avatar ziinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawly's Issues

Downloading Files

Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile pipeline which seems to be used for saving data into one file, CSV, JSON, etc.

How to use Crawly when not need to collect links?

I have in Database about 250k URLs that need to enrich. Go to every link of the list and parse its HTML. Is there a good way how to feed them to Crawly queue? Is it suitable to use Crawly in this case?

Scraping different items from the same spider, each with different pipeline requirements

Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?

For example, web page A (e.g. a blog) will have:

  • comments
  • article content
  • related links
  • title

while web page B (e.g. a weather site) will have:

  • temperature
  • country

In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:

item: [:title, :comments, :article_content, :related_links, :temperature, :country]

the issues are that:

  1. they share the same pipeline. When scraping the weather data, the blog-related fields will be blank, and vice versa for when scraping the blog. This will affect pipeline validations, since pipeline is shared.
  2. Output of the two items will be the same.
  3. Duplication checks (such as :item_id) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.

I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.

Using the tutorial as an example:

using this ideal scenario config:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  follow_redirects: true,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {MyItemStruct, [
        Crawly.Pipelines.Validate,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
    {MyOtherItemStruct, [
        Crawly.Pipelines.Validate,
        Crawly.Pipelines.CleanMyData,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :name }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
  ]

with the spider implemented as so:

 @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
    name= response.body |> Floki.find("article.blog_post h2") |> Floki.text()

    %{
         :requests => requests,
         :items => [
            %MyItemStruct{title: title, url: response.request_url},
            %MyOtherItemStruct{name: title, url: response.request_url}
         ]
      }
  end

The returned items then can get sorted into their specified pipelines.

This configuration method proposes the following:

  • allow declaration of item-specific pipelines for multiple items
  • allow passing of arguments to a pipeline implementation e.g. {MyPipelineModule, validate_more_than: 5}

To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.

Do let me know what you think @oltarasenko

Unable to get up and running from the quick start

Hi, I am unable to scrape the erlang solutions blog as the quickstart guide states here:
https://github.com/oltarasenko/crawly#quickstart

Attempting to run the spider through iex results in:

iex(1)> Crawly.Engine.start_spider(MyCrawler.EslSpider)
[info] Starting the manager for Elixir.MyCrawler.EslSpider
[debug] Running spider init now.
[debug] Scraped ":title,:url"
[debug] Starting requests storage worker for Elixir.MyCrawler.EslSpider...
[debug] Started 2 workers for Elixir.MyCrawler.EslSpider
:ok
iex(2)> [info] Current crawl speed is: 0 items/min
[info] Stopping MyCrawler.EslSpider, itemcount timeout achieved

I'm quite lost as there is no way for me to debug this, if it is a network issue (which is highly unlikely since i can access the esl website through my browser), or if it is an issue with the urls being filtered out.

defmodule MyCrawler.EslSpider do
  @behaviour Crawly.Spider
  alias Crawly.Utils
  require Logger
  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init() do
    Logger.debug("Running spider init now.")
    [start_urls: ["https://www.erlang-solutions.com/blog.html"]]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    IO.inspect(response)
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    # Modified this to make it even more general, to eliminate the possibility of selector problem
    title = response.body |> Floki.find("title") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end

Of note is that the spider does not even call the parse_items callback, as the IO.inspect for the response is not called at all.

Config is as follows:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 2,
  follow_redirects: true,
  output_format: "csv",
  item: [:title, :url],
  item_id: :title

Discussion: Store state for fault-tolerance crawler

I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.

Is there any recommendations or suggestions how to implement it in my current Crawly project?

Tag jobs with unique id

One of the common problems I am facing right now, if that it's not possible to separate one crawly's job from another, from the external point of view.

E.g. the same spider can be executed multiple times. How do we know that the data came exactly from a given run? For example here: http://18.216.221.122/ in order to group items inside a UI we need to have an ID to unite them.

I plan to:

  1. Extend engine in the way that start_spider would accept an optional job_id parameter
  2. The engine would automatically generate the job if now provided
  3. Crawly will use this job_id when the communication with the external world is done. E.g. if we are shipping logs somewhere -> it will be used. If we're shipping items it will be used.

What do you think about the idea?

What is recommended "concurent_request_per_domain" ?

It looks like it easy to overflow queue. Entire system down. This happen when I set concurrency more than ~20. What is the recommended setting of concurrency for Crawly? What is performance can you achieve with it?

22:27:10.427 [error] GenServer #PID<0.527.0> terminating
** (MatchError) no match of right hand side value: {:empty, {[], []}}
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:509: :hackney_pool.queue_out/2
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:376: :hackney_pool.dequeue/3
    (hackney 1.15.2) /Users/mycomputer/Documents/Projects/Playgraound/homebase/deps/hackney/src/hackney_pool.erl:349: :hackney_pool.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.4051887452.3045064705.41660>, :request, #PID<0.455.0>, :shutdown}
State: {:state, :default, {:metrics_ng, :metrics_dummy}, 50, 150000, {:dict, 50, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064705.22552> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.10994> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.25098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39221> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41389> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41530> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.49674> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.31884> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.34134> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.22288> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.35098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.19015> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.39210> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064706.63253> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.42573> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064708.30170> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13738> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064710.33158> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41238> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41539> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61055> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.21269> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.42098> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.61233> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064712.11785> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.34068> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26414> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064711.38039> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.34782> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [], [[#Reference<0.4051887452.3045064712.8112> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.24733> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.43205> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.22048> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064705.41660> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.35050> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064705.24426> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.13941> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.18037> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.55708> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13898> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.23688> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064706.52408> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064708.32218> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064707.33419> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.44346> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064709.13751> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064708.41062> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064709.34340> | {'www.homebase.co.uk', 443, :hackney_ssl}]], [[#Reference<0.4051887452.3045064707.20603> | {'www.homebase.co.uk', 443, :hackney_ssl}], [#Reference<0.4051887452.3045064712.26880> | {'www.homebase.co.uk', 443, :hackney_ssl}]]}}}, {:dict, 1, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[], [], [], [[{'www.homebase.co.uk', 443, :hackney_ssl} | {[], []}]], [], [], [], [], [], [], [], [], [], [], [], []}}}, {:dict, 20, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []}, {{[[#Reference<0.4051887452.3045064712.10994> | {{#PID<0.436.0>, #Reference<0.4051887452.3045064712.11107>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064710.25098> | {{#PID<0.491.0>, #Reference<0.4051887452.3045064710.25100>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064708.49674> | {{#PID<0.435.0>, #Reference<0.4051887452.3045064708.49679>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.31884> | {{#PID<0.503.0>, #Reference<0.4051887452.3045064712.31886>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3044802561.184150> | {{#PID<0.428.0>, #Reference<0.4051887452.3044802561.184156>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.63253> | {{#PID<0.521.0>, #Reference<0.4051887452.3045064706.63260>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064707.42573> | {{#PID<0.516.0>, #Reference<0.4051887452.3045064707.42615>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064708.30170> | {{#PID<0.519.0>, #Reference<0.4051887452.3045064708.30172>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064705.41238> | {{#PID<0.457.0>, #Reference<0.4051887452.3045064705.41240>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064706.61055> | {{#PID<0.501.0>, #Reference<0.4051887452.3045064706.61080>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064706.61233> | {{#PID<0.448.0>, #Reference<0.4051887452.3045064706.61248>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [[#Reference<0.4051887452.3045064712.11785> | {{#PID<0.495.0>, #Reference<0.4051887452.3045064712.11789>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064711.38039> | {{#PID<0.514.0>, #Reference<0.4051887452.3045064711.38065>}, {'www.homebase.co.uk', 443, :hackney_ssl}}], [#Reference<0.4051887452.3045064712.34782> | {{#PID<0.450.0>, #Reference<0.4051887452.3045064712.34787>}, {'www.homebase.co.uk', 443, :hackney_ssl}}]], [], [[#Reference<0.4051887452.3045064707.43205> | {{#PID<0.446.0>, #Reference<0.4051887452.3045064707.43464>}, {'www.homebase.co.uk', 443, :hackney_ssl (truncated)

Spider scheduling for continuous crawling

I'm looking to do cron-style spider scheduling, where the engine starts the spider if it is not running at a scheduled timing interval.

Should this be within the Engine or Commander (?) module context?

This would require the Engine to be part of a supervision tree, I think.

WriteFile produces invalid json

When writing json, WriteFile produces invalid json, as it uses a newline to separate items.

{"title": "first item}
{"title": "second item"}

The items should be in a list, comma separated:

[
    {"title": "first item"},
    {"title": "second item"}
]

Add lightweight UI for Crawly Management

As it was discussed here: #97 (comment)
we want to build a lightweight (probably HTTP based) UI for the single node based Crawly operations. For people who don't want (or don't need) to use more complex https://github.com/oltarasenko/crawly_ui.

As far as we see it now we need to develop a Lightweight HTTP client (alternatively we might look into command line clients like https://github.com/ndreynolds/ratatouille)

  1. without an external database as a dependency
  2. which allows to schedule/stop jobs on the given node
  3. which allows seeing currently running jobs [ideally with metrics like crawl-speed]
  4. which allows to schedule jobs with given parameters like concurrency

Improve tests for CSV encoder pipeline

Some tests are quite ugly at the moment :(. Need to check the race condition here:

  1) test CSV encoder test Items are stored in CSV after csv pipeline (DataStorageWorkerTest)
     test/data_storage_worker_test.exs:149
     ** (MatchError) no match of right hand side value: {:error, :already_started}
     stacktrace:
       test/data_storage_worker_test.exs:6: DataStorageWorkerTest.__ex_unit_setup_0/1
       test/data_storage_worker_test.exs:1: DataStorageWorkerTest.__ex_unit__/2

As it fails the CI pipeline for master.

Postgres instead of files.

I want to store scraped data in Postgres with help of Ecto of course.
Is there some best practice for this?

Improvements to output file names of spiders

Right now the file name is the spider's module name and there's no way to configure it. Because the file name is static, previous data is overwritten, where as scrapers usually add a timestamp to the file name. That would be very nice because in this way previous results are automatically retained and it's immediate when the parsing was done.

Spiders could have an optional name method which could be used to enable users to customize it and the default naming could include a timestamp.

Is it possible to have a configuration per spider?

I am currently running into an issue where one of my spider gets denied for making too many requests, and I would like to set the concurent requests for that specific spider to 1 without affecting the other spiders, so far I havent found a way to do this.

Currently the only way I see of achieving what I want is creating a separate application for each spider with each its own config, which doesnt feel like it will be quite optimal has I will end up with probably 50+ spiders meaning 50+ apps.

The question: is there currently a way to make configurations spider specific, and if not do you intend on making that possible in the future?

Ecto item pipeline configuration

I have a problem with build the custom Ecto pipeline.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines

If I get it right, I need to define the pipeline module in my config.exs file topipelines:[]section?
Also I miss understanding that does mean this part MyApp.insert_with_ecto(item) ? Do you mean Repo.insert here or what?
Can you please describe in more detail how should I connect it?

P.S. I apologize if the questions seemed stupid, but I still do not understand how this works. I will be thrilled when I figure it out. Ecto is a missed puzzle for my crawly projects.

Move away from using global app env for pipeline module config

Pipeline config should be localized to the specific pipeline.
Benefits:

  1. Reduces clashes (when more middlewares/pipelines get built)
  2. makes parameter declaration clearer
  3. makes pipelines more "functional" and reusable
  4. use of Application.get_env usage within pipeline modules makes things less clear when declaring pipelines.

Related to #20 , would pave the way for adding logic into a pipeline (instead of having a fat pipeline module)

Proposed api:

  • Tuple definitions would only be required for modules that require configuration
    • at the pipeline level, they could throw an error when checking for parameters, or use default values.
  • a pipeline module can also be passed directly when no parameters are needed

For example:

pipelines: [
  ....
  MyCustom.Pipeline.CleanItem,
  {Crawly.Pipelines.Validate, item: [:title, :url] },
  {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
  Crawly.Pipelines.JSONEncoder
]

Besides adjusting existing built-in pipelines, this proposed change would also require the adjustment of Crawly.Utils.pipe to check for tuple definitions.

Example for CSV export

I see that you can export JSON with the following config:

  config :crawly,
    other configs...
    pipelines: [Crawly.Pipelines.JSONEncoder]

Is there a middleware for exporting to CSV format or a recommend way to do this?

Also, why does the file format end in file.ji instead of file.json when using JSONncoder?

Improve user agents database

A lot of my crawl depends on proper user-agent strings. It's a bit hard to supply user agents using a config as we're doing now. It would be good to have a database of user agents and to pick user agents from it. I am thinking of a standalone application with a simple interface, which then could be integrated into crawly.

We could get a database from:http://www.useragentstring.com/pages/api.php or any other service.

Request throttling

This feature would involve rate limiting of requests made by a spider, such as X requests per min.

Splitting spider logs

One of the concerns I want to address in the new milestone is Logs.
We need to be able to split logs per spider if requested. It would allow us to have spider logs in one place. Understanding the performance of the job, and what was dropped for example, etc.

As I see it should be quite similar to: https://docs.scrapy.org/en/latest/topics/logging.html

@Ziinc I wonder if you have solved this problem on your own already or have some ideas to share?

Crawly does not return any data.

Hello!

I copy the examples repos and try to run it on my system. What is, am I doing wrong?

iex(1)> Crawly.Engine.start_spider(Esl)

16:46:27.399 [info]  Starting the manager for Elixir.Esl
 
16:46:27.409 [debug] Starting requests storage worker for Elixir.Esl...
 
16:46:27.514 [debug] Started 4 workers for Elixir.Esl
:ok
iex(2)> 
16:47:27.515 [info]  Current crawl speed is: 0 items/min
 
16:47:27.515 [info]  Stopping Esl, itemcount timeout achieved

Improvements for spider management

Currently, there the Crawly.Engine apis are lacking for spider monitoring and management, especially for when there is no access to logs.

I think some critical areas are:

  • spider crawl stats (scraped item count, dropped request/item count, scrape speed)
  • stop_all_spiders to stop all running spiders

The stopping of spiders should be easy to implement.

For the spider stats, since some of the data is nested quite deep in the supervision tree, i'm not so sure how to get it to "bubble up" to the Crawly.Engine level.

@oltarasenko thoughts?

Proxy setup.

Hello!

When connecting to a proxy, my IP does not change. I am using ProxyMesh. When I try on my machine via OS Setting, connection by HTTPS are working fine. Does Crawly support HTTPS? Can it be a problem of the issue?

Here my config file:

use Mix.Config
# in config.exs
config :crawly,
  proxy: "us-ca.proxymesh.com:31280",
  closespider_timeout: 10,
  concurrent_requests_per_domain: 7,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url]},
    # {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
   ],
   port: 4001

To check an IPs of processes, I used this small module:

defmodule Spider.Proxy do
  @behaviour Crawly.Spider

  require Logger

  @impl Crawly.Spider
  def base_url(), do: "https://whatismyipaddress.com/"

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [
        "https://whatismyipaddress.com/"
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    item = %{
      url:
        Floki.find(
          response.body,
          "div#section_left > div:nth-of-type(2) > div:nth-of-type(1) > a"
        )
        |> Floki.text()
    }

    %Crawly.ParsedItem{:items => [item], :requests => []}
  end
end

WriteFile folder option improvements

Hi everyone! I've been using crawly recently and I found the folder option a bit confusing.

The folder is always set as /tmp which makes think that only absolute paths are allowed. A single example with a local or ~ path would make things clearer, for example here in the last one: https://hexdocs.pm/crawly/Crawly.Pipelines.WriteToFile.html

The other thing that bugs me is that the folder has to exist. It would be nicer if the folder is created when missing. This would open the possibility of setting the default to a local path, making it more immediate whether the parser is working

Add options for HTTPoison

How it is possible to implement in Crawly adding options for HTTPoison.
For example, such:


url = "https://example.com/api/endpoint_that_needs_a_bearer_token"
headers = []
**options = [ssl: [{:versions, [:'tlsv1.2']}], recv_timeout: 500]**
{:ok, response} = HTTPoison.get(url, headers, **options**)

And so that it can then be used in a spider.

Dumping data to DB instead of a file

Thanks for this awesome library <3

Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?

Pattern for handling authentication of requests

Hi,

I'm interested in know what the appropriate pattern is for authenticating a spider? Most of the spiders I'm writing need to log in first before being able to scrape the content that I need access to. What would be the normal pattern for this as I've not found any examples of it that I can see.

My guess would be to write a middleware to perform the log in and set the auth cookies, however this will be a different authentication process for the various different spiders. Would this be done within the spider itself perhaps in the init() function?

Thanks for any help.

Proxy support?

Hi, great project. I'm considering a rewrite of my existing scraper in scrapy but i can't seem to find any proxy support in the docs. Is there a way to customize the request to go through a proxy?

Passing spider initialization options

init/0 results in a "hardcoded" way of providing initial urls to crawl.

For example, during application runtime, if I wanted to feed urls to the spider to crawl, I would not be able to do so without having a convoluted method of fetching those urls (e.g. from a database).

I propose allowing options to be passed as arguments to the spider's init callback, passed through the Crawly.Engine.start_spider function. These options are optional, and is up to the user to implement it in the spider.

Example:

Crawly.Engine.start_spider(MySpider, urls: ["my urls"], pagination: false)

#in the spider
def init(opts \\ []) do
  # default options
  opts = Enum.into(opts, %{urls: ["Other url"], pagination: true})
  # do something with pagination flag
  # ....

  [start_urls: opts[:urls] ]
end

Add pluggable HTTP backends

What:

Currently Crawly uses HTTPoison to perform requests. We want to make it more dynamic, to be able to use other HTTP clients, and headless browsers.

Why:

All currently known HTTP clients have some specific behaviors. Some sites would ban everything which does not look like a browser. Some sites would use JS to render web pages. We need to be able to address all these problems by dynamically configuring which backend to use in each concrete situation.

Why Crawly do not use Poolboy library?

It might look a little offtopic. This question interesting for me in in terms of my education.

Is there some reason why Crawly do not use Poolboy library?

Add versioning to the documentation

Looks like it's quite hard to maintain the documentation in a good shape if you have multiple versions with different settings. (And especially different (slightly diverging) tutorials).

We need to have versioning.

I am slightly biased against the standard Elixir style of documenting the code (e.g. having very large docstrings makes the code unreadable, at least to me).

I would try to add an index page to docsify, and would store stand alone copies for different versions (in case of major updates in API)

I would like to contribute to this project

Hello,
I've learned the basics of Elixir and OTP and created an online card game with Phoenix (including sockets, channels and presence). I'd like to contribute to this project to gain more experience in Elixir.
Do you have any task I could help with? I've seen on your roadmap that you want a UI for jobs management, seems like an interesting feature to build.

on_finish callback for spiders

I believe a on_finish/0 optional callback would be very beneficial (I personnaly need to know when my spiders finish and I would rather not be poking Crawly.Engine every X seconds to check if my spider is still running using Crawly.Engine.running_spiders).

It can be called if it is defined prior to
GenServer.call(MODULE, {:stop_spider, spider_name})
in Engine.

I have tested this change with my application and it seems to work, however I am very new to elixir and I don't know this repo well so I'm not certain this is an acceptable solution to this problem, and I'm not certain this would cover all cases of Spiders stoping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.