Comments (21)
@sbpipb It's better to open new issues for cases like this.
Your problem seems to be related to the fact that you have an incorrect configuration in your config file. E.g. you have listed the Crawly.Pipelines.WriteToFile,
item pipeline in a list of middlewares. Please move it to the items pipelines instead, and test if it helped.
Check paragraph 4 of the quickstart https://hexdocs.pm/crawly/readme.html#quickstart for details.
from crawly.
@UnumUS sorry which repo are you talking about?
from crawly.
It might happen due to the connectivity issues for example.
from crawly.
Could be related to httpoison not being updated.
from crawly.
@Ziinc @oltarasenko Thanks for your replies!
Yes. In the quick start, guides defined an older version of Crawly.
I updated it to the last one. Now I have another type of issue.
iex(1)> Crawly.Engine.start_spider(Spider.Esl)
13:18:17.777 [debug] Starting the manager for Elixir.Spider.Esl
13:18:17.779 [debug] Starting requests storage worker for Elixir.Spider.Esl...
13:18:18.151 [debug] Started 4 workers for Elixir.Spider.Esl
:ok
iex(2)>
13:18:18.787 [debug] Could not parse item, error: :error, reason: :undef, stacktrace: [{Floki, :find, ["<!DOCTYPE html>\n\n<!--[if IE 9 ]> <html class=\"ie ie9 no-js\" lang=\"en\"> <![endif]-->\n<!--[if gt IE 9]><!--><html class=\"no-js\" lang=\"en\"><!--<![endif]-->\n<!-- the \"no-js\" class is for Modernizr. -->\n\n<head>\n <link type=\"text/plain\" rel=\"author\" href=\"/humans.txt\" />\n <meta charset=\"utf-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n\n <title>Blog | Erlang Solutions</title>\n\n <meta name=\"author\" content=\"\">\n\n <link rel=\"shortcut icon\" type=\"image/x-icon\" href=\"/assets/design/favicon-13104a1a076cb868e2103d4161b6b208d7957897d5d58cc2e9d317116b624263.ico\" />\n\n <link rel=\"stylesheet\" media=\"all\" href=\"/assets/designs/design-ad7b2dcbd35413c3f8717bc51058ca7505610986521c52b900976c653ec5c102.css\" />\n\n <!-- CSS libraries -->\n <link rel=\"alternate\" type=\"application/atom+xml\" title=\"Blog RSS\" href=\"/news.rss\" />\n <link rel=\"alternate\" type=\"application/atom+xml\" title=\"MongooseIM Blog RSS\" href=\"/mongoose_news.rss\" />\n <script>\nvar debug = true;\nvar CookieConsent = 'CookieConsent';\n\nfunction log(msg1, msg2) {\n if(!debug) { return; }\n if(typeof msg2 === 'undefined') {\n console.log(msg1);\n return;\n }\n console.log(msg1, msg2);\n}\n\nfunction setCookie(cname, cvalue, exdays) {\n var d = new Date();\n d.setTime(d.getTime() + (exdays*24*60*60*1000));\n var expires = \"expires=\"+ d.toUTCString();\n document.cookie = cname + \"=\" + cvalue + \";\" + expires + \";path=/\";\n}\n\nfunction getCookie(cname) {\n var name = cname + \"=\";\n var decodedCookie = decodeURIComponent(document.cookie);\n var ca = decodedCookie.split(';');\n for(var i = 0; i <ca.length; i++) {\n var c = ca[i];\n while (c.charAt(0) == ' ') {\n c = c.substring(1);\n }\n if (c.indexOf(name) == 0) {\n return c.substring(name.length, c.length);\n }\n }\n return \"\";\n}\n\nfunction isTrackingAllowed() {\n var c = getCookie(CookieConsent);\n return '' === c || '1' === c;\n}\n\nfunction async_pardot_load() {\n log('async_pardot_load');\n var s = document.createElement('script'); s.type = 'text/javascript';\n s.src = ('https:' == document.location.protocol ? 'https://pi' : 'http://cdn') + '.pardot.com/pd.js';\n var c = document.getElementsByTagName('script')[0]; c.parentNode.insertBefore(s, c);\n}\n\nfunction pardotTracking() {\n log('pardotTracking on');\n if(window.attachEvent) { window.attachEvent('onload', async_pardot_load); }\n else { window.addEventListener('load', async_pardot_load, false); }\n}\n</script>\n\n <meta content='We’re curious and inquisitive by nature. If we come across a smart new technology, something interesting or a little odd, we’re compelled to share it.' name='description'>\n<meta content='Erlang, Elixir, RabbitMQ, MQTT, AMQP, XMPP' name='keywords'>\n<meta content='Blog' name='title'>\n<meta property=\"og:title\" content=\"The Erlang Solutions blog: sharing BEAM, Erlang, and Elixir insights with the world.\" />\n<meta property=\"og:type\" content=\"website\" />\n<meta property=\"og:url\" content=\"https://www.erlang-solutions.com/blog.html\" />\n <meta property=\"og:image\" content=\"https://esl-web-static.s3.amazonaws.com/uploads/image/file/410/Blog___Featured_images.png\" />\n\n\n\n <!-- Google Tag Manager -->\n<script>\nfunction googleTagManager() {\nlog('googleTagManager tracking on');\n(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\nnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n})(window,document,'script','dataLayer','GTM-KTN9QLQ');\n}\nif(isTrackingAllowed()) { googleTagManager(); }\n</script>\n<!-- End Google Tag Manager -->\n\n</head>\n\n<body>\n<!-- Google Tag Manager (noscript) -->\n<noscript><iframe src=\"https://www.googletagmanager.com/ns.html?id=GTM-KTN9QLQ\"\nheight=\"0\" width=\"0\" style=\"display:none;visibility:hidden\"></iframe></noscript>\n<!-- End Google Tag Manager (noscript) -->\n<header>\n <div class=\"container\">\n <h1 class=\"logo\">\n <a href=\"/home.html\"><img alt=\"Erlang\" srcset=\"/assets/design/logo@2x-903c3dc0dbe" <> ..., "a.more"], []}, {Spider.Esl, :parse_item, 1, [file: 'lib/spider/example_test.ex', line: 14]}, {Crawly.Worker, :parse_item, 1, [file: 'lib/crawly/worker.ex', line: 112]}, {:epipe, :run, 2, [file: '/Users/mycomputer/Documents/Projects/Playgraound/crawler_esl/deps/epipe/src/epipe.erl', line: 23]}, {Crawly.Worker, :handle_info, 2, [file: 'lib/crawly/worker.ex', line: 43]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]
13:18:18.788 [debug] Crawly worker could not process the request to "https://www.erlang-solutions.com/blog.html"
reason: :undef
13:19:18.152 [info] Current crawl speed is: 0 items/min
13:19:18.152 [info] Stopping Spider.Esl, itemcount timeout achieved
The spider code.
defmodule Spider.Esl do
@behaviour Crawly.Spider
alias Crawly.Utils
@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"
@impl Crawly.Spider
def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]
@impl Crawly.Spider
def parse_item(response) do
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")
requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()
title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
%{
:requests => requests,
:items => [%{title: title, url: response.request_url}]
}
end
end
from crawly.
Sorry @UnumUS it's my bug. I need to update the documentation. Please add Floki to mix.exs file to the dependencies section:
defp deps do
[
{:crawly, "~> 0.8.0"},
{:floki, "~> 0.20.0"}
]
end```
from crawly.
Also alternatively you could try Meeseeks, as it's shown here:
- https://github.com/oltarasenko/autosites/blob/master/mix.exs#L26
- https://github.com/oltarasenko/autosites/blob/master/lib/autosites/autotradercouk.ex
from crawly.
I added Floki to my deps. I still try to run an example code you present in the guides. I am on EslSpider right now.
iex(1)> Crawly.Engine.start_spider(EslSpider)
15:07:27.474 [debug] Starting the manager for Elixir.EslSpider
15:07:27.476 [debug] Starting requests storage worker for Elixir.EslSpider...
15:07:27.484 [debug] Started 8 workers for Elixir.EslSpider
:ok
iex(2)>
15:07:29.372 [info] Dropping item: %{title: "", url: "https://www.erlang-solutions.com/blog.html"}. Reason: missing required fields
15:07:29.719 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
15:07:29.993 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
15:07:30.004 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
15:07:30.017 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
.....
15:08:27.485 [info] Current crawl speed is: 148 items/min
15:09:27.486 [info] Current crawl speed is: 0 items/min
15:09:27.486 [info] Stopping EslSpider, itemcount timeout achieved
The data did not stored, /tmp folder not been created.
The config file.
use Mix.Config
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
closespider_itemcount: 1000,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:url, :title]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title},
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
],
port: 4001
from crawly.
Oh, @UnumUS could you please refer any existing folder in the {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
from crawly.
If it is windows, specifying folder: "/tmp"
would result in an error, if i'm not wrong, since it doesn't exist on windows. Without the :folder
option, it should determine the temp folder path based on system.
@UnumUS the behaviour (on unix) is such that it will create a file called /tmp/MySpiderName.jl
from crawly.
I refer config to existed folder /test, created there the file EslSpider.jl. Not errors occur. However, the ESL table file is empty. I am on MacOS 10.15.3.
The last lines of output:
16:07:57.450 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
16:08:47.266 [info] Current crawl speed is: 148 items/min
16:09:47.267 [info] Current crawl speed is: 0 items/min
16:09:47.267 [info] Stopping EslSpider, itemcount timeout achieved
from crawly.
Sorry about the last post I did not updated my iex after changes in file was saved.
When I refer to a existed folder I see the next errors:
....
16:17:38.670 [error] Pipeline crash: Elixir.Crawly.Pipelines.WriteToFile, error: :error, reason: {:badmatch, {:error, :enoent}}, args: [extension: "jl", folder: "/test"]
16:18:26.447 [info] Current crawl speed is: 148 items/min
16:19:26.448 [info] Current crawl speed is: 0 items/min
16:19:26.448 [info] Stopping EslSpider, itemcount timeout achieved
from crawly.
Hello, sorry to double-check do you have the /test
folder on your machine? Usually, it's not the case for Unix type systems. Another question is permissions.
Could you please do the following in your shell (bash, zsh): ls -la /test
from crawly.
Another question. What happens if you do:
File.open("/test/spider.jl", [:binary, :write, :utf8]
in your iex?
from crawly.
Hello, sorry to double-check do you have the
/test
folder on your machine? Usually, it's not the case for Unix type systems. Another question is permissions.Could you please do the following in your shell (bash, zsh):
ls -la /test
mycomputer@MacBook-Pro-Mycomputer the_crawler % ls -la /test
ls: /test: No such file or directory
However I can see the folder in my IDE
Another question. What happens if you do:
File.open("/test/spider.jl", [:binary, :write, :utf8]
in your iex?
from crawly.
Ok, now it's clear. The folder you're seeing in IDE is not /test
but your '/test`.
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
-> should be absolute path.
Otherwise, what do you see if calling: cat /tmp/EslSpider.jl
from crawly.
@oltarasenko Oh, I did not expect this))It is work now. Thanks for the support!
from crawly.
Sorry @UnumUS it's my bug. I need to update the documentation. Please add Floki to mix.exs file to the dependencies section:
defp deps do [ {:crawly, "~> 0.8.0"}, {:floki, "~> 0.20.0"} ] end```
Need to update docs to:
- update quick start version
- add floki/meeseeks as dependency
from crawly.
docs has been updated. Pending patch release
from crawly.
having the same issue even though path to tmp directory is existing.
iex(2)>
08:41:26.926 [error] Pipeline crash: Elixir.Crawly.Pipelines.WriteToFile, error: :error, reason: :undef, args: [folder: "/home/sbpipb/projects/price_spider/tmp", extension: "jl"]
import Config
config :crawly,
middlewares: [
{Crawly.Middlewares.UserAgent, user_agents: [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
]}
{Crawly.Pipelines.WriteToFile,
folder: "/home/sbpipb/projects/price_spider/tmp",
extension: "jl"}
]
08:42:26.921 [info] Current crawl speed is: 1 items/min
08:43:26.922 [info] Current crawl speed is: 0 items/min
iex(6)> File.open("/home/sbpipb/projects/price_spider/tmp/test.jl", [:binary, :write, :utf8])
{:ok, #PID<0.490.0>}
iex(7)> File.open("/home/sbpipb/projects/price_spider/tmp/test.jl", [:binary, :write, :utf8])
{:ok, #PID<0.492.0>}
iex(8)>
from crawly.
@oltarasenko thanks for the help! I opted to upgrade my dependencies to match the quickstart guide.
ran the mix gen config command and now is working for me. thanks again for the swift replies!
from crawly.
Related Issues (20)
- `Could not start application crawly` when trying to enter IEx HOT 2
- js render problem HOT 6
- custom parsar callback sample HOT 7
- Could not compile dependency :epipe HOT 5
- Demo page not loading HOT 3
- Setting up a parametric spider (dynamic base_url and start_urls) HOT 1
- Use a more reliable website to crawl in tutorial HOT 1
- Any working examples? HOT 4
- jl files not found probably not writing HOT 1
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.