Comments (24)
With parameterized pipeline definitions, one could pass DataStorage modules within the pipeline (#19 ) parameters. So for the CSV DataStorage, it could be:
pipelines: [
....
MyCommon.Pipeline,
{Crawly.DataStorage.FileStorageBackend,
headers: [:id, :test]
folder: "/tmp",
include_headers: true,
extension: "csv"}
]
instead of global configs like in #19 :
config :crawly, Crawly.DataStorage.FileStorageBackend,
folder: "/tmp",
include_headers: false,
extension: "jl"
Having global configs prevents having multiple pipeline modules within the same pipelines declaration. I liken this to how elixir's piping works, the pipeline module either transforms the scraped data or does a side effect (e.g. store it) and returns it for downstream pipelines.
from crawly.
@Ziinc I will think what is the best way of approaching the problem. I was working in Scrapy core team some time in the past, and we where using one Item structure per project. However there we used a concept of required/non-required fields.
I did not really want to define the item as we did it in Scrapy, as it seems to be an overkill to have it defined separately.
Also Crawly defined item, as a set of required fields (not a full set of fields, as it might seem: https://oltarasenko.github.io/crawly/#/?id=item-atom). So for now I would suggest to remove the definition of non overlapping fields from the item config.
Again I like the idea of separating fetcher and parser even more. Let me think how to plan the changes in future.
from crawly.
@oltarasenko I understand what you mean and agree. It seems like additional boilerplate for not much benefit.
Perhaps looking at it from a different angle, the whole definition of an "item" in the config would be unnecessary. From my understanding, the defined item is only used for the Crawly.Pipelines.Validate
pipeline module, and the item_id
is only used in the Crawly.Pipelines.DuplicatesFilter
pipeline module. With a tuple-format definition, these two parameters can be localized to the pipeline module definition instead of as a global config parameter, as so:
config :crawly,
...
pipelines: [
....
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title }
....
]
With the tuple-format definitions, we can do something like this for configuring different pipeline requirements:
pipelines: [
....
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
{Crawly.Pipelines.IfCondition, condition: fn x -> Keyword.has_key?(x, :a) end ,[
Crawly.Pipelines.Validate, item: [:header_count],
...
] }
]
And this could allow for multi-item-type logic within the pipeline:
pipelines: [
....
MyCommon.Pipeline
{Crawly.Pipelines.Logic.IfCondition, condition: fn x -> Keyword.has_key?(x, :a) end , pipelines: [
{Crawly.Pipelines.Validate, item: [:hello, :world},
{Crawly.Pipelines.DuplicatesFilter, item_id: :world},
...
] },
{Crawly.Pipelines.Logic.IfCondition, condition: fn x -> Keyword.has_key?(x, :b) end , pipelines: [
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
] },
MyCommon.Pipeline2
]
This takes inspiration from the factorio game (haha), where you can define filters/logical triggers for moving resources. In essence, this approach allow the split off items that match a logical condition a separate pipeline, assuming an IfCondition
module. Other logical modules might be feasible, such as a Case
module or a IfElse
module.
To accomplish something like this, it would require allowing a tuple defintion. The Validate
and DuplicatesFilter
modules would benefit from a parameterized definition too.
from crawly.
Well another problem is to be able to output CSV. E.g. this way or another we would have to extend the item definition (as in scrapy). As otherwise we would not be able to output the CSV headers :(.
from crawly.
I was syncing up with some people from Scrapy team. They are stating the following:
-
Scrapy actually allowed to use different pipelines for different items. However the configuration is not nice as well. https://stackoverflow.com/questions/8372703/how-can-i-use-different-pipelines-for-different-spiders-in-a-single-scrapy-proje. Honestly I would not want to follow their way here.
-
It's quite a rare case to have multiple different items per project. In the example above they would prefer to have Article definition and comments inside it. So it will be a responsibility of a spider to validate comments.
-
Another option is to disable the generic pipeline (ValidateItem in our case) and to put two custom instead, e.g. ValidateBlogComments/ValidateArticles.
from crawly.
As explained in the parent post body, my use case is about handling different types of scraped items, which may or may not use different spiders. Different types of scraped items would have different processing/validation requirements, hence the need for different pipelines.
from crawly.
If it helps make it clearer, I use (or at least trying to) a single instance of crawly to manage all scraping needs of the main application, which requires many different web sources and feeds. The scraped data then gets cleaned and stored to their relevant databases tables, or undergo further processing steps. This last part is why some ability to separate scraped items would be great, as they currently all end up in the same pipeline declaration.
from crawly.
Right now my solution for this consists of pipeline level matching, with each key for a specific scraped item type.
def run(%{my_item: item}, state) do
# do things
end
def run(item, state), do: {item, state}
Should this be the standard way of approaching this problem?
from crawly.
This approach operates on a per-key basis, meaning that all items are going to consist of a single-key map
from crawly.
I am about to introduce spider level setting. E.g. it will be possible to do it inside init function. So the idea is that if you're specifying the spider level settings it would allow you to override the global config. I have kind of made required preparations here: e05b512
So one of the points is that spider would be able to set a list of processors for every given request/item.
from crawly.
Looking through the commit, it seems the middlewares and item pipelines are going to be set to the default value from the config, then the spider overrides them?
from crawly.
Yes. The idea I am playing with now is something like the following API:
(in Crawly.Request)
@spec new(url, headers, options) :: request
when url: binary(),
headers: [term()],
options: [term()],
request: Crawly.Request.t()
def new(url, headers \\ [], options \\ []) do
# Define a list of middlewares which are used by default to process
# incoming requests
default_middlewares = [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt
]
middlewares =
Application.get_env(:crawly, :middlewares, default_middlewares)
new(url, headers, options, middlewares)
end
@doc """
Same as Crawly.Request.new/3 from but allows to specify middlewares as the 4th
parameter.
"""
@spec new(url, headers, options, middlewares) :: request
when url: binary(),
headers: [term()],
options: [term()],
middlewares: [term()], # TODO: improve typespec here
request: Crawly.Request.t()
def new(url, headers, options, middlewares) do
%Crawly.Request{
url: url,
headers: headers,
options: options,
middlewares: middlewares
}
end
So at the end of the day, the worker will get a complete request with processors set
. However, it may read the spider settings to see if override is required.
I am fixing tests there now. Hope to show the code soon!
from crawly.
This would allow middlewares to be set on a per request basis, but there isn't a way to specify pipelines on a per scraped item basis as there isn't a standard struct for each scraped item
Also, it seems simpler to do pattern matching on scraped items within the pipelines than to check and specify pipelines within the spider, since it causes the spider to be unnecessarily fat
from crawly.
This would allow middlewares to be set on a per request basis, but there isn't a way to specify pipelines on a per scraped item basis as there isn't a standard struct for each scraped item
Yeah. I don't have a good answer for that :(. E.g. you're right we can't do the same thing with items.
Let me think even more.
Also, it seems simpler to do pattern matching on scraped items within the pipelines than to check and specify pipelines within the spider, since it causes the spider to be unnecessarily fat.
Maybe. The spider should not be fat. The idea, for now, it to allow setting middlewares/pipelines in init...
Also, I am trying to avoid pattern matching in some cases. E.g. I don't really feel comfortable if some entity does pattern matches on complex structures defined elsewhere..
from crawly.
What do you think of:
- Moving middlewares outside of request?
- Should we define Crawly.Item and ask people to define items using the
use Crawly.Item
macro where the base item would be extended (and that base item define pipelines API)?
from crawly.
I'm gonna brain-dump my thoughts on what what Crawly's selling points are to me, and my ideal-scenario type of architecture (bear with me, it might be long):
Crawly's selling points, to me, are the simplicity in how everything is just a series of pipelines. It is quite easy to visualize and understand how a request flows through the entire system from start to finish into a scraped item. This is because once I understood how a item pipeline was called, I immediately understood how a middleware worked as well, due to the reuse of the pipeline concept. I think that reusing this concept both maximizes ease of understanding of the overall system, and of how to write your own custom pipelines (which for any advanced user, would eventually happen).
It is also why i think that the Crawly.ScrapedItem (or FetchedItem, as long as it is less ambiguous about what Item
refers to) is a good idea, as it standardizes what a scraped item is and how it should be handled through the lifecycle.
With this being said, I do note that there is no pipeline lifecycle for handling what happens to a response after it is being fetched, which is also where the retries from #39 would ideally be handled.
There is also no pipeline lifecycle for handling what happens to a request between when it is fetched from the request storage, and from when it is passed to the fetcher.
Request-Response-ScrapedItem Flow
this leaves out the part between the request storage and the fetcher
Each phase between storage, fetcher, and spider, all need to have some degree of control and customizability. For example, for the portion between the Fetcher and the Spider, there needs to be a way to control, handle, and customize the response received by the Fetcher. Since all the fetcher does is to transform the request into the response, it should not be responsible for doing so. A separate pluggable pipeline module can be introduced between these two points (Fetcher,Spider) to handle issues like backoffs, retries, etc.
This Response handler section would ideally solve the retries issue, as it handles the Response lifecycle. If it reuses the pipeline concept, it will not be any additional difficulty to understand how the response moves between pipeline modules.
This makes Crawly a hyper customizable data fetching framework, which would be extremely attractive for any serious web scraper or data person. It is also simple and flexible enough to customize the request lifecycle (through middlewares) and the scraped item lifecycle (through item pipelines).
The only things lacking now, I believe, is the lifecycle between when the request is fetched, and when the response is passed to the spider. These are essentially the only parts that cannot be customized.
The 3 levels of configuration
Ideally, in my mind, there should be 3 different levels of configuration. A global default configuration (for pipelines, middlewares), a spider-level configuration (meaning declared in init
, and will apply to all requests from the spider, overriding the defaults), and a request-level middleware override (attached to the request).
In this case, the most recent PR (#39 ) will allow for request-level middleware overrides.
I am supportive of the idea of having middlewares declared in the Request
struct, for the response handers to be declared in the Response
struct, and for the item pipelines to be declared in the ScrapedItems
struct. This allows for maximum customizability and flexibility, which is always a plus.
from crawly.
To add to the 3 level overrides point:
Request-level overrides will be very rarely used and only in very unique cases.
To add for the ScrapedItem struct, even with all the above implemented, the spider will still have to specify what pipeline set to use, which will tend towards a fatter spider (as it declares the pipelines for each type of scraped item, and also would be duplicated for each different spider). Pattern matching based on item type does have its benefits, since the pipeline will know what the item structure looks like.
from crawly.
I think to summarize my points:
- keeping middlewares in the requests is a good idea
- a
use Crawly.Item
oruse Crawly Scraped.Item
macro would help, but would make the system more opaque. There needs to be a way to selectively apply pipelines to a scraped item based on its inherent data type.
from crawly.
I think it's a useful conversation. Some of the flows above can already go to Crawly documentation, as it will explain the flow of the things. Which will allow people to understand the internals. I would appreciate if you could summarize it in relevant PRs.
Crawly's selling points, to me, are the simplicity in how everything is just a series of pipelines. It is quite easy to visualize and understand how a request flows through the entire system from start to finish into a scraped item. This is because once I understood how a item pipeline was called, I immediately understood how a middleware worked as well, due to the reuse of the pipeline concept. I think that reusing this concept both maximizes ease of understanding of the overall system, and of how to write your own custom pipelines (which for any advanced user, would eventually happen).
Yeah. And we should keep it as simple as possible. With this regards I am still a bit unsure if pipelines/middlewares should really live within Request/Item. Also maybe we should not really separate them. The idea is that these are just pre/post processors. I am still a bit unsure. However to suggest something:
- Let me stip the Request changes to a separate PR. We should really discuss it outside the Retries scope.
- The retries for now, will use a static config to understand what needs to be skipped.
And finally:
a use Crawly.Item or use Crawly Scraped.Item macro would help, but would make the system more opaque. There needs to be a way to selectively apply pipelines to a scraped item based on its inherent data type.
One of the things I don't like regarding OOP is something like unexpected features. E.g. I would want to avoid the cases when if you're using a use
macro, then something magically injects functions to your map. From my experience, this will make it very hard to develop and debug such systems. It will make everything very complicated for users. I kind of like these slim Items... (E.g. when I was using scrapy, it always felt quite strange that you have to define item, like in django. But all fields was actually text fields)
Regarding your last points
from crawly.
Well at the last minute I decided to keep the middlewares in request as they are for now. (As stripping out middlewares from the request would raise a question of identifying retries, which was solved).
from crawly.
I'll open up a PR for the documentation updates on the request-response-item flow.
As for going back to the main crux of this issue, it is about how to determine the pipeline for a scraped item, as visualized in this diagram:
If Requests already contain the middlewares, it does make sense for an Item to contain the required pipelines for consistency. However, if it is set by the spider when the item is scraped, logic for checking fields moves to the spider, instead of remaining in the pipeline lifecycle. This makes the spider fatter and less extraction-focused.
As you mentioned, pre/post processing should be left outside of the spider. I agree with this, hence we should not set pipelines in the spider.
What I can think of, which could be implementable as of now through just a pipeline module, would be a pipeline that uses Utils.pipe/2
and a function that returns set of pipelines.
# config.exs
pipelines: [
....
MyCommon.Pipeline
{Crawly.Pipelines.Function, func: case do
%MyStruct{} -> [..., My.Pipeline2, ...]
%{my_field: _ } -> [..., My.Pipeline1, ...]
%{other_field: _ } -> [..., My.Pipeline2, ...]
end }
]
However, this example makes the config very verbose, which is also not ideal.
From my own experimentation, using struct-based pattern matching (in a custom pipeline) does works. For example, I can populate the fields of an ecto schema struct in the spider and then do pattern matching in the pipelines for that particular struct, but it does requires some intermediate conversions back to a map before it is inserted into the table with ecto.
Example:
# MyCustomPipeline.ex
def run(%MyStruct{}, state) do
# do things
# Maybe insert into ecto
end
def run(item, state), do: {item, state}
Perhaps we could just let advanced users figure this out themselves, and maybe just bless the struct-based pattern matching method as the best practice for adding pipeline logic?
from crawly.
Yes, we could do it as you're suggesting. I was speaking with people from scrapy core team, what they are saying - you can re-write pipeline in a spider there, however this feature is almost never used. I kind of like the approach with custom middlewares more... E.g. it seems to be a way more simple...
from crawly.
I think let's keep the pipeline behaviour as-is for now. It seems like additional changes for minimal benefit. Spider overrides in the init would be easier to implement in worker.exs
too.
I'll open PRs for the documentation soon:
- request-response-item flow
- advanced pipeline techniques (pattern matching, structs, ecto schemas)
from crawly.
PRs opened, to be closed on merge
from crawly.
Related Issues (20)
- Any working examples? HOT 4
- jl files not found probably not writing HOT 1
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
- Error: Could not load spiders. HOT 5
- [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run
- Encountering Complications with Forwarding to Crawly.API.Router HOT 1
- Upgrade to `httpoison` 2.x? HOT 3
- management Web UI on localhost:4001 is not working HOT 4
- Does Crawly support requests using the POST method? HOT 1
- Crawly compilation warnings, undefined Floki functions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.