Comments (7)
Thanks!
Concerning your idea:
doesn't $web->is404 and others lead to bad (unsafe) code which just checks specific codes?
better would be
if($web->isSuccess) { // or $web->isOk ? or $web->is2xx ?
// process data (was 200 OK or other 2xx)
} elseif($web->isServerError) { // or $web->is5xx ?
// repeat request later (was 500 Internal Server Error, 503 Service Unavailable, 504 Gateway Timeout or other 5xx)
} else { // we might offer something like $web->isClientError or $web->is4xx - but the user should not forget to handle all error codes!
// bad url (was 404 Not Found, 410 Gone, 451 Unavailable For Legal Reasons or other 4xx or unknown status codes)
}
from phpscraper.
Oh, sorry, I've removed it as I realized I need to think it through more and play with some code. Your idea with grouping it as is2xx
, is4xx
, is5xx
is solving a question I've had: how to make an organized decision based on it. I'll open a PR to have something to talk about it shortly
from phpscraper.
I've add some ideas for status code related methods: #162
Let me know what you think @eposjk :)
from phpscraper.
Well - technically it works, but does it really make sense to add different functions which do the same? In addition, there isn't an is(NumericStatusCode) function for every common status code. So it gets confusing - the developer has to know that he could use isNotFound/is404 but not isGone/is410.
My personal opinion would be to only keep the is2xx/is4xx/is5xx (and maybe is3xx - if there is a way not to auto-follow redirects) - maybe renaming them to isSuccess(2xx)/isServerError(5xx)/isClientError(4xx) - who really needs to check for specific codes and knows what he's doing should do that with e.g. $web->statusCode===404
By the way: The way to check if any error occured would be !$web->isSuccess
- just $web->isServerError || $web->isClientError
is NOT enought! Web crawlers should be able to deal even with exotic HTTP status codes and treat them as errors (e.g. 9xx).
from phpscraper.
Well - technically it works, but does it really make sense to add different functions which do the same? In addition, there isn't an is(NumericStatusCode) function for every common status code. So it gets confusing - the developer has to know that he could use isNotFound/is404 but not isGone/is410.
Yeah, my first thought was to allow flexibility. But for keeping the interface simple falling back on a simple ===
check sounds good. I'll adjust it shortly.
My personal opinion would be to only keep the is2xx/is4xx/is5xx (and maybe is3xx - if there is a way not to auto-follow redirects) - maybe renaming them to isSuccess(2xx)/isServerError(5xx)/isClientError(4xx) - who really needs to check for specific codes and knows what he's doing should do that with e.g. $web->statusCode===404
Yeah, this would allow to break it down into categories for internal handling. As you've mentioned, detailed handling can be done on the statusCode itself.
By the way: The way to check if any error occured would be !$web->isSuccess - just $web->isServerError || $web->isClientError is NOT enought! Web crawlers should be able to deal even with exotic HTTP status codes and treat them as errors (e.g. 9xx).
The 600+ ranges seem to be used for various purposes atm. Some are caching-related and others are errors. What are you thinking how these should be handled?
from phpscraper.
Its more complicated than I thought - I gave it a try: https://github.com/eposjk/PHPScraper/tree/status-codes
New insights:
- the action depends not only on the status code, but also on the chain of redirects leading to this page (its a huge difference if you have just a 410 Gone status or a temporary redirect leading to a 410 Gone page) - I solved that by adding helper functions to detect if there has been a temporary redirect
$web->usesTemporaryRedirect
or if the whole result is only temporary$web->isTemporaryResult
- error status codes should not primary distinguished by if they are client (4xx) or server (5xx) errors, but if they are permanent or temporary errors (
$web->isTemporaryResult
contains a detailed list of temporary ones) - everything what is not a successful result and not a temporary one is probably a permanent error - but hey: some might occur because of administrative actions on the web server - so it would be a good practice to consider those errors (e.g. 404) as permanent after trying multiple times
- the only status code i would really consider permanent is 410 Gone - where we can assume it isn't sent accidentally (that's why I created
$web->isGone
- which also checks if there were no temporary redirects) - By the way, I also collect retry timing hints from the Retry-After headers.
$web->retryAt
- Furthermore it should be possible to detect if there is a permanent redirect which should be used for future requests
$web->permanentRedirectUrl
- I would handle the 600+ errors like any unknown error as probably permanent.
see some demo code at https://github.com/eposjk/PHPScraper/blob/status-codes/demo.php
In addition, I would remove isClientError(), isServerError(), isNotFound() and isForbidden().
(At least isNotFound() I consider as harmful - people probably will use it to check if a page is intentional unavailable. They probably will forget to check for other status codes which also might be used for intentional unavailable content - 410 Gone, 401 Unauthorized and 403 Forbidden)
What is still missing:
- test cases
- helper function/example code to detect when a probably permanent error is really a permanent one (-> a database driven webcrawler example)
from phpscraper.
Hey @eposjk
I've open a PR for this: #164. Let's work out some practical details with this.
Cheers,
Peter
from phpscraper.
Related Issues (20)
- [Proposal] Exposing Goutte/Client via client() property/callable method HOT 1
- Allow to set cookies
- TypeError HOT 3
- Parsing structured data (microdata) HOT 3
- Idea: Discovery Sets
- Idea: Implement low-level util to access the web. HOT 1
- Idea: Directly exposing received headers HOT 1
- What location PHPSCrapper based on? HOT 1
- Docker Composer Install Error HOT 12
- [Request] Add robots.txt parsing HOT 3
- [Request] Sitemap Index Files HOT 2
- Syntax Error when i tried using PHP 7.3 HOT 3
- fabpot/goutte HOT 14
- Spanish web content not displayed correctly '?' is putted instead of the correct character HOT 1
- Fix problems reported by PHPStan HOT 5
- psr/http-message 2.0 compatibility HOT 2
- issue about php scraping api HOT 1
- Scraping a site with CloudFlare protection/redirect returns no results HOT 2
- upgrade path for 3.0 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phpscraper.