Comments (10)
Also, I'd like to migrate this to psr-4, and separate the classes into their own files. Or perhaps you should do that, it's likely a BC.
from phpscraper.
Sigh. Version 6 no longer including a manager, so using this requires a cache and mechanism for fetching and storing the rules.
It's not too difficult, but it's not a trivial syntax change either (though there are some namespace changes).
Thoughts on how to proceed?
from phpscraper.
Hey @tacman
yeah, I remember there were some structural changes in the package. Do you think you can get it done? Namespaces can usually be replaced easily most of the time.
Cheers,
Peter
from phpscraper.
OK. There are 2 approaches. The easiest is to download the rules file, add it to the repo, and then load it. Of course, the rules will become stale.
The better approach require a dependency on a cache. Then we can fetch the rules like this, which will update the rules every 24 hours:
public function getTldCollection(): Rules
{
$cache = new FilesystemAdapter(); // or some other cache.
$rules = $cache->get('pdp_rules', function (ItemInterface $item) {
// The callable will only be executed on a cache miss.
$item->expiresAfter(3600 * 24);
$response = $this->client->request(
'GET',
PsrStorageFactory::PUBLIC_SUFFIX_LIST_URI
);
return $response->getContent();
});
$publicSuffixList = Rules::fromString($rules);
return $publicSuffixList;
I'll go ahead and implement this to make it functional, but I'm not sure how to code it so that the user can inject whatever cache they already have in their application.
from phpscraper.
Since you've tagged this as a new version, can we also bump to PHP8?
from phpscraper.
I started down the rabbit hole...
If phpscraper needs a cache for the domain parse, a CacheInterface cache should probably be injected. But that means the phpscaper should itself be a service that's injected, rather than called with new phpscaper().
Alternatively, we can add a CacheAwareInterface, and add the cache via a method call.
Alas, I'm not as expert in this as I'd like to be!
from phpscraper.
The question of the cache was stopped me too. I was actually thinking of storing a file/set of files somewhere to avoid handling the questions of integration, especially with simple VanillaPHP projects (where PHPscraper comes in handy for me most).
from phpscraper.
Hey @tacman,
Have you made progress implementing a cache? I've seen this commit and was wondering if we can get a framework agnostic-solution working. I'd still try to avoid injecting a CacheInterface as it is framework dependent. Happy to hear your thoughts!
Cheers,
Peter
from phpscraper.
Alternatively either spatie/url or thephpleague/uri could replace jeremykendall/php-domain-parser. I still need to confirm if the libs are suitable for the job tho.
from phpscraper.
For now we are using league/uri for URL processing, with this the subdomain-specific filtering has been dropped.
from phpscraper.
Related Issues (20)
- [Proposal] Exposing Goutte/Client via client() property/callable method HOT 1
- Allow to set cookies
- TypeError HOT 3
- get http status code HOT 7
- Parsing structured data (microdata) HOT 3
- Idea: Discovery Sets
- Idea: Implement low-level util to access the web. HOT 1
- Idea: Directly exposing received headers HOT 1
- What location PHPSCrapper based on? HOT 1
- Docker Composer Install Error HOT 12
- [Request] Add robots.txt parsing HOT 3
- [Request] Sitemap Index Files HOT 2
- Syntax Error when i tried using PHP 7.3 HOT 3
- fabpot/goutte HOT 14
- Spanish web content not displayed correctly '?' is putted instead of the correct character HOT 1
- Fix problems reported by PHPStan HOT 5
- psr/http-message 2.0 compatibility HOT 2
- issue about php scraping api HOT 1
- Scraping a site with CloudFlare protection/redirect returns no results HOT 2
- upgrade path for 3.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phpscraper.