Comments (4)
Hi @damien-git, you should probably drop the Internet Archive a note (mailto:[email protected]), as they may be able to tune the behaviour of their crawler.
In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.
I quite like the idea of tuning the crawl via robots.txt
but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.
If we can find our which extractor they are using that might help.
from heritrix3.
I've run a variant of ExtractorJS for years, that lets me filter out the links it discovers using a set of regular expressions. These are applied before the links are turned into full URLs, making it a bit easier to target common false positives in JS libraries than it would be if we are doing the filtering in the scope. You also don't risk catching any URLs extracted via other (more reliable) means.
Looking at the above, I should probably filter out any links extracted via ExtractorJS containing "gtm."
from heritrix3.
We (Akamai) are seeing a similar issue with sites that have our mPulse product enabled, which includes JavaScript in the page's HTML that looks like this:
var a=["ak.bpcip","ak.cport","..."];
This results in our customer's websites getting crawled by numerous crawlers on each page for those 20+ elements of the array, e.g.:
http://website/foo/bar/ak.bpcip
http://website/foo/bar/ak.cport
... etc
from heritrix3.
+1
I am using google tag manager and crawler is making many requests with "/gtm.js"
from heritrix3.
Related Issues (20)
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
- archive web crawler - crawl speed HOT 7
- Support for silent option when running a job
- Redirect field in seeds-report.txt is only populated for status 301 and 302
- Text versions of DNS should be recorded as WARC-Type resource instead of response
- Heritrix 3.4.0-SNAPSHOT-2022-03-08T19:15:59Z keeps pausing.. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.