crawler-commons / crawler-commons Goto Github PK

A set of reusable Java components that implement functionality common to any web crawler

License: Apache License 2.0

Java 99.48% HTML 0.52%

web-crawler java sitemaps robots-txt open-source library

crawler-commons's Introduction

Overview

Crawler-Commons is a set of reusable Java components that implement functionality common to any web crawler.
These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

Documentation
Mailing List
Installation
News

User Documentation

Javadocs

Mailing List

There is a mailing list on Google Groups.

Installation

Using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.crawler-commons</groupId>
    <artifactId>crawler-commons</artifactId>
    <version>1.4</version>
</dependency>

Using Gradle, add the folling to your build file:

dependencies {
    implementation group: 'com.github.crawler-commons', name: 'crawler-commons', version: '1.4'
}

News

18th July 2023 - crawler-commons 1.4 released

We are pleased to announce the 1.4 release of Crawler-Commons.

The new release includes many improvements and bug fixes, several dependency upgrades and improvements to the automatic build system. The following are the most notable improvements and changes:

Java 11 is now required to run or build crawler-commons
the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309 and provides a new API entry point accepting a collection of single-word user-agent product tokens which allows for faster and RFC-compliant matching of robots.txt user-agent lines. Please note that user-agent product tokens must be lower-case.

See the CHANGES.txt file included with the release for the detailed list of changes.

28th July 2022 - crawler-commons 1.3 released

We are glad to announce the 1.3 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. The new release includes multiple dependency upgrades, improvements to the automatic builds, and a tighter protections against XXE vulnerability issues in the Sitemap parser.

14th October 2021 - crawler-commons 1.2 released

We are glad to announce the 1.2 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. This version fixes an XXE vulnerability issue in the Sitemap parser and includes several improvements to the URL normalizer and the Sitemaps parser.

29th June 2020 - crawler-commons 1.1 released

We are glad to announce the 1.1 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.

21st March 2019 - crawler-commons 1.0 released

We are glad to announce the 1.0 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. Among other bug fixes and improvements this version adds support for parsing sitemap extensions (image, video, news, alternate links).

7th June 2018 - crawler-commons 0.10 released

We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the Sitemap parsing and the removal of the Tika dependency.

31st October 2017 - crawler-commons 0.9 released

We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based sitemap parser as the SAX equivalent introduced in the previous version has better performance and is also more robust. You might need to change your code to replace SiteMapParserSAX with SiteMapParser. The parser is now aware of namespaces, and by default does not force the namespace to be the one recommended in the specification (http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in the wild. You can set the behaviour using the method setStrictNamespace(boolean).

As usual, the version 0.9 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

9th June 2017 - crawler-commons 0.8 released

We are glad to announce the 0.8 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of the HTTP fetcher support, which has been put in a separate project. We also added a SAX-based parser for processing sitemaps, which requires less memory and is more robust to malformed documents than its DOM-based counterpart. The latter has been kept for now but might be removed in the future.

24th November 2016 - crawler-commons 0.7 released

We are glad to announce the 0.7 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it

find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./'

Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version.

The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

11th June 2015 - crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central. Please note that the groupId has changed to com.github.crawler-commons.

The Java documentation can be found here.

22nd April 2015 - crawler-commons has moved

The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

11th April 2014 - crawler-commons 0.4 is released

We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further improvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.

See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central.

11 Oct 2013 - crawler-commons 0.3 is released

This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.

See the CHANGES.txt file included with the release for a full list of details.

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See Apache Nutch v1.7 Released for more details.

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

See Apache Nutch v2.2 Released for more details.

02 Feb 2013 - crawler-commons 0.2 is released

This release improves robots.txt and sitemap parsing support.

See the CHANGES.txt file included with the release for a full list of details.

License

Published under Apache License 2.0, see LICENSE

crawler-commons's People

Contributors

Stargazers

Watchers

crawler-commons's Issues

Follow Google example of giving Allow directives higher match weight than Disallow directives

According to Wikipedia, which references this article 
(http://blog.semetrical.com/googles-secret-approach-to-robots-txt/), 
"...Google's implementation differs in that Allow patterns with equal or more 
characters in the directive path win over a matching Disallow pattern.[18] Bing 
uses the Allow or Disallow directive which is the most specific.[8]"

See also 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt 
for details on how Google interprets robots.txt files.

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:41

Support Image Sitemap's

Google's documentation on SiteMap support highlights some anomalies with 
regards to the above SiteMap type.
We should consider supporting such functionality.

https://support.google.com/webmasters/answer/178636?hl=en&ref_topic=4581190

Original issue reported on code.google.com by [email protected] on 11 Apr 2014 at 7:21

Generate SitemapTool.jar from the SItemapTester

Currently, in our sitemaps package we have the following file: 
SiteMapTester.java

The file name is tricky as we shouldn't have a Test file in our regular src 
tree (as Lewis has previously mentioned).

After examination of the file, I think that I understand the need for that 
file, its use is to take an online sitemap and parse it recursively, while 
printing all of the sitemap urls as a list (done recursively, so if this is an 
index sitemap it will parse all of the other sitemaps etc and print out all of 
the url entries to the console). 

I actually like this sitemap parsing util, because it gives me an answer that 
our library doesn't support natively.

My most common scenario of parsing sitemaps is parsing sitemaps recursively 
while giving me the list of URLs - this was my original requirement when I 
stumbled upon this library, I have a php script site, and I wanted to have a 
list of all of my URLs...

We should have this functionality (of parsing recursively over a sitemap while 
retrieving a list of urls) as a seperate jar tool.



I'd suggest SiteMapTool.

It would be cleanest if this was a separate artifact from the build - e.g. we 
create a crawler-commons jar, and a crawler-commons-tools.jar, where the latter 
is an uber jar (includes all dependencies) so you can just run it from the 
command line.


We should also rename the original Java file accordingly

Original issue reported on code.google.com by [email protected] on 4 Jul 2014 at 9:36

Blocked on: #43

Add tar.gz artifact creation to Maven release profile

We need to make the tar.gz releases of CC more accessible. This can be added to 
the Maven release profile and will be automated when we push future releases.

Original issue reported on code.google.com by [email protected] on 9 Apr 2014 at 1:20

Move to pure Maven for CC build lifecycle

As discussed on the project mailing list, we can really simplify things 
especially artifact releases by shifting to Maven for the project build 
lifecycle.
I'll address  this just now and commit the code.

Original issue reported on code.google.com by [email protected] on 30 Jan 2013 at 3:06

[Sitemaps] Fix the Tester Util's Logic

The scenario is running the SitemapTester on a sitemapIndex in a GZ file.


This is the current method stamp of the sitemapParser:
parse(URL url, String mt, boolean recursive)

As you see, the second argument is the MediaType, which never changes in the 
current tool.


So the problem occurred when one tried to parse a GZ file (so he sent 
"application/gzip" as media type) recursively, where inside the SitemapIndex 
were other sitemaps, for example plain/text, and our parser then tried to get 
the URLs from those sitemaps with the Media Type of "application/gz"...



The solution is to use the new method in issue39, where our library detects the 
MediaType on the fly using Tika.

Original issue reported on code.google.com by [email protected] on 4 Jul 2014 at 9:26

Blocking: #44
Blocked on: #65

[Sitemaps] Add the Parser a conviniece method with only a URL argument

Currently the Parser has two public methods to activate it, both have an 
argument with the Media Type (Content type), I suggest adding a new parsing 
method in which we will use Tika to detect the MediaType

The parsing method would be as follows:
public AbstractSiteMap parseSiteMap(URL url);


The content of this method will be something like:
byte[] bytes = IOUtils.toByteArray(onlineSitemapUrl);
String contentType = new Tika().detect(bytes);

return parseSiteMap(contentType, bytes, onlineSitemapUrl);


The new method I suggest above will be very convenient for the light user who 
only wants to parse a simple sitemap without getting into any nitty gritty - I 
believe many people will appreciate it.

Original issue reported on code.google.com by [email protected] on 26 Apr 2014 at 7:47

Catch & report invalid robots.txt rules that include domain name in the URL path

Specifically catch cases of people putting http://<domain> or <domain> as part 
of the path.

There's the question of what we do in that case, if the domain matches the 
domain used to fetch the robots.txt file. I think we should try to honor the 
intent of the rule, which means pretending like the author of the file didn't 
mess up the syntax.

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:36

Http Components 4.2.2 Upgrade + Patch for issues 1-7

Please find attached and let me know if it doesn't work in your environment 
(Windows vs Mac vs Linux, encoding, CR/LF, etc)

I generated it (my local environment) using 
git diff --no-prefix 1e15be8d632f59e5f62a468bcb7a0a227c510d4e > 
../crawler-commons.patch 

And it is on top of r34; r35 (Eclipse Formatter) is not in my cloned version at 
https://github.com/FuadEfendi/crawler-commons


Let me know you you have any error messages. Thanks!


P.S. Next time I'll follow http://wiki.apache.org/hadoop/GitAndHadoop to 
generate patch-per-issue

Original issue reported on code.google.com by [email protected] on 5 Nov 2012 at 3:08

Attachments:

crawler-commons.patch

Fix deprecation in Crawler Commons Code

What steps will reproduce the problem?
1. I switched javac deprecation on be default which enables us to spot 
potential areas for improvements. If you compile the code, you will see that 
there are a few. 

We can work on the instances of deprecation post 0.2 release.

Original issue reported on code.google.com by [email protected] on 28 Jan 2013 at 2:52

Support Video Sitemap's

Google's documentation on SiteMap support states anomalies in the above 
SiteMap's which we should consider supporting.
https://support.google.com/webmasters/answer/80471?hl=en&ref_topic=4581190

Original issue reported on code.google.com by [email protected] on 11 Apr 2014 at 7:20

Support matching against query parameters in robots.txt rules

Apparently Googlebot will use the query portion of a URL when matching against 
rules.

See http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

It's unclear to me whether Googlebot will attempt a match without the query 
parameter portion of the URL, if no more specific match is found. E.g. will 
Disallow: /*.htm$ match against http://domain.com/page.htm?session=23

I would guess yes, but it would be best to validate using the "Fetch as Google" 
tool - see 
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=158587

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:24

Sitemap Parser to normalize entries

The attached sitemap contains the following entry which is causing problems 
(for my Firefox browser and for our Sitemap parser). This is due to the 
presence of the ampersands.

<url>
  <lastmod>2011-10-28</lastmod>
  <loc>http://www.tricae.com.br//Triciclo-Meu-1&Acirc;&ordm;-Tico-Tico-Europa--Bandeirante-1302.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>

A solution would be to normalize all sitemap entries, however first we need to 
port some code to CC (possibly from Nutch).
In the meantime a targeted hack of SiteMap parser would suffice but it is 
certainly not ideal.

Original issue reported on code.google.com by [email protected] on 9 May 2013 at 5:23

Attachments:

sitemap.xml

[Sitemaps] Upgrade code after release of Tika v1.6

After release v1.6 of Tika we should upgrade two files:

1. SiteMapParser (one TODO)
2. SiteMapParserTest (two TODOs in the unit tests)


The fixes are MediaTypes submitted to Tika and scheduled to appear in the 1.6 
release

Original issue reported on code.google.com by [email protected] on 8 Jul 2014 at 6:50

Blocked on: #48

Use longest-match-wins approach to matching URLs in robots.txt

See "Order of precedence for group-member records" section at the end of 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:43

[Sitemaps] SiteMapParser Tika detection doesn't work well on all cases

When using the parse method which gets only a sitemap URL, we use Tika to 
detect the Mime type.

On some cases, the detection is bad.


We need to use a better Tika detection.


Use:
new Tika().detect(URL)

Instead of the current:
new Tika().detect(bytes)

Original issue reported on code.google.com by [email protected] on 12 Jul 2014 at 8:29

HttpComponents Upgrade: 4.1.1 -> 4.2.1

Version 4.2.1 has better multithreading support; using connection pool we can 
better manage Keep-Alive and Cookie Store

This is first step before implementing in-full Cookie support (which currently 
does not work; you can notice it from presence of Session IDs in URLs)

Original issue reported on code.google.com by [email protected] on 6 Oct 2012 at 3:25

FetchedResult doesn't stores HTTP Status Code


FetchedResult doesn't stores HTTP Status Code, and HTTP Reason Phrase... but we 
need it in most cases. For instance, to find that page was redirected, and then 
to check "Location" header...

Original issue reported on code.google.com by [email protected] on 7 Oct 2012 at 1:07

Possible Defect: BufferedReader(new InputStreamReader(effective_tld_data_stream));

{code}
java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code 
point was found in the input��krehamn
{code}

We need to use explicitly -Dfile.encoding=UTF8 as a startup parameter. Must be 
documented.

The problem currently happens with Bamboo environment, where we can't manage 
this to run JUnit tests.

See last line 
{code}
EffectiveTldFinder
... 
    public boolean initialize(InputStream effective_tld_data_stream) {
        domains = new HashMap<String, EffectiveTLD>();
        try {
            if (null == effective_tld_data_stream && null != this.getClass().getResource(ETLD_DATA)) {
                effective_tld_data_stream = this.getClass().getResourceAsStream(ETLD_DATA);
            }
            BufferedReader input = new BufferedReader(new InputStreamReader(effective_tld_data_stream));
{code}

It tries to read "/effective_tld_names.dat" using default charset.

Original issue reported on code.google.com by [email protected] on 10 Jun 2011 at 3:43

[Sitemaps] Add Tika MediaType Support

Change the Mime type parsing to use Tika's MediaType.

So instead of this code:
if (url.getPath().endsWith(".xml") || contentType.contains("text/xml") || 
contentType.contains("application/xml") || 
contentType.contains("application/x-xml")
                        || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) {

            // Try parsing the XML which could be in a number of formats
            return processXml(url, content);
        } else if (url.getPath().endsWith(".txt") || contentType.contains("text/plain")) {
            // plain text
            return (AbstractSiteMap) processText(content, url.toString());
        } else if (url.getPath().endsWith(".gz") || contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip")
                        || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress")
                        || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) {
            return processGzip(url, content);
        }

I want to Identify the mediaType:
MediaType mediaType = MediaType.parse(contentType);

And then to process as follows:
1. By recursing through the mediatype supertypes till we get to the root and 
compare to the XML media type (or others)
2. If not found we should check the Aliases (for example text/xml is an alias 
of application/xml which is the more accurate form)
3. If not found then it is a bad MediaType and the exception should be thrown.


In this issue I will remove the recognition of unknown-to-tika media types 
(Although, I will submit them to the Tika Jira and we might automatically 
identify them in the future with future Tika library updates ).



This Issue will actually change the media recognition from our own String 
recognition to Tika recognition of MediaTypes

Original issue reported on code.google.com by [email protected] on 26 Apr 2014 at 7:52

Blocking: #47

SitemapIndex should allow to skip sitemaps

If one sitemap contained in a SitemapIndex failed to process
a loop with SiteMapIndex.hasUnprocessedSitemap() will never finish.

Either AbstractSitemap.setProcessed(boolean) should be public
or some method (eg, sitemapIndex.skipSitemap(AbstractSiteMap sitemap))
should be provided to remove failed sitemaps from sitemap index.

Original issue reported on code.google.com by [email protected] on 15 Dec 2013 at 11:50

Upload artifact to maven central repo

upload your library to maven central.

Original issue reported on code.google.com by [email protected] on 10 Aug 2012 at 7:05

Javadoc for Crawler Commons Online

This one is self explanatory. I suppose it lies with the site admin?

Original issue reported on code.google.com by [email protected] on 2 Nov 2012 at 9:47

remove ant scripts and configuration

We can now remove build.xml, build.properties and probably also the two jars in 
lib-ext to clean up the trunk code.

Original issue reported on code.google.com by [email protected] on 30 Jan 2013 at 11:12

[Sitemaps] Add Tika Support

Sitemaps should use the Tika implementations instead of using the current one 
in two places:

1. Currently the Parser has two public methods to activate it, both have an 
argument with the Media Type (Content type), I suggest adding two new parsing 
methods in which we will use Tika to detect the MediaType, the parsing methods 
would be as follows:

public AbstractSiteMap parseSiteMap(URL url);
public AbstractSiteMap parseSiteMap(File file);

The content of these methods will be something like:
byte[] bytes = IOUtils.toByteArray(onlineSitemapUrl);
String contentType = new Tika().detect(bytes);

return parseSiteMap(contentType, bytes, onlineSitemapUrl);


The new methods I suggest above will be very convenient for the light user who 
only wants to parse a simple sitemap without getting into any nitty gritty - I 
believe many people will appreciate it.


2. Change the Mime type parsing to use Tika's MediaTyep.
So instead of this code:
if (url.getPath().endsWith(".xml") || contentType.contains("text/xml") || 
contentType.contains("application/xml") || 
contentType.contains("application/x-xml")
                        || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) {

            // Try parsing the XML which could be in a number of formats
            return processXml(url, content);
        } else if (url.getPath().endsWith(".txt") || contentType.contains("text/plain")) {
            // plain text
            return (AbstractSiteMap) processText(content, url.toString());
        } else if (url.getPath().endsWith(".gz") || contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip")
                        || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress")
                        || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) {
            return processGzip(url, content);
        }

I want to use something like the following:
String mediaType = MediaType.parse(contentType).toString();
        if (mediaType.contains(MediaType.APPLICATION_XML.getSubtype())) {
            return processXml(url, content);
        } else if (mediaType.contains(MediaType.APPLICATION_ZIP.getSubtype())) {
            return processGzip(url, content);
        } else if (mediaType.contains(MediaType.TEXT_PLAIN.getType())) {
            return (AbstractSiteMap) processText(content, url.toString());
        }

Original issue reported on code.google.com by [email protected] on 19 Apr 2014 at 8:20

[Robots] Resolve relative URL for sitemaps

2014-03-27 13:55:25,730 WARN crawlercommons.robots.SimpleRobotRulesParser: 
Problem processing robots.txt for 
http://www.iglobal.co/mexico/render_phone_view/victoria-cortes-maria-del-carmen-
1

2014-03-27 13:55:25,730 WARN crawlercommons.robots.SimpleRobotRulesParser: 
    Invalid URL with sitemap directive: /sitemap.xml

Original issue reported on code.google.com by digitalpebble on 27 Mar 2014 at 3:07

[Sitemaps] Add more JUnit tests

Currently we have low unit testing coverage.

We should add more unit tests to cover more Sitemap Parser code.

Original issue reported on code.google.com by [email protected] on 22 May 2014 at 5:13

Proper code styling template and mechanism to enforce it

Ken has committed in the past an Eclipse formatting file:
http://code.google.com/p/crawler-commons/source/browse/trunk/doc/eclipse-formatt
er.properties

We should have a generic styling file which will match all IDEs (I personally 
use IntelliJ Idea for example).

If the styling template is accepted by us we should use it, or we can compare 
it to other styling files used by other groups and upgrade it if we see fit.


We should also find a mechanism which will use this file in Maven at the 
"Compile" task automatically.

Possible candidate will be:
http://code.google.com/p/maven-java-formatter-plugin/




Summary:
* Find suitable styling guidelines
* Enforce it using a generic (xml ?) file
* Find a good maven plugin which will run it every compile

Original issue reported on code.google.com by [email protected] on 9 Aug 2014 at 6:58

Sitemap URLs in robots.txt are unnecessarily lowercased

This is code snippet, line 381-... of SimpleRobotsRuleParser 
{code}
            line = line.trim().toLowerCase();
            if (line.length() == 0) {
                continue;
            }

            RobotToken token = tokenize(line);
{code}


Use case scenario: it doesn't work with sitemaps listed at 
http://www.tripadvisor.com/robots.txt

Original issue reported on code.google.com by [email protected] on 10 Jun 2013 at 7:15

Merged into: #25

More robust parsing of sitemap index files

http://activision.taleo.net/careersection/sitemap.jss

contains entries such as 

<sitemap>
http://activision.taleo.net/careersection/sitemap.jss?portalCode=2&lang=en
</sitemap>

i.e. the mandatory <loc> element is missing

Ideally the site should produce the correct content following the schema but we 
should make the parser a bit more robust and produce outlinks even if the loc 
element is missing

Original issue reported on code.google.com by digitalpebble on 6 Sep 2013 at 9:28

Missing top level domains

effective_tld_names.dat should contain the following TLDS : .tel .me .rs .asia

Original issue reported on code.google.com by digitalpebble on 8 Jan 2014 at 2:53

Robots.txt parser should not lowercase sitemap URLs

See http://www.amazon.com/robots.txt

contains entries such as 

Sitemap: http://www.amazon.com/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.com/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.com/sitemaps.bbb7d657c7e29fa.SitemapIndex_0.xml.gz

which are returned by the parser in lowercase

http://www.amazon.com/sitemaps.bbb7d657c7e29fa.sitemapindex_0.xml.gz

the trouble being that these URLs return a 404 when lowercased but work fine 
with the original form.

Original issue reported on code.google.com by digitalpebble on 16 May 2013 at 9:02

[Sitemaps] Upgrade to JUnit v4 conventions

Our JUnit works well.

But it uses JUnit v3 conventions.

I suggest upgrading to v4 conventions


What are the suggested changes?
The changes I suggest are as shown in the Junit github page here (Notes 
section):
https://github.com/junit-team/junit/wiki/Getting-started

The changes will include:
* @Test annotations
* Removing inheritance of TestCase and adding the annotation instead
* Adding expected annotation when expecting an exception

Original issue reported on code.google.com by [email protected] on 21 May 2014 at 6:29

Set correct default priority for URL in a sitemap file

The default priority for a URL in a sitemap file is 0.5  as specified in 
http://www.sitemaps.org/protocol.html. It is currently left at 0 if unspecified.

Original issue reported on code.google.com by digitalpebble on 16 May 2013 at 9:11

Substantiate Javadoc

As always, Javadoc are in general pretty thin on the ground.
I'll get us a patch to get some reasonable package descriptions for the 
forthcoming release.

Original issue reported on code.google.com by [email protected] on 27 Jan 2013 at 12:52

Upgrade the Slf4j logging Library to v1.7.7

We are currently using slf4j and log4j-over-slf4j v1.6.6


The current version of slf4j is v1.7.7, which is 9 versions with upgrades and 
bugfixes over our version.



v1.6.6 was out at June 2012, and doesn't have a method I want to use - this was 
the trigger for my need of the upgrade.



The only downside I can see for the upgrade is that the new version (1.7.7) 
requires JDK v1.5 or higher, so if support for lower JDKs is a requirement for 
this library I will manage without that method...

Original issue reported on code.google.com by [email protected] on 12 Apr 2014 at 9:22

Review default.properties

What steps will reproduce the problem?
This is not a problem, it is a suggested improvement.

What is the expected output? What do you see instead?
Expected output should result from better Javadoc and javac configuration 
settings

What version of the product are you using? On what operating system?
I'm using 0.2-SNAPSHOT on Ubuntu 12.04LTS

Please provide any additional information below.
I'll get a patch for this shortly. It is reasonably trivial.

Original issue reported on code.google.com by [email protected] on 26 Jan 2013 at 6:43

Add Fetch Report to FetchedResult

We have loads of fine grained method available to us via FetchedResult.
I think it would be really cool however if we were able to print a report of 
the FetchedResult including some timing statistics as well as an account of all 
page metadata, it't content, etc. 
Just as the Nutch WebTableReader does


https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/cr
awl/WebTableReader.java

Patch coming up.

Original issue reported on code.google.com by [email protected] on 28 Aug 2014 at 4:35

Upgrade the Slf4j logging in SiteMap's

The upgrade I suggest is as follows:
When I find a log statement which concatenates text in a log statement I want 
to use slf4j "Parameterized" way of doing it.
Here is an example:
Current code:
LOG.debug("XML url = " + xmlUrl);

Suggested Code:
LOG.debug("XML url = {}", xmlUrl);



The suggested code is the slf4j way of doing that sort of String concatenation 
logging.


The logic is that if the user's log level is INFO for example, there is no need 
to waste CPU in doing that concatenation.
Our current code does the String concatenation anyway (even when on INFO level 
or higher), while the slf4j way doesn't do the concatenation unless in DEBUG 
level.


In log4j they realized the problem so they came with a workaround which is 
using a condition as follows:
if (LOG.isDebugEnabled()){
 LOG.debug("XML url = " + xmlUrl);
}

it solves the problem, but now we have 3 lines instead of 1.
(Slf4j does that condition like log4j, but it does it internally.).



Which brings me to my second suggestion:
Removing the check for the log level.

The check is needed when using old logging libraries like log4j v1.X
But when using slf4j there is no need for it as I wrote above.

So I want to change the following code:
if (LOG.isDebugEnabled()){
    StringBuffer sb = new StringBuffer("  ");
    sb.append(i).append(". ").append(url);
    LOG.debug(sb.toString());
}

By removing the unnecessary IF condition.

So it will look like the following:
StringBuffer sb = new StringBuffer("  ");
sb.append(i).append(". ").append(url);
LOG.debug(sb.toString());


And actually if both suggestions will be accepted then the above code will look 
like:
Current code:
if (LOG.isDebugEnabled()){
    StringBuffer sb = new StringBuffer("  ");
    sb.append(i).append(". ").append(url);
    LOG.debug(sb.toString());
}


Suggested code:
LOG.debug("  {}. {}", i, url);

Original issue reported on code.google.com by [email protected] on 9 Apr 2014 at 6:15

Move Javadoc out of core code.

We should move the doc's out of the core code. 
The current docs are deleted and replaced anyway when doing a release.
I propose to store in 
http://crawler-commons.googlecode.com/svn/docs
Any objections?

Original issue reported on code.google.com by [email protected] on 28 Jan 2013 at 2:57

Trivial improvements to UserAgent

The attachment provides better Javadoc for UserAgent class as well as 
introducing a base CrawlerCommons class which can and should contain really 
general helper methods such as getVersion().

Original issue reported on code.google.com by [email protected] on 3 May 2013 at 6:00

[Sitemaps] SiteMapParser Tika detection doesn't work well on some cases

When using the parse method which gets only a sitemap URL, we use Tika to 
detect the Mime type.

On some cases, the detection is bad.


We need to use a better Tika detection.


Use:
new Tika().detect(URL)

Instead of the current:
new Tika().detect(bytes)

Original issue reported on code.google.com by [email protected] on 12 Jul 2014 at 8:29

Blocked on: #40

EffectiveTldFinderTest.java incorrectly references EffectiveTLD class

What steps will reproduce the problem?
1. Try to compile the EffectiveTldFinderTest.java file
2. Compile fails.
3.

What is the expected output? What do you see instead?

Instead of a clean compile, I get:

EffectiveTldFinderTest.java:17: cannot find symbol
symbol  : class EffectiveTLD
location: class EffectiveTldFinderTest
        EffectiveTLD etld = null;
        ^
EffectiveTldFinderTest.java:30: cannot find symbol
symbol  : class EffectiveTLD
location: class EffectiveTldFinderTest
        EffectiveTLD etld = null;
        ^
EffectiveTldFinderTest.java:41: cannot find symbol
symbol  : class EffectiveTLD
location: class EffectiveTldFinderTest
        EffectiveTLD etld = null;
        ^
EffectiveTldFinderTest.java:50: cannot find symbol
symbol  : class EffectiveTLD
location: class EffectiveTldFinderTest
        EffectiveTLD etld = null;
        ^
EffectiveTldFinderTest.java:68: cannot find symbol
symbol  : class EffectiveTLD
location: class EffectiveTldFinderTest
        EffectiveTLD etld = null;
        ^
5 errors



What version of the product are you using? On what operating system?

latest source on CentOS

Please provide any additional information below.

Problem solved by Changing the affected lines as follows:

EffectiveTLD etld = null;       
[becomes]
EffectiveTldFinder.EffectiveTLD etld = null;

Original issue reported on code.google.com by [email protected] on 19 May 2010 at 10:55

Suppport Google's "noindex" extension to robots.txt

According to this site, Google supports a "noindex: xxx" directive:

http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex
-your-site/

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:31

Is URL Legal? Why do we need that?

Method call returns "false":

DEBUG - SiteMapParser              -   169. 
url="http://www.classmates.com/sitemaps/publicprofile/publicprofile-sitemap-1999
0716-0000.xml.gz",lastMod=2009-10-21T00:00+12:00,type=null,processed=false,urlLi
stSize=0
INFO  - ClassmatesCrawler          - parsed sitemapIndex: 
url="http://www.classmates.com/sitemaps/publicprofile/sitemapindex-publicprofile
-1999.xml",sitemapListSize=169
INFO  - ClassmatesCrawler          - fetching sitemapUrl: 
http://www.classmates.com/sitemaps/publicprofile/publicprofile-sitemap-19990725-
0000.xml.gz
INFO  - ClassmatesCrawler          - contentType: application/xml  URL: 
http://www.classmates.com/sitemaps/publicprofile/publicprofile-sitemap-19990725-
0000.xml.gz
DEBUG - SiteMapParser              - urlIsLegal: 
http://www.classmates.com/sitemaps/publicprofile/ <= 
http://www.classmates.com/directory/public/memberprofile/list.htm?regId=2406410 
? false
DEBUG - SiteMapParser              - urlIsLegal: 
http://www.classmates.com/sitemaps/publicprofile/ <= 
http://www.classmates.com/directory/public/memberprofile/list.htm?regId=2406412 
? false
DEBUG - SiteMapParser              - urlIsLegal: 
http://www.classmates.com/sitemaps/publicprofile/ <= 
http://www.classmates.com/directory/public/memberprofile/list.htm?regId=2406413 
? false
DEBUG - SiteMapParser              - urlIsLegal: 
http://www.classmates.com/sitemaps/publicprofile/ <= 
http://www.classmates.com/directory/public/memberprofile/list.htm?regId=2406415 
? false


(sorry for bad formatting...)

-Fuad

Original issue reported on code.google.com by [email protected] on 10 Jun 2011 at 7:12

Generate source and javadoc artifacts and publish these to the Maven repo

This would make it easier for developers to use CC within their IDE.

Original issue reported on code.google.com by [email protected] on 23 Jan 2013 at 6:27

Support Googlebot-compatible regular expressions in URL specifications

As an example, see http://www.scottish.parliament.uk/robots.txt

User-agent: *
Disallow: /*.htm$

See http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 for 
details on how Google handles wildcards. It's unclear to me whether anything 
other than '*' and '$' are treated specially.

I'll file a separate issue re matching against query parameters, which might 
not be supported currently.

Original issue reported on code.google.com by [email protected] on 17 Mar 2013 at 6:20

Proper dehandling of InterruptedException

1. Code snippet from SimpleHttpFetcher:

                    // Check to see if we got interrupted.
                    if (Thread.interrupted()) {
                        throw new AbortedFetchException(url, AbortedFetchReason.INTERRUPTED);
                    }

2. Code snippet from AbortedFetchReason:

    INTERRUPTED,            // Fetch was interrupted (typically by FetchBuffer calling executor.terminate())


Comment: the bug here is that we cleared terminated status of the calling 
thread, and we re throw kind of new INTERRUPTED without proper documenting it, 
etc.

Calling thread must properly handle AbortedFetchException; otherwise it will 
hang. 


Another problem:

   private static void safeAbort(boolean needAbort, HttpRequestBase request) {
        if (needAbort && (request != null)) {
            try {
                request.abort();
            } catch (Throwable t) {
                // Ignore any errors
            }
        }
    }


What to do with OutOfMemoryException then? 


Such patterns in a code can cause application unpredictably hang and we will 
have no clue why... and I had experienced such when I was forced to analyze the 
code, find that "aborted" was indeed due to "interrupted", and re-throw 
InterruptedException (or set to the upper layer.


References:
http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html

Original issue reported on code.google.com by [email protected] on 7 Nov 2012 at 2:18

[SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString()

Am seeing ~19% CPU time spent on string concatenations when parsing a large 
number of sitemap files. This is due to : 

* SiteMapURL.toString using implicit concatenations (String s = "url=\"" + url 
+ "\",") 

* SiteMapURL.toString being called unnecessarily e.g. LOG.debug("  " + (i + 1) 
+ ". " + sUrl);

The patch attached improved the toString() implementation and checks for the 
log levels before building the strings

Original issue reported on code.google.com by digitalpebble on 24 May 2013 at 2:07

Upgrade the Tika-core to v1.6

Update the pom.xml with tika-core v1.6 instead of the current version of Tika

Original issue reported on code.google.com by [email protected] on 18 Jul 2014 at 8:02

Blocking: #45

crawler-commons / crawler-commons Goto Github PK

crawler-commons's Introduction

Overview

Table of Contents

User Documentation

Javadocs

Mailing List

Installation

News

18th July 2023 - crawler-commons 1.4 released

28th July 2022 - crawler-commons 1.3 released

14th October 2021 - crawler-commons 1.2 released

29th June 2020 - crawler-commons 1.1 released

21st March 2019 - crawler-commons 1.0 released

7th June 2018 - crawler-commons 0.10 released

31st October 2017 - crawler-commons 0.9 released

9th June 2017 - crawler-commons 0.8 released

24th November 2016 - crawler-commons 0.7 released

11th June 2015 - crawler-commons 0.6 is released

22nd April 2015 - crawler-commons has moved

15th October 2014 - crawler-commons 0.5 is released

11th April 2014 - crawler-commons 0.4 is released

11 Oct 2013 - crawler-commons 0.3 is released

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

02 Feb 2013 - crawler-commons 0.2 is released

License

crawler-commons's People

Contributors

Stargazers

Watchers

Forkers

crawler-commons's Issues

Recommend Projects

Recommend Topics

Recommend Org