Giter Club home page Giter Club logo

http-fetcher's Introduction

Build Status license Coverage Status

Overview

Crawler-Commons is a set of reusable Java components that implement functionality common to any web crawler.
These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

Table of Contents

User Documentation

Javadocs

Mailing List

There is a mailing list on Google Groups.

Installation

Using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.crawler-commons</groupId>
    <artifactId>crawler-commons</artifactId>
    <version>1.4</version>
</dependency>

Using Gradle, add the folling to your build file:

dependencies {
    implementation group: 'com.github.crawler-commons', name: 'crawler-commons', version: '1.4'
}

News

18th July 2023 - crawler-commons 1.4 released

We are pleased to announce the 1.4 release of Crawler-Commons.

The new release includes many improvements and bug fixes, several dependency upgrades and improvements to the automatic build system. The following are the most notable improvements and changes:

  • Java 11 is now required to run or build crawler-commons
  • the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309 and provides a new API entry point accepting a collection of single-word user-agent product tokens which allows for faster and RFC-compliant matching of robots.txt user-agent lines. Please note that user-agent product tokens must be lower-case.

See the CHANGES.txt file included with the release for the detailed list of changes.

28th July 2022 - crawler-commons 1.3 released

We are glad to announce the 1.3 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. The new release includes multiple dependency upgrades, improvements to the automatic builds, and a tighter protections against XXE vulnerability issues in the Sitemap parser.

14th October 2021 - crawler-commons 1.2 released

We are glad to announce the 1.2 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. This version fixes an XXE vulnerability issue in the Sitemap parser and includes several improvements to the URL normalizer and the Sitemaps parser.

29th June 2020 - crawler-commons 1.1 released

We are glad to announce the 1.1 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.

21st March 2019 - crawler-commons 1.0 released

We are glad to announce the 1.0 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. Among other bug fixes and improvements this version adds support for parsing sitemap extensions (image, video, news, alternate links).

7th June 2018 - crawler-commons 0.10 released

We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the Sitemap parsing and the removal of the Tika dependency.

31st October 2017 - crawler-commons 0.9 released

We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based sitemap parser as the SAX equivalent introduced in the previous version has better performance and is also more robust. You might need to change your code to replace SiteMapParserSAX with SiteMapParser. The parser is now aware of namespaces, and by default does not force the namespace to be the one recommended in the specification (http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in the wild. You can set the behaviour using the method setStrictNamespace(boolean).

As usual, the version 0.9 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

9th June 2017 - crawler-commons 0.8 released

We are glad to announce the 0.8 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of the HTTP fetcher support, which has been put in a separate project. We also added a SAX-based parser for processing sitemaps, which requires less memory and is more robust to malformed documents than its DOM-based counterpart. The latter has been kept for now but might be removed in the future.

24th November 2016 - crawler-commons 0.7 released

We are glad to announce the 0.7 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it

find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./'

Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version.

The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

11th June 2015 - crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central. Please note that the groupId has changed to com.github.crawler-commons.

The Java documentation can be found here.

22nd April 2015 - crawler-commons has moved

The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

11th April 2014 - crawler-commons 0.4 is released

We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further improvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.

See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central.

11 Oct 2013 - crawler-commons 0.3 is released

This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.

See the CHANGES.txt file included with the release for a full list of details.

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See Apache Nutch v1.7 Released for more details.

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

See Apache Nutch v2.2 Released for more details.

02 Feb 2013 - crawler-commons 0.2 is released

This release improves robots.txt and sitemap parsing support.

See the CHANGES.txt file included with the release for a full list of details.

License

Published under Apache License 2.0, see LICENSE

http-fetcher's People

Contributors

aecio avatar dependabot[bot] avatar kkrugler avatar rzo1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

http-fetcher's Issues

Get rid of LOGGER.error in SimpleHttpFetcher

Currently we have:

            } else if (e.getCause() instanceof RedirectException) {
                LOGGER.error(e.getMessage());
                throw new RedirectFetchException(url, extractRedirectedUrl(url, localContext), RedirectExceptionReason.TOO_MANY_REDIRECTS);
            } else {

But we're throwing RedirectFetchException, so we shouldn't be logging anything. And the e.getMessage() call returns an empty string, so you get:

ERROR crawlercommons.fetcher.http.SimpleHttpFetcher                 - 

In the logs, which isn't very useful.

Remove SimpleHttpFetcher.finalize() method

Overriding the Object::finalize() method is deprecated since Java 9, and should be removed. One option would be to replace it with a close() method and make SimpleHttpFetcher implement the Closeable interface.

Current finalize() implementation:

    @Override
    protected void finalize() {
        monitor.interrupt();
        _connectionManager.shutdown();
        IOUtils.closeQuietly(_httpClient);
        _httpClient = null;
    }

Add a README

We should have a README with an example of how to use the fetcher code.

Unit test for stale connections

We had to comment out the test for state connections testStaleConnection() in SimpleHttpFetcherTest when upgrading to Jetty 11 (pull request #31). It was disabled because the method ServerConnector::setSoLingerTime() is no longer supported in Jetty 11+. So it is unclear if the test is doing what it is supposed to do. It is also not clear if there is an alternative way to test this.

Hook up HttpClient tests

We currently have a bunch of custom test handlers (e.g. RedirectResponseHandler) that aren't being used. I think these were previously hooked up in Bixo, and that test code should be ported.

Fix fetching of robots.txt with https protocol

Examples include http://rollingstone.com/robots.txt, http://forbes.com/robots.txt, and http://www.morningstar.com/robots.txt. These redirect to https://xxx, and that fails with a javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated exception.

Make cookie parsing of dates more lenient

Currently we get errors like:

2017-12-21 20:31:27,606 WARN  org.apache.http.client.protocol.ResponseProcessCookies        - Invalid cookie header: "Set-Cookie: WMF-Last-Access=21-Dec-2017;Path=/;HttpOnly;secure;Expires=Mon, 22 Jan 2018 12:00:00 GMT". Invalid 'expires' attribute: Mon, 22 Jan 2018 12:00:00 GMT

We should have more lenient parsing of dates.

Reduce/remove dependency on Tika

Currently we pull in tika-core so we can use the Metadata type (for http response headers), and for access to one Mime-type extractor routine. We can likely trim down much of the dependent jars, or maybe get rid of the dependency all together and just use the class that we need.

Don't throw exceptions for HTTP errors

Bringing back issue crawler-commons/crawler-commons#86:

In our SimpleHttpFetcher, we check the status code, and when we encounter a status code not starting with "2" (non 2xx) we throw an Exception:
https://github.com/crawler-commons/http-fetcher/blob/master/src/main/java/crawlercommons/fetcher/http/SimpleHttpFetcher.java#L622-L626

Many times the page itself has html content which we should return to the user instead of throwing an exception.

For example: 404 pages...
Another example: status code: 500 page

Please note that when checking several other http clients - all of them returned the page even when status code was 4xx or 500 (maybe other cases also, I didn't check)

Put this one into the Maven repo

If I understand, this library isn't present in the maven repo (I might be wrong here)

So, in order to use it, one needs to download the source code and integrate it into his own code (Or build a jar and copy it to ones project?)

Can this project be put into the maven repo so I could just load it using the pom.xml file ?

Switch back to Java 1.7 compatibility?

I'd like to use http-fetcher with some projects that are Java 1.7. I'm wondering if anyone has any strong feelings about keeping it at 1.8?

I've tried this, and other than a one line change everything compiles and tests fine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.