yasserg / crawler4j Goto Github PK

Open Source Web Crawler for Java

License: Apache License 2.0

Java 82.96% Groovy 14.36% HTML 2.42% CSS 0.18% Shell 0.08%

crawler4j's Introduction

crawler4j

crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

Installation

Using Maven

Add the following dependency to your pom.xml:

    <dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.4.0</version>
    </dependency>

Using Gradle

Add the following dependency to your build.gradle file:

compile group: 'edu.uci.ics', name: 'crawler4j', version: '4.4.0'

Quickstart

You need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp4|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "https://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("https://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

As can be seen in the above code, there are two main functions that should be overridden:

shouldVisit: This function decides whether the given URL should be crawled or not. In the above example, this example is not allowing .css, .js and media files and only allows pages within 'www.ics.uci.edu' domain.
visit: This function is called after the content of a URL is downloaded successfully. You can easily get the url, text, links, html, and unique id of the downloaded page.

You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored and the number of concurrent threads:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        // Instantiate the controller for this crawl.
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        // For each crawl, you need to add some seed urls. These are the first
        // URLs that are fetched and then the crawler starts following links
        // which are found in these pages
        controller.addSeed("https://www.ics.uci.edu/~lopes/");
        controller.addSeed("https://www.ics.uci.edu/~welling/");
    	controller.addSeed("https://www.ics.uci.edu/");
    	
    	// The factory which creates instances of crawlers.
        CrawlController.WebCrawlerFactory<BasicCrawler> factory = MyCrawler::new;
        
        // Start the crawl. This is a blocking operation, meaning that your code
        // will reach the line after this only when crawling is finished.
        controller.start(factory, numberOfCrawlers);
    }
}

More Examples

Basic crawler: the full source code of the above example with more details.
Image crawler: a simple image crawler that downloads image content from the crawling domain and stores them in a folder. This example demonstrates how binary content can be fetched using crawler4j.
Collecting data from threads: this example demonstrates how the controller can collect data/statistics from crawling threads.
Multiple crawlers: this is a sample that shows how two distinct crawlers can run concurrently. For example, you might want to split your crawling into different domains and then take different crawling policies for each group. Each crawling controller can have its own configurations.
Shutdown crawling: this example shows how crawling can be terminated gracefully by sending the 'shutdown' command to the controller.
Postgres/JDBC integration: this shows how to save the crawled content into a Postgres database (or any other JDBC repository), thanks rzo1.

Configuration Details

The controller class has a mandatory parameter of type CrawlConfig. Instances of this class can be used for configuring crawler4j. The following sections describe some details of configurations.

Crawl depth

By default there is no limit on the depth of crawling. But you can limit the depth of crawling. For example, assume that you have a seed page "A", which links to "B", which links to "C", which links to "D". So, we have the following link structure:

A -> B -> C -> D

Since, "A" is a seed page, it will have a depth of 0. "B" will have depth of 1 and so on. You can set a limit on the depth of pages that crawler4j crawls. For example, if you set this limit to 2, it won't crawl page "D". To set the maximum depth you can use:

crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);

Enable SSL

To enable SSL simply:

CrawlConfig config = new CrawlConfig();

config.setIncludeHttpsPages(true);

Maximum number of pages to crawl

Although by default there is no limit on the number of pages to crawl, you can set a limit on this:

crawlConfig.setMaxPagesToFetch(maxPagesToFetch);

Enable Binary Content Crawling

By default crawling binary content (i.e. images, audio etc.) is turned off. To enable crawling these files:

crawlConfig.setIncludeBinaryContentInCrawling(true);

See an example here for more details.

Politeness

crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:

crawlConfig.setPolitenessDelay(politenessDelay);

Proxy

Should your crawl run behind a proxy? If so, you can use:

crawlConfig.setProxyHost("proxyserver.example.com");
crawlConfig.setProxyPort(8080);

If your proxy also needs authentication:

crawlConfig.setProxyUsername(username);
crawlConfig.setProxyPassword(password);

Resumable Crawling

Sometimes you need to run a crawler for a long time. It is possible that the crawler terminates unexpectedly. In such cases, it might be desirable to resume the crawling. You would be able to resume a previously stopped/crashed crawl using the following settings:

crawlConfig.setResumableCrawling(true);

However, you should note that it might make the crawling slightly slower.

User agent string

User-agent string is used for representing your crawler to web servers. See here for more details. By default crawler4j uses the following user agent string:

"crawler4j (https://github.com/yasserg/crawler4j/)"

However, you can overwrite it:

crawlConfig.setUserAgentString(userAgentString);

License

Published under Apache License 2.0, see LICENSE

crawler4j's People

Stargazers

Watchers

Forkers

jintian michaeledavis teabeats dungvn3000 nttuyen homeir wchswchs thiagokrug blacklocus keepcleargas exeu kshtzsharma48 akberali diveinedu markush ravnsystems sazoo johnewart ejrav zaric qinpeng alannesta hessmac spaf123 muchunfeng binjiang eworld888 psramkumar muratzorer lihangtong gsrivast31 wangqifeng chuanmen chengenbao gigglephish munimanees kiddsmpsn datalibs lsroudi boyddensmore tenthdoctor hunzai nguy0035 qingyu1229 mgosk daryllopes08 shravan0786 minotaursu sunnyyants monolithic waynedsacqli 404-not-find doellerbe huokedu kusium macrotea sloth4413 lxserenade oskaralfons hanst liang-kun yijuchung worldviewchua2014 xiaoliangfan yangfei2012 corny813 smrjans bhanup xingxjhui wenzai007 chosh31 wilecoyote1 smtriplett asarkar1990 ysvang foundation-models rohatgisanat okancetin dlng stephanytang mehakinderoberoi nmodi72 ihanxavier lemar1948 michaellutaaya tspanindra robertxj abhishekp906 inferiousslayer theuone jqueguiner vsoltys veritando rene4711 etlapa tiennb helloyingying narakai data-processing pelumi

crawler4j's Issues

crawling images

I just noticed if image url embedded is something like the following, crawler seems ignoring the image url which in turn it misses that image.

class="overlay" href="/photos/myname//6004496333/" data-rapid_p="81"

is this a known issue? or I missed something ?

SEVERE: Fatal transport error: status code: 1005

Mar 01, 2015 9:50:14 PM edu.uci.ics.crawler4j.fetcher.PageFetcher fetchHeader
SEVERE: Fatal transport error: wordpress.com:443 failed to respond while fetching https://wordpress.com/ (link found in doc #0)
Non success status for link: https://wordpress.com/, status code: 1005, description: Fatal transport error - Is the server down ?
Mar 01, 2015 9:50:14 PM edu.uci.ics.crawler4j.crawler.WebCrawler onUnexpectedError
WARNING: Skipping URL: https://wordpress.com/, StatusCode: 1005, , Fatal transport error - Is the server down ?

Make the wait time at the end of the crawl configurable

Search in Google

How to make the crawler search and follow Google Results?

Implement an ARC and/or WARC file writer

For archiving crawlers

Number of crawled Pages varies

Hi,

after an update to the new Crawler4J Version (3.0) the number of total Crawled Pages differs strongly for each run on the same site.

For example the first run results in (1076 Pages)
The second run directly after the first one results into (876 Pages)
The third one results in (971 Pages)

In the previous Versions there were no varies!

I've configured the crawler like the following:

    CrawlConfig config = new CrawlConfig();

    config.setCrawlStorageFolder(rootFolder);
    config.setPolitenessDelay(400);
    config.setMaxPagesToFetch(-1);
    config.setMaxDepthOfCrawling(-1);
    config.setResumableCrawling(false);
    config.setIncludeHttpsPages(true);
    config.setFollowRedirects(true);
    config.setMaxConnectionsPerHost(200);
    config.setMaxTotalConnections(200);

Is this an issue known to you?
Or could it be an usage failure?

Thanks

Upgrade version to 3.4.0-SNAPSHOT

Since I will start working on some changes, I'll update the build version.

Crawling Data

Does crawler4j use data of previous crawls to determine current crawl? How is it implemented?

JVM heap size keeps increasing considerably

What steps will reproduce the problem?
1.a while loop ( creating crawlController and crawling a website ) at each iteration
2. the crawlController have been slightly modified so it could have something to do with it

What is the expected output? What do you see instead?

After each crawling operation, the heap size is supposed to go back the way it was before it, seeing how it's a series of independent operations.

but instead, the heap size keeps increasing with no end until it crashes.

What version of the product are you using?

3.5

Please provide any additional information below.

After using yourKit Java profiler to locate what lines of code were exactly consuming this memory, I ended up at this :

Environment env = new Environment(envHome, envConfig);

this line from the crawlController, was consuming over 620 MB from a total of 650 MB.

Now I don't know what's exactly in this env object so I can't diagose it, but I hope you guys can help me with that.

I will include a screenshot from the profiler's result

Add link attributes information to WebURL

Usecase:

in order to decide wether a link should be visited or not I need other information besides the url, e.g. the css class of the tag. A possible solution could be to provide the WebURL with the link's attributes (list of key/value pairs), what do you think?

Allow crawling of websites requiring authentication

Either to personal or institutional (archiving) use, some content on websites needs that the user has authenticated before accessing it.

We should implement URL-pattern or domain based authentication and be able to check for session validity and enrich the HTTP requests that the crawler issues with authentication-related data (headers, cookies, etc...)

Endless behavior in a calendar url

I am trying to crawl PDF files from Open Journal System (OJS) Journal. However, after of several minutes, crawler4j has an endless behavior in a calendar url, such as:

http://rcci.uci.cu/index.php?journal=rcci&op=statistics&page=about&statisticsYear=7749

Note that the value of year at the end of URL change continuosly. Is there any way to change this behavior in crawler4j? My crawler configuration is as follow:

Resumable crawling: false
Max depth of crawl: -1
Max pages to fetch: -1
User agent string: crawler4j (http://code.google.com/p/crawler4j/)
Include https pages: true
Include binary content: true
Max connections per host: 100
Max total connections: 100
Socket timeout: 20000
Max total connections: 100
Max outgoing links to follow: 5000
Max download size: 5242880
Should follow redirects?: true
Proxy host: null
Proxy port: 80
Proxy username: null
Proxy password: null

Best regards.

Factorize logging

Logging should be factorized so that logging code (especially guard logging statements) does not clutter the logic code.

Enable adding custom http headers when fetching urls

As far as I can see there's currently no possibility of settings own http headers (either globally for all requests or on a per-request basis) when fetching urls. Or did I miss something?

Output of the BasicCrawler in the package

Hi, I am testing this in Eclipse with Maven, and what I got from the screen when running BasicCrawlerController is below. How to make sense of this output? Am I doing something wrong?
......
32D0: 63 6F 6D 0A 68 6B 2E 6F 72 67 0A 6C 74 64 2E 68 com.hk.org.ltd.h
32E0: 6B 0A 69 6E 63 2E 68 6B 0A 0A 2F 2F 20 59 6F 6C k.inc.hk..// Yol
32F0: 61 20 3A 20 68 74 74 70 73 3A 2F 2F 77 77 77 2E a : https://www.
3300: 79 6F 6C 61 2E 63 6F 6D 2F 0A 2F 2F 20 53 75 62 yola.com/.// Sub
3310: 6D 69 74 74 65 64 20 62 79 20 53 74 65 66 61 6E mitted by Stefan
3320: 6F 20 52 69 76 65 72 61 20 3C 73 74 65 66 61 6E o Rivera <stefan
3330: 6F 40 79 6F 6C 61 2E 63 6F 6D 3E 20 32 30 31 34 [email protected]> 2014
3340: 2D 30 37 2D 30 39 0A 79 6F 6C 61 73 69 74 65 2E -07-09.yolasite.
3350: 63 6F 6D 0A 0A 2F 2F 20 5A 61 4E 69 43 20 3A 20 com..// ZaNiC :
3360: 68 74 74 70 3A 2F 2F 77 77 77 2E 7A 61 2E 6E 65 http://www.za.ne
3370: 74 2F 0A 2F 2F 20 53 75 62 6D 69 74 74 65 64 20 t/.// Submitted
3380: 62 79 20 72 65 67 69 73 74 72 79 20 3C 68 6F 73 by registry <hos
3390: 74 6D 61 73 74 65 72 40 6E 69 63 2E 7A 61 2E 6E [email protected]
33A0: 65 74 3E 20 32 30 30 39 2D 31 30 2D 30 33 0A 7A et> 2009-10-03.z
33B0: 61 2E 6E 65 74 0A 7A 61 2E 6F 72 67 0A 0A 2F 2F a.net.za.org..//
33C0: 20 3D 3D 3D 45 4E 44 20 50 52 49 56 41 54 45 20 ===END PRIVATE
33D0: 44 4F 4D 41 49 4E 53 3D 3D 3D 0A DOMAINS===.
Keep-Alive-Timer, called close()
Keep-Alive-Timer, called closeInternal(true)
Keep-Alive-Timer, SEND TLSv1.2 ALERT: warning, description = close_notify
Padded plaintext before ENCRYPTION: len = 2
0000: 01 00 ..
Keep-Alive-Timer, WRITE: TLSv1.2 Alert, length = 26
[Raw write]: length = 31
0000: 15 03 03 00 1A 00 00 00 00 00 00 00 02 79 A0 B3 .............y..
0010: 71 88 84 93 CD C4 0D FD 01 0D 6B EF 31 B2 19 q.........k.1..
Keep-Alive-Timer, called closeSocket(true)

I suppose the output should look like the following, according to the StepByStep instruction on google:
Docid: 1
URL: http://www.ics.uci.edu/~lopes/
Domain: 'uci.edu'
Sub-domain: 'www.ics'
Path: '/~lopes/'
Parent page: null
Anchor text: null
Text length: 2442
Html length: 9987
Number of outgoing links: 34
Response headers:
Date: Fri, 07 Mar 2014 07:40:21 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Tue, 07 Jan 2014 17:41:36 GMT
ETag: "672c0db5-2703-4ef64e34f02c7"
Accept-Ranges: bytes
Content-Length: 9987
Connection: close
Content-Type: text/html; charset=UTF-8
Set-Cookie: Coyote-2-80c30153=80c3019d:0; path=/

Factory instead of hardcoded class.newInstance()

I would like to suggest, that adding a the possibility to use a factory to create new web-crawlers would be of great value.

Since a web-crawler could hold a few custom services (e.g. classifiers, database services) a factory would be a very nice thing to make crawler4j usable for example via Spring.

A few years ago an issue was created on googlecode (https://code.google.com/p/crawler4j/issues/detail?id=144), which is a duplicate of mine request - but nothing happend. Is there a reason for not including a factory approach in the code-base?

Thanks in advance.

Provide a checkstyle configuration

And eventually the Eclipse:CS plugin config file

Option to specify DefaultCookieStore

Not all sites have regular "form" or "basic" auth; and because of that, it would be great to be able to specify our own "cookies" to start with. In our case we did this and it seems to be working fine:

CrawlConfig.java

  public CookieStore getDefaultCookieStore() {
      return defaultCookieStore;
  }

  public void setDefaultCookieStore(CookieStore cookieStore) {
      this.defaultCookieStore = cookieStore;
  }

PageFetcher.java

clientBuilder.setDefaultCookieStore(config.getDefaultCookieStore());

Controller

            BasicCookieStore cookieStore = new BasicCookieStore();
            for (Map.Entry<String, String> entry : getLoginCookies().entrySet()){
                BasicClientCookie cookie = new BasicClientCookie(entry.getKey(), entry.getValue());
                cookie.setSecure(true);
                cookie.setDomain("127.0.0.1");
                cookie.setPath("/");
                cookieStore.addCookie(cookie);
            }
            config.setDefaultCookieStore(cookieStore);

            PageFetcher pageFetcher = new PageFetcher(config);

Thanks!

Class InProcessPagesDB key Define is not incompatible with Class WorkQueues key Define

InProcessPagesDB
DatabaseEntry key = new DatabaseEntry(Util.int2ByteArray(webUrl.getDocid()));
key is byte[4]

WorkQueues
byte[] keyData = new byte[6]; keyData[0] = url.getPriority(); keyData[1] = (url.getDepth() > Byte.MAX_VALUE ? Byte.MAX_VALUE : (byte) url.getDepth()); Util.putIntInByteArray(url.getDocid(), keyData, 2);
key is byte[6]

Update code to match GoogleCode repository

Can't crawl Wikipedia HTTPS pages

Hi. I'm Mario and I hope you can help me guys!

I'm using Google's Crawler4j to download text from:

https://es.wikipedia.org/wiki/David_Gilmour

Two weeks ago this library worked fine, but now it complains with:

"ERROR edu.uci.ics.crawler4j.fetcher.PageFetcher - Fatal transport error: null while fetching https://es.wikipedia.org/wiki/David_Gilmour/ (link found in doc #0)"

Could the problem be related to:

httpS?
Wikipedia disabled crawling on their site?

In case of 2, is there any workaround I can apply?

Thank you very much for your time!

Improve error logging

Currently there is a lot of error logging calls that do not include the reference to the exception itself, hence the stack trace isn't logged and investigation is impossible.

amazon api spinder

hi yasserg

crawler4j helps me to get data from amazon api, there are many many urls to query, first i add 2*numberOfCrawlers seeds and call Controller.startNonBlocking(), then i add another url when MyCrawler.visit() is called.But i find everytime the crawler4j will stop at a random task.Maybe 180 ,maybe 1000, there still other seeds but it seems the threads are all died.

So,what happened?can you help me?

Use slf4j for logging

I am running crawler4j and the output is to the directory /frontier/. The files in this directory are

00000000.jdb
je.info.0
je.lck

the .jdb file is the only one with data the other three files have zero bytes. I am not sure what to do with this data. I followed a stack overflow link and found out that the .jdb file can be opened by using Berkeley DB. I am on a windows system so its getting hard to build and use Berkeley db. Please suggest if this is the right output.
I downloaded the crawler4j library from the following link.
https://github.com/yasserg/crawler4j/releases

maxPageSize allowance

Hi,
I'm testing your crawler with image crawler, it has been great. However, it skipps certain images that exceeds max page size. Can you tell me where the default page size is predefined?

thanks

Getting processed pages out of crawler4j

Hey folks,

I currently need to get the total number of processed pages out of crawler4j.
So far, we are working with localData Objects, which work fine, as long as we do not have to resume the crawling. If we do or have to do it, only the pages from the current crawling are counted and all previous pages are missing.

Is there a better and easier way of getting this information?

Best regards,

Daniel

Problem fetching password protected pages

I am trying to fetch password protected pages (i.e. twitter.com). In order to do that I use the “FormAuthInfo”.
authInfo = new FormAuthInfo("username", "password",
"https://twitter.com/sessions",
"session[username_or_email]",
"session[password]");
config.addAuthInfo(authInfo);
When I start my crawler I get the following output:
[main] INFO edu.uci.ics.crawler4j.fetcher.PageFetcher - FORM authentication for: /sessions123
[main] DEBUG edu.uci.ics.crawler4j.fetcher.PageFetcher - Successfully Logged in with user: username to: twitter.com

But the crawler doesn’t crawl the password protected site.

Is there a problem with the cookie that needs to be send to the server?

Implement crawler trap detection

Crawler traps are URL patterns that the crawler should not attempt to visit, as they are either:

not desired within a specific crawl
will cause the crawler to get stuck in loops (for instance forum registration)

A crawler trap is specified by a Java regular expression, such as (?i)^._forum[s]?._post_reply.cfm=.*$

Traps will be loaded from flat text files, one trap expression by line.

Threads Stop working

The crawler works with the web pages using in BasicCrawlcontroller
// controller.addSeed("http://www.ics.uci.edu/~yganjisa/");
// controller.addSeed("http://www.ics.uci.edu/~lopes/");
// controller.addSeed("http://www.ics.uci.edu/");

but NO other web pages appear to work. One example is below

Docid: 1
URL: http://www.unisa.edu.au/
Docid of parent page: 0
Text length: 2007
Html length: 14538

Number of outgoing links: 90

10017 [Thread-1] INFO crawler.CrawlController - It looks like no thread is working, waiting for 10 seconds to make sure...
20023 [Thread-1] INFO crawler.CrawlController - No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
30024 [Thread-1] INFO crawler.CrawlController - All of the crawlers are stopped. Finishing the process...
30025 [Thread-1] INFO crawler.CrawlController - Waiting for 10 seconds before final clean up...

Implement a crawl log

The crawl log should contain all useful data about processed URIs, and could be used as a base for extracting metrics on the crawl, and a based for indexing in an archiving crawler.

Implement support for links with (unsafe) characters such as spaces

Hi,

Using the simple example implementation of the crawler, I've noticed links with spaces in them are ignored, ie. not included in the list of outgoing links by the parser. I suspect this is not the intention as e.g. the URLCanonicalizer normalizes " " to "%20", ie. it accounts for spaces.

Example referring page: "http://seagateplastics.com/Stock_Plastics_Catalog/catalog/angles.html"
This link is skipped: "http://seagateplastics.com/Stock_Plastics_Catalog/images_catalog/SG2078 PDF (1).pdf".

If this could be fixed/explained, that would be great.
Thanks!

Allow other HTML parser implementations

I have a use case where I want to do "predictive" crawling, i.e. I know in advance wich patterns of outgoing URLs I want to crawl.

For that I would like another HTML parser implementation, like JSoup that allows jquery-like selectors very useful for my purpose.

Modify signature of CrawlController#startXXX methods

I launch crawls from a Springs container. I need to be able to clone a specific subclass of WebCrawler with its configuration.

HTTP Basic Auth

Hi,

is it possible to crawl secured pages which are secured by an http basic auth?
This could be a nice feature to be implemented by your crawler.

Thanks

Handle inline data URI

For example, http://www.videogamesprites.net/FinalFantasy2/Backgrounds/ uses inline images with URI of the form "data:image/png:base64,".

The data is at the end of the URI and can be decoded using the prefix information.

Crawler thread seems not running

I use this excellent library to scrape some site, but sometimes it stopped unexpected without any exceptions or errors.
my custom webcrawler is belows

public class YNUWebCrawler extends WebCrawler {

  private Log logger = LogFactory.getLog(YNUWebCrawler.class);

  @Autowired
  public static CrawlerRepository crawlerRepository;

  private final static Pattern FILTERS = Pattern.compile(".*(\\.("
      + "css|js|gif|jpg|jpeg|png|mp3|mp4|avi|flv|zip|gz|apk|ipa|exe|bin|doc|docx|xls|xlsx|ppt|pptx"
      + "))$");

  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    try {
      return !FILTERS.matcher(href).matches() && Utils.getDomainName(href).contains("ynu.edu.cn");
    } catch (URISyntaxException e) {
      e.printStackTrace();
    }
    return false;
  }

  @Override
  public void visit(Page page) {
    String url = page.getWebURL().getURL();

    if (page.getParseData() instanceof HtmlParseData) {
      HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
      String renderedHtml = htmlParseData.getHtml();

      logger.info("process crawler data with url " + url);
      Crawler crawlerPersisted = crawlerRepository.findByUrl(url);
      if (crawlerPersisted == null) {
        Crawler crawlerCreated = new Crawler(url, DigestUtils.md5Hex(renderedHtml),
            Crawler.STATUS_CREATE);
        crawlerCreated.setModifiedFlag(true);
        logger.info("add crawler data with url " + url);
        crawlerRepository.save(crawlerCreated);
      } else {
        String newContentMD5 = DigestUtils.md5Hex(renderedHtml);
        if (!newContentMD5.equals(crawlerPersisted.getMd5())) {
          logger.info("update crawler data with url " + url);
          crawlerPersisted.setMd5(newContentMD5);
          crawlerPersisted.setStatus(Crawler.STATUS_UPDATE);
          crawlerPersisted.setModifiedFlag(true);
          logger.info("update crawler data with url " + url);
          crawlerRepository.save(crawlerPersisted);
        }
      }
    }
  }
}

  private void doCrawlerProcessing() {
    new Thread(() -> {
      logger.trace("starting doCrawlerProcessing");
      isCrawlerProcessing = true;
      String crawlStorageFolder = environment.getProperty(PROPERTY_KEY_CRAW_STORAGE_FOLDER, "tmp");
      int numberOfCrawlers = Integer.parseInt(environment.getProperty(
          PROPERTY_KEY_NUMBER_OF_CARWLERS, "10"));

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setPolitenessDelay(Integer.parseInt(environment.getProperty(
          PROPERTY_KEY_POLITENESS_DELAY, "50")));

      PageFetcher pageFetcher = new PageFetcher(config);
      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller;
      try {
        controller = new CrawlController(config, pageFetcher, robotstxtServer);
        controller.addSeed(environment.getProperty(PROPERTY_KEY_ROOT_SEED_URL, "http://www.ynu.edu.cn/"));
        // 目前暂时无法在YNUWebCrawler中注入crawlerRepository，先临时用这种方式
        YNUWebCrawler.crawlerRepository = crawlerRepository;
        controller.start(YNUWebCrawler.class, numberOfCrawlers);
        logger.info("ended doCrawlerProcessing");

        // 查找所有modifiedFlag标记过的，记录其中的URL到change文件，并且转换状态
        List<Crawler> modifiedCrawler = crawlerRepository.findByModifiedFlag(true);
        List<String> modifiedUrl = new ArrayList<>();
        modifiedCrawler.stream().forEach(crawler -> {
          modifiedUrl.add(crawler.getUrl());
          crawler.setModifiedFlag(false);
          crawler.setStatus(Crawler.STATUS_NORMAL);
        });
        logger.trace("starting update crawler data");
        crawlerRepository.save(modifiedCrawler);
        logger.trace("ended update crawler data");
        logger.trace("starting writeChangeFile");
        Utils.writeChangeFile(modifiedUrl);
        logger.trace("ended writeChangeFile");
        isCrawlerProcessing = false;
      } catch (Exception e) {
        logger.error(e.getMessage());
      }
    }).start();
  }

The following is some log message where it stopped

2015-05-27 19:46:31.597  INFO 11200 --- [Crawler 187] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://bbs.ynu.edu.cn/forum.php?action=reply&extra&fid=40&mod=post&page=1&repquote=319662&tid=27039
2015-05-27 19:46:31.626  INFO 11200 --- [Crawler 187] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://bbs.ynu.edu.cn/forum.php?action=reply&extra&fid=40&mod=post&page=1&repquote=319662&tid=27039
2015-05-27 19:46:31.698  INFO 11200 --- [Crawler 203] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/kxyj/kyxm/6053.htm
2015-05-27 19:46:31.698  INFO 11200 --- [Crawler 422] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.swrq.ynu.edu.cn/tzgg/28978.htm
2015-05-27 19:46:31.722  INFO 11200 --- [Crawler 285] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/kxyj/yjcg/22474.htm
2015-05-27 19:46:31.730  INFO 11200 --- [Crawler 203] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/kxyj/kyxm/6053.htm
2015-05-27 19:46:31.730  INFO 11200 --- [Crawler 422] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.swrq.ynu.edu.cn/tzgg/28978.htm
2015-05-27 19:46:31.732  INFO 11200 --- [Crawler 435] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/xsgz/xshd/5488.htm
2015-05-27 19:46:31.744  WARN 11200 --- [Crawler 309] e.uci.ics.crawler4j.crawler.WebCrawler   : Skipping a URL: http://www.sds.ynu.edu.cn/docs/2011-11/20111105161115180568.rar which was bigger ( 10000000 ) than max allowed size
2015-05-27 19:46:31.750  INFO 11200 --- [Crawler 285] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/kxyj/yjcg/22474.htm
2015-05-27 19:46:31.761  INFO 11200 --- [Crawler 435] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/xsgz/xshd/5488.htm

2015-05-28 16:54:05.240  INFO 40786 --- [    Crawler 280] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/515.aspx
2015-05-28 16:54:05.283  INFO 40786 --- [    Crawler 280] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/515.aspx
2015-05-28 16:54:06.927  INFO 40786 --- [    Crawler 562] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/545/1.aspx
2015-05-28 16:54:06.956  INFO 40786 --- [    Crawler 562] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/545/1.aspx
2015-05-28 16:54:08.241  INFO 40786 --- [    Crawler 881] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/show-1228.aspx
2015-05-28 16:54:08.251  INFO 40786 --- [    Crawler 371] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.lib.ynu.edu.cn/intrduce/491
2015-05-28 16:54:08.256  INFO 40786 --- [    Crawler 382] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.lib.ynu.edu.cn/node/257
2015-05-28 16:54:08.270  INFO 40786 --- [    Crawler 881] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/show-1228.aspx
2015-05-28 16:54:08.280  INFO 40786 --- [    Crawler 371] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.lib.ynu.edu.cn/intrduce/491
2015-05-28 16:54:08.280  WARN 40786 --- [    Crawler 827] e.uci.ics.crawler4j.crawler.WebCrawler   : Skipping URL: http://www.ynu.edu.cn/info/2011-05-27/0-2-3821.html, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
2015-05-28 16:54:08.287  INFO 40786 --- [    Crawler 382] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.lib.ynu.edu.cn/node/257
2015-05-28 16:54:08.302  INFO 40786 --- [    Crawler 309] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.dj.ynu.edu.cn/wdxz/6522.htm
2015-05-28 16:54:08.308  INFO 40786 --- [    Crawler 569] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.dj.ynu.edu.cn/ywdd/35459.htm
2015-05-28 16:54:08.330  INFO 40786 --- [    Crawler 309] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.dj.ynu.edu.cn/wdxz/6522.htm
2015-05-28 16:54:08.337  INFO 40786 --- [    Crawler 569] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.dj.ynu.edu.cn/ywdd/35459.htm

I walk through the code, but find nothing useful information about this strange problem.

ps, I tried to set different number of crawlers(5,10,100,500,1000) and ran both Windows and Linux OS!
The problem occured when crawler4j crawled about 10000+ pages (numberOfCrawlers set to 10), or when crawler4j crawled about 60000+ pages (numberOfCrawlers set to 1000)

I tried to crawl some small sized website, no such problems shown.

Implement WebCrawler as an abstract class

Mr Ganjisaffar,

I think that it would be better if WebCrawler was an abstract class rather than a concrete one, just so that it is then clear to the developer which methods can/should be overwritten by default.

Yours faithfully,

James Hamilton

Enhance parallel execution

Right now we have a full crawl workflow in a crawler thread:

Fetch a number of URLs from the frontier
For each candidate URL, process it:
- HTTP GET
- test if should visit
- if so parse
- extract outgoing links and schedule them
- visit; e.g. process payload

I believe separating the different steps of the URL processing would enhance the crawl speed (that's basically the approach Internet Archive's Heritrix takes):

Have configurable thread pools for:

executing the HTTP request
parsing
link extraction and frontier scheduling
visiting

This involves having a shared component to store the HTTP response contents between the moment they are downloaded and the moment they have been visited. My initial guess is that Berkeley DB looks like a damn good candidate ;-)

handshake alert: unrecognized_name when crawling some websites.

I am getting the following error while trying to crawl a website.

2015-05-31 09:18:49 ERROR WebCrawler:467 - handshake alert: unrecognized_name, while processing: https://mobileshop.com.eg/

I can crawl other https websites without any problems, but not this one. I tried to disable SNI by using the -Djsse.enableSNIExtension=false option which solves the problem for this website, but breaks things for others. For example, if I disable SNI, I receive the following error for another website:

2015-05-31 09:24:34 ERROR WebCrawler:467 - Received fatal alert: internal_error, while processing: https://www.computershopegypt.com/

Crawling the latter website worked fine with SNI enabled.

BasicCrawler failed when crawling Baidu

Hi,

I tried the simple crawler sample on the index page. It worked well when choosing Yahoo, Google and most of other websites as seeds. However, it just failed with www.baidu.com. Is there any measure I can take to solve the problem?

ps：I can successfully get baidu pages using URLConnection.

Best regards.

Bug in PageFetcher: HttpClient Entity not consumed in some cases

Hi,

I think there is a serious bug in the class PageFetcher in the fetchPage method. In case of an entity bigger than the maximum size the response must be closed. You are not consuming the stream. This causes the connection to be held all the time. When the maximum total connections of httpclient is reached, the crawler freezes. I investigated this by taking a look at the httpclient logs:

2015-05-13 15:31:32,208 DEBUG [org.apache.http.impl.conn.PoolingHttpClientConnectionManager](Crawler 5) Connection request: [route: {}->http://xxx:80][total kept alive: 1; route allocated: 1 of 100; total allocated: 1 of 5]
2015-05-13 15:31:32,208 DEBUG [org.apache.http.impl.conn.PoolingHttpClientConnectionManager](Crawler 5) Connection leased: [id: 0][route: {}->http://xxx:80][total kept alive: 0; route allocated: 1 of 100; total allocated: 1 of 5]
2015-05-13 15:31:32,260 WARN [edu.uci.ics.crawler4j.crawler.WebCrawler](Crawler 5) Skipping a URL: xxx which was bigger ( 1679608 ) than max allowed size
2015-05-13 15:31:32,571 DEBUG [org.apache.http.impl.conn.PoolingHttpClientConnectionManager](Crawler 3) Connection [id: 1][route: {}->http://xxx:80] can be kept alive for 10.0 seconds
2015-05-13 15:31:32,571 DEBUG [org.apache.http.impl.conn.PoolingHttpClientConnectionManager](Crawler 3) Connection released: [id: 1][route: {}->http://xxx:80][total kept alive: 1; route allocated: 2 of 100; total allocated: 2 of 5]

As you can see the total allocated connections get increased by 1 and then stay at 2. And this goes on, every time a page is to big. When the limit of 5 is reached, the crawler stops.

For reference: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

Best regards
Albert

Add an option to strip query string from URLs

Sometimes you do not want to crawl the same page multiple times because other pages are leading to it with various query strings. For example: http://somesite.com/search?query=foo, http://somesite.com/search?query=bar and so on.

Exclude logback.xml from building into jar

record the page where image is stored from

Hello,

Can anyone please help to modify the code a little bit so that imagecrawler will save the page url where the image got saved from?

currently imagecrawler only saves image but no url is saved to local storage.

thanks

Upgrade BerkeleyDB JE to latest version

Currently 5.0.58, available from Oracle Maven repo

cf. http://stackoverflow.com/questions/10095199/what-are-the-coordinates-of-berkeley-db-je-5-0-x-in-maven-central

Close inProcessPages if it's not null

inProcessPages may be opened in constructing a frontier.
After the crawling session is closed, it's still opened.

Add support for a javascript renderer

I'd like a way to render javascript somehow, or allow a pluggable interface where we could add another client that would fetch the HTML and render the javascript in it.

I've been looking at headless web browsers like PhantomJS.

Crawler downloads entire content before checking maximum download size

The crawler should ideally check the size of data as it is received instead of downloading all the data and then checking whether it is too much. This feature is of interest so that users can prevent downlaoding massive data which could end up crashing the process.

Get status when proxy timeout

Hello,
Is there any way to know if the crawling process (start method) has finished because of a connection timeout form the proxy connection?
Thanks!