digitalpebble / storm-crawler Goto Github PK

A scalable, mature and versatile web crawler based on Apache Storm

Home Page: https://stormcrawler.apache.org/

License: Apache License 2.0

Java 34.59% HTML 65.31% FLUX 0.10%

web-crawler apache-storm distributed java crawler stormcrawler

storm-crawler's Introduction

Apache StormCrawler (Incubating) is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

Quickstart

NOTE: These instructions assume that you have Apache Maven installed. You will need to install Apache Storm 2.6.2 to run the crawler.

StormCrawler requires Java 11 or above. To execute tests, it requires you to have a locally installed and working Docker environment.

DigitalPebble's Ansible-Storm repository contains resources to install Apache Storm using Ansible. Alternatively, this stormcrawler-docker project should help you run Apache Storm on Docker.

Once Storm is installed, the easiest way to get started is to generate a new StormCrawler project following the instructions below:

mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.0

You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use.

This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

Alternatively if you can't or don't want to use the Maven archetype above, you can simply copy the files from archetype-resources.

Have a look at the code of the CrawlTopology class, the crawler-conf.yaml file as well as the files in src/main/resources/, they are all that is needed to run a crawl topology : all the other components come from the core module.

Getting help

The WIKI is a good place to start your investigations but if you are stuck please use the tag stormcrawler on StackOverflow or ask a question in the discussions section.

The project website has a page listing companies providing commercial support for Apache StormCrawler.

Note for developers

Please format your code before submitting a PR with

mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false

You can enable pre-commit format hooks by running:

mvn clean install -Dskip.format.code=false

Thanks

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

storm-crawler's People

Contributors

Stargazers

Watchers

Forkers

apsaltis jessica9421 gongweijun86 luyee xingzhixi diogenesjf sukria weborama guiforget purplepapa acinader nfalaki pradhanbu bejean conanzxn sju dang-sugarinc rkrombho jakekdodd codeaudit zhaiyuyong jorgelbg larghir etz-chaim ludovic-boutros hancy2013 lipengyu allfro mattburns anjackson jonathanstiansen shenrubulijie shanfei aydoganserdar matif84 goldenlink fysoft2006 liinnux lviiii wdxxl ukrtoday aditya-travis mindis ip-2014 chengxiangfei sobolsigizmund silenyuan wuzhongdehua anukat2015 yongxu74 giuseppebonaccorso teotikalki dongbin86 king-of-the-ring cwts sebastian-nagel rbaindourov angelo337 owenrh zjpjohn cameraforensics doubleyoung kangzhenkang thijswesterveld kantone junfengduan karinaflor nangal danielshir tusharsb vincentchen vtinguan tt-52101 ragupathyr linpelvis birkand cigolabs huyaqiang09 heartsavior duydcoco isspek foromer4 derekzone trungnd18 sexybear appbootup jeshfour eorliac desslee bearchui adamchui lingya bytearchive life- staticmian dhammikacal webstardotme bradenwu ravirajadrangi davidliujian

storm-crawler's Issues

JSON stringifying metadata keys

Was adding support for ETags in HttpResponse when I realized that, when using If-Modified-Since, it's looking for a metadata key actually named 'If-Modified-Since.' I think I originally did this for consistency, but now it seems like a bad idea. First because we're normalizing HTTP headers (so 'if-modified-since' would be better). But more importantly, I think these behavior-changing metadata keys should be JSON compatible. A common use case will be to deserialize known metadata within or directly downstream from the spout, and JSON-compliant key names will be required for many spout types. So, 'ifModifiedSince,' and the ETag key would be 'ifNoneMatch.'

It's a super simple fix, but because it's changing the contract, I wanted to run it by you guys first.

Port NUTCH-1825

see [https://issues.apache.org/jira/browse/NUTCH-1825]

Exception in DebugMetricsConsumer

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.util.Map
at com.digitalpebble.storm.metrics.DebugMetricsConsumer.handleDataPoints(DebugMetricsConsumer.java:103) ~[classes/:na]

[ElasticSearch] Spout for reading URLs from status index

I wrote a StatusUpdaterBolt [master eb5bf73] for ElasticSearch which would be perfectly complemented by a Spout.

Concurrency issue with ProtocolFactory

I've just found that when looking at the code with Pradan, ProtocolFactory caches the Protocol instances so there is exactly one instance within a ProtocolFactory.

In the FetcherBolt we have a single instance of ProtocolFactory, which is used by all the threads concurrently. The method getProtocolOutput in the HTTP protocol could be called concurrently. Any shared state like ifModifiedSince could then become corrupted.

Pradhan will write a test class to illustrate the issue. We could sync the method but that would create a bottleneck. A simpler approach would be for the ProtocolFactory to create a new instance of the protocol for every request, it might be inexpensive to do so.

Handling of redirections

The FetcherBolt nor the ParserBolt currently do anything specific with the HTTP status returned for a given page and as a result redirections are currently not handled.

One option would be to let whichever component is in charge of the persistence of the URL status i.e. post parsing. Another approach would be to have the parser (and maybe also the fetcher) declare a non-default output (e.g. 'status') and plug that as input to the persistence bolt.

any thoughts on this?

Protocol interface to take metadata into account

Ideally we should be able to access the metadata for a URL from the protocol implementations. One motivation for this is that we could use different resources (e.g. scripts for processing AJAX) based on the value of a metadatum.

What would be the best way of doing this? add a separate method? if so would what would the FetcherBolt do? Replace the existing method with the one handling metadata? This way if a protocol implementation does not need the metadata it could simply ignore them.

Any thoughts on this?

Fetcher to dump the content of its queues to the log

We could use a very primitive mechanism like checking whether a file exists at an arbitrary location (configurable in the usual way) to get the FetcherBolt to dump the content of its internal queues to the logs. This would be useful as a way of debugging.

Update <developers> section of pom.xml

Jake, could you add yourself to the pom?
Thanks!

Metadata emitted in fetcher

I noticed that, in some places, we emit response.getMetadata() and in other places, we emit metadata within FetcherBolt and SimpleFetcherBolt.

The response.getMetadata() includes all of the original tuple metadata (if extant). Is there a reason for emitting only the original tuple metadata in some situations? Particularly when emitting to the status stream, it might be beneficial downstream to do something with the HTTP response metadata.

Document URLFilters and their configuration

Have a separate WIKI page for that

nextTuple() blocks in the BlockingURLSpout

The topmost infinite loop in BlockingURLSpout causes the nextTuple() method to block. Because Storm runs nextTuple(), ack(), and fail() on the same thread, I think this prevents the spout from functioning properly (see last paragraph of the interface description in https://storm.incubator.apache.org/apidocs/backtype/storm/spout/ISpout.html).

I've got a version that removes the infinite loop and seems to function as expected, but I need to run a few more tests first. In the meantime, if you get a chance to take a chance to look at BlockingURLSpout, let me know if you agree that this is an issue.

update lists of committers in pom.xml

@GuiForget could you please add yourself to that file?

Refactoring the metrics in ParserBolt

See discussion in #41 : the metrics package will probably move to 'external' in which case we'll have to refactor ParserBolt so that it uses standard storm metrics instead.

Upgrade to Storm 0.9.3

There were several important issues resolved in 0.9.3, including one Tika-related fix. I'm currently running my SC topology on 0.9.3 without a hitch, so this should be a non-breaking upgrade.

RobotsURLFilter

The filtering of URLs based on the robots.txt directives is done within the Fetcher for an incoming URL. It would be efficient to be able to filter on the outlinks so that URLs do not get added to the queues (or any other form of persistence) if they can't be fetched anyway.

We should provide a RobotsURLFilter for this where we'd refetch the robots.txt file for a given URL and store it in a cache. The additional cost of pulling the robots.txt would be outweighed by the benefits of not adding unnecessary URLs to the queues.

Wiki on SitemapParser

Normalise metadata returned by servers

curl -I "http://twitter.com/home?status=Check+this+out%3ahttp%3a%2f%2fwww.dillards.com%2fwebapp%2fwcs%2fstores%2fservlet%2fProductDisplay%3fcatalogId%3d301%26productId"
HTTP/1.1 301 Moved Permanently
content-length: 0
date: Thu, 22 Jan 2015 12:11:11 UTC
location: https://twitter.com/home?status=Check+this+out%3ahttp%3a%2f%2fwww.dillards.com%2fwebapp%2fwcs%2fstores%2fservlet%2fProductDisplay%3fcatalogId%3d301%26productId
server: tsa_b
set-cookie: guest_id=v1%3A142192867165557827; Domain=.twitter.com; Path=/; Expires=Sat, 21-Jan-2017 12:11:11 UTC
x-connection-hash: ca85b658e4cc93174936c73ea46fa39c
x-response-time: 8

where location is in lowercase and our code expects it to be 'Location' in order to work. We could lowercase ALL keys in our metadata or alternatively normalise the key names at the protocol level.

NPE in ParserBolt

Encountered this while working on the Spiderling.

The NPE stems from a null content byte array, and occurs here. I discovered that the offending URL was one that resulted in a 301 redirect. However, the protocol response's content is non-null for several other URLs that result in 301s, so the redirect response may not be the source of the problem.

Console output with stack trace:

52998 [Thread-12-parse] ERROR backtype.storm.util - Async loop died!
java.lang.RuntimeException: java.lang.NullPointerException
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.executor$fn__3441$fn__3453$fn__3500.invoke(executor.clj:748) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.util$async_loop$fn__464.invoke(util.clj:463) ~[storm-core-0.9.3.jar:0.9.3]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
Caused by: java.lang.NullPointerException: null
    at com.digitalpebble.storm.crawler.bolt.ParserBolt.execute(ParserBolt.java:168) ~[storm-crawler-0.4-SNAPSHOT.jar:na]
    at backtype.storm.daemon.executor$fn__3441$tuple_action_fn__3443.invoke(executor.clj:633) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.executor$mk_task_receiver$fn__3364.invoke(executor.clj:401) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$clojure_handler$reify__1447.onEvent(disruptor.clj:58) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.3.jar:0.9.3]
    ... 6 common frames omitted

@jnioche have you seen this before?

Xpath expressions don't match when element name is set

As shown in the test [https://github.com/DigitalPebble/storm-crawler/commit/7cf89baec8297fc4e725d6d06be5f1145d6194c1], the XPathFilter does not extract the data when an element name is specified.

Parser to send tuple to 'status' stream instead of failing

The parser currently fails tuples when an exception is caught. This is probably not the right behaviour as the same URL will be refetched and refail again later (unless of course). What we could do instead would be to use the status stream to mark the URL as failed and keep track of the reason why it did so. Whichever component is in charge of persisting the URL status can then decide on what to do with it.

This is related to the discussion in #42

The same logic could be applied to fetch failures as well. Instead of failing them and let the spout handle the logic of keeping track of the number of errors we'd send to the status stream. The advantage of doing this is that the spout wouldn't have to update anything and would just read.

Investigate use of async IO for Fetcher

The fetcher bolts currently block IO. Doing asynchronous IO would certainly provide better throughput.

[https://jfarcand.wordpress.com/2010/12/21/going-asynchronous-using-asynchttpclient-the-basic/] provides interesting examples

NotSerializableException for Metadata

Getting NotSerializableException: com.digitalpebble.storm.crawler.Metadata when running CrawlTopology on the master branch. For reference, I'm running the topology in local mode using maven exec.

I'm 99.99% sure that Metadata just needs to implement Serializable (this solves the problem for me). Just wanted to check if CrawlTopology was successfully executing for anyone else after the most recent commits.

Stack trace:

15352 [Thread-17-disruptor-executor[6 6]-send-queue] ERROR backtype.storm.util - Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$consume_loop_STAR_$fn__1460.invoke(disruptor.clj:94) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.util$async_loop$fn__464.invoke(util.clj:463) ~[storm-core-0.9.3.jar:0.9.3]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
Caused by: java.lang.RuntimeException: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at backtype.storm.serialization.SerializableSerializer.write(SerializableSerializer.java:41) ~[storm-core-0.9.3.jar:0.9.3]
    at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:75) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:18) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:486) ~[kryo-2.21.jar:na]
    at backtype.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.worker$mk_transfer_fn$fn__3549.invoke(worker.clj:129) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__3283.invoke(executor.clj:258) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$clojure_handler$reify__1447.onEvent(disruptor.clj:58) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.3.jar:0.9.3]
    ... 6 common frames omitted
Caused by: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) ~[na:1.7.0_55]
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) ~[na:1.7.0_55]
    at backtype.storm.serialization.SerializableSerializer.write(SerializableSerializer.java:38) ~[storm-core-0.9.3.jar:0.9.3]
    ... 16 common frames omitted
15353 [Thread-17-disruptor-executor[6 6]-send-queue] ERROR backtype.storm.daemon.executor - 
java.lang.RuntimeException: java.lang.RuntimeException: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$consume_loop_STAR_$fn__1460.invoke(disruptor.clj:94) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.util$async_loop$fn__464.invoke(util.clj:463) ~[storm-core-0.9.3.jar:0.9.3]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
Caused by: java.lang.RuntimeException: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at backtype.storm.serialization.SerializableSerializer.write(SerializableSerializer.java:41) ~[storm-core-0.9.3.jar:0.9.3]
    at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:75) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:18) ~[kryo-2.21.jar:na]
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:486) ~[kryo-2.21.jar:na]
    at backtype.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.worker$mk_transfer_fn$fn__3549.invoke(worker.clj:129) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__3283.invoke(executor.clj:258) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.disruptor$clojure_handler$reify__1447.onEvent(disruptor.clj:58) ~[storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.3.jar:0.9.3]
    ... 6 common frames omitted
Caused by: java.io.NotSerializableException: com.digitalpebble.storm.crawler.Metadata
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) ~[na:1.7.0_55]
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) ~[na:1.7.0_55]
    at backtype.storm.serialization.SerializableSerializer.write(SerializableSerializer.java:38) ~[storm-core-0.9.3.jar:0.9.3]
    ... 16 common frames omitted
Disconnected from the target VM, address: '127.0.0.1:49418', transport: 'socket'
15370 [Thread-17-disruptor-executor[6 6]-send-queue] ERROR backtype.storm.util - Halting process: ("Worker died")
java.lang.RuntimeException: ("Worker died")
    at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:325) [storm-core-0.9.3.jar:0.9.3]
    at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.5.1.jar:na]
    at backtype.storm.daemon.worker$fn__3808$fn__3809.invoke(worker.clj:452) [storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.daemon.executor$mk_executor_data$fn__3274$fn__3275.invoke(executor.clj:240) [storm-core-0.9.3.jar:0.9.3]
    at backtype.storm.util$async_loop$fn__464.invoke(util.clj:473) [storm-core-0.9.3.jar:0.9.3]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]

Process finished with exit code 1

How to write a nice manual for SC?

Use [https://www.gitbook.com/]?
Just the WIKI?
asciidoc?

Documenting Metadata registration

A user building a topology that doesn't extend ConfigurableTopology will need to manually register Metadata for serialization.

ParserBolt to send outlinks to status stream, remove from default one?

The ParserBolt currently sends the outlinks to the default stream alongside the other fields for a document ("url", "content", "metadata", "text"). It would be better to send them to the status stream instead, as we do for the Sitemaps, redirections etc...

Replace HTTP protocol implementation

Before I did anything with this, wanted to check and see if there was a reason why HTTP 1.0 was hardcoded. Right now, line 167 of HttpResponse sets the protocol version to 1.0, regardless of whether http.useHttp11: true is set in the configuration:

StringBuffer reqStr = new StringBuffer("GET ");
            if (http.useProxy()) {
                reqStr.append(url.getProtocol() + "://" + host + portString
                        + path);
            } else {
                reqStr.append(path);
            }

            reqStr.append(" HTTP/1.0\r\n");

Is this intentional?

Apache Tika & Storm = FileNotFoundException

Hi,

I am unable to run in LocalCluster mode (actually haven't tried to deploy this).
I am always getting a FileNotFoundException. Can you actually run this code without any issues?

4518 [Thread-10] INFO backtype.storm.daemon.supervisor - Copying resources at jar:file:/Users/tammomueller/.m2/repository/edu/ucar/netcdf/4.2-min/netcdf-4.2-min.jar!/resources to /var/folders/wj/bplcpvpd07s525n5vnfm0gyw0000gn/T//cc4ea00d-14a5-41b6-8178-7bb2a2f8da46/supervisor/stormdist/crawl-1-1371622362/resources
4524 [Thread-10] ERROR backtype.storm.event - Error when processing event
java.io.FileNotFoundException: Source 'file:/Users/tammomueller/.m2/repository/edu/ucar/netcdf/4.2-min/netcdf-4.2-min.jar!/resources' does not exist
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) ~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1261) ~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1230) ~[commons-io-2.4.jar:2.4]
at backtype.storm.daemon.supervisor$fn__5073.invoke(supervisor.clj:452) ~[storm-core-0.9.0-wip17.jar:na]
at clojure.lang.MultiFn.invoke(MultiFn.java:172) ~[clojure-1.4.0.jar:na]
at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this__4986.invoke(supervisor.clj:290) ~[storm-core-0.9.0-wip17.jar:na]

Same issue reported on storm and Tika:
https://github.com/nathanmarz/storm/issues/82

Track target of redirection

We should store the target of a redirection in the metadata of the redirected URL, regardless of whether the target gets filtered or not. This will make it easier to debug the behaviour of the Fetcher

Fix release artefacts

The release is done with the following commands :

mvn release:clean release:prepare -P release
mvn release:perform -P release

Since refactoring into core and external I am seeing the following

uber-jar published for core

.../content/com/digitalpebble/storm-crawler-core/0.4/storm-crawler-core-0.4-jar-with-dependencies.jar

which should not have happened since I specify the 'release' profile.

generated tests for parent pom

.../com/digitalpebble/storm-crawler/0.4/storm-crawler-0.4-tests.jar

which should not have happened.

Any ideas? I will release 0.4 with these but it would be good to fix that for the next release

Change URLFilter interface to optionally take metadata

This would be useful for several use cases such as [https://github.com/PopSugar/storm-crawler-extensions/pull/2] or [https://github.com//pull/46] where in both cases we need to know the source URL of a new link (or rather its domain or hostname).

This is similar to [https://github.com//issues/25].

We'd then be able to enforce the filtering based on whether a URL is from the same hostname or domain as the URL it originates from as a normal URLFilter which would make the code more compact and easier to reuse from various parsing bolts or the Fetcher (e.g. redirections).

Should this be done before or after [https://github.com//pull/36]?

Use a single MultiCountMetric in FetcherBolt

[https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/storm/crawler/bolt/FetcherBolt.java#L87]

they are of the same nature, why use 2?

Make protocol defined via configuration

[https://github.com/DigitalPebble/storm-crawler/blob/master/src/main/java/com/digitalpebble/storm/crawler/protocol/ProtocolFactory.java] currently hardcodes the protocol implementation to use. We need to make it configurable like we do for URL filters etc...

Provide MetricsConsumer for storing storm metrics with ES

Not a core s-c issue but probably a nice feature to have. We already provide an ES bolt, so we already have the ES dependencies in. Would be good to send the metics to ES to disaply with Kibana without having to use logstash|statsd| etc...

See [https://github.com/elastic/elasticsearch-hadoop/issues/325#issuecomment-66472035] for the same issue submitted to elasticsearch-hadoop

SitemapParser apply filtering to outlinks

Customisable/pluggable MetadataTransfer

We currently have a hardwired, one-size-fits-all MetadataTransfer class. It would be good to be able to define user-specific ones and have them used in the same way regardless of which parser class calls them. These user-defined classes would extend the behaviour of MetadataTransfer.

Add If-Modified-Since support to HTTP Protocol

Now that the HTTP protocol has access to metadata, maybe we can add support for If-Modified-Since in HTTPResponse? This could significantly boost performance for URLs that are frequently processed by the crawler.

I have an implementation of this already (although not through the metadata). If this is desirable, it'll be a snap for me to contribute it.

Missing transitive dependencies

Since refactoring the sources in to core and external, any project declaring SC as a dependency is not getting org.jdom or crawler-commons as transitive dependencies. I am pretty certain this was the case before.

@jakekdodd any idea?

Add cache to AbstractStatusUpdaterBolt

The AbstractStatusUpdaterBolt [1f7e546] should help writing bolts to store the status of URLs. I will add an internal cache so that URLs which are already known do not get written as DISCOVERED every time. The cache should be configurable from the usual Storm config.

Sitemap parser

It would be good to reuse the parsers from crawler-commons to handle sitemaps. Probably simpler to have a separate parser from the Tika-based one and trigger the use of the Sitemap parser based on the presence of an arbitrary metadata like 'parse.sitemap'=true.

The triaging of the tuples based on the metadata could be done in a meta parser wrapping both the tika or sitemap one but probably simpler if that could be done somehow in the topology class. Alternatively the sitemap parser could also wrap the tika parser.

The output of this parser would not be the same as the Tika based one as it would not generate any textual content or call the parsefilters but consists of URL,Metadata pairs. Similarly to what I suggested for #37, this bolt could generate a 'status' stream that would be constumed by a persistence bolt. It would also use the default stream to pass on URLs that are not sitemaps.

Question : what happens if we send things down a stream with nothing to consume them?

SimpleFetcherBolt should use status stream to report error

@jakekdodd want to take care of this one?

URLPartitionerBolt to partition on arbitrary metadata value

Not a top priority issue but something to keep in mind for later.

Add handling of cookies to HTTP protocol

The cookies info could be stored in the metadata (and inherited from the parent URL) and used by the Protocol implementation.

Pre-release tasks

Apply formatting to the whole code
update README file to include core in paths and jar names

Wiki on use of streams

Need to explain 'status' stream and how it differs from the default one.

Use slf4j {} placeholders

We use slf4j everywhere in the project but without making use of placeholders [http://www.slf4j.org/faq.html#logging_performance]

Project reorganization

Given our discussion in #24, and previous discussions on Kafka spouts, HBase indexing, etc, we should think about reorganizing the project so that we have a core SDK and external SDK(s).

I was thinking something like

root
     pom.xml
    |-> crawler-core
        pom.xml
    |-> crawler-external
        pom.xml

The external sub-project would include things that depend on external technologies and libraries.

Resource files overridden by ones provided by default in SC

We currently use [http://maven.apache.org/plugins/maven-assembly-plugin/], which is fine but apparently Shade would give us more control.
One reason why we'd need it is for instance : when an application uses storm crawler as a dependency, it gets the content src/main/resources (e.g. urlfilter config) in the uber-jar but the version coming from the dependency will override any files specified in the project with the same name. To put it differently, we need to use different names for the resources than the ones used by default unless we want to use the default files.
any thoughts?

Possible bug in getHashForURL method of ShardedQueue

I think that the statement:

return (ip.hashCode() & Integer.MAX_VALUE) % queueNumber

in the getHashForURL method of com.digitalpebble.storm.fetchqueue.ShardedQueue should instead be:

return (ip.hashCode() & Integer.MAX_VALUE) % numQueues

where numQueues is the ShardedQueue.numQueues property. From my reading of the code, I'm assuming that getHashForURL ought to return the queueNumber for the URL for sharing, rather than requiring it as an input.

Pass the storm config when configuring the ParseFilter

So they can access information shared via the yaml files