Giter Club home page Giter Club logo

uscdatascience / sparkler Goto Github PK

View Code? Open in Web Editor NEW
410.0 410.0 143.0 23.65 MB

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Home Page: http://irds.usc.edu/sparkler/

License: Apache License 2.0

Scala 34.60% Shell 4.84% Java 38.15% JavaScript 11.55% Python 8.25% HTML 0.68% Dockerfile 1.50% CSS 0.26% Mustache 0.19%
big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler

sparkler's People

Contributors

amirhosf avatar berkarcan avatar buggtb avatar chrismattmann avatar dependabot[bot] avatar felixloesing avatar gitter-badger avatar giuseppetotaro avatar karanjeets avatar kefaun2601 avatar kyan2601 avatar ldaume avatar lewismc avatar mattvryan-github avatar nhandyal avatar prenastro avatar prowave avatar rahulpalamuttam avatar rohithyeravothula avatar ryanstonebraker avatar sk-s-hub avatar slhsxcmy avatar smadha avatar sujen1412 avatar thammegowda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkler's Issues

Minor Issues

  • Add core.properties in Solr Schema configuration. This will help auto-deploy the core.
  • Improve Sparkler setup guide and add missing links.

Sparkler Build Failing

The build is failing due to reference to a deprecated (removed) module "sparkler-plugins-active".

Working on it...

Logs:

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ sparkler-api ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/target/sparkler-api-0.1-SNAPSHOT.jar to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.pom
[INFO] 
[INFO] --- maven-bundle-plugin:2.5.0:install (default-install) @ sparkler-api ---
[INFO] Installing edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Writing OBR metadata
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler-plugins 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ sparkler-plugins ---
[INFO] 
[INFO] --- maven-install-plugin:2.4:install (default-install) @ sparkler-plugins ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-plugins/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/plugin/sparkler-plugins/0.1-SNAPSHOT/sparkler-plugins-0.1-SNAPSHOT.pom
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] sparkler-parent .................................... SUCCESS [  0.117 s]
[INFO] sparkler-api ....................................... SUCCESS [  2.089 s]
[INFO] sparkler-plugins ................................... SUCCESS [  0.005 s]
[INFO] sparkler ........................................... FAILURE [  0.495 s]
[INFO] urlfilter-regex .................................... SKIPPED
[INFO] fetcher-jbrowser ................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.265 s
[INFO] Finished at: 2016-10-28T17:54:21-05:00
[INFO] Final Memory: 43M/1451M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project sparkler-app: Could not resolve dependencies for project edu.usc.irds.sparkler:sparkler-app:jar:0.1-SNAPSHOT: Could not find artifact edu.usc.irds.sparkler.plugin:sparkler-plugins:jar:0.1-SNAPSHOT -> [Help 1]

crawl hanging..

Hi,

I injected seed list and start ๐Ÿ‘
java -jar target/sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1483360200720 -i 2 -tg 10000 -tn 1000 -m local[*]

after 1-2 min crawlin is hanging no more crawling..

any suggestions ?


`17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.ortadogugazetesi.net/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_71 stored as values in memory (estimated size 61.0 KB, free 12.9 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_71 in memory on localhost:58926 (size: 61.0 KB, free: 757.0 MB)
17/01/02 15:09:39 INFO Executor: Finished task 51.0 in stage 1.0 (TID 127). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 51.0 in stage 1.0 (TID 127) in 3599 ms on localhost (69/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.kibrisgazetesi.com/
17/01/02 15:09:39 INFO Executor: Finished task 71.0 in stage 1.0 (TID 147). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 71.0 in stage 1.0 (TID 147) in 1000 ms on localhost (70/76)
17/01/02 15:09:39 WARN ParseFunction$: PARSING-CONTENT-ERROR http://www.kibrisgazetesi.com/
17/01/02 15:09:39 WARN ParseFunction$: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at org.apache.tika.parser.html.HtmlHandler.characters(HtmlHandler.java:258)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:582)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:62)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:34)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:57)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.evrensel.net/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.yeniakit.com/
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.bilimtarihi.org/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_65 stored as values in memory (estimated size 184.0 KB, free 13.0 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_65 in memory on localhost:58926 (size: 184.0 KB, free: 756.8 MB)
17/01/02 15:09:39 INFO Executor: Finished task 65.0 in stage 1.0 (TID 141). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 65.0 in stage 1.0 (TID 141) in 1849 ms on localhost (71/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_54 stored as values in memory (estimated size 491.2 KB, free 13.5 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_54 in memory on localhost:58926 (size: 491.2 KB, free: 756.4 MB)
17/01/02 15:09:40 INFO Executor: Finished task 54.0 in stage 1.0 (TID 130). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 54.0 in stage 1.0 (TID 130) in 3830 ms on localhost (72/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.evrensel.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_66 stored as values in memory (estimated size 458.8 KB, free 14.0 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_66 in memory on localhost:58926 (size: 458.8 KB, free: 755.9 MB)
17/01/02 15:09:40 INFO Executor: Finished task 66.0 in stage 1.0 (TID 142). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 66.0 in stage 1.0 (TID 142) in 2243 ms on localhost (73/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.ortadogugazetesi.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_75 stored as values in memory (estimated size 117.4 KB, free 14.1 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_75 in memory on localhost:58926 (size: 117.4 KB, free: 755.8 MB)
17/01/02 15:09:40 INFO Executor: Finished task 75.0 in stage 1.0 (TID 151). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 75.0 in stage 1.0 (TID 151) in 1204 ms on localhost (74/76)

`

Error Message "Could not launch browser" when starting crawler of quickstart guide

17/01/28 15:02:03 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
17/01/28 15:02:03 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_0 not found, computing it
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_1 not found, computing it
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO PluginService$: Felix Configuration loaded successfully
17/01/28 15:02:03 INFO FetcherJBrowserActivator: Activating FetcherJBrowser Plugin
17/01/28 15:02:03 INFO RegexURLFilterActivator: Activating RegexURL Plugin
Bundle Found: org.apache.felix.framework
Bundle Found: fetcher.jbrowser
Bundle Found: urlfilter.regex
[2017-01-28T15:02:04.134] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.134] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.135] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.135] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.135] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.135] ... 1 more
[2017-01-28T15:02:04.359] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.359] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.359] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.359] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.359] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.359] ... 1 more
[2017-01-28T15:02:04.368] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.369] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.369] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.369] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.369] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.369] ... 1 more
17/01/28 15:02:04 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.openqa.selenium.WebDriverException: Could not launch browser.
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'osboxes', ip: '192.168.178.134', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-514.el7.x86_64', java.version: '1.8.0_121'
Driver info: driver.version: JBrowserDriver

Setup unit tests and integration tests for Sparkler

To begin with, write unit tests for :

  • test RDD functions
  • Test core functionality of plugins

Tasks

  • Setup a web server for testing
    • Bind it to junit to auto start and stop while running the tests
  • Setup a solr instance for testing
    • Bind it to junit to auto start and stop
  • Test Default Fetcher
  • Test Javascript Engine functionality
  • Test URL filters
  • Test Fetch Function
  • Test Parse Function
  • Test Seed Injection
  • Test URL Normalizer
  • Test HDFS persistance
  • Test Kafka Output

PS:
The current progress can be tracked on https://github.com/USCDataScience/sparkler/tree/unittests branch

No fetched content is written

Due to an incorrect validation check, all fetched URLs are filtered out and none is written on disk

rdd.filter(_.fetchedData.getResource.getStatus == FETCHED)

Build failing

http://pastebin.com/xPbxHNhM

...
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=36 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=39 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=41 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=44 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=46 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=48 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=File must end with newline character
warning file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SolrResultIterator.scala message=Avoid using null line=56 column=43
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=26 column=2
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=30 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=35 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=37 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=45 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=52 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=60 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=File must end with newline character
Saving to outputFile=/Users/madhav/Documents/workspace/sparkler/sparkler-app/target/scalastyle-output.xml
Processed 23 file(s)
Found 29 errors
Found 9 war
....

Seems like config issue

Maven packaging poblem

~/git_workspace/sparkler/target$ java -classpath sparkler-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.sparkler.pipeline.Crawler -m "local" -j "sparkler-job-1465179374801" -i 1
2016-06-05 19:26:41 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-06-05 19:26:42 ERROR SparkContext:95 [main] - Error initializing SparkContext.
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at akka.actor.ActorSystem$Settings.(ActorSystem.scala:169)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:505)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(U

Solr Cloud - solrj.SolrServerException: No live SolrServers available to handle this request

When solr cloud is enabled for backend, we get this

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
	at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.0.11:8983/solr/crawldb_shard1_replica1, http://192.168.0.11:8984/solr/crawldb_shard1_replica2]
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
	at edu.usc.irds.sparkler.CrawlDbRDD.getPartitions(CrawlDbRDD.scala:72)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:641)
	at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:153)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:145)
	at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:45)
	at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:236)
	at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
	... 6 more

Update Solr Schema

  • Change jobId to crawl_id for better understanding and consistency

  • Add content_type field

  • Add crawler field for document uniqueness

  • Change plainText to extracted_text

  • Change lastFetchedAt to fetch_timestamp

  • Change indexedAt to indexed_at for consistency

  • Add fetch_status_code field to record the response code

  • Add hostname as group can have a different definition in future

  • Change numTries to retries_since_fetch for better understanding and consistency

  • Add signature field to store the hash of page's content

  • Add version field which defines the schema version

  • Add outlinks field

  • Change depth to crawler_discover_depth

  • Add relative_path field to record the file path when dumped

  • Add parent field for document's parent id

  • Generate document ID as:

    • Seed: SHA256(crawl_id-url-ingestion_timestamp)
    • Other: SHA256(crawl_id-url-parent_fetch_timestamp)
  • Also linked to #49

e.u.i.s.model.Resource.<init>(Resource.java:46) java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"

17/01/29 12:43:30 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:64585 in memory (size: 2.8 KB, free: 2.4 GB)
17/01/29 12:43:31 ERROR Executor: Exception in task 7.0 in stage 4.0 (TID 67)
java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"
	at java.net.URL.<init>(URL.java:627)
	at java.net.URL.<init>(URL.java:490)
	at java.net.URL.<init>(URL.java:439)
	at edu.usc.irds.sparkler.model.Resource.<init>(Resource.java:46)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
	at edu.usc.irds.sparkler.service.SolrProxy.addResources(SolrProxy.scala:44)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:43)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:34)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Stream handler unavailable due to: For input string: "0x6"
	at org.apache.felix.framework.URLHandlersStreamHandlerProxy.parseURL(URLHandlersStreamHandlerProxy.java:429)
	at java.net.URL.<init>(URL.java:622)
	... 18 more

URL filter regex

Hi,

Am I missing the url filter ? How I can tell the sparkler app to url filter rules ?

in general or per domain

thx

Working with remote spark.

Has anyone tried this with a non local spark?

I ask because when I try and run on a remote spark I get class mismatch errors:

java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

But then when I check the versions, you guys use Spark 1.6.1 which requires Scala 2.10.x, but you use Scala 2.11.x and if you try and downgrade to 2.10 it doesn't compile.

Debugging crawl in Sparkler

URL Partitioner

Input: Query Solr for the URLs to be generated

status:NEW

Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host

Fetch

Input: URL to fetch
Output: Request and Response Headers written in a file

Parse

Input: URL (which will be fetched and parsed) OR the fetched content
Output: Extracted Content

Fair Fetcher

Input: List of URLs. Uses Crawl policy
Output: fetched and/or parsed content in separate files under a directory

Guide for sparkler and hdfs

In the guide there is nothing on how to connect hdfs with sparkler. In notes for Apache Nutch Users and developers.

Note 2: Crawled content
Sparkler can produce the segments on HDFS, trying to keep it compatible with nutch content format.

Please share the steps. How?

Java null pointer error in fetch()

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'**
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException
at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6)
java.util.NoSuchElementException: key not found: 6**
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

Escape metachars in solr queries

java.lang.RuntimeException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.search.SyntaxError: Cannot parse 'group:': Encountered "" at line 1, column 6.
Was expecting one of:
     ...
    "(" ...
    "*" ...
     ...
     ...
     ...
     ...
     ...
    "[" ...
    "{" ...
     ...
    "filter(" ...
     ...

    at edu.usc.irds.sparkler.util.SolrResultIterator.getNextBean(SolrResultIterator.scala:72)
    at edu.usc.irds.sparkler.util.SolrResultIterator.(SolrResultIterator.scala:57)
    at edu.usc.irds.sparkler.CrawlDbRDD.compute(CrawlDbRDD.scala:55)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Finish Juju charm

Some high level remaining tasks:

  • Add solr relation
  • Pick up spark details from relation
  • Pick up solr details from relation
  • Finish write to configuration
  • Compile and add resource
  • Add actions for remote execution of crawler
  • Add Kafka relation
  • Return id in action output
  • Store last ingest id in kv so crawl can pick it up without user looking it up
  • Sort out solr cloud
  • Finish Docs
  • Add charm push to CI for beta branch

Job hangs for a minute when Kafka is not configured

When Kafka server is not configured or active, the crawl job make repeated attempts to establish connection which adds reasonable amount of delay.

By default, Kafka feature shall be disabled in configuration.

@karanjeets Thoughts? Can you have a look this?

log4j

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt
log4j:WARN No appenders could be found for logger (edu.usc.irds.sparkler.service.Injector$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

jobId = sjob-1473483131504

So, I solved problem by
java -Dlog4j.configuration=file:///$SPARKLER_HOME/sparkler-app/src/main/resources/log4j.properties -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt

and get jobId=sjob-1473484528794

but when I run crawl
java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
The error occurs

I used hadoop2.4.0 spark1.6.1 nutch1.11 solr6.0.1 jdk1.8.0u92 scala2.11.8
everything works well

How can I fix it?

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/10 14:30:56 INFO SparkContext: Running Spark version 1.6.1
16/09/10 14:30:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/10 14:30:56 INFO SecurityManager: Changing view acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: Changing modify acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/09/10 14:30:56 INFO Utils: Successfully started service 'sparkDriver' on port 49943.
16/09/10 14:30:57 INFO Slf4jLogger: Slf4jLogger started
16/09/10 14:30:57 INFO Remoting: Starting remoting
16/09/10 14:30:57 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:36988]
16/09/10 14:30:57 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 36988.
16/09/10 14:30:57 INFO SparkEnv: Registering MapOutputTracker
16/09/10 14:30:57 INFO SparkEnv: Registering BlockManagerMaster
16/09/10 14:30:57 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b32ee736-ef54-4f7a-83ff-6c8f6ab3d442
16/09/10 14:30:57 INFO MemoryStore: MemoryStore started with capacity 723.0 MB
16/09/10 14:30:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO DiskBlockManager: Shutdown hook called
16/09/10 14:30:57 INFO ShutdownHookManager: Shutdown hook called

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.