linkedin / dr-elephant Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 853.0 11.06 MB

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark

License: Apache License 2.0

Java 51.12% Scala 11.52% HTML 10.01% Shell 2.45% CSS 1.27% JavaScript 7.65% Python 1.21% XSLT 0.45% TSQL 14.33%

dr-elephant's People

Contributors

Stargazers

Watchers

Forkers

mahashg rdsr vimmur nntnag17 mherr foursquare chesterxgchen skyone5 codingtony gk84 feiyuewuyue suyilearn deanwei dog-sunflower roln huangjun6919 yubobo ikerlan-digital mekasone alexgking clusterwhisperer jessica0530 rramos harlixxy zenglinxi0615 qixiaobo rajagopr krishnap phoenixhadoop brianmark2 epocolis datadog giannokostas vivekshivhare yuanboliu nkhuyu shichaoqu is00hcw chetnachaudhari msmelissa mgthunderbolt tglstory superwood lliiun-z sanjeevtripurari stiga-huang krasnyanskiy renozhang haiyang1987 linearregression liumao0702 brucezhou2012 anutforajaroftuna suyannone ariv14 mgaido91 teekam akshayrai umesh-prasad othman-essabir miloveme ouyangshourui arpang pradeep1288 rsprabery smarthi logicalclocks mdbconsulting devops-learning baeeq fuhmdyd colonelhou paulbramsen srinikomm xiangel ljank zheolong bigdata0803 yannbyron skakker wypb mistercrunch jxsd0084 gitter-badger abmodi stitchfix shoudi wongxingjun regzhuce dataing howlyu heguangwu mnikhil-git rayortigas damienclaveau cmenguy yiran-wu han-wang kevinzwx pranayhasan

dr-elephant's Issues

play.nettyException in New I/O worker #7

this is the pid:
[root@h0045150 ~]# ps -ef | grep elephant
root 1289 1 1 Apr26 ? 00:16:36 /usr/server/jdk/bin/java -Xms1024m -Xmx1024m -XX:MaxPermSize=256m -XX:ReservedCodeCacheSize=128m -Duser.dir=/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT -Devolutionplugin=enabled -DapplyEvolutions.default=true -Djava.library.path=/usr/server/hadoop/lib/native -Dhttp.port=8079 -Ddb.default.url=jdbc:mysql://localhost/drelephant?characterEncoding=UTF-8 -Ddb.default.user=root -Ddb.default.password=admin123 -cp /root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/dr-elephant.dr-elephant-2.0.3-SNAPSHOT.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-java-jdbc_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-jdbc_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.sbt-link-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.javassist.javassist-3.18.0-GA.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-exceptions-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.templates_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.github.scala-incubator.io.scala-io-file_2.10-0.4.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.github.scala-incubator.io.scala-io-core_2.10-0.4.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.jsuereth.scala-arm_2.10-1.3.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-iteratees_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.scala-stm.scala-stm_2.10-0.7.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-json_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-functional_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-datacommons_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/joda-time.joda-time-2.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.joda.joda-convert-1.3.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.netty.netty-http-pipelining-1.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.slf4j.slf4j-api-1.7.5.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/ch.qos.logback.logback-core-1.0.13.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/ch.qos.logback.logback-classic-1.0.13.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.commons.commons-lang3-3.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.ning.async-http-client-1.7.18.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/oauth.signpost.signpost-core-1.2.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/commons-codec.commons-codec-1.3.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/oauth.signpost.signpost-commonshttp4-1.2.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.httpcomponents.httpcore-4.0.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.httpcomponents.httpclient-4.0.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/commons-logging.commons-logging-1.1.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/xerces.xercesImpl-2.11.0.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/xml-apis.xml-apis-1.4

But when i Use browser access ‘http://ip:port’ I get this exception：

2016-04-27 08:58:41,717 - [ERROR] - from play.nettyException in New I/O worker #7
Exception caught in Netty
java.lang.NoClassDefFoundError: Could not initialize class play.api.libs.concurrent.Execution$
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]

Deploying Dr.Elephant needs to copy files in app-conf manually!

After building Dr.Elephant vis compile.sh, and uncompressing dr-elephant-*.zip, when starting up dr.elephant, you need to set ELEPHANT_CONF_DIR to get some configuration files needed which exists in app-conf directory of source code, like FetcherConf.xml.
So, I think we can add these files to zip package when building Dr.Elephant.

compile exception

Hi, there is an exception when I compile Dr.Elephant.

[info] [SUCCESSFUL ] org.jacoco#org.jacoco.agent;0.7.1.201405082137!org.jacoco.agent.jar (5499ms)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: FAILED DOWNLOADS ::
[warn] :: ^ see resolution messages for details ^ ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: download failed: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:121)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1144)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1142)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1165)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1163)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1167)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1162)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1170)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1135)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1113)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
error sbt.ResolveException: download failed: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
[error] Total time: 675 s, completed May 25, 2016 7:33:24 PM

Could you help me solve it?

Unable to run as a Play project

Hi,

My goal is to setup a development environment on my Mac, so that I can add some additional functionality for our needs.

Please let me know the best approach to do that.

I want do something like what we do with a typical play project, where I have my project running (play run) and I make edits via eclipse (after doing play eclipse) and automatically the changes are reflected.

If I go with the compile.sh option, I think its a lot of work as I have to again unzip the distribution created to execute the code.

I think the issue is originating from ebean..and the issue may be the way that the configuration files have to be passed.

Please let me know the way to setup Dev environment with Dr. Elephant.

Details:

When I try to run it as a Play project with the command below,
play -Dconfig.resource=/Users/dr/dr-elephant/app-conf "~run 9000"

I get the following error when I try to navigate to a page like "host:8888/search"

[error] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: DataSource user is null?

[error] application -

! @702e5capa - Internal server error, for (GET) [/] ->

play.api.Application$$anon$1: Execution exception[[RuntimeException: DataSource user is null?]]

at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10.jar:2.2.2]

at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10.jar:2.2.2]

at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.2]

at scala.Option.map(Option.scala:145) [scala-library-2.10.4.jar:na]

at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.2]

Caused by: java.lang.RuntimeException: DataSource user is null?

at com.avaje.ebeaninternal.server.lib.sql.DataSourcePool.(DataSourcePool.java:189) ~[avaje-ebeanorm.jar:na]

at com.avaje.ebeaninternal.server.core.DefaultServerFactory.getDataSourceFromConfig(DefaultServerFactory.java:420) ~[avaje-ebeanorm.jar:na]

at com.avaje.ebeaninternal.server.core.DefaultServerFactory.setDataSource(DefaultServerFactory.java:380) ~[avaje-ebeanorm.jar:na]

at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:163) ~[avaje-ebeanorm.jar:na]

at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:125) ~[avaje-ebeanorm.jar:na]

at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:65) ~[avaje-ebeanorm.jar:na]

I have tried by coping conf folder contents into app-conf and set the following properties in both elephant.conf and application.conf

db.default.driver=com.mysql.jdbc.Driver
db_url=localhost
db.default.url="jdbc:mysql://localhost:3306/drelephant"
db.default.user=root
db.default.password=""
db.default.host=localhost

datasource.db.username=root
datasource.db.password=""
datasource.db.databaseUrl="jdbc:mysql://localhost:3306/drelephant"
datasource.db.databaseDriver=com.mysql.jdbc.Driver

db_name=drelephant
db_user=root
db_password=""

None Job on Dr.elephant UI

Doctor elephant is started but not able to see any job On UI.
I have hadoop-2.6.0
spark-1.5.1

I have already run some pig and mapreduce jobs.

please give me a solution.. Thank you！

Spark history logs fetching issue in HA Cluster

DrElephant is not able to fetch the Spark history logs in Yarn HA cluster by setting the namenode_addresses, below are the configs :

<params>
<event_log_size_limit_in_mb>100</event_log_size_limit_in_mb>
<event_log_dir>/user/spark/jobhistory</event_log_dir>
<spark_log_ext>_1</spark_log_ext>
#the values specified in namenode_addresses will be used for obtaining spark logs. The cluster configuration will be ignored.
<namenode_addresses>hahdfs1.hostname:50070, hahdfs2.hostname:50070</namenode_addresses>
</params>

But it works with webhdfs if I specifically go for current active namenode, below are the configs:
<params>
<event_log_size_limit_in_mb>100</event_log_size_limit_in_mb>
<event_log_dir>webhdfs://hahdfs1.hostname.net:50070/user/spark/jobhistory</event_log_dir>
<event_log_dir>/user/spark/jobhistory</event_log_dir>
<spark_log_ext>_1</spark_log_ext>
</params>

Error logs:
08-01-2016 21:45:13 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /user/spark/jobhistory/application_1460147926973_0091_1
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1636)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:181)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /user/spark/jobhistory/application_1460147926973_0091_1
at sun.reflect.GeneratedConstructorAccessor25.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:385)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:622)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:458)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:487)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:483)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:838)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:853)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)

How to integrate Azkaban

hi：
To ask dr-elephant How to integrate Azkaban

Thanks!

emit Statsd metrics for monitoring Dr Elephant

right now we do not have any metrics emitted in Dr E to monitor the queue size and alert on it. we may need metrics on time taken to process jobs, etc. as well.

[ERROR] - from play in main

Having an issue figuring out this error, not sure where to look.

sudo $DR_RELEASE/bin/start.sh $ELEPHANT_CONF_DIR
Using config dir: /home/kfedotov/dr-elephant-master/app-conf
Using config file: /home/kfedotov/dr-elephant-master/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.

Checking the log:
tail -f logs/application.log
13 16:56:18,412 - [INFO] - from play in main
database [default] connected at jdbc:mysql://localhost/drelephant?characterEncoding=UTF-8
2016-04-13 16:56:19,588 - [ERROR] - from play in main

I have play version 2.2.1
play 2.2.1 built with Scala 2.10.2 (running Java 1.8.0_77), http://www.playframework.com

Thanks

project structure

The current project structure:

dr-elephant
 └app
 └app-conf
 └conf
 └project
   ...

what about this structure:

dr-elephant
  └dr-elephant-main
    └app
    └conf
  └dr-elephant-dist
    └scripts
    └app-conf

The new structure may help new comers understand the project quickly and smoothly.

Faster fetcher for MapReduce apps

When deploy on a product cluster, the analysis speed is slower than the jobs finish speed! Our cluster runs around 20 thousands jobs every day. But dr. elephant can only analysis around 14 thousands of them.
I have increased the consumer thread number to 30 or larger, it didn't help. The cpu usage, memory usage and network usage were still in low rate. I found that the bottleneck is the job history server. If dr. elephant can fetch data directly from HDFS, the analysis rate may increase.

Spark logs not fetchable from HDFS?

Relevant segment of my FetcherConf.xml:

  <fetcher>
    <applicationtype>spark</applicationtype>
    <classname>org.apache.spark.deploy.history.SparkFSFetcher</classname>
    <params>
      <event_log_dir>hdfs:///var/log/spark/apps</event_log_dir>
    </params>
  </fetcher>

And now the error

05-27-2016 21:25:51 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:360)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
    at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
    at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
    at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
    at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
    at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
    at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:390)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:90)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:661)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:627)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:463)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:492)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:488)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:843)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:858)
    at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:323)
    at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:241)
    at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
    ... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
    at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:358)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:613)
    ... 24 more

No jobs, search and help tabs don't work

Hi,

my env:

centos 6.5
java 1.8.0_72
Hadoop 2.6.0-cdh5.4.3
dr-elephant is build from sources (master branch)

during application start:

Exception in thread "Thread-6" java.lang.ExceptionInInitializerError
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2051)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2016)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2136)
        at org.apache.hadoop.security.Groups.<init>(Groups.java:78)
        at org.apache.hadoop.security.Groups.<init>(Groups.java:74)
        at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:303)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
        at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:311)
        at com.linkedin.drelephant.security.HadoopSecurity.<init>(HadoopSecurity.java:43)
        at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:100)
        at com.linkedin.drelephant.DrElephant.run(DrElephant.java:34)
Caused by: java.lang.RuntimeException: Bailing out since native library couldn't be loaded
        at org.apache.hadoop.security.JniBasedUnixGroupsMapping.<clinit>(JniBasedUnixGroupsMapping.java:46)
        ... 14 more

when hit help tab

[ESC[31merrorESC[0m] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2
[ESC[31merrorESC[0m] application -

! @712k9ga21 - Internal server error, for (GET) [/help] ->

play.api.Application$$anon$1: Execution exception[[RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2]]
        at play.api.Application$class.handleError(Application.scala:293) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
        at play.api.DefaultApplication.handleError(Application.scala:399) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
        at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
        at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
        at scala.Option.map(Option.scala:145) [org.scala-lang.scala-library-2.10.4.jar:na]
        at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
Caused by: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2
        at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:181) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at com.linkedin.drelephant.ElephantContext.<init>(ElephantContext.java:98) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:91) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at views.html.page.helpPage$.apply(helpPage.template.scala:69) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at views.html.page.helpPage$.render(helpPage.template.scala:90) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
Caused by: java.lang.reflect.InvocationTargetException: null
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_72]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_72]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_72]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_72]
        at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:160) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
        at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
Caused by: java.net.UnknownHostException: null
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) ~[na:1.8.0_72]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.8.0_72]
        at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_72]
        at java.net.Socket.connect(Socket.java:538) ~[na:1.8.0_72]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[na:1.8.0_72]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[na:1.8.0_72]

why dr elephant UI show nothing

I've configured dr elephant and started as well. from applications.log. it shows fine. no error/exception, but from UI. no MR/Spark job shows. any step missing or configuration?
Thanks in advance.

meet the errors when click 'Search' button

Hi Guys,
When I click the 'Search' button, we got the below errors, how can I fix it?
Caused by: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:130) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:90) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.(ElephantContext.java:86) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:79) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage$.apply(searchPage.template.scala:85) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage$.render(searchPage.template.scala:148) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage.render(searchPage.template.scala) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at controllers.Application.search(Application.java:206) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:113) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:na]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:113) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:na]
at play.core.Router$HandlerInvoker$$anon$7$$anon$2.invocation(Router.scala:183) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.Router$Routes$$anon$1.invocation(Router.scala:377) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:56) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.GlobalSettings$1.call(GlobalSettings.java:64) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:91) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:90) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251) ~[org.scala-lang.scala-library-2.10.4.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[org.scala-lang.scala-library-2.10.4.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.4.jar:na]
at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:37) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
... 4 common frames omitted
Caused by: java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_75]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_75]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_75]
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_75]
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:109) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
... 27 common frames omitted
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[na:1.7.0_75]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.7.0_75]
at java.net.Socket.connect(Socket.java:579) ~[na:1.7.0_75]
at java.net.Socket.connect(Socket.java:528) ~[na:1.7.0_75]
at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.(HttpClient.java:211) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.New(HttpClient.java:308) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.New(HttpClient.java:326) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:997) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:851) ~[na:1.7.0_75]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.verifyURL(MapReduceFetcherHadoop2.java:167) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.(MapReduceFetcherHadoop2.java:161) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.(MapReduceFetcherHadoop2.java:155) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2.(MapReduceFetcherHadoop2.java:69) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
... 32 common frames omitted

Issues running dr.elephant

Ok figured it out in dev docs

can't manage spark job

05-25-2016 20:01:24 INFO org.apache.spark.deploy.history.SparkFSFetcher$ : Looking for spark logs at logDir: webhdfs://0.0.0.0:50070/data/spark
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner :
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner :
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.net.ConnectException: Connection refused
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1636)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:48)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:580)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:537)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:605)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:458)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:487)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:483)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:838)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:853)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$isLegacyLogDirectory(SparkFSFetcher.scala:185)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:143)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:99)
... 13 more

05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1458803017484_64118] into the retry list.

history server connection issue (Can't see any jobs on dashboard)

Hi guys,

I have setup Dr.Elephant. But, I am not able to see any jobs on the dashboard:

Hello there, I've been busy!
I looked through 0 jobs today.
About 0 of them could use some tuning.
About 0 of them need some serious attention!

This might be because it is not able to connect to history server. Can you please help me to figure out where should I do changes to point it to history server. Or please suggest if there could be any other issue.

Distributed analytics

Practically it is possible that a single node cannot analyze all completed applications.
Is it possible to distribute AnalyticJob to a cluster of nodes?

More friendly UI for Dr. Elephant - Feedback/Suggestions on technology stack

Hi developers and users,

At Linkedin, we are planning to refactor the UI of Dr. Elephant to make the interaction more intuitive to the user by giving them a flow level perspective and then let them drill down to mapreduce level. The current design shows results at a mapreduce level and then we build on top of it to show information at job and flow level.

We want to discuss and get feedback from the community on what technologies you prefer and would like to use to build the UI components. Currently, Dr. Elephant makes use of Play scala templates to design the views. It is simple and modular for our purpose. Do you have any other suggestions?

The intention is to keep the UI simple, user friendly and easy to develop.

Tagging some contributors of Dr. Elephant:
@shankar37 @nntnag17 @stiga-huang @krishnap @paulbramsen @tglstory @ljank @plypaul @hongbozeng @liyintang @brandtg @timyitong @chetnachaudhari @cjuexuan @rsprabery @miloveme @anspuli @aNutForAJarOfTuna

Thanks,
Akshay

Dr Elephant logs

Hi all

I've encountered some issues like below. Any one can help this?
ENV: hdp 2.4.2
btw, I compiled successfully following below steps:

git clone repo
install play
modify play with abosulately path, add compile.conf with hadoop 2.7.0 and spark 1.6.1
run compile.sh
create database,
go to dist, unzip file
modify elephant.conf as well.
start.sh
check HOST:8080
check dr.log found below errors.

play.api.Application$$anon$1: Execution exception[[RuntimeException: Could not find class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2]]
at play.api.Application$class.handleError(Application.scala:293) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.api.DefaultApplication.handleError(Application.scala:399) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.Option.map(Option.scala:145) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:257) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:344) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:343) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [com.typesafe.play.play-iteratees_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1361) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [org.scala-lang.scala-library-2.10.5.jar:na]
Caused by: java.lang.RuntimeException: Could not find class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:173) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.(ElephantContext.java:98) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:91) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage$.apply(searchPage.template.scala:89) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage$.render(searchPage.template.scala:152) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage.render(searchPage.template.scala) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at controllers.Application.search(Application.java:273) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:133) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:na]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:133) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:na]
at play.core.Router$HandlerInvoker$$anon$7$$anon$2.invocation(Router.scala:183) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.Router$Routes$$anon$1.invocation(Router.scala:377) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:56) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.GlobalSettings$1.call(GlobalSettings.java:64) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:91) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:90) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251) ~[org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:37) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
... 4 common frames omitted
Caused by: java.lang.ClassNotFoundException: com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_67]
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_67]
at java.security.AccessController.doPrivileged(Native Method) ~[na:1.7.0_67]
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_67]
at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_67]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_67]
at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_67]
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:159) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
... 27 common frames omitted

how about open a gitter

On the basis of the topic presented above ，because using a new version about hadoop and spark will cause some problem, open issue is a slow way to fix it

Dr Elephand is not starting

Hi Guys, we have been trying to install Dr. Elephant and after a lot of troubleshooting we are stuck in this part:

[root@ip-172-31-37-252 bin]# ./start.sh
Using config dir: /usr/dr-elephant/app-conf
Using config file: /usr/dr-elephant/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8081
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.

but the process never start and looking in the logs we found this:

2016-04-22 20:34:50,374 - [ERROR] - from com.jolbox.bonecp.hooks.AbstractConnectionHook in main
Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: java.net.ConnectException: Conn
ection refused.Message:Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
/hadoop/dr-elephant-2.0.3-SNAPSHOT/logs

please let us know if somebody has been experience this issue or have any idea.

Dr.Elephant support issue

Hi,
Dr.Elephant whether to support Cloudera Hadoop ?
Dr.Elephant whether to support HDFS or Yarn HA function ?
Can help me, answer these questions?

Update wiki to include new aggregated metrics feature.

Updating the current wiki to include details of the new aggregated metrics feature. #91 #39 #39

spark history extension problem

Hi ,

I'm trying to run with hadoop 2.6 and spark 1.4, and with non-snappy codec for spark history server. It seems the it is always assuming if it's not a folder for application log, it always consider it as snappy.

Please see a small pull request reflect a possible change here:
#47

Thanks

jobhistory page does not find record on job

I am wondering how is the jobhistory page used. I tried to give application id, jobid but both of them result in 'Unable to find record on job url: '

Do we need specific setup to get this working?

Not able to see any Job on Dr.elephant UI

Doctor elephant is started but not able to see any job On UI.
I have hadoop-2.6.0
spark-1.3.0

Wnen I run the command dr.elephant then I am getting
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /Users/Persistent/lib/hadoop-2.6.0/lib/native
Starting Dr. Elephant ....
Dr. Elephant started.

I have started the jobHistory server also

On doctor elephant UI I am getting:

Hello there, I've been busy!
I looked through 0 jobs today.
About 0 of them could use some tuning.
About 0 of them need some serious attention!

and my dr_elephant.log file contains

05-19-2016 09:54:07 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : AnalysisProvider updating its Authenticate Token...
05-19-2016 09:54:14 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463576245175, and current time: 1463631787423
05-19-2016 09:54:14 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463576245175&finishedTimeEnd=1463631787423
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463576245175&finishedTimeEnd=1463631787423
05-19-2016 09:54:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:54:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631787424, and current time: 1463631802451
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631787424&finishedTimeEnd=1463631802451
05-19-2016 09:54:23 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631787424&finishedTimeEnd=1463631802451
05-19-2016 09:54:23 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:55:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631802452, and current time: 1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631802452&finishedTimeEnd=1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631802452&finishedTimeEnd=1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:56:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631862563, and current time: 1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631862563&finishedTimeEnd=1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631862563&finishedTimeEnd=1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:57:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:57:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631922457, and current time: 1463631982451
05-19-2016 09:57:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631922457&finishedTimeEnd=1463631982451

please give me a solution..

Dr.Elephant analysis spark log

Hi：
I'm use Cloudera Hadoop, version is 2.6.0-cdh5.5.0, spark version is 1.5.0-cdh5.5.0.
This version Spark log, whether it must be configured as "spark.eventLog.compress=true"?
Does this mean that the JSON format can not be parsed ?

Compilation failing using ./compile.sh

[root@scflexnode09 dr-elephant-master]# ./compile.sh
Using the default configuration
Hadoop Version : 2.6.0
Spark Version : 1.5.0
Other opts set :

trap exit SIGINT SIGTERM
+++ dirname ./compile.sh
++ cd .
++ pwd
project_root=/gpfs/home/tpathare/dr-elephant-master
cd /gpfs/home/tpathare/dr-elephant-master
start_script=/gpfs/home/tpathare/dr-elephant-master/scripts/start.sh
stop_script=/gpfs/home/tpathare/dr-elephant-master/scripts/stop.sh
rm -rf /gpfs/home/tpathare/dr-elephant-master/dist
mkdir dist
play_command -Dhadoopversion=2.6.0 -Dsparkversion=1.5.0 clean test compile dist
type activator
activator is /activator-1.3.9-minimal/bin/activator
activator -Dhadoopversion=2.6.0 -Dsparkversion=1.5.0 clean test compile dist
[info] Loading project definition from /gpfs/home/tpathare/dr-elephant-master/project
[info] Set current project to dr-elephant (in build file:/gpfs/home/tpathare/dr-elephant-master/)
[success] Total time: 0 s, completed Apr 9, 2016 7:31:16 PM
[info] Updating {file:/gpfs/home/tpathare/dr-elephant-master/}dr-elephant-master...
[info] Resolving xml-apis#xml-apis;1.3.04 ...
[info] Done updating.
[info] Compiling 43 Scala sources and 69 Java sources to /gpfs/home/tpathare/dr-elephant-master/target/scala-2.10/classes...
[error] /gpfs/home/tpathare/dr-elephant-master/app/org/apache/spark/deploy/history/SparkDataCollection.scala:217: type mismatch;
[error] found : scala.collection.mutable.HashSet[Int]
[error] required: org.apache.spark.util.collection.OpenHashSet[Int]
[error] addIntSetToJSet(data.completedStageIndices, jobInfo.completedStageIndices)
[error] ^
[warn] /gpfs/home/tpathare/dr-elephant-master/app/org/apache/spark/deploy/history/SparkDataCollection.scala:299: abstract type pattern T is unchecked since it is eliminated by erasure
[warn] seq.foreach { case (item: T) => list.add(item)}
[warn] ^
[warn] one warning found
[error] one error found
error Compilation failed
[error] Total time: 15 s, completed Apr 9, 2016 7:31:31 PM
cd target/universal
./compile.sh: line 94: cd: target/universal: No such file or directory
++ /bin/ls '*.zip'
/bin/ls: cannot access *.zip: No such file or directory
ZIP_NAME=
unzip
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.

Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).

-p extract files to pipe, no messages -l list files (short format)
-f freshen existing files, create none -t test compressed archive data
-u update files, create if necessary -z display archive comment only
-v list verbosely/show version info -T timestamp archive to latest
-x exclude files that follow (in xlist) -d extract files into exdir
modifiers:
-n never overwrite existing files -q quiet mode (-qq => quieter)
-o overwrite files WITHOUT prompting -a auto-convert any text files
-j junk paths (do not make directories) -aa treat ALL files as text
-U use escapes for all non-ASCII Unicode -UU ignore any Unicode fields
-C match filenames case-insensitively -L make (some) names lowercase
-X restore UID/GID info -V retain VMS version numbers
-K keep setuid/setgid/tacky permissions -M pipe through "more" pager
See "unzip -hh" or unzip.txt for more help. Examples:
unzip data1 -x joe => extract all files except joe from zipfile data1.zip
unzip -p foo | more => send contents of foo.zip via pipe into program more
unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer

rm
rm: missing operand
Try `rm --help' for more information.
DIST_NAME=
chmod +x /bin/dr-elephant
chmod: cannot access `/bin/dr-elephant': No such file or directory
sed -i.bak '/declare -r app_classpath/s/.$/:hadoop classpath:${ELEPHANT_CONF_DIR}"/' /bin/dr-elephant
sed: can't read /bin/dr-elephant: No such file or directory
cp /gpfs/home/tpathare/dr-elephant-master/scripts/start.sh /bin/
cp /gpfs/home/tpathare/dr-elephant-master/scripts/stop.sh /bin/
zip -r .zip

zip error: Nothing to do! (.zip)

mv .zip /gpfs/home/tpathare/dr-elephant-master/dist/
mv: cannot stat `.zip': No such file or directory
[root@scflexnode09 dr-elephant-master]#

ERROR while compiling

I got this error while compiling:

[warn] /home/test/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:300: abstract type pattern T is unchecked since it is eliminated by erasure [warn] seq.foreach { case (item: T) => list.add(item)} [warn] ^ [error] /home/test/dr-elephant/app/org/apache/spark/deploy/history/SparkFSFetcher.scala:260: too many arguments for method replay: (logData: java.io.InputStream, sourceName: String)Unit [error] replayBus.replay(logInput, logPath.toString(), false) [error] ^ [warn] one warning found [error] one error found [error] (compile:compile) Compilation failed [error] Total time: 21 s, completed 27 May, 2016 4:05:18 PM

Any idea?

Are there any plans to have this fetch jobs from AWS/EMR?

I could see a pretty compelling use case for running Dr Elephant on top of AWS to analyze Spark/MR jobs running on EMR.

What I have in mind is, Dr Elephant would be installed in a stand-alone box, and periodically poll the AWS API to check for alive clusters via DescribeCluster, get the hostname for each of those, and automatically fetch jobs running on each cluster and analyze them. The idea is when you have lots of short-running EMR clusters, you can have 1 centralized location with all the results. Optionally, maybe integrate with AWS Data Pipeline to figure out workflows.

Right now the way we have it setup is to run Dr Elephant on each EMR cluster, but this is far from ideal because we lose the results once the cluster goes down unless we export it, and have to reinstall it on every new cluster. It still works because we can have it running in a long-standing staging environment and make sure things are green before they go to prod. But in order to identify trends over multiple days this breaks down.

I haven't dug into the code yet, but what do you think about this idea? I believe there is currently no way to do such a thing, but have you ever had this request or is it something you would be open to consider in Dr Elephant? Happy to help contributing to that once I start looking at the code if you think it would be valuable.

TEZ job analysis

As we start our journey down the road to Tez, it would be good if Dr. Elephant could help us with those jobs, too. This is totally a feature request.

/compile.sh: line 27: play: command not found - Any help ?

Hi ,
I am very new to github site. I download the zip file "dr-elephant-master.zip" and trying to install in my YARN cluster. Do I need to install dr-elephant on RM node or I can install in spare node which is part of our hadoop cluster ?

Secondly, when I try to compile it complaining below errors, any help is appreciated.

./compile.sh
Using the default configuration
Hadoop Version : 2.3.0
Spark Version : 1.4.0
Other opts set :
./compile.sh: line 27: play: command not found
./compile.sh: line 94: cd: target/universal: No such file or directory
inflating: dr-elephant-master/.gitignore
inflating: dr-elephant-master/LICENSE
inflating: dr-elephant-master/NOTICE
..
..
inflating: dr-elephant-master/test/rest/RestAPITest.java
chmod: cannot access dr-elephant-master/bin/dr-elephant': No such file or directory sed: can't read dr-elephant-master/bin/dr-elephant: No such file or directory cp: cannot create regular filedr-elephant-master/bin/': Is a directory
cp: cannot create regular file `dr-elephant-master/bin/': Is a directory
adding: dr-elephant-master/ (stored 0%)
..
..
adding: dr-elephant-master/conf/evolutions/default/1.sql (deflated 72%)
adding: dr-elephant-master/conf/log4j.properties (deflated 45%)
adding: dr-elephant-master/conf/routes (deflated 64%)

Travis CI Activation

Can a LinkedIn Github admin please go to https://travis-ci.org/linkedin/dr-elephant/ and click activate? It doesn't cost anything and would be great to see build statuses in PRs. All the supporting infrastructure is in place as of #86. Thank you!

Cloud not get spark.eventLog.dir

SparkFSFetcher as a plugin in this project，how cloud we make sure to get the variable “spark.eventLog.dir” by using “new SparkConf()” ?
I find “new SparkConf()” uses "java.lang.System.getProperties()" to get the spark config items.
But env. in diffrent environment differ in thousands of ways, i cant get this variable in CDH cluster without spark-submit . do you have any idea?

i wonder if the start.sh will do something like the spark-submit script does?

DrE can't be deployed on existing clusters due to large backlog of jobs to be processed.

Hi All - I'm currently working at Paypal and we're having an issue deploying Dr. Elephant. I'd like to start a discussion around the best solution and show what we've done to improve the speed of processing job data.

Currently, DrE is unable to process a large back log of jobs on a given cluster.

Related thread on the mailing list.

Increasing the number of threads being used to query and process responses from the job history server does not improve the speed of processing, and we can see an ever growing queue of jobs waiting to be processed by DrE.

The impact of increasing the number of executors to a large number (240-ish) is shown below. While the queue sized started going down, we noticed that it went back up and hovered around 10k as more jobs were submitted.

$ grep queue dr_elephant.log |tail -10

05-25-2016 17:37:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9085
05-25-2016 17:38:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9081
05-25-2016 17:39:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9078
05-25-2016 17:40:14 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9071
05-25-2016 17:41:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9069
05-25-2016 17:42:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9038
05-25-2016 17:43:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9016
05-25-2016 17:44:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 9010
05-25-2016 17:45:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 8986
05-25-2016 17:46:15 INFO  com.linkedin.drelephant.ElephantRunner : Job queue size is 8973

At that point we delved into JVisualVM to see where the majority of time was being spent:

The readJsonNode function handles both reading from the Job History Server and parsing the JSON.

The challenge to scalability becomes clear by looking at the network traffic. Each call to the job history server creates a separate TCP connection and gzip is not enabled by default.

Separate TCP Connections:

The number of TCP connections being created for each MR job is 4 + the number of tasks for that job.

Without Gzip for the request:

GET /ws/v1/history/mapreduce/jobs/job_1464719949755_0001/conf

With Gzip:

When you multiply this by many many jobs, the benefits from gzip become pretty large.

Patch + visualvm output due to patch forthcoming.

Monitoring for Dr. Elephant

We plan to add monitoring for Dr. Elephant and are evaluating the following tools.

What we plan to measure?

To start with, we would like to measure the following.

Average time taken to process a job - normalized to job size
Latency to JobHistory Server
Is Dr. Elephant alive?
Count of jobs skipped
HeapMemory usage
Full GC count
Queue size

Evaluation Criteria

It should be possible to plug-in the monitoring tool into Dr. Elephant based on configuration. Users should have the flexibility to use another tool if they wish so. I have checked on jolokia and it's easy to plug that into the application based on config.

Monitoring via MBeans

In-order to monitor stats of interest the stat's has to be exposed as part of an attribute in an MBean. The MBean can have several attributes and the attribute values can be updated accordingly from a thread in the application.

Feedback/Suggestions

We welcome feedback on this feature. Please suggest other monitoring tools that may be interesting or easier to use/configure. Similarly, other params to be monitored.

[ANNOUNCEMENT] "Who is using Dr. Elephant?"

It's been close to 3 months since the open sourcing of Dr. Elephant and we have received tremendous response from the community and many companies have already started adopting Dr. Elephant. At Linkedin, Dr. Elephant has been successfully running for more than 2 years, analyzing over a hundred thousand jobs everyday.

As Dr. Elephant continues to grow, it will be good to track which companies are using Dr. Elephant. This will encourage others to try Dr. Elephant and help build the community.

Request you to send an email to me or reply to this issue with the github handle of the active members working on Dr. Elephant, company name and a link to a pull request. I'll update this in the README file of Dr. Elephant.

@stiga-huang @krishnap @paulbramsen @tglstory @ljank @plypaul @hongbozeng @liyintang @brandtg @timyitong @chetnachaudhari @cjuexuan @rsprabery @miloveme @anspuli @aNutForAJarOfTuna

Thanks and cheer to everyone.
Akshay Rai

AggregatedMetrics branch

Can someone explain to me what the AggregatedMetrics branch is for? Why isn't stuff just being merged into master?

Cannot parse java option string for some Spark applications

in com.linkedin.drelephant.util.Utils#parseJavaOptions, we assume that java options are in format of "-Dfoo=bar -Dfoo2=bar ...". So Dr. Elephant failed to parse options like "-Dcom.sun.management.jmxremote" or "-XX:PermSize=64m", and raise an IllegalArgumentException in ERROR log.

Should these options be ignored? Or are they valuable in analysis?

Here are two related error logs:
04-19-2016 19:25:36 ERROR com.linkedin.drelephant.util.InfoExtractor : Encountered error while parsing java options into urls: Cannot parse java option string [-Djava.util.logging.config.file=jmx.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=0 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native -XX:PermSize=64m -XX:MaxPermSize=256m]. The part [-Dcom.sun.management.jmxremote] does not contain a =.

04-20-2016 00:17:27 ERROR com.linkedin.drelephant.util.InfoExtractor : Encountered error while parsing java options into urls: Cannot parse java option string [-Djava.util.logging.config.file=jmx.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=0 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native -XX:PermSize=64m -XX:MaxPermSize=256m]. Some options does not begin with -D prefix.

Incorrect `compile.conf` example

While building latest master (1bd8f98) I've found that the example compile.conf file in Developer Guilde is misleading/incorrect. Instead of:

hadoop_version = 2.3.0                                      // The Hadoop version to compile with
spark_version = 1.4.0                                       // The Spark version to compile with
play_opts="-Dsbt.repository.config=app-conf/resolver.conf"  // Other play/sbt options

it should be:

hadoop_version=2.3.0                                        # The Hadoop version to compile with
spark_version=1.4.0                                         # The Spark version to compile with
play_opts="-Dsbt.repository.config=app-conf/resolver.conf"  # Other play/sbt options

Otherwise reading incorrect configuration fails silently, like so:

Reading from config file...
hadoop_version=2.6.0                                      # The Hadoop version to compile with
compile.conf: line 1: hadoop_version: command not found
compile.conf: line 2: spark_version: command not found

and keeps working with the defaults:

Hadoop Version : 2.3.0
Spark Version  : 1.4.0
Other opts set :
+ trap exit SIGINT SIGTERM
+++ dirname ./compile.sh
++ cd .
++ pwd
+ project_root=/Users/ljank/sandbox/dr-elephant
+ cd /Users/ljank/sandbox/dr-elephant
+ start_script=/Users/ljank/sandbox/dr-elephant/scripts/start.sh
+ stop_script=/Users/ljank/sandbox/dr-elephant/scripts/stop.sh
+ rm -rf /Users/ljank/sandbox/dr-elephant/dist
+ mkdir dist
+ play_command -Dhadoopversion=2.3.0 -Dsparkversion=1.4.0 clean test compile dist
+ type activator
+ play -Dhadoopversion=2.3.0 -Dsparkversion=1.4.0 clean test compile dist

[ERROR] - from play.nettyException in New I/O worker #1

hello：

When i open the webUI , the server report errors as follow:

2016-04-21 16:12:57,337 - [INFO] - from play in main Application started (Prod)

2016-04-21 16:12:57,564 - [INFO] - from play in main Listening for HTTP on /0:0:0:0:0:0:0:0:8050

2016-04-21 16:13:04,872 - [ERROR] - from play.nettyException in New I/O worker #1
Exception caught in Netty
java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;
at play.core.Invoker$.(Invoker.scala:24) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.Invoker$.(Invoker.scala) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$Implicits$.defaultContext$lzycompute(Execution.scala:7) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$Implicits$.defaultContext(Execution.scala:6) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$.(Execution.scala:10) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$.(Execution.scala) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

2016-04-21 16:13:05,669 - [ERROR] - from play.nettyException in New I/O worker #2
Exception caught in Netty
java.lang.NoClassDefFoundError: Could not initialize class play.api.libs.concurrent.Execution$
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

could someone help me? tks

Dr.Elephant Compilation Error for Hadoop 2.6.0 and Spark 1.5.2

Hi,

I'm trying to compile Dr.Elephant for Hadoop 2.6.0 and Spark 1.5.2, but getting compilation errors. Can you help me to fix the error?

Error Message:

[error] /Users/pkasinathan/workspace/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:217: type mismatch;
[error] found : scala.collection.mutable.HashSet[Int]
[error] required: org.apache.spark.util.collection.OpenHashSet[Int]
[error] addIntSetToJSet(data.completedStageIndices, jobInfo.completedStageIndices)

Full Compile Log:

$ ./compile.sh
Using the default configuration
Hadoop Version : 2.6.0
Spark Version : 1.5.2
Other opts set :

trap exit SIGINT SIGTERM
+++ dirname ./compile.sh
++ cd .
++ pwd
project_root=/Users/pkasinathan/workspace/dr-elephant
cd /Users/pkasinathan/workspace/dr-elephant
start_script=/Users/pkasinathan/workspace/dr-elephant/scripts/start.sh
stop_script=/Users/pkasinathan/workspace/dr-elephant/scripts/stop.sh
rm -rf /Users/pkasinathan/workspace/dr-elephant/dist
mkdir dist
play_command -Dhadoopversion=2.6.0 -Dsparkversion=1.5.2 clean test compile dist
type activator
play -Dhadoopversion=2.6.0 -Dsparkversion=1.5.2 clean test compile dist
[info] Loading project definition from /Users/pkasinathan/workspace/dr-elephant/project
[info] Set current project to dr-elephant (in build file:/Users/pkasinathan/workspace/dr-elephant/)
[success] Total time: 0 s, completed Apr 10, 2016 12:07:26 PM
[info] Updating {file:/Users/pkasinathan/workspace/dr-elephant/}dr-elephant...
[info] Resolving xml-apis#xml-apis;1.3.04 ...
[info] Done updating.
[info] Compiling 43 Scala sources and 69 Java sources to /Users/pkasinathan/workspace/dr-elephant/target/scala-2.10/classes...
[error] /Users/pkasinathan/workspace/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:217: type mismatch;
[error] found : scala.collection.mutable.HashSet[Int]
[error] required: org.apache.spark.util.collection.OpenHashSet[Int]
[error] addIntSetToJSet(data.completedStageIndices, jobInfo.completedStageIndices)
[error] ^
[warn] /Users/pkasinathan/workspace/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:299: abstract type pattern T is unchecked since it is eliminated by erasure
[warn] seq.foreach { case (item: T) => list.add(item)}
[warn] ^
[warn] one warning found
[error] one error found
error Compilation failed
[error] Total time: 15 s, completed Apr 10, 2016 12:07:41 PM

How to configure dr-elephant with spark?

When I configure the dr-elephant with spark, why I can't see the spark jobs?

The log4j change needs recompile

After making change to log4j, the code has to be recompiled in order for new change to become effective.
The restart of Dr. elephant should make change effective rightaway

Address after successful deployment

I deployed dr-elephant successfully .
What is the url?
http://ip:port/drelephant ?

When run start.sh, Dr.Elephant is not started correctly, but its log shows "Dr. Elephant started."

[hdfs@dn01 bin]$ ./start.sh /home/hdfs/dr-elephant/app-conf
Using config dir: /home/hdfs/dr-elephant/app-conf
Using config file: /home/hdfs/dr-elephant/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.

However, it is not started correctly, because the db_url, db_name, db_user is not set yet. And it can not connect to MySQL.

So when I run the stop.sh, it shows:
[hdfs@dn01 bin]$ ./stop.sh
Dr.Elephant is not running.

Support for HDFS HA

While fetching spark event logs in HDFS, dr. elephant just get the namenode address by dfs.namenode.http-address in startup. This property may be empty when using HDFS HA.

Anyone have time to add this feature?