When I configure the dr-elephant with spark, why I can't see the spark jobs?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm also struggling with with Spark logs. Here's what I have: <div class="snippet-

yes yes ( INFO org.apache.spark.deploy.histor

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to configure dr-elephant with spark? ,about linkedin/dr-elephant

Comments (19)

paulbramsen commented on May 20, 2024

@ColonelHou Take a look at the log output (make sure you're logging at the info level). It will be a big help in figuring out what's going wrong. Most likely you have a bad configuration and need to point Dr. Elephant at a different Spark log file or something.

from dr-elephant.

jeremyore commented on May 20, 2024

I configure the dr-elephant with spark ,as <event_log_dir>/spark/jobhistory</event_log_dir> in the file app-conf/FetcherConf.xml or <event_log_dir>hdfs://hadoop01:9000/spark/jobhistory</event_log_dir> (because I configure the spark spark.eventLog.dir hdfs://hadoop01:9000/spark/jobhistory,and spark history server is
Available),but both can't see the spark jobs. I don‘t konw why ?

from dr-elephant.

akshayrai commented on May 20, 2024

@jeremyore , in the FetcherConf.xml include /spark/jobhistory within event_log_dir.

Now restart Dr. Elephant and check your logs/elephant/dr-elephant.log for this line, Looking for spark logs at logDir:
Please post the above line and some previous log lines.

from dr-elephant.

ljank commented on May 20, 2024

I'm also struggling with with Spark logs. Here's what I have:

INFO  org.apache.spark.deploy.history.SparkFSFetcher$ : Looking for spark logs at logDir: /user/spark/applicationHistory

ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File /user/spark/applicationHistory/application_1464847406968_512780 does not exist
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673)
        at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
        at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
        at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
        at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
        at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:176)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File /user/spark/applicationHistory/application_1464847406968_512780 does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:542)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:755)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:532)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:425)
        at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
        at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
        at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
        ... 13 more

although the file exists and it's readable for any user:

-rwxrwxr--   3 hive spark   21336233 2016-06-29 12:02 /user/spark/applicationHistory/application_1464847406968_512780

FetcherConf.xml:

<fetchers>
  <fetcher>
    <applicationtype>spark</applicationtype>
    <classname>org.apache.spark.deploy.history.SparkFSFetcher</classname>
    <params>
      <event_log_size_limit_in_mb>100</event_log_size_limit_in_mb>
      <!--<event_log_dir>/user/spark/applicationHistory</event_log_dir>-->
      <event_log_dir>hdfs://warehouse/user/spark/applicationHistory</event_log_dir>
      <spark_log_ext></spark_log_ext>
    </params>
  </fetcher>
</fetchers>

from dr-elephant.

akshayrai commented on May 20, 2024

Looks like the namenode is not picked...logDir should be something like webhdfs://namenode-host:50070/user/spark/applicationHistory

@ljank , a couple of questions that will help me debug the issue,

Have you pulled the latest changes from the master. (Specifically #55)
Do you see this line in your log file? Multiple name services found. HDFS federation is not supported right now.
Is dfs HA enabled on your cluster? You can check for dfs.nameservices property in your hadoop conf($HADOOP_CONF_DIR/hdfs-site.xml). If set, then you have HA enabled.
If HA is enabled, you must have some properties in the hdfs-site.xml like dfs.namenode.http-address.<something>. Can you tell me the value of any one such property? Is it default(0.0.0.0:50070) or something else?

from dr-elephant.

ljank commented on May 20, 2024

yes
yes (INFO org.apache.spark.deploy.history.SparkFSFetcher$ : Multiple name services found. HDFS federation is not supported right now.)
yes, we have HA enabled on our cluster
not default:

  <property>
    <name>dfs.namenode.http-address.warehouse.hx-zk-c1-167774836</name>
    <value>hx-zk-c1:50070</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.warehouse.rr-zk-c3-167774997</name>
    <value>rr-zk-c3:50070</value>
  </property>

I've managed to make it work by explicitly specifying one of the namenodes via <namenode_addresses/> param, however, automatic resolution from hdfs-site.xml is obviously preferred.

from dr-elephant.

akshayrai commented on May 20, 2024

Cool. Like the log says, we do not have HDFS federation support now. We need to add that support.

from dr-elephant.

YannByron commented on May 20, 2024

Why not support HDFS Federation?

I notice that the "org.apache.spark.deploy.history.SparkFSFetcher#getNamenodeAddress" function where INFO "Multiple name services found. HDFS federation is not supported right now." happens only wants to get an active namenode http address.
If i rewrite this to return an activeNN http-address related current hadoop cluster, can Dr.Elephant support HDFS FD?

If not, could you explain the basic reason..

Thx.

from dr-elephant.

akshayrai commented on May 20, 2024

@YannByron , in case of HDFS federation, there can be multiple active namenodes and the spark logs may be in one of them. So, the getNameNodeAddress function should return a list of active namenodes and we need to search for the spark logs with each of them! Correct me if I am wrong.

from dr-elephant.

YannByron commented on May 20, 2024

@akshayrai
You are right.
I found one configuration item i missed before. If "namenode_addresses" parameter is set in fetcher session of FetcherConf.xml, the judgement of whether Hadoop cluster use HDFS FD feature will be bypass. Certainly, spark log should be found in this value.
I tried and it worked. Found some spark jobs in case of HDFS federation.
Is it a right way to solve this problem?
Waiting for your reply.

from dr-elephant.

akshayrai commented on May 20, 2024

@YannByron , we should definitely add HDFS Federation support. Using the conf is just a workaround. Somebody can send a PR to add HDFS federation support.

from dr-elephant.

YannByron commented on May 20, 2024

@akshayrai，Understood. Thank you..

from dr-elephant.

akshayrai commented on May 20, 2024

@mcapavan, github didn't read your question. Can you create a separate issue? Also, if Dr. Elephant is already running, then do check dr_elephant.log file.

$> less $DR_RELEASE/../logs/elephant/dr_elephant.log

from dr-elephant.

mcapavan commented on May 20, 2024

@akshayrai Many thanks to come back on this. I had a problem with my MapReduce jobs and Spark jobs yesterday. I am able to resolve my issues with this tread and other Dr. Elephent google groups. The following are the issues I have faced and resolved.

My hadoop is not enabled with dfs.webhdfs.enabled as true - fixed by updating hdfs-site.xml
Updated my log directory as <event_log_dir>webhdfs://localhost:50070/spark-history</event_log_dir>
Realized <namenode_addresses>localhost</namenode_addresses> is required only when HA is enabled. So I have commented this to resolve my issue
Realized very late that the logs are not in $DR_RELEASE/logs/application.log but I should check as you suggested at $DR_RELEASE/../logs/elephant/dr_elephant.log

As I was able to resolved the issue, I have deleted my question earlier but it may be useful to others how I have solved my issue.

from dr-elephant.

akshayrai commented on May 20, 2024

Thanks @mcapavan

from dr-elephant.

henryon commented on May 20, 2024

I run into issues about spark job.actually, the spark-history application file is exist.
anything went wrong? Thanks in advance. @akshayrai @mcvsubbu

spark setting from FetchConf.xml#############

FetchConf.txt

#################hdfs spark history directory################
[hdfs@ip-10-50-0-249 elephant]$ hdfs dfs -du /spark-history/|grep application_1463473041784_0002
27873 /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000001.inprogress
27873 /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000002.inprogress

#########error logs#################################
8-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner :
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463473041784_0002
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:180)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463473041784_0002
at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:424)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:695)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:661)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:497)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:526)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:522)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:877)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:892)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /spark-history/application_1463473041784_0002
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:392)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:647)
... 24 more

08-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1463473041784_0002] into the retry list.
08-18-2016 10:03:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 187
08-18-2016 10:03:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1463555624272_0015
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner :
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463555624272_0016
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:180)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463555624272_0016
at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:424)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:695)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:661)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:497)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:526)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:522)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:877)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:892)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /spark-history/application_1463555624272_0016
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:392)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:647)
... 24 more

08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1463555624272_0016] into the retry list.
08-18-2016 10:03:59 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 188
08-18-2016 10:03:59 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1463555624272_0017
^C

from dr-elephant.

henryon commented on May 20, 2024

@akshayrai my hdfs_site.xml check attached.
hdfs-site_xml.txt

from dr-elephant.

akshayrai commented on May 20, 2024

@henryon, soon after you start Dr. Elephant, in your dr_elephant.log you should see a message that says Looking for spark logs at logDir: ...

Dr. Elephant is looking for /spark-history/application_1463473041784_0002 in hdfs and can't find it because you do not have the file and instead you are having /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000001 with the _appattempt_* appended. I do not know if that is configurable(remove appattempt_*) in Spark.

Currently, you cannot configure that in Dr. Elephant. Please send a PR to make it configurable.

from dr-elephant.

seufagner commented on May 20, 2024

Dr elephant works without Hadoop/HDFS, in other words, in standalone mode?

from dr-elephant.

How to configure dr-elephant with spark? about dr-elephant HOT 19 CLOSED

Comments (19)

spark setting from FetchConf.xml#############

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent