Giter Club home page Giter Club logo

Comments (19)

paulbramsen avatar paulbramsen commented on May 20, 2024

@ColonelHou Take a look at the log output (make sure you're logging at the info level). It will be a big help in figuring out what's going wrong. Most likely you have a bad configuration and need to point Dr. Elephant at a different Spark log file or something.

from dr-elephant.

jeremyore avatar jeremyore commented on May 20, 2024

I configure the dr-elephant with spark ,as <event_log_dir>/spark/jobhistory</event_log_dir> in the file app-conf/FetcherConf.xml or <event_log_dir>hdfs://hadoop01:9000/spark/jobhistory</event_log_dir> (because I configure the spark spark.eventLog.dir hdfs://hadoop01:9000/spark/jobhistory,and spark history server is
Available),but both can't see the spark jobs. I don‘t konw why ?

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

@jeremyore , in the FetcherConf.xml include /spark/jobhistory within event_log_dir.

Now restart Dr. Elephant and check your logs/elephant/dr-elephant.log for this line, Looking for spark logs at logDir:
Please post the above line and some previous log lines.

from dr-elephant.

ljank avatar ljank commented on May 20, 2024

I'm also struggling with with Spark logs. Here's what I have:

INFO  org.apache.spark.deploy.history.SparkFSFetcher$ : Looking for spark logs at logDir: /user/spark/applicationHistory
ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File /user/spark/applicationHistory/application_1464847406968_512780 does not exist
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673)
        at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
        at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
        at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
        at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
        at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:176)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File /user/spark/applicationHistory/application_1464847406968_512780 does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:542)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:755)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:532)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:425)
        at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
        at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
        at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
        ... 13 more

although the file exists and it's readable for any user:

-rwxrwxr--   3 hive spark   21336233 2016-06-29 12:02 /user/spark/applicationHistory/application_1464847406968_512780

FetcherConf.xml:

<fetchers>
  <fetcher>
    <applicationtype>spark</applicationtype>
    <classname>org.apache.spark.deploy.history.SparkFSFetcher</classname>
    <params>
      <event_log_size_limit_in_mb>100</event_log_size_limit_in_mb>
      <!--<event_log_dir>/user/spark/applicationHistory</event_log_dir>-->
      <event_log_dir>hdfs://warehouse/user/spark/applicationHistory</event_log_dir>
      <spark_log_ext></spark_log_ext>
    </params>
  </fetcher>
</fetchers>

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

Looks like the namenode is not picked...logDir should be something like webhdfs://namenode-host:50070/user/spark/applicationHistory

@ljank , a couple of questions that will help me debug the issue,

  1. Have you pulled the latest changes from the master. (Specifically #55)
  2. Do you see this line in your log file? Multiple name services found. HDFS federation is not supported right now.
  3. Is dfs HA enabled on your cluster? You can check for dfs.nameservices property in your hadoop conf($HADOOP_CONF_DIR/hdfs-site.xml). If set, then you have HA enabled.
  4. If HA is enabled, you must have some properties in the hdfs-site.xml like dfs.namenode.http-address.<something>. Can you tell me the value of any one such property? Is it default(0.0.0.0:50070) or something else?

from dr-elephant.

ljank avatar ljank commented on May 20, 2024
  1. yes
  2. yes (INFO org.apache.spark.deploy.history.SparkFSFetcher$ : Multiple name services found. HDFS federation is not supported right now.)
  3. yes, we have HA enabled on our cluster
  4. not default:
  <property>
    <name>dfs.namenode.http-address.warehouse.hx-zk-c1-167774836</name>
    <value>hx-zk-c1:50070</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.warehouse.rr-zk-c3-167774997</name>
    <value>rr-zk-c3:50070</value>
  </property>

I've managed to make it work by explicitly specifying one of the namenodes via <namenode_addresses/> param, however, automatic resolution from hdfs-site.xml is obviously preferred.

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

Cool. Like the log says, we do not have HDFS federation support now. We need to add that support.

from dr-elephant.

YannByron avatar YannByron commented on May 20, 2024

Why not support HDFS Federation?

I notice that the "org.apache.spark.deploy.history.SparkFSFetcher#getNamenodeAddress" function where INFO "Multiple name services found. HDFS federation is not supported right now." happens only wants to get an active namenode http address.
If i rewrite this to return an activeNN http-address related current hadoop cluster, can Dr.Elephant support HDFS FD?

If not, could you explain the basic reason..

Thx.

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

@YannByron , in case of HDFS federation, there can be multiple active namenodes and the spark logs may be in one of them. So, the getNameNodeAddress function should return a list of active namenodes and we need to search for the spark logs with each of them! Correct me if I am wrong.

from dr-elephant.

YannByron avatar YannByron commented on May 20, 2024

@akshayrai
You are right.
I found one configuration item i missed before. If "namenode_addresses" parameter is set in fetcher session of FetcherConf.xml, the judgement of whether Hadoop cluster use HDFS FD feature will be bypass. Certainly, spark log should be found in this value.
I tried and it worked. Found some spark jobs in case of HDFS federation.
Is it a right way to solve this problem?
Waiting for your reply.

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

@YannByron , we should definitely add HDFS Federation support. Using the conf is just a workaround. Somebody can send a PR to add HDFS federation support.

from dr-elephant.

YannByron avatar YannByron commented on May 20, 2024

@akshayrai,Understood. Thank you..

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

@mcapavan, github didn't read your question. Can you create a separate issue? Also, if Dr. Elephant is already running, then do check dr_elephant.log file.

$> less $DR_RELEASE/../logs/elephant/dr_elephant.log

from dr-elephant.

mcapavan avatar mcapavan commented on May 20, 2024

@akshayrai Many thanks to come back on this. I had a problem with my MapReduce jobs and Spark jobs yesterday. I am able to resolve my issues with this tread and other Dr. Elephent google groups. The following are the issues I have faced and resolved.

  1. My hadoop is not enabled with dfs.webhdfs.enabled as true - fixed by updating hdfs-site.xml
  2. Updated my log directory as <event_log_dir>webhdfs://localhost:50070/spark-history</event_log_dir>
  3. Realized <namenode_addresses>localhost</namenode_addresses> is required only when HA is enabled. So I have commented this to resolve my issue
  4. Realized very late that the logs are not in $DR_RELEASE/logs/application.log but I should check as you suggested at $DR_RELEASE/../logs/elephant/dr_elephant.log

As I was able to resolved the issue, I have deleted my question earlier but it may be useful to others how I have solved my issue.

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

Thanks @mcapavan

from dr-elephant.

henryon avatar henryon commented on May 20, 2024

I run into issues about spark job.actually, the spark-history application file is exist.
anything went wrong? Thanks in advance. @akshayrai @mcvsubbu

spark setting from FetchConf.xml#############

FetchConf.txt

#################hdfs spark history directory################
[hdfs@ip-10-50-0-249 elephant]$ hdfs dfs -du /spark-history/|grep application_1463473041784_0002
27873 /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000001.inprogress
27873 /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000002.inprogress

#########error logs#################################
8-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner :
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463473041784_0002
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:180)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463473041784_0002
at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:424)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:695)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:661)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:497)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:526)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:522)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:877)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:892)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /spark-history/application_1463473041784_0002
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:392)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:647)
... 24 more

08-18-2016 10:03:59 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1463473041784_0002] into the retry list.
08-18-2016 10:03:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 187
08-18-2016 10:03:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1463555624272_0015
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner :
08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463555624272_0016
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:180)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /spark-history/application_1463555624272_0016
at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:424)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:695)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:661)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:497)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:526)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:522)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:877)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:892)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /spark-history/application_1463555624272_0016
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:392)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:647)
... 24 more

08-18-2016 10:03:59 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1463555624272_0016] into the retry list.
08-18-2016 10:03:59 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 188
08-18-2016 10:03:59 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1463555624272_0017
^C

from dr-elephant.

henryon avatar henryon commented on May 20, 2024

@akshayrai my hdfs_site.xml check attached.
hdfs-site_xml.txt

from dr-elephant.

akshayrai avatar akshayrai commented on May 20, 2024

@henryon, soon after you start Dr. Elephant, in your dr_elephant.log you should see a message that says Looking for spark logs at logDir: ...

Dr. Elephant is looking for /spark-history/application_1463473041784_0002 in hdfs and can't find it because you do not have the file and instead you are having /spark-history/application_1463473041784_0002_appattempt_1463473041784_0002_000001 with the _appattempt_* appended. I do not know if that is configurable(remove appattempt_*) in Spark.

Currently, you cannot configure that in Dr. Elephant. Please send a PR to make it configurable.

from dr-elephant.

seufagner avatar seufagner commented on May 20, 2024

Dr elephant works without Hadoop/HDFS, in other words, in standalone mode?

from dr-elephant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.