For an NDS benchmark, we want a query failure to then propagate to an overall job fail

The benchmark s are used by more than just CI. We point users to the benchmark

[FEA] If any query fails in NDS benchmark, then overall job should return as failed about spark-rapids-benchmarks HOT 4 CLOSED

mattahrens commented on August 17, 2024

[FEA] If any query fails in NDS benchmark, then overall job should return as failed

from spark-rapids-benchmarks.

Comments (4)

jlowe commented on August 17, 2024 1

you just need to grep the json files

IMO this should not be the default answer. Users don't expect jobs that produce the incorrect output to silently "succeed." Having to specify an extra option and then manually grep afterwards is not user friendly at all. We control the benchmark script, so we should be able to track when queries fail and have the overall Spark application return an error if any queries failed. Ideally it should have these running modes:

Default mode: queries are run as they are today. If any query failed (not task attempt failed that succeeded on retry), the overall Spark application returns an error. If the driver needs to write to a json summary file and grep through it itself, then that's what it should do. Queries that fail should never lead to a successful Spark application by default.
Hardcore benchmark run mode: max task attempts are configured to one. If any task attempt fails, benchmark performance run is essentially invalid and Spark application fails immediately.

from spark-rapids-benchmarks.

GaryShen2008 commented on August 17, 2024

By using --json_summary_folder, the power run will save the status into a json file for each query.
Like our CI job does, you just need to grep the json files and check the query.status[Completed, CompletedWithTaskFailures, Failed].

Close this issue.

from spark-rapids-benchmarks.

GaryShen2008 commented on August 17, 2024

I see.
Seems I misunderstood the requirement here. I thought it's just for our benchmark running, which we already added the check step.

For our benchmark, we're not setting spark.task.maxFailures be 1. It should be default as 4.
We still need to report the result when some queries failed or CompletedWithTaskFailures.

We can update the script default to fail the spark job when any query failed but continue to finish all the queries.
And for benchmark running, we'll add one special parameter to disable this feature to make our CI job unblock.

from spark-rapids-benchmarks.

jlowe commented on August 17, 2024

The benchmark scripts are used by more than just CI. We point users to the benchmark scripts in many of our public presentations, encouraging them to run the benchmarks themselves. Therefore the benchmark scripts need to be as user-friendly as possible.

The script definitely needs to fail if any queries fail, because the benchmark run is clearly invalid at that point. I'm OK if it doesn't fail when there are task failures. Ideally it should complain very loudly when that happens, because that's not a clean benchmark run and will result in lower performance numbers being reported than theoretically should be attainable. Task attempt failures could also indicate OOM situations or other errors that shouldn't be there but where masked by an executor relaunch, and thus we would want to investigate.

from spark-rapids-benchmarks.

[FEA] If any query fails in NDS benchmark, then overall job should return as failed about spark-rapids-benchmarks HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent