dmwm / cmsspark Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 21.0 3.64 MB

General purpose framework to run CMS experiment workflows on HDFS/Spark platform

License: MIT License

Shell 21.30% Python 68.56% Go 1.03% Jupyter Notebook 4.74% HTML 4.37%

analytics bigdata cms-framework hdfs spark

cmsspark's People

Contributors

Stargazers

Watchers

cmsspark's Issues

Wrong type

https://github.com/dmwm/CMSSpark/blob/master/src/python/CMSSpark/schemas.py#L343

there should be StringType and not LongType.

Wrong field name in processing era dataframe?

CMSSpark/src/python/CMSSpark/schemas.py

Line 32 in 7f72d51

StructField("processing_version", StringType(), True),

we have processing_version, but in

CMSSpark/src/python/CMSSpark/schemas.py

Line 23 in 7f72d51

PROCESSING_ERA_NAME NOT NULL VARCHAR2(120)

we have PROCESSING_ERA_NAME. It seems weird to me that the dataframe for the processing eras does not have the name, in fact. Is this just a typo?

dbs_events script missing header row in output

Hi @vkuznet
I don't see the problem with a quick read through the code .. but the output (on vocms092) of the dbs_events script lacks its header row (/data/cms/pop-data/dbs_events.csv.gz). I noticed as my script depends on this header row.

Missing job reports in JobMonitoring data?

Hi,

I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.

When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).

The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (T1_DE_KIT).

I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:

                                                                 TaskMonitorId  wmarchive_count  jobmonitoring_count     diff
0        pdmvserv_task_HIG-RunIISummer18wmLHEGS-00012__v1_T_180606_161154_7324          34283.0                 73.0  34210.0
1      prozober_task_SMP-PhaseIISummer17GenOnly-00008__v1_T_180607_113950_5438          13820.0                 15.0  13805.0
2        prozober_task_SMP-RunIISummer15wmLHEGS-00224__v1_T_180611_192653_6323          13133.0                  4.0  13129.0
3        prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020          12549.0                798.0  11751.0
4      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00021__v1_T_180608_101222_6299          11517.0                 34.0  11483.0
5         pdmvserv_task_SMP-RunIISummer15wmLHEGS-00225__v1_T_180416_194002_549           9113.0               1193.0   7920.0
6        prozober_task_SMP-RunIISummer15wmLHEGS-00222__v1_T_180611_192556_7810           4285.0                142.0   4143.0
7        pdmvserv_task_SMP-RunIISummer15wmLHEGS-00227__v1_T_180416_194004_2476           7852.0               4665.0   3187.0
8      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00022__v1_T_180607_163200_4994           2559.0                 35.0   2524.0
9   asikdar_RVCMSSW_10_2_0_pre6HydjetQ_B12_5020GeV_2018_PU__180629_181916_5151           2434.0                415.0   2019.0
10       pdmvserv_task_SMP-RunIISummer15wmLHEGS-00223__v1_T_180416_193956_2123           1697.0                  2.0   1695.0
11       pdmvserv_task_HIG-RunIISummer18wmLHEGS-00004__v1_T_180604_003844_1884           2612.0                965.0   1647.0
12         pdmvserv_task_HIG-RunIIFall17wmLHEGS-01913__v1_T_180608_124404_6320           2089.0                447.0   1642.0
13  vlimant_ACDC0_task_JME-RunIISummer18wmLHEGS-00002__v1_T_180619_140519_3271           1815.0                510.0   1305.0
14         pdmvserv_task_HIG-RunIIFall17wmLHEGS-00917__v1_T_180606_161025_1459           1498.0                226.0   1272.0
15            pdmvserv_task_EGM-RunIISummer18GS-00016__v1_T_180605_170606_3059           1444.0                222.0   1222.0
16       pdmvserv_task_TRK-RunIISpring18wmLHEGS-00001__v1_T_180524_151259_1146           1483.0                292.0   1191.0
17            pdmvserv_task_EGM-RunIISummer18GS-00024__v1_T_180605_170611_5611           1438.0                388.0   1050.0
18                   pdmvserv_task_HIN-HiFall15-00303__v1_T_180502_124221_8353           1802.0                834.0    968.0
19         pdmvserv_task_SMP-RunIIFall17wmLHEGS-00021__v1_T_180415_203914_1639           1194.0                259.0    935.0

The data sets are collected from the following HDFS locations:

JobMonitoring: hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
WMArchive/FWJR: hdfs:///cms/wmarchive/avro/fwjr

Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.

If any additional information is needed, I’ll be happy to provide it.

sparkexjar always empty

https://github.com/vkuznet/CMSSpark/blob/7e8a241a9801e77346f5d86443fde4eeaa7688e4/bin/run_spark#L34

the variable always gets an empty value, as the path used does not exist on LXPLUS7. Is this line of code still relevant?

--cvmfs option to run_spark de facto mandatory

To my understanding, run_spark cannot possibly run correctly from lxplus7 if the --cvmfs option is not used.
This because the correct scripts are sourced, and the right options to spark_submit are used, only when --cvmfs is used.
I would recommend to either make it clear in the documentation or execute the commands needed from lxplus7 irrespective of the presence of --cvmfs.

Make run_spark compatible with the ithdp-client cluster

Hi,
when the user is logged to ithdp-client, the correct way to setup the environment is to do

source hadoop-setconf.sh analytix

I propose to modify run_spark in such a way that, if it is run from ithdp-client, this source command is executed, instead of the ones done for lxplus7.

Apply check util functions in cron jobs

Description

Since our check_utils.sh functions is in place, we can start to apply them in the cron jobs for a solid, bullet-proof success status of crons.

Candidate cron jobs for first pitch

We may start with the cron jobs running in vocms092. Here is the static cron definitions of them which is not exist in any repository:

0 */4 * * * /data/cms/CMSSpark/bin/cron4aggregation
0 */3 * * * /data/cms/CMSSpark/bin/cron4dbs_condor
0 20 * * * /data/cms/CMSSpark/bin/cron4dbs_condor_df /data/cms/pop-data
0 18 * * * /data/cms/CMSSpark/bin/cron4dbs_events /data/cms/pop-data
0 15 */5 * * /data/cms/CMSEOS/CMSSpark/bin/backfill_dbs_condor.sh 1>/data/cms/CMSEOS/CMSSpark/log/backfill.log 2>&1
1 1 * * * /data/cms/cmsmonitoringbackup/run.sh 2>&1 1>& /data/cms/cmsmonitoringbackup/log
07 08 * * * /data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily

How

Each cron job has its own definitions, output directory and output format. Most of the mentioned candidate cron jobs write output to HDFS, so we can use check_hdfs function to check their success.

Let's give an example:
CMSSpark/bin/cron4rucio_daily.sh write output to /cms/rucio_daily/rucio/2022/08/01 hdfs directory so output format is /cms/rucio_daily/rucio/YYYY/MM/DD which is defined in its Python code. We only know "$HDFS_OUTPUT_DIR"given as /cms/rucio_daily and we need to produce /cms/rucio_daily/rucio/YYYY/MM/DD from the variable.

Example check code for CMSSpark/bin/cron4rucio_daily.sh:

......
/bin/bash "$SCRIPT_DIR/run_rucio_daily.sh" --verbose --output_folder "$HDFS_OUTPUT_DIR" --fdate "$CURRENT_DATE"

#  [It can be good to put nice comment line to separate check commands from the actual cron job, an example:]
# ----- CRON SUCCESS CHECK -----
. ./utils/check_utils.sh
# This cron job runs each day and threshold should be at max 12 hours, so 43200
# Let's check the current output sizes: hadoop fs -du -h /cms/rucio_daily/rucio/2022/08
# So, in average directory size is 80MB, so we can give 50Mb, in bytes 50000000 

check_hdfs "$HDFS_OUTPUT_DIR"/rucio/YYYY/MM/DD 43200 50000000
# !!ATTENTION!! no command should be run after this point

After check function, we should not run any command to not overwrite actual exit code of the check function.

In our tests, we can provide $HDFS_OUTPUT_DIR (/cms/rucio_daily) as some personal tmp HDFS directory like /tmp/username/rucio_daily.

Corrupt timestamps in classad csvs

the popularity scripts that parse classAds were crashing yesterday due to corrupt timestamps. This is perhaps a "new" issue as @vkuznet did not have the same problem when he last ran the popularity scripts.

Most are just extra 000 or 000000 at the end (but there is one as bad as '1517880783000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000').

An example of one not easily patched is

/HIMinBiasUPC/HIRun2011-v1/RAW,null,0,production,RECOHID11,-3859345924948754432,8.058,100.81333333333335,20180205,0.07992990345192434,RAW

these are isolated to a few dates
dataset-20180204.csv
dataset-20180205.csv

(so perhaps I am making this issue against the wrong repo)

anyway, I protected the popularity script for parsing classads against this problem, but I thought I would report it.

Authentication issues on it-hadoop-client

@dciangot and I have noticed the error:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1548256949044_18633 to YARN : Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:analytix, Ident: (token for ncsmith: HDFS_DELEGATION_TOKEN [email protected], renewer=nobody, realUser=, issueDate=1550609491439, maxDate=1551214291439, sequenceNumber=4836514, masterKeyId=2235)
We each found a different workaround: one seems to be to not do source hadoop-setconf.sh as requested in https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/getstart/client_edge_machine.html (and also remove it from https://github.com/dmwm/CMSSpark/blob/master/setup_lxplus.sh#L2-L3 ) (since it is done in https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L101-L104 and I guess cannot be done twice?)
The other workaround seems to be to leave that in, but remove https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L21-L23

I guess the question is, how can we patch this up to make it 'just work' ?

Consider file pfn vs lfn in schemas

thanks to some debugging, I notice that data in /project/monitoring/archive/xrootd does not contain lfns, but rather some mix of pfn and lfn. (presumably whatever name the user used to open the file). My guess is that this is more widespread than just this one monitoring source. Consider updating schema column names to reflect this to improved readability.

Simple setup for it-hadoop-client

I found that by just following the instructions at https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/gettingstarted_md.html I can submit this minimal job:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

conf = SparkConf().setMaster("yarn").setAppName("CMS Working Set")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

readavro = spark.read.format("com.databricks.spark.avro")
fwjr = readavro.load("/cms/wmarchive/avro/fwjr/201[789]/*/*/*.avro")

with

spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 test.py

Perhaps this is a better soft introduction than the RDD complexity? Also, there seem to be lxplus options.

Typo in variable name

https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L52

the variable should be mapreduce, not mapreduct.

Reading Avro files as Dataframes instead of RDD?

Hi,
would it be a good idea to modernize the code by reading Avro files as dataframes instead of RDD?
Look for example at https://github.com/databricks/spark-avro: it is enough to use

--packages com.databricks:spark-avro_2.11:4.0.0

to be able to use

df = spark.read.format("com.databricks.spark.avro").load(file)

So one could completely bypass the creation of an RDD which is then converted to a dataframe (this is my understanding at least).

rucio_daily.py error due to wrong default date

Running cron4rucio_daily.sh in k8s produces the following:
Input Arguments: fdate:2022-09-21 00:00:00, which gives an error, because date must be simply 2022-09-21

The source of this problem lies here, and it might be a bug of click library.

_VALID_DATE_FORMATS = ["%Y-%m-%d"]
@click.option("--fdate", required=False, default=date.today().strftime("%Y-%m-%d"),
              type=click.DateTime(_VALID_DATE_FORMATS) ... )

Output becomes %Y-%m-%d %H-%M-%S
It could be fixed by removing type parameter, as we don't use datetime object in that function.

click.DateTime documentation

Evaluate to run yum update cern-hadoop-config in each Spark script

CERN Hadoop team started to change their config very frequently compared to the last years. Our Spark and Sqoop jobs fail because of non-updated config which includes connection parameters and Hadoop/Spark node urls.

We need to evaluate running yum update cern-hadoop-config before each Spark job running on Kubernetes pod.

Last day of the month is not processed in bin/cron4dbs_condor

bin/cron4dbs_condor does not process last day of months. It can be seen in /cms/dbs_condor/release. Since it is processing previous day's data and in each months starting to process current month, it is skipping last day of each month.

Probably a easy fix, opening this issue to not forget next week.

Main bash script for cron jobs and exposing cron metrics to push-gateway

Description

Having a main bash script, which runs all cron jobs, makes initial operations before running cron job like sending starting metrics about cron job to push gateway and making post operations like sending cron job success metrics to push-gateway after cron job ends, can be a good strategy for our cron job managements. The idea is came from sqoop/run.sh script which handles kerberos authentication before cron run and send alerts after cron finished.

Suggestion for main script implementation

I would like to suggest cron_runner.sh name for the main script. Its pseudo code can be:

cron_runner.sh:

#!/bin/bash

# given bash cron command to run
IN_CMD=$@

# Send start metric to push-gateway. Metrics should include: 
#     cron script name which can be extracted from `$IN_CMD` and put as a metric tag
#     metric value should be current unix time in seconds
#     appropriate tags should be provided to pg

# Run cron
$in_cms 

# Send success metric to push-gateway and it should include same tags and value as in the start metric

It will be used as:

./cron_runner.sh /data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily

# $IN_CMD will be "/data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily" and we need to extract "cron4rucio_daily.sh" as script name to put pg metric.

Suggestion for push-gateway functions

Push-gateway allow us to expose batch jobs metrics to Prometheus. Since our cron jobs fit the definition, we can send cron job start and success metrics to Prometheus using PG. It runs as a service in our K8s cluster.

So we need to add push-gateway functions to cron_runner.sh and use them accordingly. Here are the examples for start and end metrics

# Send cron_start meric
# arg1=cron job script name
function send_start() {
    cat <<EOF | curl --data-binary @- "$PG_URL"/metrics/job/cmsmon-cron/instance/"$(hostname)"
# TYPE cron_start gauge
# HELP cron_start Start of cron job in second.
cron_start{cron_name="${1}"} $(date +%s)
EOF
}

# Send cron_end meric
# arg1=cron job script name
function send_end() {
    cat <<EOF | curl --data-binary @- "$PG_URL"/metrics/job/cmsmon-cron/instance/"$(hostname)"
# TYPE cron_end gauge
# HELP cron_end End of cron job in second.
cron_end{cron_name="${1}"} $(date +%s)
EOF
}

How to use before cron start:

# Let's say we extracted script name from "$IN_CMD" as "cron_name"
# so we can send start metric as 
send_start $cron_name

# We need to define $PG_URL and it will be like "http://<host>.cern.ch:30091", provided as global variable in the main script.

Typo in date string

https://github.com/vkuznet/CMSSpark/blob/master/src/python/CMSSpark/dbs_jm.py#L30

there should be 'day', not 'date'.

Cheers,
Andrea

Creating check functions for critic cron jobs

Check functions for critical cron jobs

CMSSpark includes lots of important cron jobs which run in various infrastructures. Sometimes, because of the nature of Spark itself or lack of error check mechanism due to the dependency to the CERN IT infrastructure, we fail to catch failing cron jobs. Hence, we need to implement test methods for the cron jobs which will be important starting point for our cron job monitoring/alerting that will come soon. And from now on, each cron job will be encapsulated with alert and test methods.

What we need

initially, we can start with writing common functions to check various test cases:

Checking output file(s) or directory statuses: existence, modification date, size
Checking output data in ElasticSearch by querying specific time range that cron job should fed: count of docs in a time range in ES index

Suggestions

We can create a new directory in CMSSpark/bin/utils and create a new file which includes common functions, named check_utils.sh

First function can be check_file_status which will accept these parameters: file path, modification time threshold in seconds which defines how old a file should be, size threshold which defines minimum sizes. If file not exists, is older than modification time threshold since current time, has less size than the size threshold; then function should exit with exit code 1.
Second function can be a general function to query ElasticSearch and get count of documents in ES index through Grafana proxy using monit cli tool. monit CLI tool will require: ElasticSearch count query, time range, Grafana data source name(-dbname), Grafana token; so this function should accept these parameters. Additionally, it should have minimum number documents parameter to compare. Example queries can be checked from es-queries.

Requirements from monitoring operator

I need to provide Grafana token, check if /cvmfs/cms.cern.ch/cmsmon/monit needs any update, provide an example ES query to count documents in specific time-range.

P.S.: this issue will be assigned to Kyrylo when we get his GitHub username

Bug in run_spark?

I noticed that in

https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L54

the Spark version is identified as 2 only if spark-submit is in the path, which is certainly not the case on lxplus7. So, running from lxplus7 one gets "Using spark 1.X" in the output, which is typically wrong.
From lxplus7, $jars ends up being

--jars /afs/cern.ch/user/v/valya/public/spark/spark-csv-assembly-1.4.0.jar,/afs/cern.ch/user/v/valya/public/spark/avro-mapred-1.7.6-cdh5.7.6.jar,/afs/cern.ch/user/v/valya/public/spark/spark-examples-1.6.0-cdh5.15.1-hadoop2.6.0-cdh5.15.1.jar

which looks suspicious because the spark-examples version is for 1.6.0. The version for Spark 2 is never found because it's always looked for under /usr/hdp/.
I don't know if this is a real problem, at least it doesn't seem to affect my CMSSpark code.

By the way, in the official instructions for running Spark from lxplus7, /cvmfs/sft.cern.ch/lcg/views/LCG_93/x86_64-centos7-gcc62-opt had been changed to /cvmfs/sft.cern.ch/lcg/views/LCG_94/x86_64-centos7-gcc7-opt.

Cheers,
Andrea

dmwm / cmsspark Goto Github PK

cmsspark's People

Contributors

Stargazers

Watchers

Forkers

cmsspark's Issues

Description

Candidate cron jobs for first pitch

How

Description

Suggestion for main script implementation

Suggestion for push-gateway functions

Check functions for critical cron jobs

What we need

Suggestions

Requirements from monitoring operator

Recommend Projects

Recommend Topics

Recommend Org