dmwm / cmsspark Goto Github PK
View Code? Open in Web Editor NEWGeneral purpose framework to run CMS experiment workflows on HDFS/Spark platform
License: MIT License
General purpose framework to run CMS experiment workflows on HDFS/Spark platform
License: MIT License
In
https://github.com/dmwm/CMSSpark/blob/master/src/python/CMSSpark/schemas.py#L343
there should be StringType and not LongType.
In
CMSSpark/src/python/CMSSpark/schemas.py
Line 32 in 7f72d51
we have processing_version, but in
CMSSpark/src/python/CMSSpark/schemas.py
Line 23 in 7f72d51
we have PROCESSING_ERA_NAME. It seems weird to me that the dataframe for the processing eras does not have the name, in fact. Is this just a typo?
Hi @vkuznet
I don't see the problem with a quick read through the code .. but the output (on vocms092) of the dbs_events script lacks its header row (/data/cms/pop-data/dbs_events.csv.gz). I noticed as my script depends on this header row.
Hi,
I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.
When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).
The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (T1_DE_KIT
).
I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:
TaskMonitorId wmarchive_count jobmonitoring_count diff
0 pdmvserv_task_HIG-RunIISummer18wmLHEGS-00012__v1_T_180606_161154_7324 34283.0 73.0 34210.0
1 prozober_task_SMP-PhaseIISummer17GenOnly-00008__v1_T_180607_113950_5438 13820.0 15.0 13805.0
2 prozober_task_SMP-RunIISummer15wmLHEGS-00224__v1_T_180611_192653_6323 13133.0 4.0 13129.0
3 prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020 12549.0 798.0 11751.0
4 pdmvserv_task_SMP-PhaseIISummer17GenOnly-00021__v1_T_180608_101222_6299 11517.0 34.0 11483.0
5 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00225__v1_T_180416_194002_549 9113.0 1193.0 7920.0
6 prozober_task_SMP-RunIISummer15wmLHEGS-00222__v1_T_180611_192556_7810 4285.0 142.0 4143.0
7 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00227__v1_T_180416_194004_2476 7852.0 4665.0 3187.0
8 pdmvserv_task_SMP-PhaseIISummer17GenOnly-00022__v1_T_180607_163200_4994 2559.0 35.0 2524.0
9 asikdar_RVCMSSW_10_2_0_pre6HydjetQ_B12_5020GeV_2018_PU__180629_181916_5151 2434.0 415.0 2019.0
10 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00223__v1_T_180416_193956_2123 1697.0 2.0 1695.0
11 pdmvserv_task_HIG-RunIISummer18wmLHEGS-00004__v1_T_180604_003844_1884 2612.0 965.0 1647.0
12 pdmvserv_task_HIG-RunIIFall17wmLHEGS-01913__v1_T_180608_124404_6320 2089.0 447.0 1642.0
13 vlimant_ACDC0_task_JME-RunIISummer18wmLHEGS-00002__v1_T_180619_140519_3271 1815.0 510.0 1305.0
14 pdmvserv_task_HIG-RunIIFall17wmLHEGS-00917__v1_T_180606_161025_1459 1498.0 226.0 1272.0
15 pdmvserv_task_EGM-RunIISummer18GS-00016__v1_T_180605_170606_3059 1444.0 222.0 1222.0
16 pdmvserv_task_TRK-RunIISpring18wmLHEGS-00001__v1_T_180524_151259_1146 1483.0 292.0 1191.0
17 pdmvserv_task_EGM-RunIISummer18GS-00024__v1_T_180605_170611_5611 1438.0 388.0 1050.0
18 pdmvserv_task_HIN-HiFall15-00303__v1_T_180502_124221_8353 1802.0 834.0 968.0
19 pdmvserv_task_SMP-RunIIFall17wmLHEGS-00021__v1_T_180415_203914_1639 1194.0 259.0 935.0
The data sets are collected from the following HDFS locations:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
hdfs:///cms/wmarchive/avro/fwjr
Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.
If any additional information is needed, I’ll be happy to provide it.
In
https://github.com/vkuznet/CMSSpark/blob/7e8a241a9801e77346f5d86443fde4eeaa7688e4/bin/run_spark#L34
the variable always gets an empty value, as the path used does not exist on LXPLUS7. Is this line of code still relevant?
To my understanding, run_spark cannot possibly run correctly from lxplus7 if the --cvmfs option is not used.
This because the correct scripts are sourced, and the right options to spark_submit are used, only when --cvmfs is used.
I would recommend to either make it clear in the documentation or execute the commands needed from lxplus7 irrespective of the presence of --cvmfs.
Hi,
when the user is logged to ithdp-client, the correct way to setup the environment is to do
source hadoop-setconf.sh analytix
I propose to modify run_spark in such a way that, if it is run from ithdp-client, this source command is executed, instead of the ones done for lxplus7.
Since our check_utils.sh functions is in place, we can start to apply them in the cron jobs for a solid, bullet-proof success status of crons.
We may start with the cron jobs running in vocms092. Here is the static cron definitions of them which is not exist in any repository:
0 */4 * * * /data/cms/CMSSpark/bin/cron4aggregation
0 */3 * * * /data/cms/CMSSpark/bin/cron4dbs_condor
0 20 * * * /data/cms/CMSSpark/bin/cron4dbs_condor_df /data/cms/pop-data
0 18 * * * /data/cms/CMSSpark/bin/cron4dbs_events /data/cms/pop-data
0 15 */5 * * /data/cms/CMSEOS/CMSSpark/bin/backfill_dbs_condor.sh 1>/data/cms/CMSEOS/CMSSpark/log/backfill.log 2>&1
1 1 * * * /data/cms/cmsmonitoringbackup/run.sh 2>&1 1>& /data/cms/cmsmonitoringbackup/log
07 08 * * * /data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily
Each cron job has its own definitions, output directory and output format. Most of the mentioned candidate cron jobs write output to HDFS, so we can use check_hdfs
function to check their success.
Let's give an example:
CMSSpark/bin/cron4rucio_daily.sh write output to /cms/rucio_daily/rucio/2022/08/01
hdfs directory so output format is /cms/rucio_daily/rucio/YYYY/MM/DD
which is defined in its Python code. We only know "$HDFS_OUTPUT_DIR"
given as /cms/rucio_daily
and we need to produce /cms/rucio_daily/rucio/YYYY/MM/DD
from the variable.
Example check code for CMSSpark/bin/cron4rucio_daily.sh:
......
/bin/bash "$SCRIPT_DIR/run_rucio_daily.sh" --verbose --output_folder "$HDFS_OUTPUT_DIR" --fdate "$CURRENT_DATE"
# [It can be good to put nice comment line to separate check commands from the actual cron job, an example:]
# ----- CRON SUCCESS CHECK -----
. ./utils/check_utils.sh
# This cron job runs each day and threshold should be at max 12 hours, so 43200
# Let's check the current output sizes: hadoop fs -du -h /cms/rucio_daily/rucio/2022/08
# So, in average directory size is 80MB, so we can give 50Mb, in bytes 50000000
check_hdfs "$HDFS_OUTPUT_DIR"/rucio/YYYY/MM/DD 43200 50000000
# !!ATTENTION!! no command should be run after this point
After check function, we should not run any command to not overwrite actual exit code of the check function.
In our tests, we can provide $HDFS_OUTPUT_DIR
(/cms/rucio_daily) as some personal tmp HDFS directory like /tmp/username/rucio_daily
.
the popularity scripts that parse classAds were crashing yesterday due to corrupt timestamps. This is perhaps a "new" issue as @vkuznet did not have the same problem when he last ran the popularity scripts.
Most are just extra 000 or 000000 at the end (but there is one as bad as '1517880783000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000').
An example of one not easily patched is
/HIMinBiasUPC/HIRun2011-v1/RAW,null,0,production,RECOHID11,-3859345924948754432,8.058,100.81333333333335,20180205,0.07992990345192434,RAW
these are isolated to a few dates
dataset-20180204.csv
dataset-20180205.csv
(so perhaps I am making this issue against the wrong repo)
anyway, I protected the popularity script for parsing classads against this problem, but I thought I would report it.
@dciangot and I have noticed the error:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1548256949044_18633 to YARN : Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:analytix, Ident: (token for ncsmith: HDFS_DELEGATION_TOKEN [email protected], renewer=nobody, realUser=, issueDate=1550609491439, maxDate=1551214291439, sequenceNumber=4836514, masterKeyId=2235)
We each found a different workaround: one seems to be to not do source hadoop-setconf.sh
as requested in https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/getstart/client_edge_machine.html (and also remove it from https://github.com/dmwm/CMSSpark/blob/master/setup_lxplus.sh#L2-L3 ) (since it is done in https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L101-L104 and I guess cannot be done twice?)
The other workaround seems to be to leave that in, but remove https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L21-L23
I guess the question is, how can we patch this up to make it 'just work' ?
thanks to some debugging, I notice that data in /project/monitoring/archive/xrootd does not contain lfns, but rather some mix of pfn and lfn. (presumably whatever name the user used to open the file). My guess is that this is more widespread than just this one monitoring source. Consider updating schema column names to reflect this to improved readability.
I found that by just following the instructions at https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/gettingstarted_md.html I can submit this minimal job:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setMaster("yarn").setAppName("CMS Working Set")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
readavro = spark.read.format("com.databricks.spark.avro")
fwjr = readavro.load("/cms/wmarchive/avro/fwjr/201[789]/*/*/*.avro")
with
spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 test.py
Perhaps this is a better soft introduction than the RDD complexity? Also, there seem to be lxplus options.
In
https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L52
the variable should be mapreduce, not mapreduct.
Hi,
would it be a good idea to modernize the code by reading Avro files as dataframes instead of RDD?
Look for example at https://github.com/databricks/spark-avro: it is enough to use
--packages com.databricks:spark-avro_2.11:4.0.0
to be able to use
df = spark.read.format("com.databricks.spark.avro").load(file)
So one could completely bypass the creation of an RDD which is then converted to a dataframe (this is my understanding at least).
Running cron4rucio_daily.sh
in k8s produces the following:
Input Arguments: fdate:2022-09-21 00:00:00
, which gives an error, because date must be simply 2022-09-21
The source of this problem lies here, and it might be a bug of click library.
_VALID_DATE_FORMATS = ["%Y-%m-%d"]
@click.option("--fdate", required=False, default=date.today().strftime("%Y-%m-%d"),
type=click.DateTime(_VALID_DATE_FORMATS) ... )
Output becomes %Y-%m-%d %H-%M-%S
It could be fixed by removing type
parameter, as we don't use datetime object in that function.
CERN Hadoop team started to change their config very frequently compared to the last years. Our Spark and Sqoop jobs fail because of non-updated config which includes connection parameters and Hadoop/Spark node urls.
We need to evaluate running yum update cern-hadoop-config
before each Spark job running on Kubernetes pod.
bin/cron4dbs_condor
does not process last day of months. It can be seen in /cms/dbs_condor/release
. Since it is processing previous day's data and in each months starting to process current month, it is skipping last day of each month.
Probably a easy fix, opening this issue to not forget next week.
Having a main bash script, which runs all cron jobs, makes initial operations before running cron job like sending starting metrics about cron job to push gateway and making post operations like sending cron job success metrics to push-gateway after cron job ends, can be a good strategy for our cron job managements. The idea is came from sqoop/run.sh script which handles kerberos authentication before cron run and send alerts after cron finished.
I would like to suggest cron_runner.sh
name for the main script. Its pseudo code can be:
cron_runner.sh
:
#!/bin/bash
# given bash cron command to run
IN_CMD=$@
# Send start metric to push-gateway. Metrics should include:
# cron script name which can be extracted from `$IN_CMD` and put as a metric tag
# metric value should be current unix time in seconds
# appropriate tags should be provided to pg
# Run cron
$in_cms
# Send success metric to push-gateway and it should include same tags and value as in the start metric
It will be used as:
./cron_runner.sh /data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily
# $IN_CMD will be "/data/cms/CMSSpark/bin/cron4rucio_daily.sh /cms/rucio_daily" and we need to extract "cron4rucio_daily.sh" as script name to put pg metric.
Push-gateway allow us to expose batch jobs metrics to Prometheus. Since our cron jobs fit the definition, we can send cron job start and success metrics to Prometheus using PG. It runs as a service in our K8s cluster.
So we need to add push-gateway functions to cron_runner.sh
and use them accordingly. Here are the examples for start and end metrics
# Send cron_start meric
# arg1=cron job script name
function send_start() {
cat <<EOF | curl --data-binary @- "$PG_URL"/metrics/job/cmsmon-cron/instance/"$(hostname)"
# TYPE cron_start gauge
# HELP cron_start Start of cron job in second.
cron_start{cron_name="${1}"} $(date +%s)
EOF
}
# Send cron_end meric
# arg1=cron job script name
function send_end() {
cat <<EOF | curl --data-binary @- "$PG_URL"/metrics/job/cmsmon-cron/instance/"$(hostname)"
# TYPE cron_end gauge
# HELP cron_end End of cron job in second.
cron_end{cron_name="${1}"} $(date +%s)
EOF
}
How to use before cron start:
# Let's say we extracted script name from "$IN_CMD" as "cron_name"
# so we can send start metric as
send_start $cron_name
# We need to define $PG_URL and it will be like "http://<host>.cern.ch:30091", provided as global variable in the main script.
In
https://github.com/vkuznet/CMSSpark/blob/master/src/python/CMSSpark/dbs_jm.py#L30
there should be 'day', not 'date'.
Cheers,
Andrea
CMSSpark includes lots of important cron jobs which run in various infrastructures. Sometimes, because of the nature of Spark itself or lack of error check mechanism due to the dependency to the CERN IT infrastructure, we fail to catch failing cron jobs. Hence, we need to implement test methods for the cron jobs which will be important starting point for our cron job monitoring/alerting that will come soon. And from now on, each cron job will be encapsulated with alert and test methods.
initially, we can start with writing common functions to check various test cases:
We can create a new directory in CMSSpark/bin/utils
and create a new file which includes common functions, named check_utils.sh
First function can be check_file_status
which will accept these parameters: file path
, modification time threshold in seconds
which defines how old a file should be, size threshold
which defines minimum sizes. If file not exists, is older than modification time threshold since current time, has less size than the size threshold; then function should exit with exit code 1
.
Second function can be a general function to query ElasticSearch and get count of documents in ES index
through Grafana proxy using monit cli tool. monit
CLI tool will require: ElasticSearch count query, time range, Grafana data source name(-dbname
), Grafana token; so this function should accept these parameters. Additionally, it should have minimum number documents
parameter to compare. Example queries can be checked from es-queries.
I need to provide Grafana token, check if /cvmfs/cms.cern.ch/cmsmon/monit
needs any update, provide an example ES query to count documents in specific time-range.
P.S.: this issue will be assigned to Kyrylo when we get his GitHub username
I noticed that in
https://github.com/dmwm/CMSSpark/blob/master/bin/run_spark#L54
the Spark version is identified as 2 only if spark-submit is in the path, which is certainly not the case on lxplus7. So, running from lxplus7 one gets "Using spark 1.X" in the output, which is typically wrong.
From lxplus7, $jars ends up being
--jars /afs/cern.ch/user/v/valya/public/spark/spark-csv-assembly-1.4.0.jar,/afs/cern.ch/user/v/valya/public/spark/avro-mapred-1.7.6-cdh5.7.6.jar,/afs/cern.ch/user/v/valya/public/spark/spark-examples-1.6.0-cdh5.15.1-hadoop2.6.0-cdh5.15.1.jar
which looks suspicious because the spark-examples version is for 1.6.0. The version for Spark 2 is never found because it's always looked for under /usr/hdp/.
I don't know if this is a real problem, at least it doesn't seem to affect my CMSSpark code.
By the way, in the official instructions for running Spark from lxplus7, /cvmfs/sft.cern.ch/lcg/views/LCG_93/x86_64-centos7-gcc62-opt had been changed to /cvmfs/sft.cern.ch/lcg/views/LCG_94/x86_64-centos7-gcc7-opt.
Cheers,
Andrea
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.