andikleen / pmu-tools Goto Github PK

Intel PMU profiling tools

License: GNU General Public License v2.0

Makefile 0.04% C 1.83% C++ 0.01% Python 97.32% Shell 0.52% HTML 0.01% Roff 0.24% Perl 0.03%

pmu-tools's Introduction

pmu tools is a collection of tools and libraries for profile collection and performance analysis on Intel CPUs on top of Linux perf. This uses performance counters in the CPU.

Quick (non-) installation

pmu-tools doesn't really need to be installed. It's enough to clone the repository and run the respective tool (like toplev or ocperf) out of the source directory.

To run it from other directories you can use export PATH=$PATH:/path/to/pmu-tools or symlink the tool you're interested in to /usr/local/bin or ~/bin. The tools automatically find their python dependencies.

When first run, toplev / ocperf will automatically download the Intel event lists from https://github.com/intel/perfmon. This requires working internet access. Later runs can be done offline. It's also possible to download the event lists ahead, see pmu-tools offline

toplev works with both python 2.7 and python 3. However it requires an not too old perf tools and depending on the CPU an uptodate kernel. For more details see toplev kernel support

The majority of the tools also don't require any python dependencies and run in "included batteries only" mode. The main exception is generating plots or XLSX spreadsheets, which require external libraries.

If you want to use those run

  pip install -r requirements.txt

once, or follow the command suggested in error messages.

jevents is a C library. It has no dependencies other than gcc/make and can be built with

cd jevents
make

Quick examples

toplev -l2 program

measure whole system in level 2 while program is running

toplev -l1 --single-thread program

measure single threaded program. On hyper threaded systems with Skylake or older the system should be idle.

toplev -NB program

Measure program showing consolidated bottleneck view and extra information associated with bottlenecks. Note this will multiplex performance counters, so there may be measuring errors.

toplev -NB --run-sample program

Measure programing showing bottlenecks and extra nodes, and automatically sample for the location of bottlenecks in a second pass.

toplev --drilldown --only-bottleneck program

Rerun workload with minimal multiplexing until critical bottleneck is found. Only print critical bottleneck

toplev -l3 --no-desc -I 100 -x, sleep X

measure whole system for X seconds every 100ms, outputting in CSV format.

toplev --all --core C0 taskset -c 0,1 program

Measure program running on core 0 with all nodes and metrics enabled.

toplev --all --xlsx x.xlsx -a sleep 10

Generate spreadsheet with full system measurement for 10 seconds

For more details on toplev please see the toplev tutorial

What tool to use for what?

You want to:

understand CPU bottlenecks on the high-level: use toplev.
display toplev output graphically: toplev --xlsx (or --graph)
know what CPU events to run, but want to use symbolic names for a new CPU: use ocperf.
measure interconnect/caches/memory/power management on Xeon E5+: use ucevent (or toplev)
Use perf events from a C program: use jevents
Query CPU topology or disable HyperThreading: use cputop
Change Model Specific Registers: use msr
Change PCI config space: use pci

For more details on the tools see TOOLS

All features:

Major tools/libraries

The "ocperf" wrapper to "perf" that provides a full core performance counter event list for common Intel CPUs. This allows to use all the Intel events, not just the builtin events of perf. Can be also used as a library from other python programs
The "toplev.py" tool to identify the micro-architectural bottleneck for a workload. This implements the TopDown or TopDown2 methodology.
The "ucevent" tool to manage and compute uncore performance events. Uncore is the part of the CPU that is not core. Supports many metrics for power management, IO, QPI (interconnect), caches, and others. ucevent automatically generates event descriptions for the perf uncore driver and pretty prints the output. It also supports computing higher level metrics derived from multiple events.
A library to resolve named intel events (like INST_RETIRED.ANY) to perf_event_attr (jevents) and provide higher level function for using the Linux perf API for self profiling or profiling other programs. It also has a "perf stat" clone called "jestat"
A variety of tools for plotting and post processing perf stat -I1000 -x, or toplev.py -I1000 -x, interval measurements.
Some utility libraries and functions for MSR access, CPU topology and other functionality,as well as example programs how to program the Intel PMU.

There are some obsolete tools which are not supported anymore, like simple-pebs. These are kept as PMU programming reference, but may need some updates to build on newer Linux kernels.

Recent new features:

TMA 4.8 release

toplev updated to TMA 4.8:
- Bottlenecks View:
  - Renamed Base_Non_Br to Useful_Work and simplified descriptions for all BV metrics.
  - Cache_Memory_Latency now accounts for L1 cache latency as well.
  - Improved Branching_Overhead accuracy for function calling and alignments
  - Cross-reference Bottlenecks w/ TMA tree for tool visualization (VTune request)
- New Tree Nodes
  - L1_Hit_Latency: estimates fraction of cycles with demand load accesses that hit the L1 cache (relies on Dependent_Loads_Weight SystemParameter today)
- New Informative Metrics
  - Fetch_LSD (client), Fetch_DSB, Fetch_MITE under Info.Pipeline group [SKL onwards]
  - DSB_Bandwidth under Info.Botlnk.L2
  - L2MPKI_RFO under Info.Memory
- Key Enhancements & fixes
  - Fixed Ports_Utilization/Ports_Utilized_0
  - Slightly tuned memory (fixed cost) latencies [SPR, EMR]
- Corrected CPU_Utilization, CPUs_Utilized for Linux perf based tools
toplev now supports Meteor Lake systems.
Add a new genretlat.py tool to tune the toplev model for a workload. The basic tuning needs to be generated before first toplev use using genretlat -o mtl-retlat.json ./workloads/BC1s (or suitable workload). toplev has a new --ret-latency option to override the tuning file.

TMA 4.7 release

toplev updated to TMA 4.7:
- New --hbm-only for sprmax in HBM Only mode. toplev currently cannot auto detect this condition.
- New Models
  - SPR-HBM: model for Intel Xeon Max (server) processor covering HBM-only mode (on top of cache mode introduced in 4.6 release)
- New Features
  - Releasing the Bottlenecks View - a rather complete version [SKL onwards]
    - Bottlenecks View is An abstraction or summarization of the 100+ TMA tree nodes into a 12-entry vector of familiar performance issues, presented under the Info.Bottlenecks section.
  - This release introduces Core_Bound_Est metric: An estimation of total pipeline cost when the execution is compute-bound.
  - Besides, balanced distrubtion among Branching Retired, Irregular_Overhead, Mispredictions and Instruction_Fetch_BW as well as
  - enhanced Cache_Memory_Latency to account for Stores info better accuracy.
- New Tree Metrics (nodes)
  - HBM_Bound: stalls due to High Bandwidth Memory (HBM) accesses by loads.
- Informative Metrics
- New: Uncore_Frequency in server models
- New: IpPause [CFL onwards]
- Key Enhancements & fixes
  - Hoisted Serializing_Operation and AMX_Busy to level 3; directly under Core Bound [SKL onwards]
  - Swapped semantics of ILP (becomes per-thread) and Execute (per physical core) info metrics
  - Moved Nop_Instructions to Level 4 under Other_Light_Op [SKL onwards]
  - Moved Shuffles_256b to Level 4 under Other_Light_Op [ADL onwards]
  - Renamed Local/Remote_DRAM to Local/Remote_MEM to account for HBM too
  - Reduced # events when SMT is off [all]
  - Reduced # events for HBM metrics; fixed MEM_Bandwidth/Latency descriptions [SPR-HBM]
  - Tuned Threshold for: Branching_Overhead; Fetch_Bandwidth, Ports_Utilized_3m
toplev has new options:
- --node-metrics or -N collects and shows metrics related to selected TMA nodes if their nodes cross the threshold. With --drilldown it will show only the metrics of the bottleneck.
- --areas can select nodes and metrics by area
- --bottlenecks or -B shows the bottleneck view metrics (equivalent to --areas Info.Bottleneck)
- --only-bottleneck only shows the bottleneck, as well as its associated metrics if enabled.
interval-plot has --level and --metrics arguments to configure the inputs. It now defaults to level 1 only, no metrics to make the plots more readable.
toplev has a new --reserved-counters option to handle systems that reserve some generic counters.
toplev has a new --no-sort option to disable grouping metrics with tree nodes.

TMA 4.6 release

toplev updated to Ahmad Yasin's TMA 4.6
- Support for Intel Xeon Max processors (SPRHBM)
- New Features:
  - Support for optimized power-performance states via C01/C02_Wait nodes under Core Bound category as well as C0_Wait info metric [ADL onwards]
  - HBM_Bound: stalls due to High Bandwidth Memory (HBM) accesses by loads.
  - C01/C02_Wait: cycles spent in C0.1/C0.2 power-performance optimized states
  - Other_Mispredicts: slots wasted due to other cases of misprediction (non-retired x86 branches or other types)
  - Other_Nukes: slots wasted due to Nukes (Machine Clears) not related to memory ordering.
  - Info.Bottlenecks: Memory_Synchronization, Irregular_Overhead (fixes Instruction_Fetch_BW), Other_Bottlenecks [SKL onwards]
  - CPUs_Utilized - Average number of utilized CPUs [all]
  - New metrics UC_Load_PKI, L3/DRAM_Bound_L, Spec_Clears_Ratio, EPC [SKL onwards]
  - Unknown_Branch_Cost and Uncore_Rejects & Bus_Lock_PKI (support for Resizable Bar) [ADL]
  - Enabled FP_Vector_128b/256b nodes in SNB/JKT/IVB/IVT
  - Enabled FP_Assists, IpAssist into, as well as Fixed Mixing_Vectors [SKL through TGL]
  - TIOPs plus 8 new metrics Offcore_PKI and R2C_BW [SPR, SPR-HBM]
  - Grouped all Uncore-based Mem Info metric under MemOffcore distinct group (to ease skipping their overhead) [all]
- Key Enhancements & fixes
  - Reduced # events (multiplexing) for GFLOPs, FLOPc, IpFLOP, FP_Scalar and FP_Vector [BDW onwards]
  - Reduced # events (multiplexing) & Fixed Serializing_Operations, Ports_Utilized_0 [ADL onwards]
  - Fixed Branch_Misprediction_Cost overestimate, Mispredictions [SKL onwards]
  - Fixed undercount in FP_Vector/IpArith (induced by 4.5 update) + Enabled/fixed IO_Read/Write_BW [SPR]
  - Tuned #Avg_Assist_Cost [SKL onwards]
  - Remove X87_Use [HSW/HSX]
  - Renamed Shuffles node & some metrics/groups in Info.Bottlenecks and Info.Memory*. CountDomain fixes

TMA 4.4 release

toplev updated to Ahmad Yasin's TMA 4.4
- Add support for Sapphire Rapids servers
  - New breakdown of Heavy_Operations, add new nodes for Assists, Page Faults
  - A new Int_Operations level 3 node, including Integer Vector and Shuffle
  - Support for RDT MBA stalls.
  - AMX and FP16 support
  - Better FP_Vector breakdown
  - Support 4wide MITE breakdown.
  - Add new Info.Pipeline Metrics group.
  - Support for Retired/Executed uops and String instruction cycles
  - Frequency of microcode assits.
  - Add Core_Bound_Likely for SMT and IpSWF for software prefetches.
  - Cache bandwidth is split per processor and per core.
  - Snoop Metric group for cross processor snoops.
  - Various bug fixes and improvements.
Support for running on Alderlake with a hybrid Goldencove / Gracemont model Add a new --aux option to control the auxillary nodes on Atom. --cputype atom/core is supported to filter on core types.
cputop supports an atom/core shortcut to generate the cpu mask of hybrid CPUs. Use like toplev $(cputop core cpuset) workload
toplev now supports a --abbrev option to abbreviate node names
Add experimental --thread option to support per SMT thread measurements on pre ICL CPUs.

TMA 4.3 release

toplev updated to Ahmad Yasin's TMA 4.3: New Retiring.Light_Operations breakdown

Notes: ADL is missing so far. TGL/RKL still use the ICL model. if you see missing events please remove ~/.cache/pmu-events/* to force a redownload
- New Tree Metrics (nodes)
  - A brand new breakdown of the Light_Operations sub-category (under Retiring category) per operation type:
    - Memory_Operations for (fraction of retired) slots utilized by load or store memory accesses
    - Fused_Instructions for slots utilized by fused instruction pairs (mostly conditional branches)
    - Non_Fused_Branches for slots utilized by remaining types of branches.
    - (Branch_Instructions is used in lieu of the last two nodes for ICL .. TGL models)
    - Nop_Instructions for slots utilized by NOP instructions
    - FP_Arith - a fraction estimate of arithmetic floating-point operations (legacy)
  - CISC new tree node for complex instructions (under the Heavy_Operations sub-category)
  - Decoder0_Alone new tree node for instructions requiring heavy decoder (under the Fetch_Bandwidth sub-category)
  - Memory_Fence new tree node for LFENCE stalls (under the Core_Bound sub-category)
- Informative Groups
  - New Info.Branches group for branch instructions of certain types: Cond_TK (Conditional TaKen branches), Cond_NT (Conditional Non-Taken), CallRet, Jump and Other_Branches.
  - Organized (almost all) Info metrics in 5 mega-buckets of {Fed, Bad, Ret, Cor, Mem} using the Metric Group column
- New Informative Metrics
  - UpTB for Uops per Taken Branch
  - Slots_Utilization for Fraction of Physical Core issue-slots utilized by this Logical Processor [ICL onwards]
  - Execute_per_Issue for the ratio of Uops Executed to Uops Issued (allocated)
  - Fetch_UpC for average number of fetched uops when the front-end is delivering uops
  - DSB_Misses_Cost for Total penalty related to DSB misses
  - IpDSB_Miss_Ret for Instructions per (any) retired DSB miss
  - Kernel CPI for Cycles Per Instruction in kernel (operating system) mode
- Key Enhancements & fixes
  - Fixed Heavy_Operations for few uop instructions [ICL, ICX, TGL].
  - Fixed Fetch_Latency overcount (or Fetch_Bandwidth undercount) [ICL, ICX, TGL]
  - Capped nodes using fixed-costs, e.g. DRAM_Bound, to 100% max. Some tools did this in ad-hoc manner thus far [All]
  - Fixed DTLB_{Load,Store} and STLB_Hit_{Load,Store} in case of multiple hits per cycles [SKL onwards]
  - Fixed Lock_Latency to account for lock that hit in L1D or L2 caches [SKL onwards]
  - Fixed Mixing_Vectors and X87_Use to Clocks and Slots Count Domains, respectively [SKL onwards]
  - Many other fixes: Thresholds, Tagging (e.g. Ports_Utilized_2), Locate-with, Count Domain, Metric Group, Metric Max, etc
jestat now supports CSV output (-x,), not aggregated.
libjevents has utility functions to output event list in perf stat style (both CSV and normal)
toplev now outputs multiplexing statistics by default. This can be disabled with --no-mux.
cputop now supports hybrid types (type=="core"/"atom")
ucevent now supports Icelake Server
toplev now supports Icelake Server

TMA 4.2 release

toplev updated to Ahmad Yasin's TMA 4.2: Bottlenecks Info group, Tuned memory access costs
- New Metrics
  - New Info.Bottlenecks group aggregating total performance-issue costs in SLOTS across the tree: [SKL onwards]
    - Memory_Latency, Memory_Bandwidth, Memory_Data_TLBs
    - Big_Code, Instruction_Fetch_BW, Branching_Overheads and
    - Mispredictions (introduced in 4.1 release)
  - New tree node for Streaming_Stores [ICL onwards]
- Key Enhancements & fixes
  - Tuned memory metrics with up-to-date frequency-based measured costs [TGL, ICX]
    - The Average_Frequency is calculated using the TSC (TimeStamp Counter) value
    - With this key enhancement #Mem costs become NanoSecond- (was Constant), DurationTimeInMilliSeconds becomes ExternalParameter CountDomain and #Base_Frequency is deprecated
    - The previous method of setting frequency using Base_Frequency is deprecated.
  - Fixed Ports_Utilization for detection of serializing operations - issue#339 [SKL onwards]
  - Tuned MITE, DSB, LSD and move to Slots_Estimated domain [all]
  - Capping DTLB_Load and STLB_Hit_Load cost using events in Clocks CountDomain [SKL onwards]
  - Tuned Pause latency using default setting [CLX]
  - Fixed average Assists cost [IVB onwards]
  - Fixed Mispredicts_Resteers Clears_Resteers Branch_Mispredicts Machine_Clears and Mispredictions [ICL+]
  - A parameter to avoid using PERF_METRICS MSR e.g. for older OS kernels (implies higher event multiplexing)
  - Reduced # events for select nodes collections (lesser event multiplexing): Backend_Bound/Core_Bound, Clear_Resteers/Unknwon_Branches, Kernel_Utilization
  - Other fixes: Thresholds, Tagging (e.g. Ports_Utilized_2), Locate-with, etc
toplev now has a --parallel argument to can process large --import input files with multiple threads. There is a new interval-merge tool that can merge multiple perf-output files.
toplev now supports a --subset argument that can process parts of --import input files, either by splitting them or by sampling. This is a building block for more efficient processing of large input files.
toplev can now generate scripts to collect data with perf stat record to lower runtime collection overhead, and import the perf.data, using a new --script-record option. This currently requires unreleased perf patches, hopefully in Linux 5.11.
toplev can now support json files for Chrome's about://tracing with --json
toplev now supports --no-multiplex in interval mode (-Ixxx)
The tools now don't force python 2 anymore to support running out of the box on distributions which do not install python 2.
toplev now hides the perf command line by default. Override with --perf.
Updated to TMA 4.11: Fixed an error in misprediction-related and Power License metrics
toplev now supports the new fixed TMA metrics counters on Icelake. This requires the upcoming 5.9+ kernel.

TMA 4.1 release

toplev was updated to Ahmad Yasin's/Anton Hanna's TMA 4.1 New Metrics:
- Re-arrange Retiring Level 2 into Light_Operations & Heavy_Operations. Light_Operations replaces the previous Base (or "General Retirement") while Heavy_Operations is a superset of the Microcode_Sequencer node (that moves to Level 3)
- Mixing_Vectors: hints on a pitfall when intermixing the newer AVX* with legacy SSE* vectors, a tree node under Core Bound [SKL onwards] Key Enhancements & fixes
- Tuning of Level 2 breakdown for Backend_Bound, Frontend_Bound (rollback FRONTEND_RETIRED 2-events use) [SKL onwards]
- Improved branch misprediction related metrics to leverage a new PerfMon event [ICL onwards]
- Improved CORE_CLKS & #Retire_Slots-based metrics [ICL onwards]
- Adjusted cost of all nodes using MEM_LOAD_*RETIRED.* in case of shadow L1 d-cache misses
- renamed Frontend_ to Fetch_Latency/Bandwidth [all]
- Additional documentation/details to aid automated parsing in ‘For Tool Developers’.
- Other fixes including Thresholds, Tagging (e.g. $issueSnoops), Locate-with, Metric Group
toplev can now generate charts in xlsx files with the --xchart option.

Older changes in CHANGES

Help wanted

The plotting tools could use a lot of improvements. Both tl-serve and tl-barplot. If you're good in python or JS plotting any help improving those would be appreciated.

Mailing list

Please post to the [email protected] mailing list. For bugs please open an issue on https://github.com/andikleen/pmu-tools/issues

Licenses

ocperf, toplev, ucevent, parser are under GPLv2, jevents is under the modified BSD license.

Andi Kleen

pmu-tools's People

Contributors

Stargazers

Watchers

Forkers

kopchik erikarn llethub graydon biswapanda xu3bp6 wewela martinfaust dlespiau feilongliu lsharifi zestrada raceli tgrabiec jiaqiang simondyl hkshaw1990 banitag1 yongjianxu leoninnovate bluelover-zm weinix furatafram hgn chubbymaggie seokj greenscientist iramin psteinb hoangt yangxi changliwei goryszewskig ugiwgh qilewuqiong nkurz oddy555 lishiyong110 felipebetancur ustczjr86 knweiss ottolu elvinio emaxerrno atlantic777 drcrallen lailamaharon icefishc dwdm byeonghunhyeon sjanulonoks gemini1994 rainm nsk-dmitry fabmiz wcohen anirajk tdrjnr lishuai1225 justplay oleksandr-oksenenko-zz shaygalon schandra soramichi nimisolo gumi-presentation-by-dzh ahama thomas-yang xzffwy michaeljclark hacker-qian ziyht gollum2411 serhiyx vetoplayer panyufeng920 vickyma farck mcastelino fallinsky bodgergely hannesweisbach hying-caritas wei-n-ning hawkroy ptabc lilinji abazhaniuk pllopis pandamengxu sharpzhao figozhang cooljiansir renzhengeek sheepx86 harrywei hjat2005 jackwangpku mahikishan deelmind

pmu-tools's Issues

not further backend_bound output in the toplev -l2

1 level :toplev.py sleep 60
Using level 1.
perf stat -x, -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1,any=1/,cpu/event=0xc2,umask=0x2/}' -A -a sleep 60
S0-C0 FE Frontend_Bound: 35.21%
S0-C0 BE Backend_Bound: 50.81% //maybe the bound.
S0-C1 FE Frontend_Bound: 34.63%
S0-C1 BE Backend_Bound: 43.94%
.....
2 level:toplev.py -l2 sleep 60
S0-C0 FE Frontend_Bound: 32.92%

S0-C0 FE Frontend_Bound.Frontend_Latency: 27.60%

S0-C0 BE Backend_Bound: 52.92%

S0-C1 FE Frontend_Bound: 36.04%
S0-C1 FE Frontend_Bound.Frontend_Latency: 29.17%
......
S0-C0-T1BE/Mem Backend_Bound.Memory_Bound: 0.00% mismeasured

look,we cannot found the S0-C0 BE 's sub item,such as Frontend_Bound.Frontend_Latency:.
?
my kernel is ubuntu 3.16.0-31-generic.

Issue with '-v' flag

I'm trying to run toplev.py with a docker container as a workload, I use the following command:

python toplev.py --core C0 -l1 -I 1000 -x, -o ../benchmarks/mediaStreamingLevel1I1000msC0.csv taskset -c 0 docker run -t --name=streaming_client -v /path/to/output:/output --volumes-from streaming_dataset --net streaming_network cloudsuite/media-streaming:client 172.18.0.2
When I run this the toplev is removing the '-v' flag present in the docker command which is causing errors. The output is:

Will measure complete system
Using level 1.
perf stat -x\; -e
'{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,any=1,cmask=1/,cpu/event=0xc2,umask=0x2/}' -I 1000 -C 0,18,36,54 -A -a taskset -c 0 docker run -t --name=streaming_client /path/to/output:/output --volumes-from streaming_dataset --net streaming_network cloudsuite/media-streaming:client 172.18.0.2
Unable to find image '/path/to/output:/output:latest' locally

This might be happening as toplev also has the '-v ' flag (--verbose or -v). Without using toplev the docker container runs fine.

toplev crashes

saw this with a76c89a

$ ~/software/pmu-tools/repo/toplev.py -l1 sleep 10
Using level 1.
perf stat -x, -e 'task-clock,{cpu/event=0xc2,umask=0x2/,cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cpu/event=0x9c,umask=0x1/,cycles}' sleep 10
Traceback (most recent call last):
  File "/home/steinbac/software/pmu-tools/repo/toplev.py", line 1748, in <module>
    ret = execute(runner, out, rest)
  File "/home/steinbac/software/pmu-tools/repo/toplev.py", line 960, in execute
    print_keys(runner, res, rev, valstats, out, interval, env)
  File "/home/steinbac/software/pmu-tools/repo/toplev.py", line 885, in print_keys
    cores = [key_to_coreid(x) for x in res.keys() if int(x) in runner.allowed_threads]
ValueError: invalid literal for int() with base 10: ''

Issues with ocperf and toplev

I am running PMU-Tools on a Haswell i7 processor with 3.13.0-35-generic kernel (Ubuntu). I am getting some odd behavior in the output of ocperf and toplev.

When I run ocperf.py stat with the same events as toplev.py. It seems to show that many of the counters are <not counted>. Is this normal behavior? As I understood it, ocperf.py shouldn't show this behavior because it uses the events directly from Intel's description of the micro-architecture on my computer.

'{cycles,cpu/event=0xc2,umask=0x2/,ref-cycles,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/},{cpu/event=0xa2,umask=0x8/,cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0x9c,umask=0x1/,cpu/event=0x9c,umask=0x1,cmask=4/,cycles,instructions},{cpu/event=0xe,umask=0x1/,cycles,cpu/event=0x79,umask=0x30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cycles,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cycles,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xb1,umask=0x1,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xb1,umask=0x1,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},{cpu/event=0xa3,umask=0x6,cmask=6/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0x5e,umask=0x1/,instructions},{cpu/event=0xab,umask=0x2/,cpu/event=0x87,umask=0x1/,cycles,cpu/event=0x79,umask=0x30,edge=1,cmask=1/,cpu/event=0x85,umask=0x10/},{cpu/event=0x80,umask=0x4/,cpu/event=0x79,umask=0x24,cmask=4/,cycles,cpu/event=0x79,umask=0x24,cmask=1/,cpu/event=0x85,umask=0x10/},{cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0x79,umask=0x18,cmask=1/,cycles,cpu/event=0xa3,umask=0xc,cmask=12/,cpu/event=0x79,umask=0x18,cmask=4/},{cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0xa3,umask=0xc,cmask=12/,cpu/event=0xa3,umask=0x5,cmask=5/,cpu/event=0xa2,umask=0x8/,cycles},{cpu/event=0xa3,umask=0x5,cmask=5/,cpu/event=0xd1,umask=0x4/,cpu/event=0xd1,umask=0x20/,cycles},{cpu/event=0xc5,umask=0x0/,cpu/event=0xe6,umask=0x1f/,cpu/event=0x5e,umask=0x1/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0x80,umask=0x4/,cycles,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1/},{cpu/event=0xd1,umask=0x4/,cycles,cpu/event=0xd2,umask=0x2/,cpu/event=0x7,umask=0x1/,cpu/event=0x3,umask=0x2/},{cpu/event=0x8,umask=0x10/,cycles,cpu/event=0x8,umask=0x60/,cpu/event=0x60,umask=0x1,cmask=6/},{cpu/event=0xd2,umask=0x1/,cpu/event=0x60,umask=0x1,cmask=1/,cycles,cpu/event=0xd2,umask=0x4/,cpu/event=0x60,umask=0x1,cmask=6/},{cpu/event=0xb7,umask=0x1,offcore_rsp=0x10003c0002/,cycles,cpu/event=0xd0,umask=0x42/,cpu/event=0xd2,umask=0x4/,cpu/event=0xd0,umask=0x82/},{cpu/event=0x49,umask=0x60/,cycles,cpu/event=0x49,umask=0x10/},{cpu/event=0xd1,umask=0x8/,cpu/event=0x3,umask=0x8/,cycles,cpu/event=0x48,umask=0x1/}

When I see the output of this command, a lot of events show up as <not counted>
Here is a sample of the output -

     1.196793465,<not counted>,cpu/event=0xab,umask=0x2/
     1.196793465,<not counted>,cpu/event=0x87,umask=0x1/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0x79,umask=0x30,edge=1,cmask=1/
     1.196793465,<not counted>,cpu/event=0x85,umask=0x10/
     1.196793465,<not counted>,cpu/event=0x80,umask=0x4/
     1.196793465,<not counted>,cpu/event=0x79,umask=0x24,cmask=4/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0x79,umask=0x24,cmask=1/
     1.196793465,<not counted>,cpu/event=0x85,umask=0x10/
     1.196793465,<not counted>,cpu/event=0xa3,umask=0x6,cmask=6/
     1.196793465,<not counted>,cpu/event=0x79,umask=0x18,cmask=1/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xa3,umask=0xc,cmask=12/
     1.196793465,<not counted>,cpu/event=0x79,umask=0x18,cmask=4/
     1.196793465,<not counted>,cpu/event=0xa3,umask=0x6,cmask=6/
     1.196793465,<not counted>,cpu/event=0xa3,umask=0xc,cmask=12/
     1.196793465,<not counted>,cpu/event=0xa3,umask=0x5,cmask=5/
     1.196793465,<not counted>,cpu/event=0xa2,umask=0x8/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xa3,umask=0x5,cmask=5/
     1.196793465,<not counted>,cpu/event=0xd1,umask=0x4/
     1.196793465,<not counted>,cpu/event=0xd1,umask=0x20/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xc5,umask=0x0/
     1.196793465,<not counted>,cpu/event=0xe6,umask=0x1f/
     1.196793465,<not counted>,cpu/event=0x5e,umask=0x1/
     1.196793465,<not counted>,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/
....
....
     1.196793465,<not counted>,cpu/event=0xd1,umask=0x4/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xd2,umask=0x2/
     1.196793465,<not counted>,cpu/event=0x7,umask=0x1/
     1.196793465,<not counted>,cpu/event=0x3,umask=0x2/
     1.196793465,<not counted>,cpu/event=0x8,umask=0x10/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0x8,umask=0x60/
     1.196793465,<not counted>,cpu/event=0x60,umask=0x1,cmask=6/
     1.196793465,<not counted>,cpu/event=0xd2,umask=0x1/
     1.196793465,<not counted>,cpu/event=0x60,umask=0x1,cmask=1/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xd2,umask=0x4/
     1.196793465,<not counted>,cpu/event=0x60,umask=0x1,cmask=6/
     1.196793465,<not counted>,cpu/event=0xb7,umask=0x1,offcore_rsp=0x10003c0002/
     1.196793465,<not counted>,cycles
     1.196793465,<not counted>,cpu/event=0xd0,umask=0x42/
     1.196793465,<not counted>,cpu/event=0xd2,umask=0x4/
     1.196793465,<not counted>,cpu/event=0xd0,umask=0x82/

Could this be the reason toplev.py seems to be producing stacked bar-plots that do not sum to 100%. For example, what does it mean in the first level figure is zero, but the back-end bound metrics in level2 is non-zero.

Can't collect some events on Xeon E5-2630 v3

Currently, I am trying to analyze my application by using toplev.py. However, it seems like that Xeon E5-2630 v3 is not supported. Specifically, I could not get the frontend, retiring and Bad speculation information. In addition, the backend information could not generate the detailed information such as memory bound and core bound information.

$] python toplev.py -l5 my_app
28 events not supported
0     BE      Backend_Bound:                67.04%
        This category reflects slots where no uops are being
        delivered due to a lack of required resources for accepting
        more uops in the Backend of the pipeline...
0             CPU utilization:        0.89 CPUs
        Number of CPUs used...
1     BE      Backend_Bound:                67.43%
1             CPU utilization:        0.89 CPUs

(I am sorry to disturb the issue article.)

GenuineIntel-6-4F-core.json not found

on a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Downloading https://download.01.org/perfmon/mapfile.csv to mapfile.csv
Downloading https://download.01.org/perfmon/BDX/BroadwellX_core_V10.json to GenuineIntel-6-4F-core.json
Cannot access event server: HTTP Error 404: Not Found

still
ocperf.py list does produce a reasonable list of events supported on Broadwell
and
ocperf.py stat works with my "standard" list of Broadwell events...

Scale arg ignored in CSV mode

CSV-enabled output ignores the scale argument. If this is intentional the README should be updated, otherwise a quick workaround would be to modify self.vals in OutputCSV.flush() if args.scale is set.

ucevent.py fails with the new perf stat csv output format

The new perf stat csv output (https://lwn.net/Articles/653941/) breaks ucevent.py.

The assertion on line 601 of ucevent.py (assert evp[0] == j) fails because measure() includes the new stats printed after the event name as part of the event name: e.g., in a sample run evp[0] is 'uncore_imc_0/event=0x4,umask=0x3/,103357003,10.39' rather than 'uncore_imc_0/event=0x4,umask=0x3/'.

Fix in ocperf

Hallo,

I have a little fix to propose for the process_args function in ocperf.py:

From 21b152a29f59da03769d4db33df720123218de80 Mon Sep 17 00:00:00 2001
From: Omar Awile <[email protected]>
Date: Mon, 12 Sep 2016 10:17:11 +0200
Subject: [PATCH] Pass along optional argv parameter for this case too

---
 ocperf.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ocperf.py b/ocperf.py
index f9bc904..7b1c068 100755
--- a/ocperf.py
+++ b/ocperf.py
@@ -790,7 +790,7 @@ def process_args(emap, argv=sys.argv):
                                              True if record == yes else False, emap)
             cmd.append(prefix + event)
         elif argv[i][0:2] == '-c':
-            oarg, i, prefix = getarg(i, cmd)
+            oarg, i, prefix = getarg(i, cmd, argv=argv)
             if oarg == "default":
                 if overflow is None:
                     print >>sys.stderr,"""

cheers!

toplev chart is empty

Hi,

I have tried toplev in order to produce a chart as you show in the README.

However, I just got an empty figure (see attachment).

Command line was:

toplev.py -I 100 -l3 --title "GNU grep" --graph md5sum ~/Downloads/ubuntu-14.04.3-server-amd64.iso

percentages going over 100?

Hi there. First of all, thank you for the tools. I've learned a lot about how to use perf just by looking at how the pmu-tools do it.

I'm seeing this strange output in commit d70840b, using command line options ../pmu-tools/toplev.py --verbose --no-multiplex -l3 --single-thread -- ./myprogram

I consistently get this this printed output whose % is > 100 on a particular test program I am running.

BE Backend_Bound: 82.25 % [100.00%]
BE/Mem Backend_Bound.Memory_Bound: 57.41 % [100.00%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 5.58 % [100.00%]
This metric estimates how often the CPU was stalled without
loads missing the L1 data cache...
Sampling events: mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem Backend_Bound.Memory_Bound.L1_Bound.DTLB_Load: _ 196.05 %below _ [100.00%]
This metric represents cycles fraction where the TLB was
missed by load instructions...
Sampling events: mem_inst_retired.stlb_miss_loads:p

rdpmc_read() in libjevents returns values > 2^48

I've been trying to do something similar to the interrupts.c code, but was having trouble with rdpmc_read() giving seemingly nonsense results. I've narrowed it down to the buf->offset field sometimes having a high bit set (1L << 48). PERF_EVENT_IOC_RESET will both the counter and the offset to 0, but at some point buf->offset will jump back. Masking off the high bits (or just ignoring offset) solves the problem but I can't find any reason it should be necessary.

I don't know if this is a bug in rdpmc_read(), a bug in the kernel, or something I'm doing wrong. I'm using a slightly older kernel (4.2.0) on Skylake, so it's also possible this is something that has already been fixed. Test code against current pmu-tools master is here: https://github.com/nkurz/pmu-tools/tree/test-offset. Help or suggestions of a better venue would be appreciated. I can upgrade to the current kernel and retest if necessary. Thanks!

perfpd doesn't seem to handle MMAP2

Running perfpd doesn't seem to properly symbolize perf.data files that contain MMAP2

toplev -l4 --user generates invalid event spec on Intel(R) Xeon(R) CPU E5-2630 v3 (Haswell-E)

$ python2 ~/shared/pack/pmu-tools/toplev.py -l4 --user ls

yields

Using level 4.
Nodes Data_Sharing Memory_Bound 1_Port_Utilized Split_Stores L3_Bound
2_Ports_Utilized Contested_Accesses 3m_Ports_Utilized Store_Latency
Lock_Latency L3_Hit_Latency Split_Loads Ports_Utilization Core_Bound
MEM_Bound FB_Full have errata HSM30 HSM31 HSM26, HSM30
perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,cpu/event=0xd,umask=0x3,cmask=1/u,instructions:u},{cpu/event=0xa2,umask=0x8/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0xb1,umask=0x2,cmask=1/u,cpu/event=0xb1,umask=0x2,cmask=2/u,cpu/event=0xb1,umask=0x2,cmask=3/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4/u,cpu/event=0x5e,umask=0x1/u,instructions:u},{cpu/event=0x80,umask=0x4/u,cpu/event=0xab,umask=0x2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0x87,umask=0x1/u,cpu/event=0x14,umask=0x2/u,cpu/event=0x79,umask=0x30,edge=1,cmask=1/u,cpu/event=0xc1,umask=0x40/u,cycles:u},{cpu/event=0x79,umask=0x24,cmask=4/u,cpu/event=0xa8,umask=0x1,cmask=1/u,cpu/event=0x79,umask=0x24,cmask=1/u,cpu/event=0x85,umask=0x60/u,cpu/event=0x79,umask=0x18,cmask=1/u,cpu/event=0xa8,umask=0x1,cmask=4/u,cycles:u,cpu/event=0x79,umask=0x18,cmask=4/u,cpu/event=0x85,umask=0x10/u},{cpu/event=0xa3,umask=0xc,cmask=12/u,cpu/event=0xd1,umask=0x20/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0xd1,umask=0x4/u,cpu/event=0xa3,umask=0x5,cmask=5/u,cycles:u},{cpu/event=0x80,umask=0x4/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xe6,umask=0x1f/u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1/u,cpu/event=0x85,umask=0x60/u,cpu/event=0xc5,umask=0x0/u,cycles:u,cpu/event=0x5e,umask=0x1/u,cpu/event=0x85,umask=0x10/u},{cpu/event=0x60,umask=0x8,cmask=6/u,cpu/event=0x7,umask=0x1/u,cpu/event=0xb7,umask=0x1/puhu,cpu/event=0xd0,umask=0x42/u,cpu/event=0x3,umask=0x2/u,cpu/event=0xb1,umask=0x2,cmask=3/u,cycles:u,cpu/event=0xb2,umask=0x1/u},{cpu/event=0x60,umask=0x8,cmask=1/u,cpu/event=0x8,umask=0x60/u,cpu/event=0xb1,umask=0x2,cmask=2/u,cpu/event=0x60,umask=0x8,cmask=6/u,cpu/event=0xb1,umask=0x2,cmask=1/u,cpu/event=0x49,umask=0x60/u,cpu/event=0x49,umask=0x10/u,cpu/event=0x8,umask=0x10/u,cycles:u},{cpu/event=0x60,umask=0x4,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xb1,umask=0x2,cmask=2/u,cpu/event=0xb1,umask=0x2,cmask=3/u,cpu/event=0xd0,umask=0x82/u,cpu/event=0xc0,umask=0x2/u,cycles:u,cpu/event=0xd0,umask=0x21/u,instructions:u},{cpu/event=0x3,umask=0x8/u,cpu/event=0xd1,umask=0x8/u,cpu/event=0xd1,umask=0x40/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0x48,umask=0x2,cmask=1/u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4/u,cpu/event=0x5e,umask=0x1/u,cpu/event=0x48,umask=0x1/u},{cpu/event=0xd3,umask=0x1/u,cpu/event=0xd1,umask=0x4/u,cpu/event=0xd2,umask=0x4/u,cpu/event=0xd3,umask=0x4/u,cpu/event=0xd2,umask=0x1/u,cpu/event=0xd1,umask=0x40/u,cpu/event=0xd2,umask=0x2/u,cpu/event=0xd1,umask=0x2/u},{cpu/event=0xd3,umask=0x10/u,cpu/event=0xd3,umask=0x20/u,cycles:u}' ls
invalid or unsupported event: '{[snip]}'
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events

The issue appears to be this event:

cpu/event=0xb7,umask=0x1/puhu

which has a duplicate u specifier.

cc @lcw

Issue with remote write count

I'm confused with the remote write counter.
I've used ocperf.py list to find out the remote write counter , and I found the events--offcore_response_corewb_llc_miss_any_dram and offcore_response_corewb_llc_hit_any_response may be the write counter.
So I use the command ocperf.py stat -e offcore_response.corewb.llc_miss.any_dram,offcore_response.all_reads.llc_miss.remote_dram,offcore_response.corewb.llc_hit.any_response,mem-stores -I 1000 -C 8 to monitor the system. Than I use numactl to bind milc to physcpu 8 and remote memory, but the result is so confusing.

253.030788963 0 offcore_response_corewb_llc_miss_any_dram (36.40%)
253.030788963 36,177,045 offcore_response_all_reads_llc_miss_remote_dram (36.40%)
253.030788963 0 offcore_response_corewb_llc_hit_any_response (18.18%)
253.030788963 224,853,825 mem-stores (27.21%)
254.030893213 0 offcore_response_corewb_llc_miss_any_dram (36.40%)
254.030893213 35,695,552 offcore_response_all_reads_llc_miss_remote_dram (36.39%)
254.030893213 0 offcore_response_corewb_llc_hit_any_response (18.11%)
254.030893213 230,275,843 mem-stores (27.21%)
255.031004841 0 offcore_response_corewb_llc_miss_any_dram (36.39%)
255.031004841 35,970,716 offcore_response_all_reads_llc_miss_remote_dram (36.31%)
255.031004841 0 offcore_response_corewb_llc_hit_any_response (18.11%)
255.031004841 219,686,387 mem-stores (27.21%)

The result shows the llc miss and llc hit are both 0 .....
So I wonder if the events I chose are wrong?
Hoping for your reply.

parser cannot read samples in perf.data after

PERF=perf315 ./tester works
PERF=perf316 ./tester

does not read any samples and fails

CPU_STARTING and CPU_DYING no longer used in Linux 4.9

Hi, thanks for the great tool!

simple-pebs/simple-pebs.c uses CPU_STARTING and CPU_DYING to allow CPUs to be hot-plugged, but these macros are no longer used in Linux 4.9.
http://lxr.free-electrons.com/ident?v=4.9;i=CPU_STARTING

As this commit (https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=ee1e714b94521b0bb27b04dfd1728ec51b19d4f0) suggests, we should move to the new state machine mechanism to support hot-plugging for kernel 4.9 or later versions.

For most of the cases where CPU hot-plugging never happens, just delete the notifier call-backs like soramichi@d175a0b should work.

Problem with performance counters for Xeon D-1540

The Xeon D-1540 appears to have a problem where only 4 of the 8 perf counters per core actually count, whereas the other 4 remain zero (with hyperthreading disabled). I experienced this issue and then saw that other people have had the same issue: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/560536

I don't know if this affects other processor in that family, but it obviously ends up giving bogus pmu-tools results for this family of processors when hyperthreading is disabled. Might want to check for that particular processor and then limit the number of counters per perf set to only 4.

Note that in addition to only 4 out of 8 counters available, the LLC counter values also have their own set of problems as described at that page (also confirmed with my CPU). There's actually a lot of counter-related problems with this processor...
http://www.intel.com/content/www/us/en/processors/xeon/xeon-d-1500-specification-update.html

return res[index][cpuoff] IndexError: tuple index out of range

ubuntu
cpu: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
kernel :Linux ubuntu 3.16.0-31-generic

root@ubuntu:~/pmu-tools# toplev.py -l2 -p 2004
Running in HyperThreading mode. Will measure complete system.
Using level 2.
perf stat -x, -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1,any=1/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xa3,umask=0x6,cmask=6/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,instructions,cpu/event=0x9c,umask=0x1/,cycles,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xb1,umask=0x1,cmask=2/,cycles,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xb1,umask=0x1,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},{cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0xa2,umask=0x8/,cpu/event=0x5e,umask=0x1/,instructions}' -A -a -p 2004
7 ---->print index in code./pmu-tools/toplev.py 915 line
(11946234493.0,) ----->print res[index]
0 ---->print cpuoff
then repeat
6
(3199005707.0,)
0

7
(11946234493.0,)
1

Traceback (most recent call last):
File "/root/pmu-tools/toplev.py", line 1377, in
ret = execute(runner, out, rest)
File "/root/pmu-tools/toplev.py", line 728, in execute
print_keys(runner, res, rev, out, interval, env)
File "/root/pmu-tools/toplev.py", line 682, in print_keys
runner.print_res(r, rev[cpus[0]], out, interval, core_fmt(core), env, Runner.SMT_yes, stat)
File "/root/pmu-tools/toplev.py", line 1161, in print_res
obj.compute(lambda e, level:
File "/root/pmu-tools/ivb_server_ratios.py", line 637, in compute
self.val = (STALLS_MEM_ANY(EV, 2) + EV("RESOURCE_STALLS.SB", 2)) / CLKS(EV, 2 )
File "/root/pmu-tools/ivb_server_ratios.py", line 71, in STALLS_MEM_ANY
return EV(lambda EV , level : min(EV("CPU_CLK_UNHALTED.THREAD", level) , EV("CYCLE_ACTIVITY.STALLS_LDM_PENDING", level)) , level )
File "/root/pmu-tools/toplev.py", line 1162, in
lookup_res(res, rev, e, obj, env, level, stat.referenced))
File "/root/pmu-tools/toplev.py", line 902, in lookup_res
for off in range(cpu.threads)])
File "/root/pmu-tools/ivb_server_ratios.py", line 71, in
return EV(lambda EV , level : min(EV("CPU_CLK_UNHALTED.THREAD", level) , EV("CYCLE_ACTIVITY.STALLS_LDM_PENDING", level)) , level )
File "/root/pmu-tools/toplev.py", line 901, in
lookup_res(res, rev, ev, obj, env, level, referenced, off), level)
File "/root/pmu-tools/toplev.py", line 919, in lookup_res
return res[index][cpuoff]
IndexError: tuple index out of range

we think the cpuoff =1 out of rang because of that the res[index] only have one member.

How can we fix the bug?

Thanks

#!/usr/bin/python should be #!/usr/bin/env python in toplev.py (as in ocperf.py)

minor issue,

Download JSON Events File

I am running PMU tools on an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
When I run the event_download.py script, it tries to fecth https://download.01.org/perfmon/HSW/Haswell_core_V14.json. It seems that causes a 404 error.

The file it should fetch seems to be https://download.01.org/perfmon/HSW/Haswell_core_V15.json. Is there any workaround this problem?

The tools supporting cpu framework?

Hi, I want to get the level 2 matrics and some level 3 metrics using the tool "toplev".

I want to confirm about which cpus framework the "toplev" tools now supports on ? SNB, IVB, HSW, BDW ?

ucevent CBO.LLC_{DDIO,PCIE}_* events for HSX

Haswell Xeon CBO.LLC_{DDIO,PCIE}_* events are missing. Any plans to add them in the near future?

Event duplicity?

Hi all,
I have noticed that on my CPU Intel i7-3537U ocperf lists the following events that seem to be the same (event, umask and any flag), what's the meaning of _p?
Thanks

cpu_clk_unhalted.thread: Core cycles when the thread is not in halt state
cpu/event=0x3c,umask=0x0,name=cpu_clk_unhalted_thread/
cpu_clk_unhalted.thread_p: Thread cycles when thread is not in halt state
cpu/event=0x3c,umask=0x0,name=cpu_clk_unhalted_thread_p/

cpu_clk_unhalted.thread_any: Core cycles when at least one thread on the physical core is not in halt state
cpu/event=0x3c,umask=0x0,any=1,name=cpu_clk_unhalted_thread_any/
cpu_clk_unhalted.thread_p_any: Core cycles when at least one thread on the physical core is not in halt state
cpu/event=0x3c,umask=0x0,any=1,name=cpu_clk_unhalted_thread_p_any/

inst_retired.any: Instructions retired from execution.
cpu/event=0xc0,umask=0x0,name=inst_retired_any/
inst_retired.any_p: Number of instructions retired. General Counter - architectural event
cpu/event=0xc0,umask=0x0,name=inst_retired_any_p/

fucking windows newline symbol!!!!

alp@ws207:~/wrk$ ./pmu-tools/list-events.py
bash: ./pmu-tools/list-events.py: /usr/bin/python^M: плохой интерпретатор: Нет такого файла или каталога

Improper handling of non-consecutive imc uncore pmu dev names

On my machine (running Linux 3.19) imc uncore pmu dev names are not consecutive: i.e., instead of /sys/devices/uncore_imc_{0..3} I have /sys/devices/uncore_imc_{0,1,4,5}. But expand_events (among possibly other places in the code) assume there is no gap in naming, so I end up only two values instead of four.

As a quick/dirty workaround I modified ucexpr.py>expand_events to:

for n in range(10):
    if ucevent.box_exists(...):
        l.append(...)

cannot run toplev.py on fresh haswell box (kernel 3.13 and python 2.7)

Hi,

I have some troubles to run toplev on a new box so I wanted to let you know.

Here is an output, from a fresh clone of master branch :

satin@satin-phyexp1:/tmp/pmu-tools$ python --version
Python 2.7.6
satin@satin-phyexp1:/tmp/pmu-tools$ uname -r
3.13.0-37-generic
satin@satin-phyexp1:/tmp/pmu-tools$ ./toplev.py -I 100 -l3 --title "GNU grep" --graph grep -r foo /usr/*
Using level 3.
UOPS_EXECUTED.CYCLES_GE_1_UOPS_EXEC not found
satin@satin-phyexp1:/tmp/pmu-tools$ []
Traceback (most recent call last):
  File "/tmp/pmu-tools//tl-barplot.py", line 185, in <module>
    plt.subplot(numplots, 1, 1)
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 897, in subplot
    a = fig.add_subplot(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/figure.py", line 914, in add_subplot
    a = subplot_class_factory(projection_class)(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 9251, in __init__
    self._subplotspec = GridSpec(rows, cols)[int(num) - 1]
  File "/usr/lib/pymodules/python2.7/matplotlib/gridspec.py", line 176, in __getitem__
    raise IndexError("index out of range")
IndexError: index out of range

Is that a bug in toplev ?
Obviously UOPS_EXECUTED.CYCLES_GE_1_UOPS_EXEC not found looks like a culprit.This is used in hsw_client_ratios.py.
Or am I missing some additional perf libraries or is due to my processor not fully supported (Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz) ?

Thanks in advance for any hints...

BDX support

Any plan to add BDX support? Is it safe to use bdxde_*.py files instead, in the meantime?

BDX events are now available on https://download.01.org/perfmon/.

hsx_server_ratios missing

It seems that hsx_server_ratios is not included in the repo. Could you please add it?

Issues with energy measuring (--power)

I believe I've found two issues with --power functionality in current HEAD.

First, commit 9458aea925a20c19c9d15056c5dc623dc3fdbf12 appears to break power events (and likely some others too), because after that change valid_events_str is computed too early, before valid_events are populated in Runner::collect.

Reverting that commit seems to fix the issue for me on a HT CPU. However, on a non-HT CPU another issue remains: the metrics are not printed, unless I add -A flag to perf command line.

KeyError: u'MATRIX_REQUEST'

Hi,

ocperf fails on my machine:

$ ocperf.py stat -e arith.div:k 
Downloading https://download.01.org/perfmon/mapfile.csv to mapfile.csv
Downloading https://download.01.org/perfmon/HSW/Haswell_core_V15.json to GenuineIntel-6-3C-core.json
Downloading https://download.01.org/perfmon/HSW/Haswell_matrix_bit_definitions_V15.json to GenuineIntel-6-3C-offcore.json
Downloading https://download.01.org/perfmon/readme.txt to readme.txt
Traceback (most recent call last):
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 690, in <module>
    emap = find_emap()
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 536, in find_emap
    return json_with_extra(el)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 482, in json_with_extra
    add_extra_env(emap, el)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 492, in add_extra_env
    emap.add_offcore(oc)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 452, in add_offcore
    if row[u"MATRIX_REQUEST"].upper() != "NULL":
KeyError: u'MATRIX_REQUEST'

Same for ocperf.py list:

$ ocperf.py list
Traceback (most recent call last):
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 690, in <module>
    emap = find_emap()
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 524, in find_emap
    emap = json_with_extra(el)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 482, in json_with_extra
    add_extra_env(emap, el)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 492, in add_extra_env
    emap.add_offcore(oc)
  File "/home/tgrabiec/src/pmu-tools/ocperf.py", line 452, in add_offcore
    if row[u"MATRIX_REQUEST"].upper() != "NULL":
KeyError: u'MATRIX_REQUEST'

The CPU is Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Problems Running Toplev

I am trying to run this command - sudo ../pmu-tools/toplev.py -I 100 -l3 --title "GNU grep" --graph grep -r asdf /etc/*. On an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

This is setting off an AssertionError inside toplev.

Traceback (most recent call last):
  File "../pmu-tools/toplev.py", line 950, in <module>
    ret = execute(runner, out, rest)
  File "../pmu-tools/toplev.py", line 509, in execute
    env)
  File "../pmu-tools/toplev.py", line 549, in do_execute
    runner.print_res(res[j], rev[j], out, prev_interval, j, env)
  File "../pmu-tools/toplev.py", line 806, in print_res
    obj.compute(lambda e, level:
  File "/home/subho/pmu-tools/hsw_client_ratios.py", line 713, in compute
    self.val = BackendBoundAtEXE(EV, 2)- self.MemoryBound.compute(EV )
  File "/home/subho/pmu-tools/hsw_client_ratios.py", line 30, in BackendBoundAtEXE
    return BackendBoundAtEXE_stalls(EV, level) / CLKS(EV, level)
  File "/home/subho/pmu-tools/hsw_client_ratios.py", line 28, in BackendBoundAtEXE_stalls
    return ( EV("CYCLE_ACTIVITY.CYCLES_NO_EXECUTE", level) + EV("UOPS_EXECUTED.CYCLES_GE_1_UOPS_EXEC", level) - FewUopsExecutedThreshold(EV, level) - EV("RS_EVENTS.EMPTY_CYCLES", level) + EV("RESOURCE_STALLS.SB", level) )
  File "../pmu-tools/toplev.py", line 807, in <lambda>
    lookup_res(res, rev, e, obj, env, level))
  File "../pmu-tools/toplev.py", line 631, in lookup_res
    assert event_rmap(rev[index]) == canon_event(ev)
AssertionError

This might be related to #7. I downloaded the https://download.01.org/perfmon/HSW/Haswell_core_V15.json and put it in my pmu_events folder as GenuineIntel-6-3C-core.json (instead of the V14 file which does not exist there and which event_download.py was looking for).

not counted issue on Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Hi community!

I am using perf as :sudo perf stat -e r00c0,r01c0,r01c0:p,r01c0:pp sleep 1

but the result is

Performance counter stats for 'sleep 1':

       464,229 r00c0                                                       
       464,229 r01c0                                                       
 <not counted> r01c0:p                 
 <not counted> r01c0:pp                

   1.001639901 seconds time elapsed

Whys is PEBS not counted here?

Sorry to ask my question here...

Please help me on this.

gen-dot.py can't work with latest ratios files.

gen-dot.py can't work with latest ratios files, because Runner instance has no attribute 'metric' and 'parent':

Traceback (most recent call last):
  File "./gen-dot.py", line 45, in <module>
    m.Setup(runner)
  File "/home/yefeng/pmu-tools-master/ivb_client_ratios.py", line 1604, in __init__
    n = Metric_IPC() ; r.metric(n)
AttributeError: Runner instance has no attribute 'metric'

and

Traceback (most recent call last):
  File "./gen-dot.py", line 48, in <module>
    runner.fix_parents()
  File "./gen-dot.py", line 32, in fix_parents
    if not obj.parent:
AttributeError: Frontend_Bound instance has no attribute 'parent'

I think runner.fix_parents() is not need, and modfied runner.finish(), it works

class Runner:
    def finish(self):
        for n in self.olist:
            if n.level > 1:
                print '"%s" -> "%s";' % (n.parent.name, n.name)
            else:
                print '"%s";' % (n.name)
    def metric(self, n):
        pass

runner = Runner()
m.Setup(runner)
print >>sys.stderr, runner.olist
#runner.fix_parents()
print "digraph {"
print "fontname=\"Courier\";"
runner.finish()
print "}"

CPU_CLK_UNHALTED.THREAD{,_P} is same event in ocperf/toplev

Should be different events.

HSM31 on Xeon v3 valid?

I'm currently testing toplev.py on a Xeon v3 (Haswell) and see this output in level 2 system-wide test:

# ./toplev.py -l2 sleep 5
Will measure complete system.
Using level 2.
warning: removing Memory_Bound Core_Bound due to unsupported events in kernel:
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE CYCLE_ACTIVITY.STALLS_LDM_PENDING
Use --force-events to override (may result in wrong measurements)
Nodes Memory_Bound Core_Bound have errata HSM31 and were disabled.
Override with --ignore-errata

Using the --force-events and --ignore-errata options works as a workaround. However, I'm wondering if the error is valid in the first place?

I see HSM31 in the ~/.cache/pmu-events/GenuineIntel-6-3C-core.json
file and find it in the Intel 4th-gen-core mobile specification update
as HSM31: Performance Monitor UOPS_EXECUTED Event May Undercount.

However, I don't see HSM31 or anything UOPS_EXECUTED-related mentioned in the Intel Xeon E5 v3 specification update for my processor at all.
So is this warning valid?

The CPU of my Haswell test system:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2888.281
BogoMIPS:              5010.61
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Running the latest CentOS 7.2 kernel:

# uname -a
Linux haswell1 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CYCLE_ACTIVITY.STALLS_L1D_PENDING is always zero

I noticed that level 3 stats printed for memory bound workloads are incorrect on my machine (Xeon E5-2658 v3, Linux 3.19). Here is a sample output with a program that is DRAM bound (Intel MLC):

BE      Backend_Bound:                                90.68% 
BE/Mem  Backend_Bound.Memory_Bound:                   84.30% 
BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:          84.35% 
BE/Mem  Backend_Bound.Memory_Bound.L3_Bound:          22.48% 
BE/Mem  Backend_Bound.Memory_Bound.MEM_Bound:         61.69%

L1_Bound value is incorrect. I traced the issue to perf always reporting zero for CYCLE_ACTIVITY.STALLS_L1D_PENDING. Here is a sample perf output for that event:

perf stat -I 1000 -e cpu/event=0xa3,umask=0xc,cmask=12/ -a sleep 5
#           time             counts unit events
     1.000206434                  0      cpu/event=0xa3,umask=0xc,cmask=12/
     2.000452095                  0      cpu/event=0xa3,umask=0xc,cmask=12/
     3.000657316                  0      cpu/event=0xa3,umask=0xc,cmask=12/
     4.000875653                  0      cpu/event=0xa3,umask=0xc,cmask=12/
     5.001068298                  0      cpu/event=0xa3,umask=0xc,cmask=12/

With cmask=4, a value that seems correct is returned. I double checked SDM Vol3b and it seems that cmask value of 12 (0xc) should be correct. I understand this is not directly a pmu-tools bug, but was hoping to hear back if others are affected too.

I cant get pebs-grabber to install

I was able to build and install simple-pebs without issue.

This is on kernel 4.2.0-19

when I try pebs-grabber, I get an error.

dmegs says:
pebs_grabber: PEBS version 2
pebs_grabber: Cannot register kprobe: -2

Data collection about HyperThread/SMT

"-p/--pid mode not compatible with SMT. Use sleep in global mode." why? It's affected by perf?
I had gotten a answer from Intel vtune engineer, SMT can compatible with -p.
when computing Core actual clocks, and smt_enabled is true, however, the value nearly equal between "CPU_CLK_UNHALTED.THREAD:amt1" and ""CPU_CLK_UNHALTED.THREAD". Is it normal phenomenon?

Core actual clocks

def CORE_CLKS(EV, level):
return (EV("CPU_CLK_UNHALTED.THREAD:amt1", level) / 2) if smt_enabled else CLKS(EV, level)

Thank you! :)

tabs and spaces inconsistent

Hi,
From a fresh checkout from master:
File "pmu-tools/toplev.py", line 147
e = e[:e.find(":")]
^
TabError: inconsistent use of tabs and spaces in indentation

And indeed, sometimes there are tabs, and sometimes spaces, and Python 3 doesn't like it.

toplev crashes on wrong float literal

Everything below was run as root.

# toplev.py -l1 sleep 10
Will measure complete system.
Using level 1.
perf stat -x\; -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,any=1,cmask=1/,cpu/event=0xc2,umask=0x2/}' -A -a sleep 10
Traceback (most recent call last):
  File "/media/dc/B2B200EFB200B9BD/inz/pmu-tools/toplev.py", line 1617, in <module>
    ret = execute(runner, out, rest)
  File "/media/dc/B2B200EFB200B9BD/inz/pmu-tools/toplev.py", line 792, in execute
    env)
  File "/media/dc/B2B200EFB200B9BD/inz/pmu-tools/toplev.py", line 907, in do_execute
    multiplex = float(n[off + 1])
ValueError: invalid literal for float(): 100,00

Below is the result of perf stat ...:

# perf stat -x\; -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,any=1,cmask=1/,cpu/event=0xc2,umask=0x2/}' -A -a sleep 10

CPU0;2241793414;;cpu/event=0x3c,umask=0x0,any=1/;10007127837;100,00
CPU1;2241051974;;cpu/event=0x3c,umask=0x0,any=1/;10007126109;100,00
CPU2;798878574;;cpu/event=0x3c,umask=0x0,any=1/;10007122595;100,00
CPU3;798025029;;cpu/event=0x3c,umask=0x0,any=1/;10007121927;100,00
CPU4;1479869080;;cpu/event=0x3c,umask=0x0,any=1/;10007136940;100,00
CPU5;1479260102;;cpu/event=0x3c,umask=0x0,any=1/;10007135470;100,00
CPU6;1637764499;;cpu/event=0x3c,umask=0x0,any=1/;10007133938;100,00
CPU7;1637043424;;cpu/event=0x3c,umask=0x0,any=1/;10007132916;100,00
CPU0;1778359179;;cpu/event=0xe,umask=0x1/;10007225730;100,00
CPU1;372610005;;cpu/event=0xe,umask=0x1/;10007224789;100,00
CPU2;423892267;;cpu/event=0xe,umask=0x1/;10007221503;100,00
CPU3;159917631;;cpu/event=0xe,umask=0x1/;10007219288;100,00
CPU4;457584393;;cpu/event=0xe,umask=0x1/;10007232528;100,00
CPU5;741543029;;cpu/event=0xe,umask=0x1/;10007230406;100,00
CPU6;1260524783;;cpu/event=0xe,umask=0x1/;10007228798;100,00
CPU7;402408452;;cpu/event=0xe,umask=0x1/;10007227198;100,00
CPU0;3625922836;;cpu/event=0x9c,umask=0x1/;10007284308;100,00
CPU1;153504280;;cpu/event=0x9c,umask=0x1/;10007281630;100,00
CPU2;1325774321;;cpu/event=0x9c,umask=0x1/;10007277765;100,00
CPU3;74342815;;cpu/event=0x9c,umask=0x1/;10007275369;100,00
CPU4;1632602740;;cpu/event=0x9c,umask=0x1/;10007287236;100,00
CPU5;268262892;;cpu/event=0x9c,umask=0x1/;10007284804;100,00
CPU6;2650705954;;cpu/event=0x9c,umask=0x1/;10007284336;100,00
CPU7;154401725;;cpu/event=0x9c,umask=0x1/;10007282072;100,00
CPU0;61193398;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007317348;100,00
CPU1;61193275;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007314139;100,00
CPU2;22095380;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007310171;100,00
CPU3;22095271;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007307061;100,00
CPU4;32942258;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007305560;100,00
CPU5;32942366;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007302497;100,00
CPU6;46513674;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007288378;100,00
CPU7;46513732;;cpu/event=0xd,umask=0x3,any=1,cmask=1/;10007284929;100,00
CPU0;1471503714;;cpu/event=0xc2,umask=0x2/;10007304010;100,00
CPU1;340643421;;cpu/event=0xc2,umask=0x2/;10007300601;100,00
CPU2;357872568;;cpu/event=0xc2,umask=0x2/;10007296155;100,00
CPU3;126618075;;cpu/event=0xc2,umask=0x2/;10007292269;100,00
CPU4;397747237;;cpu/event=0xc2,umask=0x2/;10007290478;100,00
CPU5;628580576;;cpu/event=0xc2,umask=0x2/;10007286803;100,00
CPU6;1041831779;;cpu/event=0xc2,umask=0x2/;10007272537;100,00
CPU7;355401725;;cpu/event=0xc2,umask=0x2/;10007268529;100,00

I am not sure what's the case, but maybe locale?

# locale
LANG=pl_PL.utf8
LANGUAGE=en_US
LC_CTYPE="pl_PL.utf8"
LC_NUMERIC="pl_PL.utf8"
LC_TIME="pl_PL.utf8"
LC_COLLATE="pl_PL.utf8"
LC_MONETARY="pl_PL.utf8"
LC_MESSAGES="pl_PL.utf8"
LC_PAPER="pl_PL.utf8"
LC_NAME="pl_PL.utf8"
LC_ADDRESS="pl_PL.utf8"
LC_TELEPHONE="pl_PL.utf8"
LC_MEASUREMENT="pl_PL.utf8"
LC_IDENTIFICATION="pl_PL.utf8"
LC_ALL=pl_PL.utf8

The /usr/bin/python version (shouldn't you use /usr/bin/env python instead?):

Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2

Probably the issue can be solved setting locale in Python to the system one:

Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
Type "copyright", "credits" or "license" for more information.

IPython 2.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import locale

In [2]: locale.getdefaultlocale()
Out[2]: ('pl_PL', 'UTF-8')

In [3]: locale.atof("23.3")
Out[3]: 23.3

In [4]: locale.atof("23,3")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-132b6afaec24> in <module>()
----> 1 locale.atof("23,3")

/usr/lib/python2.7/locale.pyc in atof(string, func)
    314         string = string.replace(dd, '.')
    315     #finally, parse the string
--> 316     return func(string)
    317 
    318 def atoi(str):

ValueError: invalid literal for float(): 23,3

In [5]: locale.setlocale(locale.LC_ALL, '.'.join(locale.getdefaultlocale())
   ...: )
Out[5]: 'pl_PL.UTF-8'

In [6]: locale.atof("23,3")
Out[6]: 23.3

In [7]: locale.atof("23.3")
Out[7]: 23.3

So in the end it seems that locale.atof should be used instead of float when casting str to float.

ocperf.py event naming doesn't correspond to perf's one

Example:

$ ocperf.py record --event offcore_response.all_reads.l3_hit.hitm_other_core sleep 1
$ perf evlist
offcore_response_all_reads_l3_hit_hitm_other_core

So, ocperf.py event name contains dots, but perf's event name contains only underscore.
It confuses tools which uses perf and doesn't let to use ocperf.py as perf's wrapper.

I believe there are few solutions:

rename all ocperf.py events
Let users to specify perf-styled event names (with underscores)

ocperf: different values of cpu_clk_unhalted.thread_any within the same core

Hi Andi,

I'm trying to measure the unhalted cycles on a core basis. I therefore used ocperf and selected the above mentioned event.

However, what I'm getting from ocperf is sort of weird. I was expecting to get the same values for the two threads that share the same core, however this not seem to be true.

$ sudo ./ocperf.py stat -e cpu_clk_unhalted.thread_any -a -A sleep 5
perf stat -e cpu/event=0x3c,umask=0x0,any=1,name=cpu_clk_unhalted_thread_any/ -a -A sleep 5

Performance counter stats for 'system wide':

CPU0 627.912.025 cpu_clk_unhalted_thread_any
CPU1 627.248.055 cpu_clk_unhalted_thread_any
CPU2 529.161.153 cpu_clk_unhalted_thread_any
CPU3 812.752.677 cpu_clk_unhalted_thread_any

   5,001079353 seconds time elapsed

Any hint at what might be the culprit here?

Some info on my system:

OS: Ubuntu 14.04
CPU : Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

Thank you!

Cannot run toplev.py

I'm running pmu-tools on Intel Xeon E5-2660 (Sandy Bridge). ocperf.py runs fine, but toplev.py always gives me an error "IndexError: list index out of range".

Traceback (most recent call last):
File "./pmu-tools/toplev.py", line 765, in
sys.exit(execute(runner.evnum, runner, out, rest))
File "./pmu-tools/toplev.py", line 461, in execute
runner.print_res(res[j], rev[j], out, interval, j)
File "./pmu-tools/toplev.py", line 654, in print_res
obj.compute(lambda e, level:
File "/home/fei/pmu-tools/simple_ratios.py", line 36, in compute
self.val = EV("IDQ_UOPS_NOT_DELIVERED.CORE", 1) / SLOTS(EV)
File "./pmu-tools/toplev.py", line 655, in
lookup_res(res, rev, e, obj.res_map[(e, level)]))
File "./pmu-tools/toplev.py", line 482, in lookup_res
return res[index]
IndexError: list index out of range

KeyError: 'Description'

Hi,
I just tried pmu-tools/ocerf.py on a haswell box:

$ ./ocperf.py
Traceback (most recent call last):
  File "./ocperf.py", line 774, in <module>
    emap = find_emap()
  File "./ocperf.py", line 599, in find_emap
    emap = json_with_extra(el)
  File "./ocperf.py", line 557, in json_with_extra
    add_extra_env(emap, el)
  File "./ocperf.py", line 574, in add_extra_env
    emap.add_uncore(uc)
  File "./ocperf.py", line 551, in add_uncore
    self.uncore_events[name] = UncoreEvent(name, row)
  File "./ocperf.py", line 241, in __init__
    e.desc = row['Description'].strip()
KeyError: 'Description'

it's trying to open ${HOME}/.cache//pmu-events/GenuineIntel-6-3F-uncore.json which does not contain this field Description at any point.

Any ideas on how to proceed?
Best -

$ uname -a
Linux islay.mpi-cbg.de 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.1.1503 (Core) 
Release:        7.1.1503
Codename:       Core
$ cat /proc/cpuinfo|grep -i "name"|head -n1
model name      : Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Puzzling about ICache_Misses

1.why the metric "ICache_Misses" don't be include in jkt_server_ratios.py ?
2.what is your thinking about that "ICache Misses" formula difference with vtune?
vtune, "ICache Misses" formula is event("ICACHE.MISSES") / query("InstructionsRetired")
pmu-tools, EV("ICACHE.IFETCH_STALL", 3) / CLKS(EV, 3) - ITLB_Miss_Cycles(EV, 3) / CLKS(EV, 3 )

Thank you!

NameError: global name 'sample_regs_user' is not defined

perf record -b --call-graph dwarf -- sleep 3
python perfdata.py perf.data

Traceback (most recent call last):
File "/home/ubuntu/Source/pmu-tools/parser/perfdata.py", line 575, in
h = perf_file.parse_stream(f)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 197, in parse_stream
return self._parse(stream, Container())
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 661, in _parse
subobj = sc._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 661, in _parse
subobj = sc._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 960, in _parse
obj = self.subcon._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 287, in _parse
return self._decode(self.subcon._parse(stream, context), context)
File "/usr/lib/python2.7/dist-packages/construct/adapters.py", line 261, in _decode
return self.inner_subcon._parse(BytesIO(obj), context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 519, in _parse
obj.append(self.subcon._parse(stream, context))
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 659, in _parse
sc._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 840, in _parse
obj = self.cases.get(key, self.default)._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 270, in _parse
return self.subcon._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 661, in _parse
subobj = sc._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 840, in _parse
obj = self.cases.get(key, self.default)._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 661, in _parse
subobj = sc._parse(stream, context)
File "/usr/lib/python2.7/dist-packages/construct/core.py", line 430, in _parse
count = self.countfunc(context)
File "/home/ubuntu/Source/pmu-tools/parser/perfdata.py", line 127, in
Array(lambda ctx: sample_regs_user,
NameError: global name 'sample_regs_user' is not defined

ocperf.py misparses "long" --event option

This fails:

$ ./ocperf.py record --event mem_load_uops_retired.l1_hit echo 1
perf record --event mem_load_uops_retired.l1_hit echo 1
event syntax error: 'mem_load_uops_retired.l1_hit'
                     \___ parser error
Run 'perf list' for a list of valid events

 Usage: perf record [<options>] [<command>]
    or: perf record [<options>] -- <command> [<options>]

    -e, --event <event>   event selector. use 'perf list' to list available events

Passes:

$ ocperf.py record -e mem_load_uops_retired.l1_hit echo 1

The parsing code in ocperf.py does not handle long "--event" properly, apparently:

        elif sys.argv[i][0:2] == '-e':  # <--- oops, this is not for "--event"
            event, i, prefix = getarg(i, cmd)
            event, overflow = process_events(event, print_only,
                                             True if record == yes else False)
            cmd.append(prefix + event)

toplev.py fails with --level 2 on haswell

[tgrabiec@muninn ~]$ toplev.py -C 0 sleep 2 --level 2
Using level 2.
perf stat -x, -e '{cycles,cpu/event=0xc2,umask=0x2/,ref-cycles,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/},{cpu/event=0xa2,umask=0x8/,cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0x9c,umask=0x1/,cpu/event=0x9c,umask=0x1,cmask=4/,cycles,instructions},{cpu/event=0xe,umask=0x1/,cycles,cpu/event=0x79,umask=0x30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cycles,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cycles,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xb1,umask=0x1,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xb1,umask=0x1,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},{cpu/event=0xa3,umask=0x6,cmask=6/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0x5e,umask=0x1/,instructions}' --cpu 0 sleep 2
Traceback (most recent call last):
  File "/home/tgrabiec/src/pmu-tools/toplev.py", line 950, in <module>
    ret = execute(runner, out, rest)
  File "/home/tgrabiec/src/pmu-tools/toplev.py", line 511, in execute
    runner.print_res(res[j], rev[j], out, interval, j, env)
  File "/home/tgrabiec/src/pmu-tools/toplev.py", line 806, in print_res
    obj.compute(lambda e, level:
  File "/home/tgrabiec/src/pmu-tools/hsw_client_ratios.py", line 713, in compute
    self.val = BackendBoundAtEXE(EV, 2)- self.MemoryBound.compute(EV )
  File "/home/tgrabiec/src/pmu-tools/hsw_client_ratios.py", line 30, in BackendBoundAtEXE
    return BackendBoundAtEXE_stalls(EV, level) / CLKS(EV, level)
  File "/home/tgrabiec/src/pmu-tools/hsw_client_ratios.py", line 28, in BackendBoundAtEXE_stalls
    return ( EV("CYCLE_ACTIVITY.CYCLES_NO_EXECUTE", level) + EV("UOPS_EXECUTED.CYCLES_GE_1_UOPS_EXEC", level) - FewUopsExecutedThreshold(EV, level) - EV("RS_EVENTS.EMPTY_CYCLES", level) + EV("RESOURCE_STALLS.SB", level) )
  File "/home/tgrabiec/src/pmu-tools/toplev.py", line 807, in <lambda>
    lookup_res(res, rev, e, obj, env, level))
  File "/home/tgrabiec/src/pmu-tools/toplev.py", line 631, in lookup_res
    assert event_rmap(rev[index]) == canon_event(ev)
AssertionError

--level 1 seems to work:

[tgrabiec@muninn ~]$ toplev.py -C 0 sleep 2 --level 1
WARNING: HT enabled
Measuring multiple processes/threads on the same core may is not reliable.
Using level 1.
perf stat -x, -e '{cycles,cpu/event=0xc2,umask=0x2/,ref-cycles,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/}' --cpu 0 sleep 2
Backend Bound:                                 49.06% 
    This category reflects slots where no uops are being delivered due to a lack
    of required resources for accepting more uops in the Backend of the pipeline.
Frequency:                                      1.12 metric
    Frequency in Ghz

Fails to compile in CentOS 6.5

addr.c uses the macro PERF_ATTR_SIZE_VER1, which isn't available in CentOS's version of /usr/include/linux/perf_events.h. However, it does have PERF_ATTR_SIZE_VER0, and compiles correctly when that is changed. Not sure if that will cause issues with usage, however.

Also, addr.c doesn't compile with gcc 4.4.7, and I'm reasonably certain that it's because that file uses an anonymous union in a struct, which that version of GCC doesn't support (not even with -std=c11, which isn't a supported option in this version). Upgrading my GCC to one that was compiled in this decade fixes the issue, without any source code modification other than the replaced macro above.