wikijs/coding/flame_graph.md

258 lines
12 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: FlameGraph
description:
published: true
date: 2021-08-23T18:30:49.760Z
tags: 火焰图
editor: markdown
dateCreated: 2021-08-23T17:06:44.193Z
---
# Flame Graph
火焰图可以对perf命令记录的软件运行时抽样数据进行可视化方便快速、准确的分析出执行频率最高的code-path分析软件运行状态等。经常使用的性能分析火焰图分为以下几类
- CPU
- Memory
- Off-CPU
- Hot/Cold
- Differential
![out.svg](/out.svg)
上图为对sshd进程使用perf进行采样生成的火焰图
# 概述
火焰图y轴代表调用栈x轴则代表该调用栈被抽样到的次数按照字母序排序而非时间。火焰图可以理解为将每个抽样到的callstack自底向上排序再按照字母序从左向右排序。每个矩形代表一个栈帧。矩形宽度越宽代表它出现在栈上的几率越高。顶部的栈帧为当前正在CPU执行的方法的栈帧下方为其callstack。火焰图颜色没有明确意义但是有一些约定颜色下文介绍。Flame graph不同于chrome浏览器中的Flame chartFlame chart的x轴为时间。
推荐视频演讲者为flame graph开发者[BrendanGregg](http://www.brendangregg.com/index.html):
[https://youtu.be/D53T1Ejig1Q](https://youtu.be/D53T1Ejig1Q)
火焰图是将包含call stack的抽样数据可视化的工具profile数据可以使用以下工具在不同平台上生成
- Linux: perf, eBPF, SystemTap, and ktap
- Mac OS X: DTrace and Instruments
- Windows: Xperf.exe
- Solaris, illumos, FreeBSD: DTrace
# 实验
使用如下代码模拟火焰图,[https://github.com/brendangregg/FlameGraph](https://github.com/brendangregg/FlameGraph)
```c
#include<stdio.h>
#define COUNT 1000000
void a(){
for(int i = 0; i < COUNT; i++ );
}
void b(){
for(int i = 0; i < COUNT; i++ );
a();
}
void c(){
for(int i = 0; i < COUNT; i++ );
b();
}
int main(){
while(1){
c();
}
}
```
命令如下:
```bash
gcc flamegraph.c
./a.out
ps -aux | grep a.out
sudo perf record -F 99 -p pid -g -- sleep 60 #pid替换为实际的pid
sudo perf report -i perf.data
sudo perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg
```
使用perf report -i perf.data可以分析cpu使用百分比但是无法看到详细调用栈
![untitled.png](/untitled.png)
对perf采样数据fold call stack再生成火焰图
[perf.data](/perf.data)
[perf.unfold](/perf.unfold)
[perf.fold](/perf.fold)
```bash
all.main.c
all.main.c
all.main.c
all.main.c.b
all.main.c.b
all.main.c.b
all.main.c.b.a
all.main.c.b.a
all.main.c.b.a
```
![out_(2).svg](/out_(2).svg)
# Perf
perf是linux系统上的性能分析工具可以用来分析算法优化空间复杂度、时间复杂度、代码优化提高执行速度、减少内存占用、 评估程序对硬件资源的使用情况例如各级cache的访问次数各级cache的丢失次数、流水线停顿周期、前端总线访问次数等。 评估程序对操作系统资源的使用情况系统调用次数、上下文切换次数、任务迁移次数不同cpu之间。其原理是分析CPU中的硬件Performance Counters记录的数据、内核代码中埋的Tracepoints、内核计数器的低优先级events。
- Performance Counter是CPU中的专用硬件寄存器可以用来计算cache-misses、branches mispredicted、instructions executed等这些数据可以用来trace 程序流确认代码hotspot。
- Tracepoints是Linux内核代码中预置的hook比如系统调用、TCP/IP事件文件系统操作等。tracepotints对性能有一定影响默认关闭可以通过perf命令开启trace收集时间戳、stack trace信息。perf也可以通过kprobes和uprobes框架动态创建tracepoints分析内核态和用户态的状态。
- Software Events例如CPU migrations, page faults.
### 常用命令
- [perf stat](https://perf.wiki.kernel.org/index.php/Tutorial#Counting_with_perf_stat): obtain event counts
- [perf record](https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record): record events for later reporting
- [perf report](https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report): break down events by process, function, etc.
- [perf annotate](https://perf.wiki.kernel.org/index.php/Tutorial#Source_level_analysis_with_perf_annotate): annotate assembly or source code with event counts
- [perf top](https://perf.wiki.kernel.org/index.php/Tutorial#Live_analysis_with_perf_top): see live event count
- [perf bench](https://perf.wiki.kernel.org/index.php/Tutorial#Benchmarking_with_perf_bench): run different kernel microbenchmarks
### 应对问题
- Why is the kernel on-CPU so much? What code-paths?
- Which code-paths are causing CPU level 2 cache misses?
- Are the CPUs stalled on memory I/O?
- Which code-paths are allocating memory, and how much?
- What is triggering TCP retransmits?
- Is a certain kernel function being called, and how often?
- What reasons are threads leaving the CPU?
### perf_events
- Task-clock-msecsCPU 利用率,该值高,说明程序的多数时间花费在 CPU 计算上而非 IO。
- Context-switches进程切换次数记录了程序运行过程中发生了多少次进程切换频繁的进程切换是应该避免的。
- Cache-misses程序运行过程中总体的 cache 利用情况,如果该值过高,说明程序的 cache 利用不好
- CPU-migrations表示进程 t1 运行过程中发生了多少次 CPU 迁移,即被调度器从一个 CPU 转移到另外一个 CPU 上运行。
- Cycles处理器时钟一条机器指令可能需要多个 cycles
- Instructions: 机器指令数目。
- IPC是 Instructions/Cycles 的比值,该值越大越好,说明程序充分利用了处理器的特性。
- Cache-references: cache 命中的次数
- Cache-misses: cache 失效的次数。
![http://www.brendangregg.com/perf_events/perf_events_map.png](http://www.brendangregg.com/perf_events/perf_events_map.png)
![http://www.brendangregg.com/perf_events/perf_events_map.png](http://www.brendangregg.com/perf_events/perf_events_map.png)
# 背景知识
### Symbols
符号表
### JIT Symbols (Java, Node.js)
### Stack Traces
性能调优工具如 perfOprofile 等的基本原理都是对被监测对象进行采样,最简单的情形是根据 tick 中断进行采样,即在 tick 中断内触发采样点,在采样点里判断程序当时的上下文。假如一个程序 90% 的时间都花费在函数 foo() 上,那么 90% 的采样点都应该落在函数 foo() 的上下文中。运气不可捉摸,但我想只要采样频率足够高,采样时间足够长,那么以上推论就比较可靠。因此,通过 tick 触发采样,我们便可以了解程序中哪些地方最耗时间,从而重点分析。
perf_events is an event-oriented observability tool, which can help you solve advanced performance and troubleshooting functions. Questions that can be answered include:
算法优化(空间复杂度、时间复杂度)、代码优化(提到执行速度、减少内存占用) 评估程序对硬件资源的使用情况例如各级cache的访问次数各级cache的丢失次数、流水线停顿周期、前端总线访问次数等。 评估程序对操作系统资源的使用情况,系统调用次数、上下文切换次数、任务迁移次数[程序代码调优工具perf学习记录 - carterzhang - 博客园 (cnblogs.com)](https://www.cnblogs.com/carterzhang/p/6184342.html)
perf_events is part of the Linux kernel, under tools/perf. While it uses many Linux tracing features, some are not yet exposed via the perf command, and need to be used via the ftrace interface instead. My [perf-tools](https://github.com/brendangregg/perf-tools) collection (github) uses both perf_events and ftrace as needed.
![http://www.brendangregg.com/perf_events/perf_events_map.png](http://www.brendangregg.com/perf_events/perf_events_map.png)
# Background
### 符号表
perf 跟踪依赖与调试信息(symbols), 调试符号表的作用就是将内存的十六进制翻译为对应的函数即参数.
对于内核, 通过安装对应内核版本的调试包, 可以解决. 还可以自己手动编译内核源码增加调试相关信息. 对于用户态, 通过安装对应程序的调试符号包也可以解决. 还可以自己手动编译源码,不要 strip 调试符号.
检验你所用的内核是否支持调试符号, 运行
cat /boot/config-2.6.32-642.4.2.el6.x86_64 | grep CONFIG_KALLSYMS
```
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
```
### 栈帧
被优化的程序是忽略栈指针的, 如果没有栈帧, 有些调试符号就不能正确地显示.
自从 kernel 3.9, 对于应用户态的程序, perf_events 支持利用 dwarf(libunwind) 来绕过这个缺失栈帧的问题. 在编译的时候加上 -g dwarf 即可.
对应用户态, 编译时增加:
```
-fno-omit-frame-pointer:
```
对应内核,编译时加参数
```
CONFIG_FRAME_POINTER=y
```
# 帧指针重用
```bash
gcc flamegraph.c -o omitframepointer.out -fomit-frame-pointer
```
![omitframepointer.svg](/omitframepointer.svg)
开启帧指针重用后无法正确显示调用栈stack walking有问题
# java
实现 Java 火焰图的两个问题:
1. The JVM compiles methods on the fly (just-in-time: JIT), and doesn't expose a traditional symbol table for system profilers to read.
2. The JVM also uses the frame pointer register (RBP on x86-64) as a general purpose register, breaking traditional stack walking.
解决上面两个问题的办法
1. A JVMTI agent, [perf-map-agent](https://github.com/jrudolph/perf-map-agent), which can provide a Java symbol table for perf to read (/tmp/perf-PID.map).
2. Patching JDK hotspot to reintroduce the frame pointer register, which allows full stack walking.
# flame graph
如何生成火焰图
火焰图深度、宽度含义
cpu、mem、Off-CPU Flame Graphs、Hot/Cold Flame Graphs火焰图
# 生产中解决的问题
![cpu-perf-mirror-perftest-group_100018672300qps.svg](/cpu-perf-mirror-perftest-group_100018672300qps.svg)
![cpu-perf-mirror-orgarea-rb_100018672300qps.svg](/cpu-perf-mirror-orgarea-rb_100018672300qps.svg)
![mem-perf-mirror-perftest-group_100018672300qps.svg](/mem-perf-mirror-perftest-group_100018672300qps.svg)
![mem-perf-mirror-orgarea-rb_100018672300qps.svg](/mem-perf-mirror-orgarea-rb_100018672300qps.svg)
# 参考文章:
- [http://www.brendangregg.com/perf.html](http://www.brendangregg.com/perf.html)
- [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Node.js)
- [http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html](http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
- [http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html](http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html)
- [http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html](http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html)
- [http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html](http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html)
- [http://www.brendangregg.com/ebpf.html](http://www.brendangregg.com/ebpf.html)
- [http://www.brendangregg.com/flamegraphs.html](http://www.brendangregg.com/flamegraphs.html)
- [https://netflixtechblog.com/java-in-flames-e763b3d32166](https://netflixtechblog.com/java-in-flames-e763b3d32166)
- [https://medium.com/@maheshsenni/java-performance-profiling-using-flame-graphs-e29238130375](https://medium.com/@maheshsenni/java-performance-profiling-using-flame-graphs-e29238130375)
- [https://medium.com/@maheshsenni/java-performance-profiling-using-flame-graphs-e29238130375](https://medium.com/@maheshsenni/java-performance-profiling-using-flame-graphs-e29238130375)
- [http://engineering.conversantmedia.com/technology/2016/12/01/java-memory-allocation-flamegraph/](http://engineering.conversantmedia.com/technology/2016/12/01/java-memory-allocation-flamegraph/)
- [https://tech.meituan.com/2020/10/22/java-jit-practice-in-meituan.html](https://tech.meituan.com/2020/10/22/java-jit-practice-in-meituan.html)
- [http://www.trueeyu.com/2014/10/31/fno-omit-frame-pointer/](http://www.trueeyu.com/2014/10/31/fno-omit-frame-pointer/)
- [https://www.cnblogs.com/carterzhang/p/6184342.html](https://www.cnblogs.com/carterzhang/p/6184342.html)
- [1Lzot5BYTI7pmbKdPd9w-5mLScFhQBKGV?usp=sharing](https://drive.google.com/drive/folders/1Lzot5BYTI7pmbKdPd9w-5mLScFhQBKGV?usp=sharing)