从CMS垃圾回收器日志看ES集群稳定性

es堆内存配置

  • -Xms30g -Xmx30g
  • -XX:CMSInitiatingOccupancyFraction=75
  • 查看具体的堆内存的分配大小
    • /usr/local/java8/bin/jstat -gccapacity 17394 1000
    • survivor: 629120.0,629M
    • eden: 5033216.0, 5G
    • old: 25165824.0, 25G

cms正常的并发gc

CMS通过将大量工作分散到并发处理阶段来在减少STW时间,在这块做得非常优秀。

几个关键的时间日志:

  • 0.138/0.138 secs:这个阶段的持续时间与时钟时间;
  • user:GC线程在垃圾收集期间所使用的CPU总时间;
  • sys:系统调用或者等待系统事件花费的时间;
  • real:应用被暂停的时钟时间,由于GC线程是多线程的,导致了real小于(user+real),如果是gc线程是单线程的话,real是接近于(user+real)时间。

下面个是一段ParNew和CMS结合的垃圾回收日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 年轻代gc之后,对象分配失败
2021-03-02T07:44:58.434+0800: 807042.036: [GC (Allocation Failure) 807042.036: [ParNew
# survivor期望放下的对象大小是322M 按照上面es堆内存的配置,survivor的大小应该是600M+,这里为什么能发下的大小只有一般,Desired survivor size是Survivor区大小的50%(TargetSurvivorRatio的默认值)。
Desired survivor size 322109440 bytes, new threshold 1 (max 6)
# 年龄是1的对象的占用空间大小640M,可以发现,年龄为1的对象大小已经超过了survivor空间的大小,所以需要将对象升级到老年代
- age 1: 640934736 bytes, 640934736 total
# 年轻代:回收前->回收后(年轻代总的空间) 回收时间;整个堆:回收前->回收后(总空间) 回收时间
: 5662336K->629120K(5662336K), 0.1170999 secs] 23497082K->19192581K(30828160K), 0.1175795 secs] [Times: user=2.19 sys=0.03, real=0.12 secs]
# 经过上面的parnew回收之后,年轻代共5G空间,占用629M,老年代共25G空间,占用20G,到达了cms的回收阈值75%
# stw 初始标记阶段
2021-03-02T07:45:02.448+0800: 807046.051: [GC (CMS Initial Mark) [1 CMS-initial-mark: 19230527K(25165824K)] 19860050K(30828160K), 0.0077555 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]
# no stw 并发标记阶段
2021-03-02T07:45:02.457+0800: 807046.059: [CMS-concurrent-mark-start]
2021-03-02T07:45:04.725+0800: 807048.327: [CMS-concurrent-mark: 2.269/2.269 secs] [Times: user=25.46 sys=0.81, real=2.27 secs]
# no stw 并发preclean阶段
2021-03-02T07:45:04.725+0800: 807048.328: [CMS-concurrent-preclean-start]
2021-03-02T07:45:04.896+0800: 807048.498: [CMS-concurrent-preclean: 0.162/0.170 secs] [Times: user=0.44 sys=0.02, real=0.17 secs]
# no stw 并发的abortable preclean阶段
2021-03-02T07:45:04.896+0800: 807048.498: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 2021-03-02T07:45:10.286+0800: 807053.888: [CMS-concurrent-abortable-preclean: 5.196/5.390 secs] [Times: user=23.80 sys=0.94, real=5.39 secs]
# stw 最终标记阶段
2021-03-02T07:45:50.010+0800: 807093.612: [GC (CMS Final Remark) [YG occupancy: 2388409 K (5662336 K)]807093.612: [Rescan (parallel) , 0.0570763 secs]807093.669: [weak refs processing, 0.0002759 secs]807093.669: [class unloading, 0.0418367 secs]807093.711: [scrub symbol table, 0.0110873 secs]807093.722: [scrub string table, 0.0018460 secs][1 CMS-remark: 19475169K(25165824K)] 21863578K(30828160K), 0.1166112 secs] [Times: user=1.35 sys=0.00, real=0.11 secs]
# no stw 并发清除阶段
2021-03-02T07:45:50.127+0800: 807093.729: [CMS-concurrent-sweep-start]
2021-03-02T07:45:53.535+0800: 807097.137: [CMS-concurrent-sweep: 3.270/3.408 secs] [Times: user=35.67 sys=1.43, real=3.41 secs]
# no stw 并发reset阶段
2021-03-02T07:45:53.535+0800: 807097.137: [CMS-concurrent-reset-start]
2021-03-02T07:45:53.601+0800: 807097.204: [CMS-concurrent-reset: 0.066/0.066 secs] [Times: user=0.35 sys=0.03, real=0.07 secs]

上面的垃圾回收日志是正常的一个回收过程,年轻代放不下之后,会将对象升级到老年代,老年代到达cms设置的阈值之后,会触发老年代的回收,上述回收过程正是cms的优势所在,一次gc过程,大部分时间是跟应用线程并发执行,对应用线程的影响比较小,对于一般的应用基本上也是感知不到的。

但是有一种情况下的gc会对应用造成很大的影响,cms本身有一个比较大的缺陷在于,无法处理浮动垃圾,单纯的标记清除算法,随着垃圾回收的进行,老年代的碎片也会越来越多,一次old gc的时候,如果新一波的young gc之后升级到老年代的对象,在目前老年代中仍然放不下,会产生concurrent mode failure,此时会造成单线程的fullgc,应用线程会被暂停,如果堆内存比较大,就像我们的es机器,堆内存30G,一次fullgc需要花费20多秒的时间,对线上的应用的影响会非常大。

concurrent mode failure

CMS 收集器无法处理浮动垃圾(Floating Garbage),可能出现 “Concurrnet Mode Failure” 失败而导致另一次Full GC的产生,可能引发串行Full GC;下面是一段产生该异常的gc日志,gc造成es正常的线程停顿了26s的时间:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2021-02-23T07:01:47.432+0800: 199651.034: [GC (CMS Initial Mark) [1 CMS-initial-mark: 19184059K(25165824K)] 19829584K(30828160K), 0.0135926 secs] [Times: user=0.10 sys=0.00, real=0.01 secs]
2021-02-23T07:01:47.446+0800: 199651.048: Total time for which application threads were stopped: 0.0187626 seconds, Stopping threads took: 0.0026169 seconds
2021-02-23T07:01:47.446+0800: 199651.048: [CMS-concurrent-mark-start]
2021-02-23T07:01:49.548+0800: 199653.151: [CMS-concurrent-mark: 2.102/2.102 secs] [Times: user=24.28 sys=0.90, real=2.11 secs]
2021-02-23T07:01:49.548+0800: 199653.151: [CMS-concurrent-preclean-start]
2021-02-23T07:01:49.728+0800: 199653.330: [CMS-concurrent-preclean: 0.169/0.179 secs] [Times: user=0.35 sys=0.02, real=0.17 secs]
2021-02-23T07:01:49.728+0800: 199653.330: [CMS-concurrent-abortable-preclean-start]
2021-02-23T07:01:51.246+0800: 199654.848: Total time for which application threads were stopped: 0.0029138 seconds, Stopping threads took: 0.0004014 seconds
2021-02-23T07:01:51.247+0800: 199654.849: Total time for which application threads were stopped: 0.0012148 seconds, Stopping threads took: 0.0000792 seconds
2021-02-23T07:01:52.643+0800: 199656.245: Total time for which application threads were stopped: 0.0033441 seconds, Stopping threads took: 0.0003442 seconds
2021-02-23T07:01:52.645+0800: 199656.247: Total time for which application threads were stopped: 0.0019024 seconds, Stopping threads took: 0.0002444 seconds
2021-02-23T07:01:54.746+0800: 199658.348: [GC (Allocation Failure) 199658.348: [ParNew
Desired survivor size 322109440 bytes, new threshold 1 (max 6)
- age 1: 476010080 bytes, 476010080 total
: 5650054K->558251K(5662336K), 0.1145601 secs] 24834113K->20039782K(30828160K), 0.1148260 secs] [Times: user=1.48 sys=0.03, real=0.12 secs]
2021-02-23T07:01:54.861+0800: 199658.463: Total time for which application threads were stopped: 0.1188860 seconds, Stopping threads took: 0.0014423 seconds
2021-02-23T07:01:55.211+0800: 199658.813: Total time for which application threads were stopped: 0.0041470 seconds, Stopping threads took: 0.0018983 seconds
CMS: abort preclean due to time 2021-02-23T07:01:55.372+0800: 199658.974: [CMS-concurrent-abortable-preclean: 5.430/5.644 secs] [Times: user=36.94 sys=2.87, real=5.65 secs]
2021-02-23T07:01:55.375+0800: 199658.977: [GC (CMS Final Remark) [YG occupancy: 1938421 K (5662336 K)]199658.977: [Rescan (parallel) , 0.0611395 secs]199659.038: [weak refs processing, 0.0003288 secs]199659.039: [class unloading, 0.0382319 secs]199659.077: [scrub symbol t
able, 0.0107094 secs]199659.088: [scrub string table, 0.0015974 secs][1 CMS-remark: 19481531K(25165824K)] 21419952K(30828160K), 0.1160335 secs] [Times: user=1.42 sys=0.12, real=0.12 secs]
2021-02-23T07:01:55.491+0800: 199659.093: Total time for which application threads were stopped: 0.1192748 seconds, Stopping threads took: 0.0005368 seconds
2021-02-23T07:01:55.491+0800: 199659.094: [CMS-concurrent-sweep-start]
2021-02-23T07:01:55.640+0800: 199659.242: Total time for which application threads were stopped: 0.0029760 seconds, Stopping threads took: 0.0005614 seconds
2021-02-23T07:01:55.642+0800: 199659.244: Total time for which application threads were stopped: 0.0015641 seconds, Stopping threads took: 0.0002054 seconds
2021-02-23T07:01:56.638+0800: 199660.241: Total time for which application threads were stopped: 0.0036077 seconds, Stopping threads took: 0.0012128 seconds
2021-02-23T07:01:56.675+0800: 199660.278: [GC (Allocation Failure) 199660.278: [ParNew (promotion failed): 5590537K->5661406K(5662336K), 1.3361918 secs] 24852053K->25503147K(30828160K), 1.3365327 secs] [Times: user=3.56 sys=0.02, real=1.34 secs]
2021-02-23T07:01:58.012+0800: 199661.614: [Full GC (Allocation Failure) 199661.614: [CMS2021-02-23T07:02:01.539+0800: 199665.142: [CMS-concurrent-sweep: 4.686/6.048 secs] [Times: user=27.51 sys=2.27, real=6.04 secs]
(concurrent mode failure): 19841741K->12822928K(25165824K), 26.3160634 secs] 25503147K->12822928K(30828160K), [Metaspace: 85227K->85227K(1130496K)], 26.3188523 secs] [Times: user=26.33 sys=0.00, real=26.31 secs]
2021-02-23T07:02:24.331+0800: 199687.934: Total time for which application threads were stopped: 27.6682172 seconds, Stopping threads took: 0.0098307 seconds

gc日志的设置

这个gc日志不是es特有的设置,而是jvm本身的设置

1
2
3
4
-Xloggc:/opt/meituan/es_logs/elasticsearch.gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=15
-XX:GCLogFileSize=20M

之前遇到fullgc的问题,为了避免对线上应用的影响,第一时间重启了es进程,由于es进程配置的gc日志只保存在一个文件中,重启之后日志被重写,我们无法去分析当时的一个gc日志情况;这里开启了日志滚动的开关,生成的日志的形式如下:

1
2
3
4
elasticsearch.gc.log
elasticsearch.gc.log.0
elasticsearch.gc.log.1
elasticsearch.gc.log.2.current
  • 上面表示,当前正在写的文件的编号是2,假如文件的个数是3个,那么编号2的写完了之后,会继续写后缀为0的文件,就是这种滚动写日志文件的形式;
  • 每次程序启动的时候,都会从编号0开始,重新创建后缀0的文件
    • 假设目前程序生成了文件0、1、2,重启之后,会重新写文件0,因此文件2中的内容还保存,也就是重启之前的gc的内容还可以去查看,分析重启之前的gc现场
    • 存在一个问题,假设日志文件比较少,日志还没有滚动,只有文件0,这时候,重启了程序,文件0会重新创建,这时候之前的日志内容就没了

es线上稳定性的思考

事前、事中、事后

  • 事前:监控、慢查询梳理
  • 事中:踢除有问题节点,保证线上查询正常;重启节点,需要注意的是保存好事故现场
    • 直接杀掉有问题的节点,缺点无法保留问题现场
  • 事后:分析问题出现的原因,解决问题
    • 优化索引结构,优化慢查询
    • 开发小工具,保证问题出现的时候,线上服务影响程度最小化

当然这不是一个一次性的工作,需要针对问题不断去优化事前、事中、事后的处理逻辑

  • 比如说:监控是依赖es node api的,node api需要获取每个节点的状态,当一个节点full gc之后,整个api不返回结果了,监控无效了
    • 通过线程池api,提前发现问题
  • 获取节点的工具化页面,在full gc之后也无法获取到,通过cellar缓存解决
  • gc log有问题,只保存在一个文件中,每次重启文件被重写,无法分析当时问题现场的gc日志

参考文章