Direct Memory OOM造成Region Server挂掉

故障时间：2017-06-04 09:15

一、现象

RegionServer自行挂掉：zk + shutdown hook

2017-06-04 09:14:06,194 INFO  [regionserver/102zw68.tvindex.sohuno.com/10.10.102.68:16020] zookeeper.ZooKeeper: Session: 0x25a38b51be18b65 closed
2017-06-04 09:14:06,194 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2017-06-04 09:14:06,216 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
	at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
	at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2665)
2017-06-04 09:14:06,238 INFO  [Thread-9] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@318603db
2017-06-04 09:14:06,238 INFO  [Thread-9] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2017-06-04 09:14:06,239 INFO  [Thread-9] regionserver.ShutdownHook: Shutdown hook finished.

从之前的日志发现了大量：

2017-06-04 09:13:19,756 ERROR [MemStoreFlusher.1] regionserver.HStore: Failed to open store file : hdfs://hbasezwcluster/hbase/data/default/rc_entire/c6fe2d013a27158a68af5f8b74bb9d6b/.tmp/09cc827c27bb4659aa9f6d97fb38fe65, keeping it in tmp location
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://hbasezwcluster/hbase/data/default/rc_entire/c6fe2d013a27158a68af5f8b74bb9d6b/.tmp/09cc827c27bb4659aa9f6d97fb38fe65
	at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:494)
	at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:522)
	at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1084)
	at org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:254)
	at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:399)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:504)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:494)
	at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:653)
	at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:645)
	at org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1724)
	at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:920)
	at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2271)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2375)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2105)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2067)
	at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1958)
	at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1884)
	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510)
	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:658)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
	at org.apache.hadoop.hdfs.util.DirectBufferPool.getBuffer(DirectBufferPool.java:70)
	at org.apache.hadoop.hdfs.BlockReaderLocal.createDataBufIfNeeded(BlockReaderLocal.java:244)
	at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:570)
	at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:538)
	at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:686)
	at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:742)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:799)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
	at java.io.DataInputStream.readFully(DataInputStream.java:195)
	at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:390)
	at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:479)
	... 21 more

从表面现象看，这个和DirectMemory发生OOM有关

二、初步检查

1. 配置不均：

HBase-zw中只有如下机器配置了bucketcache.size，但是配置略有不同。

序号	机器	MaxDirectMemorySize	bucketcache.size
1	10.10.102.68	13G	12288(12G)
2	10.10.102.69	7G	6144(6G)
3	10.10.102.70	7G	6144(6G)
4	10.10.102.71	7G	6144(6G)
5	10.10.102.72	7G	6144(6G)
6	10.10.102.73	7G	6144(6G)
7	10.10.102.226	7G	6144(6G)
8	10.10.102.227	7G	6144(6G)
9	10.10.65.86	7G	6144(6G)
10	其他机器	没有显式指定	无

1	由于文档是后补的，再问题解决之前，上述1~9中的几个RegionServer均发生了上述问题。

2. 内存统计

在RegionServer挂掉后，内存下降(正常情况)，之前内存在60%左右，应该是正常

(1)mm统计：

(2) tsar统计

3. 读写量

从读写请求量看基本属于正常情况。

4.GC

(1) 从图表看比较正常。

(2) gc日志：

09点12开始发生了很多次fullgc：

2017-06-04T09:12:35.759+0800: 5071440.974: [GC pause (young), 0.1301810 secs]
2017-06-04T09:12:55.143+0800: 5071460.358: [Full GC 11G->1206M(20G), 2.8186760 secs]
2017-06-04T09:13:03.122+0800: 5071468.337: [Full GC 2305M->1120M(20G), 2.0535820 secs]
2017-06-04T09:13:14.729+0800: 5071479.944: [Full GC 2812M->1158M(20G), 1.9584960 secs]
2017-06-04T09:13:17.813+0800: 5071483.028: [Full GC 1473M->1180M(20G), 1.8418870 secs]
2017-06-04T09:13:20.781+0800: 5071485.996: [Full GC 1383M->1187M(20G), 1.7836590 secs]
2017-06-04T09:13:23.690+0800: 5071488.905: [Full GC 1360M->1194M(20G), 1.7804030 secs]
2017-06-04T09:13:26.595+0800: 5071491.810: [Full GC 1383M->1202M(20G), 1.7664520 secs]
2017-06-04T09:13:28.716+0800: 5071493.931: [Full GC 1303M->1215M(20G), 1.7689500 secs]
2017-06-04T09:13:30.514+0800: 5071495.729: [Full GC 1244M->1221M(20G), 1.6835710 secs]
2017-06-04T09:13:32.479+0800: 5071497.694: [Full GC 1337M->1226M(20G), 1.8048440 secs]
2017-06-04T09:13:34.326+0800: 5071499.541: [Full GC 1250M->1236M(20G), 1.5021000 secs]
2017-06-04T09:13:35.836+0800: 5071501.051: [Full GC 1248M->1244M(20G), 1.4851340 secs]
2017-06-04T09:13:38.346+0800: 5071503.561: [Full GC 1632M->1253M(20G), 2.0687720 secs]
2017-06-04T09:13:40.453+0800: 5071505.668: [Full GC 1284M->1260M(20G), 1.6977380 secs]
2017-06-04T09:13:43.179+0800: 5071508.394: [Full GC 1471M->1263M(20G), 1.8358100 secs]
2017-06-04T09:13:45.041+0800: 5071510.256: [Full GC 1298M->1275M(20G), 1.7325290 secs]
2017-06-04T09:13:47.800+0800: 5071513.015: [Full GC 1522M->1277M(20G), 1.8715070 secs]
2017-06-04T09:13:49.697+0800: 5071514.912: [Full GC 1311M->1288M(20G), 1.7041390 secs]
2017-06-04T09:13:52.434+0800: 5071517.649: [Full GC 1497M->1288M(20G), 1.8675380 secs]
2017-06-04T09:13:55.426+0800: 5071520.641: [Full GC 1457M->1288M(20G), 1.6890270 secs]
2017-06-04T09:13:58.243+0800: 5071523.458: [Full GC 1428M->1288M(20G), 1.6103100 secs]
2017-06-04T09:14:00.980+0800: 5071526.195: [Full GC 1382M->1288M(20G), 1.5230450 secs]
2017-06-04T09:14:03.628+0800: 5071528.843: [Full GC 1396M->1288M(20G), 1.5169950 secs]

从上面的内存变化可以看到，这种fullgc显然是不正常的，从历史的堆统计看以看到：

所以这里可以初步怀疑是堆外内存造成的fullgc。

三、分析原因: 堆外内存不足

由于本机器显示配置了MaxDirectMemorySize为7G，同时配置RegionServer的blockCache使用堆外内存作为二级缓存6G，也就是剩下给RegionServer的只有1G(堆外内存是预分配)。
根据JVM的原理，当堆外内存不足时，触发fullgc。

package test.oom;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
/**
 * -XX:+PrintGCDetails -XX:MaxDirectMemorySize=500M -XX:+UseG1GC -Xmx1G
 * @author leifu
 * @Date 2017年7月6日
 * @Time 上午9:33:37
 */
public class DirectMemoryOOM {
    /**
     * 300MB
     */
    private static final int _300MB = 1024 * 1024 * 300;
    
    public static void main(String[] args) throws Exception {
        int count = 0;
        List<ByteBuffer> bufs = new ArrayList<ByteBuffer>();
        while(true) {
            count++;
            System.out.println("count=" + count);
            Thread.sleep(2000);
            System.out.println("allocate direct.");
            ByteBuffer buf = ByteBuffer.allocateDirect(_300MB);
            bufs.add(buf);
        }
    }
}

从日志看出，从第二次分配开始，就会触发fullgc，但是heap几乎就没怎么用，同时没有打印堆外内存信息

count=1
allocate direct.
count=2
allocate direct.
[Full GC (System.gc())  983K->278K(8192K), 0.0081902 secs]
   [Eden: 1024.0K(24.0M)->0.0B(4096.0K) Survivors: 0.0B->0.0B Heap: 983.1K(256.0M)->278.4K(8192.0K)], [Metaspace: 2642K->2642K(1056768K)]
 [Times: user=0.01 sys=0.01, real=0.01 secs] 
count=3
allocate direct.
[Full GC (System.gc())  360K->276K(8192K), 0.0031508 secs]
   [Eden: 1024.0K(4096.0K)->0.0B(4096.0K) Survivors: 0.0B->0.0B Heap: 360.3K(8192.0K)->276.7K(8192.0K)], [Metaspace: 2643K->2643K(1056768K)]
 [Times: user=0.00 sys=0.00, real=0.01 secs]

如果加-XX:+DisableExplicitGC，就不会触发System.gc()，直接会发生OOM。

这里可以参考毕玄的一篇文章：http://hellojava.info/?p=56

四、RegionServer与堆外内存

从日志看RegionServer在执行flush操作将数据刷到Hfile(HDFS，一般是多备份，通过网络)，其中涉及到了数据传输和HFile远程读取，看下HBase源码应该是用到了NIO。

Caused by: java.lang.OutOfMemoryError: Direct buffer memory
    at java.nio.Bits.reserveMemory(Bits.java:658)
    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
    at org.apache.hadoop.hdfs.util.DirectBufferPool.getBuffer(DirectBufferPool.java:70)
    at org.apache.hadoop.hdfs.BlockReaderLocal.createDataBufIfNeeded(BlockReaderLocal.java:244)
    at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:570)
    at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:538)
    at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:686)
    at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:742)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:799)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
    at java.io.DataInputStream.readFully(DataInputStream.java:195)
    at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:390)
    at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:479)
```    
    
这篇文章分析了标准的HBase客户端可能会大量用到堆外内存。(https://groups.google.com/forum/#!topic/asynchbase/xFvHuniLI1c](https://groups.google.com/forum/#!topic/asynchbase/xFvHuniLI1c)
```Java
I thought I'd take a moment to explain what I discovered trying to track down serious problems with the regular (non-async) hbase client and Java's nio implementation.
We were having issues running out of direct memory and here's a stack trace which says it all:
        java.nio.Buffer.<init>(Buffer.java:172)
        java.nio.ByteBuffer.<init>(ByteBuffer.java:259)
        java.nio.ByteBuffer.<init>(ByteBuffer.java:267)
        java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:64)
        java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
        java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
        sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:155)
        sun.nio.ch.IOUtil.write(IOUtil.java:37)
        sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        java.io.DataOutputStream.flush(DataOutputStream.java:106)
        org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:518)
        org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:751)
        org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        $Proxy11.getProtocolVersion(<Unknown Source>:Unknown line)
Here you can see that an HBaseClient request is flushing a stream which has a socket channel at the other end of it. HBase has decided not to use direct memory for its byte buffers which I thought was smart since they are difficult to manage. Unfortunately, behind the scenes the JDK is noticing the lack of direct memory buffer in the socket channel write call, and it is allocating a direct memory buffer on your behalf! The size of that direct memory buffer depends on the amount of data you want to write at that time, so if you are writing 1M of data, the JDK will allocate 1M of direct memory.
The same is done on the reading side as well. If you perform channel I/O with a non-direct memory buffer, the JDK will allocate a direct memory buffer for you. In the reading case it allocates a size that equals the amount of room you have in the direct memory buffer you passed in to the read call. WTF!? That can be a very large value.
To make matters worse, the JDK caches these direct memory buffers in thread local storage and it caches not one, but three of these arbitrarily sized buffers. (Look in sun.nio.ch.Util.getTemporaryDirectBuffer and let me know if I have interpreted the code incorrectly.) So if you have a large number of threads talking to hbase you can find yourself overflowing with direct memory buffers that you have not allocated and didn't even know about.
This issue is what caused us to check out the asynchbase client, which happily didn't have any of these problems. The reason is that asynchbase uses netty and netty knows the proper way of using direct memory buffers for I/O. The correct way is to use direct memory buffers in manageable sizes, 16k to 64k or something like that, for the purpose of invoking a read or write system call. Netty has algorithms for calculating the best size given a particular socket connection, based on the amount of data it seems to be able to read at once, etc. Netty reads the data from the OS using direct memory and copies that data into Java byte buffers.
Now you might be wondering why you don't just pass a regular Java byte array into the read/write calls, to avoid the copy from direct memory to java heap memory, and here's the story about that. Let's assume you're doing a file or socket read. There are two cases:
If the amount being read is < 8k, it uses a native char array on the C stack for the read system call, and then copies the result into your Java buffer.
If the amount being read is > 8k, the JDK calls malloc, does the read system call with that buffer, copies the result into your Java buffer, and then calls free.
The reason for this is that the the compacting Java garbage collector might move your Java buffer while you're blocked in the read system call and clearly that will not do. But if you are not aware of the malloc/free being called every time you perform a read larger than 8k, you might be surprised by the performance of that. Direct memory buffers were created to avoid the malloc/free every time you read. You still need to copy but you don't need to malloc/free every time.
People get into trouble with direct memory because you cannot free them up when you know you are done. Instead you need to wait for the garbage collector to run and THEN the finalizers to be executed. You can never tell when the GC will run and/or your finalizers be run, so it's really a crap shoot. That's why the JDK caches these buffers (that it shouldn't be creating in the first place). And the larger your heap size, the less frequent the GCs. And actually, I saw some code in the JDK which called System.gc() manually when a direct memory buffer allocation failed, which is an absolute no-no. That might work with small heap sizes but with large heap sizes, a full System.gc() can take 15 or 20 seconds. We were trying to use the G1 collector which allows for very large heap sizes without long GC delays, but those delays were occurring because some piece of code was manually running GC. When I disabled System.gc() with a command line option, we ran out of direct memory instead.
This is long but I hope informative.

大致意思：

因为堆外内存很难管理，所以HBase没有直接使用，但JDK会在大的读写时候主动使用堆外内存，而且有很多是ThreadLocal级别。
建议不使用原生客户端，可以考虑使用netty版本的客户端。

五、解决方法与注意事项

1.解决方法：

当原理清楚后，解决方法很简单：

不用bucket：这个要看系统是否需要这个。
增加堆外内存配置：例如不显示配置就是和xmx一致。

无论怎样调整：保证有足够的堆外内存可以使用，但要保证机器内存的稳定

2. 注意事项：

监控RegionServer配置一致性。
深入了解RegionServer原理。

六、参考文献

附注一个bucketCache的统计，可以看到堆外内存是几乎没用，而且也可以确认堆外内存是预分配的。

2017-06-04 09:13:49,672 INFO  [BucketCacheStatsExecutor] bucket.BucketCache: failedBlockAdditions=0, totalSize=12.00 GB, freeSize=12.00 GB, usedSize=10 KB, cacheSize=621 B, accesses=32573928, hits=12607, IOhitsPerSecond=0, IOTimePerHit=NaN, hitRatio=0.04%, cachingAccesses=78220, cachingHits=12607, cachingHitsRatio=16.12%, evictions=0, evicted=0, evictedPerRun=NaN