记一次CacheCloud停服

1. 故障表现

四台Cachecloud同时报警、CacheCloud页面无法访问、应用接口(通过appId拉取节点)和上报接口无法访问。

2. 相关日志

报警时，出现了大量如下：

2017-07-18 12:01:49.039 {http-nio-8585-exec-308} ERROR com.sohu.cache.async.impl.AsyncServiceImpl - Task java.util.concurrent.FutureTask@a494c6 rejected from java.util.concurrent.ThreadPoolExecut       or@58d286ec[Running, pool size = 200, active threads = 200, queued tasks = 0, completed tasks = 19123]client-report-pool_10.10.193.35
 25734 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@a494c6 rejected from java.util.concurrent.ThreadPoolExecutor@58d286ec[Running, pool size = 200, active thread       s = 200, queued tasks = 0, completed tasks = 19123]
 25735         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) ~[na:1.7.0_51]
 25736         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) [na:1.7.0_51]
 25737         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) [na:1.7.0_51]
 25738         at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132) ~[na:1.7.0_51]
 25739         at com.sohu.cache.async.impl.AsyncServiceImpl.submitFuture(AsyncServiceImpl.java:92) ~[classes!/:na]
 25740         at com.sohu.cache.client.service.impl.ClientReportDataServiceImpl.deal(ClientReportDataServiceImpl.java:60) [classes!/:na]
 25741         at com.sohu.cache.client.heartbeat.RedisClientReportDataController.reportData(RedisClientReportDataController.java:75) [classes!/:na]
 25742         at sun.reflect.GeneratedMethodAccessor178.invoke(Unknown Source) ~[na:na]
 25743         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_51]
 25744         at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_51]
 25745         at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:222) [spring-web-4.2.4.RELEASE.jar!/:4.2.4.RELEASE]
 25746         at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:137) [spring-web-4.2.4.RELEASE.jar!/:4.2.4.RELEASE]

所以初步怀疑是客户端上报造成的，所以暴力解决了下把上报的代码全部注释掉重新上线，结果之后还是继续报警，出现了大量如下日志。

### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428556 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428557 ### The error occurred while setting parameters
428558 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json)         values (?,?,?,?,?,?,?)
428559 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428560 ; SQL []; Lock wait timeout exceeded; try restarting transaction; nested exception is java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428561 java.util.concurrent.ExecutionException: org.springframework.dao.CannotAcquireLockException:
428562 ### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428563 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428564 ### The error occurred while setting parameters
428565 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json)         values (?,?,?,?,?,?,?)
428566 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428567 ; SQL []; Lock wait timeout exceeded; try restarting transaction; nested exception is java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428568         at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_51]
428569         at java.util.concurrent.FutureTask.get(FutureTask.java:202) ~[na:1.7.0_51]
428570         at com.sohu.cache.async.impl.AsyncServiceImpl$1.run(AsyncServiceImpl.java:60) ~[classes!/:na]
428571         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
428572         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
428573         at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
428574 Caused by: org.springframework.dao.CannotAcquireLockException:
428575 ### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428576 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428577 ### The error occurred while setting parameters
428578 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json)         values (?,?,?,?,?,?,?)
428579 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction

3. 相关疑点

每天中午12点有一个大的删除基准表：

1	delete form standard_statistics where created_time<='2017-07-16 12:00:35';

从上面的表现以及之前的故障判断是这个表有问题，但是无论从数据量以及历史看都是没问题的。

4. 确定原因

leader直接抓MySQL process list定位到很多慢查询，都和standard_statistics有关，查询如下

1 2	use information_shema; select * from processlist order by Time desc limit 10\G;

竟然都卡在上面这个语句，竟然还没有索引，这个应该不可能啊(已经运行n年了，之前表数据量更大)，

看了下表结构，索引变了，擦！！！，对比下sql schema的记录，原来是dba搞的，这也太草率了。。

本质：即使各种统计做了线程池隔离(粗暴那种，不是hystrix的熔断)，但是数据库连接池是有限的，而主线程的的接口仍然需要数据库，所以依然hang住了主线程池，所以对外表现是无法提供服务。

5.思考

不要有固势思维：过去没问题，不代表现在没问题，何况还有不靠谱的人。
不要被日志迷惑：日志很多时候不能看出问题的本质，这个很多时候已经证明。
遇到问题不能只评经验：想一想问题根本在哪里。