记一次CacheCloud停服

1. 故障表现

四台Cachecloud同时报警、CacheCloud页面无法访问、应用接口(通过appId拉取节点)和上报接口无法访问。

2. 相关日志

报警时,出现了大量如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2017-07-18 12:01:49.039 {http-nio-8585-exec-308} ERROR com.sohu.cache.async.impl.AsyncServiceImpl - Task java.util.concurrent.FutureTask@a494c6 rejected from java.util.concurrent.ThreadPoolExecut or@58d286ec[Running, pool size = 200, active threads = 200, queued tasks = 0, completed tasks = 19123]client-report-pool_10.10.193.35
25734 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@a494c6 rejected from java.util.concurrent.ThreadPoolExecutor@58d286ec[Running, pool size = 200, active thread s = 200, queued tasks = 0, completed tasks = 19123]
25735 at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) ~[na:1.7.0_51]
25736 at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) [na:1.7.0_51]
25737 at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) [na:1.7.0_51]
25738 at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132) ~[na:1.7.0_51]
25739 at com.sohu.cache.async.impl.AsyncServiceImpl.submitFuture(AsyncServiceImpl.java:92) ~[classes!/:na]
25740 at com.sohu.cache.client.service.impl.ClientReportDataServiceImpl.deal(ClientReportDataServiceImpl.java:60) [classes!/:na]
25741 at com.sohu.cache.client.heartbeat.RedisClientReportDataController.reportData(RedisClientReportDataController.java:75) [classes!/:na]
25742 at sun.reflect.GeneratedMethodAccessor178.invoke(Unknown Source) ~[na:na]
25743 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_51]
25744 at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_51]
25745 at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:222) [spring-web-4.2.4.RELEASE.jar!/:4.2.4.RELEASE]
25746 at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:137) [spring-web-4.2.4.RELEASE.jar!/:4.2.4.RELEASE]

所以初步怀疑是客户端上报造成的,所以暴力解决了下把上报的代码全部注释掉重新上线,结果之后还是继续报警,出现了大量如下日志。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
### Error updating database. Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428556 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428557 ### The error occurred while setting parameters
428558 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json) values (?,?,?,?,?,?,?)
428559 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428560 ; SQL []; Lock wait timeout exceeded; try restarting transaction; nested exception is java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428561 java.util.concurrent.ExecutionException: org.springframework.dao.CannotAcquireLockException:
428562 ### Error updating database. Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428563 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428564 ### The error occurred while setting parameters
428565 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json) values (?,?,?,?,?,?,?)
428566 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428567 ; SQL []; Lock wait timeout exceeded; try restarting transaction; nested exception is java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428568 at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_51]
428569 at java.util.concurrent.FutureTask.get(FutureTask.java:202) ~[na:1.7.0_51]
428570 at com.sohu.cache.async.impl.AsyncServiceImpl$1.run(AsyncServiceImpl.java:60) ~[classes!/:na]
428571 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
428572 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
428573 at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
428574 Caused by: org.springframework.dao.CannotAcquireLockException:
428575 ### Error updating database. Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
428576 ### The error may involve com.sohu.cache.dao.InstanceStatsDao.mergeStandardStats-Inline
428577 ### The error occurred while setting parameters
428578 ### SQL: replace into standard_statistics(collect_time,ip,port,db_type,info_json,diff_json,cluster_info_json) values (?,?,?,?,?,?,?)
428579 ### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction

3. 相关疑点

  • 每天中午12点有一个大的删除基准表:
1
delete form standard_statistics where created_time<='2017-07-16 12:00:35';

从上面的表现以及之前的故障判断是这个表有问题,但是无论从数据量以及历史看都是没问题的。

4. 确定原因

leader直接抓MySQL process list定位到很多慢查询,都和standard_statistics有关,查询如下

1
2
use information_shema;
select * from processlist order by Time desc limit 10\G;

竟然都卡在上面这个语句,竟然还没有索引,这个应该不可能啊(已经运行n年了,之前表数据量更大),

看了下表结构,索引变了,擦!!!,对比下sql schema的记录,原来是dba搞的,这也太草率了。。

本质:即使各种统计做了线程池隔离(粗暴那种,不是hystrix的熔断),但是数据库连接池是有限的,而主线程的的接口仍然需要数据库,所以依然hang住了主线程池,所以对外表现是无法提供服务。

5.思考

  • 不要有固势思维:过去没问题,不代表现在没问题,何况还有不靠谱的人。

  • 不要被日志迷惑:日志很多时候不能看出问题的本质,这个很多时候已经证明。

  • 遇到问题不能只评经验:想一想问题根本在哪里。