MySQL主库xtrabackup备份导致应用异常

昨天有位同学问我:有台主库tps一下子跌没了,应用写入全部报错,她提供几个时间点给我。tps一下子跌没,我条件反射:莫非有全局读锁。开始排查
第一步:
查看数据监控 mysql thread running 23:47分左右开始上涨,到第二天01:17恢复正常。其他指标变化不是非常明显

又得到非常有用信息:因备库问题,在主库16:00启动xtrabackup,这时感觉找到问题的根源
第二步:
查看slow,一条子查询跑了23659S、flush跑了 5216S,这条子查询应该在19:00左右跑起,备份应该在23:45左右跑flush tables with read lock

# Time: 121225  1:09:54
# User@Host: xxx[xxx] @  [xxxx.xxxx.xxxx.xxxx]
# Query_time: 23659.989155  Lock_time: 0.000104 Rows_sent: 0  Rows_examined: 47501046480
SET timestamp=1356368994;
select id, parent_biz_order_id, buyer_id, status, type, gmt_modify from ttt where parent_biz_order_id 
in (select parent_biz_order_id from ttt where type='哈哈');
# Time: 121225  1:13:25
# User@Host: root[root] @ localhost []
# Query_time: 5216.430881  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
use mysql;
SET timestamp=1356369205;
FLUSH TABLES WITH READ LOCK;

第三步:
查看备份信息:备份脚本实际跑完(备份未成功)时间是在第二天00:02:15

+--------------+---------------------+---------------------+
| db_name      | start_time          | end_time |
+--------------+---------------------+---------------------+
| xxxxxxxxx | 2012-12-24 17:40:56 | 2012-12-25 00:02:15 |

从备份日志看出,23:46:24 xtrabackup触发flush tables with read lock,等了900秒后,因为close tables失败,最终导致备份失败。触发flush tables时间点+900S=xtrabackup脚本跑完时间end_time
备份脚本报错:
121224 23:46:24 innobackupex-1.5.1: Starting to lock all tables…
innobackupex-1.5.1: Error: Connection to mysql child process (pid=18366) timedout. (Time limit of 900 seconds exceeded. You may adjust time limit by editing the value of parameter “$mysql_response_timeout” in this script.) while waiting for reply to MySQL request: ‘FLUSH TABLES WITH READ LOCK;’ at /usr/bin/innobackupex-1.5.1 line 336.

xtrabackup源码解释:

# flush tables with read lock
        mysql_check(); --没有问题
        mysql_lockall() if !$option_no_lock; 
--在执行flush tables with read lock时,mysql_send函数执行超时900S,备份失败
 
# timeout in seconds for a reply from mysql
my $mysql_response_timeout = 900;
 
sub mysql_send {
    my $request = shift;
 
    $current_mysql_request = $request;
    print MYSQL_WRITER "$request\n";
    mysql_check();
    $current_mysql_request = '';
}
 
sub mysql_lockall {
    $now = current_time();
    print STDERR "$now  $prefix Starting to lock all tables...\n";
 
    mysql_send "USE mysql;";
    mysql_send "SET AUTOCOMMIT=0;";
    if (compare_versions($mysql_server_version, '4.0.22') == 0
        || compare_versions($mysql_server_version, '4.1.7') == 0) {
        # MySQL server version is 4.0.22 or 4.1.7
        mysql_send "COMMIT;";
        mysql_send "FLUSH TABLES WITH READ LOCK;";
    } else {
        # MySQL server version is other than 4.0.22 or 4.1.7
        mysql_send "FLUSH TABLES WITH READ LOCK;";
        mysql_send "COMMIT;";
    }
    write_binlog_info;
    write_slave_info if $option_slave_info;
 
    $now = current_time();
    print STDERR "$now  $prefix All tables locked and flushed to disk\n";
}

把整个过程连贯起来描述:

1、16:00 xtracbackup启动,因为xtrabackup会先copy innodb data file
2、19:00 查询sql跑起
3、23:45 xtrabackup加flush tables with read lock,跑成功这条语句要做三件事:1、加全局锁 2、关闭open tables 3、打上阻塞commit的flag。因为加全局锁成功,但在关闭open tables时,因为长查询,一直hang住
4、23:45以后,所有的写入都会被阻塞,应用不停创建连接,thread running升高
5、第二天1:09:54 sql跑完
6、第二天1:17:25thread running恢复正常

小结:还有几个问题没有解释,flush tables with read lock失败,thread running为什么没有在flush失败时间点下降?slow.log为什么记得5216S?


Post a Comment