Transaction whose execution time exceeds max-txn-ttl can still return results #48907

lxr599 · 2023-11-27T03:34:51Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

 t1                    |                    t2
 
begin;    

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    3 |    2 |
|    2 |    3 |
+------+------+
                                       begin;    

                                       select * from t1;
                                       +------+------+
                                       | a    | b    |
                                       +------+------+
                                       |    3 |    2 |
                                       |    2 |    3 |
                                       +------+------+

                                       delete from t1 where a=3;
                                       Query OK, 1 row affected (0.00 sec)
                                       
-- wait for 5 min(current max-txn-ttl)
update t1 set a=4 where a=3;
Query OK, 1 row affected (1.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

                                       select * from t1;                                     
   				       ERROR 8229 (HY000): TTL manager has timed out, pessimistic locks may expire, please commit or rollback this transaction
                                       commit;
                                       ERROR 1105 (HY000): tikv aborts txn: Error(Txn(Error(Mvcc(Error(PessimisticLockNotFound { start_ts: TimeStamp(445921037785235457), key: [116, 128, 0, 0, 0, 0, 0, 1, 246, 95, 114, 128, 0, 0, 0, 0, 0, 0, 2], reason: LockTsMismatch })))))

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    2 |    3 |
|    4 |    2 |
+------+------+

select * from t1;
ERROR 8229 (HY000): TTL manager has timed out, pessimistic locks may expire, please commit or rollback this transaction

commit;
Query OK, 0 rows affected (0.00 sec)

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    2 |    3 |
|    4 |    2 |
+------+------+

2. What did you expect to see? (Required)

t1 starts before t2, but t2 times out, so at least the first select after update in t1 should also time out instead of returning a result.

3. What did you see instead (Required)

The select after update and the update itself in t1 were executed successfully.

4. What is your TiDB version? (Required)

tidb_version(): Release Version: v7.4.0
Edition: Community
Git Commit Hash: 38cb4f3
Git Branch: heads/refs/tags/v7.4.0
UTC Build Time: 2023-10-10 14:18:50
GoVersion: go1.21.1
Race Enabled: false
Check Table Before Drop: false
Store: tikv

The text was updated successfully, but these errors were encountered:

wengsy150943 · 2023-11-27T05:12:48Z

Since t1 is started before t2 and has the same ttl, it seems that t1 should be aborted as t2 due to timeout.
However, t1 is committed successfully.

zyguan · 2023-12-01T06:44:37Z

The TTL keepalive routine actually starts on receiving the pessimistic lock response of the primary key. So, the TTL of t1 is slightly later than the TTL of t2, depending on the time interval between t2's delete and t1's update. And if you execute the select immediately after the update, it's won't response timeout error. In addition, the error TTL manager has timed out just warn you that your transaction probably cannot be committed successfully, but tidb will not abort it for you, just prevent you from executing more queries, so you may still commit it if there is no other transaction resolves it's locks.

wengsy150943 · 2023-12-01T07:18:27Z

The TTL keepalive routine actually starts on receiving the pessimistic lock response of the primary key. So, the TTL of t1 is slightly later than the TTL of t2, depending on the time interval between t2's delete and t1's update. And if you execute the select immediately after the update, it's won't response timeout error. In addition, the error just warn you that your transaction probably cannot be committed successfully, but tidb will not abort it for you, just prevent you from executing more queries, so you may still commit it if there is no other transaction resolves it's locks.TTL manager has timed out

Thanks for your reply.

We have tested the case you mentioned. Its form is similar to the first case in this issue. It seems that t1's select responses timeout error in an unexpected way.

t1	t2
begin;
select * from t1; -- returns(3,2)(2,3)
	begin;
	select * from t1; -- returns(3,2)(2,3)
	delete from t1 where a=3; -- ok
-- wait for 5 min(current max-txn-ttl)
update t1 set a=4 where a=3; -- o1,ok
	select * from t1; -- o2,ERROR 8229 (HY000):
select * from t1; -- o3 ERROR 8229 (HY000):

Notice that t1's update(o1) is loaded 5min later then t2's delete. And o1 is blocked for some seconds and then executed. It seems that the lock hold by t2's delete is removed at that time. Thus t2's next select(o2) responses timeout error.

Then we execute t1's next select(o3) immediately after the update. Though it is executed immediately after the update, it still responses timeout error, as o2 did.

If TTL keepalive routine starts at t1's update, t1's select(o3) should be executed instead of failing. This situation seems unexpected.

lxr599 · 2023-12-01T08:05:53Z

We also discovered a phenomenon. If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case. Otherwise, it will time out directly, just like the second case. This also seems to be unexpected.

zyguan · 2023-12-01T09:41:24Z

Sorry, the previous explaination is not completely true. The TTL keepalive routine starts on t1's update (o3), however the txn uptime is calculated by the start ts, so t1's TTL < t2's TTL.

Notice that t1's update(o1) is loaded 5min later then t2's delete. And o1 is blocked for some seconds and then executed. It seems that the lock hold by t2's delete is removed at that time. Thus t2's next select(o2) responses timeout error.

Then we execute t1's next select(o3) immediately after the update. Though it is executed immediately after the update, it still responses timeout error, as o2 did.

@wengsy150943 Since the update(o1) has been blocked for a while, it seems that the keepalive routine was running at that moment, and this line was executed. Thus select(o3) got the timeout error. Would you mind provide the table schema you used in this case? I'd like to try it locally.

If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case. Otherwise, it will time out directly, just like the second case. This also seems to be unexpected.

@lxr599 The keepalive routine started when update(o1) got executed, however, it waited 10s for checking TTL first time. I guess that's why the select(o3) returned successfully. Could you try an another case, let t1: 1) execute update(o1); 2) wait more than 15s; 3) execute select(o3) ?

Thanks, @lxr599 @wengsy150943 ! You’ve really done some in-depth testing.

wengsy150943 · 2023-12-01T12:47:27Z

The schema we used here is very simple:

create table t1 (a int, b int);
insert into t1 values(3,2);
insert into t1 values(2,3); -- in fact, this record is not necessary.

To take a more precise time control, I execute these commands in one line after t1's begin.

 select sleep(5*60);  update t1 set a=4 where a=3; select sleep(15); select * from t1;

There are two sleep commands. The first one waits for the max-txn-ttl(5 min); the second one controls the time interval between o1 and o3.

The phenomenons we discovered are shown below, hope that these phenomenons help:

When the second sleep waits for 0/1/3/5/6/9 s, o3 executes successfully. This is consistent with the first case in the issue instead of mine. Maybe it is due to my slow typing speed ):
When the second sleep waits for 15 s, o3 responses time out.
When the second sleep waits for 2/7 mins, o3 executes successfully. This phenomenon is similar with that mentioned by @lxr599.

We also discovered a phenomenon. If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case.

It seems that there is a 10s delay for timeout responce after txn begin(instead of update(o1)), as you said.

BTW, during testing, select sleep(x) sometimes behaves not so precise. It might cost 5*(1min+5s) when select sleep(5*60). However, we cannot reproduce this phenomenon and not sure whether it is a bug of TiDB.

zyguan · 2023-12-04T11:05:10Z

I've run some tests locally. It seems the TTL manager works as expected, and the results is shown as the following.

Here is the code I used, you may run it by the command:

deno test -A https://gist.githubusercontent.com/zyguan/bf2ba336fc7902f3b279535417905554/raw/471cf22ea251b05edfd6dad0d9aabdbf107036aa/test-tidb-48907.js

Also thanks to your cases, I found an another issue #49151 .

wengsy150943 · 2023-12-04T13:44:30Z

Thanks for your test cases.
We repeat these cases locally. And they do work as expected.

However, the last case we mentioned is not included in these cases. And we find that it works in a different way when is executed manually or by test tool.

Its test case is shown as below:

Deno.test('tidb-48907-c', async () => {
  const c1 = await mysql.createConnection({ host: TIDB_HOST, port: 4000, user: "root", database: 'test' });
  const c2 = await mysql.createConnection({ host: TIDB_HOST, port: 4000, user: "root", database: 'test' });
  try {
    await c1.query("set @@tidb_general_log=1, @@innodb_lock_wait_timeout=600");
    await c1.query("drop table if exists t1");
    await c1.query("create table t1 (a int, b int)");
    await c1.query("insert into t1 values (3, 2), (2, 3)");
    log("show config", await pp(c1.query("show config where name like '%max-txn-ttl'")));

    await c1.query("begin");
    log("t1 select(o1):", await pp(c2.query("select * from t1")));
    await c2.query("begin");
    log("t2 delete:", await pp(c2.query("delete from t1 where a=3")));
    log("wait 5 minutes");
    await sleep((5*60)*1000);
    log("continue");
    log("t1 update(o3):", await pp(c1.query("update t1 set a=4 where a=3")));
    log("wait 5 minutes");
    await sleep((5*60)*1000);
    log("continue");
    log("t1 select(o3'):", await pp(c1.query("select * from t1")));
    log("t2 commit:", await pp(c2.query("commit")));
    log("t1 commit:", await pp(c1.query("commit")));

  } finally {
    await close(c1);
    await close(c2);
  }
});

The last select of t1 is failed while executing by tool.

However, it can be successfully executed manually.

Could you try executing this case locally? Maybe this phenomenon is related with the jdbc or mysql client.

BTW, responsing time out after max-txn-ttl + 10s may be not so expected according to the document, while it is easy to be corrected (personal opinion).

zyguan · 2023-12-05T08:41:58Z

Could you try executing this case locally? Maybe this phenomenon is related with the jdbc or mysql client.

Interesting, I cannot reproduce the issue. Can you reproduce it without t2? (I think the phenomenon is unrelated to t2)

wengsy150943 · 2023-12-05T13:51:38Z

Interesting, I cannot reproduce the issue. Can you reproduce it without t2? (I think the phenomenon is unrelated to t2)

This case can be reproduced when the update matches and changes 0 row. ):
So it seems expected, while it's an incorrect report.

My fault.

jebter · 2024-09-04T03:56:28Z

So I will close it，if there are any updates, you can reopen it.

lxr599 added the type/bug The issue is confirmed as a bug. label Nov 27, 2023

aytrack added the sig/transaction SIG:Transaction label Nov 29, 2023

lxr599 mentioned this issue Dec 5, 2023

Transaction exceeding ttl does not release the lock #49184

Open

jebter closed this as completed Sep 4, 2024

jebter added the type/stale This issue has not been updated for a long time. label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transaction whose execution time exceeds max-txn-ttl can still return results #48907

Transaction whose execution time exceeds max-txn-ttl can still return results #48907

lxr599 commented Nov 27, 2023 •

edited

Loading

wengsy150943 commented Nov 27, 2023

zyguan commented Dec 1, 2023

wengsy150943 commented Dec 1, 2023 •

edited

Loading

lxr599 commented Dec 1, 2023 •

edited

Loading

zyguan commented Dec 1, 2023 •

edited

Loading

wengsy150943 commented Dec 1, 2023 •

edited

Loading

zyguan commented Dec 4, 2023 •

edited

Loading

wengsy150943 commented Dec 4, 2023

zyguan commented Dec 5, 2023

wengsy150943 commented Dec 5, 2023

jebter commented Sep 4, 2024

Transaction whose execution time exceeds max-txn-ttl can still return results #48907

Transaction whose execution time exceeds max-txn-ttl can still return results #48907

Comments

lxr599 commented Nov 27, 2023 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

wengsy150943 commented Nov 27, 2023

zyguan commented Dec 1, 2023

wengsy150943 commented Dec 1, 2023 • edited Loading

lxr599 commented Dec 1, 2023 • edited Loading

zyguan commented Dec 1, 2023 • edited Loading

wengsy150943 commented Dec 1, 2023 • edited Loading

zyguan commented Dec 4, 2023 • edited Loading

wengsy150943 commented Dec 4, 2023

zyguan commented Dec 5, 2023

wengsy150943 commented Dec 5, 2023

jebter commented Sep 4, 2024

lxr599 commented Nov 27, 2023 •

edited

Loading

wengsy150943 commented Dec 1, 2023 •

edited

Loading

lxr599 commented Dec 1, 2023 •

edited

Loading

zyguan commented Dec 1, 2023 •

edited

Loading

wengsy150943 commented Dec 1, 2023 •

edited

Loading

zyguan commented Dec 4, 2023 •

edited

Loading