Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transaction whose execution time exceeds max-txn-ttl can still return results #48907

Closed
lxr599 opened this issue Nov 27, 2023 · 11 comments
Closed
Labels
sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug. type/stale This issue has not been updated for a long time.

Comments

@lxr599
Copy link

lxr599 commented Nov 27, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

 t1                    |                    t2
 
begin;    

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    3 |    2 |
|    2 |    3 |
+------+------+
                                       begin;    

                                       select * from t1;
                                       +------+------+
                                       | a    | b    |
                                       +------+------+
                                       |    3 |    2 |
                                       |    2 |    3 |
                                       +------+------+

                                       delete from t1 where a=3;
                                       Query OK, 1 row affected (0.00 sec)
                                       
-- wait for 5 min(current max-txn-ttl)
update t1 set a=4 where a=3;
Query OK, 1 row affected (1.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

                                       select * from t1;                                     
   				       ERROR 8229 (HY000): TTL manager has timed out, pessimistic locks may expire, please commit or rollback this transaction
                                       commit;
                                       ERROR 1105 (HY000): tikv aborts txn: Error(Txn(Error(Mvcc(Error(PessimisticLockNotFound { start_ts: TimeStamp(445921037785235457), key: [116, 128, 0, 0, 0, 0, 0, 1, 246, 95, 114, 128, 0, 0, 0, 0, 0, 0, 2], reason: LockTsMismatch })))))

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    2 |    3 |
|    4 |    2 |
+------+------+

select * from t1;
ERROR 8229 (HY000): TTL manager has timed out, pessimistic locks may expire, please commit or rollback this transaction

commit;
Query OK, 0 rows affected (0.00 sec)

select * from t1;
+------+------+
| a    | b    |
+------+------+
|    2 |    3 |
|    4 |    2 |
+------+------+

2. What did you expect to see? (Required)

t1 starts before t2, but t2 times out, so at least the first select after update in t1 should also time out instead of returning a result.

3. What did you see instead (Required)

The select after update and the update itself in t1 were executed successfully.

4. What is your TiDB version? (Required)

tidb_version(): Release Version: v7.4.0
Edition: Community
Git Commit Hash: 38cb4f3
Git Branch: heads/refs/tags/v7.4.0
UTC Build Time: 2023-10-10 14:18:50
GoVersion: go1.21.1
Race Enabled: false
Check Table Before Drop: false
Store: tikv

@lxr599 lxr599 added the type/bug The issue is confirmed as a bug. label Nov 27, 2023
@wengsy150943
Copy link

Since t1 is started before t2 and has the same ttl, it seems that t1 should be aborted as t2 due to timeout.
However, t1 is committed successfully.

@aytrack aytrack added the sig/transaction SIG:Transaction label Nov 29, 2023
@zyguan
Copy link
Contributor

zyguan commented Dec 1, 2023

The TTL keepalive routine actually starts on receiving the pessimistic lock response of the primary key. So, the TTL of t1 is slightly later than the TTL of t2, depending on the time interval between t2's delete and t1's update. And if you execute the select immediately after the update, it's won't response timeout error. In addition, the error TTL manager has timed out just warn you that your transaction probably cannot be committed successfully, but tidb will not abort it for you, just prevent you from executing more queries, so you may still commit it if there is no other transaction resolves it's locks.

@wengsy150943
Copy link

wengsy150943 commented Dec 1, 2023

The TTL keepalive routine actually starts on receiving the pessimistic lock response of the primary key. So, the TTL of t1 is slightly later than the TTL of t2, depending on the time interval between t2's delete and t1's update. And if you execute the select immediately after the update, it's won't response timeout error. In addition, the error just warn you that your transaction probably cannot be committed successfully, but tidb will not abort it for you, just prevent you from executing more queries, so you may still commit it if there is no other transaction resolves it's locks.TTL manager has timed out

Thanks for your reply.

We have tested the case you mentioned. Its form is similar to the first case in this issue. It seems that t1's select responses timeout error in an unexpected way.

t1 t2
begin;
select * from t1; -- returns(3,2)(2,3)
begin;
select * from t1; -- returns(3,2)(2,3)
delete from t1 where a=3; -- ok
-- wait for 5 min(current max-txn-ttl)
update t1 set a=4 where a=3; -- o1,ok
select * from t1; -- o2,ERROR 8229 (HY000):
select * from t1; -- o3 ERROR 8229 (HY000):

Notice that t1's update(o1) is loaded 5min later then t2's delete. And o1 is blocked for some seconds and then executed. It seems that the lock hold by t2's delete is removed at that time. Thus t2's next select(o2) responses timeout error.

Then we execute t1's next select(o3) immediately after the update. Though it is executed immediately after the update, it still responses timeout error, as o2 did.

If TTL keepalive routine starts at t1's update, t1's select(o3) should be executed instead of failing. This situation seems unexpected.

@lxr599
Copy link
Author

lxr599 commented Dec 1, 2023

We also discovered a phenomenon. If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case. Otherwise, it will time out directly, just like the second case. This also seems to be unexpected.

@zyguan
Copy link
Contributor

zyguan commented Dec 1, 2023

Sorry, the previous explaination is not completely true. The TTL keepalive routine starts on t1's update (o3), however the txn uptime is calculated by the start ts, so t1's TTL < t2's TTL.

Notice that t1's update(o1) is loaded 5min later then t2's delete. And o1 is blocked for some seconds and then executed. It seems that the lock hold by t2's delete is removed at that time. Thus t2's next select(o2) responses timeout error.

Then we execute t1's next select(o3) immediately after the update. Though it is executed immediately after the update, it still responses timeout error, as o2 did.

@wengsy150943 Since the update(o1) has been blocked for a while, it seems that the keepalive routine was running at that moment, and this line was executed. Thus select(o3) got the timeout error. Would you mind provide the table schema you used in this case? I'd like to try it locally.

If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case. Otherwise, it will time out directly, just like the second case. This also seems to be unexpected.

@lxr599 The keepalive routine started when update(o1) got executed, however, it waited 10s for checking TTL first time. I guess that's why the select(o3) returned successfully. Could you try an another case, let t1: 1) execute update(o1); 2) wait more than 15s; 3) execute select(o3) ?

Thanks, @lxr599 @wengsy150943 ! You’ve really done some in-depth testing.

@wengsy150943
Copy link

wengsy150943 commented Dec 1, 2023

The schema we used here is very simple:

create table t1 (a int, b int);
insert into t1 values(3,2);
insert into t1 values(2,3); -- in fact, this record is not necessary.

To take a more precise time control, I execute these commands in one line after t1's begin.

 select sleep(5*60);  update t1 set a=4 where a=3; select sleep(15); select * from t1;

There are two sleep commands. The first one waits for the max-txn-ttl(5 min); the second one controls the time interval between o1 and o3.

The phenomenons we discovered are shown below, hope that these phenomenons help:

When the second sleep waits for 0/1/3/5/6/9 s, o3 executes successfully. This is consistent with the first case in the issue instead of mine. Maybe it is due to my slow typing speed ):
When the second sleep waits for 15 s, o3 responses time out.
When the second sleep waits for 2/7 mins, o3 executes successfully. This phenomenon is similar with that mentioned by @lxr599.

We also discovered a phenomenon. If the update of t1 ( o1 ) waits long enough, like 7 minutes ( max-txn-ttl is 5 min ), the result of the later select ( o3) will be returned successfully, just like the first case.

It seems that there is a 10s delay for timeout responce after txn begin(instead of update(o1)), as you said.

BTW, during testing, select sleep(x) sometimes behaves not so precise. It might cost 5*(1min+5s) when select sleep(5*60). However, we cannot reproduce this phenomenon and not sure whether it is a bug of TiDB.

@zyguan
Copy link
Contributor

zyguan commented Dec 4, 2023

I've run some tests locally. It seems the TTL manager works as expected, and the results is shown as the following.

2023-12-04_181602

Here is the code I used, you may run it by the command:

deno test -A https://gist.githubusercontent.com/zyguan/bf2ba336fc7902f3b279535417905554/raw/471cf22ea251b05edfd6dad0d9aabdbf107036aa/test-tidb-48907.js

Also thanks to your cases, I found an another issue #49151 .

@wengsy150943
Copy link

Thanks for your test cases.
We repeat these cases locally. And they do work as expected.

However, the last case we mentioned is not included in these cases. And we find that it works in a different way when is executed manually or by test tool.

Its test case is shown as below:

Deno.test('tidb-48907-c', async () => {
  const c1 = await mysql.createConnection({ host: TIDB_HOST, port: 4000, user: "root", database: 'test' });
  const c2 = await mysql.createConnection({ host: TIDB_HOST, port: 4000, user: "root", database: 'test' });
  try {
    await c1.query("set @@tidb_general_log=1, @@innodb_lock_wait_timeout=600");
    await c1.query("drop table if exists t1");
    await c1.query("create table t1 (a int, b int)");
    await c1.query("insert into t1 values (3, 2), (2, 3)");
    log("show config", await pp(c1.query("show config where name like '%max-txn-ttl'")));

    await c1.query("begin");
    log("t1 select(o1):", await pp(c2.query("select * from t1")));
    await c2.query("begin");
    log("t2 delete:", await pp(c2.query("delete from t1 where a=3")));
    log("wait 5 minutes");
    await sleep((5*60)*1000);
    log("continue");
    log("t1 update(o3):", await pp(c1.query("update t1 set a=4 where a=3")));
    log("wait 5 minutes");
    await sleep((5*60)*1000);
    log("continue");
    log("t1 select(o3'):", await pp(c1.query("select * from t1")));
    log("t2 commit:", await pp(c2.query("commit")));
    log("t1 commit:", await pp(c1.query("commit")));

  } finally {
    await close(c1);
    await close(c2);
  }
});

The last select of t1 is failed while executing by tool.

image

However, it can be successfully executed manually.

image

Could you try executing this case locally? Maybe this phenomenon is related with the jdbc or mysql client.

BTW, responsing time out after max-txn-ttl + 10s may be not so expected according to the document, while it is easy to be corrected (personal opinion).

@zyguan
Copy link
Contributor

zyguan commented Dec 5, 2023

Could you try executing this case locally? Maybe this phenomenon is related with the jdbc or mysql client.

Interesting, I cannot reproduce the issue. Can you reproduce it without t2? (I think the phenomenon is unrelated to t2)

2023-12-05_163602

@wengsy150943
Copy link

Interesting, I cannot reproduce the issue. Can you reproduce it without t2? (I think the phenomenon is unrelated to t2)

This case can be reproduced when the update matches and changes 0 row. ):
So it seems expected, while it's an incorrect report.

My fault.

@jebter
Copy link

jebter commented Sep 4, 2024

So I will close it,if there are any updates, you can reopen it.

@jebter jebter closed this as completed Sep 4, 2024
@jebter jebter added the type/stale This issue has not been updated for a long time. label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug. type/stale This issue has not been updated for a long time.
Projects
None yet
Development

No branches or pull requests

5 participants