Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Cilium IPsec upgrade - unexpected packet drops on downgrade #29987

Closed
giorio94 opened this issue Dec 19, 2023 · 16 comments
Closed

CI: Cilium IPsec upgrade - unexpected packet drops on downgrade #29987

giorio94 opened this issue Dec 19, 2023 · 16 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/bug This is a bug in the Cilium logic. sig/loader Impacts the loading of BPF programs into the kernel. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@giorio94
Copy link
Member

CI failure

Hit on #29896
Link: https://github.com/cilium/cilium/actions/runs/7258862919/job/19777306148
Sysdump: cilium-sysdump-ipsec-downgrade-1-20231219-094253.zip

 [=] Test [no-unexpected-packet-drops] [3/71]

  [-] Scenario [no-unexpected-packet-drops/no-unexpected-packet-drops]
  🟥 Found unexpected packet drops:
{
  "labels": {
    "direction": "EGRESS",
    "reason": "Missed tail call"
  },
  "name": "cilium_drop_count_total",
  "value": 1
}
@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Dec 19, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Dec 19, 2023

Related: #28088

@giorio94
Copy link
Member Author

@ti-mo
Copy link
Contributor

ti-mo commented Jan 29, 2024

@giorio94 Thanks for the report! As far as I can tell, the failed job downgraded to 1.14.5 (I'm assuming this is correct and it's not using unreleased v1.14 tip under the covers!) which doesn't have the latest round of fixes (the new 'Host datapath not ready' drop reason) yet. I remain hopeful that this will fix the issue once and for all. 😄

@bimmlerd
Copy link
Member

bimmlerd commented Mar 5, 2024

Hit in #31119
Link: https://github.com/cilium/cilium/actions/runs/8140815226/job/22284010883#step:19:276
sysdump: cilium-sysdump-ipsec-downgrade-1-20240305-085616.7z.zip (sorry for the stupid format, but GH doesn't allow 7z files, and just zip was 30M 😮‍💨)

@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2024

@ti-mo
Copy link
Contributor

ti-mo commented Mar 6, 2024

Thanks for the reports, all.

@julianwiedmann FYI, we've found and fixed a few 'real' missed tail calls in the datapath revealed by static analysis on the bytecode: 672fccf.

Are these code paths that could potentially be triggered during up/downgrade tests, resulting in missed tail calls? I'm relatively confident the user space side is sound, even on release branches, so we may need to start looking into datapath itself. #30972 should provide more clarity on 1.16 but I've hit a small blocker and this still needs to be backported. Almost impossible to find the culprit until this lands.

@julianwiedmann
Copy link
Member

Thanks for the reports, all.

@julianwiedmann FYI, we've found and fixed a few 'real' missed tail calls in the datapath revealed by static analysis on the bytecode: 672fccf.

Are these code paths that could potentially be triggered during up/downgrade tests, resulting in missed tail calls?

Not for testing in combination with IPsec, unfortunately :/. So we're still chasing ...

I'm relatively confident the user space side is sound, even on release branches, so we may need to start looking into datapath itself. #30972 should provide more clarity on 1.16 but I've hit a small blocker and this still needs to be backported. Almost impossible to find the culprit until this lands.

And there's no way to start debugging with a WIP branch, because we need actual builds for the up/downgrade? Hmpf.

@chancez
Copy link
Contributor

chancez commented Mar 13, 2024

And there's no way to start debugging with a WIP branch, because we need actual builds for the up/downgrade? Hmpf.

We do build and push images and push dev charts to quay for PRs IIRC, so you have some options there.

@giorio94
Copy link
Member Author

@dylandreimerink
Copy link
Member

dylandreimerink commented Mar 15, 2024

@marqc Just made a discovery with regards to this flake. Quote from a slack thread:

I combined changes from PR/29711+PR/30972 with downgrade test to v1.15+PR/30972
The cilium bpf metrics list shows missed tail call in "bpf_host.c" line 1181
cilium-c969j

{
  "reason": "Missed tail call",
  "direction": "ingress",
  "packets": 124,
  "bytes": 39928,
  "line": 1181,
  "file": "bpf_host.c"
}

cilium-m7rmx
{
  "reason": "Missed tail call",
  "direction": "ingress",
  "packets": 62,
  "bytes": 19964,
  "line": 1181,
  "file": "bpf_host.c"
}

cilium-v4grq
{
  "reason": "Missed tail call",
  "direction": "ingress",
  "packets": 248,
  "bytes": 79856,
  "line": 1181,
  "file": "bpf_host.c"
}

If I'm reading this correctly, then missed tail call is in here:

		ret = tail_call_internal(ctx, from_host ? CILIUM_CALL_IPV4_FROM_HOST :
							  CILIUM_CALL_IPV4_FROM_NETDEV,
					 &ext_err);
		/* We are not returning an error here to always allow traffic to
		 * the stack in case maps have become unavailable.
		 *
		 * Note: Since drop notification requires a tail call as well,
		 * this notification is unlikely to succeed.
		 */
		return send_drop_notify_error_ext(ctx, identity, ret, ext_err,
						  CTX_ACT_OK, METRIC_INGRESS);

So what we know, is:

  1. We have a temporary missing tail call (metric increments during program replacement, but stops incrementing once downgrade is done) [link]
  2. Its on the ingress path
  3. Its in do_netdev in bpf_host.c

So if its ingress. Then from_host must be false. Thus its the CILIUM_CALL_IPV4_FROM_NETDEV tail call that is missing.

But I have yet to come up with an idea how or why that situation might occur. Perhaps someone else has an idea?

@thorn3r
Copy link
Contributor

thorn3r commented Apr 3, 2024

@thorn3r
Copy link
Contributor

thorn3r commented Apr 5, 2024

@giorio94
Copy link
Member Author

@thorn3r
Copy link
Contributor

thorn3r commented May 6, 2024

Copy link

github-actions bot commented Jul 6, 2024

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 6, 2024
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/bug This is a bug in the Cilium logic. sig/loader Impacts the loading of BPF programs into the kernel. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

7 participants