Fix: flaky Dex login test by improving authentication flow handling #3082

madmecodes · 2025-04-02T05:32:35Z

✏️ Summary of Changes

Make the Dex test more robust (fix flakiness)

Solution:

403 Handling - When receiving a 403 Forbidden response during login, we attempt to restart the OAuth flow, which helps when sessions expire.

Reproduce the issue:

# Run multiple concurrent requests to increase system load
kubectl run load-test --image=busybox --restart=Never -- /bin/sh -c "while true; do wget -q -O- http://localhost:8080; done"

You will get this:

After implementing 403 handling:

# Handle 403 specifically - might need to restart oauth flow
          if resp.status_code == 403:
                    # Let's try one more approach - go through the oauth2 flow again
                    oauth_url = f"{urlsplit(self._endpoint_url).scheme}://{urlsplit(self._endpoint_url).netloc}/oauth2/start"
                    resp = s.get(
                        oauth_url,
                        headers=headers,
                        allow_redirects=True,
                        verify=not self._skip_tls_verify,
                        timeout=30
                    )
                    
                    # Continue with normal flow after restart
                    if resp.status_code == 200:
                        # If we have cookies now, we're good
                        if s.cookies:
                            return "; ".join([f"{c.name}={c.value}" for c in s.cookies])

Successfully passed, after main 403 retry. (load test pod still running and if i comment out the 403 mechanism, it again throws the error)

📦 Dependencies

Asked to resolve the issue in PR 3056

🐛 Related Issues

none

✅ Contributor Checklist

I have tested these changes with kustomize. See Installation Prerequisites.
All commits are signed-off to satisfy the DCO check.

You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

madmecodes · 2025-04-02T05:33:24Z

@juliusvonkohout looking forward for the feedback

juliusvonkohout · 2025-04-07T15:18:51Z

I see a very large git diff, are you sure that you need to change several hundred lines for that ?

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

madmecodes · 2025-04-07T15:53:13Z

@juliusvonkohout looking forward for the feedback

The key addition is just the 403 handling block after the login POST request, which restarts the OAuth flow. The retry logic is also minimal but effective.

madmecodes · 2025-04-07T15:57:39Z

Retry 403 logic is working fine, i tested both the versions,
Original version: Detects 403 errors but has no recovery mechanism, so it simply fails
Improved version: Not only detects 403 errors but successfully restarts the OAuth flow to recover from them

juliusvonkohout · 2025-04-08T08:20:06Z

tests/gh-actions/test_dex_login.py

+                s = requests.Session()
+
+                # GET the endpoint_url, which should redirect to Dex
+                resp = s.get(


Lest use clean coding standards, so proper long pronounceable variable names. session instead of s, response instead of resp, url_object instead of url_obj

okay i will be more verbose

juliusvonkohout · 2025-04-08T08:21:03Z

tests/gh-actions/test_dex_login.py

+                return "; ".join([f"{c.name}={c.value}" for c in s.cookies])
+
+            except Exception as e:
+                if attempt < max_retries - 1:


Why do you use -1 instead of <= ?

I used attempt < max_retries - 1 to ensure we retry only up to the second-to-last attempt (e.g., attempts 0 and 1 for max_retries = 3).

On the final attempt (attempt 2), if it fails, we fall into the else block to raise the exception without an extra retry.

Using <= max_retries would imply retrying beyond the intended limit since attempt only goes up to max_retries - 1 in the loop.

attempt <= max_retries would be incorrect since attempt never reaches max_retries in a range(max_retries)=> (0,1,2) and maxRetry=3 loop.

I will update the code, for more readability

…t retry logic Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

juliusvonkohout · 2025-04-10T15:43:54Z

tests/gh-actions/test_dex_login.py

+                # Handle 403 specifically - might need to restart oauth flow
+                if response.status_code == 403:
+                    # Try one more approach - go through the oauth2 flow again
+                    oauth_url = f"{urlsplit(self._endpoint_url).scheme}://{urlsplit(self._endpoint_url).netloc}/oauth2/start"
+                    response = session.get(
+                        oauth_url,
+                        allow_redirects=True,
+                        verify=not self._skip_tls_verify,
+                    )
+                    # Continue with normal flow after restart


The difference is really hard to read, but here I see a relevant change. I am worried that this could hide the underlying problem. it is more a workaround than a real solution. Why do we need to retry in the first place ?

Why does a bit of load fail the test so easily ?

That bit of load did fail the test at that time, but when I checked again the next day, it was able to handle it. So I deliberately triggered a 403 using this code block, and as seen in the output, the old version exited with the error, while the new one worked.

def patched_post(original_post): def _patched_post(*args, **kwargs): if random.random() < 0.2: mock_response = requests.Response() mock_response.status_code = 403 mock_response._content = b"Simulated 403 error" mock_response.url = args[0] return mock_response return original_post(*args, **kwargs) return _patched_post original_post = requests.Session.post requests.Session.post = patched_post(original_post)

You're right that this is more of a workaround than a solution to the underlying problem. The 403 errors happen when the Dex authentication flow gets disrupted under load. While this fix makes the test more reliable, we should investigate why these errors occur in the first place.

The error occurs randomly and is a bit hard to reproduce. But yes, when it fails, the same error occurs — a 403 against the URL.

Shall I investigate the logs of these pods and run it again to find the root cause?

kubectl logs -n auth $(kubectl get pods -n auth -l app=dex -o name) -f
kubectl logs -n oauth2-proxy $(kubectl get pods -n oauth2-proxy -o name) -f

Yes please investigate, but i will merge it for now

juliusvonkohout · 2025-04-14T06:53:26Z

/lgtm
/approve

google-oss-prow · 2025-04-14T06:53:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [juliusvonkohout]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ubeflow#3082) * Fix: flaky Dex login test by improving authentication flow handling Signed-off-by: madmecodes <ayushguptadev1@gmail.com> * improve Dex login test reliability with retry and 403 handling Signed-off-by: madmecodes <ayushguptadev1@gmail.com> * Update: Improve test readability with descriptive variables and robust retry logic Signed-off-by: madmecodes <ayushguptadev1@gmail.com> * Update dex_oauth2-proxy_test.yaml Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> --------- Signed-off-by: madmecodes <ayushguptadev1@gmail.com> Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> Co-authored-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

Fix: flaky Dex login test by improving authentication flow handling

b887182

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

google-oss-prow bot requested review from juliusvonkohout and kimwnasptd April 2, 2025 05:32

google-oss-prow bot added the size/L label Apr 2, 2025

juliusvonkohout mentioned this pull request Apr 7, 2025

end-to-end integration tests #3077

Merged

2 tasks

improve Dex login test reliability with retry and 403 handling

9787e7b

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

juliusvonkohout reviewed Apr 8, 2025

View reviewed changes

madmecodes and others added 2 commits April 8, 2025 22:39

Update: Improve test readability with descriptive variables and robus…

425e134

…t retry logic Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

Update dex_oauth2-proxy_test.yaml

85af230

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

juliusvonkohout reviewed Apr 10, 2025

View reviewed changes

google-oss-prow bot assigned juliusvonkohout Apr 14, 2025

google-oss-prow bot added the lgtm label Apr 14, 2025

google-oss-prow bot added the approved label Apr 14, 2025

google-oss-prow bot merged commit 23ab69b into kubeflow:master Apr 14, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: flaky Dex login test by improving authentication flow handling #3082

Fix: flaky Dex login test by improving authentication flow handling #3082

madmecodes commented Apr 2, 2025 •

edited

Loading

madmecodes commented Apr 2, 2025

juliusvonkohout commented Apr 7, 2025

madmecodes commented Apr 7, 2025

madmecodes commented Apr 7, 2025

juliusvonkohout Apr 8, 2025

madmecodes Apr 8, 2025

juliusvonkohout Apr 8, 2025

madmecodes Apr 8, 2025

juliusvonkohout Apr 10, 2025 •

edited

Loading

juliusvonkohout Apr 10, 2025

madmecodes Apr 13, 2025

juliusvonkohout Apr 14, 2025

juliusvonkohout commented Apr 14, 2025

google-oss-prow bot commented Apr 14, 2025

Fix: flaky Dex login test by improving authentication flow handling #3082

Fix: flaky Dex login test by improving authentication flow handling #3082

Conversation

madmecodes commented Apr 2, 2025 • edited Loading

✏️ Summary of Changes

Solution:

Reproduce the issue:

📦 Dependencies

🐛 Related Issues

✅ Contributor Checklist

madmecodes commented Apr 2, 2025

juliusvonkohout commented Apr 7, 2025

madmecodes commented Apr 7, 2025

madmecodes commented Apr 7, 2025

juliusvonkohout Apr 8, 2025

Choose a reason for hiding this comment

madmecodes Apr 8, 2025

Choose a reason for hiding this comment

juliusvonkohout Apr 8, 2025

Choose a reason for hiding this comment

madmecodes Apr 8, 2025

Choose a reason for hiding this comment

juliusvonkohout Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

juliusvonkohout Apr 10, 2025

Choose a reason for hiding this comment

madmecodes Apr 13, 2025

Choose a reason for hiding this comment

juliusvonkohout Apr 14, 2025

Choose a reason for hiding this comment

juliusvonkohout commented Apr 14, 2025

google-oss-prow bot commented Apr 14, 2025

madmecodes commented Apr 2, 2025 •

edited

Loading

juliusvonkohout Apr 10, 2025 •

edited

Loading