Skip to content

Fix: flaky Dex login test by improving authentication flow handling #3082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 14, 2025

Conversation

madmecodes
Copy link
Contributor

@madmecodes madmecodes commented Apr 2, 2025

✏️ Summary of Changes

  1. Make the Dex test more robust (fix flakiness)

Solution:

403 Handling - When receiving a 403 Forbidden response during login, we attempt to restart the OAuth flow, which helps when sessions expire.

Reproduce the issue:

# Run multiple concurrent requests to increase system load
kubectl run load-test --image=busybox --restart=Never -- /bin/sh -c "while true; do wget -q -O- http://localhost:8080; done"

You will get this:
Screenshot 2025-04-02 at 10 54 44 AM

After implementing 403 handling:

# Handle 403 specifically - might need to restart oauth flow
          if resp.status_code == 403:
                    # Let's try one more approach - go through the oauth2 flow again
                    oauth_url = f"{urlsplit(self._endpoint_url).scheme}://{urlsplit(self._endpoint_url).netloc}/oauth2/start"
                    resp = s.get(
                        oauth_url,
                        headers=headers,
                        allow_redirects=True,
                        verify=not self._skip_tls_verify,
                        timeout=30
                    )
                    
                    # Continue with normal flow after restart
                    if resp.status_code == 200:
                        # If we have cookies now, we're good
                        if s.cookies:
                            return "; ".join([f"{c.name}={c.value}" for c in s.cookies])

Successfully passed, after main 403 retry. (load test pod still running and if i comment out the 403 mechanism, it again throws the error)

Screenshot 2025-04-02 at 11 00 38 AM

📦 Dependencies

Asked to resolve the issue in PR 3056

🐛 Related Issues

none

✅ Contributor Checklist


You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
@madmecodes
Copy link
Contributor Author

@juliusvonkohout looking forward for the feedback

@juliusvonkohout
Copy link
Member

I see a very large git diff, are you sure that you need to change several hundred lines for that ?

@juliusvonkohout juliusvonkohout mentioned this pull request Apr 7, 2025
2 tasks
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
@madmecodes
Copy link
Contributor Author

@juliusvonkohout looking forward for the feedback

The key addition is just the 403 handling block after the login POST request, which restarts the OAuth flow. The retry logic is also minimal but effective.

@madmecodes
Copy link
Contributor Author

Retry 403 logic is working fine, i tested both the versions,
Original version: Detects 403 errors but has no recovery mechanism, so it simply fails
Improved version: Not only detects 403 errors but successfully restarts the OAuth flow to recover from them

Screenshot 2025-04-07 at 9 26 53 PM

s = requests.Session()

# GET the endpoint_url, which should redirect to Dex
resp = s.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lest use clean coding standards, so proper long pronounceable variable names. session instead of s, response instead of resp, url_object instead of url_obj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i will be more verbose

return "; ".join([f"{c.name}={c.value}" for c in s.cookies])

except Exception as e:
if attempt < max_retries - 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use -1 instead of <= ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used attempt < max_retries - 1 to ensure we retry only up to the second-to-last attempt (e.g., attempts 0 and 1 for max_retries = 3).

On the final attempt (attempt 2), if it fails, we fall into the else block to raise the exception without an extra retry.

Using <= max_retries would imply retrying beyond the intended limit since attempt only goes up to max_retries - 1 in the loop.

attempt <= max_retries would be incorrect since attempt never reaches max_retries in a range(max_retries)=> (0,1,2) and maxRetry=3 loop.

I will update the code, for more readability

madmecodes and others added 2 commits April 8, 2025 22:39
…t retry logic

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Comment on lines +125 to +134
# Handle 403 specifically - might need to restart oauth flow
if response.status_code == 403:
# Try one more approach - go through the oauth2 flow again
oauth_url = f"{urlsplit(self._endpoint_url).scheme}://{urlsplit(self._endpoint_url).netloc}/oauth2/start"
response = session.get(
oauth_url,
allow_redirects=True,
verify=not self._skip_tls_verify,
)
# Continue with normal flow after restart
Copy link
Member

@juliusvonkohout juliusvonkohout Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is really hard to read, but here I see a relevant change. I am worried that this could hide the underlying problem. it is more a workaround than a real solution. Why do we need to retry in the first place ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does a bit of load fail the test so easily ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bit of load did fail the test at that time, but when I checked again the next day, it was able to handle it. So I deliberately triggered a 403 using this code block, and as seen in the output, the old version exited with the error, while the new one worked.

def patched_post(original_post):
    def _patched_post(*args, **kwargs):
        if random.random() < 0.2:
            mock_response = requests.Response()
            mock_response.status_code = 403
            mock_response._content = b"Simulated 403 error"
            mock_response.url = args[0]
            return mock_response
        return original_post(*args, **kwargs)
    return _patched_post

original_post = requests.Session.post
requests.Session.post = patched_post(original_post)
Screenshot 2025-04-13 at 5 50 33 PM

You're right that this is more of a workaround than a solution to the underlying problem. The 403 errors happen when the Dex authentication flow gets disrupted under load. While this fix makes the test more reliable, we should investigate why these errors occur in the first place.

The error occurs randomly and is a bit hard to reproduce. But yes, when it fails, the same error occurs — a 403 against the URL.

Shall I investigate the logs of these pods and run it again to find the root cause?

kubectl logs -n auth $(kubectl get pods -n auth -l app=dex -o name) -f
kubectl logs -n oauth2-proxy $(kubectl get pods -n oauth2-proxy -o name) -f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please investigate, but i will merge it for now

@juliusvonkohout
Copy link
Member

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 23ab69b into kubeflow:master Apr 14, 2025
8 checks passed
milinddethe15 pushed a commit to milinddethe15/kf-manifests that referenced this pull request Apr 27, 2025
…ubeflow#3082)

* Fix: flaky Dex login test by improving authentication flow handling

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* improve Dex login test reliability with retry and 403 handling

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* Update: Improve test readability with descriptive variables and robust retry logic

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* Update dex_oauth2-proxy_test.yaml

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

---------

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Co-authored-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants