Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entrypoint: in case of step command failure, write postfile #687

Merged
merged 1 commit into from
Mar 27, 2019

Conversation

vdemeester
Copy link
Member

@vdemeester vdemeester commented Mar 27, 2019

Changes

The entrypoint package wraps the step commands and execute them. This
allows use to use pods containers with some order. In a step, the
entrypoint binary will wait for the file of the previous step to be
present to execute the actual command.

Before this change, if a command failed (exit 1 or something),
entrypoint would not write a file, and thus the whole pod would be
stuck running (all the next step would wait forever).

This fixes that by always writing the post-file — and making
the waiter a bit smarter :

  • it will now look for a {postfile}.err to detect if the previous
    step failed or not.
  • if the previous steps failed, it will fail too without executing the
    step commands.

Closes #682

cc @pivotal-nader-ziada

I need to update pkg/entrypoint unit tests though 👼

Signed-off-by: Vincent Demeester vdemeest@redhat.com

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 27, 2019
@tekton-robot tekton-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 27, 2019
@vdemeester vdemeester requested a review from nader-ziada March 27, 2019 08:08
tb.Command("/bin/sh"), tb.Args("-c", "exit 1"),
),
tb.Step("world", "busybox",
tb.Command("/bin/sh"), tb.Args("-c", "sleep 20s"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to detect if this step is executed or not 😅

@vdemeester vdemeester force-pushed the 682-entrypoint-postfile branch from d86f71b to fa47d21 Compare March 27, 2019 08:36
@abayer
Copy link
Contributor

abayer commented Mar 27, 2019

Nice!

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@vdemeester
Copy link
Member Author

vdemeester commented Mar 27, 2019

/hold

Putting on hold for a review from @bobcatfish @pivotal-nader-ziada

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

Copy link
Member

@nader-ziada nader-ziada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great PR @vdemeester 👍

some questions for my understanding


t.Logf("Waiting for TaskRun in namespace %s to fail", namespace)
if err := WaitForTaskRunState(c, "failing-taskrun", TaskRunFailed("failing-taskrun"), "TaskRunFailed"); err != nil {
t.Errorf("Error waiting for TaskRun to finish: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to check the status of the taskrun after it failed and make sure step 2 has failure and step 3 didn't run. Before this fix, it was failing eventually by waiting for the default timeout, somehow if we can check that's not the failure reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good point, I should validate that all steps after the exit 1 are terminated and in the same error state 👼
That raise a question : in the case of steps being "skipped" because a previous one failed, should we return another exit code than the default 1 ? (to be able to detect such cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, do we end up with confusion if a step exits with a non-0/1 exit code on purpose? I'd lean towards using Reason or Message rather than ExitCode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it won't change the status of the TaskRun. It would only be useful to detect that some of the container didn't have to execute. We could also "copy" the exit code of the failed process to the later ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the best option is, would be nice to know it didn't execute, anything here makes sense? http://tldp.org/LDP/abs/html/exitcodes.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed quiclky with @abayer on slack :

  • exit 0 on skipped containers (the one after the failure)
  • updating the TaskRun status to point to the failed step
  • updating the steps status with some skipped messages for the steps that were skip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • exit 0 on skipped containers (the one after the failure) 👍
  • updating the TaskRun status to point to the failed step 🤔 but is that an api change to add that? It already has an array of steps status which should contain the same info
  • updating the steps status with some skipped messages for the steps that were skip 🤔 I wonder if we are over thinking this. If a step fails, that should be clear enough why the taskrun failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll go for exit code 0 in here (and make sure the TaskRun status references the correct step as a failure) — we have time to think about skip messages & co later on, in a follow-up.

} else if !os.IsNotExist(err) {
log.Fatalf("Waiting for %q: %v", file, err)
return fmt.Errorf("Waiting for %q: %v", file, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this now going to return an error right away and make it not wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the behavior as log.Fatalf would have called os.Exit (and killed the process). We just "control" where the kill happens 😅

Entrypoint: *ep,
WaitFile: *waitFile,
PostFile: *postFile,
Args: flag.Args(),
Waiter: &RealWaiter{},
Runner: &RealRunner{},
PostWriter: &RealPostWriter{},
}.Go()
}
if err := e.Go(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this block of code make sense inside the Go func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought of puting it there — but it meant it was harder to "test" the Go function as it would call os.Exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough :)

@vdemeester vdemeester force-pushed the 682-entrypoint-postfile branch from fa47d21 to ba0a03b Compare March 27, 2019 12:55
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@vdemeester vdemeester force-pushed the 682-entrypoint-postfile branch from ba0a03b to 9cd0521 Compare March 27, 2019 15:09
@abayer
Copy link
Contributor

abayer commented Mar 27, 2019

@vdemeester So for now at least, we're not making any changes to the TaskRunStatus.Steps, right?

@vdemeester
Copy link
Member Author

@vdemeester So for now at least, we're not making any changes to the TaskRunStatus.Steps, right?

Indeed, for now, no change in there 😉

@vdemeester
Copy link
Member Author

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
@abayer
Copy link
Contributor

abayer commented Mar 27, 2019

Excellent. =)

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@nader-ziada
Copy link
Member

nice work @vdemeester 🎉

@vdemeester
Copy link
Member Author

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
@vdemeester
Copy link
Member Author

I forgot to remove a piece of code 😹

The entrypoint package wraps the step commands and execute them. This
allows use to use pods containers with some order. In a step, the
entrypoint binary will wait for the file of the previous step to be
present to execute the actual command.

Before this change, if a command failed (`exit 1` or something),
entrypoint would not write a file, and thus the whole pod would be
stuck running (all the next step would wait forever).

This fixes that by always writing the post-file — and making
the *waiter* a bit smarter :

- it will now look for a `{postfile}.err` to detect if the previous
  step failed or not.
- if the previous steps failed, it will fail too without executing the
  step commands.

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
@vdemeester vdemeester force-pushed the 682-entrypoint-postfile branch from 9cd0521 to 8fa004b Compare March 27, 2019 16:11
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@vdemeester
Copy link
Member Author

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
@nader-ziada
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@tekton-robot tekton-robot merged commit 7c43fba into tektoncd:master Mar 27, 2019
@vdemeester vdemeester deleted the 682-entrypoint-postfile branch March 27, 2019 17:20
pradeepitm12 referenced this pull request in openshift/tektoncd-pipeline Jan 27, 2021
Previously, we were never closing Idle connections leading to issues described
in #687.  This commit adds a fixed 2 minute timeout for idle connections though
later we can also add other timeouts as well as allow for users to change the
timeout values.

I verified this manually by building on a base image with a shell and then
verifying that the number of open connections eventually go down unlike before.

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If an init container in a TaskRun fails, entire TaskRun should fail
5 participants