entrypoint: in case of step command failure, write postfile #687

vdemeester · 2019-03-27T08:07:59Z

Changes

The entrypoint package wraps the step commands and execute them. This
allows use to use pods containers with some order. In a step, the
entrypoint binary will wait for the file of the previous step to be
present to execute the actual command.

Before this change, if a command failed (exit 1 or something),
entrypoint would not write a file, and thus the whole pod would be
stuck running (all the next step would wait forever).

This fixes that by always writing the post-file — and making
the waiter a bit smarter :

it will now look for a {postfile}.err to detect if the previous
step failed or not.
if the previous steps failed, it will fail too without executing the
step commands.

Closes #682

cc @pivotal-nader-ziada

~~I need to update pkg/entrypoint unit tests though 👼~~

Signed-off-by: Vincent Demeester vdemeest@redhat.com

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
~~[ ] Includes docs (if user facing)~~
Commit messages follow commit message best practices

See the contribution guide
for more details.

tekton-robot · 2019-03-27T08:08:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vdemeester · 2019-03-27T08:10:14Z

test/taskrun_test.go

+			tb.Command("/bin/sh"), tb.Args("-c", "exit 1"),
+		),
+		tb.Step("world", "busybox",
+			tb.Command("/bin/sh"), tb.Args("-c", "sleep 20s"),


Added this to detect if this step is executed or not 😅

abayer · 2019-03-27T08:49:17Z

Nice!

/lgtm

vdemeester · 2019-03-27T08:55:01Z

/hold

Putting on hold for a review from @bobcatfish @pivotal-nader-ziada

vdemeester · 2019-03-27T11:25:37Z

/test pull-tekton-pipeline-integration-tests

nader-ziada

great PR @vdemeester 👍

some questions for my understanding

nader-ziada · 2019-03-27T12:30:12Z

test/taskrun_test.go

+
+	t.Logf("Waiting for TaskRun in namespace %s to fail", namespace)
+	if err := WaitForTaskRunState(c, "failing-taskrun", TaskRunFailed("failing-taskrun"), "TaskRunFailed"); err != nil {
+		t.Errorf("Error waiting for TaskRun to finish: %s", err)


would be nice to check the status of the taskrun after it failed and make sure step 2 has failure and step 3 didn't run. Before this fix, it was failing eventually by waiting for the default timeout, somehow if we can check that's not the failure reason.

Right, good point, I should validate that all steps after the exit 1 are terminated and in the same error state 👼
That raise a question : in the case of steps being "skipped" because a previous one failed, should we return another exit code than the default 1 ? (to be able to detect such cases)

If we do that, do we end up with confusion if a step exits with a non-0/1 exit code on purpose? I'd lean towards using Reason or Message rather than ExitCode.

🤔 it won't change the status of the TaskRun. It would only be useful to detect that some of the container didn't have to execute. We could also "copy" the exit code of the failed process to the later ones.

not sure what the best option is, would be nice to know it didn't execute, anything here makes sense? http://tldp.org/LDP/abs/html/exitcodes.html

We discussed quiclky with @abayer on slack :

exit 0 on skipped containers (the one after the failure)

updating the TaskRun status to point to the failed step

updating the steps status with some skipped messages for the steps that were skip

exit 0 on skipped containers (the one after the failure) 👍

updating the TaskRun status to point to the failed step 🤔 but is that an api change to add that? It already has an array of steps status which should contain the same info

updating the steps status with some skipped messages for the steps that were skip 🤔 I wonder if we are over thinking this. If a step fails, that should be clear enough why the taskrun failed

Ok, I'll go for exit code 0 in here (and make sure the TaskRun status references the correct step as a failure) — we have time to think about skip messages & co later on, in a follow-up.

cmd/entrypoint/main.go

nader-ziada · 2019-03-27T12:39:01Z

cmd/entrypoint/main.go

 		} else if !os.IsNotExist(err) {
-			log.Fatalf("Waiting for %q: %v", file, err)
+			return fmt.Errorf("Waiting for %q: %v", file, err)


is this now going to return an error right away and make it not wait?

It doesn't change the behavior as log.Fatalf would have called os.Exit (and killed the process). We just "control" where the kill happens 😅

nader-ziada · 2019-03-27T12:43:29Z

cmd/entrypoint/main.go

 		Entrypoint: *ep,
 		WaitFile:   *waitFile,
 		PostFile:   *postFile,
 		Args:       flag.Args(),
 		Waiter:     &RealWaiter{},
 		Runner:     &RealRunner{},
 		PostWriter: &RealPostWriter{},
-	}.Go()
+	}
+	if err := e.Go(); err != nil {


does this block of code make sense inside the Go func?

I initially thought of puting it there — but it meant it was harder to "test" the Go function as it would call os.Exit.

fair enough :)

abayer · 2019-03-27T15:51:01Z

@vdemeester So for now at least, we're not making any changes to the TaskRunStatus.Steps, right?

vdemeester · 2019-03-27T15:51:38Z

@vdemeester So for now at least, we're not making any changes to the TaskRunStatus.Steps, right?

Indeed, for now, no change in there 😉

vdemeester · 2019-03-27T15:51:45Z

/hold cancel

abayer · 2019-03-27T15:52:15Z

Excellent. =)

/lgtm

nader-ziada · 2019-03-27T16:06:11Z

nice work @vdemeester 🎉

vdemeester · 2019-03-27T16:10:07Z

/hold

vdemeester · 2019-03-27T16:10:19Z

I forgot to remove a piece of code 😹

The entrypoint package wraps the step commands and execute them. This allows use to use pods containers with some order. In a step, the entrypoint binary will wait for the file of the previous step to be present to execute the actual command. Before this change, if a command failed (`exit 1` or something), entrypoint would not write a file, and thus the whole pod would be stuck running (all the next step would wait forever). This fixes that by always writing the post-file — and making the *waiter* a bit smarter : - it will now look for a `{postfile}.err` to detect if the previous step failed or not. - if the previous steps failed, it will fail too without executing the step commands. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

vdemeester · 2019-03-27T16:11:47Z

/hold cancel

nader-ziada · 2019-03-27T16:15:49Z

/lgtm

vdemeester · 2019-03-27T16:49:31Z

/test pull-tekton-pipeline-integration-tests

Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <dibyo@google.com>

tekton-robot requested review from bobcatfish and imjasonh March 27, 2019 08:08

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 27, 2019

tekton-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 27, 2019

vdemeester requested a review from nader-ziada March 27, 2019 08:08

vdemeester commented Mar 27, 2019

View reviewed changes

vdemeester force-pushed the 682-entrypoint-postfile branch from d86f71b to fa47d21 Compare March 27, 2019 08:36

tekton-robot assigned abayer Mar 27, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019

nader-ziada reviewed Mar 27, 2019

View reviewed changes

vdemeester force-pushed the 682-entrypoint-postfile branch from fa47d21 to ba0a03b Compare March 27, 2019 12:55

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

vdemeester force-pushed the 682-entrypoint-postfile branch from ba0a03b to 9cd0521 Compare March 27, 2019 15:09

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019

vdemeester force-pushed the 682-entrypoint-postfile branch from 9cd0521 to 8fa004b Compare March 27, 2019 16:11

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019

tekton-robot assigned nader-ziada Mar 27, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

tekton-robot merged commit 7c43fba into tektoncd:master Mar 27, 2019

vdemeester deleted the 682-entrypoint-postfile branch March 27, 2019 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entrypoint: in case of step command failure, write postfile #687

entrypoint: in case of step command failure, write postfile #687

vdemeester commented Mar 27, 2019 •

edited

Loading

tekton-robot commented Mar 27, 2019

vdemeester Mar 27, 2019

abayer commented Mar 27, 2019

vdemeester commented Mar 27, 2019 •

edited

Loading

vdemeester commented Mar 27, 2019

nader-ziada left a comment

nader-ziada Mar 27, 2019

vdemeester Mar 27, 2019

abayer Mar 27, 2019

vdemeester Mar 27, 2019

nader-ziada Mar 27, 2019

vdemeester Mar 27, 2019

nader-ziada Mar 27, 2019

vdemeester Mar 27, 2019

nader-ziada Mar 27, 2019

vdemeester Mar 27, 2019

nader-ziada Mar 27, 2019

vdemeester Mar 27, 2019

nader-ziada Mar 27, 2019

abayer commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

abayer commented Mar 27, 2019

nader-ziada commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

nader-ziada commented Mar 27, 2019

vdemeester commented Mar 27, 2019

entrypoint: in case of step command failure, write postfile #687

entrypoint: in case of step command failure, write postfile #687

Conversation

vdemeester commented Mar 27, 2019 • edited Loading

Changes

Submitter Checklist

tekton-robot commented Mar 27, 2019

Choose a reason for hiding this comment

abayer commented Mar 27, 2019

vdemeester commented Mar 27, 2019 • edited Loading

vdemeester commented Mar 27, 2019

nader-ziada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abayer commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

abayer commented Mar 27, 2019

nader-ziada commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019

nader-ziada commented Mar 27, 2019

vdemeester commented Mar 27, 2019

vdemeester commented Mar 27, 2019 •

edited

Loading

vdemeester commented Mar 27, 2019 •

edited

Loading