Container failure reasons should be more clear and include more types of failures #24

euank · 2015-02-24T19:37:21Z

When a container unexpectedly exits, it should be more clearly communicated what went wrong.

This issue is to point out specific cases where it could be improved.

OOMKilled - If a container is killed due to memory constraints, this should be communicated.
"no such image" should instead give the error from pulling (e.g. registry auth).
Anything docker 1.5 puts in the .State.Error field (executable file not found in $PATH, etc) should be bubbled up.

Additional suggestions are welcome.

sthomp · 2015-04-17T21:22:12Z

Sometimes when I run RunTask the response will have 'failures' with reason 'AGENT'. What does this reason mean? (I assume this issue is a catch-all reason that will be fixed in this issue?)

euank · 2015-04-17T21:42:24Z

That's actually separate from this issue. This refers to the 'reason' field that can be populated when a task moves to stop. This will show up in 'describe-tasks', not run/start task.

The run/start task 'failure' response means that the backend will not pass the request onto an agent for some reason. The reason 'AGENT' means that the container instance that it tried to place on does not have a healthy agent running on it. It also references the container instance arn in the failure output.

You can verify this by checking the 'agentConnected' field in the output of 'describe-container-instances'.

These 'ghost' container instances could exist for a few reasons:

Some of your instances/agents crashed or were unable to connect
You're running an older copy of the agent from the preview which did not persist state
You're running the agent without ECS_CHECKPOINT=true
You're changing the ECS_DATADIR between runs of the agent (e.g. stopping the agent on boot and starting it again with different configuration)
Something else 😄

Looking at the EC2 instance id in the describe output and figuring out why it's either not running or ran twice with two different container instance arns should help.

Good luck,
Euan

Edit:
I agree that output isn't very clear and we should improve it or document it better.

sthomp · 2015-04-17T21:47:23Z

Thanks for the detail! I have some digging to do...

This solves many problems with task transitions (again!). It also provides significantly more detailed error reasons to the backend. Relates to aws#52, aws#31, and aws#24

sthomp · 2015-04-27T18:10:30Z

@euank Its not clear to me what could be wrong as the 'reason: AGENT' problem seems sporadic. I have an ECS Cluster with 1 ECS Instance. The ECS instance is running what I believe is the latest AMI (amzn-ami-2015.03.a-amazon-ecs-optimized (ami-ecd5e884)). I dont think this is necessarily a 'ghost' container because if I retry RunTask a couple times it will work. I haven't done anything custom with the agent or the container instance... Just running everything default from the AMI. So there must be something wrong with the agent itself. This problem seems to happen fairly regularly. Let me know if this issue should be moved to a separate thread.

euank · 2015-04-27T18:40:56Z

By 'ghosts', I meant container instances which appear in list-container-instances but do not have a corresponding agent. If you only have a single (or small number of) container instances per that API, then my response above was off the mark.

One possibility that occurs to me now is that the agent does a disconnect/reconnect every once in a while and it's being seen as an invalid placement target at that time.

If you want to see if this is what caused it, you can find the times that this disconnect/reconnect occured with the following command: grep "Creating poll dialer" /var/log/ecs/ecs-agent.log | awk '{print $1}'.
You could then try to correlate the times that run-task failed and see if they were immediately before that message was printed each time.

If you want to discuss this issue further, please do create a new one.

Best,
Euan

This solves many problems with task transitions (again!). It also provides significantly more detailed error reasons to the backend. Relates to aws#52, aws#31, and aws#24

Engine refactor, fixes for #52, #31, #24, and resource contention issues (addressing what I think really caused #33).

euank · 2015-05-08T20:01:55Z

The errors, as reported in he reason field, are much clearer in the v1.1.0 agent (see 7c02e04).

There's still room for improvement, but we can track those improvements separately as needed.

samehmohamed88 · 2017-07-18T17:56:29Z

What does the failure reason "ATTRIBUTE" mean? That certainly doesn't qualify as a helpful error message! Issue 535 is related to fluentd but there's nothing in this message or that message that clearly indicated it's fluentd or any other thing!

JonCubed · 2017-08-16T12:52:55Z

@aboarya a failure reason of Attribute means one or more requiresAttributes of the task definition failed to match the attributes of the container instance

taylorfturner · 2019-12-04T21:07:56Z

@JonCubed is it for sure that ATTRIBUTE is resolved through requiresAttributes that are not in the task definition?

euank added the kind/enhancement label Feb 24, 2015

euank mentioned this issue Feb 24, 2015

Agent should report dispatch failures... somewhere #10

Closed

euank pushed a commit to euank/amazon-ecs-agent that referenced this issue Feb 26, 2015

Serialize map request fields.

3a120b6

Closes aws#24.

euank added a commit that referenced this issue May 4, 2015

Merge pull request #57 from euank/EngineRefactor

cc00d67

Engine refactor, fixes for #52, #31, #24, and resource contention issues (addressing what I think really caused #33).

euank closed this as completed May 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container failure reasons should be more clear and include more types of failures #24

Container failure reasons should be more clear and include more types of failures #24

euank commented Feb 24, 2015

sthomp commented Apr 17, 2015

euank commented Apr 17, 2015

sthomp commented Apr 17, 2015

sthomp commented Apr 27, 2015

euank commented Apr 27, 2015

euank commented May 8, 2015

samehmohamed88 commented Jul 18, 2017 •

edited

Loading

JonCubed commented Aug 16, 2017

taylorfturner commented Dec 4, 2019

Container failure reasons should be more clear and include more types of failures #24

Container failure reasons should be more clear and include more types of failures #24

Comments

euank commented Feb 24, 2015

sthomp commented Apr 17, 2015

euank commented Apr 17, 2015

sthomp commented Apr 17, 2015

sthomp commented Apr 27, 2015

euank commented Apr 27, 2015

euank commented May 8, 2015

samehmohamed88 commented Jul 18, 2017 • edited Loading

JonCubed commented Aug 16, 2017

taylorfturner commented Dec 4, 2019

samehmohamed88 commented Jul 18, 2017 •

edited

Loading