Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container failure reasons should be more clear and include more types of failures #24

Closed
euank opened this issue Feb 24, 2015 · 9 comments

Comments

@euank
Copy link
Contributor

euank commented Feb 24, 2015

When a container unexpectedly exits, it should be more clearly communicated what went wrong.

This issue is to point out specific cases where it could be improved.

  1. OOMKilled - If a container is killed due to memory constraints, this should be communicated.

  2. "no such image" should instead give the error from pulling (e.g. registry auth).

  3. Anything docker 1.5 puts in the .State.Error field (executable file not found in $PATH, etc) should be bubbled up.

Additional suggestions are welcome.

euank pushed a commit to euank/amazon-ecs-agent that referenced this issue Feb 26, 2015
@sthomp
Copy link

sthomp commented Apr 17, 2015

Sometimes when I run RunTask the response will have 'failures' with reason 'AGENT'. What does this reason mean? (I assume this issue is a catch-all reason that will be fixed in this issue?)

@euank
Copy link
Contributor Author

euank commented Apr 17, 2015

That's actually separate from this issue. This refers to the 'reason' field that can be populated when a task moves to stop. This will show up in 'describe-tasks', not run/start task.

The run/start task 'failure' response means that the backend will not pass the request onto an agent for some reason. The reason 'AGENT' means that the container instance that it tried to place on does not have a healthy agent running on it. It also references the container instance arn in the failure output.

You can verify this by checking the 'agentConnected' field in the output of 'describe-container-instances'.

These 'ghost' container instances could exist for a few reasons:

  • Some of your instances/agents crashed or were unable to connect
  • You're running an older copy of the agent from the preview which did not persist state
  • You're running the agent without ECS_CHECKPOINT=true
  • You're changing the ECS_DATADIR between runs of the agent (e.g. stopping the agent on boot and starting it again with different configuration)
  • Something else 😄

Looking at the EC2 instance id in the describe output and figuring out why it's either not running or ran twice with two different container instance arns should help.

Good luck,
Euan

Edit:
I agree that output isn't very clear and we should improve it or document it better.

@sthomp
Copy link

sthomp commented Apr 17, 2015

Thanks for the detail! I have some digging to do...

euank added a commit to euank/amazon-ecs-agent that referenced this issue Apr 24, 2015
This solves many problems with task transitions (again!). It also
provides significantly more detailed error reasons to the backend.

Relates to aws#52, aws#31, and aws#24
@sthomp
Copy link

sthomp commented Apr 27, 2015

@euank Its not clear to me what could be wrong as the 'reason: AGENT' problem seems sporadic. I have an ECS Cluster with 1 ECS Instance. The ECS instance is running what I believe is the latest AMI (amzn-ami-2015.03.a-amazon-ecs-optimized (ami-ecd5e884)). I dont think this is necessarily a 'ghost' container because if I retry RunTask a couple times it will work. I haven't done anything custom with the agent or the container instance... Just running everything default from the AMI. So there must be something wrong with the agent itself. This problem seems to happen fairly regularly. Let me know if this issue should be moved to a separate thread.

@euank
Copy link
Contributor Author

euank commented Apr 27, 2015

By 'ghosts', I meant container instances which appear in list-container-instances but do not have a corresponding agent. If you only have a single (or small number of) container instances per that API, then my response above was off the mark.

One possibility that occurs to me now is that the agent does a disconnect/reconnect every once in a while and it's being seen as an invalid placement target at that time.

If you want to see if this is what caused it, you can find the times that this disconnect/reconnect occured with the following command: grep "Creating poll dialer" /var/log/ecs/ecs-agent.log | awk '{print $1}'.
You could then try to correlate the times that run-task failed and see if they were immediately before that message was printed each time.

If you want to discuss this issue further, please do create a new one.

Best,
Euan

euank added a commit to euank/amazon-ecs-agent that referenced this issue Apr 28, 2015
This solves many problems with task transitions (again!). It also
provides significantly more detailed error reasons to the backend.

Relates to aws#52, aws#31, and aws#24
euank added a commit to euank/amazon-ecs-agent that referenced this issue May 4, 2015
This solves many problems with task transitions (again!). It also
provides significantly more detailed error reasons to the backend.

Relates to aws#52, aws#31, and aws#24
euank added a commit that referenced this issue May 4, 2015
Engine refactor, fixes for #52, #31, #24, and resource contention issues (addressing what I think really caused #33).
@euank
Copy link
Contributor Author

euank commented May 8, 2015

The errors, as reported in he reason field, are much clearer in the v1.1.0 agent (see 7c02e04).

There's still room for improvement, but we can track those improvements separately as needed.

@euank euank closed this as completed May 8, 2015
@samehmohamed88
Copy link

samehmohamed88 commented Jul 18, 2017

What does the failure reason "ATTRIBUTE" mean? That certainly doesn't qualify as a helpful error message! Issue 535 is related to fluentd but there's nothing in this message or that message that clearly indicated it's fluentd or any other thing!

@JonCubed
Copy link

@aboarya a failure reason of Attribute means one or more requiresAttributes of the task definition failed to match the attributes of the container instance

@taylorfturner
Copy link

@JonCubed is it for sure that ATTRIBUTE is resolved through requiresAttributes that are not in the task definition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants