-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container failure reasons should be more clear and include more types of failures #24
Comments
Sometimes when I run RunTask the response will have 'failures' with reason 'AGENT'. What does this reason mean? (I assume this issue is a catch-all reason that will be fixed in this issue?) |
That's actually separate from this issue. This refers to the 'reason' field that can be populated when a task moves to stop. This will show up in 'describe-tasks', not run/start task. The run/start task 'failure' response means that the backend will not pass the request onto an agent for some reason. The reason 'AGENT' means that the container instance that it tried to place on does not have a healthy agent running on it. It also references the container instance arn in the failure output. You can verify this by checking the 'agentConnected' field in the output of 'describe-container-instances'. These 'ghost' container instances could exist for a few reasons:
Looking at the EC2 instance id in the describe output and figuring out why it's either not running or ran twice with two different container instance arns should help. Good luck, Edit: |
Thanks for the detail! I have some digging to do... |
@euank Its not clear to me what could be wrong as the 'reason: AGENT' problem seems sporadic. I have an ECS Cluster with 1 ECS Instance. The ECS instance is running what I believe is the latest AMI (amzn-ami-2015.03.a-amazon-ecs-optimized (ami-ecd5e884)). I dont think this is necessarily a 'ghost' container because if I retry RunTask a couple times it will work. I haven't done anything custom with the agent or the container instance... Just running everything default from the AMI. So there must be something wrong with the agent itself. This problem seems to happen fairly regularly. Let me know if this issue should be moved to a separate thread. |
By 'ghosts', I meant container instances which appear in One possibility that occurs to me now is that the agent does a disconnect/reconnect every once in a while and it's being seen as an invalid placement target at that time. If you want to see if this is what caused it, you can find the times that this disconnect/reconnect occured with the following command: If you want to discuss this issue further, please do create a new one. Best, |
The errors, as reported in he reason field, are much clearer in the v1.1.0 agent (see 7c02e04). There's still room for improvement, but we can track those improvements separately as needed. |
What does the failure reason |
@aboarya a failure reason of |
@JonCubed is it for sure that |
When a container unexpectedly exits, it should be more clearly communicated what went wrong.
This issue is to point out specific cases where it could be improved.
OOMKilled - If a container is killed due to memory constraints, this should be communicated.
"no such image" should instead give the error from pulling (e.g. registry auth).
Anything docker 1.5 puts in the .State.Error field (executable file not found in $PATH, etc) should be bubbled up.
Additional suggestions are welcome.
The text was updated successfully, but these errors were encountered: