-
Notifications
You must be signed in to change notification settings - Fork 557
VM has reported a failure when processing extension 'cse0' #1806
Comments
@IvanCaro could you paste the api model that you passed as input to acs-engine? @JackQuincy I wonder if cse has an non-idiomatic execution context for this change in some cases: b5eb43b#diff-95c1c34f292e829cdcc06906aaf5c4f1 Does it make sense that a stat failure like the above would ever short-circuit as currently implemented? |
Only reason I could see this failing is if we called set -x or something of the sort earlier in the line/command |
@JackQuincy I can reproduce consistently when specifying |
hey @jackfrancis this happened with customVnet and DNS Servers, i created the cluster without DNS (i used azure provider) and after i changed (this works). |
@jackfrancis @JackQuincy Even I have faced this issue before and when I change the location and redeploy it worked. This issue is not consistent and its not happening all time. |
@IvanCaro Do you agree this is an indeterminate, ephemeral issue and we should close this? |
Same issue here using Kubernetes 1.6 as an orchestrator and customVnet. Is there a workaround for this? Update: I had to add maxPods value into the api model, otherwise the value would be empty and prevent the kubelet from starting. |
I found that I was having this because my nodes (master and agents) were not able to hit the k8s.gcr.io to download kubectl. I discovered this by logging into the master and looking at /var/log/cluster-provision.log, which ended with:
I traced this back to here - https://github.com/Azure/acs-engine/edit/master/parts/k8s/kubernetesmastercustomscript.sh#L571 This indicated it was having trouble running kubectl, so I tried invoking it from the ssh terminal. Low and behold - command not found. That file lead me to the fact that it's installed using a service called kubectl-extract. Looking at its logs using
So there was a problem downloading kubectl from k8s.gcr.io. Turns out it was a DNS problem, but that's just my network... Hope this helps someone debug a related issue. |
I ran into this as well. It failed when ran in westus2. I changed to eastus and it worked |
I'm having same issues when deploying to custom VNET on WestEurope using acs-engine 0.12.4 and Kubernetes 1.9.1. |
Just to acknowledge this. I get the exact same thing, custom VNET in West Europe, North Europe works just fine. Update: This is flaky somehow, now I can't deploy without custom VNET without it getting stuck on the extension for the master node. |
I got the same error and the problem was the DNS resolution at the VNET. After I fixed my custom DNS servers to resolve internet names everything worked fine. |
I'm getting this error with or without custom VNET. And here is the build from a pull request with a potential fix that still failed, https://circleci.com/gh/Azure/acs-engine/14298?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link //Morten |
cse* errors generally are a result of the provisioning process on the host failing. I'd like to keep this ticket open to encourage folks to share (bad) experiences. We're working on (1) improving logging around this and (2) hunting down transient errors (such as lack of DNS access would incur) and try, where appropriate, to introduce add'l resilience. |
also met same problem in eastus. cse0 timed out and the master VM can't ssh |
Just identified one cause of transient cse errors (DNS availability race condition on cluster provisioning), added some retry resiliency and am hoping that eliminates that symptom. @feiskyer please try to repro using master next week and let me know if you can, thanks! |
@jackfrancis sure |
@jackfrancis Thanks for pushing the fix. I've tested with the latest master, but unfortunately it didn't help in my case. I consistently face this issue if networking is set to In case this is useful, here's my template:
And my
Just tried with |
I've been testing a lot to day with ace-engine 0.12.5 and I've yet to run into this issue. Last week with 0.12.4 and 0.12.2 I got it all the time. So it seems to be much better 👍 |
@ilyalukyanov This PR also moves the ball forward: That is aimed for master today, should reduce further cse flakiness. |
I've literally deployed 20 times today without issues. Then suddenly the extension error popped up again. For the exact same generated template. This was a template generated with acs-engine 0.12.5. |
@jackfrancis thanks for prompt fixes! I'll give them a go later this week and will update this thread. |
This is still happening in West Europe. |
Is there a work around to get the partial deployment into a health state? |
@idanshahar Are you able to build from master? Much of the work post v0.12 has been identifying transient issues with provision scripts (and dependencies), which is where CSE deployment errors originate. @Jarlotee Depending on the scenario you could cherry-pick through the provision script Thanks for your endurance, all. :) |
@jackfrancis and anyone else who gets bitten by this. My issue turned out to be in the SPN password which had a % in it! The password was truncated at the % which caused the subsequent failure. |
@jackfrancis Yes I can do so, but still need a patch for a customer. when does the next version suppose to be released? BTW there is another issue exists in the master branch... #2198 UPDATE After building from master, this is the error I've got:
|
@CecileRobertMichon Good news thanks. |
@khaldoune I think I have the cause. There seems to be a regression with Calico. I'm trying to find out which commit introduced the regression. In the meantime, if it's an option for you, the same apimodel will work if you remove the line |
@CecileRobertMichon I've already tried to go from 0.14.0 and replace calico 0.7 with 0.1 (I've updated the tgz url), the provisioning had failed. I've also seen something strange in calico's manifest: cniversion:0.1 instead of 0.7, changing it to 0.7 did not change anything. I hope it helps. |
To clarify, the regression is not a general calico regression as deployments using |
The version of calico in the master branch being deployed is quite old 2.6.3 See releases. Could you check to see if the latest version in my PR referenced above resolves your problem? As mentioned - the script extension will fail if the nodes are not ready. The I haven't done any digging, but one suspect is that the |
Thanks @dtzar! @khaldoune can you try changing the value of "clusterSubnet" to match the value of "vnetCidr"? |
To clarify and record, |
@khaldoune It would be good to understand what's going on with your network topology/configuration. Per your above configuration, I see: |
Hi, Thanks all for your assistance, I was out of the office yesterday... @CecileRobertMichon , we need Calico because we are using it for project/namespaces/network isolation. @dtzar : I had workarround the issue 2202 by disabling the Encryption at Rest. If my understanding is good, whe should have clusterSubnet=kubeClusterCidr=clusterSubnet= PODs CIDR. From a design point of view, the pods's cidr should be private (not directly addressable from outside the k8s cluster), and thus, we should be able to use something else than the masters and workers cidrs as a pod CIDR. That's what I'm trying to achieve. In Azure, a VNET can have several address spaces, so if we reed here: https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/features.md#feat-custom-vnet "Additionally, to prevent source address NAT'ing within the VNET, we assign to the vnetCidr property in masterProfile the CIDR block that represents the usable address space in the existing VNET" I understand that I just need to add another address space (10.10.0.0/16) to my k8s VNET (198.18.184.0/22) and the magic should happen. I've just deployed with success a K8S 1.9.6 using a modified version of the acs 0.13.0 $ k get nodes 122/125 of the Sonobuoy tests on this cluster are passing (I will analyse 3 KOs later) Here is the complete configuration of the subnet: An exerpt: I've replaced 10.0.0.0/16 with 172.16.0.0/16 because the first one is used. As you can see, I've 2 address spaces in my VNET. @CecileRobertMichon I've also been able to deploy a k8s 1.9.3 cluster using acs0.13.1 k get nodes I will try to provision a cluster with the PR2551. I will keep you updated. |
The provisioning with PR2521 has failed. $ acs-engine version Here are logs (provision, cloud init, cloud init output): Thanks |
@jackfrancis @CecileRobertMichon @dtzar The deployment fails even with the acs 0.14.5 and with a routable cidr for PODs (in the vnet address space): vnet_prefix: 198.18.184.0/21 Here is the cluster definition: { |
@jackfrancis @idanshahar @CecileRobertMichon @dtzar My /etc/cni is empty. Where/When its content gets created by acs-engine? Thanks. |
@khaldoune /etc/cni should contain net.d:
Since you were able to deploy the same api model with two vnets in v13.1 and see ready nodes this might be a regression. I suspect it could be linked to issue #2476. Could you please open a new issue since I think we are outside the scope of this current issue for better tracking of the bug/fix? Thank you for your patience, let's get this resolved asap! cc @jackfrancis |
@CecileRobertMichon @jackfrancis Provisioning using Azure CNI instead of Calico with acs 0.14.5 works fine. Provisioning with Calico and a single subnet for both masters and workers fails. I've also double-checked weither or not the Encryption At Rest has been enabled by default in 0.14.5, it is not. I've just created a new issue: #2607 Thanks for your help. |
I got this error yesterday using acs-engine 15.2 with the distro set to coreos. Removing this from the template and reverting to ubuntu mitigated the issue, but means we can't provision coreOS vms. Marty |
I just upgraded a cluster from 1.7.5 to 1.8.10 via acs-engine 0.15.2 and ran into this issue. The cluster uses Azure CNI and Ubuntu. The resource group Deployment is still showing the Failure if more details are needed. Ignoring the error, and resuming the upgrade seems to have worked fine, but the cse0 extension on the master VM is still showing status "Provisioning failed". I don't know what the implications of this are, but as I said, everything seems to be working. |
I am seeing this same issue with the following:
The cluster is trying to use Ubuntu with Azure CNI |
@rocketraman and @BrendanThompson please share the apimodel you used to generate the template/deploy the cluster as well as the exact error message (what was the error code?). |
@CecileRobertMichon Here is my API model, with private information elided: Here is the error (operation status was "Conflict", Provisioning state is "Failed"):
Same exact error on two different clusters. |
@CecileRobertMichon I think I understand what happened in my case. Looking at [1] In my previous cluster, I was experiencing an issue in which etcd wasn't starting up because it was choking on the |
I'm facing similar issues today. 1.9 or 1.10 with ACS 0.16.1 |
Same issue with k8s 1.6.6 and acs-engine 1.16.2 |
In our case the apt package indexes in |
For everyone here, https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout has been added to help troubleshoot VM extension errors. Please follow the instructions if you encounter one of those. |
@CecileRobertMichon I too face #vmextensionprovisioningtimeout error all the time when i have 3 masters. |
@Navlesh please take a look at https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout if you haven't already and open a new issue with title "CSE error: exit code <INSERT_YOUR_EXIT_CODE>" and include the following in the description:
|
Is this a request for help?:
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
Version: canary
GitCommit: 8db990b
GitTreeState: clean
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes
What happened:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}
What you expected to happen:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
The text was updated successfully, but these errors were encountered: