add vm factory support #303

bergwolf · 2018-05-14T09:34:47Z

The PR adds vm factory support to kata runtime per design in the VM Factory plugin section. The vm factory controls how a new vm is created:

direct: vm is created directly
template: vm is created via vm template. A template vm is pre-created and saved. Later vm is just a clone of the template vm so that they readonly share a portion of initial memory (including kernel, initramfs and the kata agent). CPU and memory are hot plugged when necessary.
cache: vm is created via vm caches. A set of cached vm are pre-created and maintained alive. New vms are created by just picking a cached vm. CPU and memory are hot plugged when necessary.

To support vm factory, virtcontainers is refactored so that:

hypervisor and sandbox are decoupled
a new VM layer is added. A "VM" is a guest without any sandbox association and can be assigned to a sandbox when necessary.
builtin proxy based agent can run internally without association with a sandbox.

A new kata CLI subcommand kata-runtime factory is added. It is used to manage vm factory init and destroy. Only direct and template factory are supported by kata CLI, because the cache factory requires a running daemon to maintain the cache vm channels. OTOH, all factories are available to use at the kata API level.

VMs created by vm factory still lack network support because right now network devices are configured before vm is created. We need network hotplug support (#287) to enable network configuration after vm is created. That said, this patch works otherwise and is large enough. So I'd like to get it merged and continue working on the network support when #287 is done.

jodh-intel

@sboeuf, @devimc - ptal.

jodh-intel · 2018-05-15T09:50:01Z

cli/config/configuration.toml.in

+[factory]
+# VM templating support. Once enabled, new VMs are created from template
+# using vm cloning. They will share the same initial kernel memory by mapping
+# it readonly. When disabled, new VMs are created from scrash.


Can you explain why users would want to use this feature (to save memory, speed up boot, etc)?

s/scrash/scratch/.

Good point! I'll add some text.

Also, we either need a new enable_debug= option for this section, or a comment explaining the the runtime enable_debug option affects factory logging?

@bergwolf @jodh-intel I think it will be useful to add the security implications of this approach as well here so that users are aware of the side-effects of enabling this option.

@jodh-intel @amshinde The purpose of the PR is to add vm factory functionality. The CLI side change is just minimal to use vm factory. I can add more CLI changes after the PR is merged.

jodh-intel · 2018-05-15T09:52:05Z

cli/config/configuration.toml.in

+#enable_template = false
+
+# VM caching support. When set to larger than 0, number of cache empty VMs
+# will maintained alive. New VMs are created by just using the cached VMs.


Can you explain the impact of setting this? For example, if I set this to 5, does that mean the first container I start after a fresh boot will result in 6 instances of qemu being started (which will take longer than normal), but subsequently created containers will be fast (since they can make use of the running instances?

Also, what happens if those instances are running and qemu gets upgraded "underneath" them? Presumably, we should recommend all instances are stopped before upgrading.

How do you stop these instances (we need to document this)?

jodh-intel · 2018-05-15T09:54:14Z

vendor/github.com/intel/govmm/qemu/qemu.go

@@ -1126,6 +1126,10 @@ type Memory struct {
 	// MaxMem is the maximum amount of memory that can be made available
 	// to the guest through e.g. hot pluggable memory.
 	MaxMem string
+


There is no corresponding change to Gopkg.toml.

The govmm PR is not merge yet. I'll update vendor properly once that one gets merged.

@bergwolf could you please go to the govmm repo and address the comments. I'd like to see it merged before we move forward with this PR.

jodh-intel · 2018-05-15T09:55:11Z

vendor/github.com/intel/govmm/qemu/qemu.go

+	} else if config.Knobs.FileBackedMem == true {
+		if config.Memory.Size != "" && config.Memory.Path != "" {
+			dimmName := "dimm1"
+			objMemParam := "memory-backend-ram,id=" + dimmName + ",size=" + config.Memory.Size + ",mem-path=" + config.Memory.Path


Has this been tested with qemu-lite and qemu v1.11 vanilla?

jodh-intel · 2018-05-15T09:56:53Z

virtcontainers/factory/cache/cache.go

+
+func New(count uint, b base.FactoryBase) base.FactoryBase {
+	if count < 1 {
+		return b


Should count == 0 result in a log warning?

It's actually OK to have count == 0 in which case we still return the base factory.

jodh-intel · 2018-05-15T09:57:14Z

virtcontainers/factory/cache/cache.go

+			for {
+				vm, err := b.GetBaseVM()
+				if err != nil {
+					//glog.Errorf("cache factory get error when allocate vm: %v", err)


Commented out code here and below.

jodh-intel · 2018-05-15T09:58:46Z

virtcontainers/factory/factory.go

+
+func (f *FactoryConfig) validate() error {
+	if f.Cache == 0 {
+		return fmt.Errorf("invalid factory cache number %d", f.Cache)


Is it worth display the entire FactoryConfig to help debugging here?

jodh-intel · 2018-05-15T10:02:45Z

virtcontainers/vm.go

+}
+
+func (v *VM) assignSandbox(s *Sandbox) error {
+	// TODO: add vm symlinks


Incomplete?

bergwolf · 2018-06-08T17:08:23Z

Removing wip and the do-not-merge tag. The PR is working and ready for review now. @sboeuf @jodh-intel @egernst @WeiZhang555 @laijs PTAL. In the meaning time, I'll fix up CI and post perf numbers.

katabuilder · 2018-06-08T17:28:56Z

PSS Measurement:
Qemu: 161919 KB
Proxy: 6752 KB
Shim: 10851 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1999732 KB

bergwolf · 2018-06-08T18:22:34Z

Script to run: echo 3 |sudo tee /proc/sys/vm/drop_caches; free -h; for I in $(seq 100); do echo {$I}th; docker run -d --runtime kata busybox top;done;free -h

I've changed kata config to create 128MB guests by default.

w/o vm templating:

before test:
              total        used        free      shared  buff/cache   available
Mem:            31G        1.0G         29G        173M        722M         29G
Swap:          9.3G          4K        9.3G

after test:
              total        used        free      shared  buff/cache   available
Mem:            31G         15G         14G        190M        1.1G         15G
Swap:          9.3G          4K        9.3G

w/ vm templating:

              total        used        free      shared  buff/cache   available
Mem:            31G        1.2G         29G        253M        833M         29G
Swap:          9.3G          4K        9.3G

after test:
              total        used        free      shared  buff/cache   available
Mem:            31G        6.2G         23G        271M        1.1G         23G
Swap:          9.3G          4K        9.3G

So vm templating helped saving about 9GB for 100 guests, which is around 90MB~ for each 128MB guest. It's clear that vm templating can be very useful for density workload.

@egernst ^^

WeiZhang555 · 2018-06-09T07:04:15Z

Wow, this looks like an amazing optimization. And it's also quite a huge patch, need some time to read~

katabuilder · 2018-06-11T12:05:11Z

PSS Measurement:
Qemu: 161626 KB
Proxy: 8801 KB
Shim: 8899 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996880 KB

grahamwhaley · 2018-06-11T13:10:25Z

Memory savings look good :-)
We have a script over in CC (that at some point I'll move over into the kata repo) that took a bunch of measurements for looking at size/density: https://github.com/clearcontainers/tests/blob/master/metrics/density/footprint_data.md

I presume this measurement was taken with KSM turned off. I have a feeling that some of those savings could also be found with KSM, but I suspect:

KSM would take longer and consume more CPU resource to find and merge the pages
some of the shared pages in QEMU are on the host side only, where KSM is not generally enabled (the host side pages are not marked as mergeable) - so I think we will get gains here that KSM cannot currently get.

So, those memory savings look good to me :-)

bergwolf · 2018-06-11T13:30:22Z

@grahamwhaley Yes, these numbers were taken when KSM is off. And yes, one advantage of vm templating over KSM is that it does not consume extra CPU resources. However, it does not share the host side qemu memory either, same as KSM. All the sharing is done at the guest memory level, mostly including kernel, initramfs, kata-agent and other process memory like systemd if that is enabled.

grahamwhaley

I like the whole idea, and the structure to my eyes looks right.
I've added some general feedback, but we need some core VM and VC devs to review this...

grahamwhaley · 2018-06-11T13:17:53Z

vendor/github.com/intel/govmm/qemu/qemu.go

@@ -1126,6 +1126,10 @@ type Memory struct {
 	// MaxMem is the maximum amount of memory that can be made available
 	// to the guest through e.g. hot pluggable memory.
 	MaxMem string
+


In the patch title - hack/vendor made me think

that this was a patch in the https://github.com/kata-containers/runtime/tree/master/virtcontainers/hack directory

and reminded me that we should have changed that subdir name from hack when we moved over VC

and that no PRs should probably container the word hack ;-)

Can you drop the hack from the subject pls?

That commit will be reworked once the govmm PR gets merged. The hack in the commit subject is to remind myself of the rework ;)

grahamwhaley · 2018-06-11T13:26:30Z

cli/config/configuration.toml.in

+# using vm cloning. They will share the same initial kernel memory by mapping
+# it readonly. When disabled, new VMs are created from scrash.
+#
+#enable_template = false


As this is commented out, and I presume the default is therefore false/off, you may as well make the commented out line:

#enable_template = true

as that is the most likely line to be added or uncommented I think.

Good point!

grahamwhaley · 2018-06-11T13:27:34Z

cli/factory.go

+
+var factorySubCmds = []cli.Command{
+	initFactoryCommand,
+	destroyFactoryCommand,


Will it be possible to have some sort of 'status' command as well - to show how many runtimes are shared, or even more important I think once we have the cache/pool, how many are cached?

It is possible and it needs corresponding virtcontainers side of changes to support it. At this point, I would defer further improvements to future PRs since this one is already pretty big now.

grahamwhaley · 2018-06-11T13:37:36Z

virtcontainers/qemu.go

+	return q.hotplugAddMemory(slot, memMB)
+}
+
+func (q *qemu) hotplugAddMemory(slot uint32, amount int) error {


Just a thought - is this tied to the vmtemplating PR, or could it (if need be, to simplify the PRs) be pulled out into a separate PR?
Also, /cc @devimc as this feels like it might overlap or interact with some of his work?

VM templating needs cpu and memory hotplug support so I added them together.

I understand this is needed, but this is something that is not related to the whole refactoring, and I think you should submit this as a separate PR on top of which you should rebase this current PR.
This will help to split this huge PR and people will be more motivated in reviewing it ;)

It is not about refactoring. It is a dependency of vm factory functionality and vm factory is in fact the only consumer of the internal hotplug APIs ATM. I thought that it would lose the whole picture for reviewers if I split it.

And for reviewing, I thought it is smaller commit in a large PR that actually helps. But it seems you are right that people would like to just review it as a whole rather than looking into each commit. And the PR is huge to be reviewed as a whole (hence I get no comments on the core idea/API/refactoring part of the PR for a week :( ). I'll split some dependency commits to another PR to see if that helps.

@bergwolf Thanks :)
And don't worry, I'm gonna spend some time today in reviewing this PR, this is a huge piece of work and I need to be focus to be able to properly review it.
Also, don't hesitate to ping people that should review this PR. Usually if you don't ask, they will not spend the time on such big PRs 😃

Agreed. I plan to re-review this tomorrow AM too fwiw.

I've been planning to review this for a whole week, in the end I just started this work a few minutes ago 😆
Shame on myself :-)

grahamwhaley · 2018-06-11T13:40:30Z

virtcontainers/factory.go

@@ -6,6 +6,6 @@
 package virtcontainers

 type Factory interface {
-	GetVM(hypervisorType HypervisorType, hypervisorConfig HypervisorConfig) (*VM, error)


Can this whole commit be folded into the one that added factory.go in the first place? Seems this fixes something effectively 'missing' from that prior commit?

grahamwhaley · 2018-06-11T13:43:00Z

cli/config.go

@@ -359,7 +359,7 @@ func newQemuHypervisorConfig(h hypervisor) (vc.HypervisorConfig, error) {
 }


Ideally I would have liked to have seen the static check (lint, cyclo) fixes in (merged into) the original commits. Not a biggie, just noting.

katabuilder · 2018-06-11T15:05:23Z

PSS Measurement:
Qemu: 163764 KB
Proxy: 4632 KB
Shim: 11067 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1995640 KB

jshachm · 2018-06-12T01:17:19Z

Just a brief looking into the code.(quiet a lot of code.. great job!)

hypervisor and sandbox are decoupled is a really good idea ~
IMO decouple will lead more chance to remain resources .So I think clear resource after exception and proper rollback are really need. I will cherry-pick and do a little more test on this to find out if there is any resource remaining.

Will feedback later~

sboeuf

More comments later...

sboeuf · 2018-06-15T20:21:35Z

virtcontainers/qemu.go

@@ -62,7 +60,9 @@ type qemu struct {

 const qmpCapErrMsg = "Failed to negoatiate QMP capabilities"

-const defaultConsole = "console.sock"
+const qmpSocket = "qmp.sock"


I am fine with using only one socket, but you increased the numbers of char in the socket name now. controlSocket and monitorSocket were shorter.

The main point is that there is no need for two qmp sockets. If soeket file name length is a concern, I can just use qmp instead. But we've already shortened socket parent path and qmp.sock survives BuildSocketPath check, so I'd prefer to making the naming clear such that a socket actually ends with its .sock suffix.

Yes, fair enough. We have have shortened the parent path, and I agree I prefer a socket name ending with .sock.

sboeuf · 2018-06-15T20:22:38Z

vendor/github.com/intel/govmm/qemu/qemu.go

@@ -1126,6 +1126,10 @@ type Memory struct {
 	// MaxMem is the maximum amount of memory that can be made available
 	// to the guest through e.g. hot pluggable memory.
 	MaxMem string
+


@bergwolf could you please go to the govmm repo and address the comments. I'd like to see it merged before we move forward with this PR.

sboeuf · 2018-06-15T21:29:51Z

virtcontainers/implementation.go

 }

 // SetLogger implements the VC function of the same name.
 func (impl *VCImpl) SetLogger(logger logrus.FieldLogger) {
 	SetLogger(logger)
 }

+// SetFactory implements the VC function of the same name.
+func (impl *VCImpl) SetFactory(factory Factory) {


You need the mock implementation to implement this new API.

The mock implementation has SetFactory.

sboeuf · 2018-06-15T21:52:24Z

virtcontainers/qemu.go

+	return q.hotplugAddMemory(slot, memMB)
+}
+
+func (q *qemu) hotplugAddMemory(slot uint32, amount int) error {


I understand this is needed, but this is something that is not related to the whole refactoring, and I think you should submit this as a separate PR on top of which you should rebase this current PR.
This will help to split this huge PR and people will be more motivated in reviewing it ;)

jodh-intel · 2018-06-18T08:44:16Z

cli/create.go

+		kataLog.WithField("factory", factoryConfig).Info("load vm factory")
+		f, err := vf.NewFactory(factoryConfig, true)
+		if err != nil {
+			kataLog.WithError(err).Info("load vm factory failed")


I think it's worth adding a comment here explaining that if vm factory support fails, the system will "degrade" automatically to using the non-factory codepath.

In fact, I think we might want to add another variable to the config file to avoid this graceful degradation as some users only want to use VM factories:

[factory] # If set, fail if VM factory support is not available. ## strict=true

Yes, agreed. But again, I'd like to push all these UX changes to later PRs. This PR is mostly to add core vm factory functionalities.

jodh-intel · 2018-06-18T08:46:04Z

cli/config/configuration.toml.in

+[factory]
+# VM templating support. Once enabled, new VMs are created from template
+# using vm cloning. They will share the same initial kernel memory by mapping
+# it readonly. When disabled, new VMs are created from scrash.


Also, we either need a new enable_debug= option for this section, or a comment explaining the the runtime enable_debug option affects factory logging?

jodh-intel · 2018-06-18T08:46:25Z

cli/factory.go

+			f, err := vf.NewFactory(factoryConfig, true)
+			if err != nil {
+				kataLog.WithError(err).Error("load vm factory failed")
+				// ignore error


It happens when we destroy a factory that does not exist in the first place.

jodh-intel · 2018-06-18T08:56:06Z

virtcontainers/factory/template/template.go

+		return err
+	}
+	flags := uintptr(syscall.MS_NOSUID | syscall.MS_NODEV)
+	opts := fmt.Sprintf("size=%dM", t.config.HypervisorConfig.DefaultMemSz+8)


An experience value that a vm device state file is smaller than 8MB, usually it is about 3MB~ and we need to align it by 8.

devimc · 2018-06-18T13:07:30Z

cli/factory.go

+			fmt.Println("vm factory initialized")
+		} else {
+			kataLog.Error("vm factory is not enabled")
+			fmt.Println("vm factory is not enabled")


kataLog does not print to stdout. But I think we should tell user if the command succeeds or not. Any suggestion how to do it properly with kataLog?

devimc · 2018-06-18T13:08:09Z

cli/factory.go

+				f.CloseFactory()
+			}
+		}
+		fmt.Println("vm factory destroyed")


why don't use kataLog ?

kataLog does not print to stdout.

sboeuf

@bergwolf The PR looks pretty good, but I don't think we need to introduce a new abstraction layer with this new VM interface. The factory should call directly into the hypervisor, there is no reason to add an extra layer here. And about the sandbox decoupling, it's not needed since we want the complexity to be hidden behind sandbox API.
We can do the proper parallelization from Frakti by using CreateSandbox() with the appropriate factory, and hotplug later by creating new containers. Each container will be responsible for hotplugging the resources that we need.
That being said, the same way we're going to introduce sandbox.AddStorage(), we could introduce sandbox.AddResources() if really needed.

sboeuf · 2018-06-18T17:14:30Z

virtcontainers/qemu.go

@@ -62,7 +60,9 @@ type qemu struct {

 const qmpCapErrMsg = "Failed to negoatiate QMP capabilities"

-const defaultConsole = "console.sock"
+const qmpSocket = "qmp.sock"


Yes, fair enough. We have have shortened the parent path, and I agree I prefer a socket name ending with .sock.

sboeuf · 2018-06-18T17:26:37Z

virtcontainers/qemu_arch_base.go

@@ -253,15 +250,6 @@ func (q *qemuArchBase) memoryTopology(memoryMb, hostMemoryMb uint64) govmmQemu.M
 	return memory
 }

-func (q *qemuArchBase) append9PVolumes(devices []govmmQemu.Device, volumes []Volume) []govmmQemu.Device {


The removal of this function is not part of the decoupling between sandbox and hypervisor, is it ?
Shoud be part of it's own commit I think.

Yes, agreed.

sboeuf · 2018-06-18T19:45:13Z

virtcontainers/vm.go

+
+package virtcontainers
+
+type VM struct {


I don't really get the need for such additional abstraction layer. Could you explain the rationales behind this ? My understanding so far is that it actually duplicates the hypervisor interface, and includes the agent handler, meaning it looks a lot like what sandbox is doing.

amshinde · 2018-06-18T20:44:04Z

virtcontainers/qemu.go

@@ -340,8 +346,7 @@ func (q *qemu) createSandbox(sandboxConfig SandboxConfig) error {
 	// bridge gets the first available PCI address i.e bridgePCIStartAddr
 	devices = q.arch.appendBridges(devices, q.state.Bridges)

-	devices = q.arch.append9PVolumes(devices, sandboxConfig.Volumes)


@bergwolf Can you add in the commit message that you are removing this method and the reason for doing this.

amshinde · 2018-06-18T20:58:20Z

virtcontainers/factory.go

+
+package virtcontainers
+
+type Factory interface {


@bergwolf Can you add documentation comments for Factory here? Will be useful include the purpose for Factory and FactoryBase in the commit message.

@bergwolf I agree with @amshinde here. The difference between Factory and FactoryBase is very thin and confusing, and unless there's a very good reason to keep it, you should simplify this with Factory interface only (basically what you've defined in FactoryBase right now).

@sboeuf That would push the vm config comparison and CPU/memory hotplug down to each of the factory implementations. factory/factory.go is not just a broker but also handles vm config validation/comparison and possible CPU/memory hotplugs. It's better to do it at the factory level instead of doing the same work in each of the factory implementations.

@bergwolf I had missed this. This makes sense and it'd be great to explicitly explain this from the code by commenting about it. Otherwise, it's not obvious what difference we have between Factory and FactoryBase interfaces.

amshinde · 2018-06-18T21:08:15Z

virtcontainers/factory/base/base.go

+
+import vc "github.com/kata-containers/runtime/virtcontainers"
+
+type FactoryBase interface {


Same as above, can you add documentation comments for FactoryBase and its purpose?

amshinde · 2018-06-18T21:15:23Z

virtcontainers/factory/cache/cache.go

+//
+// cache implements base vm factory on top of other base vm factory.
+
+package cache


@bergwolf Can you add description about the cache and template factory? I get no idea about what these are and how they differ from each other. I know you are planning to add implementation later, but comments about general flow giving an idea about what is stored and what is constructed runtime for each factory type will be useful to get a general understanding.

It's in the PR description.

amshinde · 2018-06-18T21:52:31Z

cli/config/configuration.toml.in

+[factory]
+# VM templating support. Once enabled, new VMs are created from template
+# using vm cloning. They will share the same initial kernel memory by mapping
+# it readonly. When disabled, new VMs are created from scrash.


@bergwolf @jodh-intel I think it will be useful to add the security implications of this approach as well here so that users are aware of the side-effects of enabling this option.

amshinde · 2018-06-18T22:05:37Z

cli/create.go

@@ -100,6 +101,25 @@ func create(containerID, bundlePath, console, pidFilePath string, detach bool,
 		return err
 	}

+	if runtimeConfig.FactoryConfig.Template {


@bergwolf Can we handle this in virtcontainers CreateSandbox itself rather than doing this in the cli?
All these implementation details about the factory could be handled by virtcontainers itself keeping the api simple.

kata CLI is the one reading the configuration files. And for frakti, unlike kata cli, we would only initialize the vm factory for only once, instead of doing it for every CreateSandbox. That is the reason factory config is not pushed into CreateSandbox.

amshinde · 2018-06-18T22:07:17Z

cli/factory.go

+	},
+}
+
+var initFactoryCommand = cli.Command{


Are these commands intended to be called by frakti? I am trying to understand the use case here.

No, not at all. frakti does not call any of the kata CLI commands. The CLI commands are minimal for kata cli to use vm factory and it helps testing.

amshinde · 2018-06-18T22:13:14Z

virtcontainers/factory/template/template.go

@@ -103,3 +116,13 @@ func (t *template) createFromTemplateVM() (*vc.VM, error) {

 	return vc.NewVM(config)
 }
+
+func (t *template) checkTemplateVM() error {


This needs some comments what you are doing here and what those files mean.

amshinde · 2018-06-18T22:14:15Z

virtcontainers/factory/factory.go

 }

 type factory struct {
 	base base.FactoryBase
 }

-func NewFactory(config FactoryConfig) (vc.Factory, error) {
+func NewFactory(config FactoryConfig, fetchOnly bool) (vc.Factory, error) {


Please add comments for the fetchOnly flag here.

bergwolf · 2018-06-19T01:32:10Z

@sboeuf The VM layer is needed because vm factory needs to work on a vm guest abstraction without sandbox association. The VM layer is on top of the hypervisor layer because hypervisor is just a set of driver interfaces that has no in-memory abstraction. We can add abstraction data structure in hypervisor.go but it would just look like the VM layer and it is not as clean as a separate VM layer.

With the VM layer, we have a proper layering that:

A VM is a guest without sandbox association
hypervisor.go is the hypervisor driver interface that implement a VM
A sandbox is a collection of containers
A VM can be assigned to a sandbox and once that is done, the sandbox becomes operational

And about the sandbox decoupling, it's not needed since we want the complexity to be hidden behind sandbox API.

The sandbox API cannot cover vm factory. For one thing, vm factory needs to be initialized before any sandbox is created. For another thing, unlike kata CLI, frakti would just initialize the factory for only once and use it as long as the daemon is running. So it needs to be a separate API that is not coupled with the sandbox APIs.

We can do the proper parallelization from Frakti by using CreateSandbox() with the appropriate factory, and hotplug later by creating new containers. Each container will be responsible for hotplugging the resources that we need.

I agree that it makes sense to let each container hotplug what it needs. But doing hotplug with vm factory still allows the factory to prepare VMs with different sizes. I would like to keep such ability. WDYT?

jodh-intel · 2018-06-19T09:44:09Z

virtcontainers/factory/base/base.go

+	Config() vc.VMConfig
+
+	// GetBaseVM returns a paused VM created by the base factory.
+	GetBaseVM() (*vc.VM, error)


The main advantage of a cache is of course speed. However, we need to be able to identify when the cache should be invalidated, for example when a new rootfs image containing a security fix is released.

As such, I wonder if we need to add a timestamp into vc.VM that shows when the cache item was created (not when it was used). That will give us a basic way to look for "stale images" I think?

https://github.com/kata-containers/runtime/blob/master/cli/list.go#L69

https://github.com/clearcontainers/runtime/wiki/Stale-containers

Generally, I think the system upgrade best practice will need to include destroying the vm factory and re-initialize it after upgrade. This would apply to all vm factory users, including kata cli and frakti. For example in frakti, we would destroy the vm factory whenever the daemon quits and always initialize it upon startup.

You are right that it is possible to handle it automatically inside vm factory but it can be very tricky and error prone since an upgrade can happen at any time. Given that frakti is the only user of cache vm factory, I would prefer to push the burden up to frakti where it is already properly handled.

jodh-intel · 2018-06-19T09:47:36Z

virtcontainers/qemu.go

+		}
+	}
+
+	err = q.qmpMonitorCh.qmp.ExecSetMigrateArguments(q.qmpMonitorCh.ctx, fmt.Sprintf("exec:cat>%s", q.config.DevicesStatePath))


Are there any issues around creating the image with "version X" of qemu and then trying to use it with "version Y" of qemu? I'm thinking about build-support as well (qemu-lite compared with qemu-vanilla compared with distro-packaged qemu).

In such case, the qmp command might fail. And I would classify it as a configuration mistake... It's just like the version mismatch when you upgrade kata-runtime but do not correspondingly upgrade kata-agent and end up with certain grpc failures.

sboeuf · 2018-06-19T18:04:25Z

@bergwolf

The sandbox API cannot cover vm factory. For one thing, vm factory needs to be initialized before any sandbox is created. For another thing, unlike kata CLI, frakti would just initialize the factory for only once and use it as long as the daemon is running. So it needs to be a separate API that is not coupled with the sandbox APIs.

I clearly understand the need that you have here, and this means you need access to the hypervisor interface so that you can run a VM and stop it to save its state, but I thought it would be better to make sure that we can do that through the hypervisor interface itself. And because you have removed dependency on the sandbox through the hypervisor interface, you should be able to do that.

I agree that it makes sense to let each container hotplug what it needs. But doing hotplug with vm factory still allows the factory to prepare VMs with different sizes. I would like to keep such ability. WDYT?

Ok, but if you need different size of VMs, why don't you simply create them with the expected amount directly. IMO the hotplug should be reserved for the containers since they are the variable in the equation (don't know what the CRI layer will ask for in advance).

bergwolf · 2018-06-20T02:13:50Z

@sboeuf

you need access to the hypervisor interface so that you can run a VM and stop it to save its state

Except for accessing the hypervisor interface, we also need an in-memory data structure to represent the guest without a sandbox, so that we can create it (vm.NewVM()) and assign it to a sandbox (vm.assignSandbox()). The other VM APIs (Pause, Save, Stop, Resume, Kill, and hotplug) are just wrappers around hypervisor interface so that we do not export the entire set of hypervisor interface.

why don't you simply create them with the expected amount directly

For performance consideration, it is much slower to create a new VM and wait for agent to be up, than simply hotplug the requested CPU/Memory.

sboeuf · 2018-06-20T19:52:15Z

@bergwolf

Except for accessing the hypervisor interface, we also need an in-memory data structure to represent the guest without a sandbox, so that we can create it (vm.NewVM()) and assign it to a sandbox (vm.assignSandbox()). The other VM APIs (Pause, Save, Stop, Resume, Kill, and hotplug) are just wrappers around hypervisor interface so that we do not export the entire set of hypervisor interface.

vm.assignSandbox() is a consequence of the way you've decoupled things. If you think the other way around, the sandbox gets the info regarding an existing hypervisor and that's all it needs. About vm.NewVM(), I can see that you expect to check on the agent after you started the VM, but you don't strictly need to do that. The VM could be a simple entity started through the hypervisor interface without the need to know about the agent. And it's only when you create the sandbox that you provide the VM you want for your hypervisor, which leaves the sandbox taking care of calling into the agent to check if it properly started inside the VM.

For performance consideration, it is much slower to create a new VM and wait for agent to be up, than simply hotplug the requested CPU/Memory.

That's because you assume you have to wait for the agent to be up inside the VM when you're creating it from the factory. I don't think we need this, and as explained above, this is more something that should be checked in the context of the sandbox.

sboeuf · 2018-07-17T15:40:17Z

@amshinde @devimc @grahamwhaley @egernst it'd be nice if you could take a look ;)

It is not used and we actully cannot append multiple 9pfs volumes to a guest. Signed-off-by: Peng Tao <bergwolf@gmail.com>

A hypervisor implementation does not need to depend on a sandbox structure. Decouple them in preparation for vm factory. Signed-off-by: Peng Tao <bergwolf@gmail.com>

The boolean return value is not necessary. Signed-off-by: Peng Tao <bergwolf@gmail.com>

There are a few changes we need on kata agent to introduce vm factory support: 1. decouple agent creation from sandbox config 2. setup agent without creating a sandbox 3. expose vm storage path and share mount point Signed-off-by: Peng Tao <bergwolf@gmail.com>

1. support qemu migration save operation 2. setup vm templating parameters per hypervisor config 3. create vm storage path when it does not exist. This can happen when an empty guest is created without a sandbox. Signed-off-by: Peng Tao <bergwolf@gmail.com>

As representation of a guest without actual sandbox attached to it. This prepares for vm factory support. Signed-off-by: Peng Tao <bergwolf@gmail.com>

Add vm factory support per design in the VM Factory plugin section. The vm factory controls how a new vm is created: 1. direct: vm is created directly 2. template: vm is created via vm template. A template vm is pre-created and saved. Later vm is just a clone of the template vm so that they readonly share a portion of initial memory (including kernel, initramfs and the kata agent). CPU and memory are hot plugged when necessary. 3. cache: vm is created via vm caches. A set of cached vm are pre-created and maintained alive. New vms are created by just picking a cached vm. CPU and memory are hot plugged when necessary. Fixes: kata-containers#303 Signed-off-by: Peng Tao <bergwolf@gmail.com>

Add SetFactory to allow virtcontainers consumers to set a vm factory. And use it to create new VMs whenever the factory is set. Signed-off-by: Peng Tao <bergwolf@gmail.com>

Add enable_template option to the config file. When it is set, enable the vm template factory. cache factory cannot be used by kata cli directly because it requires a running daemon to maintain the cache VMs. `kata-runtime factory init` would initialize the vm factory and `kata-runtime factory destroy` would destroy the vm factory. When configured, a vm factory is loaded before creating new sandboxes. Signed-off-by: Peng Tao <bergwolf@gmail.com>

For one thing, it is not used by any kata components. For another thing, it breaks vm factory hypervisor config check. Signed-off-by: Peng Tao <bergwolf@gmail.com>

Add UTs to all factory components. Signed-off-by: Peng Tao <bergwolf@gmail.com>

katacontainersbot · 2018-07-19T04:57:10Z

PSS Measurement:
Qemu: 162966 KB
Proxy: 5090 KB
Shim: 8883 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007096 KB

opendev-zuul · 2018-07-19T05:25:15Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : FAILURE in 38m 39s

bergwolf · 2018-07-19T08:16:03Z

This is now blocked by kata-containers/tests/issues/513 and kata-containers/tests/issues/506. restarting CI did not help.... I know it all passed yesterday. But codecov was not happy so I had to update again.

bergwolf · 2018-07-20T17:50:20Z

recheck

opendev-zuul · 2018-07-20T19:09:44Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 43m 57s

sboeuf · 2018-07-20T19:22:17Z

@kata-containers/runtime PTAL, this is important to review this PR, so that we can move forward with it. It's been around for a long time, and @bergwolf reworked it many times. PTAL !

laijs · 2018-07-21T03:24:35Z

LGTM.
Well done @bergwolf, thanks for having done it.

laijs · 2018-07-21T22:47:18Z

It seems that codecov/patch is not reliable. see kata-containers/tests#508
And, as @sboeuf said, it's been around for a long time, and @bergwolf reworked it many times, I merge it despite the codecov/patch result.

grahamwhaley · 2018-09-12T14:03:24Z

Hi @bergwolf - I know this is an old and merged item (that base VM templating/factory is merged already). I was wondering, are you planning at some point on running some metrics tests to publish maybe on vm factory gains for density and boot? I think having these numbers available for Kata might be a great thing.
I've not had time to run up the factory and try it - but the report gen might help as a first pass easy look?

Maybe we should open an Issue to track this, and then we might be able to have some collab to come up with some sort of blog etc.?
/cc @annabellebertooch @brecky777

agent: do not quit on grpc serve errors

bergwolf added the do-not-merge label May 14, 2018

jodh-intel reviewed May 15, 2018

View reviewed changes

bergwolf force-pushed the vmfactory branch from 162ae20 to 2368e24 Compare June 8, 2018 16:55

bergwolf changed the title ~~[wip] add vm factory support~~ add vm factory support Jun 8, 2018

bergwolf removed the do-not-merge label Jun 8, 2018

bergwolf force-pushed the vmfactory branch from 2368e24 to 6c15da4 Compare June 11, 2018 11:38

grahamwhaley reviewed Jun 11, 2018

View reviewed changes

bergwolf force-pushed the vmfactory branch from 6c15da4 to 47238ac Compare June 11, 2018 14:34

sboeuf reviewed Jun 15, 2018

View reviewed changes

jodh-intel reviewed Jun 18, 2018

View reviewed changes

devimc reviewed Jun 18, 2018

View reviewed changes

sboeuf suggested changes Jun 18, 2018

View reviewed changes

amshinde reviewed Jun 18, 2018

View reviewed changes

bergwolf mentioned this pull request Jun 19, 2018

Add APIs to enable vm templating kata-containers/govmm#18

Merged

jodh-intel reviewed Jun 19, 2018

View reviewed changes

bergwolf mentioned this pull request Jun 20, 2018

qemu: clean up qmp channel #429

Merged

sboeuf approved these changes Jul 17, 2018

View reviewed changes

bergwolf added 11 commits July 19, 2018 10:49

qemu: remove append9PVolumes

4ac6754

It is not used and we actully cannot append multiple 9pfs volumes to a guest. Signed-off-by: Peng Tao <bergwolf@gmail.com>

hypervisor: decouple hypervisor from sandbox

18e6a6e

A hypervisor implementation does not need to depend on a sandbox structure. Decouple them in preparation for vm factory. Signed-off-by: Peng Tao <bergwolf@gmail.com>

hypervisor: cleanup valid method

7f20dd8

The boolean return value is not necessary. Signed-off-by: Peng Tao <bergwolf@gmail.com>

virtcontainers: add a vm abstraction layer

8dda2dd

As representation of a guest without actual sandbox attached to it. This prepares for vm factory support. Signed-off-by: Peng Tao <bergwolf@gmail.com>

virtconainers: add SetFactory API

a7d888f

Add SetFactory to allow virtcontainers consumers to set a vm factory. And use it to create new VMs whenever the factory is set. Signed-off-by: Peng Tao <bergwolf@gmail.com>

cli: do not set ip based kernel parameter

7cdc0fe

For one thing, it is not used by any kata components. For another thing, it breaks vm factory hypervisor config check. Signed-off-by: Peng Tao <bergwolf@gmail.com>

factory: add UTs

0a11230

Add UTs to all factory components. Signed-off-by: Peng Tao <bergwolf@gmail.com>

bergwolf force-pushed the vmfactory branch from a70c0cd to 0a11230 Compare July 19, 2018 04:45

laijs merged commit 14d25b8 into kata-containers:master Jul 21, 2018

grahamwhaley mentioned this pull request Jul 25, 2018

docker rm sometimes hangs up #515

Closed

miaoyq mentioned this pull request Jul 31, 2018

How to limit the memory of pod? #400

Closed

egernst mentioned this pull request Aug 22, 2018

Stable 1.1.1 candidate #621

Merged

sboeuf added the feature New functionality label Sep 12, 2018

bergwolf deleted the vmfactory branch September 13, 2018 03:28

zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019

Merge pull request kata-containers#303 from bergwolf/agent_quit

b0e39f9

agent: do not quit on grpc serve errors

		@@ -359,7 +359,7 @@ func newQemuHypervisorConfig(h hypervisor) (vc.HypervisorConfig, error) {
		}


		import vc "github.com/kata-containers/runtime/virtcontainers"

		type FactoryBase interface {

add vm factory support #303

add vm factory support #303

Conversation

bergwolf commented May 14, 2018 • edited Loading

jodh-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergwolf commented Jun 8, 2018

katabuilder commented Jun 8, 2018

bergwolf commented Jun 8, 2018

WeiZhang555 commented Jun 9, 2018

katabuilder commented Jun 11, 2018

grahamwhaley commented Jun 11, 2018

bergwolf commented Jun 11, 2018

grahamwhaley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katabuilder commented Jun 11, 2018

jshachm commented Jun 12, 2018

sboeuf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sboeuf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergwolf commented May 14, 2018 •

edited

Loading