Implement a share-based scheduling heuristic #1

AlexisMontagne · 2016-05-17T01:15:23Z

jlevesy · 2016-05-18T12:02:30Z

Sounds good to me.

… timeout As etcd's keysAPI already supports HeaderTimeoutPerRequest, fleet should not handle request timeout on its own, but just pass on the timeout value to the etcd client. For that, set timeout only in etcd.Config.HeaderTimeoutPerRequest, and get rid of internal timeout handling from both EtcdRegistry and LeaseManager. Also remove EtcdRegistry.ctx(), and just call context.Background() directly, as suggested by Jon. Suggested-by: Jonathan Boulle <jonathanboulle@gmail.com> Partially resolves coreos#1397

Support truthy values in boolean unit options(true, yes, on, 1, t). Originally written by wuqixuan <wuqixuan@huawei.com> Supersedes coreos#1271 Fixes: coreos#1108

Fleetctl {stop,unload} has been always printing out somewhat ambiguous messages like below: Unit app.service loaded on 0eb1b9b6.../coreos1 For better usability, print out a better result message, either about {stopp,unload}ing successfully, or failing to do. Partially resolves: coreos#1432

As the unit state cache is going to be obsolete, functional test does not need to run go test with unit state cache enabled. So let's remove the second run for now.

…alues fleetctl: support truthy values in boolean unit options

registry: use etcd.Config.HeaderTimeoutPerRequest instead of internal timeout

tests: test fleetd with extra configuration options

…al-message fleetctl: print out result message explicitly

- rework option parsing - move away from (most) global variables (cAPI), keep currentCommand for now... - we need to pass c (the cli context) and cAPI (the client API) around Signed-off-by: Olaf Buddenhagen <olaf@endocode.com>

Add github.com/spf13/cobra and github.com/spf13/pflag.

Let's roll back to the original global variable cAPI, to keep the entire code as simple as possible. As this is a single command-line tool, it would be harmless to simply use a global variable.

Use Cobra (github.com/spf13/cobra) instead of cli (github.com/codegangsta/cli), for better cmdline user interface. * Create a wrapper runWrapper() to be used for cobra, to wrap around a normal run*() function into a prototype for cobra.Command.Run(). It also sets a global variable cAPI for running a normal command. * remove unnecessary code for codegangsta/cli from fleetctl.go. Suggested-by: Jonathan Boulle <jonathanboulle@gmail.com> Fixes: coreos#1453 Supersedes coreos#1570

As every fleetctl command uses cobra instead of codegangsta/cli, we need to update unit tests as well.

* Do not compare output of help messages, to avoid occasional failures. As cobra sometimes prints out a different help message, it causes functional test TestClientHelpFlag to fail. * Run "fleetctl version" instead of "fleetctl --version". * Run "fleetctl help" instead of "fleetctl" to get help message.

Since CoreOS alpha (1000), coreos-cloudinit doesn't seem to watch on /media/configdrive/openstack/latest/user_data. That's why functional tests haven't been running, as systemd-nspawn container runs without any configurations. Fix it by changing into /var/lib/coreos-install/user_data. Fixes: coreos#1532

…fxtests functional: fix path to user_data to /var/lib/coreos-install

So far ReplaceMember has been checking for error, by looking up an explicit string "Success" in stderr from "systemctl -M <ID> poweroff". However, with systemd v229 or newer, systemctl poweroff does not print out "Success" even if poweroff succeeded. This causes TestKnownHostsVerification to always return error like: --- FAIL: TestKnownHostsVerification (2.66s) client_test.go:60: Failed replacing machine: poweroff failed: Fix it by changing the systemctl call to "machinectl poweroff <ID>", which should return error correctly. This would be better in the long run, than checking for messages in stderr. Fixes: coreos#1586

…ror-fxtests functional: poweroff machine using machinectl instead of systemctl

TestSingleNodeConnectivityLoss has been so far retrieving is-enabled status by calling "systemctl list-unit-files UNITS". It was expected to get "enabled" for each unit. This has worked until systemd v227. However, that's not the case any more since systemd v228, precisely since systemd/systemd@0ec0dea ("install: follow unit file symlinks in /usr, but not /etc when looking for [Install] data"). Since then, systemctl list-unit-files returns "linked-runtime" status for each fleet unit, which is actually reasonable behavior for units that are not installed persistently under /etc/systemd. With systemd v228 or newer, however, TestSingleNodeConnectivityLoss in fleet functional tests cannot make use of "systemctl list-unit-files" any more. And it would not be optimal to simply change enabled to linked-runtime, because then the test couldn't work with systemd < v228. So let's work around it, by calling "systemctl is-enabled UNIT" instead. That command has returned correct status "linked-runtime", no matter whether systemd >= 228 or not. Fixes: coreos#1588

run-in-qemu should use double hyphens for long command-line options for consistency with other tools.

Allow run-in-qemu to run with a given custom base image, e.g. coreos_developer_qemu_image.img which was built by CoreOS SDK. $ ./run-in-qemu -c alpha -i ~/.qemu/coreos_developer_qemu_image.img

…fxtests functional: allow run-in-qemu to run based on a custom base image

As systemd v230 or newer requires /etc/machine-id on a nspawn container, we need to write the container's own machine ID into /etc/machine-id. Otherwise, functional tests don't start at all like: --- FAIL: TestKnownHostsVerification (10.80s) client_test.go:36: unable to detect machine PID Fixes: coreos#1594 See also systemd/systemd#3014 See also rkt/rkt#2440

…-fxtests functional: write machine ID into nspawn container

As CoreOS requires PAM configurations under /etc/pam.d starting from developer version of CoreOS 104x, PAM configurations on the host need to copied into the nspawn container. Otherwise, every sudo command fails with messages like: --- FAIL: TestShutdown (1.99s) shutdown_test.go:43: exit status 1 Note that copying /etc/pam.d is necessary only for such setups with sys-auth/pambase installed. Though it should be also fine for systems where /etc/pam.d is empty, because then it should automatically fall back to /usr/lib64/pam.d, which belongs to sys-libs/pam. Fixes: coreos#1591 See also https://coreos.com/blog/security-brief-coreos-linux-alpha-remote-ssh-issue.html

…e-header scripts,golang: update schema-generator, bump golang to 1.7.1

So far it hasn't been possible to do "go get github.com/coreos/fleet" because there was no main.go in the top source directory. Create main.go, a simple wrapper for fleetd to make it "go-getable".

Now that fleetd/fleetd.go doesn't have main(), we need to make the default build script run go build on the top source directory, instead of ./fleetd/. Then it will end up building fleetd, as main() in main.go imports fleetd.Main() by default.

…table fleetd,main: make github.com/coreos/fleet go-getable

Apparently rkt doesn't support a single dash in its option any more. For example, "rkt -volume" doesn't work any more. Use double dash instead: "rkt --volume" Also --insecure-skip-verify doesn't exist. Use --insecure-options=image instead.

Documentation: fix unavailable rkt options in examples

To avoid confusion with ./build, we'd better have build-aci under ./scripts directory.

Various changes on build-aci. * Use acbuild command for building the ACI image. * Remove unnecessary docker command. Actually we could build ACI image without relying on docker, just like etcd does. * Remove unnecessary manifest definition, as it's automatically generated by acbuild. * Improve the cleanup path using acbuildEnd(). * Allow TCP ports 2379,4001 for etcd How to test, for example: $ ./scripts/build-aci v0.12.0-amd64 $ sudo rkt run --insecure-options=image \ --stage1-name=coreos.com/rkt/stage1-fly:1.14.0 --inherit-env \ --volume etc-fleet,kind=host,source=/etc/fleet \ --volume machine-id,kind=host,source=/etc/machine-id \ --volume dbus-socket,kind=host,source=/run/dbus/system_bus_socket \ --volume fleet-units,kind=host,source=/run/fleet/units \ ./fleetd-v0.12.0-amd64.aci

engine: remove vestigial trigger code

To be as idiomatic as possible, do so: * make tryWaitForSystemdActiveState() return error instead of int * pass getBlockAttempts(cCmd) to tryWaitForSystemdActiveState() to handle such cases of --no-block or --block-attempts. * make assertSystemdActiveState(), waitForSystemdActiveState() return just error, not with 'found', as it's unnecessary. * remove unnecessary checking against conn == nil. * compare strings with '!=' instead of strings.Compare() != 0

Now that tryWaitForSystemdActiveState() returns error instead of int, tryWaitForUnitStates() should also return error instead of int, to be as idiomatic as possible.

It's not appropriate to check systemd active states with DBUS connection, as that will return correct states only on the same machine. So we should introduce a check using cAPI.UnitStates(), to check states on other machines.

As it's basically not efficient to call cAPI.UnitStates() for each unit, let's call it only once outside goroutines for processing units, i.e. calling in waitForSystemdActiveState(), not in getSingleUnitState(). To do that, we need to share the result apiStates across all units. Also make use of goroutines in waitForSystemdActiveState(), instead of sequential iteration over each unit. Note that we still need to keep calling UnitStates() when assertSystemdActiveState() failed, because in the next attempt the old apiState is not necessarily going to be valid. Ideally in that case, we should be able to fetch the state only for a single unit. However, we cannot do that for now, because cAPI.UnitState() is not available. In the future we would need to implement cAPI.UnitState() and all dependendent parts all over the tree in fleet, e.g. schema, etcdRegistry, rpcRegistry, etc. to replace UnitStates() in this place with the new method UnitState(). In practice, calling UnitStates() here is not as badly inefficient as it looks, because it will be anyway rarely called only when the assertion failed.

…temd-states fleetctl: check systemd active state via client API

scripts: build ACI image using acbuild

When marshal() returns a non-nil error 'err', which shadows the pre-defined 'err', we should explicitly return the second error returned from marshal(). Otherwise it will always return nil. To fix other potential bugs, let's just return err explicitly. This was discovered by running go tool vet: $ go tool vet --shadow ./registry registry/job.go:316: declaration of "err" shadows declaration at registry/job.go:315

In matchLocalFileAndUnit(), if os.Stat() returns nil and getUnitFromFile() returns non-nil error, then in the end matchLocalFileAndUnit() will return (false, nil). That's not the expected result. It should have returned the second error that came from getUnitFromFile(). To avoid further potential bugs, let's follow the idiomatic coding style: first check err != nil, and fail-fast. This issue was discoverd by running go tools vet.

…vars registry,fleetctl: fix bugs regarding shadowed error variables

Fix weblink to build-aci to https://github.com/coreos/fleet/blob/master/scripts/build-aci. This was a side-effect of moving the script to the scripts directory.

…-aci Documentation: fix a broken weblink to scripts/build-aci

Run TestRegistryMuxUnitManagement() only when user's euid is 0 and systemd is avilable at run-time. Such a distinction is necessary for avoiding test failures, especially when unit tests run on travis CI, where systemd is not available. With those checks, we can re-enable error handling after calling systemd.NewSystemdUnitManager().

On Linux syscall.Setuid() is not supported in go. So running unit tests with root privileges fails when calling Setuid. Let's skip the test on Linux, if euid is 0.

…r-systemd registry/rpc: re-enable error handling after NewSystemdUnitManage

…the shares of the machine

fleetctl: print error message when units were not found in registry

b445699

AlexisMontagne force-pushed the am-share-based-scheduling branch from e87998b to 3fb314e Compare May 17, 2016 01:17

Dongsu Park and others added 26 commits May 18, 2016 16:01

fleetctl: support truthy values in boolean unit options

fabe005

Support truthy values in boolean unit options(true, yes, on, 1, t). Originally written by wuqixuan <wuqixuan@huawei.com> Supersedes coreos#1271 Fixes: coreos#1108

tests: test fleetd with extra configuration options

463e27b

functional: run go test just once, without unit state cache option

5cc9d3b

As the unit state cache is going to be obsolete, functional test does not need to run go test with unit state cache enabled. So let's remove the second run for now.

Merge pull request coreos#1574 from endocode/dongsu/fleetctl-truthy-v…

77dd54e

…alues fleetctl: support truthy values in boolean unit options

Merge pull request coreos#1580 from endocode/dongsu/etcd-request-timeout

1882d3c

registry: use etcd.Config.HeaderTimeoutPerRequest instead of internal timeout

Merge pull request coreos#1556 from endocode/kayrus/cache_func_tests

3a974af

tests: test fleetd with extra configuration options

Merge pull request coreos#1577 from endocode/dongsu/fleetctl-stop-fin…

6a92833

…al-message fleetctl: print out result message explicitly

fleetctl: move to cli

eb7f560

- rework option parsing - move away from (most) global variables (cAPI), keep currentCommand for now... - we need to pass c (the cli context) and cAPI (the client API) around Signed-off-by: Olaf Buddenhagen <olaf@endocode.com>

vendor github.com/spf13/{cobra,pflag}

606869a

Add github.com/spf13/cobra and github.com/spf13/pflag.

fleetctl: do not pass around cAPI

0779d25

Let's roll back to the original global variable cAPI, to keep the entire code as simple as possible. As this is a single command-line tool, it would be harmless to simply use a global variable.

fleetctl: use cobra also for unit tests

d4e5524

As every fleetctl command uses cobra instead of codegangsta/cli, we need to update unit tests as well.

Merge pull request coreos#1585 from endocode/dongsu/config-cloudinit-…

56b539f

…fxtests functional: fix path to user_data to /var/lib/coreos-install

Merge pull request coreos#1587 from endocode/dongsu/poweroff-check-er…

441214f

…ror-fxtests functional: poweroff machine using machinectl instead of systemctl

functional: use double hyphens for long options of run-in-qemu

2bcac49

run-in-qemu should use double hyphens for long command-line options for consistency with other tools.

functional: allow run-in-qemu to run based on a custom base image

3e421ef

Allow run-in-qemu to run with a given custom base image, e.g. coreos_developer_qemu_image.img which was built by CoreOS SDK. $ ./run-in-qemu -c alpha -i ~/.qemu/coreos_developer_qemu_image.img

Merge pull request coreos#1590 from endocode/dongsu/set-custom-image-…

5423b01

…fxtests functional: allow run-in-qemu to run based on a custom base image

Merge pull request coreos#1596 from endocode/dongsu/machine-id-nspawn…

ba58786

…-fxtests functional: write machine ID into nspawn container

AlexisMontagne force-pushed the am-share-based-scheduling branch from 3fb314e to 01d7f7a Compare May 27, 2016 13:53

Dongsu Park and others added 28 commits September 9, 2016 16:42

Merge pull request coreos#1678 from endocode/dongsu/schema-gen-licens…

b0a7b5e

…e-header scripts,golang: update schema-generator, bump golang to 1.7.1

fleetd,main: make github.com/coreos/fleet go-getable

895b435

So far it hasn't been possible to do "go get github.com/coreos/fleet" because there was no main.go in the top source directory. Create main.go, a simple wrapper for fleetd to make it "go-getable".

Merge pull request coreos#1681 from endocode/dongsu/fleetd-main-go-ge…

4a72636

…table fleetd,main: make github.com/coreos/fleet go-getable

Merge pull request coreos#1683 from endocode/dongsu/docs-fix-rkt-options

39c7d4b

Documentation: fix unavailable rkt options in examples

scripts: move build-aci to scripts

37c87c2

To avoid confusion with ./build, we'd better have build-aci under ./scripts directory.

engine: remove vestigial trigger code

e7f7f48

Merge pull request coreos#1684 from vcaputo/remove_unused_trigger

015aa0b

engine: remove vestigial trigger code

fleetctl: make tryWaitForUnitStates return error instead of int

57b583f

Now that tryWaitForSystemdActiveState() returns error instead of int, tryWaitForUnitStates() should also return error instead of int, to be as idiomatic as possible.

fleetctl: check systemd active state via cAPI.UnitStates()

cc2837f

It's not appropriate to check systemd active states with DBUS connection, as that will return correct states only on the same machine. So we should introduce a check using cAPI.UnitStates(), to check states on other machines.

Merge pull request coreos#1667 from endocode/dongsu/fleetctl-capi-sys…

49420c5

…temd-states fleetctl: check systemd active state via client API

Merge pull request coreos#1682 from endocode/dongsu/scripts-build-aci

b2e537c

scripts: build ACI image using acbuild

Merge pull request coreos#1685 from endocode/dongsu/fix-shadow-error-…

b17e08b

…vars registry,fleetctl: fix bugs regarding shadowed error variables

Documentation: fix a broken weblink to scripts/build-aci

306b84f

Fix weblink to build-aci to https://github.com/coreos/fleet/blob/master/scripts/build-aci. This was a side-effect of moving the script to the scripts directory.

Merge pull request coreos#1689 from endocode/dongsu/fix-weblink-build…

686bddf

…-aci Documentation: fix a broken weblink to scripts/build-aci

ssh: fix a bug in dropping privileges on Linux

c86e6d0

On Linux syscall.Setuid() is not supported in go. So running unit tests with root privileges fails when calling Setuid. Let's skip the test on Linux, if euid is 0.

Merge pull request coreos#1693 from endocode/dongsu/unit-test-rootuse…

a11d017

…r-systemd registry/rpc: re-enable error handling after NewSystemdUnitManage

machine: Fetch shares from machine metadata

c0823e7

unit: Fetch the weight from the unit file, default weight is 1

b417af6

job: Validate the X-Fleet parameter Weight

fa41a78

engine: Compute the score of each agent from the weight of units and …

524b896

…the shares of the machine

AlexisMontagne force-pushed the am-share-based-scheduling branch from 01d7f7a to 524b896 Compare October 28, 2016 16:02

AlexisMontagne merged commit 524b896 into master Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a share-based scheduling heuristic #1

Implement a share-based scheduling heuristic #1

AlexisMontagne commented May 17, 2016

jlevesy commented May 18, 2016

Implement a share-based scheduling heuristic #1

Implement a share-based scheduling heuristic #1

Conversation

AlexisMontagne commented May 17, 2016

jlevesy commented May 18, 2016