Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a share-based scheduling heuristic #1

Merged
merged 229 commits into from
Nov 9, 2016

Conversation

AlexisMontagne
Copy link
Member

Based on coreos#945

@AlexisMontagne AlexisMontagne force-pushed the am-share-based-scheduling branch from e87998b to 3fb314e Compare May 17, 2016 01:17
@jlevesy
Copy link

jlevesy commented May 18, 2016

Sounds good to me.

Dongsu Park and others added 26 commits May 18, 2016 16:01
… timeout

As etcd's keysAPI already supports HeaderTimeoutPerRequest, fleet should
not handle request timeout on its own, but just pass on the timeout
value to the etcd client. For that, set timeout only in
etcd.Config.HeaderTimeoutPerRequest, and get rid of internal timeout
handling from both EtcdRegistry and LeaseManager.

Also remove EtcdRegistry.ctx(), and just call context.Background()
directly, as suggested by Jon.

Suggested-by: Jonathan Boulle <jonathanboulle@gmail.com>
Partially resolves coreos#1397
Support truthy values in boolean unit options(true, yes, on, 1, t).

Originally written by wuqixuan <wuqixuan@huawei.com>
Supersedes coreos#1271
Fixes: coreos#1108
Fleetctl {stop,unload} has been always printing out somewhat ambiguous
messages like below:

  Unit app.service loaded on 0eb1b9b6.../coreos1

For better usability, print out a better result message, either about
{stopp,unload}ing successfully, or failing to do.

Partially resolves: coreos#1432
As the unit state cache is going to be obsolete, functional test does
not need to run go test with unit state cache enabled. So let's remove
the second run for now.
…alues

fleetctl: support truthy values in boolean unit options
registry: use etcd.Config.HeaderTimeoutPerRequest instead of internal timeout
tests: test fleetd with extra configuration options
…al-message

fleetctl: print out result message explicitly
- rework option parsing
- move away from (most) global variables (cAPI), keep currentCommand for now...
- we need to pass c (the cli context) and cAPI (the client API) around

Signed-off-by: Olaf Buddenhagen <olaf@endocode.com>
Add github.com/spf13/cobra and github.com/spf13/pflag.
Let's roll back to the original global variable cAPI, to keep the
entire code as simple as possible. As this is a single command-line
tool, it would be harmless to simply use a global variable.
Use Cobra (github.com/spf13/cobra) instead of cli
(github.com/codegangsta/cli), for better cmdline user interface.

* Create a wrapper runWrapper() to be used for cobra, to wrap around
  a normal run*() function into a prototype for cobra.Command.Run().
  It also sets a global variable cAPI for running a normal command.
* remove unnecessary code for codegangsta/cli from fleetctl.go.

Suggested-by: Jonathan Boulle <jonathanboulle@gmail.com>
Fixes: coreos#1453
Supersedes coreos#1570
As every fleetctl command uses cobra instead of codegangsta/cli,
we need to update unit tests as well.
* Do not compare output of help messages, to avoid occasional failures.
  As cobra sometimes prints out a different help message, it causes
  functional test TestClientHelpFlag to fail.
* Run "fleetctl version" instead of "fleetctl --version".
* Run "fleetctl help" instead of "fleetctl" to get help message.
Since CoreOS alpha (1000), coreos-cloudinit doesn't seem to watch
on /media/configdrive/openstack/latest/user_data. That's why
functional tests haven't been running, as systemd-nspawn container
runs without any configurations.
Fix it by changing into /var/lib/coreos-install/user_data.

Fixes: coreos#1532
…fxtests

functional: fix path to user_data to /var/lib/coreos-install
So far ReplaceMember has been checking for error, by looking up an
explicit string "Success" in stderr from "systemctl -M <ID> poweroff".
However, with systemd v229 or newer, systemctl poweroff does not print
out "Success" even if poweroff succeeded. This causes
TestKnownHostsVerification to always return error like:

  --- FAIL: TestKnownHostsVerification (2.66s)
      client_test.go:60: Failed replacing machine: poweroff failed:

Fix it by changing the systemctl call to "machinectl poweroff <ID>",
which should return error correctly. This would be better in the long
run, than checking for messages in stderr.

Fixes: coreos#1586
…ror-fxtests

functional: poweroff machine using machinectl instead of systemctl
TestSingleNodeConnectivityLoss has been so far retrieving is-enabled
status by calling "systemctl list-unit-files UNITS". It was expected
to get "enabled" for each unit. This has worked until systemd v227.
However, that's not the case any more since systemd v228, precisely
since systemd/systemd@0ec0dea
("install: follow unit file symlinks in /usr, but not /etc when looking
for [Install] data"). Since then, systemctl list-unit-files returns
"linked-runtime" status for each fleet unit, which is actually
reasonable behavior for units that are not installed persistently
under /etc/systemd.

With systemd v228 or newer, however, TestSingleNodeConnectivityLoss
in fleet functional tests cannot make use of "systemctl list-unit-files"
any more. And it would not be optimal to simply change enabled to
linked-runtime, because then the test couldn't work with systemd < v228.

So let's work around it, by calling "systemctl is-enabled UNIT" instead.
That command has returned correct status "linked-runtime", no matter
whether systemd >= 228 or not.

Fixes: coreos#1588
run-in-qemu should use double hyphens for long command-line options
for consistency with other tools.
Allow run-in-qemu to run with a given custom base image, e.g.
coreos_developer_qemu_image.img which was built by CoreOS SDK.

  $ ./run-in-qemu -c alpha -i ~/.qemu/coreos_developer_qemu_image.img
…fxtests

functional: allow run-in-qemu to run based on a custom base image
As systemd v230 or newer requires /etc/machine-id on a nspawn container,
we need to write the container's own machine ID into /etc/machine-id.
Otherwise, functional tests don't start at all like:

  --- FAIL: TestKnownHostsVerification (10.80s)
    client_test.go:36: unable to detect machine PID

Fixes: coreos#1594
See also systemd/systemd#3014
See also rkt/rkt#2440
…-fxtests

functional: write machine ID into nspawn container
As CoreOS requires PAM configurations under /etc/pam.d starting from
developer version of CoreOS 104x, PAM configurations on the host need to
copied into the nspawn container. Otherwise, every sudo command fails
with messages like:

  --- FAIL: TestShutdown (1.99s)
      shutdown_test.go:43: exit status 1

Note that copying /etc/pam.d is necessary only for such setups with
sys-auth/pambase installed. Though it should be also fine for systems
where /etc/pam.d is empty, because then it should automatically fall
back to /usr/lib64/pam.d, which belongs to sys-libs/pam.

Fixes: coreos#1591
See also https://coreos.com/blog/security-brief-coreos-linux-alpha-remote-ssh-issue.html
@AlexisMontagne AlexisMontagne force-pushed the am-share-based-scheduling branch from 3fb314e to 01d7f7a Compare May 27, 2016 13:53
Dongsu Park and others added 28 commits September 9, 2016 16:42
…e-header

scripts,golang: update schema-generator, bump golang to 1.7.1
So far it hasn't been possible to do "go get github.com/coreos/fleet"
because there was no main.go in the top source directory.
Create main.go, a simple wrapper for fleetd to make it "go-getable".
Now that fleetd/fleetd.go doesn't have main(), we need to make the
default build script run go build on the top source directory, instead
of ./fleetd/. Then it will end up building fleetd, as main() in main.go
imports fleetd.Main() by default.
…table

fleetd,main: make github.com/coreos/fleet go-getable
Apparently rkt doesn't support a single dash in its option
any more. For example, "rkt -volume" doesn't work any more.
Use double dash instead: "rkt --volume"

Also --insecure-skip-verify doesn't exist.
Use --insecure-options=image instead.
Documentation: fix unavailable rkt options in examples
To avoid confusion with ./build, we'd better have build-aci under
./scripts directory.
Various changes on build-aci.

* Use acbuild command for building the ACI image.
* Remove unnecessary docker command. Actually we could build ACI
  image without relying on docker, just like etcd does.
* Remove unnecessary manifest definition, as it's automatically
  generated by acbuild.
* Improve the cleanup path using acbuildEnd().
* Allow TCP ports 2379,4001 for etcd

How to test, for example:

$ ./scripts/build-aci v0.12.0-amd64
$ sudo rkt run --insecure-options=image \
  --stage1-name=coreos.com/rkt/stage1-fly:1.14.0 --inherit-env \
  --volume etc-fleet,kind=host,source=/etc/fleet \
  --volume machine-id,kind=host,source=/etc/machine-id \
  --volume dbus-socket,kind=host,source=/run/dbus/system_bus_socket \
  --volume fleet-units,kind=host,source=/run/fleet/units \
  ./fleetd-v0.12.0-amd64.aci
engine: remove vestigial trigger code
To be as idiomatic as possible, do so:

  * make tryWaitForSystemdActiveState() return error instead of int
  * pass getBlockAttempts(cCmd) to tryWaitForSystemdActiveState()
    to handle such cases of --no-block or --block-attempts.
  * make assertSystemdActiveState(), waitForSystemdActiveState() return
    just error, not with 'found', as it's unnecessary.
  * remove unnecessary checking against conn == nil.
  * compare strings with '!=' instead of strings.Compare() != 0
Now that tryWaitForSystemdActiveState() returns error instead of int,
tryWaitForUnitStates() should also return error instead of int, to be
as idiomatic as possible.
It's not appropriate to check systemd active states with DBUS
connection, as that will return correct states only on the same machine.
So we should introduce a check using cAPI.UnitStates(), to check states
on other machines.
As it's basically not efficient to call cAPI.UnitStates() for each unit,
let's call it only once outside goroutines for processing units, i.e.
calling in waitForSystemdActiveState(), not in getSingleUnitState().
To do that, we need to share the result apiStates across all units.

Also make use of goroutines in waitForSystemdActiveState(), instead of
sequential iteration over each unit.

Note that we still need to keep calling UnitStates() when
assertSystemdActiveState() failed, because in the next attempt the old
apiState is not necessarily going to be valid. Ideally in that case, we
should be able to fetch the state only for a single unit. However, we
cannot do that for now, because cAPI.UnitState() is not available. In
the future we would need to implement cAPI.UnitState() and all
dependendent parts all over the tree in fleet, e.g. schema, etcdRegistry,
rpcRegistry, etc. to replace UnitStates() in this place with the new
method UnitState(). In practice, calling UnitStates() here is not as
badly inefficient as it looks, because it will be anyway rarely called
only when the assertion failed.
…temd-states

fleetctl: check systemd active state via client API
When marshal() returns a non-nil error 'err', which shadows the
pre-defined 'err', we should explicitly return the second error returned
from marshal(). Otherwise it will always return nil.
To fix other potential bugs, let's just return err explicitly.

This was discovered by running go tool vet:

$ go tool vet --shadow ./registry
registry/job.go:316: declaration of "err" shadows declaration at registry/job.go:315
In matchLocalFileAndUnit(), if os.Stat() returns nil and
getUnitFromFile() returns non-nil error, then in the end
matchLocalFileAndUnit() will return (false, nil). That's not the
expected result. It should have returned the second error that came
from getUnitFromFile().

To avoid further potential bugs, let's follow the idiomatic coding
style: first check err != nil, and fail-fast.

This issue was discoverd by running go tools vet.
…vars

registry,fleetctl: fix bugs regarding shadowed error variables
Fix weblink to build-aci to
https://github.com/coreos/fleet/blob/master/scripts/build-aci.
This was a side-effect of moving the script to the scripts directory.
…-aci

Documentation: fix a broken weblink to scripts/build-aci
Run TestRegistryMuxUnitManagement() only when user's euid is 0 and
systemd is avilable at run-time. Such a distinction is necessary for
avoiding test failures, especially when unit tests run on travis CI,
where systemd is not available.

With those checks, we can re-enable error handling after calling
systemd.NewSystemdUnitManager().
On Linux syscall.Setuid() is not supported in go. So running unit tests
with root privileges fails when calling Setuid.
Let's skip the test on Linux, if euid is 0.
…r-systemd

registry/rpc: re-enable error handling after NewSystemdUnitManage
@AlexisMontagne AlexisMontagne force-pushed the am-share-based-scheduling branch from 01d7f7a to 524b896 Compare October 28, 2016 16:02
@AlexisMontagne AlexisMontagne merged commit 524b896 into master Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants