[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

iliana · 2022-03-03T17:31:51Z

Previous PRs: #469 #457 — I do feel somewhat strange about opening a new PR but it's diverged significantly from where #469 left off and merges in #457, so a fresh PR is probably warranted.

RFD 183 recommends the use of TUF as a transport mechanism to get updates from Oxide to the rack for a few reasons, one of which is "I wrote a TUF client once already". This integrates tough into Nexus and adds code for sled-agent to download update artifacts from Nexus (ty @smklein).

There's still a lot of work to do for updates, but this is the foundation on which everything will work: Nexus fetches TUF metadata from an update server and stores a list of artifacts, decides when to apply updates and instructs sled-agent to do so, and serves those artifacts to sled-agent.

The life of an artifact

A repository has a single base URL, such as https://rack-updates.oxide.computer/v1 or https://internal-mirror.it.bigcorp.example/oxide. Under this base URL are metadata and targets prefixes as used by TUF.

In TUF parlance, "target" refers to a file in the repository, as opposed to "metadata" or "role" which refer to the TUF-level metadata. We need a different word to refer to the blobs of data we want to make it all the way to sled-agents and get applied somehow, so we choose "artifact".

Artifacts are added to the TUF repository (an example of generating this repository can be found in https://github.com/oxidecomputer/omicron/blob/updates-demo/nexus/tests/integration_tests/updates.rs), along with an artifacts.json target that describes what each artifact does. An example:

{
    "version": "1",
    "artifacts": [
        {
            "name": "cockroach",
            "version": 1,
            "kind": "zone",
            "target": "crdb-1.23.beta-2383293+df8ej3d.definitely-illumos.tar.gz"
        }
    ]
}

"target" in the metadata entry refers to the name of the TUF repository target to download and verify to get the version 1 cockroach zone.

Although we expect updates to eventually be automatic, it is currently manual; the /updates/refresh API is used to instruct Nexus to fetch the most recent TUF metadata and list of artifacts, and cache this list of artifacts in the database. This is the step that performs TUF repository validation. This assumes the database Nexus uses can only be modified via Nexus, as we do not cache the TUF metadata.

In the future, Nexus can use this list of targets and the list of software in use across the rack to decide a plan to update any outdated software. Currently, Nexus tells every sled-agent to apply every update whenever this refresh occurs via sled-agent's /update endpoint. sled-agent then fetches the artifact from Nexus via the internal /artifacts/{kind}/{name}/{version} endpoint. In the future, it will do something useful with this artifact. It currently just puts it on the filesystem.

When Nexus is instructed to fetch an artifact, it stores them in a cache directory, /var/tmp/oxide_artifacts. This storage is treated as volatile; there's no assumptions made that an artifact was or was not previously stored here. An eventual TODO for me is to make Nexus re-verify the hash of a file as it is read back from the cache, and to make sled-agent verify the hash as it reads from Nexus's response.

Things to convert into issues when this is merged

We're likely to keep /updates/refresh long-term (to provide, say, a [🔄 Check for updates] button in the console), but this should be behind authZ because repeatedly refreshing can potentially (but hopefully shouldn't) interfere with updates working.
Dropshot doesn't generate a useful OpenAPI description for the /artifacts/{kind}/{name}/{version} endpoint. This might be fixed by improve OpenAPI description for Response<Body> endpoints dropshot#295. In the meantime, sled-agent directly hits this endpoint with the reqwest Client instead of using the client method.
/artifacts/{kind}/{name}/{version} (and its consumer in sled-agent) should stream data instead of reading it all into memory, especially when we start adding things like Oxide-provided OS images through the repository. Depending on how we want to go about this (wrt the above issue), we might want to use our normal pagination, implement Range header support, or just stream the response body.
To avoid race conditions that can potentially perform unwanted downgrades, we want to separate the update system into three steps: learn about the current set of artifacts, create a plan for applying updates to hosts, then execute that plan.
Nexus stores the artifacts in a temporary path which will be wiped on boot(? maybe??) — but we should clean up older files from the cache.
The database calls in Nexus::updates_refresh_metadata should be turned into a single datastore function that wraps it in a transaction, to keep the list of artifacts in the database consistent.
Nexus should re-verify artifact hashes when reading from the on-disk cache, and Nexus should tell sled-agent the hash of the artifact to apply so it can verify for itself.
It's unclear what this code does if artifacts.json contains a dangerous target name like hey/can/you/look/../../../../up/the/directory/tree and that should have a test written to ensure it errors.
Work upstream to bring tough into the world of async/.await and fix some other things I've tripped over while writing this.

nexus is async, tough is not async, but tough uses reqwest in blocking mode which is async. https://docs.rs/reqwest/0.11.7/reqwest/blocking/index.html: > Conversely, the functionality in `reqwest::blocking` must not be > executed within an async runtime, or it will panic when attempting to > block. moves the relevant code out to updates.rs, since the function can't borrow self due to lifetime constraints.

... since We only get a JoinError if the task we're waiting on panics.

…pdate-refresh

this fixes problematic generation of an OpenAPI interface, which in turn fixes generation of non-compiling code where `()` becomes an enum with no variants.

test_nexus_openapi_internal fails due to a change in dropshot, which I'm debugging now.

we intend to have an updates repository that customers can mirror on their internal networks in case they make the reasonable decision to not plug their racks into the internet. also: at some point we may choose to pivot the metadata URL used to "hide" updates from earlier versions of nexus, in case we don't invent another mechanism to do that. that is, we would publish one final nexus update to the update metadata in /metadata/, and that update would change nexus to use /metadata-new/. this allows us to force a two-stage update if necessary. or, perhaps, we might have other update channels that can be configured on the rack. a user may want to say "update to beta". having the entry point be a single URL allows us to build these abstractions without telling the user to change the URL in a specific way or requiring specific a specific mirroring setup.

smklein

This looks awesome - adds useful functionality, has tests up and down the stack, and provides configuration that lets folks opt-in in both dev and production environments.

Thanks for all the hard work on this! Looks good, with some comments below, but nothing that I think should block us from moving forward.

smklein · 2022-03-03T18:00:53Z

nexus/src/config.rs

+    /// Updates-related configuration. Updates APIs return 400 Bad Request when this is
+    /// unconfigured.
+    #[serde(default)]


Nice, this seems like a decent tradeoff to not break folks doing development.

smklein · 2022-03-03T18:06:53Z

sled-agent/src/updates.rs

+
+            // Fetch the artifact and write to the file in its entirety,
+            // replacing it if it exists.
+            // TODO: Would love to stream this instead.


This is totally doable! I updated the file server example in Nexus to show how this works - and also added a test to show how streaming works on both the client / server side.

smklein · 2022-03-03T18:52:49Z

nexus/src/config.rs

+    /** Trusted root.json role for the TUF updates repository. */
+    pub trusted_root: PathBuf,
+    /** Default base URLs for the TUF repository. */
+    pub default_base_url: String,


So it seems straightforward to me that we'll host a TUF repo somewhere, and set the value according to that (for default_base_url) but where would the root.json file come from?

I saw you added it to the .gitignore, but if we merge this PR and I want to actually set up a TUF repo + go through this workflow, where should I get this file from?

(If this is a step a developer must do manually, maybe we should add some docs?)

This is something I expect we'll add during CI. We might have a different root.json for test builds, for instance. (A generic root.json might work here, but how do you share that key, and how do you avoid shipping production builds with that role as the trusted root?)

smklein · 2022-03-03T19:00:26Z

nexus/src/db/datastore.rs

+    pub async fn update_available_artifact_hard_delete_outdated(
+        &self,
+        current_targets_role_version: i64,
+    ) -> DeleteResult {
+        // We use the `targets_role_version` column in the table to delete any old rows, keeping
+        // the table in sync with the current copy of artifacts.json.
+        use db::schema::update_available_artifact::dsl;
+        diesel::delete(dsl::update_available_artifact)


This is a comment out of my own curiosity, and doesn't need to block us, but...

What are the implications for rollback here of deleting all artifacts with an old "target role version"? Does this prevent Nexus from being able to roll itself back?

artifacts.json should likely contain the list of all valid updates, including what Nexus can decide to downgrade itself to. (This might not scale good, and maybe the document format should be redesigned?)

smklein · 2022-03-03T19:13:17Z

sled-agent/src/updates.rs

+) -> Result<(), Error> {
+    match artifact.kind {
+        UpdateArtifactKind::Zone => {
+            let directory = PathBuf::from("/var/tmp/zones");


I think you're doing this right now, but if this is /opt/oxide, this will "just work" with the rest of the Zone management stuff.

Yeah, we have this in the demo branch we used yesterday. The main reason I'm keeping this as /var/tmp for now is because this will work in integration tests running on pretty much any box. Giving sled-agent some form of configuration for this value (which we can set to a TempDir during testing!) is probably the right way forward here.

smklein · 2022-03-03T22:04:53Z

nexus/src/nexus.rs

+        // FIXME: if we hit an error in any of these database calls, the available artifact table
+        // will be out of sync with the current artifacts.json. can we do a transaction or
+        // something?


The artifacts.json is "local to nexus", right? If it's in memory in Nexus, what's the potential issue?

Even if multiple Nexus instances are concurrently upserting these artifacts, the database will ultimately end up in the same final state, won't it?

artifacts.json leaves Nexus's memory after the TUF metadata is fetched, leaving the relevant contents behind in this table. So when Nexus is considering what updates to apply and how, it's going to rely solely on the data in the database. (That's the plan, at least.)

smklein · 2022-03-03T22:16:52Z

nexus/src/nexus.rs

+        for sled in self
+            .db_datastore
+            .sled_list(&DataPageParams {
+                marker: None,
+                direction: PaginationOrder::Ascending,
+                limit: NonZeroU32::new(100).unwrap(),
+            })
+            .await?


We discussed this in-person, but I think there's a race condition here:

We should ensure that multiple Nexii running concurrently don't accidentally downgrade sled agents

We should ensure that multiple requests to a Sled agent to update an artifact don't actually retrigger the download, especially if a download is already in progress.

I don't want to hold up this PR, but we should document this limitations somewhere so we remember to fix 'em.

Added to the bulleted list at the top!

smklein · 2022-03-04T16:53:07Z

nexus/src/nexus.rs

+            // Demo-quality solution could be "destroy it on boot" or something?
+            // (we aren't doing that yet).


If we don't fix this "removal-of-old-artifacts" stuff now, could we file an issue / link it here?

(This seems like one of those solutions that could get us pretty far, but which we probably shouldn't ship with)

Added this up top

smklein · 2022-03-04T17:55:49Z

nexus/src/nexus.rs

+        // TODO: These artifacts could be quite large - we should figure out how to
+        // stream this file back instead of holding it entirely in-memory in a
+        // Vec<u8>.
+        //
+        // Options:
+        // - RFC 7233 - "Range Requests" (is this HTTP/1.1 only?)
+        // https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests
+        // - "Roll our own". See:
+        // https://stackoverflow.com/questions/20969331/standard-method-for-http-partial-upload-resume-upload


This is a comment I think I left before experimenting with streaming in dropshot - the examples here: https://github.com/oxidecomputer/dropshot/blob/main/dropshot/tests/test_streaming.rs

should give us a mechanism to do this (specifically, hyper_staticfile::FileBytesStream::new around a file)

Added an explicit "fix the read into memory" bullet up top

nexus/tests/integration_tests/updates.rs

nexus/src/config.rs

#717) * WIP Nexus download endpoint * WIP writing files * free function now method * sorting out errs * Add tests, fix bugs * I guess compiling code is better than the alternative * I EXPECTORATEd that my JSON would be more up-to-date * initial work for using the tough client in nexus * run tough client in tokio::task::spawn_blocking nexus is async, tough is not async, but tough uses reqwest in blocking mode which is async. https://docs.rs/reqwest/0.11.7/reqwest/blocking/index.html: > Conversely, the functionality in `reqwest::blocking` must not be > executed within an async runtime, or it will panic when attempting to > block. moves the relevant code out to updates.rs, since the function can't borrow self due to lifetime constraints. * allow an unconfigured updates system * woo license header * fix hardcoded base URLs to use localhost, for now * add smf/nexus/root.json to .gitignore * rename columns to make more sense at first glance * keep table in sync with artifacts.json * make the tests happy * add /updates/refresh to oxapi_demo * fix examples/config-file.toml * test [updates] in config works * remove dead code * .unwrap() on JoinError from spawn_blocking ... since We only get a JoinError if the task we're waiting on panics. * wire download_artifact up to the database * tell all sleds to apply all updates * fixup! wire download_artifact up to the database * actually verify the target after download * Keep merging * use ..UpdatedNoContent instead of ..Ok<()> this fixes problematic generation of an OpenAPI interface, which in turn fixes generation of non-compiling code where `()` becomes an enum with no variants. * end-to-end updates test * move UpdateArtifactKind into common * use a single updates base URL we intend to have an updates repository that customers can mirror on their internal networks in case they make the reasonable decision to not plug their racks into the internet. also: at some point we may choose to pivot the metadata URL used to "hide" updates from earlier versions of nexus, in case we don't invent another mechanism to do that. that is, we would publish one final nexus update to the update metadata in /metadata/, and that update would change nexus to use /metadata-new/. this allows us to force a two-stage update if necessary. or, perhaps, we might have other update channels that can be configured on the rack. a user may want to say "update to beta". having the entry point be a single URL allows us to build these abstractions without telling the user to change the URL in a specific way or requiring specific a specific mirroring setup. * clean up unnecessary ResourceType variants * refactor to require full artifact descriptions * work around dropshot Response<Body> issue * undo this change * combine updates integ tests into one module * remove errant fixme * comment nit Co-authored-by: Sean Klein <sean@oxide.computer>

smklein and others added 30 commits November 23, 2021 11:26

WIP Nexus download endpoint

fc70559

WIP writing files

80250e1

Merge branch 'main' into update-file-server

3ac2ddb

Merge branch 'main' into update-file-server

9357831

free function now method

67fa74c

Merge branch 'main' into update-file-server

d7728a1

sorting out errs

7900929

Merge branch 'main' into update-file-server

a3f8121

Merge branch 'main' into update-file-server

fad83b5

Add tests, fix bugs

4523cbd

I guess compiling code is better than the alternative

e82885c

I EXPECTORATEd that my JSON would be more up-to-date

dc5897a

initial work for using the tough client in nexus

3e82573

allow an unconfigured updates system

9eab4b8

woo license header

383d966

fix hardcoded base URLs to use localhost, for now

76a6bf9

add smf/nexus/root.json to .gitignore

8ad7ad6

rename columns to make more sense at first glance

aec3211

keep table in sync with artifacts.json

9426256

make the tests happy

f65f1bb

add /updates/refresh to oxapi_demo

c653a11

fix examples/config-file.toml

e68080f

test [updates] in config works

3f42295

remove dead code

0618d09

.unwrap() on JoinError from spawn_blocking

62e54d0

... since We only get a JoinError if the task we're waiting on panics.

Merge branch 'main' into update-file-server

e2af470

Merge remote-tracking branch 'origin/update-file-server' into nexus-u…

3c6bd7d

…pdate-refresh

wire download_artifact up to the database

1afe334

tell all sleds to apply all updates

405d1f8

iliana added 12 commits January 21, 2022 04:47

Merge remote-tracking branch 'origin/main' into updates-demo

ec1f4d4

use ..UpdatedNoContent instead of ..Ok<()>

374ffc7

this fixes problematic generation of an OpenAPI interface, which in turn fixes generation of non-compiling code where `()` becomes an enum with no variants.

Merge remote-tracking branch 'origin/main' into updates-demo

a4866fd

end-to-end updates test

391d7f2

Merge remote-tracking branch 'origin/main' into updates-demo

57238b0

test_nexus_openapi_internal fails due to a change in dropshot, which I'm debugging now.

move UpdateArtifactKind into common

832cb95

clean up unnecessary ResourceType variants

5a62f58

refactor to require full artifact descriptions

7d49637

work around dropshot Response<Body> issue

b483cf0

undo this change

449d24a

combine updates integ tests into one module

eae1452

smklein approved these changes Mar 4, 2022

View reviewed changes

iliana added 2 commits March 4, 2022 18:36

remove errant fixme

6f5264d

comment nit

1b489dd

lifning reviewed Mar 4, 2022

View reviewed changes

nexus/src/config.rs Outdated Show resolved Hide resolved

iliana enabled auto-merge (squash) March 4, 2022 19:07

iliana merged commit 31ecc01 into main Mar 4, 2022

iliana deleted the updates-demo branch March 4, 2022 19:36

smklein mentioned this pull request Jun 24, 2022

[nexus] Populate rack during initialization #1239

Merged

smklein mentioned this pull request Jul 26, 2022

Update System: Tracking Issue #250

Open

27 tasks

david-crespo mentioned this pull request Jan 4, 2023

Tracking issue for System Update API (phase 1) #2107

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

iliana commented Mar 3, 2022 •

edited

Loading

smklein left a comment

smklein Mar 3, 2022

smklein Mar 3, 2022

smklein Mar 3, 2022

iliana Mar 4, 2022

smklein Mar 3, 2022

iliana Mar 4, 2022

smklein Mar 3, 2022

iliana Mar 4, 2022

smklein Mar 3, 2022

iliana Mar 4, 2022

smklein Mar 3, 2022

iliana Mar 4, 2022

smklein Mar 4, 2022

iliana Mar 4, 2022

smklein Mar 4, 2022 •

edited

Loading

iliana Mar 4, 2022

		// Demo-quality solution could be "destroy it on boot" or something?
		// (we aren't doing that yet).

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

Conversation

iliana commented Mar 3, 2022 • edited Loading

The life of an artifact

Things to convert into issues when this is merged

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Mar 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iliana commented Mar 3, 2022 •

edited

Loading

smklein Mar 4, 2022 •

edited

Loading