Update System: Tracking Issue #250

smklein · 2021-09-16T19:03:24Z

This issue tracks the work mentioned in RFD 183. Resolving these issues will likely result in subsequent RFDs.

Creating and Hosting Packages

Off-rack Update Service: Host a service where software bundles can be uploaded, stored, and downloaded. Likely TUF.
- Update Hosting: Provide storage / interfaces for downloading packages.
- Update Listing: Provide interfaces for listing and querying packages.
Automated Tooling For Publishing Packages: Create tooling to automate the process by which artifacts can be added to the "Update Service".
- Packages of interest include...
  - Helios Ramdisk
  - Zone Images
  - Images for SP / RoT
- Automated Testing For Packages: Create tests to confirm/deny interoperability between differently-versioned artifacts.
- Signing Infrastructure: We need a mechanism for packages to be signed as "from Oxide", which can be validated on the rack. This schema should be transparent and documented, such that rack owners could plausibly replace components with their own software.

Getting Updates to the Rack

Communicating Update Status to the API / Console

Create APIs for inspecting versions of software, both at a component and "whole-rack" level (e.g., "what version of the API am I using")
Create APIs for requesting particular versions of software. Presumably this is a mechanism by which downgrade could be requested, but also could be utilized for avoiding updates during critical service windows.
Provide support in the Console for inspecting/requesting versions

Pushing Updates from Nexus to Everything Else

Expose APIs to Receive Updates: Within Sled Agent, SP, etc, "update targets" should expose an interface to be able to download and apply software bundles as instructed by Nexus.
Coordinating Updates: Nexus applying updates to the rack should probably use an update schedule that doesn't powercycle all sleds at once. Though we can certainly start with something simple here, once we get the initial system wired up, we can start building systems to balance updates against live customer workload. Update plans #764

Validating that this Process Works

Stand up a lab system using a minimally-defined update system
Iterate on this system "without re-installing", to acquire empirical evidence of the update system utility

The text was updated successfully, but these errors were encountered:

smklein · 2021-09-16T19:08:59Z

Hey @iliana - if it would help to break this into a different structure, I'm happy to edit.

Also, I'm happy to help with adjusting Sled Agent / Nexus interfaces + implementations to make uploading/storing/applying updates a bit easier. I'm a lot less certain where the TUF logic lives (like, where + how do we validate updates when they're received by Nexus?) but it's easy for me to prod at the DB to store any necessary metadata we need.

If sub-pieces of this would be better suited to smaller issues, we can track those too. Feel free to assign portions of this back to me; I'm happy to help.

iliana · 2021-09-16T19:11:14Z

Looks like I can edit as necessary, too. This roughly maps to what I have in my head so it's a pretty good first go at a tracking issue :)

An early implementation of the update service will just be a static HTTP service. I'm assuming we can re-use pkg.oxide.computer for this?

smklein · 2021-09-16T19:13:49Z

I think so - I don't actually own that URL, maybe @jclulow can confirm?

zephraph · 2022-08-23T16:20:55Z

@smklein, I think changelogs are missing here. I know it's likely implied, but it's worth acknowledging given that there will be explicit work required. @rcgoodfellow had added some good thoughts around this in RFD-290 which can be found in https://github.com/oxidecomputer/rfd/issues/477 now. I've still got a TODO to either drive that RFD or to bribe someone else to... but again, there will be some specific work around being able to inform customer's what have changed.

One of the biggest reasons I bring this up is that writing helpful changelogs is as much a cultural thing as anything else. Given in a lot of our repositories we're not in the habit of doing so it'll necessitate a change in our process.

askfongjojo · 2023-04-04T01:10:55Z

Linking #2483 and #2754 here to better support sled reboots

smklein added ✈️ control plane Update System Replacing old bits with newer, cooler bits labels Sep 16, 2021

smklein assigned iliana Sep 16, 2021

smklein mentioned this issue Sep 16, 2021

Omicron Live Migration Support #252

Open

8 tasks

smklein unassigned iliana Jul 26, 2022

smklein removed the control plane label Nov 16, 2022

david-crespo mentioned this issue Jan 4, 2023

Tracking issue for System Update API (phase 1) #2107

Closed

18 tasks

This was referenced Feb 6, 2023

Tracking: Update oxidecomputer/console#1351

Open

Tracking: System Update API (phase 2) #2341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update System: Tracking Issue #250

Update System: Tracking Issue #250

smklein commented Sep 16, 2021 •

edited

Loading

smklein commented Sep 16, 2021

iliana commented Sep 16, 2021

smklein commented Sep 16, 2021

zephraph commented Aug 23, 2022

askfongjojo commented Apr 4, 2023

Update System: Tracking Issue #250

Update System: Tracking Issue #250

Comments

smklein commented Sep 16, 2021 • edited Loading

Creating and Hosting Packages

Getting Updates to the Rack

Communicating Update Status to the API / Console

Pushing Updates from Nexus to Everything Else

Validating that this Process Works

smklein commented Sep 16, 2021

iliana commented Sep 16, 2021

smklein commented Sep 16, 2021

zephraph commented Aug 23, 2022

askfongjojo commented Apr 4, 2023

smklein commented Sep 16, 2021 •

edited

Loading