Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update System: Tracking Issue #250

Open
3 of 27 tasks
smklein opened this issue Sep 16, 2021 · 5 comments
Open
3 of 27 tasks

Update System: Tracking Issue #250

smklein opened this issue Sep 16, 2021 · 5 comments
Labels
Update System Replacing old bits with newer, cooler bits

Comments

@smklein
Copy link
Collaborator

smklein commented Sep 16, 2021

This issue tracks the work mentioned in RFD 183. Resolving these issues will likely result in subsequent RFDs.

Creating and Hosting Packages

  • Off-rack Update Service: Host a service where software bundles can be uploaded, stored, and downloaded. Likely TUF.
    • Update Hosting: Provide storage / interfaces for downloading packages.
    • Update Listing: Provide interfaces for listing and querying packages.
  • Automated Tooling For Publishing Packages: Create tooling to automate the process by which artifacts can be added to the "Update Service".
    • Packages of interest include...
      • Helios Ramdisk
      • Zone Images
      • Images for SP / RoT
    • Automated Testing For Packages: Create tests to confirm/deny interoperability between differently-versioned artifacts.
    • Signing Infrastructure: We need a mechanism for packages to be signed as "from Oxide", which can be validated on the rack. This schema should be transparent and documented, such that rack owners could plausibly replace components with their own software.

Getting Updates to the Rack

  • Decide when to perform updates: Make Nexus self-sufficient, and able to decide when to update itself.
    • Version-awareness: Make Nexus able to consider "desired" versions, and update to a reasonable choice while maintaining backwards compatibility. (The current implementation updates to whatever is latest, regardless of other software on the rack)
    • Rebalancing, liveness-awareness: Make Nexus able to rebalance workloads to enable upgrades that require service / sled reboots. This process must consider...
      • Externally-facing Resources: Namely, live migration of virtual machines
      • Internally-facing Services: Nexus, CRDB, Clickhouse, DNS servers, Oximeter, Crucible Downstairs, etc, must all maintain availability amid updates.
    • Modifying Storage: Preparing for / executing DB schema changes
      • Draining Sagas: As documented in RFD 289, ensure that sagas don't cross upgrade boundaries
    • Downgrade: Define/implement a process for downgrade.
  • Get the bundles into Nexus: Update Nexus's interface to expose an endpoint for uploading + instructing racks to update themselves (completed by [v2] TUF integration in Nexus + update artifact fetching by sled-agent #717).
    • Store the bundles within on-sled: The SQL representation of software bundles will likely need to be updated to include metadata referencing downloaded software versions, but the storage of the locally-downloaded binaries will likely live outside CockroachDB.

Communicating Update Status to the API / Console

  • Create APIs for inspecting versions of software, both at a component and "whole-rack" level (e.g., "what version of the API am I using")
  • Create APIs for requesting particular versions of software. Presumably this is a mechanism by which downgrade could be requested, but also could be utilized for avoiding updates during critical service windows.
  • Provide support in the Console for inspecting/requesting versions

Pushing Updates from Nexus to Everything Else

  • Expose APIs to Receive Updates: Within Sled Agent, SP, etc, "update targets" should expose an interface to be able to download and apply software bundles as instructed by Nexus.
  • Coordinating Updates: Nexus applying updates to the rack should probably use an update schedule that doesn't powercycle all sleds at once. Though we can certainly start with something simple here, once we get the initial system wired up, we can start building systems to balance updates against live customer workload. Update plans #764

Validating that this Process Works

  • Stand up a lab system using a minimally-defined update system
  • Iterate on this system "without re-installing", to acquire empirical evidence of the update system utility
@smklein smklein added ✈️ control plane Update System Replacing old bits with newer, cooler bits labels Sep 16, 2021
@smklein
Copy link
Collaborator Author

smklein commented Sep 16, 2021

Hey @iliana - if it would help to break this into a different structure, I'm happy to edit.

Also, I'm happy to help with adjusting Sled Agent / Nexus interfaces + implementations to make uploading/storing/applying updates a bit easier. I'm a lot less certain where the TUF logic lives (like, where + how do we validate updates when they're received by Nexus?) but it's easy for me to prod at the DB to store any necessary metadata we need.

If sub-pieces of this would be better suited to smaller issues, we can track those too. Feel free to assign portions of this back to me; I'm happy to help.

@iliana
Copy link
Contributor

iliana commented Sep 16, 2021

Looks like I can edit as necessary, too. This roughly maps to what I have in my head so it's a pretty good first go at a tracking issue :)

An early implementation of the update service will just be a static HTTP service. I'm assuming we can re-use pkg.oxide.computer for this?

@smklein
Copy link
Collaborator Author

smklein commented Sep 16, 2021

I think so - I don't actually own that URL, maybe @jclulow can confirm?

@zephraph
Copy link
Contributor

@smklein, I think changelogs are missing here. I know it's likely implied, but it's worth acknowledging given that there will be explicit work required. @rcgoodfellow had added some good thoughts around this in RFD-290 which can be found in https://github.com/oxidecomputer/rfd/issues/477 now. I've still got a TODO to either drive that RFD or to bribe someone else to... but again, there will be some specific work around being able to inform customer's what have changed.

One of the biggest reasons I bring this up is that writing helpful changelogs is as much a cultural thing as anything else. Given in a lot of our repositories we're not in the habit of doing so it'll necessitate a change in our process.

@askfongjojo
Copy link

Linking #2483 and #2754 here to better support sled reboots

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Update System Replacing old bits with newer, cooler bits
Projects
None yet
Development

No branches or pull requests

4 participants