Docs: Concepts section (#98)

AdrianoKF · nicholasjng · web-flow · commit aec0761e4aa9 · 2024-09-26T11:08:34.000+02:00
* docs(concepts): Add WIP workload lifecycle

* docs: Add card grid on landing page

* docs(concepts): Add section on jobs identifiers

* docs: Fix typo

Co-authored-by: Nicholas Junge &lt;n.junge@appliedai-institute.de&gt;

* docs: Fix typo

Co-authored-by: Nicholas Junge &lt;n.junge@appliedai-institute.de&gt;

* docs: Reword confusing terminology

* docs: Improve wording of job completion section

* docs: Improve landing page

* docs: Add high-level architecture overview

---------

Co-authored-by: Nicholas Junge &lt;n.junge@appliedai-institute.de&gt;
diff --git a/client/docs/concepts/_index.md b/client/docs/concepts/_index.md
@@ -0,0 +1,9 @@
+# Concepts
+
+This section covers the basic concepts behind jobq.
+
+It can help you:
+
+-   Understand the [high-level architecture](architecture.md) of jobq.
+-   Understand how jobq [identifies jobs](identifiers.md).
+-   Understand the [lifecycle of a job](lifecycle.md), from its submission to its completion.
diff --git a/client/docs/concepts/architecture.md b/client/docs/concepts/architecture.md
@@ -0,0 +1,52 @@
+---
+title: Architecture
+---
+
+# Understanding the jobq architecture
+
+The jobq high-level architecture consists of two major components:
+
+1. The [_client-side library_](#client-side-library), which is used to declare and submit jobs to a compute cluster.
+2. The [_server-side API_](#server-side-api), which serves as the interface between the client and the compute cluster.
+
+```mermaid
+architecture-beta
+    group api[Compute Cluster]
+
+    service jobq(server)[Jobq API] in api
+    service kueue(server)[Kueue] in api
+    service kubernetes(server)[Kubernetes API] in api
+    service ray(server)[Kuberay] in api
+
+    service client(server)[jobq Client]
+
+    jobq:R --> L:kueue
+    jobq:B --> T:kubernetes
+
+    kueue:R --> L:ray
+    kueue:B --> T:kubernetes
+
+    client:R --> L:jobq
+```
+
+## Client-side library
+
+The client-side Python library provides a high-level interface for declaring and executing jobs, either locally or on a compute cluster.
+It is designed to be easy to use and to integrate with other Python libraries and frameworks.
+
+The library is responsible for:
+
+-   Providing a `@job` decorator to annotate Python functions as jobs.
+-   Configuring the container image build for a job (through a declarative configuration or explicit `Dockerfile`).
+-   Setting runtime parameters for a job (e.g., its resource requirements).
+-   Managing the lifecycle of jobs, including monitoring their status and logs through a command-line interface.
+
+The library is implemented as a Python package that can be installed using pip.
+
+## Server-side API
+
+The server-side API is a RESTful API that provides a low-level interface for managing jobs in a compute cluster.
+
+It builds on top of Kubernetes and the [Kueue framework](https://kueue.sigs.k8s.io/), which provides a high-level abstraction for managing workloads in a Kubernetes cluster (including queueing, priority-based scheduling, preemption, and resource management).
+
+Kueue itself can manage workloads of various types, such as Kubernetes `Jobs`, Kuberay `RayJobs`, among others.
diff --git a/client/docs/concepts/identifiers.md b/client/docs/concepts/identifiers.md
@@ -0,0 +1,52 @@
+# Job Identifiers
+
+## Terminology
+
+In order to understand how jobq identifies jobs, it is important to understand the conceptual components of a workload:
+A jobs is composed of an abstract definition of the workload (as a [Kueue `Workload`](https://kueue.sigs.k8s.io/docs/concepts/workload/) custom resource) and a set of Kubernetes resources (for example a Kubernetes `Job`, or a custom resource like the [Kuberay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) `RayJob`) that make up the executable part of the workload.
+
+At first these similar-sounding names can be confusing, so let's establish some terminology:
+
+-   A **workload** or **job** (lowercase "w"/"j") is a set of Kubernetes resources that make up the
+    executable portion of a job.
+-   A **Workload** (uppercase "W") refers to the Kueue `Workload` custom resource.
+-   A **Job** (uppercase "J") refers to the Kubernetes `batch/v1/Job` resource (one way how code can be submitted through jobq).
+
+Kueue handles the `Workload` and updates its status to reflect the current state of the workload.
+
+![Kueue workload components](https://kueue.sigs.k8s.io/images/queueing-components.svg)
+
+## Identifying workloads
+
+Every Workload (as managed by Kueue) carries by an automatically generated unique identifier (UID) as well as a human-readable name and namespace.
+Both these could serve as a unique identifier for a Workload. However, a name/namespace combination is not guaranteed to be unique over time (for example when deleting and recreating), whereas UIDs are.
+This makes UIDs a slightly better choice for identifying a given Workload resource.
+
+The concrete workload resource has the same identifiers, a UID and name/namespace combination.
+
+A given job references its associated Workload in a 1:1 fashion (through its `metadata.ownerReferences` field).
+
+This theoretically allows to identify a job in the cluster through two different identifiers:
+
+-   the UID of the (concrete) _job_ resource.
+-   the UID of the (abstract) _Workload_ resource.
+
+In practice, jobq always uses the **UID of the concrete workload** as the identifier for a job.
+All CLI operations return and accept the UID of the concrete workload.
+
+As an example, imagine the following resources in the cluster after submitting a job:
+
+```mermaid
+graph LR
+subgraph "Namespace example"
+direction LR
+  A["`Workload **job-example** <pre>uid-1</pre>`"] --> B["`Job **example** <pre>uid-2</pre>`"] --> C[Pod]
+end
+```
+
+If we want to query the logs of the job, we can do so by calling `jobq logs` with the UID of the concrete workload:
+
+```console
+$ jobq logs uid-2
+[... log output ...]
+```
diff --git a/client/docs/concepts/lifecycle.md b/client/docs/concepts/lifecycle.md
@@ -0,0 +1,48 @@
+# Job Lifecycle
+
+Since jobq builds on top of the [Kueue](https://kueue.sigs.k8s.io/) job queuing system for scheduling,
+the lifecycle of a job is very similar to the lifecycle of a workload in Kueue.
+
+The remainder of this document uses the terms _job_ and _workload_ interchangeably.
+
+A workload roughly goes through three phases after its submission: _queuing and scheduling_, _execution_, and _completion_.
+
+### Queueing and scheduling
+
+After its submission, a workload is in the `Submitted` state, where it competes with other workloads for available resource quotas.
+Once it is admitted to a cluster queue, it enters the `Pending` state, where Kueue will reserve a quota for it.
+Alternatively, if the selected local or cluster queue for the workload are stopped or do not exist, the workload will enter the `Inadmissible` state until this condition is resolved.
+
+### Execution
+
+After all admission checks for the workload have passed, it enters the `Admitted` state, it is now eligible for execution by the cluster.
+
+### Completion
+
+When the workload terminates successfully, it enters the terminal `Succeeded` state.
+If any unrecoverable error occurs during execution, the workload enters the terminal `Failed` state. This does not necessarily happen on the first abnormal termination of a pod, depending on the type of workload and other factors (such as the retry limit in a `batch/v1/Job`).
+
+A currently executing workload may be preempted by another workload (e.g., by a newly submitted workload with a higher priority).
+In this case, Kueue will terminate any pods associated with the preempted workload and either requeue it for later execution or evict it from the cluster queue.
+
+## State Diagram
+
+```mermaid
+stateDiagram-v2
+    direction LR
+    [*] --> Submitted
+
+    Submitted --> Pending: quotaReserved
+    Submitted --> Inadmissible
+    Inadmissible --> Submitted
+
+    Pending --> Admitted: admitted
+
+    Admitted --> Succeeded: success
+    Admitted --> Failed: error
+
+    Admitted --> Submitted: evicted
+    Admitted --> Pending: requeued
+    Succeeded --> [*]
+    Failed --> [*]
+```
diff --git a/client/docs/index.md b/client/docs/index.md
@@ -1,3 +1,35 @@
-# Jobq, a cluster workflow scheduling tool
+---
+title: Home
+---
 
-This documentation is work in progress.
+#
+
+!!! warning "Work in progress"
+
+    This documentation is work in progress.
+    Please excuse frequent changes and missing content.
+
+<div class="grid cards" markdown>
+-   [:material-thought-bubble:{ .lg .middle } **Concepts**](concepts/_index.md)
+
+    ***
+
+    Learn about the concepts behind jobq
+
+-   [:material-apple-keyboard-command:{ .lg .middle } **Command-line interface**](cli.md)
+
+    ***
+
+    Learn how to use the `jobq` command-line interface
+
+-   [:material-book-open-variant:{ .lg .middle } **API Reference**](reference/SUMMARY.md)
+
+    ***
+
+    Detailed documentation of the jobq Python API
+
+</div>
+
+<hr />
+
+This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0){: target="\_blank" }.
diff --git a/client/mkdocs.yml b/client/mkdocs.yml