Skip to content

Commit aec0761

Browse files
Docs: Concepts section (#98)
* docs(concepts): Add WIP workload lifecycle * docs: Add card grid on landing page * docs(concepts): Add section on jobs identifiers * docs: Fix typo Co-authored-by: Nicholas Junge <n.junge@appliedai-institute.de> * docs: Fix typo Co-authored-by: Nicholas Junge <n.junge@appliedai-institute.de> * docs: Reword confusing terminology * docs: Improve wording of job completion section * docs: Improve landing page * docs: Add high-level architecture overview --------- Co-authored-by: Nicholas Junge <n.junge@appliedai-institute.de>
1 parent fcbd814 commit aec0761

File tree

6 files changed

+303
-104
lines changed

6 files changed

+303
-104
lines changed

client/docs/concepts/_index.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Concepts
2+
3+
This section covers the basic concepts behind jobq.
4+
5+
It can help you:
6+
7+
- Understand the [high-level architecture](architecture.md) of jobq.
8+
- Understand how jobq [identifies jobs](identifiers.md).
9+
- Understand the [lifecycle of a job](lifecycle.md), from its submission to its completion.

client/docs/concepts/architecture.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: Architecture
3+
---
4+
5+
# Understanding the jobq architecture
6+
7+
The jobq high-level architecture consists of two major components:
8+
9+
1. The [_client-side library_](#client-side-library), which is used to declare and submit jobs to a compute cluster.
10+
2. The [_server-side API_](#server-side-api), which serves as the interface between the client and the compute cluster.
11+
12+
```mermaid
13+
architecture-beta
14+
group api[Compute Cluster]
15+
16+
service jobq(server)[Jobq API] in api
17+
service kueue(server)[Kueue] in api
18+
service kubernetes(server)[Kubernetes API] in api
19+
service ray(server)[Kuberay] in api
20+
21+
service client(server)[jobq Client]
22+
23+
jobq:R --> L:kueue
24+
jobq:B --> T:kubernetes
25+
26+
kueue:R --> L:ray
27+
kueue:B --> T:kubernetes
28+
29+
client:R --> L:jobq
30+
```
31+
32+
## Client-side library
33+
34+
The client-side Python library provides a high-level interface for declaring and executing jobs, either locally or on a compute cluster.
35+
It is designed to be easy to use and to integrate with other Python libraries and frameworks.
36+
37+
The library is responsible for:
38+
39+
- Providing a `@job` decorator to annotate Python functions as jobs.
40+
- Configuring the container image build for a job (through a declarative configuration or explicit `Dockerfile`).
41+
- Setting runtime parameters for a job (e.g., its resource requirements).
42+
- Managing the lifecycle of jobs, including monitoring their status and logs through a command-line interface.
43+
44+
The library is implemented as a Python package that can be installed using pip.
45+
46+
## Server-side API
47+
48+
The server-side API is a RESTful API that provides a low-level interface for managing jobs in a compute cluster.
49+
50+
It builds on top of Kubernetes and the [Kueue framework](https://kueue.sigs.k8s.io/), which provides a high-level abstraction for managing workloads in a Kubernetes cluster (including queueing, priority-based scheduling, preemption, and resource management).
51+
52+
Kueue itself can manage workloads of various types, such as Kubernetes `Jobs`, Kuberay `RayJobs`, among others.

client/docs/concepts/identifiers.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Job Identifiers
2+
3+
## Terminology
4+
5+
In order to understand how jobq identifies jobs, it is important to understand the conceptual components of a workload:
6+
A jobs is composed of an abstract definition of the workload (as a [Kueue `Workload`](https://kueue.sigs.k8s.io/docs/concepts/workload/) custom resource) and a set of Kubernetes resources (for example a Kubernetes `Job`, or a custom resource like the [Kuberay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) `RayJob`) that make up the executable part of the workload.
7+
8+
At first these similar-sounding names can be confusing, so let's establish some terminology:
9+
10+
- A **workload** or **job** (lowercase "w"/"j") is a set of Kubernetes resources that make up the
11+
executable portion of a job.
12+
- A **Workload** (uppercase "W") refers to the Kueue `Workload` custom resource.
13+
- A **Job** (uppercase "J") refers to the Kubernetes `batch/v1/Job` resource (one way how code can be submitted through jobq).
14+
15+
Kueue handles the `Workload` and updates its status to reflect the current state of the workload.
16+
17+
![Kueue workload components](https://kueue.sigs.k8s.io/images/queueing-components.svg)
18+
19+
## Identifying workloads
20+
21+
Every Workload (as managed by Kueue) carries by an automatically generated unique identifier (UID) as well as a human-readable name and namespace.
22+
Both these could serve as a unique identifier for a Workload. However, a name/namespace combination is not guaranteed to be unique over time (for example when deleting and recreating), whereas UIDs are.
23+
This makes UIDs a slightly better choice for identifying a given Workload resource.
24+
25+
The concrete workload resource has the same identifiers, a UID and name/namespace combination.
26+
27+
A given job references its associated Workload in a 1:1 fashion (through its `metadata.ownerReferences` field).
28+
29+
This theoretically allows to identify a job in the cluster through two different identifiers:
30+
31+
- the UID of the (concrete) _job_ resource.
32+
- the UID of the (abstract) _Workload_ resource.
33+
34+
In practice, jobq always uses the **UID of the concrete workload** as the identifier for a job.
35+
All CLI operations return and accept the UID of the concrete workload.
36+
37+
As an example, imagine the following resources in the cluster after submitting a job:
38+
39+
```mermaid
40+
graph LR
41+
subgraph "Namespace example"
42+
direction LR
43+
A["`Workload **job-example** <pre>uid-1</pre>`"] --> B["`Job **example** <pre>uid-2</pre>`"] --> C[Pod]
44+
end
45+
```
46+
47+
If we want to query the logs of the job, we can do so by calling `jobq logs` with the UID of the concrete workload:
48+
49+
```console
50+
$ jobq logs uid-2
51+
[... log output ...]
52+
```

client/docs/concepts/lifecycle.md

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Job Lifecycle
2+
3+
Since jobq builds on top of the [Kueue](https://kueue.sigs.k8s.io/) job queuing system for scheduling,
4+
the lifecycle of a job is very similar to the lifecycle of a workload in Kueue.
5+
6+
The remainder of this document uses the terms _job_ and _workload_ interchangeably.
7+
8+
A workload roughly goes through three phases after its submission: _queuing and scheduling_, _execution_, and _completion_.
9+
10+
### Queueing and scheduling
11+
12+
After its submission, a workload is in the `Submitted` state, where it competes with other workloads for available resource quotas.
13+
Once it is admitted to a cluster queue, it enters the `Pending` state, where Kueue will reserve a quota for it.
14+
Alternatively, if the selected local or cluster queue for the workload are stopped or do not exist, the workload will enter the `Inadmissible` state until this condition is resolved.
15+
16+
### Execution
17+
18+
After all admission checks for the workload have passed, it enters the `Admitted` state, it is now eligible for execution by the cluster.
19+
20+
### Completion
21+
22+
When the workload terminates successfully, it enters the terminal `Succeeded` state.
23+
If any unrecoverable error occurs during execution, the workload enters the terminal `Failed` state. This does not necessarily happen on the first abnormal termination of a pod, depending on the type of workload and other factors (such as the retry limit in a `batch/v1/Job`).
24+
25+
A currently executing workload may be preempted by another workload (e.g., by a newly submitted workload with a higher priority).
26+
In this case, Kueue will terminate any pods associated with the preempted workload and either requeue it for later execution or evict it from the cluster queue.
27+
28+
## State Diagram
29+
30+
```mermaid
31+
stateDiagram-v2
32+
direction LR
33+
[*] --> Submitted
34+
35+
Submitted --> Pending: quotaReserved
36+
Submitted --> Inadmissible
37+
Inadmissible --> Submitted
38+
39+
Pending --> Admitted: admitted
40+
41+
Admitted --> Succeeded: success
42+
Admitted --> Failed: error
43+
44+
Admitted --> Submitted: evicted
45+
Admitted --> Pending: requeued
46+
Succeeded --> [*]
47+
Failed --> [*]
48+
```

client/docs/index.md

+34-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,35 @@
1-
# Jobq, a cluster workflow scheduling tool
1+
---
2+
title: Home
3+
---
24

3-
This documentation is work in progress.
5+
#
6+
7+
!!! warning "Work in progress"
8+
9+
This documentation is work in progress.
10+
Please excuse frequent changes and missing content.
11+
12+
<div class="grid cards" markdown>
13+
- [:material-thought-bubble:{ .lg .middle } **Concepts**](concepts/_index.md)
14+
15+
***
16+
17+
Learn about the concepts behind jobq
18+
19+
- [:material-apple-keyboard-command:{ .lg .middle } **Command-line interface**](cli.md)
20+
21+
***
22+
23+
Learn how to use the `jobq` command-line interface
24+
25+
- [:material-book-open-variant:{ .lg .middle } **API Reference**](reference/SUMMARY.md)
26+
27+
***
28+
29+
Detailed documentation of the jobq Python API
30+
31+
</div>
32+
33+
<hr />
34+
35+
This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0){: target="\_blank" }.

0 commit comments

Comments
 (0)