-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
35050d1
commit da0c7e3
Showing
1 changed file
with
269 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
# HugePages support in Kubernetes | ||
|
||
**Authors** | ||
* Derek Carr (@derekwaynecarr) | ||
* Seth Jennings (@sjenning) | ||
* Piotr Prokop (@PiotrProkop) | ||
|
||
**Status**: In progress | ||
|
||
## Abstract | ||
|
||
A proposal to enable applications running in a Kubernetes cluster to use huge | ||
pages. | ||
|
||
A pod may request a number of huge pages. The `scheduler` is able to place the | ||
pod on a node that can satisfy that request. The `kubelet` advertises an | ||
allocatable number of huge pages to support scheduling decisions. A pod may | ||
consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not | ||
overcommitted. | ||
|
||
## Motivation | ||
|
||
Memory is managed in blocks known as pages. On most systems, a page is 4KB. 1MB | ||
of memory is equal to 256 pages; 1GB of memory is 256,000 pages, etc. CPUs have | ||
a built-in memory management unit that manages a list of these pages in | ||
hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of | ||
virtual-to-physical page mappings. If the virtual address passed in a hardware | ||
instruction can be found in the TLB, the mapping can be determined quickly. If | ||
not, a TLB miss occurs, and the system falls back to slower, software based | ||
memory management. This results in performance issues. Since the hardware is | ||
fixed, the only way to alleviate the performance impact of a TLB miss is to | ||
increase the page size. | ||
|
||
A huge page is a memory page that is larger than 4KB. On x86_64 architectures, | ||
there are two huge page sizes: 2MB and 1GB. Sizes vary on other architectures, | ||
but the idea is the same. In order to use huge pages, application must write | ||
code that is aware of them. Transparent huge pages (THP) attempt to automate | ||
the management of huge pages without application knowledge, but they have | ||
limitations. In particular, they are limited to 2MB page sizes. | ||
|
||
Managing memory is hard, and unfortunately, there is no one-size fits all | ||
solution for all applications. THP might lead to performance degradation on | ||
nodes with high memory utilization or fragmentation due to the defragmenting | ||
efforts of THP, which can lock memory pages. For this reason, some applications | ||
may be designed to (or recommend) usage of pre-allocated huge pages instead of | ||
THP. | ||
|
||
## Scope | ||
|
||
This proposal only includes pre-allocated huge pages configured on the node by | ||
the administrator at boot time or by manual dynamic allocation. It does not | ||
discuss how the cluster could dynamically attempt to allocate huge pages in an | ||
attempt to find a fit for a pod pending scheduling. It is anticipated that | ||
operators may use a variety of strategies to allocate huge pages, but we | ||
do not anticipate the kubelet itself doing the allocation. Allocation of huge | ||
pages ideally happens soon after boot time. | ||
|
||
A simple solution via a `DaemonSet` is presented here: | ||
https://github.com/derekwaynecarr/hugepages/tree/master/allocator | ||
|
||
This proposal defers issues relating to NUMA. | ||
|
||
## Use Cases | ||
|
||
The class of applications that benefit from huge pages typically have | ||
- A large memory working set | ||
- A sensitivity to memory access latency | ||
|
||
Example applications include: | ||
- Java applications can back the heap with huge pages using the | ||
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. | ||
- database management systems (MySQL, PostgreSQL, MongoDB, etc.) | ||
- packet processing systems (DPDK) | ||
|
||
Applications can generally use huge pages by calling | ||
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory | ||
- `mmap()` a file backed by `hugetlbfs` | ||
- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known | ||
Issues). | ||
|
||
1. A pod can use huge pages with any of the prior described methods. | ||
1. A pod can request huge pages. | ||
1. A scheduler can bind pods to nodes that have available huge pages. | ||
1. A quota may limit usage of huge pages. | ||
1. A limit range may constrain min and max huge page requests. | ||
|
||
## Feature Gate | ||
|
||
The proposal introduces huge pages as an Alpha feature. | ||
|
||
It must be enabled via the `--feature-gates=HugePages=true` flag on pertient | ||
components pending graduation to Beta. | ||
|
||
## Node Specfication | ||
|
||
Huge pages cannot be overcommitted on a node. | ||
|
||
A system may support multiple huge page sizes. It is assumed that most | ||
nodes will be configured to primarily use the default huge page size as | ||
returned via `grep Hugepagesize /proc/meminfo`. This defaults to 2MB on | ||
most Linux systems unless overriden by `default_hugepagesz=1g` in kernel | ||
boot parameters. This design does not limit the ability to use multiple | ||
huge page sizes at the same time, but we expect this to not be the normal | ||
setup. For each supported huge page size, the node will advertise a | ||
resource of the form `hugepages.<hugepagesize>`. For example, if a node | ||
supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node | ||
capacity and allocatable values. | ||
|
||
There are a variety of huge page sizes supported across different hardware | ||
architectures. It is preferred to have a resource per size in order to | ||
better support quota. For example, 1 huge page with size 2MB is an order | ||
of magnitude different than 1 huge page with size 1GB. We assume gigantic | ||
pages are even more precious resources than huge pages. | ||
|
||
Pre-allocated huge pages reduce the amount of allocatable memory on a node. The | ||
node will treat pre-allocated huge pages similar to other system reservations | ||
and reduce the amount of `memory` it reports using the following formula: | ||
|
||
``` | ||
[Allocatale] = [Node Capacity] - | ||
[Kube-Reserved] - | ||
[System-Reserved] - | ||
[Pre-Allocated-HugePages * HugePageSize] - | ||
[Hard-Eviction-Threshold] | ||
``` | ||
|
||
The following represents a machine with 10Gi of memory. 1Gi of memory has been | ||
reserved as 512 pre-allocated huge pages sized 2MB. As you can see, the | ||
allocatable memory has been reduced to account for the amount of huge pages | ||
reserved. | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Node | ||
metadata: | ||
name: node1 | ||
... | ||
status: | ||
capacity: | ||
memory: 10Gi | ||
hugepages.2MB: 512 | ||
allocatable: | ||
memory: 9Gi | ||
hugepages.2MB: 512 | ||
... | ||
``` | ||
|
||
## Pod Specification | ||
|
||
A pod must make a request to consume pre-allocated huge pages using the resource | ||
`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The | ||
request and limit for `hugepages.<hugepagesize>` must match until such time the | ||
node provides more reliability in handling overcommit (potentially via | ||
eviction). An application that consumes `hugepages.<hugepagesize>` is not in | ||
the `BestEffort` QoS class. | ||
|
||
If a pod consumes huge pages via `shmget`, it must run with a supplemental group | ||
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of | ||
this group is outside the scope of this specification. | ||
|
||
In order to consume huge pages backed by the `hugetlbfs` filesystem inside the | ||
specified container in the pod, it is helpful to understand the set of mount | ||
options used with `hugetlbfs`. The `pagesize` option is used to specify the | ||
huge page size and pool that is used for accounting. The pod must make a | ||
request for huge pages that match this `pagesize`. The `size` option is | ||
defaulted to sum of `hugepages.hugepagesze` requests made by each container for | ||
the specified `pagesize`. This keeps the behavior consistent with memory backed | ||
`emptyDir` volumes whose usage is ultimately constrained by the pod cgroup | ||
sandbox memory settings. The `min_size` option is omitted. It's possible we | ||
could have extended `emptyDir` to support an additional option for | ||
`medium=Memory` and `pageSize`, but this proposal prefers to create an | ||
additional volume type instead. For more details, see "Using Huge Pages" here: | ||
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt | ||
|
||
The following is a sample pod that is limited to 512 huge pages of size 2MB. It | ||
can consume those pages using `shmget()` or via `mmap()` with the specified | ||
volume. | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: example | ||
spec: | ||
containers: | ||
... | ||
volumeMounts: | ||
- mountPath: /hugepages | ||
name: hugepage | ||
resources: | ||
requests: | ||
hugepages.2MB: "512" | ||
limits: | ||
hugepages.2MB: "512" | ||
volumes: | ||
- name: hugepage | ||
hugePages: | ||
pageSize: "2MB" | ||
``` | ||
|
||
## CRI Updates | ||
|
||
The `LinuxContainerResources` message should be extended to support specifying | ||
huge page limits per size. The specification for huge pages should align with | ||
opencontainers/runtime-spec. | ||
|
||
see: | ||
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits | ||
|
||
The CRI changes are required before promoting this feature to Beta. | ||
|
||
## Cgroup Enforcement | ||
|
||
To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the | ||
`hugetlb` cgroup must be mounted. | ||
|
||
The QoS level cgroups are left unbounded across all huge page pool sizes. | ||
|
||
The pod level cgroup sandbox is configured as follows, where `hugepagesize` is | ||
the system default huge page size. If no request is made for huge pages of | ||
a particular size, the limit is set to 0 for all supported types on the node. | ||
|
||
``` | ||
pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages] * hugepagesize) | ||
``` | ||
|
||
If the container runtime supports specification of huge page limits, the | ||
container cgroup sandbox will be configured with the specified limit. | ||
|
||
## Limits and Quota | ||
|
||
The `ResourceQuota` resource will be extended to support accounting for | ||
`hugepages.<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange` | ||
resource will be extended to define min and max constraints for `hugepages` | ||
similar to `cpu` and `memory`. | ||
|
||
## Scheduler changes | ||
|
||
The scheduler will need to ensure any huge page request defined in the pod spec | ||
can be fulfilled by a candidate node. | ||
|
||
## cAdvisor changes | ||
|
||
cAdvisor will need to be modified to return the number of pre-allocated huge | ||
pages per page size on the node. It will be used to determine capacity | ||
and calculate allocatable values on the node. | ||
|
||
## Known Issues | ||
|
||
### Huge pages as shared memory | ||
|
||
For the Java use case, the JVM maps the huge pages as a shared memory segment | ||
and memlocks them to prevent the system from moving or swapping them out. | ||
|
||
There are several issues here: | ||
- The user running the Java app must be a member of the gid set in the | ||
`vm.huge_tlb_shm_group` sysctl | ||
- sysctl `kernel.shmmax` must allow the size of the shared memory segment | ||
- The user's memlock ulimits must allow the size of the shared memory segment | ||
- `vm.huge_tlb_shm_group` is not namespaced. | ||
|
||
### NUMA | ||
|
||
NUMA is complicated. To support NUMA, the node must support cpu pinning, | ||
devices, and memory locality. Extending that requirement to huge pages is not | ||
much different. It is anticipated that the `kubelet` will provide future NUMA | ||
locality guarantees as a feature of QoS. In particular, pods in the | ||
`Guaranteed` QoS class are expected to have NUMA locality preferences. | ||
|