Skip to content

Commit

Permalink
HugePages proposal
Browse files Browse the repository at this point in the history
  • Loading branch information
derekwaynecarr committed Jul 24, 2017
1 parent 35050d1 commit da0c7e3
Showing 1 changed file with 269 additions and 0 deletions.
269 changes: 269 additions & 0 deletions contributors/design-proposals/hugepages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# HugePages support in Kubernetes

**Authors**
* Derek Carr (@derekwaynecarr)
* Seth Jennings (@sjenning)
* Piotr Prokop (@PiotrProkop)

**Status**: In progress

## Abstract

A proposal to enable applications running in a Kubernetes cluster to use huge
pages.

A pod may request a number of huge pages. The `scheduler` is able to place the
pod on a node that can satisfy that request. The `kubelet` advertises an
allocatable number of huge pages to support scheduling decisions. A pod may
consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not
overcommitted.

## Motivation

Memory is managed in blocks known as pages. On most systems, a page is 4KB. 1MB
of memory is equal to 256 pages; 1GB of memory is 256,000 pages, etc. CPUs have
a built-in memory management unit that manages a list of these pages in
hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
virtual-to-physical page mappings. If the virtual address passed in a hardware
instruction can be found in the TLB, the mapping can be determined quickly. If
not, a TLB miss occurs, and the system falls back to slower, software based
memory management. This results in performance issues. Since the hardware is
fixed, the only way to alleviate the performance impact of a TLB miss is to
increase the page size.

A huge page is a memory page that is larger than 4KB. On x86_64 architectures,
there are two huge page sizes: 2MB and 1GB. Sizes vary on other architectures,
but the idea is the same. In order to use huge pages, application must write
code that is aware of them. Transparent huge pages (THP) attempt to automate
the management of huge pages without application knowledge, but they have
limitations. In particular, they are limited to 2MB page sizes.

Managing memory is hard, and unfortunately, there is no one-size fits all
solution for all applications. THP might lead to performance degradation on
nodes with high memory utilization or fragmentation due to the defragmenting
efforts of THP, which can lock memory pages. For this reason, some applications
may be designed to (or recommend) usage of pre-allocated huge pages instead of
THP.

## Scope

This proposal only includes pre-allocated huge pages configured on the node by
the administrator at boot time or by manual dynamic allocation. It does not
discuss how the cluster could dynamically attempt to allocate huge pages in an
attempt to find a fit for a pod pending scheduling. It is anticipated that
operators may use a variety of strategies to allocate huge pages, but we
do not anticipate the kubelet itself doing the allocation. Allocation of huge
pages ideally happens soon after boot time.

A simple solution via a `DaemonSet` is presented here:
https://github.com/derekwaynecarr/hugepages/tree/master/allocator

This proposal defers issues relating to NUMA.

## Use Cases

The class of applications that benefit from huge pages typically have
- A large memory working set
- A sensitivity to memory access latency

Example applications include:
- Java applications can back the heap with huge pages using the
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
- database management systems (MySQL, PostgreSQL, MongoDB, etc.)
- packet processing systems (DPDK)

Applications can generally use huge pages by calling
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
- `mmap()` a file backed by `hugetlbfs`
- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
Issues).

1. A pod can use huge pages with any of the prior described methods.
1. A pod can request huge pages.
1. A scheduler can bind pods to nodes that have available huge pages.
1. A quota may limit usage of huge pages.
1. A limit range may constrain min and max huge page requests.

## Feature Gate

The proposal introduces huge pages as an Alpha feature.

It must be enabled via the `--feature-gates=HugePages=true` flag on pertient
components pending graduation to Beta.

## Node Specfication

Huge pages cannot be overcommitted on a node.

A system may support multiple huge page sizes. It is assumed that most
nodes will be configured to primarily use the default huge page size as
returned via `grep Hugepagesize /proc/meminfo`. This defaults to 2MB on
most Linux systems unless overriden by `default_hugepagesz=1g` in kernel
boot parameters. This design does not limit the ability to use multiple
huge page sizes at the same time, but we expect this to not be the normal
setup. For each supported huge page size, the node will advertise a
resource of the form `hugepages.<hugepagesize>`. For example, if a node
supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node
capacity and allocatable values.

There are a variety of huge page sizes supported across different hardware
architectures. It is preferred to have a resource per size in order to
better support quota. For example, 1 huge page with size 2MB is an order
of magnitude different than 1 huge page with size 1GB. We assume gigantic
pages are even more precious resources than huge pages.

Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
node will treat pre-allocated huge pages similar to other system reservations
and reduce the amount of `memory` it reports using the following formula:

```
[Allocatale] = [Node Capacity] -
[Kube-Reserved] -
[System-Reserved] -
[Pre-Allocated-HugePages * HugePageSize] -
[Hard-Eviction-Threshold]
```

The following represents a machine with 10Gi of memory. 1Gi of memory has been
reserved as 512 pre-allocated huge pages sized 2MB. As you can see, the
allocatable memory has been reduced to account for the amount of huge pages
reserved.

```
apiVersion: v1
kind: Node
metadata:
name: node1
...
status:
capacity:
memory: 10Gi
hugepages.2MB: 512
allocatable:
memory: 9Gi
hugepages.2MB: 512
...
```

## Pod Specification

A pod must make a request to consume pre-allocated huge pages using the resource
`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The
request and limit for `hugepages.<hugepagesize>` must match until such time the
node provides more reliability in handling overcommit (potentially via
eviction). An application that consumes `hugepages.<hugepagesize>` is not in
the `BestEffort` QoS class.

If a pod consumes huge pages via `shmget`, it must run with a supplemental group
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
this group is outside the scope of this specification.

In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
specified container in the pod, it is helpful to understand the set of mount
options used with `hugetlbfs`. The `pagesize` option is used to specify the
huge page size and pool that is used for accounting. The pod must make a
request for huge pages that match this `pagesize`. The `size` option is
defaulted to sum of `hugepages.hugepagesze` requests made by each container for
the specified `pagesize`. This keeps the behavior consistent with memory backed
`emptyDir` volumes whose usage is ultimately constrained by the pod cgroup
sandbox memory settings. The `min_size` option is omitted. It's possible we
could have extended `emptyDir` to support an additional option for
`medium=Memory` and `pageSize`, but this proposal prefers to create an
additional volume type instead. For more details, see "Using Huge Pages" here:
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

The following is a sample pod that is limited to 512 huge pages of size 2MB. It
can consume those pages using `shmget()` or via `mmap()` with the specified
volume.

```
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
...
volumeMounts:
- mountPath: /hugepages
name: hugepage
resources:
requests:
hugepages.2MB: "512"
limits:
hugepages.2MB: "512"
volumes:
- name: hugepage
hugePages:
pageSize: "2MB"
```

## CRI Updates

The `LinuxContainerResources` message should be extended to support specifying
huge page limits per size. The specification for huge pages should align with
opencontainers/runtime-spec.

see:
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits

The CRI changes are required before promoting this feature to Beta.

## Cgroup Enforcement

To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
`hugetlb` cgroup must be mounted.

The QoS level cgroups are left unbounded across all huge page pool sizes.

The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
the system default huge page size. If no request is made for huge pages of
a particular size, the limit is set to 0 for all supported types on the node.

```
pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages] * hugepagesize)
```

If the container runtime supports specification of huge page limits, the
container cgroup sandbox will be configured with the specified limit.

## Limits and Quota

The `ResourceQuota` resource will be extended to support accounting for
`hugepages.<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange`
resource will be extended to define min and max constraints for `hugepages`
similar to `cpu` and `memory`.

## Scheduler changes

The scheduler will need to ensure any huge page request defined in the pod spec
can be fulfilled by a candidate node.

## cAdvisor changes

cAdvisor will need to be modified to return the number of pre-allocated huge
pages per page size on the node. It will be used to determine capacity
and calculate allocatable values on the node.

## Known Issues

### Huge pages as shared memory

For the Java use case, the JVM maps the huge pages as a shared memory segment
and memlocks them to prevent the system from moving or swapping them out.

There are several issues here:
- The user running the Java app must be a member of the gid set in the
`vm.huge_tlb_shm_group` sysctl
- sysctl `kernel.shmmax` must allow the size of the shared memory segment
- The user's memlock ulimits must allow the size of the shared memory segment
- `vm.huge_tlb_shm_group` is not namespaced.

### NUMA

NUMA is complicated. To support NUMA, the node must support cpu pinning,
devices, and memory locality. Extending that requirement to huge pages is not
much different. It is anticipated that the `kubelet` will provide future NUMA
locality guarantees as a feature of QoS. In particular, pods in the
`Guaranteed` QoS class are expected to have NUMA locality preferences.

0 comments on commit da0c7e3

Please sign in to comment.