HugePages proposal

kubernetes · Jul 24, 2017 · da0c7e3 · da0c7e3
1 parent 35050d1
commit da0c7e3
Showing 1 changed file with 269 additions and 0 deletions.
diff --git a/contributors/design-proposals/hugepages.md b/contributors/design-proposals/hugepages.md
@@ -0,0 +1,269 @@
+# HugePages support in Kubernetes
+
+**Authors**
+* Derek Carr (@derekwaynecarr)
+* Seth Jennings (@sjenning)
+* Piotr Prokop (@PiotrProkop)
+
+**Status**: In progress
+
+## Abstract
+
+A proposal to enable applications running in a Kubernetes cluster to use huge
+pages.
+
+A pod may request a number of huge pages.  The `scheduler` is able to place the
+pod on a node that can satisfy that request.  The `kubelet` advertises an
+allocatable number of huge pages to support scheduling decisions. A pod may
+consume hugepages via `hugetlbfs` or `shmget`.  Huge pages are not
+overcommitted.
+
+## Motivation
+
+Memory is managed in blocks known as pages.  On most systems, a page is 4KB. 1MB
+of memory is equal to 256 pages; 1GB of memory is 256,000 pages, etc.  CPUs have
+a built-in memory management unit that manages a list of these pages in
+hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
+virtual-to-physical page mappings.  If the virtual address passed in a hardware
+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs, and the system falls back to slower, software based
+memory management.  This results in performance issues.  Since the hardware is
+fixed, the only way to alleviate the performance impact of a TLB miss is to
+increase the page size.
+
+A huge page is a memory page that is larger than 4KB.  On x86_64 architectures,
+there are two huge page sizes: 2MB and 1GB.  Sizes vary on other architectures,
+but the idea is the same.  In order to use huge pages, application must write
+code that is aware of them.  Transparent huge pages (THP) attempt to automate
+the management of huge pages without application knowledge, but they have
+limitations.  In particular, they are limited to 2MB page sizes.  
+
+Managing memory is hard, and unfortunately, there is no one-size fits all
+solution for all applications.  THP might lead to performance degradation on
+nodes with high memory utilization or fragmentation due to the defragmenting
+efforts of THP, which can lock memory pages.  For this reason, some applications
+may be designed to (or recommend) usage of pre-allocated huge pages instead of
+THP. 
+
+## Scope
+
+This proposal only includes pre-allocated huge pages configured on the node by
+the administrator at boot time or by manual dynamic allocation.  It does not
+discuss how the cluster could dynamically attempt to allocate huge pages in an
+attempt to find a fit for a pod pending scheduling.  It is anticipated that
+operators may use a variety of strategies to allocate huge pages, but we
+do not anticipate the kubelet itself doing the allocation.  Allocation of huge
+pages ideally happens soon after boot time.
+
+A simple solution via a `DaemonSet` is presented here:
+https://github.com/derekwaynecarr/hugepages/tree/master/allocator
+
+This proposal defers issues relating to NUMA.
+
+## Use Cases
+
+The class of applications that benefit from huge pages typically have
+- A large memory working set
+- A sensitivity to memory access latency
+
+Example applications include:
+- Java applications can back the heap with huge pages using the
+  `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
+- database management systems (MySQL, PostgreSQL, MongoDB, etc.)
+- packet processing systems (DPDK)
+
+Applications can generally use huge pages by calling
+- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
+- `mmap()` a file backed by `hugetlbfs`
+- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
+  Issues).
+
+1. A pod can use huge pages with any of the prior described methods.
+1. A pod can request huge pages.
+1. A scheduler can bind pods to nodes that have available huge pages.
+1. A quota may limit usage of huge pages.
+1. A limit range may constrain min and max huge page requests.
+
+## Feature Gate
+
+The proposal introduces huge pages as an Alpha feature.
+
+It must be enabled via the `--feature-gates=HugePages=true` flag on pertient
+components pending graduation to Beta.
+
+## Node Specfication
+
+Huge pages cannot be overcommitted on a node.
+
+A system may support multiple huge page sizes.  It is assumed that most
+nodes will be configured to primarily use the default huge page size as
+returned via `grep Hugepagesize /proc/meminfo`.  This defaults to 2MB on
+most Linux systems unless overriden by `default_hugepagesz=1g` in kernel
+boot parameters.  This design does not limit the ability to use multiple
+huge page sizes at the same time, but we expect this to not be the normal
+setup.  For each supported huge page size, the node will advertise a
+resource of the form `hugepages.<hugepagesize>`.  For example, if a node
+supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node
+capacity and allocatable values.
+
+There are a variety of huge page sizes supported across different hardware
+architectures.  It is preferred to have a resource per size in order to
+better support quota.  For example, 1 huge page with size 2MB is an order
+of magnitude different than 1 huge page with size 1GB.  We assume gigantic
+pages are even more precious resources than huge pages.
+
+Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
+node will treat pre-allocated huge pages similar to other system reservations
+and reduce the amount of `memory` it reports using the following formula:
+
+```
+[Allocatale] = [Node Capacity] - 
+ [Kube-Reserved] - 
+ [System-Reserved] - 
+ [Pre-Allocated-HugePages * HugePageSize] -
+ [Hard-Eviction-Threshold]
+```
+
+The following represents a machine with 10Gi of memory.  1Gi of memory has been
+reserved as 512 pre-allocated huge pages sized 2MB.  As you can see, the
+allocatable memory has been reduced to account for the amount of huge pages
+reserved.
+
+```
+apiVersion: v1
+kind: Node
+metadata:
+  name: node1
+...
+status:
+  capacity:
+    memory: 10Gi
+    hugepages.2MB: 512
+  allocatable:
+    memory: 9Gi
+    hugepages.2MB: 512
+...  
+```
+
+## Pod Specification
+
+A pod must make a request to consume pre-allocated huge pages using the resource
+`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The
+request and limit for `hugepages.<hugepagesize>` must match until such time the
+node provides more reliability in handling overcommit (potentially via
+eviction).  An application that consumes `hugepages.<hugepagesize>` is not in
+the `BestEffort` QoS class.
+
+If a pod consumes huge pages via `shmget`, it must run with a supplemental group
+that matches `/proc/sys/vm/hugetlb_shm_group` on the node.  Configuration of
+this group is outside the scope of this specification.
+
+In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
+specified container in the pod, it is helpful to understand the set of mount
+options used with `hugetlbfs`.  The `pagesize` option is used to specify the
+huge page size and pool that is used for accounting.  The pod must make a
+request for huge pages that match this `pagesize`.  The `size` option is
+defaulted to sum of `hugepages.hugepagesze` requests made by each container for
+the specified `pagesize`.  This keeps the behavior consistent with memory backed
+`emptyDir` volumes whose usage is ultimately constrained by the pod cgroup
+sandbox memory settings.  The `min_size` option is omitted. It's possible we
+could have extended `emptyDir` to support an additional option for
+`medium=Memory` and `pageSize`, but this proposal prefers to create an
+additional volume type instead.  For more details, see "Using Huge Pages" here:
+https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
+
+The following is a sample pod that is limited to 512 huge pages of size 2MB. It
+can consume those pages using `shmget()` or via `mmap()` with the specified
+volume.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: example
+spec:
+  containers:
+...
+    volumeMounts:
+    - mountPath: /hugepages
+      name: hugepage
+    resources:
+      requests:
+        hugepages.2MB: "512"
+      limits:
+        hugepages.2MB: "512"
+  volumes:
+  - name: hugepage
+    hugePages:
+      pageSize: "2MB"
+```
+
+## CRI Updates
+
+The `LinuxContainerResources` message should be extended to support specifying
+huge page limits per size.  The specification for huge pages should align with
+opencontainers/runtime-spec.
+
+see:
+https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
+
+The CRI changes are required before promoting this feature to Beta.
+
+## Cgroup Enforcement
+
+To use this feature, the `--cgroups-per-qos` must be enabled.  In addition, the
+`hugetlb` cgroup must be mounted.
+
+The QoS level cgroups are left unbounded across all huge page pool sizes.
+
+The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
+the system default huge page size.  If no request is made for huge pages of
+a particular size, the limit is set to 0 for all supported types on the node.
+
+```
+pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages] * hugepagesize)
+```
+
+If the container runtime supports specification of huge page limits, the
+container cgroup sandbox will be configured with the specified limit.
+
+## Limits and Quota
+
+The `ResourceQuota` resource will be extended to support accounting for
+`hugepages.<hugepagesize>` similar to `cpu` and `memory`.  The `LimitRange`
+resource will be extended to define min and max constraints for `hugepages`
+similar to `cpu` and `memory`.
+
+## Scheduler changes
+
+The scheduler will need to ensure any huge page request defined in the pod spec
+can be fulfilled by a candidate node.
+
+## cAdvisor changes
+
+cAdvisor will need to be modified to return the number of pre-allocated huge
+pages per page size on the node.  It will be used to determine capacity
+and calculate allocatable values on the node.
+
+## Known Issues
+
+### Huge pages as shared memory
+
+For the Java use case, the JVM maps the huge pages as a shared memory segment
+and memlocks them to prevent the system from moving or swapping them out.
+
+There are several issues here:
+- The user running the Java app must be a member of the gid set in the
+  `vm.huge_tlb_shm_group` sysctl
+- sysctl `kernel.shmmax` must allow the size of the shared memory segment
+- The user's memlock ulimits must allow the size of the shared memory segment
+- `vm.huge_tlb_shm_group` is not namespaced.
+
+### NUMA
+
+NUMA is complicated.  To support NUMA, the node must support cpu pinning,
+devices, and memory locality.  Extending that requirement to huge pages is not
+much different.  It is anticipated that the `kubelet` will provide future NUMA
+locality guarantees as a feature of QoS.  In particular, pods in the
+`Guaranteed` QoS class are expected to have NUMA locality preferences.
+