-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HugePages proposal #837
HugePages proposal #837
Conversation
[Allocatale] = [Node Capacity] - | ||
[Kube-Reserved] - | ||
[System-Reserved] - | ||
[Pre-Allocated-HugePages * HugePageSize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a minus missing at the end of this line
I may choose to use explicit huge page sizes after more reflection. Want to experiment with a prototype some today. Difference between 2MB and 1Gi pages is significant enough that I would want to quota them separately. The empty dir topic would then become a separate volume type with pageSize option. |
Few questions:
|
|
|
da0c7e3
to
9c41fae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly nits, nothing to hold up the proposal over. lgtm.
virtual-to-physical page mappings. If the virtual address passed in a hardware | ||
instruction can be found in the TLB, the mapping can be determined quickly. If | ||
not, a TLB miss occurs, and the system falls back to slower, software based | ||
memory management. This results in performance issues. Since the hardware is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/memory management/address translation/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nit: s/the hardware is fixed/the size of the TLB is fixed/
instruction can be found in the TLB, the mapping can be determined quickly. If | ||
not, a TLB miss occurs, and the system falls back to slower, software based | ||
memory management. This results in performance issues. Since the hardware is | ||
fixed, the only way to alleviate the performance impact of a TLB miss is to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/alleviate the performance impact/reduce the chance/
A huge page is a memory page that is larger than 4KB. On x86_64 architectures, | ||
there are two huge page sizes: 2MB and 1GB. Sizes vary on other architectures, | ||
but the idea is the same. In order to use huge pages, application must write | ||
code that is aware of them. Transparent huge pages (THP) attempt to automate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/attempt/attempts/
but the idea is the same. In order to use huge pages, application must write | ||
code that is aware of them. Transparent huge pages (THP) attempt to automate | ||
the management of huge pages without application knowledge, but they have | ||
limitations. In particular, they are limited to 2MB page sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, dynamic splitting of THPs into standard pages during certain conditions can result in undesirable application stalls.
A system may support multiple huge page sizes. It is assumed that most | ||
nodes will be configured to primarily use the default huge page size as | ||
returned via `grep Hugepagesize /proc/meminfo`. This defaults to 2MB on | ||
most Linux systems unless overriden by `default_hugepagesz=1g` in kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit s/overriden/overridden/
most Linux systems unless overriden by `default_hugepagesz=1g` in kernel | ||
boot parameters. This design does not limit the ability to use multiple | ||
huge page sizes at the same time, but we expect this to not be the normal | ||
setup. For each supported huge page size, the node will advertise a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not sure if we need this sentence (starting with "This design..") i found it confusing i.e. you can, but maybe you shouldn't because... reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been told that mixed use of 2MB and 1GB pages is possible on a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think I've seen it. Mostly because the use-cases for 1GB pages make co-locating other apps that use 2MB pages on the same hardware architecturally undesirable (from a deployment standpoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, yes mixing is possible for the three way to explicitly use huge pages.
https://gist.github.com/sjenning/b6bed5bf029c9fd6f078f76b37f0a73f#consuming-hugepages
hugetlbfs - pagesize mount option
mmap() - MAP_HUGE_2MB, MAP_HUGE_1GB
shmget() - SHM_HUGE_2MB, SHM_HUGE_1GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed to avoid confusion.
## Pod Specification | ||
|
||
A pod must make a request to consume pre-allocated huge pages using the resource | ||
`hugepages.<hugepagesize>` whose quantity is a non-negative number of pages. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/non-negative/positive integer/
request and limit for `hugepages.<hugepagesize>` must match until such time the | ||
node provides more reliability in handling overcommit (potentially via | ||
eviction). An application that consumes `hugepages.<hugepagesize>` is not in | ||
the `BestEffort` QoS class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we need this sentence. If we do need it, i'd go with
An application that consumes hugepages.<hugepagesize>
is, by definition, not in the BestEffort
QoS class as it specifies a request for one or more resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for rewording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea at time is not all resources impact QoS (ie OIR) and placement in cgroup hierarchy, but this would. Will reason if it was confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
`pageSize` option that must be provided by users. The `pageSize` option aligns | ||
with the `pagesize` argument in the above mount. It is used to specify the huge | ||
page size and pool that is used for accounting. To write into this volume, the | ||
pod must make a request for huge pages that match this `pagesize`, otherwise its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pod/container/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volumes are per-pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i was thinking requests where per container. i can read it both ways now.
To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the | ||
`hugetlb` cgroup must be mounted. | ||
|
||
The QoS level cgroups are left unbounded across all huge page pool sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to bound at the kubepods (node) level in case system services are using hugepages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
cc @PiotrProkop |
virtual-to-physical page mappings. If the virtual address passed in a hardware | ||
instruction can be found in the TLB, the mapping can be determined quickly. If | ||
not, a TLB miss occurs, and the system falls back to slower, software based | ||
memory management. This results in performance issues. Since the hardware is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nit: s/the hardware is fixed/the size of the TLB is fixed/
setup. For each supported huge page size, the node will advertise a | ||
resource of the form `hugepages.<hugepagesize>`. For example, if a node | ||
supports 2MB huge pages, a resource `hugepages.2MB` will be shown in node | ||
capacity and allocatable values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does allocatable == capacity in all cases for hugepages resources, or are node-allocatable enhancements also in scope? The text describes how allocatable memory is affected by hugepages, but not how the operator could set aside pre-allocated hugepages that are not available for user pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add text. I see no reason why a reservation would be a problem at first thought.
and reduce the amount of `memory` it reports using the following formula: | ||
|
||
``` | ||
[Allocatale] = [Node Capacity] - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Allocatale/Allocatable
request and limit for `hugepages.<hugepagesize>` must match until such time the | ||
node provides more reliability in handling overcommit (potentially via | ||
eviction). An application that consumes `hugepages.<hugepagesize>` is not in | ||
the `BestEffort` QoS class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for rewording
`pageSize` option that must be provided by users. The `pageSize` option aligns | ||
with the `pagesize` argument in the above mount. It is used to specify the huge | ||
page size and pool that is used for accounting. To write into this volume, the | ||
pod must make a request for huge pages that match this `pagesize`, otherwise its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volumes are per-pod
Do proposals usually link to the feature (here)? |
|
||
### NUMA | ||
|
||
NUMA is complicated. To support NUMA, the node must support cpu pinning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with waiting to solve NUMA. Could we add a plan for how this will work as a pre-requisite for beta? I have some concerns related to non-fungibility of pages from separate NUMA nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will the plan differ from normal memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUIC then the smallest granularity level of the scheduler so far is the node level - Do I recall this correctly?
For NUMA however, the granularity level would be intra-node locality - thus the locality of NUMA nodes and memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand - we anticipate the scheduler to only maintain visibility at the node level. we anticipate the kubelet to take responsibility for intra-node locality, and that will be a factor of QoS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from small comments above, LGTM.
By the way, huge +1 for this proposal and feature.
volumes whose usage is ultimately constrained by the pod cgroup sandbox memory | ||
settings. The `min_size` option is omitted as its not necessary. It's possible | ||
we could have extended `emptyDir` to support an additional option for | ||
`medium=Memory` but extend it to support a `pageSize` paramter, but this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/paramter/parameter
9c41fae
to
d78d868
Compare
Updated based on feedback. |
and reduce the amount of `memory` it reports using the following formula: | ||
|
||
``` | ||
[Allocatable] = [Node Capacity] - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Doesn't node capacity already get reduced whenever static huge pages are created? I thought /proc/meminfo
reduces total memory available in the presence of static huge pages.
Edit: Can Capacity be MemTotal - (HugePages_Total * Hugepagesize)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not. MemTotal
is fixed in /proc/meminfo
even after allocating hugepages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh - i think adjusting [Node Capacity] as described in your edited comment introduces far more confusion when you layer in other monitoring tools. The machines memory capacity is [Node Capacity], the operator has reserved a portion of that capacity for use in huge pages.
status: | ||
capacity: | ||
memory: 10Gi | ||
hugepages.2MB: 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid exposing size in the API for now? In the future, we do plan on supporting attributes for resources. Size can be an attribute then. Until attributes are available, would node labels suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the semantics of the request if size were not in the name? It seems to me there are two ways to do that:
- Use
bytes
as the resource quantity unit instead ofpages
. This may require requests to be quantized to a multiple of available page size. What if the page sizes are heterogeneous across the set of nodes? - Continue to use
pages
as the resource quantity unit, and you just get the node's default page size whatever that may be (if you care, influence the scheduler with node label affinity.)
(2) above seems better than (1) to me. However neither option significantly reduces changes that pod authors would have to make once resource attributes are available. Caveat, this is without having reviewed a concrete proposal for resource attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2) doesn't satisfy the quota concerns. It makes quota on pages useless. This is why I settled on the name proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better compute resource API can handle quota as well. Can the quota requirement be deferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vish asked me to take a look. Nothing terribly upsetting here.
|
||
There are a variety of huge page sizes supported across different hardware | ||
architectures. It is preferred to have a resource per size in order to better | ||
support quota. For example, 1 huge page with size 2MB is an order of magnitude |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 orders of magnitude, actually
`hugepages.<hugepagesize>` whose quantity is a positive number of pages. The | ||
request and limit for `hugepages.<hugepagesize>` must match until such time the | ||
node provides more reliability in handling overcommit (potentially via | ||
eviction). An application that consumes `hugepages.<hugepagesize>` is not in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT class, are you saying that using hugepages gets me an automatic priority boost? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more: by definition, an application that consumes hugepages in not in the BestEffort QoS class because it must have a resource request for hugepages, making it at least Burstable (Guaranteed if all resources are specified and requests == limits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically what seth said.
min_size=<value>,nr_inodes=<value> none /mnt/huge | ||
``` | ||
|
||
The proposal recommends a new volume type `hugePages`. It will support a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better as a new volume or as a twist on emptyDir, medium=HugePages, hugePagesConfig: { pageSize: 2Mi }
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a twist of EmptyDir. We already deal with local storage that is split between EmptyDir (volumes abstraction), container logs and overlay filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user perspective, probably no material difference. If EmptyDir is easier for the project to manage then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with emptyDir. The rationale at time was code complexity, but it's probably simpler API to keep with emptyDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for EmptyDir as the only memory backed volume in k8s.
``` | ||
|
||
The proposal recommends a new volume type `hugePages`. It will support a single | ||
`pageSize` option that must be provided by users. The `pageSize` option aligns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter that resource.Quantity is spelled 2Mi
but the resource name above is 2MB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin - at time of writing, i had mapped pageSize to align with the OCI runtime-spec pageSize field for huge page limits, so it didn't need to be a literal resource.Quantity.
see: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
it seems more common to use decimal rather than binary units when describing the pageSize, but i think it should be fine if we want to support binary units only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at this closer in runc, supported hugepage sizes are determined here:
https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/utils.go#L408
https://github.com/opencontainers/runc/blob/master/vendor/github.com/docker/go-units/size.go#L73
it appears to ultimately be parsing the representation returned from the sys/kernel/mm/hugepages/hugepages-{size}kB
in kibibytes, so binary notation in kubernetes representation should be fine...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this is useful reference:
https://en.wikipedia.org/wiki/Page_%28computer_memory%29#Page_size_trade-off
the actual page sizes are binary notation. even reading 2MByte on intel chipset was 2^21 and 4MByte was 2^22 as expected. binary format is actually preferred. (see pg 109 https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf for sample reference.)
settings. The `min_size` option is omitted as its not necessary. It's possible | ||
we could have extended `emptyDir` to support an additional option for | ||
`medium=Memory` but extend it to support a `pageSize` parameter, but this | ||
proposal prefers to create an additional volume type instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Care to explain why? I am open to discussion, but I want to know the rationale here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no objection, I am happy to extend emptyDir as described in earlier comment. The life cycle requirements were same as memory backed emptyDir
d78d868
to
cd62813
Compare
@thockin @saad-ali @vishh @dchen1107 -- updated to respond to feedback. |
|
||
The proposal introduces huge pages as an Alpha feature. | ||
|
||
It must be enabled via the `--feature-gates=HugePages=true` flag on pertient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pertinent?
- name: hugepage | ||
emptyDir: | ||
medium: HugePages | ||
pageSize: "2Mi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we treat each hugepage size as a separate resource, that's one thing. But parsing the resource name seems hacky.
Can we instead have emptydir volumes uses whatever hugepage resource was specified as part of the container specification and also require all containers to specify the same type of hugepage resource?
For the latter it may help having a different resource name like hugepages.kubernetes.io/2Mi
.
Or the other option is to have medium
be the hugepage resource type itself (hugepages.2Mi
). memory
medium has a corresponding memory
compute resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - It's probably good to tie the EmptyDir
to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize
.
Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Mi
in the example above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand - for the initial alpha version of this feature, I am going to update the proposal to restrict the pod from consuming more than one huge page pool size. The volume will then infer the pagesize to use based on the requests.
|
||
The `ResourceQuota` resource will be extended to support accounting for | ||
`hugepages.<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange` | ||
resource will be extended to define min and max constraints for `hugepages` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Burst ratio wouldn't apply to integer resources like hugepages right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
`vm.huge_tlb_shm_group` sysctl | ||
- sysctl `kernel.shmmax` must allow the size of the shared memory segment | ||
- The user's memlock ulimits must allow the size of the shared memory segment | ||
- `vm.huge_tlb_shm_group` is not namespaced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these issues going to be simplified for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to evaluate the associated burden based on usage of the feature rather than pre-design.
- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.) | ||
- Java applications can back the heap with huge pages using the | ||
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. | ||
- packet processing systems (DPDK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Virtual Machines (qemu/kvm/libvirt) can leverage huge pages
`grep Hugepagesize /proc/meminfo`. This defaults to 2MB on most Linux systems | ||
unless overriden by `default_hugepagesz=1g` in kernel boot parameters. For each | ||
supported huge page size, the node will advertise a resource of the form | ||
`hugepages.<hugepagesize>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear - Resources as in #782 ? Might be worth a link then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand - this is unrelated to resource class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - so will it then be advertised as part of the node info?
Supported huge page sizes are determined by parsing the | ||
`sys/kernel/mm/hugepages/hugepages-{size}kB` directory on the host. Kubernetes | ||
will expose the `hugepages.<hugepagesize>` resource using binary notation form. | ||
For example, if a node supports `hugepages-2048kB`, a resource `hugepages.2Mi` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously you said the pages will be advertised by using hugepages.<hugepagesize>
but you did not define the unit of the size - in this line however you implicitly convert kB
to Mi
- which is okay, but it's probably worth noting the default unit, to be able to specify the right resources whne creating pods.
IOW The proposal should define the uni to use, otherwise it's unclear if you have to request 2048kB
or 2Mi
units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will add text describing the size.
- name: hugepage | ||
emptyDir: | ||
medium: HugePages | ||
pageSize: "2Mi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - It's probably good to tie the EmptyDir
to the request hugepages. Otherwise there is also the option of abusing the mechanism, by requesting one size, but then using a different pageSize
.
Quite unlikely, but still possible is that different hugepage sizes are requested, thus the volume needs a reference to the hugepage size. Just saying that a volume can not infer the page size automatically if there is more than one given. This in turn would mean that the volume should reference the resource (thus hugepages.2Mi
in the example above)
|
||
cAdvisor will need to be modified to return the number of pre-allocated huge | ||
pages per page size on the node. It will be used to determine capacity and | ||
calculate allocatable values on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reporting is just informative, and not for scheduling, right?
As I'd expect that the resource class mechanism is used to advertise available hugepages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reporting is for scheduling. the resource class mechanism is later.
cd62813
to
07659e2
Compare
@vishh @thockin @dchen1107 @saad-ali - updated based on last round of feedback. notable changes:
|
07659e2
to
d8673b1
Compare
and reduce the amount of `memory` it reports using the following formula: | ||
|
||
``` | ||
[Allocatable] = [Node Capacity] - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on formula and the example you listed below, looks like you change the semantics of NodeCapacity and NodeAllocatable which both are declared as ResourceList. Taking your example given below, it is confused to the user that the node has total memory 10Gi = 9Gi (memory allocatable) + 1Gi (hugepage-2Mi allocatable) or 11Gi = 10Gi (memory capacity) + 1Gi (hugepage-2Mi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the first instance of a cross-resource data dependency in node status? @derekwaynecarr could you expand on how the monitoring tools are affected by this decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was my thought process at the time:
- most users associate capacity[memory] with
/proc/meminfo
MemTotal
(i.e. total physical RAM) - pre-allocating huge pages does not change
MemTotal
, it just reducesMemAvailable
- pre-allocating huge pages is an implicit reservation of
memory
for use in a different mode against memory capacity - i suspect users would be confused if the capacity value for
memory
as reported back by the kubelet differed from what is reported in cAdvisor (or other monitoring tools) and could bubble up through Prometheus, etc. i really feel people will be confused as most monitoring tools will readMemTotal
.
i don't think users should expect to add numbers with two different units when evaluating capacity.
if we wanted the operator to explicitly account for pre-allocated hugepages in either system-reserved or kube-reserved, i suspect that causes more trouble when we actually enforce-node-allocatable for system and kube cgroups as that memory is not actually available (its been reserved for hugepage usage only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agreed with you on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not all kernel has MemAvailable in /proc/meminfo.
|
||
The request and limit for `hugepages-<hugepagesize>` must match. Similar to | ||
memory, an application that requests `hugepages-<hugepagesize>` resource is at | ||
minimum in the `Burstable` QoS class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we don't want to overcommit hugepage, at least for 1.8 release. If that is true, shouldn't we only support Guaranteed QoS class for hugepage? I understand that we changed QoS crossing all resources, and all containers, which makes any pod requiring hugepage is at least at Burstable class, but I want to make sure that is what you are referring here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not want to overcommit hugepages. the request must always match the limit for hugepages. i should be able have the request != limit for cpu and memory resource requirements on any pod that consumes hugepages. i was trying to assert that hugepage requests impact qos (similar to cpu and memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
|
||
If a pod consumes huge pages via `shmget`, it must run with a supplemental group | ||
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of | ||
this group is outside the scope of this specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that how to configure such supplemental group is outside the scope, but can you have an example on how a user pod to request huge pages via shmget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will add an example of a pod that runs a jvm backed by huge pages.
fail validation. We believe it is rare for applications to attempt to use | ||
multiple huge page sizes. This restriction may be lifted in the future with | ||
community presented use cases. Introducing the feature with this restriction | ||
limits the exposure of API changes needed when consuming huge pages via volumes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the node specification with multiple huge page sizes invalid at the initial node capacity discovery level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess that is fine. the kubelet could fail startup if it detects multiple huge page sizes have been pre-allocated.
/lgtm We can address several comments in separate prs. Thanks! |
Automatic merge from submit-queue |
Automatic merge from submit-queue (batch tested with PRs 50893, 50913, 50963, 50629, 50640) bump(github.com/google/cadvisor): 27e1acbb4ef0fe1889208b21f8f4a6d0863e02f6 Bump cadvisor dep to include google/cadvisor@556c7b1. This is required for supporting huge pages for 1.8 kubernetes/community#837 kubernetes/enhancements#275 @dashpole @derekwaynecarr @eparis @jeremyeder
…ture Automatic merge from submit-queue HugePages feature **What this PR does / why we need it**: Implements HugePages support per kubernetes/community#837 Feature track issue: kubernetes/enhancements#275 **Special notes for your reviewer**: A follow-on PR is opened to add the EmptyDir support. **Release note**: ```release-note Alpha support for pre-allocated hugepages ```
Automatic merge from submit-queue HugePages proposal A pod may request a number of huge pages. The `scheduler` is able to place the pod on a node that can satisfy that request. The `kubelet` advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages via `hugetlbfs` or `shmget`. Planned as Alpha for Kubernetes 1.8 release. See feature: kubernetes/enhancements#275 @kubernetes/sig-scheduling-feature-requests @kubernetes/sig-scheduling-misc @kubernetes/sig-node-proposals @kubernetes/api-approvers @kubernetes/api-reviewers
A pod may request a number of huge pages. The
scheduler
is able to place the pod on a node that can satisfy that request. Thekubelet
advertises an allocatable number of huge pages to support scheduling decisions. A pod may consume hugepages viahugetlbfs
orshmget
.Planned as Alpha for Kubernetes 1.8 release.
See feature: kubernetes/enhancements#275
@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers