前言
本文摘选自内核文档:https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
时间仓促,就不翻译了,可以发现和v1是很不一样的
CPU
The “cpu” controllers regulates distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy.
In all the above models, cycles distribution is defined only on a temporal base and it does not account for the frequency at which tasks are executed. The (optional) utilization clamping support allows to hint the schedutil cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU.
WARNING: cgroup2 doesn’t yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already have placed RT processes into nonroot cgroups during the system boot process, and these processes may need to be moved to the root cgroup before the cpu controller can be enabled.
CPU Interface Files
All time durations are in microseconds.
-
cpu.stat
A read-only flat-keyed file. This file exists whether the controller is enabled or not.
It always reports the following three stats:- usage_usec
- user_usec
- system_usec
and the following three when the controller is enabled:
- nr_periods
- nr_throttled
- throttled_usec
-
cpu.weight
A read-write single value file which exists on non-root cgroups. The default is “100”.
The weight in the range [1, 10000].
-
cpu.weight.nice
A read-write single value file which exists on non-root cgroups. The default is “0”.
The nice value is in the range [-20, 19].
This interface file is an alternative interface for “cpu.weight” and allows reading and setting weight using the same values used by nice(2). Because the range is smaller and granularity is coarser for the nice values, the read value is the closest approximation of the current weight.- cpu.max
A read-write two value file which exists on non-root cgroups. The default is “max 100000”.
The maximum bandwidth limit. It’s in the following format:
$MAX $PERIOD
which indicates that the group may consume upto $MAX in each $PERIOD duration. “max” for $MAX indicates no limit. If only one number is written, $MAX is updated.
-
cpu.pressure
A read-write nested-keyed file.Shows pressure stall information for CPU. See PSI - Pressure Stall Information for details.- cpu.uclamp.min
A read-write single value file which exists on non-root cgroups. The default is “0”, i.e. no utilization boosting.The requested minimum utilization (protection) as a percentage rational number, e.g. 12.34 for 12.34%.
This interface allows reading and setting minimum utilization clamp values similar to the sched_setattr(2). This minimum utilization value is used to clamp the task specific minimum utilization clamp.
The requested minimum utilization (protection) is always capped by the current value for the maximum utilization (limit), i.e. cpu.uclamp.max.
-
cpu.uclamp.max
A read-write single value file which exists on non-root cgroups. The default is “max”. i.e. no utilization cappingThe requested maximum utilization (limit) as a percentage rational number, e.g. 98.76 for 98.76%.
This interface allows reading and setting maximum utilization clamp values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp.### Memory
Memory
The “memory” controller regulates distribution of memory. Memory is stateful and implements both limit and protection models. Due to the intertwining between memory usage and reclaim pressure and the stateful nature of memory, the distribution model is relatively complex.
While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. Currently, the following types of memory usages are tracked.
-
Userland memory - page cache and anonymous memory.
-
Kernel data structures such as dentries and inodes.
-
TCP socket buffers.
The above list may expand in the future for better coverage.
Memory Interface Files
All memory amounts are in bytes. If a value which is not aligned to PAGE_SIZE is written, the value may be rounded up to the closest PAGE_SIZE multiple when read back.
-
memory.current
A read-only single value file which exists on non-root cgroups.The total amount of memory currently being used by the cgroup and its descendants.
-
memory.min
A read-write single value file which exists on non-root cgroups. The default is “0”.Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent’s protection proportional to its actual memory usage below memory.min.
Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs.
If a memory cgroup is not populated with processes, its memory.min is ignored.
-
memory.low
A read-write single value file which exists on non-root cgroups. The default is “0”.Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup’s memory won’t be reclaimed unless there is no reclaimable memory available in unprotected cgroups. Above the effective low boundary (or effective min boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent’s protection proportional to its actual memory usage below memory.low.
Putting more memory than generally available under this protection is discouraged.
-
memory.high
A read-write single value file which exists on non-root cgroups. The default is “max”.Memory usage throttle limit. This is the main mechanism to control memory usage of a cgroup. If a cgroup’s usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure.
Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached.
-
memory.max
A read-write single value file which exists on non-root cgroups. The default is “max”.Memory usage hard limit. This is the final protection mechanism. If a cgroup’s memory usage reaches this limit and can’t be reduced, the OOM killer is invoked in the cgroup. Under certain circumstances, the usage may go over the limit temporarily.
In default configuration regular 0-order allocations always succeed unless OOM killer chooses current task as a victim.
Some kinds of allocations don’t invoke the OOM killer. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead.
This is the ultimate protection mechanism. As long as the high limit is used and monitored properly, this limit’s utility is limited to providing the final safety net.
-
memory.oom.group
A read-write single value file which exists on non-root cgroups. The default value is “0”.Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.
Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed.
If the OOM killer is invoked in a cgroup, it’s not going to kill any tasks outside of this cgroup, regardless memory.oom.group values of ancestor cgroups.
-
memory.events
A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.Note that all fields in this file are hierarchical and the file modified event can be generated due to an event down the hierarchy. For for the local events at the cgroup level see memory.events.local.
-
low
The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under the low boundary. This usually indicates that the low boundary is over-committed. -
high
The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event’s occurrences are expected. -
max
The number of times the cgroup’s memory usage was about to go over the max boundary. If direct reclaim fails to bring it down, the cgroup goes to OOM state. -
oom
The number of time the cgroup’s memory usage was reached the limit and allocation was about to fail.This event is not raised if the OOM killer is not considered as an option, e.g. for failed high-order allocations or if caller asked to not retry attempts.
-
oom_kill
The number of processes belonging to this cgroup killed by any kind of OOM killer. -
memory.events.local
Similar to memory.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events. -
memory.stat
A read-only flat-keyed file which exists on non-root cgroups.This breaks down the cgroup’s memory footprint into different types of memory, type-specific details, and other information on the state and past events of the memory management system.
All memory amounts are in bytes.
The entries are ordered to be human readable, and new entries can show up in the middle. Don’t rely on items remaining in a fixed position; use the keys to look up specific values!
If the entry has no per-node counter (or not show in the memory.numa_stat). We use ‘npn’ (non-per-node) as the tag to indicate that it will not show in the memory.numa_stat.
-
anon
Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) -
file
Amount of memory used to cache filesystem data, including tmpfs and shared memory. -
kernel_stack
Amount of memory allocated to kernel stacks. -
pagetables
Amount of memory allocated for page tables. -
percpu (npn)
Amount of memory used for storing per-cpu kernel data structures. -
sock (npn)
Amount of memory used in network transmission buffers -
shmem
Amount of cached filesystem data that is swap-backed, such as tmpfs, shm segments, shared anonymous mmap()s -
file_mapped
Amount of cached filesystem data mapped with mmap() -
file_dirty
Amount of cached filesystem data that was modified but not yet written back to disk -
file_writeback
Amount of cached filesystem data that was modified and is currently being written back to disk -
swapcached
Amount of swap cached in memory. The swapcache is accounted against both memory and swap usage. -
anon_thp
Amount of memory used in anonymous mappings backed by transparent hugepages -
file_thp
Amount of cached filesystem data backed by transparent hugepages -
shmem_thp
Amount of shm, tmpfs, shared anonymous mmap()s backed by transparent hugepages -
inactive_anon, active_anon, inactive_file, active_file, unevictable
Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the page reclaim algorithm.As these represent internal list state (eg. shmem pages are on anon memory management lists), inactive_foo + active_foo may not be equal to the value for the foo counter, since the foo counter is type-based, not list-based.
-
slab_reclaimable
Part of “slab” that might be reclaimed, such as dentries and inodes. -
slab_unreclaimable
Part of “slab” that cannot be reclaimed on memory pressure. -
slab (npn)
Amount of memory used for storing in-kernel data structures. -
workingset_refault_anon
Number of refaults of previously evicted anonymous pages. -
workingset_refault_file
Number of refaults of previously evicted file pages. -
workingset_activate_anon
Number of refaulted anonymous pages that were immediately activated. -
workingset_activate_file
Number of refaulted file pages that were immediately activated. -
workingset_restore_anon
Number of restored anonymous pages which have been detected as an active workingset before they got reclaimed. -
workingset_restore_file
Number of restored file pages which have been detected as an active workingset before they got reclaimed. -
workingset_nodereclaim
Number of times a shadow node has been reclaimed -
pgfault (npn)
Total number of page faults incurred -
pgmajfault (npn)
Number of major page faults incurred -
pgrefill (npn)
Amount of scanned pages (in an active LRU list) -
pgscan (npn)
Amount of scanned pages (in an inactive LRU list) -
pgsteal (npn)
Amount of reclaimed pages -
pgactivate (npn)
Amount of pages moved to the active LRU list -
pgdeactivate (npn)
Amount of pages moved to the inactive LRU list -
pglazyfree (npn)
Amount of pages postponed to be freed under memory pressure -
pglazyfreed (npn)
Amount of reclaimed lazyfree pages -
thp_fault_alloc (npn)
Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. -
thp_collapse_alloc (npn)
Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. -
memory.numa_stat
A read-only nested-keyed file which exists on non-root cgroups.This breaks down the cgroup’s memory footprint into different types of memory, type-specific details, and other information per node on the state of the memory management system.
This is useful for providing visibility into the NUMA locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use case is evaluating application performance by combining this information with the application’s CPU allocation.
All memory amounts are in bytes.
The output format of memory.numa_stat is:
type N0=<bytes in node 0N1=<bytes in node 1...
The entries are ordered to be human readable, and new entries can show up in the middle. Don’t rely on items remaining in a fixed position; use the keys to look up specific values!
The entries can refer to the memory.stat.
-
memory.swap.current
A read-only single value file which exists on non-root cgroups.The total amount of swap currently being used by the cgroup and its descendants.
-
memory.swap.high
A read-write single value file which exists on non-root cgroups. The default is “max”.Swap usage throttle limit. If a cgroup’s swap usage exceeds this limit, all its further allocations will be throttled to allow userspace to implement custom out-of-memory procedures.
This limit marks a point of no return for the cgroup. It is NOT designed to manage the amount of swapping a workload does during regular operation. Compare to memory.swap.max, which prohibits swapping past a set amount, but lets the cgroup continue unimpeded as long as other memory can be reclaimed.
Healthy workloads are not expected to reach this limit.
-
memory.swap.max
A read-write single value file which exists on non-root cgroups. The default is “max”.Swap usage hard limit. If a cgroup’s swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out.
-
memory.swap.events
A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. -
high
The number of times the cgroup’s swap usage was over the high threshold. -
max
The number of times the cgroup’s swap usage was about to go over the max boundary and swap allocation failed. -
fail
The number of times swap allocation failed either because of running out of swap system-wide or max limit.When reduced under the current usage, the existing swap entries are reclaimed gradually and the swap usage may stay higher than the limit for an extended period of time. This reduces the impact on the workload and memory management. -
memory.pressure
A read-only nested-keyed file.Shows pressure stall information for memory. See PSI - Pressure Stall Information for details.
Usage Guidelines
“memory.high” is the main mechanism to control memory usage. Over-committing on high limit (sum of high limits available memory) and letting global memory pressure to distribute memory according to usage is a viable strategy.
Because breach of the high limit doesn’t trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.
Determining whether a cgroup has enough memory is not trivial as memory usage doesn’t indicate whether the workload can benefit from more memory. For example, a workload which writes data received from network to a file can use all available memory but can also operate as performant with a small amount of memory. A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn’t implemented yet.
Memory Ownership
A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn’t move the memory usages that it instantiated while in the previous cgroup to the new cgroup.
A memory area may be used by processes belonging to different cgroups. To which cgroup the area will be charged is in-deterministic; however, over time, the memory area is likely to end up in a cgroup which has enough memory allowance to avoid high reclaim pressure.
If a cgroup sweeps a considerable amount of memory which is expected to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership.
网友评论