diff --git a/linux-tkg-patches/6.0/0010-lru_6.0.patch b/linux-tkg-patches/6.0/0010-lru_6.0.patch index b97022b..0b4b5a0 100644 --- a/linux-tkg-patches/6.0/0010-lru_6.0.patch +++ b/linux-tkg-patches/6.0/0010-lru_6.0.patch @@ -1,403 +1,381 @@ -linux-kernel.vger.kernel.org archive mirror - - help / color / mirror / Atom feed - -* [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework -@ 2022-09-18 7:59 Yu Zhao - 2022-09-18 7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao - ` (14 more replies) - 0 siblings, 15 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 7:59 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao - -What's new -========== -1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS, - Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15. -2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs) - machines. The old direct reclaim backoff, which tries to enforce a - minimum fairness among all eligible memcgs, over-swapped by about - (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which - pulls the plug on swapping once the target is met, trades some - fairness for curtailed latency: - https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/ -3. Fixed minior build warnings and conflicts. More comments and nits. - -TLDR -==== -The current page reclaim is too expensive in terms of CPU usage and it -often makes poor choices about what to evict. This patchset offers an -alternative solution that is performant, versatile and -straightforward. - -Patchset overview -================= -The design and implementation overview is in patch 14: -https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/ - -01. mm: x86, arm64: add arch_has_hw_pte_young() -02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG -Take advantage of hardware features when trying to clear the accessed -bit in many PTEs. - -03. mm/vmscan.c: refactor shrink_node() -04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into - its sole caller" -Minor refactors to improve readability for the following patches. - -05. mm: multi-gen LRU: groundwork -Adds the basic data structure and the functions that insert pages to -and remove pages from the multi-gen LRU (MGLRU) lists. - -06. mm: multi-gen LRU: minimal implementation -A minimal implementation without optimizations. - -07. mm: multi-gen LRU: exploit locality in rmap -Exploits spatial locality to improve efficiency when using the rmap. - -08. mm: multi-gen LRU: support page table walks -Further exploits spatial locality by optionally scanning page tables. - -09. mm: multi-gen LRU: optimize multiple memcgs -Optimizes the overall performance for multiple memcgs running mixed -types of workloads. - -10. mm: multi-gen LRU: kill switch -Adds a kill switch to enable or disable MGLRU at runtime. - -11. mm: multi-gen LRU: thrashing prevention -12. mm: multi-gen LRU: debugfs interface -Provide userspace with features like thrashing prevention, working set -estimation and proactive reclaim. - -13. mm: multi-gen LRU: admin guide -14. mm: multi-gen LRU: design doc -Add an admin guide and a design doc. - -Benchmark results -================= -Independent lab results ------------------------ -Based on the popularity of searches [01] and the memory usage in -Google's public cloud, the most popular open-source memory-hungry -applications, in alphabetical order, are: - Apache Cassandra Memcached - Apache Hadoop MongoDB - Apache Spark PostgreSQL - MariaDB (MySQL) Redis - -An independent lab evaluated MGLRU with the most widely used benchmark -suites for the above applications. They posted 960 data points along -with kernel metrics and perf profiles collected over more than 500 -hours of total benchmark time. Their final reports show that, with 95% -confidence intervals (CIs), the above applications all performed -significantly better for at least part of their benchmark matrices. - -On 5.14: -1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% - less wall time to sort three billion random integers, respectively, - under the medium- and the high-concurrency conditions, when - overcommitting memory. There were no statistically significant - changes in wall time for the rest of the benchmark matrix. -2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]% - more transactions per minute (TPM), respectively, under the medium- - and the high-concurrency conditions, when overcommitting memory. - There were no statistically significant changes in TPM for the rest - of the benchmark matrix. -3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]% - and [21.59, 30.02]% more operations per second (OPS), respectively, - for sequential access, random access and Gaussian (distribution) - access, when THP=always; 95% CIs [13.85, 15.97]% and - [23.94, 29.92]% more OPS, respectively, for random access and - Gaussian access, when THP=never. There were no statistically - significant changes in OPS for the rest of the benchmark matrix. -4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and - [2.16, 3.55]% more operations per second (OPS), respectively, for - exponential (distribution) access, random access and Zipfian - (distribution) access, when underutilizing memory; 95% CIs - [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS, - respectively, for exponential access, random access and Zipfian - access, when overcommitting memory. - -On 5.15: -5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% - and [4.11, 7.50]% more operations per second (OPS), respectively, - for exponential (distribution) access, random access and Zipfian - (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, - [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for - exponential access, random access and Zipfian access, when swap was - on. -6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% - less average wall time to finish twelve parallel TeraSort jobs, - respectively, under the medium- and the high-concurrency - conditions, when swap was on. There were no statistically - significant changes in average wall time for the rest of the - benchmark matrix. -7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per - minute (TPM) under the high-concurrency condition, when swap was - off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, - respectively, under the medium- and the high-concurrency - conditions, when swap was on. There were no statistically - significant changes in TPM for the rest of the benchmark matrix. -8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and - [11.47, 19.36]% more total operations per second (OPS), - respectively, for sequential access, random access and Gaussian - (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, - [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, - for sequential access, random access and Gaussian access, when - THP=never. - -Our lab results ---------------- -To supplement the above results, we ran the following benchmark suites -on 5.16-rc7 and found no regressions [10]. - fs_fio_bench_hdd_mq pft - fs_lmbench pgsql-hammerdb - fs_parallelio redis - fs_postmark stream - hackbench sysbenchthread - kernbench tpcc_spark - memcached unixbench - multichase vm-scalability - mutilate will-it-scale - nginx - -[01] https://trends.google.com -[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/ -[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/ -[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/ -[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/ -[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/ -[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/ -[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/ -[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/ -[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/ - -Read-world applications -======================= -Third-party testimonials ------------------------- -Konstantin reported [11]: - I have Archlinux with 8G RAM + zswap + swap. While developing, I - have lots of apps opened such as multiple LSP-servers for different - langs, chats, two browsers, etc... Usually, my system gets quickly - to a point of SWAP-storms, where I have to kill LSP-servers, - restart browsers to free memory, etc, otherwise the system lags - heavily and is barely usable. - - 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU - patchset, and I started up by opening lots of apps to create memory - pressure, and worked for a day like this. Till now I had not a - single SWAP-storm, and mind you I got 3.4G in SWAP. I was never - getting to the point of 3G in SWAP before without a single - SWAP-storm. - -Vaibhav from IBM reported [12]: - In a synthetic MongoDB Benchmark, seeing an average of ~19% - throughput improvement on POWER10(Radix MMU + 64K Page Size) with - MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across - three different request distributions, namely, Exponential, Uniform - and Zipfan. - -Shuang from U of Rochester reported [13]: - With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% - and [9.26, 10.36]% higher throughput, respectively, for random - access, Zipfian (distribution) access and Gaussian (distribution) - access, when the average number of jobs per CPU is 1; 95% CIs - [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher - throughput, respectively, for random access, Zipfian access and - Gaussian access, when the average number of jobs per CPU is 2. - -Daniel from Michigan Tech reported [14]: - With Memcached allocating ~100GB of byte-addressable Optante, - performance improvement in terms of throughput (measured as queries - per second) was about 10% for a series of workloads. - -Large-scale deployments ------------------------ -We've rolled out MGLRU to tens of millions of ChromeOS users and -about a million Android users. Google's fleetwide profiling [15] shows -an overall 40% decrease in kswapd CPU usage, in addition to -improvements in other UX metrics, e.g., an 85% decrease in the number -of low-memory kills at the 75th percentile and an 18% decrease in -app launch time at the 50th percentile. - -The downstream kernels that have been using MGLRU include: -1. Android [16] -2. Arch Linux Zen [17] -3. Armbian [18] -4. ChromeOS [19] -5. Liquorix [20] -6. OpenWrt [21] -7. post-factum [22] -8. XanMod [23] - -[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/ -[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/ -[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/ -[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/ -[15] https://dl.acm.org/doi/10.1145/2749469.2750392 -[16] https://android.com -[17] https://archlinux.org -[18] https://armbian.com -[19] https://chromium.org -[20] https://liquorix.net -[21] https://openwrt.org -[22] https://codeberg.org/pf-kernel -[23] https://xanmod.org - -Summery -======= -The facts are: -1. The independent lab results and the real-world applications - indicate substantial improvements; there are no known regressions. -2. Thrashing prevention, working set estimation and proactive reclaim - work out of the box; there are no equivalent solutions. -3. There is a lot of new code; no smaller changes have been - demonstrated similar effects. - -Our options, accordingly, are: -1. Given the amount of evidence, the reported improvements will likely - materialize for a wide range of workloads. -2. Gauging the interest from the past discussions, the new features - will likely be put to use for both personal computers and data - centers. -3. Based on Google's track record, the new code will likely be well - maintained in the long term. It'd be more difficult if not - impossible to achieve similar effects with other approaches. - -Yu Zhao (14): - mm: x86, arm64: add arch_has_hw_pte_young() - mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG - mm/vmscan.c: refactor shrink_node() - Revert "include/linux/mm_inline.h: fold __update_lru_size() into its - sole caller" - mm: multi-gen LRU: groundwork - mm: multi-gen LRU: minimal implementation - mm: multi-gen LRU: exploit locality in rmap - mm: multi-gen LRU: support page table walks - mm: multi-gen LRU: optimize multiple memcgs - mm: multi-gen LRU: kill switch - mm: multi-gen LRU: thrashing prevention - mm: multi-gen LRU: debugfs interface - mm: multi-gen LRU: admin guide - mm: multi-gen LRU: design doc - - Documentation/admin-guide/mm/index.rst | 1 + - Documentation/admin-guide/mm/multigen_lru.rst | 162 + - Documentation/mm/index.rst | 1 + - Documentation/mm/multigen_lru.rst | 159 + - arch/Kconfig | 8 + - arch/arm64/include/asm/pgtable.h | 15 +- - arch/x86/Kconfig | 1 + - arch/x86/include/asm/pgtable.h | 9 +- - arch/x86/mm/pgtable.c | 5 +- - fs/exec.c | 2 + - fs/fuse/dev.c | 3 +- - include/linux/cgroup.h | 15 +- - include/linux/memcontrol.h | 36 + - include/linux/mm.h | 5 + - include/linux/mm_inline.h | 231 +- - include/linux/mm_types.h | 76 + - include/linux/mmzone.h | 214 ++ - include/linux/nodemask.h | 1 + - include/linux/page-flags-layout.h | 16 +- - include/linux/page-flags.h | 4 +- - include/linux/pgtable.h | 17 +- - include/linux/sched.h | 4 + - include/linux/swap.h | 4 + - kernel/bounds.c | 7 + - kernel/cgroup/cgroup-internal.h | 1 - - kernel/exit.c | 1 + - kernel/fork.c | 9 + - kernel/sched/core.c | 1 + - mm/Kconfig | 26 + - mm/huge_memory.c | 3 +- - mm/internal.h | 1 + - mm/memcontrol.c | 28 + - mm/memory.c | 39 +- - mm/mm_init.c | 6 +- - mm/mmzone.c | 2 + - mm/rmap.c | 6 + - mm/swap.c | 54 +- - mm/vmscan.c | 2995 ++++++++++++++++- - mm/workingset.c | 110 +- - 39 files changed, 4122 insertions(+), 156 deletions(-) - create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst - create mode 100644 Documentation/mm/multigen_lru.rst - - -base-commit: 6cf215f1d5dac59a5a09514138ca37aed2719d0a --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao -@ 2022-09-18 7:59 ` Yu Zhao - 2022-09-18 7:59 ` [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao - ` (13 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 7:59 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song, - Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko, - Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Some architectures automatically set the accessed bit in PTEs, e.g., -x86 and arm64 v8.2. On architectures that do not have this capability, -clearing the accessed bit in a PTE usually triggers a page fault -following the TLB miss of this PTE (to emulate the accessed bit). - -Being aware of this capability can help make better decisions, e.g., -whether to spread the work out over a period of time to reduce bursty -page faults when trying to clear the accessed bit in many PTEs. - -Note that theoretically this capability can be unreliable, e.g., -hotplugged CPUs might be different from builtin ones. Therefore it -should not be used in architecture-independent code that involves -correctness, e.g., to determine whether TLB flushes are required (in -combination with the accessed bit). - -Signed-off-by: Yu Zhao -Reviewed-by: Barry Song -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Acked-by: Will Deacon -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - arch/arm64/include/asm/pgtable.h | 15 ++------------- - arch/x86/include/asm/pgtable.h | 6 +++--- - include/linux/pgtable.h | 13 +++++++++++++ - mm/memory.c | 14 +------------- - 4 files changed, 19 insertions(+), 29 deletions(-) - +diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst +index 1bd11118dfb1c6..d1064e0ba34a29 100644 +--- a/Documentation/admin-guide/mm/index.rst ++++ b/Documentation/admin-guide/mm/index.rst +@@ -32,6 +32,7 @@ the Linux memory management. + idle_page_tracking + ksm + memory-hotplug ++ multigen_lru + nommu-mmap + numa_memory_policy + numaperf +diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst +new file mode 100644 +index 00000000000000..33e068830497e7 +--- /dev/null ++++ b/Documentation/admin-guide/mm/multigen_lru.rst +@@ -0,0 +1,162 @@ ++.. SPDX-License-Identifier: GPL-2.0 ++ ++============= ++Multi-Gen LRU ++============= ++The multi-gen LRU is an alternative LRU implementation that optimizes ++page reclaim and improves performance under memory pressure. Page ++reclaim decides the kernel's caching policy and ability to overcommit ++memory. It directly impacts the kswapd CPU usage and RAM efficiency. ++ ++Quick start ++=========== ++Build the kernel with the following configurations. ++ ++* ``CONFIG_LRU_GEN=y`` ++* ``CONFIG_LRU_GEN_ENABLED=y`` ++ ++All set! ++ ++Runtime options ++=============== ++``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the ++following subsections. ++ ++Kill switch ++----------- ++``enabled`` accepts different values to enable or disable the ++following components. Its default value depends on ++``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled ++unless some of them have unforeseen side effects. Writing to ++``enabled`` has no effect when a component is not supported by the ++hardware, and valid values will be accepted even when the main switch ++is off. ++ ++====== =============================================================== ++Values Components ++====== =============================================================== ++0x0001 The main switch for the multi-gen LRU. ++0x0002 Clearing the accessed bit in leaf page table entries in large ++ batches, when MMU sets it (e.g., on x86). This behavior can ++ theoretically worsen lock contention (mmap_lock). If it is ++ disabled, the multi-gen LRU will suffer a minor performance ++ degradation for workloads that contiguously map hot pages, ++ whose accessed bits can be otherwise cleared by fewer larger ++ batches. ++0x0004 Clearing the accessed bit in non-leaf page table entries as ++ well, when MMU sets it (e.g., on x86). This behavior was not ++ verified on x86 varieties other than Intel and AMD. If it is ++ disabled, the multi-gen LRU will suffer a negligible ++ performance degradation. ++[yYnN] Apply to all the components above. ++====== =============================================================== ++ ++E.g., ++:: ++ ++ echo y >/sys/kernel/mm/lru_gen/enabled ++ cat /sys/kernel/mm/lru_gen/enabled ++ 0x0007 ++ echo 5 >/sys/kernel/mm/lru_gen/enabled ++ cat /sys/kernel/mm/lru_gen/enabled ++ 0x0005 ++ ++Thrashing prevention ++-------------------- ++Personal computers are more sensitive to thrashing because it can ++cause janks (lags when rendering UI) and negatively impact user ++experience. The multi-gen LRU offers thrashing prevention to the ++majority of laptop and desktop users who do not have ``oomd``. ++ ++Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of ++``N`` milliseconds from getting evicted. The OOM killer is triggered ++if this working set cannot be kept in memory. In other words, this ++option works as an adjustable pressure relief valve, and when open, it ++terminates applications that are hopefully not being used. ++ ++Based on the average human detectable lag (~100ms), ``N=1000`` usually ++eliminates intolerable janks due to thrashing. Larger values like ++``N=3000`` make janks less noticeable at the risk of premature OOM ++kills. ++ ++The default value ``0`` means disabled. ++ ++Experimental features ++===================== ++``/sys/kernel/debug/lru_gen`` accepts commands described in the ++following subsections. Multiple command lines are supported, so does ++concatenation with delimiters ``,`` and ``;``. ++ ++``/sys/kernel/debug/lru_gen_full`` provides additional stats for ++debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from ++evicted generations in this file. ++ ++Working set estimation ++---------------------- ++Working set estimation measures how much memory an application needs ++in a given time interval, and it is usually done with little impact on ++the performance of the application. E.g., data centers want to ++optimize job scheduling (bin packing) to improve memory utilizations. ++When a new job comes in, the job scheduler needs to find out whether ++each server it manages can allocate a certain amount of memory for ++this new job before it can pick a candidate. To do so, the job ++scheduler needs to estimate the working sets of the existing jobs. ++ ++When it is read, ``lru_gen`` returns a histogram of numbers of pages ++accessed over different time intervals for each memcg and node. ++``MAX_NR_GENS`` decides the number of bins for each histogram. The ++histograms are noncumulative. ++:: ++ ++ memcg memcg_id memcg_path ++ node node_id ++ min_gen_nr age_in_ms nr_anon_pages nr_file_pages ++ ... ++ max_gen_nr age_in_ms nr_anon_pages nr_file_pages ++ ++Each bin contains an estimated number of pages that have been accessed ++within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages ++and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of ++the former is the largest and that of the latter is the smallest. ++ ++Users can write the following command to ``lru_gen`` to create a new ++generation ``max_gen_nr+1``: ++ ++ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` ++ ++``can_swap`` defaults to the swap setting and, if it is set to ``1``, ++it forces the scan of anon pages when swap is off, and vice versa. ++``force_scan`` defaults to ``1`` and, if it is set to ``0``, it ++employs heuristics to reduce the overhead, which is likely to reduce ++the coverage as well. ++ ++A typical use case is that a job scheduler runs this command at a ++certain time interval to create new generations, and it ranks the ++servers it manages based on the sizes of their cold pages defined by ++this time interval. ++ ++Proactive reclaim ++----------------- ++Proactive reclaim induces page reclaim when there is no memory ++pressure. It usually targets cold pages only. E.g., when a new job ++comes in, the job scheduler wants to proactively reclaim cold pages on ++the server it selected, to improve the chance of successfully landing ++this new job. ++ ++Users can write the following command to ``lru_gen`` to evict ++generations less than or equal to ``min_gen_nr``. ++ ++ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` ++ ++``min_gen_nr`` should be less than ``max_gen_nr-1``, since ++``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to ++the active list) and therefore cannot be evicted. ``swappiness`` ++overrides the default value in ``/proc/sys/vm/swappiness``. ++``nr_to_reclaim`` limits the number of pages to evict. ++ ++A typical use case is that a job scheduler runs this command before it ++tries to land a new job on a server. If it fails to materialize enough ++cold pages because of the overestimation, it retries on the next ++server according to the ranking result obtained from the working set ++estimation step. This less forceful approach limits the impacts on the ++existing jobs. +diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst +index 575ccd40e30cfa..4aa12b8be278d3 100644 +--- a/Documentation/mm/index.rst ++++ b/Documentation/mm/index.rst +@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose. + ksm + memory-model + mmu_notifier ++ multigen_lru + numa + overcommit-accounting + page_migration +diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst +new file mode 100644 +index 00000000000000..d7062c6a894646 +--- /dev/null ++++ b/Documentation/mm/multigen_lru.rst +@@ -0,0 +1,159 @@ ++.. SPDX-License-Identifier: GPL-2.0 ++ ++============= ++Multi-Gen LRU ++============= ++The multi-gen LRU is an alternative LRU implementation that optimizes ++page reclaim and improves performance under memory pressure. Page ++reclaim decides the kernel's caching policy and ability to overcommit ++memory. It directly impacts the kswapd CPU usage and RAM efficiency. ++ ++Design overview ++=============== ++Objectives ++---------- ++The design objectives are: ++ ++* Good representation of access recency ++* Try to profit from spatial locality ++* Fast paths to make obvious choices ++* Simple self-correcting heuristics ++ ++The representation of access recency is at the core of all LRU ++implementations. In the multi-gen LRU, each generation represents a ++group of pages with similar access recency. Generations establish a ++(time-based) common frame of reference and therefore help make better ++choices, e.g., between different memcgs on a computer or different ++computers in a data center (for job scheduling). ++ ++Exploiting spatial locality improves efficiency when gathering the ++accessed bit. A rmap walk targets a single page and does not try to ++profit from discovering a young PTE. A page table walk can sweep all ++the young PTEs in an address space, but the address space can be too ++sparse to make a profit. The key is to optimize both methods and use ++them in combination. ++ ++Fast paths reduce code complexity and runtime overhead. Unmapped pages ++do not require TLB flushes; clean pages do not require writeback. ++These facts are only helpful when other conditions, e.g., access ++recency, are similar. With generations as a common frame of reference, ++additional factors stand out. But obvious choices might not be good ++choices; thus self-correction is necessary. ++ ++The benefits of simple self-correcting heuristics are self-evident. ++Again, with generations as a common frame of reference, this becomes ++attainable. Specifically, pages in the same generation can be ++categorized based on additional factors, and a feedback loop can ++statistically compare the refault percentages across those categories ++and infer which of them are better choices. ++ ++Assumptions ++----------- ++The protection of hot pages and the selection of cold pages are based ++on page access channels and patterns. There are two access channels: ++ ++* Accesses through page tables ++* Accesses through file descriptors ++ ++The protection of the former channel is by design stronger because: ++ ++1. The uncertainty in determining the access patterns of the former ++ channel is higher due to the approximation of the accessed bit. ++2. The cost of evicting the former channel is higher due to the TLB ++ flushes required and the likelihood of encountering the dirty bit. ++3. The penalty of underprotecting the former channel is higher because ++ applications usually do not prepare themselves for major page ++ faults like they do for blocked I/O. E.g., GUI applications ++ commonly use dedicated I/O threads to avoid blocking rendering ++ threads. ++ ++There are also two access patterns: ++ ++* Accesses exhibiting temporal locality ++* Accesses not exhibiting temporal locality ++ ++For the reasons listed above, the former channel is assumed to follow ++the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is ++present, and the latter channel is assumed to follow the latter ++pattern unless outlying refaults have been observed. ++ ++Workflow overview ++================= ++Evictable pages are divided into multiple generations for each ++``lruvec``. The youngest generation number is stored in ++``lrugen->max_seq`` for both anon and file types as they are aged on ++an equal footing. The oldest generation numbers are stored in ++``lrugen->min_seq[]`` separately for anon and file types as clean file ++pages can be evicted regardless of swap constraints. These three ++variables are monotonically increasing. ++ ++Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` ++bits in order to fit into the gen counter in ``folio->flags``. Each ++truncated generation number is an index to ``lrugen->lists[]``. The ++sliding window technique is used to track at least ``MIN_NR_GENS`` and ++at most ``MAX_NR_GENS`` generations. The gen counter stores a value ++within ``[1, MAX_NR_GENS]`` while a page is on one of ++``lrugen->lists[]``; otherwise it stores zero. ++ ++Each generation is divided into multiple tiers. A page accessed ``N`` ++times through file descriptors is in tier ``order_base_2(N)``. Unlike ++generations, tiers do not have dedicated ``lrugen->lists[]``. In ++contrast to moving across generations, which requires the LRU lock, ++moving across tiers only involves atomic operations on ++``folio->flags`` and therefore has a negligible cost. A feedback loop ++modeled after the PID controller monitors refaults over all the tiers ++from anon and file types and decides which tiers from which types to ++evict or protect. ++ ++There are two conceptually independent procedures: the aging and the ++eviction. They form a closed-loop system, i.e., the page reclaim. ++ ++Aging ++----- ++The aging produces young generations. Given an ``lruvec``, it ++increments ``max_seq`` when ``max_seq-min_seq+1`` approaches ++``MIN_NR_GENS``. The aging promotes hot pages to the youngest ++generation when it finds them accessed through page tables; the ++demotion of cold pages happens consequently when it increments ++``max_seq``. The aging uses page table walks and rmap walks to find ++young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` ++and calls ``walk_page_range()`` with each ``mm_struct`` on this list ++to scan PTEs, and after each iteration, it increments ``max_seq``. For ++the latter, when the eviction walks the rmap and finds a young PTE, ++the aging scans the adjacent PTEs. For both, on finding a young PTE, ++the aging clears the accessed bit and updates the gen counter of the ++page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. ++ ++Eviction ++-------- ++The eviction consumes old generations. Given an ``lruvec``, it ++increments ``min_seq`` when ``lrugen->lists[]`` indexed by ++``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to ++evict from, it first compares ``min_seq[]`` to select the older type. ++If both types are equally old, it selects the one whose first tier has ++a lower refault percentage. The first tier contains single-use ++unmapped clean pages, which are the best bet. The eviction sorts a ++page according to its gen counter if the aging has found this page ++accessed through page tables and updated its gen counter. It also ++moves a page to the next generation, i.e., ``min_seq+1``, if this page ++was accessed multiple times through file descriptors and the feedback ++loop has detected outlying refaults from the tier this page is in. To ++this end, the feedback loop uses the first tier as the baseline, for ++the reason stated earlier. ++ ++Summary ++------- ++The multi-gen LRU can be disassembled into the following parts: ++ ++* Generations ++* Rmap walks ++* Page table walks ++* Bloom filters ++* PID controller ++ ++The aging and the eviction form a producer-consumer model; ++specifically, the latter drives the former by the sliding window over ++generations. Within the aging, rmap walks drive page table walks by ++inserting hot densely populated page tables to the Bloom filters. ++Within the eviction, the PID controller uses refaults as the feedback ++to select types to evict and tiers to protect. +diff --git a/arch/Kconfig b/arch/Kconfig +index 8b311e400ec140..bf19a84fffa21b 100644 +--- a/arch/Kconfig ++++ b/arch/Kconfig +@@ -1418,6 +1418,14 @@ config DYNAMIC_SIGFRAME + config HAVE_ARCH_NODE_DEV_GROUP + bool + ++config ARCH_HAS_NONLEAF_PMD_YOUNG ++ bool ++ help ++ Architectures that select this option are capable of setting the ++ accessed bit in non-leaf PMD entries when using them as part of linear ++ address translations. Page table walkers that clear the accessed bit ++ may use this capability to reduce their search space. ++ + source "kernel/gcov/Kconfig" + + source "scripts/gcc-plugins/Kconfig" diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h -index b5df82aa99e6..71a1af42f0e8 100644 +index b5df82aa99e64b..71a1af42f0e897 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1082,24 +1082,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, @@ -413,7 +391,7 @@ index b5df82aa99e6..71a1af42f0e8 100644 -} -#define arch_faults_on_old_pte arch_faults_on_old_pte +#define arch_has_hw_pte_young cpu_has_hw_af - + /* * Experimentally, it's cheap to set the access flag in hardware and we * benefit from prefaulting mappings as 'old' to start with. @@ -424,173 +402,11 @@ index b5df82aa99e6..71a1af42f0e8 100644 -} -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte +#define arch_wants_old_prefaulted_pte cpu_has_hw_af - + static inline bool pud_sect_supported(void) { -diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h -index 44e2d6f1dbaa..dc5f7d8ef68a 100644 ---- a/arch/x86/include/asm/pgtable.h -+++ b/arch/x86/include/asm/pgtable.h -@@ -1431,10 +1431,10 @@ static inline bool arch_has_pfn_modify_check(void) - return boot_cpu_has_bug(X86_BUG_L1TF); - } - --#define arch_faults_on_old_pte arch_faults_on_old_pte --static inline bool arch_faults_on_old_pte(void) -+#define arch_has_hw_pte_young arch_has_hw_pte_young -+static inline bool arch_has_hw_pte_young(void) - { -- return false; -+ return true; - } - - #ifdef CONFIG_PAGE_TABLE_CHECK -diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h -index d13b4f7cc5be..375e8e7e64f4 100644 ---- a/include/linux/pgtable.h -+++ b/include/linux/pgtable.h -@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, - #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - #endif - -+#ifndef arch_has_hw_pte_young -+/* -+ * Return whether the accessed bit is supported on the local CPU. -+ * -+ * This stub assumes accessing through an old PTE triggers a page fault. -+ * Architectures that automatically set the access bit should overwrite it. -+ */ -+static inline bool arch_has_hw_pte_young(void) -+{ -+ return false; -+} -+#endif -+ - #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR - static inline pte_t ptep_get_and_clear(struct mm_struct *mm, - unsigned long address, -diff --git a/mm/memory.c b/mm/memory.c -index e38f9245470c..3a9b00c765c2 100644 ---- a/mm/memory.c -+++ b/mm/memory.c -@@ -126,18 +126,6 @@ int randomize_va_space __read_mostly = - 2; - #endif - --#ifndef arch_faults_on_old_pte --static inline bool arch_faults_on_old_pte(void) --{ -- /* -- * Those arches which don't have hw access flag feature need to -- * implement their own helper. By default, "true" means pagefault -- * will be hit on old pte. -- */ -- return true; --} --#endif -- - #ifndef arch_wants_old_prefaulted_pte - static inline bool arch_wants_old_prefaulted_pte(void) - { -@@ -2871,7 +2859,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, - * On architectures with software "accessed" bits, we would - * take a double page fault, so mark it accessed here. - */ -- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { -+ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { - pte_t entry; - - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - 2022-09-18 7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao -@ 2022-09-18 7:59 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao - ` (12 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 7:59 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song, - Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko, - Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Some architectures support the accessed bit in non-leaf PMD entries, -e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it -as part of linear address translation [1]. Page table walkers that -clear the accessed bit may use this capability to reduce their search -space. - -Note that: -1. Although an inline function is preferable, this capability is added - as a configuration option for consistency with the existing macros. -2. Due to the little interest in other varieties, this capability was - only tested on Intel and AMD CPUs. - -Thanks to the following developers for their efforts [2][3]. - Randy Dunlap - Stephen Rothwell - -[1]: Intel 64 and IA-32 Architectures Software Developer's Manual - Volume 3 (June 2021), section 4.8 -[2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/ -[3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/ - -Signed-off-by: Yu Zhao -Reviewed-by: Barry Song -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - arch/Kconfig | 8 ++++++++ - arch/x86/Kconfig | 1 + - arch/x86/include/asm/pgtable.h | 3 ++- - arch/x86/mm/pgtable.c | 5 ++++- - include/linux/pgtable.h | 4 ++-- - 5 files changed, 17 insertions(+), 4 deletions(-) - -diff --git a/arch/Kconfig b/arch/Kconfig -index 5dbf11a5ba4e..1c2599618eeb 100644 ---- a/arch/Kconfig -+++ b/arch/Kconfig -@@ -1415,6 +1415,14 @@ config DYNAMIC_SIGFRAME - config HAVE_ARCH_NODE_DEV_GROUP - bool - -+config ARCH_HAS_NONLEAF_PMD_YOUNG -+ bool -+ help -+ Architectures that select this option are capable of setting the -+ accessed bit in non-leaf PMD entries when using them as part of linear -+ address translations. Page table walkers that clear the accessed bit -+ may use this capability to reduce their search space. -+ - source "kernel/gcov/Kconfig" - - source "scripts/gcc-plugins/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig -index f9920f1341c8..674d694a665e 100644 +index f9920f1341c8d4..674d694a665ef5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -85,6 +85,7 @@ config X86 @@ -602,34 +418,48 @@ index f9920f1341c8..674d694a665e 100644 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h -index dc5f7d8ef68a..5059799bebe3 100644 +index 44e2d6f1dbaa87..5059799bebe36d 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -815,7 +815,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) - + static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } - + static inline unsigned long pages_to_mb(unsigned long npg) +@@ -1431,10 +1432,10 @@ static inline bool arch_has_pfn_modify_check(void) + return boot_cpu_has_bug(X86_BUG_L1TF); + } + +-#define arch_faults_on_old_pte arch_faults_on_old_pte +-static inline bool arch_faults_on_old_pte(void) ++#define arch_has_hw_pte_young arch_has_hw_pte_young ++static inline bool arch_has_hw_pte_young(void) + { +- return false; ++ return true; + } + + #ifdef CONFIG_PAGE_TABLE_CHECK diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c -index a932d7712d85..8525f2876fb4 100644 +index a932d7712d851d..8525f2876fb409 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } - + -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, - + return ret; } +#endif @@ -638,367 +468,189 @@ index a932d7712d85..8525f2876fb4 100644 int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { -diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h -index 375e8e7e64f4..a108b60a6962 100644 ---- a/include/linux/pgtable.h -+++ b/include/linux/pgtable.h -@@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, - #endif - - #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG --#ifdef CONFIG_TRANSPARENT_HUGEPAGE -+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) - static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, - unsigned long address, - pmd_t *pmdp) -@@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, - BUILD_BUG(); - return 0; +diff --git a/fs/exec.c b/fs/exec.c +index d046dbb9cbd083..c67b12f0f577fe 100644 +--- a/fs/exec.c ++++ b/fs/exec.c +@@ -1011,6 +1011,7 @@ static int exec_mmap(struct mm_struct *mm) + active_mm = tsk->active_mm; + tsk->active_mm = mm; + tsk->mm = mm; ++ lru_gen_add_mm(mm); + /* + * This prevents preemption while active_mm is being loaded and + * it and mm are being updated, which could cause problems for +@@ -1026,6 +1027,7 @@ static int exec_mmap(struct mm_struct *mm) + tsk->mm->vmacache_seqnum = 0; + vmacache_flush(tsk); + task_unlock(tsk); ++ lru_gen_use_mm(mm); + if (old_mm) { + mmap_read_unlock(old_mm); + BUG_ON(active_mm != old_mm); +diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c +index 51897427a5346e..b4a6e0a1b945aa 100644 +--- a/fs/fuse/dev.c ++++ b/fs/fuse/dev.c +@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page) + 1 << PG_active | + 1 << PG_workingset | + 1 << PG_reclaim | +- 1 << PG_waiters))) { ++ 1 << PG_waiters | ++ LRU_GEN_MASK | LRU_REFS_MASK))) { + dump_page(page, "fuse: trying to steal weird page"); + return 1; + } +diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h +index ac5d0515680eae..9179463c3c9f82 100644 +--- a/include/linux/cgroup.h ++++ b/include/linux/cgroup.h +@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) + css_put(&cgrp->self); } --#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ - #endif - - #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node() - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - 2022-09-18 7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao - 2022-09-18 7:59 ` [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao - ` (11 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song, - Miaohe Lin, Brian Geffon, Jan Alexander Steffens, - Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal, - Daniel Byrne, Donald Carr, Holger Hoffstätte, - Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain - -This patch refactors shrink_node() to improve readability for the -upcoming changes to mm/vmscan.c. - -Signed-off-by: Yu Zhao -Reviewed-by: Barry Song -Reviewed-by: Miaohe Lin -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - mm/vmscan.c | 198 +++++++++++++++++++++++++++------------------------- - 1 file changed, 104 insertions(+), 94 deletions(-) - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 992ba6a0bf10..0869cee13a90 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -2728,6 +2728,109 @@ enum scan_balance { - SCAN_FILE, - }; - -+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) + ++extern struct mutex cgroup_mutex; ++ ++static inline void cgroup_lock(void) +{ -+ unsigned long file; -+ struct lruvec *target_lruvec; -+ -+ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); -+ -+ /* -+ * Flush the memory cgroup stats, so that we read accurate per-memcg -+ * lruvec stats for heuristics. -+ */ -+ mem_cgroup_flush_stats(); -+ -+ /* -+ * Determine the scan balance between anon and file LRUs. -+ */ -+ spin_lock_irq(&target_lruvec->lru_lock); -+ sc->anon_cost = target_lruvec->anon_cost; -+ sc->file_cost = target_lruvec->file_cost; -+ spin_unlock_irq(&target_lruvec->lru_lock); -+ -+ /* -+ * Target desirable inactive:active list ratios for the anon -+ * and file LRU lists. -+ */ -+ if (!sc->force_deactivate) { -+ unsigned long refaults; -+ -+ /* -+ * When refaults are being observed, it means a new -+ * workingset is being established. Deactivate to get -+ * rid of any stale active pages quickly. -+ */ -+ refaults = lruvec_page_state(target_lruvec, -+ WORKINGSET_ACTIVATE_ANON); -+ if (refaults != target_lruvec->refaults[0] || -+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) -+ sc->may_deactivate |= DEACTIVATE_ANON; -+ else -+ sc->may_deactivate &= ~DEACTIVATE_ANON; -+ -+ refaults = lruvec_page_state(target_lruvec, -+ WORKINGSET_ACTIVATE_FILE); -+ if (refaults != target_lruvec->refaults[1] || -+ inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) -+ sc->may_deactivate |= DEACTIVATE_FILE; -+ else -+ sc->may_deactivate &= ~DEACTIVATE_FILE; -+ } else -+ sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; -+ -+ /* -+ * If we have plenty of inactive file pages that aren't -+ * thrashing, try to reclaim those first before touching -+ * anonymous pages. -+ */ -+ file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); -+ if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) -+ sc->cache_trim_mode = 1; -+ else -+ sc->cache_trim_mode = 0; -+ -+ /* -+ * Prevent the reclaimer from falling into the cache trap: as -+ * cache pages start out inactive, every cache fault will tip -+ * the scan balance towards the file LRU. And as the file LRU -+ * shrinks, so does the window for rotation from references. -+ * This means we have a runaway feedback loop where a tiny -+ * thrashing file LRU becomes infinitely more attractive than -+ * anon pages. Try to detect this based on file LRU size. -+ */ -+ if (!cgroup_reclaim(sc)) { -+ unsigned long total_high_wmark = 0; -+ unsigned long free, anon; -+ int z; -+ -+ free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); -+ file = node_page_state(pgdat, NR_ACTIVE_FILE) + -+ node_page_state(pgdat, NR_INACTIVE_FILE); -+ -+ for (z = 0; z < MAX_NR_ZONES; z++) { -+ struct zone *zone = &pgdat->node_zones[z]; -+ -+ if (!managed_zone(zone)) -+ continue; -+ -+ total_high_wmark += high_wmark_pages(zone); -+ } -+ -+ /* -+ * Consider anon: if that's low too, this isn't a -+ * runaway file reclaim problem, but rather just -+ * extreme pressure. Reclaim as per usual then. -+ */ -+ anon = node_page_state(pgdat, NR_INACTIVE_ANON); -+ -+ sc->file_is_tiny = -+ file + free <= total_high_wmark && -+ !(sc->may_deactivate & DEACTIVATE_ANON) && -+ anon >> sc->priority; -+ } ++ mutex_lock(&cgroup_mutex); +} + - /* - * Determine how aggressively the anon and file LRU lists should be - * scanned. -@@ -3195,109 +3298,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) - unsigned long nr_reclaimed, nr_scanned; - struct lruvec *target_lruvec; - bool reclaimable = false; -- unsigned long file; - - target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); - - again: -- /* -- * Flush the memory cgroup stats, so that we read accurate per-memcg -- * lruvec stats for heuristics. -- */ -- mem_cgroup_flush_stats(); -- - memset(&sc->nr, 0, sizeof(sc->nr)); - - nr_reclaimed = sc->nr_reclaimed; - nr_scanned = sc->nr_scanned; - -- /* -- * Determine the scan balance between anon and file LRUs. -- */ -- spin_lock_irq(&target_lruvec->lru_lock); -- sc->anon_cost = target_lruvec->anon_cost; -- sc->file_cost = target_lruvec->file_cost; -- spin_unlock_irq(&target_lruvec->lru_lock); -- -- /* -- * Target desirable inactive:active list ratios for the anon -- * and file LRU lists. -- */ -- if (!sc->force_deactivate) { -- unsigned long refaults; -- -- refaults = lruvec_page_state(target_lruvec, -- WORKINGSET_ACTIVATE_ANON); -- if (refaults != target_lruvec->refaults[0] || -- inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) -- sc->may_deactivate |= DEACTIVATE_ANON; -- else -- sc->may_deactivate &= ~DEACTIVATE_ANON; -- -- /* -- * When refaults are being observed, it means a new -- * workingset is being established. Deactivate to get -- * rid of any stale active pages quickly. -- */ -- refaults = lruvec_page_state(target_lruvec, -- WORKINGSET_ACTIVATE_FILE); -- if (refaults != target_lruvec->refaults[1] || -- inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) -- sc->may_deactivate |= DEACTIVATE_FILE; -- else -- sc->may_deactivate &= ~DEACTIVATE_FILE; -- } else -- sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; -- -- /* -- * If we have plenty of inactive file pages that aren't -- * thrashing, try to reclaim those first before touching -- * anonymous pages. -- */ -- file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); -- if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) -- sc->cache_trim_mode = 1; -- else -- sc->cache_trim_mode = 0; -- -- /* -- * Prevent the reclaimer from falling into the cache trap: as -- * cache pages start out inactive, every cache fault will tip -- * the scan balance towards the file LRU. And as the file LRU -- * shrinks, so does the window for rotation from references. -- * This means we have a runaway feedback loop where a tiny -- * thrashing file LRU becomes infinitely more attractive than -- * anon pages. Try to detect this based on file LRU size. -- */ -- if (!cgroup_reclaim(sc)) { -- unsigned long total_high_wmark = 0; -- unsigned long free, anon; -- int z; -- -- free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); -- file = node_page_state(pgdat, NR_ACTIVE_FILE) + -- node_page_state(pgdat, NR_INACTIVE_FILE); -- -- for (z = 0; z < MAX_NR_ZONES; z++) { -- struct zone *zone = &pgdat->node_zones[z]; -- if (!managed_zone(zone)) -- continue; -- -- total_high_wmark += high_wmark_pages(zone); -- } -- -- /* -- * Consider anon: if that's low too, this isn't a -- * runaway file reclaim problem, but rather just -- * extreme pressure. Reclaim as per usual then. -- */ -- anon = node_page_state(pgdat, NR_INACTIVE_ANON); -- -- sc->file_is_tiny = -- file + free <= total_high_wmark && -- !(sc->may_deactivate & DEACTIVATE_ANON) && -- anon >> sc->priority; -- } -+ prepare_scan_count(pgdat, sc); - - shrink_node_memcgs(pgdat, sc); - --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (2 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork Yu Zhao - ` (10 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Miaohe Lin, - Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko, - Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -This patch undoes the following refactor: -commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller") - -The upcoming changes to include/linux/mm_inline.h will reuse -__update_lru_size(). - -Signed-off-by: Yu Zhao -Reviewed-by: Miaohe Lin -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/mm_inline.h | 9 ++++++++- - 1 file changed, 8 insertions(+), 1 deletion(-) - ++static inline void cgroup_unlock(void) ++{ ++ mutex_unlock(&cgroup_mutex); ++} ++ + /** + * task_css_set_check - obtain a task's css_set with extra access conditions + * @task: the task to obtain css_set for +@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp) + * as locks used during the cgroup_subsys::attach() methods. + */ + #ifdef CONFIG_PROVE_RCU +-extern struct mutex cgroup_mutex; + extern spinlock_t css_set_lock; + #define task_css_set_check(task, __c) \ + rcu_dereference_check((task)->cgroups, \ +@@ -708,6 +719,8 @@ struct cgroup; + static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } + static inline void css_get(struct cgroup_subsys_state *css) {} + static inline void css_put(struct cgroup_subsys_state *css) {} ++static inline void cgroup_lock(void) {} ++static inline void cgroup_unlock(void) {} + static inline int cgroup_attach_task_all(struct task_struct *from, + struct task_struct *t) { return 0; } + static inline int cgroupstats_build(struct cgroupstats *stats, +diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h +index 567f12323f553e..877cbcbc6ed98a 100644 +--- a/include/linux/memcontrol.h ++++ b/include/linux/memcontrol.h +@@ -350,6 +350,11 @@ struct mem_cgroup { + struct deferred_split deferred_split_queue; + #endif + ++#ifdef CONFIG_LRU_GEN ++ /* per-memcg mm_struct list */ ++ struct lru_gen_mm_list mm_list; ++#endif ++ + struct mem_cgroup_per_node *nodeinfo[]; + }; + +@@ -444,6 +449,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio) + * - LRU isolation + * - lock_page_memcg() + * - exclusive reference ++ * - mem_cgroup_trylock_pages() + * + * For a kmem folio a caller should hold an rcu read lock to protect memcg + * associated with a kmem folio from being released. +@@ -505,6 +511,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) + * - LRU isolation + * - lock_page_memcg() + * - exclusive reference ++ * - mem_cgroup_trylock_pages() + * + * For a kmem page a caller should hold an rcu read lock to protect memcg + * associated with a kmem page from being released. +@@ -959,6 +966,23 @@ void unlock_page_memcg(struct page *page); + + void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); + ++/* try to stablize folio_memcg() for all the pages in a memcg */ ++static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) ++{ ++ rcu_read_lock(); ++ ++ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) ++ return true; ++ ++ rcu_read_unlock(); ++ return false; ++} ++ ++static inline void mem_cgroup_unlock_pages(void) ++{ ++ rcu_read_unlock(); ++} ++ + /* idx can be of type enum memcg_stat_item or node_stat_item */ + static inline void mod_memcg_state(struct mem_cgroup *memcg, + int idx, int val) +@@ -1433,6 +1457,18 @@ static inline void folio_memcg_unlock(struct folio *folio) + { + } + ++static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) ++{ ++ /* to match folio_memcg_rcu() */ ++ rcu_read_lock(); ++ return true; ++} ++ ++static inline void mem_cgroup_unlock_pages(void) ++{ ++ rcu_read_unlock(); ++} ++ + static inline void mem_cgroup_handle_over_high(void) + { + } +diff --git a/include/linux/mm.h b/include/linux/mm.h +index 21f8b27bd9fd30..88976a521ef546 100644 +--- a/include/linux/mm.h ++++ b/include/linux/mm.h +@@ -1465,6 +1465,11 @@ static inline unsigned long folio_pfn(struct folio *folio) + return page_to_pfn(&folio->page); + } + ++static inline struct folio *pfn_folio(unsigned long pfn) ++{ ++ return page_folio(pfn_to_page(pfn)); ++} ++ + static inline atomic_t *folio_pincount_ptr(struct folio *folio) + { + return &folio_page(folio, 1)->compound_pincount; diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h -index 7b25b53c474a..fb8aadb81cd6 100644 +index 7b25b53c474a7f..4949eda9a9a2ab 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h -@@ -34,7 +34,7 @@ static inline int page_is_file_lru(struct page *page) +@@ -34,15 +34,25 @@ static inline int page_is_file_lru(struct page *page) return folio_is_file_lru(page_folio(page)); } - + -static __always_inline void update_lru_size(struct lruvec *lruvec, +static __always_inline void __update_lru_size(struct lruvec *lruvec, enum lru_list lru, enum zone_type zid, long nr_pages) { -@@ -43,6 +43,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec, + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + ++ lockdep_assert_held(&lruvec->lru_lock); ++ WARN_ON_ONCE(nr_pages != (int)nr_pages); ++ __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); @@ -1012,164 +664,27 @@ index 7b25b53c474a..fb8aadb81cd6 100644 #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages); #endif --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (3 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao - ` (9 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Evictable pages are divided into multiple generations for each lruvec. -The youngest generation number is stored in lrugen->max_seq for both -anon and file types as they are aged on an equal footing. The oldest -generation numbers are stored in lrugen->min_seq[] separately for anon -and file types as clean file pages can be evicted regardless of swap -constraints. These three variables are monotonically increasing. - -Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits -in order to fit into the gen counter in folio->flags. Each truncated -generation number is an index to lrugen->lists[]. The sliding window -technique is used to track at least MIN_NR_GENS and at most -MAX_NR_GENS generations. The gen counter stores a value within [1, -MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it -stores 0. - -There are two conceptually independent procedures: "the aging", which -produces young generations, and "the eviction", which consumes old -generations. They form a closed-loop system, i.e., "the page reclaim". -Both procedures can be invoked from userspace for the purposes of -working set estimation and proactive reclaim. These techniques are -commonly used to optimize job scheduling (bin packing) in data -centers [1][2]. - -To avoid confusion, the terms "hot" and "cold" will be applied to the -multi-gen LRU, as a new convention; the terms "active" and "inactive" -will be applied to the active/inactive LRU, as usual. - -The protection of hot pages and the selection of cold pages are based -on page access channels and patterns. There are two access channels: -one through page tables and the other through file descriptors. The -protection of the former channel is by design stronger because: -1. The uncertainty in determining the access patterns of the former - channel is higher due to the approximation of the accessed bit. -2. The cost of evicting the former channel is higher due to the TLB - flushes required and the likelihood of encountering the dirty bit. -3. The penalty of underprotecting the former channel is higher because - applications usually do not prepare themselves for major page - faults like they do for blocked I/O. E.g., GUI applications - commonly use dedicated I/O threads to avoid blocking rendering - threads. -There are also two access patterns: one with temporal locality and the -other without. For the reasons listed above, the former channel is -assumed to follow the former pattern unless VM_SEQ_READ or -VM_RAND_READ is present; the latter channel is assumed to follow the -latter pattern unless outlying refaults have been observed [3][4]. - -The next patch will address the "outlying refaults". Three macros, -i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are -added in this patch to make the entire patchset less diffy. - -A page is added to the youngest generation on faulting. The aging -needs to check the accessed bit at least twice before handing this -page over to the eviction. The first check takes care of the accessed -bit set on the initial fault; the second check makes sure this page -has not been used since then. This protocol, AKA second chance, -requires a minimum of two generations, hence MIN_NR_GENS. - -[1] https://dl.acm.org/doi/10.1145/3297858.3304053 -[2] https://dl.acm.org/doi/10.1145/3503222.3507731 -[3] https://lwn.net/Articles/495543/ -[4] https://lwn.net/Articles/815342/ - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - fs/fuse/dev.c | 3 +- - include/linux/mm_inline.h | 175 ++++++++++++++++++++++++++++++ - include/linux/mmzone.h | 102 +++++++++++++++++ - include/linux/page-flags-layout.h | 13 ++- - include/linux/page-flags.h | 4 +- - include/linux/sched.h | 4 + - kernel/bounds.c | 5 + - mm/Kconfig | 8 ++ - mm/huge_memory.c | 3 +- - mm/memcontrol.c | 2 + - mm/memory.c | 25 +++++ - mm/mm_init.c | 6 +- - mm/mmzone.c | 2 + - mm/swap.c | 11 +- - mm/vmscan.c | 75 +++++++++++++ - 15 files changed, 424 insertions(+), 14 deletions(-) - -diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c -index 51897427a534..b4a6e0a1b945 100644 ---- a/fs/fuse/dev.c -+++ b/fs/fuse/dev.c -@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page) - 1 << PG_active | - 1 << PG_workingset | - 1 << PG_reclaim | -- 1 << PG_waiters))) { -+ 1 << PG_waiters | -+ LRU_GEN_MASK | LRU_REFS_MASK))) { - dump_page(page, "fuse: trying to steal weird page"); - return 1; - } -diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h -index fb8aadb81cd6..2ff703900fd0 100644 ---- a/include/linux/mm_inline.h -+++ b/include/linux/mm_inline.h -@@ -40,6 +40,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, - { - struct pglist_data *pgdat = lruvec_pgdat(lruvec); - -+ lockdep_assert_held(&lruvec->lru_lock); -+ WARN_ON_ONCE(nr_pages != (int)nr_pages); -+ - __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); - __mod_zone_page_state(&pgdat->node_zones[zid], - NR_ZONE_LRU_BASE + lru, nr_pages); -@@ -101,11 +104,177 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) +@@ -94,11 +104,224 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) return lru; } - + +#ifdef CONFIG_LRU_GEN + ++#ifdef CONFIG_LRU_GEN_ENABLED +static inline bool lru_gen_enabled(void) +{ -+ return true; ++ DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]); ++ ++ return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]); +} ++#else ++static inline bool lru_gen_enabled(void) ++{ ++ DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]); ++ ++ return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]); ++} ++#endif + +static inline bool lru_gen_in_fault(void) +{ @@ -1181,6 +696,33 @@ index fb8aadb81cd6..2ff703900fd0 100644 + return seq % MAX_NR_GENS; +} + ++static inline int lru_hist_from_seq(unsigned long seq) ++{ ++ return seq % NR_HIST_GENS; ++} ++ ++static inline int lru_tier_from_refs(int refs) ++{ ++ VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); ++ ++ /* see the comment in folio_lru_refs() */ ++ return order_base_2(refs + 1); ++} ++ ++static inline int folio_lru_refs(struct folio *folio) ++{ ++ unsigned long flags = READ_ONCE(folio->flags); ++ bool workingset = flags & BIT(PG_workingset); ++ ++ /* ++ * Return the number of accesses beyond PG_referenced, i.e., N-1 if the ++ * total number of accesses is N>1, since N=0,1 both map to the first ++ * tier. lru_tier_from_refs() will account for this off-by-one. Also see ++ * the comment on MAX_NR_TIERS. ++ */ ++ return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; ++} ++ +static inline int folio_lru_gen(struct folio *folio) +{ + unsigned long flags = READ_ONCE(folio->flags); @@ -1233,6 +775,15 @@ index fb8aadb81cd6..2ff703900fd0 100644 + __update_lru_size(lruvec, lru, zone, -delta); + return; + } ++ ++ /* promotion */ ++ if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { ++ __update_lru_size(lruvec, lru, zone, -delta); ++ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); ++ } ++ ++ /* demotion requires isolation, e.g., lru_deactivate_fn() */ ++ VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); +} + +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) @@ -1246,7 +797,7 @@ index fb8aadb81cd6..2ff703900fd0 100644 + + VM_WARN_ON_ONCE_FOLIO(gen != -1, folio); + -+ if (folio_test_unevictable(folio)) ++ if (folio_test_unevictable(folio) || !lrugen->enabled) + return false; + /* + * There are three common cases for this page: @@ -1331,2665 +882,35 @@ index fb8aadb81cd6..2ff703900fd0 100644 void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); - + + if (lru_gen_add_folio(lruvec, folio, false)) + return; + update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); if (lru != LRU_UNEVICTABLE) -@@ -123,6 +292,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) +@@ -116,6 +339,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); - + + if (lru_gen_add_folio(lruvec, folio, true)) + return; + update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); /* This is not expected to be used on LRU_UNEVICTABLE */ -@@ -140,6 +312,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) +@@ -133,6 +359,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); - + + if (lru_gen_del_folio(lruvec, folio, false)) + return; + if (lru != LRU_UNEVICTABLE) list_del(&folio->lru); update_lru_size(lruvec, lru, folio_zonenum(folio), -diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index 18cf0fc5ce67..6f4ea078d90f 100644 ---- a/include/linux/mmzone.h -+++ b/include/linux/mmzone.h -@@ -317,6 +317,102 @@ enum lruvec_flags { - */ - }; - -+#endif /* !__GENERATING_BOUNDS_H */ -+ -+/* -+ * Evictable pages are divided into multiple generations. The youngest and the -+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing. -+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An -+ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the -+ * corresponding generation. The gen counter in folio->flags stores gen+1 while -+ * a page is on one of lrugen->lists[]. Otherwise it stores 0. -+ * -+ * A page is added to the youngest generation on faulting. The aging needs to -+ * check the accessed bit at least twice before handing this page over to the -+ * eviction. The first check takes care of the accessed bit set on the initial -+ * fault; the second check makes sure this page hasn't been used since then. -+ * This process, AKA second chance, requires a minimum of two generations, -+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive -+ * LRU, e.g., /proc/vmstat, these two generations are considered active; the -+ * rest of generations, if they exist, are considered inactive. See -+ * lru_gen_is_active(). -+ * -+ * PG_active is always cleared while a page is on one of lrugen->lists[] so that -+ * the aging needs not to worry about it. And it's set again when a page -+ * considered active is isolated for non-reclaiming purposes, e.g., migration. -+ * See lru_gen_add_folio() and lru_gen_del_folio(). -+ * -+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the -+ * number of categories of the active/inactive LRU when keeping track of -+ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits -+ * in folio->flags. -+ */ -+#define MIN_NR_GENS 2U -+#define MAX_NR_GENS 4U -+ -+#ifndef __GENERATING_BOUNDS_H -+ -+struct lruvec; -+ -+#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) -+#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) -+ -+#ifdef CONFIG_LRU_GEN -+ -+enum { -+ LRU_GEN_ANON, -+ LRU_GEN_FILE, -+}; -+ -+/* -+ * The youngest generation number is stored in max_seq for both anon and file -+ * types as they are aged on an equal footing. The oldest generation numbers are -+ * stored in min_seq[] separately for anon and file types as clean file pages -+ * can be evicted regardless of swap constraints. -+ * -+ * Normally anon and file min_seq are in sync. But if swapping is constrained, -+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon -+ * min_seq behind. -+ * -+ * The number of pages in each generation is eventually consistent and therefore -+ * can be transiently negative. -+ */ -+struct lru_gen_struct { -+ /* the aging increments the youngest generation number */ -+ unsigned long max_seq; -+ /* the eviction increments the oldest generation numbers */ -+ unsigned long min_seq[ANON_AND_FILE]; -+ /* the multi-gen LRU lists, lazily sorted on eviction */ -+ struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; -+ /* the multi-gen LRU sizes, eventually consistent */ -+ long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; -+}; -+ -+void lru_gen_init_lruvec(struct lruvec *lruvec); -+ -+#ifdef CONFIG_MEMCG -+void lru_gen_init_memcg(struct mem_cgroup *memcg); -+void lru_gen_exit_memcg(struct mem_cgroup *memcg); -+#endif -+ -+#else /* !CONFIG_LRU_GEN */ -+ -+static inline void lru_gen_init_lruvec(struct lruvec *lruvec) -+{ -+} -+ -+#ifdef CONFIG_MEMCG -+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) -+{ -+} -+ -+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg) -+{ -+} -+#endif -+ -+#endif /* CONFIG_LRU_GEN */ -+ - struct lruvec { - struct list_head lists[NR_LRU_LISTS]; - /* per lruvec lru_lock for memcg */ -@@ -334,6 +430,10 @@ struct lruvec { - unsigned long refaults[ANON_AND_FILE]; - /* Various lruvec state flags (enum lruvec_flags) */ - unsigned long flags; -+#ifdef CONFIG_LRU_GEN -+ /* evictable pages divided into generations */ -+ struct lru_gen_struct lrugen; -+#endif - #ifdef CONFIG_MEMCG - struct pglist_data *pgdat; - #endif -@@ -749,6 +849,8 @@ static inline bool zone_is_empty(struct zone *zone) - #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) - #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) - #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) -+#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH) -+#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH) - - /* - * Define the bit shifts to access each section. For non-existent -diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h -index ef1e3e736e14..240905407a18 100644 ---- a/include/linux/page-flags-layout.h -+++ b/include/linux/page-flags-layout.h -@@ -55,7 +55,8 @@ - #define SECTIONS_WIDTH 0 - #endif - --#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS -+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \ -+ <= BITS_PER_LONG - NR_PAGEFLAGS - #define NODES_WIDTH NODES_SHIFT - #elif defined(CONFIG_SPARSEMEM_VMEMMAP) - #error "Vmemmap: No space for nodes field in page flags" -@@ -89,8 +90,8 @@ - #define LAST_CPUPID_SHIFT 0 - #endif - --#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \ -- <= BITS_PER_LONG - NR_PAGEFLAGS -+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ -+ KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS - #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT - #else - #define LAST_CPUPID_WIDTH 0 -@@ -100,10 +101,12 @@ - #define LAST_CPUPID_NOT_IN_PAGE_FLAGS - #endif - --#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \ -- > BITS_PER_LONG - NR_PAGEFLAGS -+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ -+ KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS - #error "Not enough bits in page flags" - #endif - -+#define LRU_REFS_WIDTH 0 -+ - #endif - #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ -diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h -index 465ff35a8c00..0b0ae5084e60 100644 ---- a/include/linux/page-flags.h -+++ b/include/linux/page-flags.h -@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) - 1UL << PG_private | 1UL << PG_private_2 | \ - 1UL << PG_writeback | 1UL << PG_reserved | \ - 1UL << PG_slab | 1UL << PG_active | \ -- 1UL << PG_unevictable | __PG_MLOCKED) -+ 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK) - - /* - * Flags checked when a page is prepped for return by the page allocator. -@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) - * alloc-free cycle to prevent from reusing the page. - */ - #define PAGE_FLAGS_CHECK_AT_PREP \ -- (PAGEFLAGS_MASK & ~__PG_HWPOISON) -+ ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK) - - #define PAGE_FLAGS_PRIVATE \ - (1UL << PG_private | 1UL << PG_private_2) -diff --git a/include/linux/sched.h b/include/linux/sched.h -index e7b2f8a5c711..8cc46a789193 100644 ---- a/include/linux/sched.h -+++ b/include/linux/sched.h -@@ -914,6 +914,10 @@ struct task_struct { - #ifdef CONFIG_MEMCG - unsigned in_user_fault:1; - #endif -+#ifdef CONFIG_LRU_GEN -+ /* whether the LRU algorithm may apply to this access */ -+ unsigned in_lru_fault:1; -+#endif - #ifdef CONFIG_COMPAT_BRK - unsigned brk_randomized:1; - #endif -diff --git a/kernel/bounds.c b/kernel/bounds.c -index 9795d75b09b2..5ee60777d8e4 100644 ---- a/kernel/bounds.c -+++ b/kernel/bounds.c -@@ -22,6 +22,11 @@ int main(void) - DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS)); - #endif - DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); -+#ifdef CONFIG_LRU_GEN -+ DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); -+#else -+ DEFINE(LRU_GEN_WIDTH, 0); -+#endif - /* End of constants */ - - return 0; -diff --git a/mm/Kconfig b/mm/Kconfig -index e3fbd0788878..378306aee622 100644 ---- a/mm/Kconfig -+++ b/mm/Kconfig -@@ -1118,6 +1118,14 @@ config PTE_MARKER_UFFD_WP - purposes. It is required to enable userfaultfd write protection on - file-backed memory types like shmem and hugetlbfs. - -+config LRU_GEN -+ bool "Multi-Gen LRU" -+ depends on MMU -+ # make sure folio->flags has enough spare bits -+ depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP -+ help -+ A high performance LRU implementation to overcommit memory. -+ - source "mm/damon/Kconfig" - - endmenu -diff --git a/mm/huge_memory.c b/mm/huge_memory.c -index f4a656b279b1..949d7c325133 100644 ---- a/mm/huge_memory.c -+++ b/mm/huge_memory.c -@@ -2444,7 +2444,8 @@ static void __split_huge_page_tail(struct page *head, int tail, - #ifdef CONFIG_64BIT - (1L << PG_arch_2) | - #endif -- (1L << PG_dirty))); -+ (1L << PG_dirty) | -+ LRU_GEN_MASK | LRU_REFS_MASK)); - - /* ->mapping in first tail page is compound_mapcount */ - VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, -diff --git a/mm/memcontrol.c b/mm/memcontrol.c -index 403af5f7a2b9..937141d48221 100644 ---- a/mm/memcontrol.c -+++ b/mm/memcontrol.c -@@ -5175,6 +5175,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) - - static void mem_cgroup_free(struct mem_cgroup *memcg) - { -+ lru_gen_exit_memcg(memcg); - memcg_wb_domain_exit(memcg); - __mem_cgroup_free(memcg); - } -@@ -5233,6 +5234,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) - memcg->deferred_split_queue.split_queue_len = 0; - #endif - idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); -+ lru_gen_init_memcg(memcg); - return memcg; - fail: - mem_cgroup_id_remove(memcg); -diff --git a/mm/memory.c b/mm/memory.c -index 3a9b00c765c2..63832dab15d3 100644 ---- a/mm/memory.c -+++ b/mm/memory.c -@@ -5117,6 +5117,27 @@ static inline void mm_account_fault(struct pt_regs *regs, - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } - -+#ifdef CONFIG_LRU_GEN -+static void lru_gen_enter_fault(struct vm_area_struct *vma) -+{ -+ /* the LRU algorithm doesn't apply to sequential or random reads */ -+ current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ)); -+} -+ -+static void lru_gen_exit_fault(void) -+{ -+ current->in_lru_fault = false; -+} -+#else -+static void lru_gen_enter_fault(struct vm_area_struct *vma) -+{ -+} -+ -+static void lru_gen_exit_fault(void) -+{ -+} -+#endif /* CONFIG_LRU_GEN */ -+ - /* - * By the time we get here, we already hold the mm semaphore - * -@@ -5148,11 +5169,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, - if (flags & FAULT_FLAG_USER) - mem_cgroup_enter_user_fault(); - -+ lru_gen_enter_fault(vma); -+ - if (unlikely(is_vm_hugetlb_page(vma))) - ret = hugetlb_fault(vma->vm_mm, vma, address, flags); - else - ret = __handle_mm_fault(vma, address, flags); - -+ lru_gen_exit_fault(); -+ - if (flags & FAULT_FLAG_USER) { - mem_cgroup_exit_user_fault(); - /* -diff --git a/mm/mm_init.c b/mm/mm_init.c -index 9ddaf0e1b0ab..0d7b2bd2454a 100644 ---- a/mm/mm_init.c -+++ b/mm/mm_init.c -@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void) - - shift = 8 * sizeof(unsigned long); - width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH -- - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; -+ - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH; - mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", -- "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", -+ "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n", - SECTIONS_WIDTH, - NODES_WIDTH, - ZONES_WIDTH, - LAST_CPUPID_WIDTH, - KASAN_TAG_WIDTH, -+ LRU_GEN_WIDTH, -+ LRU_REFS_WIDTH, - NR_PAGEFLAGS); - mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", -diff --git a/mm/mmzone.c b/mm/mmzone.c -index 0ae7571e35ab..68e1511be12d 100644 ---- a/mm/mmzone.c -+++ b/mm/mmzone.c -@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec) - * Poison its list head, so that any operations on it would crash. - */ - list_del(&lruvec->lists[LRU_UNEVICTABLE]); -+ -+ lru_gen_init_lruvec(lruvec); - } - - #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) -diff --git a/mm/swap.c b/mm/swap.c -index 9cee7f6a3809..0e423b7d458b 100644 ---- a/mm/swap.c -+++ b/mm/swap.c -@@ -484,6 +484,11 @@ void folio_add_lru(struct folio *folio) - folio_test_unevictable(folio), folio); - VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); - -+ /* see the comment in lru_gen_add_folio() */ -+ if (lru_gen_enabled() && !folio_test_unevictable(folio) && -+ lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) -+ folio_set_active(folio); -+ - folio_get(folio); - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); -@@ -575,7 +580,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) - - static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) - { -- if (folio_test_active(folio) && !folio_test_unevictable(folio)) { -+ if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) { - long nr_pages = folio_nr_pages(folio); - - lruvec_del_folio(lruvec, folio); -@@ -688,8 +693,8 @@ void deactivate_page(struct page *page) - { - struct folio *folio = page_folio(page); - -- if (folio_test_lru(folio) && folio_test_active(folio) && -- !folio_test_unevictable(folio)) { -+ if (folio_test_lru(folio) && !folio_test_unevictable(folio) && -+ (folio_test_active(folio) || lru_gen_enabled())) { - struct folio_batch *fbatch; - - folio_get(folio); -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 0869cee13a90..8d41c4ef430e 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -3050,6 +3050,81 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, - return can_demote(pgdat->node_id, sc); - } - -+#ifdef CONFIG_LRU_GEN -+ -+/****************************************************************************** -+ * shorthand helpers -+ ******************************************************************************/ -+ -+#define for_each_gen_type_zone(gen, type, zone) \ -+ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ -+ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ -+ for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) -+ -+static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid) -+{ -+ struct pglist_data *pgdat = NODE_DATA(nid); -+ -+#ifdef CONFIG_MEMCG -+ if (memcg) { -+ struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec; -+ -+ /* for hotadd_new_pgdat() */ -+ if (!lruvec->pgdat) -+ lruvec->pgdat = pgdat; -+ -+ return lruvec; -+ } -+#endif -+ VM_WARN_ON_ONCE(!mem_cgroup_disabled()); -+ -+ return pgdat ? &pgdat->__lruvec : NULL; -+} -+ -+/****************************************************************************** -+ * initialization -+ ******************************************************************************/ -+ -+void lru_gen_init_lruvec(struct lruvec *lruvec) -+{ -+ int gen, type, zone; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ -+ lrugen->max_seq = MIN_NR_GENS + 1; -+ -+ for_each_gen_type_zone(gen, type, zone) -+ INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); -+} -+ -+#ifdef CONFIG_MEMCG -+void lru_gen_init_memcg(struct mem_cgroup *memcg) -+{ -+} -+ -+void lru_gen_exit_memcg(struct mem_cgroup *memcg) -+{ -+ int nid; -+ -+ for_each_node(nid) { -+ struct lruvec *lruvec = get_lruvec(memcg, nid); -+ -+ VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, -+ sizeof(lruvec->lrugen.nr_pages))); -+ } -+} -+#endif -+ -+static int __init init_lru_gen(void) -+{ -+ BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); -+ BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); -+ -+ return 0; -+}; -+late_initcall(init_lru_gen); -+ -+#endif /* CONFIG_LRU_GEN */ -+ - static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) - { - unsigned long nr[NR_LRU_LISTS]; --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (4 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao - ` (8 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -To avoid confusion, the terms "promotion" and "demotion" will be -applied to the multi-gen LRU, as a new convention; the terms -"activation" and "deactivation" will be applied to the active/inactive -LRU, as usual. - -The aging produces young generations. Given an lruvec, it increments -max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging -promotes hot pages to the youngest generation when it finds them -accessed through page tables; the demotion of cold pages happens -consequently when it increments max_seq. Promotion in the aging path -does not involve any LRU list operations, only the updates of the gen -counter and lrugen->nr_pages[]; demotion, unless as the result of the -increment of max_seq, requires LRU list operations, e.g., -lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), -since it is only interested in hot pages. - -The eviction consumes old generations. Given an lruvec, it increments -min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes -empty. A feedback loop modeled after the PID controller monitors -refaults over anon and file types and decides which type to evict when -both types are available from the same generation. - -The protection of pages accessed multiple times through file -descriptors takes place in the eviction path. Each generation is -divided into multiple tiers. A page accessed N times through file -descriptors is in tier order_base_2(N). Tiers do not have dedicated -lrugen->lists[], only bits in folio->flags. The aforementioned -feedback loop also monitors refaults over all tiers and decides when -to protect pages in which tiers (N>1), using the first tier (N=0,1) as -a baseline. The first tier contains single-use unmapped clean pages, -which are most likely the best choices. In contrast to promotion in -the aging path, the protection of a page in the eviction path is -achieved by moving this page to the next generation, i.e., min_seq+1, -if the feedback loop decides so. This approach has the following -advantages: -1. It removes the cost of activation in the buffered access path by - inferring whether pages accessed multiple times through file - descriptors are statistically hot and thus worth protecting in the - eviction path. -2. It takes pages accessed through page tables into account and avoids - overprotecting pages accessed multiple times through file - descriptors. (Pages accessed through page tables are in the first - tier, since N=0.) -3. More tiers provide better protection for pages accessed more than - twice through file descriptors, when under heavy buffered I/O - workloads. - -Server benchmark results: - Single workload: - fio (buffered I/O): +[30, 32]% - IOPS BW - 5.19-rc1: 2673k 10.2GiB/s - patch1-6: 3491k 13.3GiB/s - - Single workload: - memcached (anon): -[4, 6]% - Ops/sec KB/sec - 5.19-rc1: 1161501.04 45177.25 - patch1-6: 1106168.46 43025.04 - - Configurations: - CPU: two Xeon 6154 - Mem: total 256G - - Node 1 was only used as a ram disk to reduce the variance in the - results. - - patch drivers/block/brd.c < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; - > page = alloc_pages_node(1, gfp_flags, 0); - EOF - - cat >>/etc/systemd/system.conf <>/etc/memcached.conf </sys/fs/cgroup/user.slice/test/memory.max - echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs - fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ - --buffered=1 --ioengine=io_uring --iodepth=128 \ - --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ - --rw=randread --random_distribution=random --norandommap \ - --time_based --ramp_time=10m --runtime=5m --group_reporting - - cat memcached.sh - modprobe brd rd_nr=1 rd_size=113246208 - swapoff -a - mkswap /dev/ram0 - swapon /dev/ram0 - - memtier_benchmark -S /var/run/memcached/memcached.sock \ - -P memcache_binary -n allkeys --key-minimum=1 \ - --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ - --ratio 1:0 --pipeline 8 -d 2000 - - memtier_benchmark -S /var/run/memcached/memcached.sock \ - -P memcache_binary -n allkeys --key-minimum=1 \ - --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ - --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed - -Client benchmark results: - kswapd profiles: - 5.19-rc1 - 40.33% page_vma_mapped_walk (overhead) - 21.80% lzo1x_1_do_compress (real work) - 7.53% do_raw_spin_lock - 3.95% _raw_spin_unlock_irq - 2.52% vma_interval_tree_iter_next - 2.37% folio_referenced_one - 2.28% vma_interval_tree_subtree_search - 1.97% anon_vma_interval_tree_iter_first - 1.60% ptep_clear_flush - 1.06% __zram_bvec_write - - patch1-6 - 39.03% lzo1x_1_do_compress (real work) - 18.47% page_vma_mapped_walk (overhead) - 6.74% _raw_spin_unlock_irq - 3.97% do_raw_spin_lock - 2.49% ptep_clear_flush - 2.48% anon_vma_interval_tree_iter_first - 1.92% folio_referenced_one - 1.88% __zram_bvec_write - 1.48% memmove - 1.31% vma_interval_tree_iter_next - - Configurations: - CPU: single Snapdragon 7c - Mem: total 4G - - ChromeOS MemoryPressure [1] - -[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/mm_inline.h | 36 ++ - include/linux/mmzone.h | 41 ++ - include/linux/page-flags-layout.h | 5 +- - kernel/bounds.c | 2 + - mm/Kconfig | 11 + - mm/swap.c | 39 ++ - mm/vmscan.c | 792 +++++++++++++++++++++++++++++- - mm/workingset.c | 110 ++++- - 8 files changed, 1025 insertions(+), 11 deletions(-) - -diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h -index 2ff703900fd0..f2b2296a42f9 100644 ---- a/include/linux/mm_inline.h -+++ b/include/linux/mm_inline.h -@@ -121,6 +121,33 @@ static inline int lru_gen_from_seq(unsigned long seq) - return seq % MAX_NR_GENS; - } - -+static inline int lru_hist_from_seq(unsigned long seq) -+{ -+ return seq % NR_HIST_GENS; -+} -+ -+static inline int lru_tier_from_refs(int refs) -+{ -+ VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); -+ -+ /* see the comment in folio_lru_refs() */ -+ return order_base_2(refs + 1); -+} -+ -+static inline int folio_lru_refs(struct folio *folio) -+{ -+ unsigned long flags = READ_ONCE(folio->flags); -+ bool workingset = flags & BIT(PG_workingset); -+ -+ /* -+ * Return the number of accesses beyond PG_referenced, i.e., N-1 if the -+ * total number of accesses is N>1, since N=0,1 both map to the first -+ * tier. lru_tier_from_refs() will account for this off-by-one. Also see -+ * the comment on MAX_NR_TIERS. -+ */ -+ return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; -+} -+ - static inline int folio_lru_gen(struct folio *folio) - { - unsigned long flags = READ_ONCE(folio->flags); -@@ -173,6 +200,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli - __update_lru_size(lruvec, lru, zone, -delta); - return; - } -+ -+ /* promotion */ -+ if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { -+ __update_lru_size(lruvec, lru, zone, -delta); -+ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); -+ } -+ -+ /* demotion requires isolation, e.g., lru_deactivate_fn() */ -+ VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); - } - - static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) -diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index 6f4ea078d90f..7e343420bfb1 100644 ---- a/include/linux/mmzone.h -+++ b/include/linux/mmzone.h -@@ -350,6 +350,28 @@ enum lruvec_flags { - #define MIN_NR_GENS 2U - #define MAX_NR_GENS 4U - -+/* -+ * Each generation is divided into multiple tiers. A page accessed N times -+ * through file descriptors is in tier order_base_2(N). A page in the first tier -+ * (N=0,1) is marked by PG_referenced unless it was faulted in through page -+ * tables or read ahead. A page in any other tier (N>1) is marked by -+ * PG_referenced and PG_workingset. This implies a minimum of two tiers is -+ * supported without using additional bits in folio->flags. -+ * -+ * In contrast to moving across generations which requires the LRU lock, moving -+ * across tiers only involves atomic operations on folio->flags and therefore -+ * has a negligible cost in the buffered access path. In the eviction path, -+ * comparisons of refaulted/(evicted+protected) from the first tier and the -+ * rest infer whether pages accessed multiple times through file descriptors -+ * are statistically hot and thus worth protecting. -+ * -+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the -+ * number of categories of the active/inactive LRU when keeping track of -+ * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in -+ * folio->flags. -+ */ -+#define MAX_NR_TIERS 4U -+ - #ifndef __GENERATING_BOUNDS_H - - struct lruvec; -@@ -364,6 +386,16 @@ enum { - LRU_GEN_FILE, - }; - -+#define MIN_LRU_BATCH BITS_PER_LONG -+#define MAX_LRU_BATCH (MIN_LRU_BATCH * 64) -+ -+/* whether to keep historical stats from evicted generations */ -+#ifdef CONFIG_LRU_GEN_STATS -+#define NR_HIST_GENS MAX_NR_GENS -+#else -+#define NR_HIST_GENS 1U -+#endif -+ - /* - * The youngest generation number is stored in max_seq for both anon and file - * types as they are aged on an equal footing. The oldest generation numbers are -@@ -386,6 +418,15 @@ struct lru_gen_struct { - struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; - /* the multi-gen LRU sizes, eventually consistent */ - long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; -+ /* the exponential moving average of refaulted */ -+ unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; -+ /* the exponential moving average of evicted+protected */ -+ unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; -+ /* the first tier doesn't need protection, hence the minus one */ -+ unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; -+ /* can be modified without holding the LRU lock */ -+ atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; -+ atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; - }; - - void lru_gen_init_lruvec(struct lruvec *lruvec); -diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h -index 240905407a18..7d79818dc065 100644 ---- a/include/linux/page-flags-layout.h -+++ b/include/linux/page-flags-layout.h -@@ -106,7 +106,10 @@ - #error "Not enough bits in page flags" - #endif - --#define LRU_REFS_WIDTH 0 -+/* see the comment on MAX_NR_TIERS */ -+#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \ -+ ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ -+ NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) - - #endif - #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ -diff --git a/kernel/bounds.c b/kernel/bounds.c -index 5ee60777d8e4..b529182e8b04 100644 ---- a/kernel/bounds.c -+++ b/kernel/bounds.c -@@ -24,8 +24,10 @@ int main(void) - DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); - #ifdef CONFIG_LRU_GEN - DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); -+ DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2); - #else - DEFINE(LRU_GEN_WIDTH, 0); -+ DEFINE(__LRU_REFS_WIDTH, 0); - #endif - /* End of constants */ - -diff --git a/mm/Kconfig b/mm/Kconfig -index 378306aee622..5c5dcbdcfe34 100644 ---- a/mm/Kconfig -+++ b/mm/Kconfig -@@ -1118,6 +1118,7 @@ config PTE_MARKER_UFFD_WP - purposes. It is required to enable userfaultfd write protection on - file-backed memory types like shmem and hugetlbfs. - -+# multi-gen LRU { - config LRU_GEN - bool "Multi-Gen LRU" - depends on MMU -@@ -1126,6 +1127,16 @@ config LRU_GEN - help - A high performance LRU implementation to overcommit memory. - -+config LRU_GEN_STATS -+ bool "Full stats for debugging" -+ depends on LRU_GEN -+ help -+ Do not enable this option unless you plan to look at historical stats -+ from evicted generations for debugging purpose. -+ -+ This option has a per-memcg and per-node memory overhead. -+# } -+ - source "mm/damon/Kconfig" - - endmenu -diff --git a/mm/swap.c b/mm/swap.c -index 0e423b7d458b..f74fd51fa9e1 100644 ---- a/mm/swap.c -+++ b/mm/swap.c -@@ -428,6 +428,40 @@ static void __lru_cache_activate_folio(struct folio *folio) - local_unlock(&cpu_fbatches.lock); - } - -+#ifdef CONFIG_LRU_GEN -+static void folio_inc_refs(struct folio *folio) -+{ -+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); -+ -+ if (folio_test_unevictable(folio)) -+ return; -+ -+ if (!folio_test_referenced(folio)) { -+ folio_set_referenced(folio); -+ return; -+ } -+ -+ if (!folio_test_workingset(folio)) { -+ folio_set_workingset(folio); -+ return; -+ } -+ -+ /* see the comment on MAX_NR_TIERS */ -+ do { -+ new_flags = old_flags & LRU_REFS_MASK; -+ if (new_flags == LRU_REFS_MASK) -+ break; -+ -+ new_flags += BIT(LRU_REFS_PGOFF); -+ new_flags |= old_flags & ~LRU_REFS_MASK; -+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); -+} -+#else -+static void folio_inc_refs(struct folio *folio) -+{ -+} -+#endif /* CONFIG_LRU_GEN */ -+ - /* - * Mark a page as having seen activity. - * -@@ -440,6 +474,11 @@ static void __lru_cache_activate_folio(struct folio *folio) - */ - void folio_mark_accessed(struct folio *folio) - { -+ if (lru_gen_enabled()) { -+ folio_inc_refs(folio); -+ return; -+ } -+ - if (!folio_test_referenced(folio)) { - folio_set_referenced(folio); - } else if (folio_test_unevictable(folio)) { -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 8d41c4ef430e..d1e60feea8ab 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -1334,9 +1334,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, - - if (folio_test_swapcache(folio)) { - swp_entry_t swap = folio_swap_entry(folio); -- mem_cgroup_swapout(folio, swap); -+ -+ /* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */ - if (reclaimed && !mapping_exiting(mapping)) - shadow = workingset_eviction(folio, target_memcg); -+ mem_cgroup_swapout(folio, swap); - __delete_from_swap_cache(folio, swap, shadow); - xa_unlock_irq(&mapping->i_pages); - put_swap_page(&folio->page, swap); -@@ -2733,6 +2735,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) - unsigned long file; - struct lruvec *target_lruvec; - -+ if (lru_gen_enabled()) -+ return; -+ - target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); - - /* -@@ -3056,6 +3061,17 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, - * shorthand helpers - ******************************************************************************/ - -+#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) -+ -+#define DEFINE_MAX_SEQ(lruvec) \ -+ unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) -+ -+#define DEFINE_MIN_SEQ(lruvec) \ -+ unsigned long min_seq[ANON_AND_FILE] = { \ -+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \ -+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ -+ } -+ - #define for_each_gen_type_zone(gen, type, zone) \ - for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ - for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ -@@ -3081,6 +3097,745 @@ static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int ni - return pgdat ? &pgdat->__lruvec : NULL; - } - -+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) -+{ -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ struct pglist_data *pgdat = lruvec_pgdat(lruvec); -+ -+ if (!can_demote(pgdat->node_id, sc) && -+ mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) -+ return 0; -+ -+ return mem_cgroup_swappiness(memcg); -+} -+ -+static int get_nr_gens(struct lruvec *lruvec, int type) -+{ -+ return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; -+} -+ -+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) -+{ -+ /* see the comment on lru_gen_struct */ -+ return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS && -+ get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) && -+ get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; -+} -+ -+/****************************************************************************** -+ * refault feedback loop -+ ******************************************************************************/ -+ -+/* -+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller. -+ * -+ * The P term is refaulted/(evicted+protected) from a tier in the generation -+ * currently being evicted; the I term is the exponential moving average of the -+ * P term over the generations previously evicted, using the smoothing factor -+ * 1/2; the D term isn't supported. -+ * -+ * The setpoint (SP) is always the first tier of one type; the process variable -+ * (PV) is either any tier of the other type or any other tier of the same -+ * type. -+ * -+ * The error is the difference between the SP and the PV; the correction is to -+ * turn off protection when SP>PV or turn on protection when SPlrugen; -+ int hist = lru_hist_from_seq(lrugen->min_seq[type]); -+ -+ pos->refaulted = lrugen->avg_refaulted[type][tier] + -+ atomic_long_read(&lrugen->refaulted[hist][type][tier]); -+ pos->total = lrugen->avg_total[type][tier] + -+ atomic_long_read(&lrugen->evicted[hist][type][tier]); -+ if (tier) -+ pos->total += lrugen->protected[hist][type][tier - 1]; -+ pos->gain = gain; -+} -+ -+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover) -+{ -+ int hist, tier; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1; -+ unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1; -+ -+ lockdep_assert_held(&lruvec->lru_lock); -+ -+ if (!carryover && !clear) -+ return; -+ -+ hist = lru_hist_from_seq(seq); -+ -+ for (tier = 0; tier < MAX_NR_TIERS; tier++) { -+ if (carryover) { -+ unsigned long sum; -+ -+ sum = lrugen->avg_refaulted[type][tier] + -+ atomic_long_read(&lrugen->refaulted[hist][type][tier]); -+ WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); -+ -+ sum = lrugen->avg_total[type][tier] + -+ atomic_long_read(&lrugen->evicted[hist][type][tier]); -+ if (tier) -+ sum += lrugen->protected[hist][type][tier - 1]; -+ WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); -+ } -+ -+ if (clear) { -+ atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); -+ atomic_long_set(&lrugen->evicted[hist][type][tier], 0); -+ if (tier) -+ WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); -+ } -+ } -+} -+ -+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) -+{ -+ /* -+ * Return true if the PV has a limited number of refaults or a lower -+ * refaulted/total than the SP. -+ */ -+ return pv->refaulted < MIN_LRU_BATCH || -+ pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <= -+ (sp->refaulted + 1) * pv->total * pv->gain; -+} -+ -+/****************************************************************************** -+ * the aging -+ ******************************************************************************/ -+ -+/* protect pages accessed multiple times through file descriptors */ -+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) -+{ -+ int type = folio_is_file_lru(folio); -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); -+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); -+ -+ VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio); -+ -+ do { -+ new_gen = (old_gen + 1) % MAX_NR_GENS; -+ -+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); -+ new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; -+ /* for folio_end_writeback() */ -+ if (reclaiming) -+ new_flags |= BIT(PG_reclaim); -+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); -+ -+ lru_gen_update_size(lruvec, folio, old_gen, new_gen); -+ -+ return new_gen; -+} -+ -+static void inc_min_seq(struct lruvec *lruvec, int type) -+{ -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ -+ reset_ctrl_pos(lruvec, type, true); -+ WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); -+} -+ -+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) -+{ -+ int gen, type, zone; -+ bool success = false; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ DEFINE_MIN_SEQ(lruvec); -+ -+ VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); -+ -+ /* find the oldest populated generation */ -+ for (type = !can_swap; type < ANON_AND_FILE; type++) { -+ while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) { -+ gen = lru_gen_from_seq(min_seq[type]); -+ -+ for (zone = 0; zone < MAX_NR_ZONES; zone++) { -+ if (!list_empty(&lrugen->lists[gen][type][zone])) -+ goto next; -+ } -+ -+ min_seq[type]++; -+ } -+next: -+ ; -+ } -+ -+ /* see the comment on lru_gen_struct */ -+ if (can_swap) { -+ min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]); -+ min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]); -+ } -+ -+ for (type = !can_swap; type < ANON_AND_FILE; type++) { -+ if (min_seq[type] == lrugen->min_seq[type]) -+ continue; -+ -+ reset_ctrl_pos(lruvec, type, true); -+ WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); -+ success = true; -+ } -+ -+ return success; -+} -+ -+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap) -+{ -+ int prev, next; -+ int type, zone; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ -+ spin_lock_irq(&lruvec->lru_lock); -+ -+ VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); -+ -+ if (max_seq != lrugen->max_seq) -+ goto unlock; -+ -+ for (type = ANON_AND_FILE - 1; type >= 0; type--) { -+ if (get_nr_gens(lruvec, type) != MAX_NR_GENS) -+ continue; -+ -+ VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap); -+ -+ inc_min_seq(lruvec, type); -+ } -+ -+ /* -+ * Update the active/inactive LRU sizes for compatibility. Both sides of -+ * the current max_seq need to be covered, since max_seq+1 can overlap -+ * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do -+ * overlap, cold/hot inversion happens. -+ */ -+ prev = lru_gen_from_seq(lrugen->max_seq - 1); -+ next = lru_gen_from_seq(lrugen->max_seq + 1); -+ -+ for (type = 0; type < ANON_AND_FILE; type++) { -+ for (zone = 0; zone < MAX_NR_ZONES; zone++) { -+ enum lru_list lru = type * LRU_INACTIVE_FILE; -+ long delta = lrugen->nr_pages[prev][type][zone] - -+ lrugen->nr_pages[next][type][zone]; -+ -+ if (!delta) -+ continue; -+ -+ __update_lru_size(lruvec, lru, zone, delta); -+ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta); -+ } -+ } -+ -+ for (type = 0; type < ANON_AND_FILE; type++) -+ reset_ctrl_pos(lruvec, type, false); -+ -+ /* make sure preceding modifications appear */ -+ smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); -+unlock: -+ spin_unlock_irq(&lruvec->lru_lock); -+} -+ -+static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq, -+ struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan) -+{ -+ int gen, type, zone; -+ unsigned long old = 0; -+ unsigned long young = 0; -+ unsigned long total = 0; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ -+ for (type = !can_swap; type < ANON_AND_FILE; type++) { -+ unsigned long seq; -+ -+ for (seq = min_seq[type]; seq <= max_seq; seq++) { -+ unsigned long size = 0; -+ -+ gen = lru_gen_from_seq(seq); -+ -+ for (zone = 0; zone < MAX_NR_ZONES; zone++) -+ size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L); -+ -+ total += size; -+ if (seq == max_seq) -+ young += size; -+ else if (seq + MIN_NR_GENS == max_seq) -+ old += size; -+ } -+ } -+ -+ /* try to scrape all its memory if this memcg was deleted */ -+ *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total; -+ -+ /* -+ * The aging tries to be lazy to reduce the overhead, while the eviction -+ * stalls when the number of generations reaches MIN_NR_GENS. Hence, the -+ * ideal number of generations is MIN_NR_GENS+1. -+ */ -+ if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) -+ return true; -+ if (min_seq[!can_swap] + MIN_NR_GENS < max_seq) -+ return false; -+ -+ /* -+ * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1) -+ * of the total number of pages for each generation. A reasonable range -+ * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The -+ * aging cares about the upper bound of hot pages, while the eviction -+ * cares about the lower bound of cold pages. -+ */ -+ if (young * MIN_NR_GENS > total) -+ return true; -+ if (old * (MIN_NR_GENS + 2) < total) -+ return true; -+ -+ return false; -+} -+ -+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc) -+{ -+ bool need_aging; -+ unsigned long nr_to_scan; -+ int swappiness = get_swappiness(lruvec, sc); -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ DEFINE_MAX_SEQ(lruvec); -+ DEFINE_MIN_SEQ(lruvec); -+ -+ VM_WARN_ON_ONCE(sc->memcg_low_reclaim); -+ -+ mem_cgroup_calculate_protection(NULL, memcg); -+ -+ if (mem_cgroup_below_min(memcg)) -+ return; -+ -+ need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); -+ if (need_aging) -+ inc_max_seq(lruvec, max_seq, swappiness); -+} -+ -+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) -+{ -+ struct mem_cgroup *memcg; -+ -+ VM_WARN_ON_ONCE(!current_is_kswapd()); -+ -+ memcg = mem_cgroup_iter(NULL, NULL, NULL); -+ do { -+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); -+ -+ age_lruvec(lruvec, sc); -+ -+ cond_resched(); -+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); -+} -+ -+/****************************************************************************** -+ * the eviction -+ ******************************************************************************/ -+ -+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) -+{ -+ bool success; -+ int gen = folio_lru_gen(folio); -+ int type = folio_is_file_lru(folio); -+ int zone = folio_zonenum(folio); -+ int delta = folio_nr_pages(folio); -+ int refs = folio_lru_refs(folio); -+ int tier = lru_tier_from_refs(refs); -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ -+ VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio); -+ -+ /* unevictable */ -+ if (!folio_evictable(folio)) { -+ success = lru_gen_del_folio(lruvec, folio, true); -+ VM_WARN_ON_ONCE_FOLIO(!success, folio); -+ folio_set_unevictable(folio); -+ lruvec_add_folio(lruvec, folio); -+ __count_vm_events(UNEVICTABLE_PGCULLED, delta); -+ return true; -+ } -+ -+ /* dirty lazyfree */ -+ if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) { -+ success = lru_gen_del_folio(lruvec, folio, true); -+ VM_WARN_ON_ONCE_FOLIO(!success, folio); -+ folio_set_swapbacked(folio); -+ lruvec_add_folio_tail(lruvec, folio); -+ return true; -+ } -+ -+ /* protected */ -+ if (tier > tier_idx) { -+ int hist = lru_hist_from_seq(lrugen->min_seq[type]); -+ -+ gen = folio_inc_gen(lruvec, folio, false); -+ list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]); -+ -+ WRITE_ONCE(lrugen->protected[hist][type][tier - 1], -+ lrugen->protected[hist][type][tier - 1] + delta); -+ __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); -+ return true; -+ } -+ -+ /* waiting for writeback */ -+ if (folio_test_locked(folio) || folio_test_writeback(folio) || -+ (type == LRU_GEN_FILE && folio_test_dirty(folio))) { -+ gen = folio_inc_gen(lruvec, folio, true); -+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]); -+ return true; -+ } -+ -+ return false; -+} -+ -+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc) -+{ -+ bool success; -+ -+ /* unmapping inhibited */ -+ if (!sc->may_unmap && folio_mapped(folio)) -+ return false; -+ -+ /* swapping inhibited */ -+ if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && -+ (folio_test_dirty(folio) || -+ (folio_test_anon(folio) && !folio_test_swapcache(folio)))) -+ return false; -+ -+ /* raced with release_pages() */ -+ if (!folio_try_get(folio)) -+ return false; -+ -+ /* raced with another isolation */ -+ if (!folio_test_clear_lru(folio)) { -+ folio_put(folio); -+ return false; -+ } -+ -+ /* see the comment on MAX_NR_TIERS */ -+ if (!folio_test_referenced(folio)) -+ set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0); -+ -+ /* for shrink_page_list() */ -+ folio_clear_reclaim(folio); -+ folio_clear_referenced(folio); -+ -+ success = lru_gen_del_folio(lruvec, folio, true); -+ VM_WARN_ON_ONCE_FOLIO(!success, folio); -+ -+ return true; -+} -+ -+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, -+ int type, int tier, struct list_head *list) -+{ -+ int gen, zone; -+ enum vm_event_item item; -+ int sorted = 0; -+ int scanned = 0; -+ int isolated = 0; -+ int remaining = MAX_LRU_BATCH; -+ struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ -+ VM_WARN_ON_ONCE(!list_empty(list)); -+ -+ if (get_nr_gens(lruvec, type) == MIN_NR_GENS) -+ return 0; -+ -+ gen = lru_gen_from_seq(lrugen->min_seq[type]); -+ -+ for (zone = sc->reclaim_idx; zone >= 0; zone--) { -+ LIST_HEAD(moved); -+ int skipped = 0; -+ struct list_head *head = &lrugen->lists[gen][type][zone]; -+ -+ while (!list_empty(head)) { -+ struct folio *folio = lru_to_folio(head); -+ int delta = folio_nr_pages(folio); -+ -+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); -+ -+ scanned += delta; -+ -+ if (sort_folio(lruvec, folio, tier)) -+ sorted += delta; -+ else if (isolate_folio(lruvec, folio, sc)) { -+ list_add(&folio->lru, list); -+ isolated += delta; -+ } else { -+ list_move(&folio->lru, &moved); -+ skipped += delta; -+ } -+ -+ if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH) -+ break; -+ } -+ -+ if (skipped) { -+ list_splice(&moved, head); -+ __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); -+ } -+ -+ if (!remaining || isolated >= MIN_LRU_BATCH) -+ break; -+ } -+ -+ item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; -+ if (!cgroup_reclaim(sc)) { -+ __count_vm_events(item, isolated); -+ __count_vm_events(PGREFILL, sorted); -+ } -+ __count_memcg_events(memcg, item, isolated); -+ __count_memcg_events(memcg, PGREFILL, sorted); -+ __count_vm_events(PGSCAN_ANON + type, isolated); -+ -+ /* -+ * There might not be eligible pages due to reclaim_idx, may_unmap and -+ * may_writepage. Check the remaining to prevent livelock if it's not -+ * making progress. -+ */ -+ return isolated || !remaining ? scanned : 0; -+} -+ -+static int get_tier_idx(struct lruvec *lruvec, int type) -+{ -+ int tier; -+ struct ctrl_pos sp, pv; -+ -+ /* -+ * To leave a margin for fluctuations, use a larger gain factor (1:2). -+ * This value is chosen because any other tier would have at least twice -+ * as many refaults as the first tier. -+ */ -+ read_ctrl_pos(lruvec, type, 0, 1, &sp); -+ for (tier = 1; tier < MAX_NR_TIERS; tier++) { -+ read_ctrl_pos(lruvec, type, tier, 2, &pv); -+ if (!positive_ctrl_err(&sp, &pv)) -+ break; -+ } -+ -+ return tier - 1; -+} -+ -+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) -+{ -+ int type, tier; -+ struct ctrl_pos sp, pv; -+ int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; -+ -+ /* -+ * Compare the first tier of anon with that of file to determine which -+ * type to scan. Also need to compare other tiers of the selected type -+ * with the first tier of the other type to determine the last tier (of -+ * the selected type) to evict. -+ */ -+ read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); -+ read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); -+ type = positive_ctrl_err(&sp, &pv); -+ -+ read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); -+ for (tier = 1; tier < MAX_NR_TIERS; tier++) { -+ read_ctrl_pos(lruvec, type, tier, gain[type], &pv); -+ if (!positive_ctrl_err(&sp, &pv)) -+ break; -+ } -+ -+ *tier_idx = tier - 1; -+ -+ return type; -+} -+ -+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, -+ int *type_scanned, struct list_head *list) -+{ -+ int i; -+ int type; -+ int scanned; -+ int tier = -1; -+ DEFINE_MIN_SEQ(lruvec); -+ -+ /* -+ * Try to make the obvious choice first. When anon and file are both -+ * available from the same generation, interpret swappiness 1 as file -+ * first and 200 as anon first. -+ */ -+ if (!swappiness) -+ type = LRU_GEN_FILE; -+ else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) -+ type = LRU_GEN_ANON; -+ else if (swappiness == 1) -+ type = LRU_GEN_FILE; -+ else if (swappiness == 200) -+ type = LRU_GEN_ANON; -+ else -+ type = get_type_to_scan(lruvec, swappiness, &tier); -+ -+ for (i = !swappiness; i < ANON_AND_FILE; i++) { -+ if (tier < 0) -+ tier = get_tier_idx(lruvec, type); -+ -+ scanned = scan_folios(lruvec, sc, type, tier, list); -+ if (scanned) -+ break; -+ -+ type = !type; -+ tier = -1; -+ } -+ -+ *type_scanned = type; -+ -+ return scanned; -+} -+ -+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) -+{ -+ int type; -+ int scanned; -+ int reclaimed; -+ LIST_HEAD(list); -+ struct folio *folio; -+ enum vm_event_item item; -+ struct reclaim_stat stat; -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ struct pglist_data *pgdat = lruvec_pgdat(lruvec); -+ -+ spin_lock_irq(&lruvec->lru_lock); -+ -+ scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); -+ -+ scanned += try_to_inc_min_seq(lruvec, swappiness); -+ -+ if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) -+ scanned = 0; -+ -+ spin_unlock_irq(&lruvec->lru_lock); -+ -+ if (list_empty(&list)) -+ return scanned; -+ -+ reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); -+ -+ list_for_each_entry(folio, &list, lru) { -+ /* restore LRU_REFS_FLAGS cleared by isolate_folio() */ -+ if (folio_test_workingset(folio)) -+ folio_set_referenced(folio); -+ -+ /* don't add rejected pages to the oldest generation */ -+ if (folio_test_reclaim(folio) && -+ (folio_test_dirty(folio) || folio_test_writeback(folio))) -+ folio_clear_active(folio); -+ else -+ folio_set_active(folio); -+ } -+ -+ spin_lock_irq(&lruvec->lru_lock); -+ -+ move_pages_to_lru(lruvec, &list); -+ -+ item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; -+ if (!cgroup_reclaim(sc)) -+ __count_vm_events(item, reclaimed); -+ __count_memcg_events(memcg, item, reclaimed); -+ __count_vm_events(PGSTEAL_ANON + type, reclaimed); -+ -+ spin_unlock_irq(&lruvec->lru_lock); -+ -+ mem_cgroup_uncharge_list(&list); -+ free_unref_page_list(&list); -+ -+ sc->nr_reclaimed += reclaimed; -+ -+ return scanned; -+} -+ -+static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, -+ bool can_swap) -+{ -+ bool need_aging; -+ unsigned long nr_to_scan; -+ struct mem_cgroup *memcg = lruvec_memcg(lruvec); -+ DEFINE_MAX_SEQ(lruvec); -+ DEFINE_MIN_SEQ(lruvec); -+ -+ if (mem_cgroup_below_min(memcg) || -+ (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) -+ return 0; -+ -+ need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); -+ if (!need_aging) -+ return nr_to_scan; -+ -+ /* skip the aging path at the default priority */ -+ if (sc->priority == DEF_PRIORITY) -+ goto done; -+ -+ /* leave the work to lru_gen_age_node() */ -+ if (current_is_kswapd()) -+ return 0; -+ -+ inc_max_seq(lruvec, max_seq, can_swap); -+done: -+ return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; -+} -+ -+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) -+{ -+ struct blk_plug plug; -+ unsigned long scanned = 0; -+ -+ lru_add_drain(); -+ -+ blk_start_plug(&plug); -+ -+ while (true) { -+ int delta; -+ int swappiness; -+ unsigned long nr_to_scan; -+ -+ if (sc->may_swap) -+ swappiness = get_swappiness(lruvec, sc); -+ else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc)) -+ swappiness = 1; -+ else -+ swappiness = 0; -+ -+ nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness); -+ if (!nr_to_scan) -+ break; -+ -+ delta = evict_folios(lruvec, sc, swappiness); -+ if (!delta) -+ break; -+ -+ scanned += delta; -+ if (scanned >= nr_to_scan) -+ break; -+ -+ cond_resched(); -+ } -+ -+ blk_finish_plug(&plug); -+} -+ - /****************************************************************************** - * initialization - ******************************************************************************/ -@@ -3123,6 +3878,16 @@ static int __init init_lru_gen(void) - }; - late_initcall(init_lru_gen); - -+#else /* !CONFIG_LRU_GEN */ -+ -+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) -+{ -+} -+ -+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) -+{ -+} -+ - #endif /* CONFIG_LRU_GEN */ - - static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) -@@ -3136,6 +3901,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) - bool proportional_reclaim; - struct blk_plug plug; - -+ if (lru_gen_enabled()) { -+ lru_gen_shrink_lruvec(lruvec, sc); -+ return; -+ } -+ - get_scan_count(lruvec, sc, nr); - - /* Record the original scan target for proportional adjustments later */ -@@ -3640,6 +4410,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) - struct lruvec *target_lruvec; - unsigned long refaults; - -+ if (lru_gen_enabled()) -+ return; -+ - target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); - target_lruvec->refaults[WORKINGSET_ANON] = refaults; -@@ -4006,12 +4779,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, - } - #endif - --static void age_active_anon(struct pglist_data *pgdat, -- struct scan_control *sc) -+static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) - { - struct mem_cgroup *memcg; - struct lruvec *lruvec; - -+ if (lru_gen_enabled()) { -+ lru_gen_age_node(pgdat, sc); -+ return; -+ } -+ - if (!can_age_anon_pages(pgdat, sc)) - return; - -@@ -4331,12 +5108,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) - sc.may_swap = !nr_boost_reclaim; - - /* -- * Do some background aging of the anon list, to give -- * pages a chance to be referenced before reclaiming. All -- * pages are rotated regardless of classzone as this is -- * about consistent aging. -+ * Do some background aging, to give pages a chance to be -+ * referenced before reclaiming. All pages are rotated -+ * regardless of classzone as this is about consistent aging. - */ -- age_active_anon(pgdat, &sc); -+ kswapd_age_node(pgdat, &sc); - - /* - * If we're getting trouble reclaiming, start doing writepage -diff --git a/mm/workingset.c b/mm/workingset.c -index a5e84862fc86..ae7e984b23c6 100644 ---- a/mm/workingset.c -+++ b/mm/workingset.c -@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly; - static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, - bool workingset) - { -- eviction >>= bucket_order; - eviction &= EVICTION_MASK; - eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; - eviction = (eviction << NODES_SHIFT) | pgdat->node_id; -@@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - - *memcgidp = memcgid; - *pgdat = NODE_DATA(nid); -- *evictionp = entry << bucket_order; -+ *evictionp = entry; - *workingsetp = workingset; - } - -+#ifdef CONFIG_LRU_GEN -+ -+static void *lru_gen_eviction(struct folio *folio) -+{ -+ int hist; -+ unsigned long token; -+ unsigned long min_seq; -+ struct lruvec *lruvec; -+ struct lru_gen_struct *lrugen; -+ int type = folio_is_file_lru(folio); -+ int delta = folio_nr_pages(folio); -+ int refs = folio_lru_refs(folio); -+ int tier = lru_tier_from_refs(refs); -+ struct mem_cgroup *memcg = folio_memcg(folio); -+ struct pglist_data *pgdat = folio_pgdat(folio); -+ -+ BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); -+ -+ lruvec = mem_cgroup_lruvec(memcg, pgdat); -+ lrugen = &lruvec->lrugen; -+ min_seq = READ_ONCE(lrugen->min_seq[type]); -+ token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0); -+ -+ hist = lru_hist_from_seq(min_seq); -+ atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); -+ -+ return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); -+} -+ -+static void lru_gen_refault(struct folio *folio, void *shadow) -+{ -+ int hist, tier, refs; -+ int memcg_id; -+ bool workingset; -+ unsigned long token; -+ unsigned long min_seq; -+ struct lruvec *lruvec; -+ struct lru_gen_struct *lrugen; -+ struct mem_cgroup *memcg; -+ struct pglist_data *pgdat; -+ int type = folio_is_file_lru(folio); -+ int delta = folio_nr_pages(folio); -+ -+ unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); -+ -+ if (pgdat != folio_pgdat(folio)) -+ return; -+ -+ rcu_read_lock(); -+ -+ memcg = folio_memcg_rcu(folio); -+ if (memcg_id != mem_cgroup_id(memcg)) -+ goto unlock; -+ -+ lruvec = mem_cgroup_lruvec(memcg, pgdat); -+ lrugen = &lruvec->lrugen; -+ -+ min_seq = READ_ONCE(lrugen->min_seq[type]); -+ if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) -+ goto unlock; -+ -+ hist = lru_hist_from_seq(min_seq); -+ /* see the comment in folio_lru_refs() */ -+ refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; -+ tier = lru_tier_from_refs(refs); -+ -+ atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); -+ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); -+ -+ /* -+ * Count the following two cases as stalls: -+ * 1. For pages accessed through page tables, hotter pages pushed out -+ * hot pages which refaulted immediately. -+ * 2. For pages accessed multiple times through file descriptors, -+ * numbers of accesses might have been out of the range. -+ */ -+ if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) { -+ folio_set_workingset(folio); -+ mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); -+ } -+unlock: -+ rcu_read_unlock(); -+} -+ -+#else /* !CONFIG_LRU_GEN */ -+ -+static void *lru_gen_eviction(struct folio *folio) -+{ -+ return NULL; -+} -+ -+static void lru_gen_refault(struct folio *folio, void *shadow) -+{ -+} -+ -+#endif /* CONFIG_LRU_GEN */ -+ - /** - * workingset_age_nonresident - age non-resident entries as LRU ages - * @lruvec: the lruvec that was aged -@@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) - VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - -+ if (lru_gen_enabled()) -+ return lru_gen_eviction(folio); -+ - lruvec = mem_cgroup_lruvec(target_memcg, pgdat); - /* XXX: target_memcg can be NULL, go through lruvec */ - memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); - eviction = atomic_long_read(&lruvec->nonresident_age); -+ eviction >>= bucket_order; - workingset_age_nonresident(lruvec, folio_nr_pages(folio)); - return pack_shadow(memcgid, pgdat, eviction, - folio_test_workingset(folio)); -@@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow) - int memcgid; - long nr; - -+ if (lru_gen_enabled()) { -+ lru_gen_refault(folio, shadow); -+ return; -+ } -+ - unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); -+ eviction <<= bucket_order; - - rcu_read_lock(); - /* --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (5 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao - ` (7 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song, - Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko, - Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Searching the rmap for PTEs mapping each page on an LRU list (to test -and clear the accessed bit) can be expensive because pages from -different VMAs (PA space) are not cache friendly to the rmap (VA -space). For workloads mostly using mapped pages, searching the rmap -can incur the highest CPU cost in the reclaim path. - -This patch exploits spatial locality to reduce the trips into the -rmap. When shrink_page_list() walks the rmap and finds a young PTE, a -new function lru_gen_look_around() scans at most BITS_PER_LONG-1 -adjacent PTEs. On finding another young PTE, it clears the accessed -bit and updates the gen counter of the page mapped by this PTE to -(max_seq%MAX_NR_GENS)+1. - -Server benchmark results: - Single workload: - fio (buffered I/O): no change - - Single workload: - memcached (anon): +[3, 5]% - Ops/sec KB/sec - patch1-6: 1106168.46 43025.04 - patch1-7: 1147696.57 44640.29 - - Configurations: - no change - -Client benchmark results: - kswapd profiles: - patch1-6 - 39.03% lzo1x_1_do_compress (real work) - 18.47% page_vma_mapped_walk (overhead) - 6.74% _raw_spin_unlock_irq - 3.97% do_raw_spin_lock - 2.49% ptep_clear_flush - 2.48% anon_vma_interval_tree_iter_first - 1.92% folio_referenced_one - 1.88% __zram_bvec_write - 1.48% memmove - 1.31% vma_interval_tree_iter_next - - patch1-7 - 48.16% lzo1x_1_do_compress (real work) - 8.20% page_vma_mapped_walk (overhead) - 7.06% _raw_spin_unlock_irq - 2.92% ptep_clear_flush - 2.53% __zram_bvec_write - 2.11% do_raw_spin_lock - 2.02% memmove - 1.93% lru_gen_look_around - 1.56% free_unref_page_list - 1.40% memset - - Configurations: - no change - -Signed-off-by: Yu Zhao -Acked-by: Barry Song -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/memcontrol.h | 31 +++++++ - include/linux/mm.h | 5 + - include/linux/mmzone.h | 6 ++ - mm/internal.h | 1 + - mm/memcontrol.c | 1 + - mm/rmap.c | 6 ++ - mm/swap.c | 4 +- - mm/vmscan.c | 184 +++++++++++++++++++++++++++++++++++++ - 8 files changed, 236 insertions(+), 2 deletions(-) - -diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h -index a2461f9a8738..9b8ab121d948 100644 ---- a/include/linux/memcontrol.h -+++ b/include/linux/memcontrol.h -@@ -445,6 +445,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio) - * - LRU isolation - * - lock_page_memcg() - * - exclusive reference -+ * - mem_cgroup_trylock_pages() - * - * For a kmem folio a caller should hold an rcu read lock to protect memcg - * associated with a kmem folio from being released. -@@ -506,6 +507,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) - * - LRU isolation - * - lock_page_memcg() - * - exclusive reference -+ * - mem_cgroup_trylock_pages() - * - * For a kmem page a caller should hold an rcu read lock to protect memcg - * associated with a kmem page from being released. -@@ -960,6 +962,23 @@ void unlock_page_memcg(struct page *page); - - void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); - -+/* try to stablize folio_memcg() for all the pages in a memcg */ -+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) -+{ -+ rcu_read_lock(); -+ -+ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) -+ return true; -+ -+ rcu_read_unlock(); -+ return false; -+} -+ -+static inline void mem_cgroup_unlock_pages(void) -+{ -+ rcu_read_unlock(); -+} -+ - /* idx can be of type enum memcg_stat_item or node_stat_item */ - static inline void mod_memcg_state(struct mem_cgroup *memcg, - int idx, int val) -@@ -1434,6 +1453,18 @@ static inline void folio_memcg_unlock(struct folio *folio) - { - } - -+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) -+{ -+ /* to match folio_memcg_rcu() */ -+ rcu_read_lock(); -+ return true; -+} -+ -+static inline void mem_cgroup_unlock_pages(void) -+{ -+ rcu_read_unlock(); -+} -+ - static inline void mem_cgroup_handle_over_high(void) - { - } -diff --git a/include/linux/mm.h b/include/linux/mm.h -index 8a5ad9d050bf..7cc9ffc19e7f 100644 ---- a/include/linux/mm.h -+++ b/include/linux/mm.h -@@ -1490,6 +1490,11 @@ static inline unsigned long folio_pfn(struct folio *folio) - return page_to_pfn(&folio->page); - } - -+static inline struct folio *pfn_folio(unsigned long pfn) -+{ -+ return page_folio(pfn_to_page(pfn)); -+} -+ - static inline atomic_t *folio_pincount_ptr(struct folio *folio) - { - return &folio_page(folio, 1)->compound_pincount; -diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index 7e343420bfb1..9ef5aa37c60c 100644 ---- a/include/linux/mmzone.h -+++ b/include/linux/mmzone.h -@@ -375,6 +375,7 @@ enum lruvec_flags { - #ifndef __GENERATING_BOUNDS_H - - struct lruvec; -+struct page_vma_mapped_walk; - - #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) - #define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) -@@ -430,6 +431,7 @@ struct lru_gen_struct { - }; - - void lru_gen_init_lruvec(struct lruvec *lruvec); -+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); - - #ifdef CONFIG_MEMCG - void lru_gen_init_memcg(struct mem_cgroup *memcg); -@@ -442,6 +444,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) - { - } - -+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) -+{ -+} -+ - #ifdef CONFIG_MEMCG - static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) - { -diff --git a/mm/internal.h b/mm/internal.h -index 4df67b6b8cce..0082d5fdddac 100644 ---- a/mm/internal.h -+++ b/mm/internal.h -@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf); - void folio_rotate_reclaimable(struct folio *folio); - bool __folio_end_writeback(struct folio *folio); - void deactivate_file_folio(struct folio *folio); -+void folio_activate(struct folio *folio); - - void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); -diff --git a/mm/memcontrol.c b/mm/memcontrol.c -index 937141d48221..4ea49113b0dd 100644 ---- a/mm/memcontrol.c -+++ b/mm/memcontrol.c -@@ -2789,6 +2789,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) - * - LRU isolation - * - lock_page_memcg() - * - exclusive reference -+ * - mem_cgroup_trylock_pages() - */ - folio->memcg_data = (unsigned long)memcg; - } -diff --git a/mm/rmap.c b/mm/rmap.c -index 131def40e4f0..2ff17b9aabd9 100644 ---- a/mm/rmap.c -+++ b/mm/rmap.c -@@ -825,6 +825,12 @@ static bool folio_referenced_one(struct folio *folio, - } - - if (pvmw.pte) { -+ if (lru_gen_enabled() && pte_young(*pvmw.pte) && -+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) { -+ lru_gen_look_around(&pvmw); -+ referenced++; -+ } -+ - if (ptep_clear_flush_young_notify(vma, address, - pvmw.pte)) { - /* -diff --git a/mm/swap.c b/mm/swap.c -index f74fd51fa9e1..0a3871a70952 100644 ---- a/mm/swap.c -+++ b/mm/swap.c -@@ -366,7 +366,7 @@ static void folio_activate_drain(int cpu) - folio_batch_move_lru(fbatch, folio_activate_fn); - } - --static void folio_activate(struct folio *folio) -+void folio_activate(struct folio *folio) - { - if (folio_test_lru(folio) && !folio_test_active(folio) && - !folio_test_unevictable(folio)) { -@@ -385,7 +385,7 @@ static inline void folio_activate_drain(int cpu) - { - } - --static void folio_activate(struct folio *folio) -+void folio_activate(struct folio *folio) - { - struct lruvec *lruvec; - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index d1e60feea8ab..33a1bdfc04bd 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -1635,6 +1635,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, - if (!sc->may_unmap && folio_mapped(folio)) - goto keep_locked; - -+ /* folio_update_gen() tried to promote this page? */ -+ if (lru_gen_enabled() && !ignore_references && -+ folio_mapped(folio) && folio_test_referenced(folio)) -+ goto keep_locked; -+ - /* - * The number of dirty pages determines if a node is marked - * reclaim_congested. kswapd will stall and start writing -@@ -3219,6 +3224,29 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) - * the aging - ******************************************************************************/ - -+/* promote pages accessed through page tables */ -+static int folio_update_gen(struct folio *folio, int gen) -+{ -+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); -+ -+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); -+ VM_WARN_ON_ONCE(!rcu_read_lock_held()); -+ -+ do { -+ /* lru_gen_del_folio() has isolated this page? */ -+ if (!(old_flags & LRU_GEN_MASK)) { -+ /* for shrink_page_list() */ -+ new_flags = old_flags | BIT(PG_referenced); -+ continue; -+ } -+ -+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); -+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; -+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); -+ -+ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; -+} -+ - /* protect pages accessed multiple times through file descriptors */ - static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) - { -@@ -3230,6 +3258,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai - VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio); - - do { -+ new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; -+ /* folio_update_gen() has promoted this page? */ -+ if (new_gen >= 0 && new_gen != old_gen) -+ return new_gen; -+ - new_gen = (old_gen + 1) % MAX_NR_GENS; - - new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); -@@ -3244,6 +3277,43 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai - return new_gen; - } - -+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) -+{ -+ unsigned long pfn = pte_pfn(pte); -+ -+ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end); -+ -+ if (!pte_present(pte) || is_zero_pfn(pfn)) -+ return -1; -+ -+ if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte))) -+ return -1; -+ -+ if (WARN_ON_ONCE(!pfn_valid(pfn))) -+ return -1; -+ -+ return pfn; -+} -+ -+static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, -+ struct pglist_data *pgdat) -+{ -+ struct folio *folio; -+ -+ /* try to avoid unnecessary memory loads */ -+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) -+ return NULL; -+ -+ folio = pfn_folio(pfn); -+ if (folio_nid(folio) != pgdat->node_id) -+ return NULL; -+ -+ if (folio_memcg_rcu(folio) != memcg) -+ return NULL; -+ -+ return folio; -+} -+ - static void inc_min_seq(struct lruvec *lruvec, int type) - { - struct lru_gen_struct *lrugen = &lruvec->lrugen; -@@ -3443,6 +3513,114 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); - } - -+/* -+ * This function exploits spatial locality when shrink_page_list() walks the -+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. -+ */ -+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) -+{ -+ int i; -+ pte_t *pte; -+ unsigned long start; -+ unsigned long end; -+ unsigned long addr; -+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; -+ struct folio *folio = pfn_folio(pvmw->pfn); -+ struct mem_cgroup *memcg = folio_memcg(folio); -+ struct pglist_data *pgdat = folio_pgdat(folio); -+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); -+ DEFINE_MAX_SEQ(lruvec); -+ int old_gen, new_gen = lru_gen_from_seq(max_seq); -+ -+ lockdep_assert_held(pvmw->ptl); -+ VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); -+ -+ if (spin_is_contended(pvmw->ptl)) -+ return; -+ -+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); -+ end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1; -+ -+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { -+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2) -+ end = start + MIN_LRU_BATCH * PAGE_SIZE; -+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2) -+ start = end - MIN_LRU_BATCH * PAGE_SIZE; -+ else { -+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2; -+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2; -+ } -+ } -+ -+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; -+ -+ rcu_read_lock(); -+ arch_enter_lazy_mmu_mode(); -+ -+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { -+ unsigned long pfn; -+ -+ pfn = get_pte_pfn(pte[i], pvmw->vma, addr); -+ if (pfn == -1) -+ continue; -+ -+ if (!pte_young(pte[i])) -+ continue; -+ -+ folio = get_pfn_folio(pfn, memcg, pgdat); -+ if (!folio) -+ continue; -+ -+ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) -+ VM_WARN_ON_ONCE(true); -+ -+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && -+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) && -+ !folio_test_swapcache(folio))) -+ folio_mark_dirty(folio); -+ -+ old_gen = folio_lru_gen(folio); -+ if (old_gen < 0) -+ folio_set_referenced(folio); -+ else if (old_gen != new_gen) -+ __set_bit(i, bitmap); -+ } -+ -+ arch_leave_lazy_mmu_mode(); -+ rcu_read_unlock(); -+ -+ if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { -+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { -+ folio = pfn_folio(pte_pfn(pte[i])); -+ folio_activate(folio); -+ } -+ return; -+ } -+ -+ /* folio_update_gen() requires stable folio_memcg() */ -+ if (!mem_cgroup_trylock_pages(memcg)) -+ return; -+ -+ spin_lock_irq(&lruvec->lru_lock); -+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); -+ -+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { -+ folio = pfn_folio(pte_pfn(pte[i])); -+ if (folio_memcg_rcu(folio) != memcg) -+ continue; -+ -+ old_gen = folio_update_gen(folio, new_gen); -+ if (old_gen < 0 || old_gen == new_gen) -+ continue; -+ -+ lru_gen_update_size(lruvec, folio, old_gen, new_gen); -+ } -+ -+ spin_unlock_irq(&lruvec->lru_lock); -+ -+ mem_cgroup_unlock_pages(); -+} -+ - /****************************************************************************** - * the eviction - ******************************************************************************/ -@@ -3479,6 +3657,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) - return true; - } - -+ /* promoted */ -+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) { -+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]); -+ return true; -+ } -+ - /* protected */ - if (tier > tier_idx) { - int hist = lru_hist_from_seq(lrugen->min_seq[type]); --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (6 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:17 ` Yu Zhao - 2022-09-28 19:36 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao - ` (6 subsequent siblings) - 14 siblings, 2 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -To further exploit spatial locality, the aging prefers to walk page -tables to search for young PTEs and promote hot pages. A kill switch -will be added in the next patch to disable this behavior. When -disabled, the aging relies on the rmap only. - -NB: this behavior has nothing similar with the page table scanning in -the 2.4 kernel [1], which searches page tables for old PTEs, adds cold -pages to swapcache and unmaps them. - -To avoid confusion, the term "iteration" specifically means the -traversal of an entire mm_struct list; the term "walk" will be applied -to page tables and the rmap, as usual. - -An mm_struct list is maintained for each memcg, and an mm_struct -follows its owner task to the new memcg when this task is migrated. -Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls -walk_page_range() with each mm_struct on this list to promote hot -pages before it increments max_seq. - -When multiple page table walkers iterate the same list, each of them -gets a unique mm_struct; therefore they can run concurrently. Page -table walkers ignore any misplaced pages, e.g., if an mm_struct was -migrated, pages it left in the previous memcg will not be promoted -when its current memcg is under reclaim. Similarly, page table walkers -will not promote pages from nodes other than the one under reclaim. - -This patch uses the following optimizations when walking page tables: -1. It tracks the usage of mm_struct's between context switches so that - page table walkers can skip processes that have been sleeping since - the last iteration. -2. It uses generational Bloom filters to record populated branches so - that page table walkers can reduce their search space based on the - query results, e.g., to skip page tables containing mostly holes or - misplaced pages. -3. It takes advantage of the accessed bit in non-leaf PMD entries when - CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. -4. It does not zigzag between a PGD table and the same PMD table - spanning multiple VMAs. IOW, it finishes all the VMAs within the - range of the same PMD table before it returns to a PGD table. This - improves the cache performance for workloads that have large - numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. - -Server benchmark results: - Single workload: - fio (buffered I/O): no change - - Single workload: - memcached (anon): +[8, 10]% - Ops/sec KB/sec - patch1-7: 1147696.57 44640.29 - patch1-8: 1245274.91 48435.66 - - Configurations: - no change - -Client benchmark results: - kswapd profiles: - patch1-7 - 48.16% lzo1x_1_do_compress (real work) - 8.20% page_vma_mapped_walk (overhead) - 7.06% _raw_spin_unlock_irq - 2.92% ptep_clear_flush - 2.53% __zram_bvec_write - 2.11% do_raw_spin_lock - 2.02% memmove - 1.93% lru_gen_look_around - 1.56% free_unref_page_list - 1.40% memset - - patch1-8 - 49.44% lzo1x_1_do_compress (real work) - 6.19% page_vma_mapped_walk (overhead) - 5.97% _raw_spin_unlock_irq - 3.13% get_pfn_folio - 2.85% ptep_clear_flush - 2.42% __zram_bvec_write - 2.08% do_raw_spin_lock - 1.92% memmove - 1.44% alloc_zspage - 1.36% memset - - Configurations: - no change - -Thanks to the following developers for their efforts [3]. - kernel test robot - -[1] https://lwn.net/Articles/23732/ -[2] https://llvm.org/docs/ScudoHardenedAllocator.html -[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - fs/exec.c | 2 + - include/linux/memcontrol.h | 5 + - include/linux/mm_types.h | 76 +++ - include/linux/mmzone.h | 56 +- - include/linux/swap.h | 4 + - kernel/exit.c | 1 + - kernel/fork.c | 9 + - kernel/sched/core.c | 1 + - mm/memcontrol.c | 25 + - mm/vmscan.c | 1010 +++++++++++++++++++++++++++++++++++- - 10 files changed, 1172 insertions(+), 17 deletions(-) - -diff --git a/fs/exec.c b/fs/exec.c -index 9a5ca7b82bfc..507a317d54db 100644 ---- a/fs/exec.c -+++ b/fs/exec.c -@@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) - active_mm = tsk->active_mm; - tsk->active_mm = mm; - tsk->mm = mm; -+ lru_gen_add_mm(mm); - /* - * This prevents preemption while active_mm is being loaded and - * it and mm are being updated, which could cause problems for -@@ -1029,6 +1030,7 @@ static int exec_mmap(struct mm_struct *mm) - tsk->mm->vmacache_seqnum = 0; - vmacache_flush(tsk); - task_unlock(tsk); -+ lru_gen_use_mm(mm); - if (old_mm) { - mmap_read_unlock(old_mm); - BUG_ON(active_mm != old_mm); -diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h -index 9b8ab121d948..344022f102c2 100644 ---- a/include/linux/memcontrol.h -+++ b/include/linux/memcontrol.h -@@ -350,6 +350,11 @@ struct mem_cgroup { - struct deferred_split deferred_split_queue; - #endif - -+#ifdef CONFIG_LRU_GEN -+ /* per-memcg mm_struct list */ -+ struct lru_gen_mm_list mm_list; -+#endif -+ - struct mem_cgroup_per_node *nodeinfo[]; - }; - diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h -index cf97f3884fda..e1797813cc2c 100644 +index cf97f3884fda20..e1797813cc2c2b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -672,6 +672,22 @@ struct mm_struct { @@ -4013,12 +934,12 @@ index cf97f3884fda..e1797813cc2c 100644 + } lru_gen; +#endif /* CONFIG_LRU_GEN */ } __randomize_layout; - + /* @@ -698,6 +714,66 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return (struct cpumask *)&mm->cpu_bitmap; } - + +#ifdef CONFIG_LRU_GEN + +struct lru_gen_mm_list { @@ -4083,22 +1004,137 @@ index cf97f3884fda..e1797813cc2c 100644 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index 9ef5aa37c60c..b1635c4020dc 100644 +index e24b40c52468a8..0c502618b37bf7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h -@@ -408,7 +408,7 @@ enum { - * min_seq behind. - * - * The number of pages in each generation is eventually consistent and therefore -- * can be transiently negative. -+ * can be transiently negative when reset_batch_size() is pending. - */ - struct lru_gen_struct { - /* the aging increments the youngest generation number */ -@@ -430,6 +430,53 @@ struct lru_gen_struct { - atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; +@@ -314,6 +314,207 @@ enum lruvec_flags { + */ }; - + ++#endif /* !__GENERATING_BOUNDS_H */ ++ ++/* ++ * Evictable pages are divided into multiple generations. The youngest and the ++ * oldest generation numbers, max_seq and min_seq, are monotonically increasing. ++ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An ++ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the ++ * corresponding generation. The gen counter in folio->flags stores gen+1 while ++ * a page is on one of lrugen->lists[]. Otherwise it stores 0. ++ * ++ * A page is added to the youngest generation on faulting. The aging needs to ++ * check the accessed bit at least twice before handing this page over to the ++ * eviction. The first check takes care of the accessed bit set on the initial ++ * fault; the second check makes sure this page hasn't been used since then. ++ * This process, AKA second chance, requires a minimum of two generations, ++ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive ++ * LRU, e.g., /proc/vmstat, these two generations are considered active; the ++ * rest of generations, if they exist, are considered inactive. See ++ * lru_gen_is_active(). ++ * ++ * PG_active is always cleared while a page is on one of lrugen->lists[] so that ++ * the aging needs not to worry about it. And it's set again when a page ++ * considered active is isolated for non-reclaiming purposes, e.g., migration. ++ * See lru_gen_add_folio() and lru_gen_del_folio(). ++ * ++ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the ++ * number of categories of the active/inactive LRU when keeping track of ++ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits ++ * in folio->flags. ++ */ ++#define MIN_NR_GENS 2U ++#define MAX_NR_GENS 4U ++ ++/* ++ * Each generation is divided into multiple tiers. A page accessed N times ++ * through file descriptors is in tier order_base_2(N). A page in the first tier ++ * (N=0,1) is marked by PG_referenced unless it was faulted in through page ++ * tables or read ahead. A page in any other tier (N>1) is marked by ++ * PG_referenced and PG_workingset. This implies a minimum of two tiers is ++ * supported without using additional bits in folio->flags. ++ * ++ * In contrast to moving across generations which requires the LRU lock, moving ++ * across tiers only involves atomic operations on folio->flags and therefore ++ * has a negligible cost in the buffered access path. In the eviction path, ++ * comparisons of refaulted/(evicted+protected) from the first tier and the ++ * rest infer whether pages accessed multiple times through file descriptors ++ * are statistically hot and thus worth protecting. ++ * ++ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the ++ * number of categories of the active/inactive LRU when keeping track of ++ * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in ++ * folio->flags. ++ */ ++#define MAX_NR_TIERS 4U ++ ++#ifndef __GENERATING_BOUNDS_H ++ ++struct lruvec; ++struct page_vma_mapped_walk; ++ ++#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) ++#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) ++ ++#ifdef CONFIG_LRU_GEN ++ ++enum { ++ LRU_GEN_ANON, ++ LRU_GEN_FILE, ++}; ++ ++enum { ++ LRU_GEN_CORE, ++ LRU_GEN_MM_WALK, ++ LRU_GEN_NONLEAF_YOUNG, ++ NR_LRU_GEN_CAPS ++}; ++ ++#define MIN_LRU_BATCH BITS_PER_LONG ++#define MAX_LRU_BATCH (MIN_LRU_BATCH * 64) ++ ++/* whether to keep historical stats from evicted generations */ ++#ifdef CONFIG_LRU_GEN_STATS ++#define NR_HIST_GENS MAX_NR_GENS ++#else ++#define NR_HIST_GENS 1U ++#endif ++ ++/* ++ * The youngest generation number is stored in max_seq for both anon and file ++ * types as they are aged on an equal footing. The oldest generation numbers are ++ * stored in min_seq[] separately for anon and file types as clean file pages ++ * can be evicted regardless of swap constraints. ++ * ++ * Normally anon and file min_seq are in sync. But if swapping is constrained, ++ * e.g., out of swap space, file min_seq is allowed to advance and leave anon ++ * min_seq behind. ++ * ++ * The number of pages in each generation is eventually consistent and therefore ++ * can be transiently negative when reset_batch_size() is pending. ++ */ ++struct lru_gen_struct { ++ /* the aging increments the youngest generation number */ ++ unsigned long max_seq; ++ /* the eviction increments the oldest generation numbers */ ++ unsigned long min_seq[ANON_AND_FILE]; ++ /* the birth time of each generation in jiffies */ ++ unsigned long timestamps[MAX_NR_GENS]; ++ /* the multi-gen LRU lists, lazily sorted on eviction */ ++ struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; ++ /* the multi-gen LRU sizes, eventually consistent */ ++ long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; ++ /* the exponential moving average of refaulted */ ++ unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; ++ /* the exponential moving average of evicted+protected */ ++ unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; ++ /* the first tier doesn't need protection, hence the minus one */ ++ unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; ++ /* can be modified without holding the LRU lock */ ++ atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; ++ atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; ++ /* whether the multi-gen LRU is enabled */ ++ bool enabled; ++}; ++ +enum { + MM_LEAF_TOTAL, /* total leaf entries */ + MM_LEAF_OLD, /* old leaf entries */ @@ -4146,32 +1182,209 @@ index 9ef5aa37c60c..b1635c4020dc 100644 + bool force_scan; +}; + - void lru_gen_init_lruvec(struct lruvec *lruvec); - void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); - -@@ -480,6 +527,8 @@ struct lruvec { - #ifdef CONFIG_LRU_GEN - /* evictable pages divided into generations */ - struct lru_gen_struct lrugen; ++void lru_gen_init_lruvec(struct lruvec *lruvec); ++void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); ++ ++#ifdef CONFIG_MEMCG ++void lru_gen_init_memcg(struct mem_cgroup *memcg); ++void lru_gen_exit_memcg(struct mem_cgroup *memcg); ++#endif ++ ++#else /* !CONFIG_LRU_GEN */ ++ ++static inline void lru_gen_init_lruvec(struct lruvec *lruvec) ++{ ++} ++ ++static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) ++{ ++} ++ ++#ifdef CONFIG_MEMCG ++static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) ++{ ++} ++ ++static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg) ++{ ++} ++#endif ++ ++#endif /* CONFIG_LRU_GEN */ ++ + struct lruvec { + struct list_head lists[NR_LRU_LISTS]; + /* per lruvec lru_lock for memcg */ +@@ -331,6 +532,12 @@ struct lruvec { + unsigned long refaults[ANON_AND_FILE]; + /* Various lruvec state flags (enum lruvec_flags) */ + unsigned long flags; ++#ifdef CONFIG_LRU_GEN ++ /* evictable pages divided into generations */ ++ struct lru_gen_struct lrugen; + /* to concurrently iterate lru_gen_mm_list */ + struct lru_gen_mm_state mm_state; - #endif ++#endif #ifdef CONFIG_MEMCG struct pglist_data *pgdat; -@@ -1176,6 +1225,11 @@ typedef struct pglist_data { - + #endif +@@ -746,6 +953,8 @@ static inline bool zone_is_empty(struct zone *zone) + #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) + #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) + #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) ++#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH) ++#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH) + + /* + * Define the bit shifts to access each section. For non-existent +@@ -1007,6 +1216,11 @@ typedef struct pglist_data { + unsigned long flags; - + +#ifdef CONFIG_LRU_GEN + /* kswap mm walk data */ + struct lru_gen_mm_walk mm_walk; +#endif + ZONE_PADDING(_pad2_) - + /* Per-node vmstats */ +diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h +index 4b71a96190a84c..3a0eec9f2faa75 100644 +--- a/include/linux/nodemask.h ++++ b/include/linux/nodemask.h +@@ -493,6 +493,7 @@ static inline int num_node_state(enum node_states state) + #define first_online_node 0 + #define first_memory_node 0 + #define next_online_node(nid) (MAX_NUMNODES) ++#define next_memory_node(nid) (MAX_NUMNODES) + #define nr_node_ids 1U + #define nr_online_nodes 1U + +diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h +index ef1e3e736e1483..7d79818dc06513 100644 +--- a/include/linux/page-flags-layout.h ++++ b/include/linux/page-flags-layout.h +@@ -55,7 +55,8 @@ + #define SECTIONS_WIDTH 0 + #endif + +-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS ++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \ ++ <= BITS_PER_LONG - NR_PAGEFLAGS + #define NODES_WIDTH NODES_SHIFT + #elif defined(CONFIG_SPARSEMEM_VMEMMAP) + #error "Vmemmap: No space for nodes field in page flags" +@@ -89,8 +90,8 @@ + #define LAST_CPUPID_SHIFT 0 + #endif + +-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \ +- <= BITS_PER_LONG - NR_PAGEFLAGS ++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ ++ KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS + #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT + #else + #define LAST_CPUPID_WIDTH 0 +@@ -100,10 +101,15 @@ + #define LAST_CPUPID_NOT_IN_PAGE_FLAGS + #endif + +-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \ +- > BITS_PER_LONG - NR_PAGEFLAGS ++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ ++ KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS + #error "Not enough bits in page flags" + #endif + ++/* see the comment on MAX_NR_TIERS */ ++#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \ ++ ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ ++ NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) ++ + #endif + #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ +diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h +index 465ff35a8c00a8..0b0ae5084e60c7 100644 +--- a/include/linux/page-flags.h ++++ b/include/linux/page-flags.h +@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) + 1UL << PG_private | 1UL << PG_private_2 | \ + 1UL << PG_writeback | 1UL << PG_reserved | \ + 1UL << PG_slab | 1UL << PG_active | \ +- 1UL << PG_unevictable | __PG_MLOCKED) ++ 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK) + + /* + * Flags checked when a page is prepped for return by the page allocator. +@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) + * alloc-free cycle to prevent from reusing the page. + */ + #define PAGE_FLAGS_CHECK_AT_PREP \ +- (PAGEFLAGS_MASK & ~__PG_HWPOISON) ++ ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK) + + #define PAGE_FLAGS_PRIVATE \ + (1UL << PG_private | 1UL << PG_private_2) +diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h +index 014ee8f0fbaabc..d9095251bffd2f 100644 +--- a/include/linux/pgtable.h ++++ b/include/linux/pgtable.h +@@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, + #endif + + #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG +-#ifdef CONFIG_TRANSPARENT_HUGEPAGE ++#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) + static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +@@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, + BUILD_BUG(); + return 0; + } +-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ ++#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ + #endif + + #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH +@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + #endif + ++#ifndef arch_has_hw_pte_young ++/* ++ * Return whether the accessed bit is supported on the local CPU. ++ * ++ * This stub assumes accessing through an old PTE triggers a page fault. ++ * Architectures that automatically set the access bit should overwrite it. ++ */ ++static inline bool arch_has_hw_pte_young(void) ++{ ++ return false; ++} ++#endif ++ + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR + static inline pte_t ptep_get_and_clear(struct mm_struct *mm, + unsigned long address, +diff --git a/include/linux/sched.h b/include/linux/sched.h +index e7b2f8a5c711c1..8cc46a789193eb 100644 +--- a/include/linux/sched.h ++++ b/include/linux/sched.h +@@ -914,6 +914,10 @@ struct task_struct { + #ifdef CONFIG_MEMCG + unsigned in_user_fault:1; + #endif ++#ifdef CONFIG_LRU_GEN ++ /* whether the LRU algorithm may apply to this access */ ++ unsigned in_lru_fault:1; ++#endif + #ifdef CONFIG_COMPAT_BRK + unsigned brk_randomized:1; + #endif diff --git a/include/linux/swap.h b/include/linux/swap.h -index 43150b9bbc5c..6308150b234a 100644 +index 43150b9bbc5caf..6308150b234a49 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -162,6 +162,10 @@ union swap_header { @@ -4183,10 +1396,40 @@ index 43150b9bbc5c..6308150b234a 100644 + struct lru_gen_mm_walk *mm_walk; +#endif }; - + #ifdef __KERNEL__ +diff --git a/kernel/bounds.c b/kernel/bounds.c +index 9795d75b09b232..b529182e8b04fc 100644 +--- a/kernel/bounds.c ++++ b/kernel/bounds.c +@@ -22,6 +22,13 @@ int main(void) + DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS)); + #endif + DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); ++#ifdef CONFIG_LRU_GEN ++ DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); ++ DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2); ++#else ++ DEFINE(LRU_GEN_WIDTH, 0); ++ DEFINE(__LRU_REFS_WIDTH, 0); ++#endif + /* End of constants */ + + return 0; +diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h +index 36b740cb3d59ef..63dc3e82be4f7f 100644 +--- a/kernel/cgroup/cgroup-internal.h ++++ b/kernel/cgroup/cgroup-internal.h +@@ -164,7 +164,6 @@ struct cgroup_mgctx { + #define DEFINE_CGROUP_MGCTX(name) \ + struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) + +-extern struct mutex cgroup_mutex; + extern spinlock_t css_set_lock; + extern struct cgroup_subsys *cgroup_subsys[]; + extern struct list_head cgroup_roots; diff --git a/kernel/exit.c b/kernel/exit.c -index 84021b24f79e..98a33bd7c25c 100644 +index 84021b24f79e3d..98a33bd7c25c50 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm) @@ -4198,16 +1441,16 @@ index 84021b24f79e..98a33bd7c25c 100644 put_task_struct(c); } diff --git a/kernel/fork.c b/kernel/fork.c -index 90c85b17bf69..d2da065442af 100644 +index 2b6bd511c6ed1c..2dd4ca002a368d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1152,6 +1152,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, goto fail_nocontext; - + mm->user_ns = get_user_ns(user_ns); + lru_gen_init_mm(mm); return mm; - + fail_nocontext: @@ -1194,6 +1195,7 @@ static inline void __mmput(struct mm_struct *mm) } @@ -4216,11 +1459,11 @@ index 90c85b17bf69..d2da065442af 100644 + lru_gen_del_mm(mm); mmdrop(mm); } - -@@ -2694,6 +2696,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) + +@@ -2692,6 +2694,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) get_task_struct(p); } - + + if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) { + /* lock the task to synchronize with memcg migration */ + task_lock(p); @@ -4229,28 +1472,115 @@ index 90c85b17bf69..d2da065442af 100644 + } + wake_up_new_task(p); - + /* forking complete and child started to run, tell ptracer */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c -index 8fccd8721bb8..2c605bdede47 100644 +index ee28253c9ac0c2..c48c0a19642b6c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c -@@ -5180,6 +5180,7 @@ context_switch(struct rq *rq, struct task_struct *prev, +@@ -5166,6 +5166,7 @@ context_switch(struct rq *rq, struct task_struct *prev, * finish_task_switch()'s mmdrop(). */ switch_mm_irqs_off(prev->active_mm, next->mm, next); + lru_gen_use_mm(next->mm); - + if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ +diff --git a/mm/Kconfig b/mm/Kconfig +index 0331f1461f81cd..96cd3ae25c6fcd 100644 +--- a/mm/Kconfig ++++ b/mm/Kconfig +@@ -1124,6 +1124,32 @@ config PTE_MARKER_UFFD_WP + purposes. It is required to enable userfaultfd write protection on + file-backed memory types like shmem and hugetlbfs. + ++# multi-gen LRU { ++config LRU_GEN ++ bool "Multi-Gen LRU" ++ depends on MMU ++ # make sure folio->flags has enough spare bits ++ depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP ++ help ++ A high performance LRU implementation to overcommit memory. See ++ Documentation/admin-guide/mm/multigen_lru.rst for details. ++ ++config LRU_GEN_ENABLED ++ bool "Enable by default" ++ depends on LRU_GEN ++ help ++ This option enables the multi-gen LRU by default. ++ ++config LRU_GEN_STATS ++ bool "Full stats for debugging" ++ depends on LRU_GEN ++ help ++ Do not enable this option unless you plan to look at historical stats ++ from evicted generations for debugging purpose. ++ ++ This option has a per-memcg and per-node memory overhead. ++# } ++ + source "mm/damon/Kconfig" + + endmenu +diff --git a/mm/huge_memory.c b/mm/huge_memory.c +index f42bb51e023a03..79e0b08b4cf93c 100644 +--- a/mm/huge_memory.c ++++ b/mm/huge_memory.c +@@ -2438,7 +2438,8 @@ static void __split_huge_page_tail(struct page *head, int tail, + #ifdef CONFIG_64BIT + (1L << PG_arch_2) | + #endif +- (1L << PG_dirty))); ++ (1L << PG_dirty) | ++ LRU_GEN_MASK | LRU_REFS_MASK)); + + /* ->mapping in first tail page is compound_mapcount */ + VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, +diff --git a/mm/internal.h b/mm/internal.h +index 785409805ed797..a1fddea6b34f41 100644 +--- a/mm/internal.h ++++ b/mm/internal.h +@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf); + void folio_rotate_reclaimable(struct folio *folio); + bool __folio_end_writeback(struct folio *folio); + void deactivate_file_folio(struct folio *folio); ++void folio_activate(struct folio *folio); + + void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, + unsigned long floor, unsigned long ceiling); diff --git a/mm/memcontrol.c b/mm/memcontrol.c -index 4ea49113b0dd..392b1fd1e8c4 100644 +index b69979c9ced5c2..1c18d7c1ce7174 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c -@@ -6204,6 +6204,30 @@ static void mem_cgroup_move_task(void) +@@ -2789,6 +2789,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) + * - LRU isolation + * - lock_page_memcg() + * - exclusive reference ++ * - mem_cgroup_trylock_pages() + */ + folio->memcg_data = (unsigned long)memcg; + } +@@ -5170,6 +5171,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) + + static void mem_cgroup_free(struct mem_cgroup *memcg) + { ++ lru_gen_exit_memcg(memcg); + memcg_wb_domain_exit(memcg); + __mem_cgroup_free(memcg); + } +@@ -5228,6 +5230,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) + memcg->deferred_split_queue.split_queue_len = 0; + #endif + idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); ++ lru_gen_init_memcg(memcg); + return memcg; + fail: + mem_cgroup_id_remove(memcg); +@@ -6196,6 +6199,30 @@ static void mem_cgroup_move_task(void) } #endif - + +#ifdef CONFIG_LRU_GEN +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ @@ -4278,7 +1608,7 @@ index 4ea49113b0dd..392b1fd1e8c4 100644 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { if (value == PAGE_COUNTER_MAX) -@@ -6609,6 +6633,7 @@ struct cgroup_subsys memory_cgrp_subsys = { +@@ -6601,6 +6628,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_reset = mem_cgroup_css_reset, .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, @@ -4286,43 +1616,602 @@ index 4ea49113b0dd..392b1fd1e8c4 100644 .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, .dfl_cftypes = memory_files, +diff --git a/mm/memory.c b/mm/memory.c +index a78814413ac03e..cd1b5bfd9f3e9d 100644 +--- a/mm/memory.c ++++ b/mm/memory.c +@@ -125,18 +125,6 @@ int randomize_va_space __read_mostly = + 2; + #endif + +-#ifndef arch_faults_on_old_pte +-static inline bool arch_faults_on_old_pte(void) +-{ +- /* +- * Those arches which don't have hw access flag feature need to +- * implement their own helper. By default, "true" means pagefault +- * will be hit on old pte. +- */ +- return true; +-} +-#endif +- + #ifndef arch_wants_old_prefaulted_pte + static inline bool arch_wants_old_prefaulted_pte(void) + { +@@ -2870,7 +2858,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, + * On architectures with software "accessed" bits, we would + * take a double page fault, so mark it accessed here. + */ +- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { ++ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { + pte_t entry; + + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); +@@ -5120,6 +5108,27 @@ static inline void mm_account_fault(struct pt_regs *regs, + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); + } + ++#ifdef CONFIG_LRU_GEN ++static void lru_gen_enter_fault(struct vm_area_struct *vma) ++{ ++ /* the LRU algorithm doesn't apply to sequential or random reads */ ++ current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ)); ++} ++ ++static void lru_gen_exit_fault(void) ++{ ++ current->in_lru_fault = false; ++} ++#else ++static void lru_gen_enter_fault(struct vm_area_struct *vma) ++{ ++} ++ ++static void lru_gen_exit_fault(void) ++{ ++} ++#endif /* CONFIG_LRU_GEN */ ++ + /* + * By the time we get here, we already hold the mm semaphore + * +@@ -5151,11 +5160,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, + if (flags & FAULT_FLAG_USER) + mem_cgroup_enter_user_fault(); + ++ lru_gen_enter_fault(vma); ++ + if (unlikely(is_vm_hugetlb_page(vma))) + ret = hugetlb_fault(vma->vm_mm, vma, address, flags); + else + ret = __handle_mm_fault(vma, address, flags); + ++ lru_gen_exit_fault(); ++ + if (flags & FAULT_FLAG_USER) { + mem_cgroup_exit_user_fault(); + /* +diff --git a/mm/mm_init.c b/mm/mm_init.c +index 9ddaf0e1b0ab95..0d7b2bd2454a1f 100644 +--- a/mm/mm_init.c ++++ b/mm/mm_init.c +@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void) + + shift = 8 * sizeof(unsigned long); + width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH +- - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; ++ - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH; + mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", +- "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", ++ "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n", + SECTIONS_WIDTH, + NODES_WIDTH, + ZONES_WIDTH, + LAST_CPUPID_WIDTH, + KASAN_TAG_WIDTH, ++ LRU_GEN_WIDTH, ++ LRU_REFS_WIDTH, + NR_PAGEFLAGS); + mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", +diff --git a/mm/mmzone.c b/mm/mmzone.c +index 0ae7571e35abb0..68e1511be12de6 100644 +--- a/mm/mmzone.c ++++ b/mm/mmzone.c +@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec) + * Poison its list head, so that any operations on it would crash. + */ + list_del(&lruvec->lists[LRU_UNEVICTABLE]); ++ ++ lru_gen_init_lruvec(lruvec); + } + + #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) +diff --git a/mm/rmap.c b/mm/rmap.c +index 93d5a6f793d204..9e0ce48bca085d 100644 +--- a/mm/rmap.c ++++ b/mm/rmap.c +@@ -833,6 +833,12 @@ static bool folio_referenced_one(struct folio *folio, + } + + if (pvmw.pte) { ++ if (lru_gen_enabled() && pte_young(*pvmw.pte) && ++ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) { ++ lru_gen_look_around(&pvmw); ++ referenced++; ++ } ++ + if (ptep_clear_flush_young_notify(vma, address, + pvmw.pte)) { + /* +diff --git a/mm/swap.c b/mm/swap.c +index 9cee7f6a380942..0a3871a70952f3 100644 +--- a/mm/swap.c ++++ b/mm/swap.c +@@ -366,7 +366,7 @@ static void folio_activate_drain(int cpu) + folio_batch_move_lru(fbatch, folio_activate_fn); + } + +-static void folio_activate(struct folio *folio) ++void folio_activate(struct folio *folio) + { + if (folio_test_lru(folio) && !folio_test_active(folio) && + !folio_test_unevictable(folio)) { +@@ -385,7 +385,7 @@ static inline void folio_activate_drain(int cpu) + { + } + +-static void folio_activate(struct folio *folio) ++void folio_activate(struct folio *folio) + { + struct lruvec *lruvec; + +@@ -428,6 +428,40 @@ static void __lru_cache_activate_folio(struct folio *folio) + local_unlock(&cpu_fbatches.lock); + } + ++#ifdef CONFIG_LRU_GEN ++static void folio_inc_refs(struct folio *folio) ++{ ++ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); ++ ++ if (folio_test_unevictable(folio)) ++ return; ++ ++ if (!folio_test_referenced(folio)) { ++ folio_set_referenced(folio); ++ return; ++ } ++ ++ if (!folio_test_workingset(folio)) { ++ folio_set_workingset(folio); ++ return; ++ } ++ ++ /* see the comment on MAX_NR_TIERS */ ++ do { ++ new_flags = old_flags & LRU_REFS_MASK; ++ if (new_flags == LRU_REFS_MASK) ++ break; ++ ++ new_flags += BIT(LRU_REFS_PGOFF); ++ new_flags |= old_flags & ~LRU_REFS_MASK; ++ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); ++} ++#else ++static void folio_inc_refs(struct folio *folio) ++{ ++} ++#endif /* CONFIG_LRU_GEN */ ++ + /* + * Mark a page as having seen activity. + * +@@ -440,6 +474,11 @@ static void __lru_cache_activate_folio(struct folio *folio) + */ + void folio_mark_accessed(struct folio *folio) + { ++ if (lru_gen_enabled()) { ++ folio_inc_refs(folio); ++ return; ++ } ++ + if (!folio_test_referenced(folio)) { + folio_set_referenced(folio); + } else if (folio_test_unevictable(folio)) { +@@ -484,6 +523,11 @@ void folio_add_lru(struct folio *folio) + folio_test_unevictable(folio), folio); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + ++ /* see the comment in lru_gen_add_folio() */ ++ if (lru_gen_enabled() && !folio_test_unevictable(folio) && ++ lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) ++ folio_set_active(folio); ++ + folio_get(folio); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); +@@ -575,7 +619,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) + + static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) + { +- if (folio_test_active(folio) && !folio_test_unevictable(folio)) { ++ if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) { + long nr_pages = folio_nr_pages(folio); + + lruvec_del_folio(lruvec, folio); +@@ -688,8 +732,8 @@ void deactivate_page(struct page *page) + { + struct folio *folio = page_folio(page); + +- if (folio_test_lru(folio) && folio_test_active(folio) && +- !folio_test_unevictable(folio)) { ++ if (folio_test_lru(folio) && !folio_test_unevictable(folio) && ++ (folio_test_active(folio) || lru_gen_enabled())) { + struct folio_batch *fbatch; + + folio_get(folio); diff --git a/mm/vmscan.c b/mm/vmscan.c -index 33a1bdfc04bd..c579b254fec7 100644 +index 382dbe97329f33..146a54cf1bd9e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c -@@ -49,6 +49,8 @@ +@@ -49,6 +49,10 @@ #include #include #include +#include +#include - ++#include ++#include + #include #include -@@ -3082,7 +3084,7 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, - for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ - for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) - --static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid) -+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) - { - struct pglist_data *pgdat = NODE_DATA(nid); - -@@ -3127,6 +3129,371 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) - get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; +@@ -129,6 +133,12 @@ struct scan_control { + /* Always discard instead of demoting to lower tier memory */ + unsigned int no_demotion:1; + ++#ifdef CONFIG_LRU_GEN ++ /* help kswapd make better choices among multiple memcgs */ ++ unsigned int memcgs_need_aging:1; ++ unsigned long last_reclaimed; ++#endif ++ + /* Allocation order */ + s8 order; + +@@ -1334,9 +1344,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, + + if (folio_test_swapcache(folio)) { + swp_entry_t swap = folio_swap_entry(folio); +- mem_cgroup_swapout(folio, swap); ++ ++ /* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */ + if (reclaimed && !mapping_exiting(mapping)) + shadow = workingset_eviction(folio, target_memcg); ++ mem_cgroup_swapout(folio, swap); + __delete_from_swap_cache(folio, swap, shadow); + xa_unlock_irq(&mapping->i_pages); + put_swap_page(&folio->page, swap); +@@ -1633,6 +1645,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, + if (!sc->may_unmap && folio_mapped(folio)) + goto keep_locked; + ++ /* folio_update_gen() tried to promote this page? */ ++ if (lru_gen_enabled() && !ignore_references && ++ folio_mapped(folio) && folio_test_referenced(folio)) ++ goto keep_locked; ++ + /* + * The number of dirty pages determines if a node is marked + * reclaim_congested. kswapd will stall and start writing +@@ -2728,6 +2745,112 @@ enum scan_balance { + SCAN_FILE, + }; + ++static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) ++{ ++ unsigned long file; ++ struct lruvec *target_lruvec; ++ ++ if (lru_gen_enabled()) ++ return; ++ ++ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); ++ ++ /* ++ * Flush the memory cgroup stats, so that we read accurate per-memcg ++ * lruvec stats for heuristics. ++ */ ++ mem_cgroup_flush_stats(); ++ ++ /* ++ * Determine the scan balance between anon and file LRUs. ++ */ ++ spin_lock_irq(&target_lruvec->lru_lock); ++ sc->anon_cost = target_lruvec->anon_cost; ++ sc->file_cost = target_lruvec->file_cost; ++ spin_unlock_irq(&target_lruvec->lru_lock); ++ ++ /* ++ * Target desirable inactive:active list ratios for the anon ++ * and file LRU lists. ++ */ ++ if (!sc->force_deactivate) { ++ unsigned long refaults; ++ ++ refaults = lruvec_page_state(target_lruvec, ++ WORKINGSET_ACTIVATE_ANON); ++ if (refaults != target_lruvec->refaults[0] || ++ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) ++ sc->may_deactivate |= DEACTIVATE_ANON; ++ else ++ sc->may_deactivate &= ~DEACTIVATE_ANON; ++ ++ /* ++ * When refaults are being observed, it means a new ++ * workingset is being established. Deactivate to get ++ * rid of any stale active pages quickly. ++ */ ++ refaults = lruvec_page_state(target_lruvec, ++ WORKINGSET_ACTIVATE_FILE); ++ if (refaults != target_lruvec->refaults[1] || ++ inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) ++ sc->may_deactivate |= DEACTIVATE_FILE; ++ else ++ sc->may_deactivate &= ~DEACTIVATE_FILE; ++ } else ++ sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; ++ ++ /* ++ * If we have plenty of inactive file pages that aren't ++ * thrashing, try to reclaim those first before touching ++ * anonymous pages. ++ */ ++ file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); ++ if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) ++ sc->cache_trim_mode = 1; ++ else ++ sc->cache_trim_mode = 0; ++ ++ /* ++ * Prevent the reclaimer from falling into the cache trap: as ++ * cache pages start out inactive, every cache fault will tip ++ * the scan balance towards the file LRU. And as the file LRU ++ * shrinks, so does the window for rotation from references. ++ * This means we have a runaway feedback loop where a tiny ++ * thrashing file LRU becomes infinitely more attractive than ++ * anon pages. Try to detect this based on file LRU size. ++ */ ++ if (!cgroup_reclaim(sc)) { ++ unsigned long total_high_wmark = 0; ++ unsigned long free, anon; ++ int z; ++ ++ free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); ++ file = node_page_state(pgdat, NR_ACTIVE_FILE) + ++ node_page_state(pgdat, NR_INACTIVE_FILE); ++ ++ for (z = 0; z < MAX_NR_ZONES; z++) { ++ struct zone *zone = &pgdat->node_zones[z]; ++ ++ if (!managed_zone(zone)) ++ continue; ++ ++ total_high_wmark += high_wmark_pages(zone); ++ } ++ ++ /* ++ * Consider anon: if that's low too, this isn't a ++ * runaway file reclaim problem, but rather just ++ * extreme pressure. Reclaim as per usual then. ++ */ ++ anon = node_page_state(pgdat, NR_INACTIVE_ANON); ++ ++ sc->file_is_tiny = ++ file + free <= total_high_wmark && ++ !(sc->may_deactivate & DEACTIVATE_ANON) && ++ anon >> sc->priority; ++ } ++} ++ + /* + * Determine how aggressively the anon and file LRU lists should be + * scanned. +@@ -2947,152 +3070,2904 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, + return can_demote(pgdat->node_id, sc); } - + +-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +-{ +- unsigned long nr[NR_LRU_LISTS]; +- unsigned long targets[NR_LRU_LISTS]; +- unsigned long nr_to_scan; +- enum lru_list lru; +- unsigned long nr_reclaimed = 0; +- unsigned long nr_to_reclaim = sc->nr_to_reclaim; +- struct blk_plug plug; +- bool scan_adjusted; ++#ifdef CONFIG_LRU_GEN + +- get_scan_count(lruvec, sc, nr); ++#ifdef CONFIG_LRU_GEN_ENABLED ++DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); ++#define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) ++#else ++DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS); ++#define get_cap(cap) static_branch_unlikely(&lru_gen_caps[cap]) ++#endif + +- /* Record the original scan target for proportional adjustments later */ +- memcpy(targets, nr, sizeof(nr)); ++/****************************************************************************** ++ * shorthand helpers ++ ******************************************************************************/ + +- /* +- * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal +- * event that can occur when there is little memory pressure e.g. +- * multiple streaming readers/writers. Hence, we do not abort scanning +- * when the requested number of pages are reclaimed when scanning at +- * DEF_PRIORITY on the assumption that the fact we are direct +- * reclaiming implies that kswapd is not keeping up and it is best to +- * do a batch of work at once. For memcg reclaim one check is made to +- * abort proportional reclaim if either the file or anon lru has already +- * dropped to zero at the first pass. +- */ +- scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && +- sc->priority == DEF_PRIORITY); ++#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) + +- blk_start_plug(&plug); +- while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || +- nr[LRU_INACTIVE_FILE]) { +- unsigned long nr_anon, nr_file, percentage; +- unsigned long nr_scanned; ++#define DEFINE_MAX_SEQ(lruvec) \ ++ unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) + +- for_each_evictable_lru(lru) { +- if (nr[lru]) { +- nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); +- nr[lru] -= nr_to_scan; ++#define DEFINE_MIN_SEQ(lruvec) \ ++ unsigned long min_seq[ANON_AND_FILE] = { \ ++ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \ ++ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ ++ } + +- nr_reclaimed += shrink_list(lru, nr_to_scan, +- lruvec, sc); +- } +- } ++#define for_each_gen_type_zone(gen, type, zone) \ ++ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ ++ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ ++ for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) + +- cond_resched(); ++static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) ++{ ++ struct pglist_data *pgdat = NODE_DATA(nid); + +- if (nr_reclaimed < nr_to_reclaim || scan_adjusted) +- continue; ++#ifdef CONFIG_MEMCG ++ if (memcg) { ++ struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec; + +- /* +- * For kswapd and memcg, reclaim at least the number of pages +- * requested. Ensure that the anon and file LRUs are scanned +- * proportionally what was requested by get_scan_count(). We +- * stop reclaiming one LRU and reduce the amount scanning +- * proportional to the original scan target. +- */ +- nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; +- nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; ++ /* for hotadd_new_pgdat() */ ++ if (!lruvec->pgdat) ++ lruvec->pgdat = pgdat; + +- /* +- * It's just vindictive to attack the larger once the smaller +- * has gone to zero. And given the way we stop scanning the +- * smaller below, this makes sure that we only make one nudge +- * towards proportionality once we've got nr_to_reclaim. +- */ +- if (!nr_file || !nr_anon) +- break; ++ return lruvec; ++ } ++#endif ++ VM_WARN_ON_ONCE(!mem_cgroup_disabled()); + +- if (nr_file > nr_anon) { +- unsigned long scan_target = targets[LRU_INACTIVE_ANON] + +- targets[LRU_ACTIVE_ANON] + 1; +- lru = LRU_BASE; +- percentage = nr_anon * 100 / scan_target; +- } else { +- unsigned long scan_target = targets[LRU_INACTIVE_FILE] + +- targets[LRU_ACTIVE_FILE] + 1; +- lru = LRU_FILE; +- percentage = nr_file * 100 / scan_target; +- } ++ return pgdat ? &pgdat->__lruvec : NULL; ++} + +- /* Stop scanning the smaller of the LRU */ +- nr[lru] = 0; +- nr[lru + LRU_ACTIVE] = 0; ++static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) ++{ ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ struct pglist_data *pgdat = lruvec_pgdat(lruvec); + +- /* +- * Recalculate the other LRU scan count based on its original +- * scan target and the percentage scanning already complete +- */ +- lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; +- nr_scanned = targets[lru] - nr[lru]; +- nr[lru] = targets[lru] * (100 - percentage) / 100; +- nr[lru] -= min(nr[lru], nr_scanned); ++ if (!can_demote(pgdat->node_id, sc) && ++ mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) ++ return 0; + +- lru += LRU_ACTIVE; +- nr_scanned = targets[lru] - nr[lru]; +- nr[lru] = targets[lru] * (100 - percentage) / 100; +- nr[lru] -= min(nr[lru], nr_scanned); ++ return mem_cgroup_swappiness(memcg); ++} + +- scan_adjusted = true; +- } +- blk_finish_plug(&plug); +- sc->nr_reclaimed += nr_reclaimed; ++static int get_nr_gens(struct lruvec *lruvec, int type) ++{ ++ return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; ++} + +- /* +- * Even if we did not try to evict anon pages at all, we want to +- * rebalance the anon lru active/inactive ratio. +- */ +- if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && +- inactive_is_low(lruvec, LRU_INACTIVE_ANON)) +- shrink_active_list(SWAP_CLUSTER_MAX, lruvec, +- sc, LRU_ACTIVE_ANON); ++static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) ++{ ++ /* see the comment on lru_gen_struct */ ++ return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS && ++ get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) && ++ get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; + } + +-/* Use reclaim/compaction for costly allocs or under memory pressure */ +-static bool in_reclaim_compaction(struct scan_control *sc) +/****************************************************************************** + * mm_struct list + ******************************************************************************/ + +static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg) -+{ + { +- if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && +- (sc->order > PAGE_ALLOC_COSTLY_ORDER || +- sc->priority < DEF_PRIORITY - 2)) +- return true; + static struct lru_gen_mm_list mm_list = { + .fifo = LIST_HEAD_INIT(mm_list.fifo), + .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock), + }; -+ + +- return false; +#ifdef CONFIG_MEMCG + if (memcg) + return &memcg->mm_list; @@ -4330,21 +2219,38 @@ index 33a1bdfc04bd..c579b254fec7 100644 + VM_WARN_ON_ONCE(!mem_cgroup_disabled()); + + return &mm_list; -+} -+ + } + +-/* +- * Reclaim/compaction is used for high-order allocation requests. It reclaims +- * order-0 pages before compacting the zone. should_continue_reclaim() returns +- * true if more pages should be reclaimed such that when the page allocator +- * calls try_to_compact_pages() that it will have enough free pages to succeed. +- * It will give up earlier than that if there is difficulty reclaiming pages. +- */ +-static inline bool should_continue_reclaim(struct pglist_data *pgdat, +- unsigned long nr_reclaimed, +- struct scan_control *sc) +void lru_gen_add_mm(struct mm_struct *mm) -+{ + { +- unsigned long pages_for_compaction; +- unsigned long inactive_lru_pages; +- int z; + int nid; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); -+ + +- /* If not in reclaim/compaction mode, stop */ +- if (!in_reclaim_compaction(sc)) +- return false; + VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list)); +#ifdef CONFIG_MEMCG + VM_WARN_ON_ONCE(mm->lru_gen.memcg); + mm->lru_gen.memcg = memcg; +#endif + spin_lock(&mm_list->lock); -+ + +- /* + for_each_node_state(nid, N_MEMORY) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + @@ -4677,13 +2583,156 @@ index 33a1bdfc04bd..c579b254fec7 100644 + return success; +} + - /****************************************************************************** - * refault feedback loop - ******************************************************************************/ -@@ -3277,6 +3644,118 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai - return new_gen; - } - ++/****************************************************************************** ++ * refault feedback loop ++ ******************************************************************************/ ++ ++/* ++ * A feedback loop based on Proportional-Integral-Derivative (PID) controller. ++ * ++ * The P term is refaulted/(evicted+protected) from a tier in the generation ++ * currently being evicted; the I term is the exponential moving average of the ++ * P term over the generations previously evicted, using the smoothing factor ++ * 1/2; the D term isn't supported. ++ * ++ * The setpoint (SP) is always the first tier of one type; the process variable ++ * (PV) is either any tier of the other type or any other tier of the same ++ * type. ++ * ++ * The error is the difference between the SP and the PV; the correction is to ++ * turn off protection when SP>PV or turn on protection when SPlrugen; ++ int hist = lru_hist_from_seq(lrugen->min_seq[type]); ++ ++ pos->refaulted = lrugen->avg_refaulted[type][tier] + ++ atomic_long_read(&lrugen->refaulted[hist][type][tier]); ++ pos->total = lrugen->avg_total[type][tier] + ++ atomic_long_read(&lrugen->evicted[hist][type][tier]); ++ if (tier) ++ pos->total += lrugen->protected[hist][type][tier - 1]; ++ pos->gain = gain; ++} ++ ++static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover) ++{ ++ int hist, tier; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1; ++ unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1; ++ ++ lockdep_assert_held(&lruvec->lru_lock); ++ ++ if (!carryover && !clear) ++ return; ++ ++ hist = lru_hist_from_seq(seq); ++ ++ for (tier = 0; tier < MAX_NR_TIERS; tier++) { ++ if (carryover) { ++ unsigned long sum; ++ ++ sum = lrugen->avg_refaulted[type][tier] + ++ atomic_long_read(&lrugen->refaulted[hist][type][tier]); ++ WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); ++ ++ sum = lrugen->avg_total[type][tier] + ++ atomic_long_read(&lrugen->evicted[hist][type][tier]); ++ if (tier) ++ sum += lrugen->protected[hist][type][tier - 1]; ++ WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); ++ } ++ ++ if (clear) { ++ atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); ++ atomic_long_set(&lrugen->evicted[hist][type][tier], 0); ++ if (tier) ++ WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); ++ } ++ } ++} ++ ++static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) ++{ ++ /* ++ * Return true if the PV has a limited number of refaults or a lower ++ * refaulted/total than the SP. ++ */ ++ return pv->refaulted < MIN_LRU_BATCH || ++ pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <= ++ (sp->refaulted + 1) * pv->total * pv->gain; ++} ++ ++/****************************************************************************** ++ * the aging ++ ******************************************************************************/ ++ ++/* promote pages accessed through page tables */ ++static int folio_update_gen(struct folio *folio, int gen) ++{ ++ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); ++ ++ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); ++ VM_WARN_ON_ONCE(!rcu_read_lock_held()); ++ ++ do { ++ /* lru_gen_del_folio() has isolated this page? */ ++ if (!(old_flags & LRU_GEN_MASK)) { ++ /* for shrink_page_list() */ ++ new_flags = old_flags | BIT(PG_referenced); ++ continue; ++ } ++ ++ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); ++ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; ++ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); ++ ++ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; ++} ++ ++/* protect pages accessed multiple times through file descriptors */ ++static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) ++{ ++ int type = folio_is_file_lru(folio); ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); ++ unsigned long new_flags, old_flags = READ_ONCE(folio->flags); ++ ++ VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio); ++ ++ do { ++ new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; ++ /* folio_update_gen() has promoted this page? */ ++ if (new_gen >= 0 && new_gen != old_gen) ++ return new_gen; ++ ++ new_gen = (old_gen + 1) % MAX_NR_GENS; ++ ++ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); ++ new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; ++ /* for folio_end_writeback() */ ++ if (reclaiming) ++ new_flags |= BIT(PG_reclaim); ++ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); ++ ++ lru_gen_update_size(lruvec, folio, old_gen, new_gen); ++ ++ return new_gen; ++} ++ +static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio, + int old_gen, int new_gen) +{ @@ -4796,13 +2845,24 @@ index 33a1bdfc04bd..c579b254fec7 100644 + return false; +} + - static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) - { - unsigned long pfn = pte_pfn(pte); -@@ -3295,8 +3774,28 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned - return pfn; - } - ++static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) ++{ ++ unsigned long pfn = pte_pfn(pte); ++ ++ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end); ++ ++ if (!pte_present(pte) || is_zero_pfn(pfn)) ++ return -1; ++ ++ if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte))) ++ return -1; ++ ++ if (WARN_ON_ONCE(!pfn_valid(pfn))) ++ return -1; ++ ++ return pfn; ++} ++ +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr) +{ @@ -4823,23 +2883,29 @@ index 33a1bdfc04bd..c579b254fec7 100644 +} +#endif + - static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, -- struct pglist_data *pgdat) ++static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, + struct pglist_data *pgdat, bool can_swap) - { - struct folio *folio; - -@@ -3311,9 +3810,375 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, - if (folio_memcg_rcu(folio) != memcg) - return NULL; - ++{ ++ struct folio *folio; ++ ++ /* try to avoid unnecessary memory loads */ ++ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) ++ return NULL; ++ ++ folio = pfn_folio(pfn); ++ if (folio_nid(folio) != pgdat->node_id) ++ return NULL; ++ ++ if (folio_memcg_rcu(folio) != memcg) ++ return NULL; ++ + /* file VMAs can contain anon pages from COW */ + if (!folio_is_file_lru(folio) && !can_swap) + return NULL; + - return folio; - } - ++ return folio; ++} ++ +static bool suitable_to_scan(int total, int young) +{ + int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8); @@ -4963,7 +3029,8 @@ index 33a1bdfc04bd..c579b254fec7 100644 + goto next; + + if (!pmd_trans_huge(pmd[i])) { -+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)) ++ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && ++ get_cap(LRU_GEN_NONLEAF_YOUNG)) + pmdp_test_and_clear_young(vma, addr, pmd + i); + goto next; + } @@ -5061,10 +3128,12 @@ index 33a1bdfc04bd..c579b254fec7 100644 + walk->mm_stats[MM_NONLEAF_TOTAL]++; + +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG -+ if (!pmd_young(val)) -+ continue; ++ if (get_cap(LRU_GEN_NONLEAF_YOUNG)) { ++ if (!pmd_young(val)) ++ continue; + -+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); ++ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); ++ } +#endif + if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i)) + continue; @@ -5202,39 +3271,143 @@ index 33a1bdfc04bd..c579b254fec7 100644 + kfree(walk); +} + - static void inc_min_seq(struct lruvec *lruvec, int type) - { - struct lru_gen_struct *lrugen = &lruvec->lrugen; -@@ -3365,7 +4230,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) - return success; - } - --static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap) -+static void inc_max_seq(struct lruvec *lruvec, bool can_swap) - { - int prev, next; - int type, zone; -@@ -3375,9 +4240,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s - - VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); - -- if (max_seq != lrugen->max_seq) -- goto unlock; -- - for (type = ANON_AND_FILE - 1; type >= 0; type--) { - if (get_nr_gens(lruvec, type) != MAX_NR_GENS) - continue; -@@ -3415,10 +4277,76 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s - - /* make sure preceding modifications appear */ - smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); --unlock: ++static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) ++{ ++ int zone; ++ int remaining = MAX_LRU_BATCH; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); ++ ++ if (type == LRU_GEN_ANON && !can_swap) ++ goto done; ++ ++ /* prevent cold/hot inversion if force_scan is true */ ++ for (zone = 0; zone < MAX_NR_ZONES; zone++) { ++ struct list_head *head = &lrugen->lists[old_gen][type][zone]; ++ ++ while (!list_empty(head)) { ++ struct folio *folio = lru_to_folio(head); ++ ++ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); ++ ++ new_gen = folio_inc_gen(lruvec, folio, false); ++ list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]); ++ ++ if (!--remaining) ++ return false; ++ } ++ } ++done: ++ reset_ctrl_pos(lruvec, type, true); ++ WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); ++ ++ return true; ++} ++ ++static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) ++{ ++ int gen, type, zone; ++ bool success = false; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ DEFINE_MIN_SEQ(lruvec); ++ ++ VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); ++ ++ /* find the oldest populated generation */ ++ for (type = !can_swap; type < ANON_AND_FILE; type++) { ++ while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) { ++ gen = lru_gen_from_seq(min_seq[type]); ++ ++ for (zone = 0; zone < MAX_NR_ZONES; zone++) { ++ if (!list_empty(&lrugen->lists[gen][type][zone])) ++ goto next; ++ } ++ ++ min_seq[type]++; ++ } ++next: ++ ; ++ } ++ ++ /* see the comment on lru_gen_struct */ ++ if (can_swap) { ++ min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]); ++ min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]); ++ } ++ ++ for (type = !can_swap; type < ANON_AND_FILE; type++) { ++ if (min_seq[type] == lrugen->min_seq[type]) ++ continue; ++ ++ reset_ctrl_pos(lruvec, type, true); ++ WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); ++ success = true; ++ } ++ ++ return success; ++} ++ ++static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) ++{ ++ int prev, next; ++ int type, zone; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ ++ spin_lock_irq(&lruvec->lru_lock); ++ ++ VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); ++ ++ for (type = ANON_AND_FILE - 1; type >= 0; type--) { ++ if (get_nr_gens(lruvec, type) != MAX_NR_GENS) ++ continue; ++ ++ VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); ++ ++ while (!inc_min_seq(lruvec, type, can_swap)) { ++ spin_unlock_irq(&lruvec->lru_lock); ++ cond_resched(); ++ spin_lock_irq(&lruvec->lru_lock); ++ } ++ } ++ ++ /* ++ * Update the active/inactive LRU sizes for compatibility. Both sides of ++ * the current max_seq need to be covered, since max_seq+1 can overlap ++ * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do ++ * overlap, cold/hot inversion happens. ++ */ ++ prev = lru_gen_from_seq(lrugen->max_seq - 1); ++ next = lru_gen_from_seq(lrugen->max_seq + 1); ++ ++ for (type = 0; type < ANON_AND_FILE; type++) { ++ for (zone = 0; zone < MAX_NR_ZONES; zone++) { ++ enum lru_list lru = type * LRU_INACTIVE_FILE; ++ long delta = lrugen->nr_pages[prev][type][zone] - ++ lrugen->nr_pages[next][type][zone]; ++ ++ if (!delta) ++ continue; ++ ++ __update_lru_size(lruvec, lru, zone, delta); ++ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta); ++ } ++ } ++ ++ for (type = 0; type < ANON_AND_FILE; type++) ++ reset_ctrl_pos(lruvec, type, false); ++ ++ WRITE_ONCE(lrugen->timestamps[next], jiffies); ++ /* make sure preceding modifications appear */ ++ smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); ++ ++ spin_unlock_irq(&lruvec->lru_lock); ++} + - spin_unlock_irq(&lruvec->lru_lock); - } - +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, -+ struct scan_control *sc, bool can_swap) ++ struct scan_control *sc, bool can_swap, bool force_scan) +{ + bool success; + struct lru_gen_mm_walk *walk; @@ -5255,7 +3428,7 @@ index 33a1bdfc04bd..c579b254fec7 100644 + * handful of PTEs. Spreading the work out over a period of time usually + * is less efficient, but it avoids bursty page faults. + */ -+ if (!arch_has_hw_pte_young()) { ++ if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { + success = iterate_mm_list_nowalk(lruvec, max_seq); + goto done; + } @@ -5269,7 +3442,7 @@ index 33a1bdfc04bd..c579b254fec7 100644 + walk->lruvec = lruvec; + walk->max_seq = max_seq; + walk->can_swap = can_swap; -+ walk->force_scan = false; ++ walk->force_scan = force_scan; + + do { + success = iterate_mm_list(lruvec, walk, &mm); @@ -5289,379 +3462,118 @@ index 33a1bdfc04bd..c579b254fec7 100644 + + VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq)); + -+ inc_max_seq(lruvec, can_swap); ++ inc_max_seq(lruvec, can_swap, force_scan); + /* either this sees any waiters or they will see updated max_seq */ + if (wq_has_sleeper(&lruvec->mm_state.wait)) + wake_up_all(&lruvec->mm_state.wait); + -+ wakeup_flusher_threads(WB_REASON_VMSCAN); ++ return true; ++} ++ ++static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq, ++ struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan) ++{ ++ int gen, type, zone; ++ unsigned long old = 0; ++ unsigned long young = 0; ++ unsigned long total = 0; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ ++ for (type = !can_swap; type < ANON_AND_FILE; type++) { ++ unsigned long seq; ++ ++ for (seq = min_seq[type]; seq <= max_seq; seq++) { ++ unsigned long size = 0; ++ ++ gen = lru_gen_from_seq(seq); ++ ++ for (zone = 0; zone < MAX_NR_ZONES; zone++) ++ size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L); ++ ++ total += size; ++ if (seq == max_seq) ++ young += size; ++ else if (seq + MIN_NR_GENS == max_seq) ++ old += size; ++ } ++ } ++ ++ /* try to scrape all its memory if this memcg was deleted */ ++ *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total; ++ ++ /* ++ * The aging tries to be lazy to reduce the overhead, while the eviction ++ * stalls when the number of generations reaches MIN_NR_GENS. Hence, the ++ * ideal number of generations is MIN_NR_GENS+1. ++ */ ++ if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) ++ return true; ++ if (min_seq[!can_swap] + MIN_NR_GENS < max_seq) ++ return false; ++ ++ /* ++ * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1) ++ * of the total number of pages for each generation. A reasonable range ++ * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The ++ * aging cares about the upper bound of hot pages, while the eviction ++ * cares about the lower bound of cold pages. ++ */ ++ if (young * MIN_NR_GENS > total) ++ return true; ++ if (old * (MIN_NR_GENS + 2) < total) ++ return true; ++ ++ return false; ++} ++ ++static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl) ++{ ++ bool need_aging; ++ unsigned long nr_to_scan; ++ int swappiness = get_swappiness(lruvec, sc); ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ DEFINE_MAX_SEQ(lruvec); ++ DEFINE_MIN_SEQ(lruvec); ++ ++ VM_WARN_ON_ONCE(sc->memcg_low_reclaim); ++ ++ mem_cgroup_calculate_protection(NULL, memcg); ++ ++ if (mem_cgroup_below_min(memcg)) ++ return false; ++ ++ need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); ++ ++ if (min_ttl) { ++ int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); ++ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); ++ ++ if (time_is_after_jiffies(birth + min_ttl)) ++ return false; ++ ++ /* the size is likely too small to be helpful */ ++ if (!nr_to_scan && sc->priority != DEF_PRIORITY) ++ return false; ++ } ++ ++ if (need_aging) ++ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false); + + return true; +} + - static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq, - struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan) - { -@@ -3494,7 +4422,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc) - - need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); - if (need_aging) -- inc_max_seq(lruvec, max_seq, swappiness); -+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness); - } - - static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) -@@ -3503,6 +4431,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - - VM_WARN_ON_ONCE(!current_is_kswapd()); - -+ set_mm_walk(pgdat); ++/* to protect the working set of the last N jiffies */ ++static unsigned long lru_gen_min_ttl __read_mostly; + - memcg = mem_cgroup_iter(NULL, NULL, NULL); - do { - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); -@@ -3511,11 +4441,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - - cond_resched(); - } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); ++static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) ++{ ++ struct mem_cgroup *memcg; ++ bool success = false; ++ unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + -+ clear_mm_walk(); - } - - /* - * This function exploits spatial locality when shrink_page_list() walks the -- * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. -+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If -+ * the scan was done cacheline efficiently, it adds the PMD entry pointing to -+ * the PTE table to the Bloom filter. This forms a feedback loop between the -+ * eviction and the aging. - */ - void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - { -@@ -3524,6 +4459,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - unsigned long start; - unsigned long end; - unsigned long addr; -+ struct lru_gen_mm_walk *walk; -+ int young = 0; - unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; - struct folio *folio = pfn_folio(pvmw->pfn); - struct mem_cgroup *memcg = folio_memcg(folio); -@@ -3538,6 +4475,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - if (spin_is_contended(pvmw->ptl)) - return; - -+ /* avoid taking the LRU lock under the PTL when possible */ -+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; ++ VM_WARN_ON_ONCE(!current_is_kswapd()); + - start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); - end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1; - -@@ -3567,13 +4507,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - if (!pte_young(pte[i])) - continue; - -- folio = get_pfn_folio(pfn, memcg, pgdat); -+ folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap); - if (!folio) - continue; - - if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); - -+ young++; -+ - if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && - !(folio_test_anon(folio) && folio_test_swapbacked(folio) && - !folio_test_swapcache(folio))) -@@ -3589,7 +4531,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - arch_leave_lazy_mmu_mode(); - rcu_read_unlock(); - -- if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { -+ /* feedback from rmap walkers to page table walkers */ -+ if (suitable_to_scan(i, young)) -+ update_bloom_filter(lruvec, max_seq, pvmw->pmd); -+ -+ if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { - for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { - folio = pfn_folio(pte_pfn(pte[i])); - folio_activate(folio); -@@ -3601,8 +4547,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - if (!mem_cgroup_trylock_pages(memcg)) - return; - -- spin_lock_irq(&lruvec->lru_lock); -- new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); -+ if (!walk) { -+ spin_lock_irq(&lruvec->lru_lock); -+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); -+ } - - for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { - folio = pfn_folio(pte_pfn(pte[i])); -@@ -3613,10 +4561,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) - if (old_gen < 0 || old_gen == new_gen) - continue; - -- lru_gen_update_size(lruvec, folio, old_gen, new_gen); -+ if (walk) -+ update_batch_size(walk, folio, old_gen, new_gen); -+ else -+ lru_gen_update_size(lruvec, folio, old_gen, new_gen); - } - -- spin_unlock_irq(&lruvec->lru_lock); -+ if (!walk) -+ spin_unlock_irq(&lruvec->lru_lock); - - mem_cgroup_unlock_pages(); - } -@@ -3899,6 +4851,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap - struct folio *folio; - enum vm_event_item item; - struct reclaim_stat stat; -+ struct lru_gen_mm_walk *walk; - struct mem_cgroup *memcg = lruvec_memcg(lruvec); - struct pglist_data *pgdat = lruvec_pgdat(lruvec); - -@@ -3935,6 +4888,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap - - move_pages_to_lru(lruvec, &list); - -+ walk = current->reclaim_state->mm_walk; -+ if (walk && walk->batched) -+ reset_batch_size(lruvec, walk); -+ - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; - if (!cgroup_reclaim(sc)) - __count_vm_events(item, reclaimed); -@@ -3951,6 +4908,11 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap - return scanned; - } - -+/* -+ * For future optimizations: -+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg -+ * reclaim. -+ */ - static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, - bool can_swap) - { -@@ -3976,7 +4938,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * - if (current_is_kswapd()) - return 0; - -- inc_max_seq(lruvec, max_seq, can_swap); -+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap)) -+ return nr_to_scan; - done: - return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; - } -@@ -3990,6 +4953,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc - - blk_start_plug(&plug); - -+ set_mm_walk(lruvec_pgdat(lruvec)); -+ - while (true) { - int delta; - int swappiness; -@@ -4017,6 +4982,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc - cond_resched(); - } - -+ clear_mm_walk(); -+ - blk_finish_plug(&plug); - } - -@@ -4033,15 +5000,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec) - - for_each_gen_type_zone(gen, type, zone) - INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); -+ -+ lruvec->mm_state.seq = MIN_NR_GENS; -+ init_waitqueue_head(&lruvec->mm_state.wait); - } - - #ifdef CONFIG_MEMCG - void lru_gen_init_memcg(struct mem_cgroup *memcg) - { -+ INIT_LIST_HEAD(&memcg->mm_list.fifo); -+ spin_lock_init(&memcg->mm_list.lock); - } - - void lru_gen_exit_memcg(struct mem_cgroup *memcg) - { -+ int i; - int nid; - - for_each_node(nid) { -@@ -4049,6 +5022,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg) - - VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, - sizeof(lruvec->lrugen.nr_pages))); -+ -+ for (i = 0; i < NR_BLOOM_FILTERS; i++) { -+ bitmap_free(lruvec->mm_state.filters[i]); -+ lruvec->mm_state.filters[i] = NULL; -+ } - } - } - #endif --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (7 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-28 18:46 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch Yu Zhao - ` (5 subsequent siblings) - 14 siblings, 1 reply; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -When multiple memcgs are available, it is possible to use generations -as a frame of reference to make better choices and improve overall -performance under global memory pressure. This patch adds a basic -optimization to select memcgs that can drop single-use unmapped clean -pages first. Doing so reduces the chance of going into the aging path -or swapping, which can be costly. - -A typical example that benefits from this optimization is a server -running mixed types of workloads, e.g., heavy anon workload in one -memcg and heavy buffered I/O workload in the other. - -Though this optimization can be applied to both kswapd and direct -reclaim, it is only added to kswapd to keep the patchset manageable. -Later improvements may cover the direct reclaim path. - -While ensuring certain fairness to all eligible memcgs, proportional -scans of individual memcgs also require proper backoff to avoid -overshooting their aggregate reclaim target by too much. Otherwise it -can cause high direct reclaim latency. The conditions for backoff are: -1. At low priorities, for direct reclaim, if aging fairness or direct - reclaim latency is at risk, i.e., aging one memcg multiple times or - swapping after the target is met. -2. At high priorities, for global reclaim, if per-zone free pages are - above respective watermarks. - -Server benchmark results: - Mixed workloads: - fio (buffered I/O): +[19, 21]% - IOPS BW - patch1-8: 1880k 7343MiB/s - patch1-9: 2252k 8796MiB/s - - memcached (anon): +[119, 123]% - Ops/sec KB/sec - patch1-8: 862768.65 33514.68 - patch1-9: 1911022.12 74234.54 - - Mixed workloads: - fio (buffered I/O): +[75, 77]% - IOPS BW - 5.19-rc1: 1279k 4996MiB/s - patch1-9: 2252k 8796MiB/s - - memcached (anon): +[13, 15]% - Ops/sec KB/sec - 5.19-rc1: 1673524.04 65008.87 - patch1-9: 1911022.12 74234.54 - - Configurations: - (changes since patch 6) - - cat mixed.sh - modprobe brd rd_nr=2 rd_size=56623104 - - swapoff -a - mkswap /dev/ram0 - swapon /dev/ram0 - - mkfs.ext4 /dev/ram1 - mount -t ext4 /dev/ram1 /mnt - - memtier_benchmark -S /var/run/memcached/memcached.sock \ - -P memcache_binary -n allkeys --key-minimum=1 \ - --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \ - --ratio 1:0 --pipeline 8 -d 2000 - - fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \ - --buffered=1 --ioengine=io_uring --iodepth=128 \ - --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ - --rw=randread --random_distribution=random --norandommap \ - --time_based --ramp_time=10m --runtime=90m --group_reporting & - pid=$! - - sleep 200 - - memtier_benchmark -S /var/run/memcached/memcached.sock \ - -P memcache_binary -n allkeys --key-minimum=1 \ - --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \ - --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed - - kill -INT $pid - wait - -Client benchmark results: - no change (CONFIG_MEMCG=n) - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - mm/vmscan.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++----- - 1 file changed, 96 insertions(+), 9 deletions(-) - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index c579b254fec7..3f83325fdc71 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -131,6 +131,12 @@ struct scan_control { - /* Always discard instead of demoting to lower tier memory */ - unsigned int no_demotion:1; - -+#ifdef CONFIG_LRU_GEN -+ /* help kswapd make better choices among multiple memcgs */ -+ unsigned int memcgs_need_aging:1; -+ unsigned long last_reclaimed; -+#endif -+ - /* Allocation order */ - s8 order; - -@@ -4431,6 +4437,19 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - - VM_WARN_ON_ONCE(!current_is_kswapd()); - + sc->last_reclaimed = sc->nr_reclaimed; + + /* @@ -5675,55 +3587,542 @@ index c579b254fec7..3f83325fdc71 100644 + return; + } + - set_mm_walk(pgdat); - - memcg = mem_cgroup_iter(NULL, NULL, NULL); -@@ -4842,7 +4861,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw - return scanned; - } - --static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) ++ set_mm_walk(pgdat); ++ ++ memcg = mem_cgroup_iter(NULL, NULL, NULL); ++ do { ++ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); ++ ++ if (age_lruvec(lruvec, sc, min_ttl)) ++ success = true; ++ ++ cond_resched(); ++ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); ++ ++ clear_mm_walk(); ++ ++ /* check the order to exclude compaction-induced reclaim */ ++ if (success || !min_ttl || sc->order) ++ return; ++ ++ /* ++ * The main goal is to OOM kill if every generation from all memcgs is ++ * younger than min_ttl. However, another possibility is all memcgs are ++ * either below min or empty. ++ */ ++ if (mutex_trylock(&oom_lock)) { ++ struct oom_control oc = { ++ .gfp_mask = sc->gfp_mask, ++ }; ++ ++ out_of_memory(&oc); ++ ++ mutex_unlock(&oom_lock); ++ } ++} ++ ++/* ++ * This function exploits spatial locality when shrink_page_list() walks the ++ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If ++ * the scan was done cacheline efficiently, it adds the PMD entry pointing to ++ * the PTE table to the Bloom filter. This forms a feedback loop between the ++ * eviction and the aging. ++ */ ++void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) ++{ ++ int i; ++ pte_t *pte; ++ unsigned long start; ++ unsigned long end; ++ unsigned long addr; ++ struct lru_gen_mm_walk *walk; ++ int young = 0; ++ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; ++ struct folio *folio = pfn_folio(pvmw->pfn); ++ struct mem_cgroup *memcg = folio_memcg(folio); ++ struct pglist_data *pgdat = folio_pgdat(folio); ++ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); ++ DEFINE_MAX_SEQ(lruvec); ++ int old_gen, new_gen = lru_gen_from_seq(max_seq); ++ ++ lockdep_assert_held(pvmw->ptl); ++ VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); ++ ++ if (spin_is_contended(pvmw->ptl)) ++ return; ++ ++ /* avoid taking the LRU lock under the PTL when possible */ ++ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; ++ ++ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); ++ end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1; ++ ++ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { ++ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2) ++ end = start + MIN_LRU_BATCH * PAGE_SIZE; ++ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2) ++ start = end - MIN_LRU_BATCH * PAGE_SIZE; ++ else { ++ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2; ++ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2; ++ } ++ } ++ ++ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; ++ ++ rcu_read_lock(); ++ arch_enter_lazy_mmu_mode(); ++ ++ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { ++ unsigned long pfn; ++ ++ pfn = get_pte_pfn(pte[i], pvmw->vma, addr); ++ if (pfn == -1) ++ continue; ++ ++ if (!pte_young(pte[i])) ++ continue; ++ ++ folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap); ++ if (!folio) ++ continue; ++ ++ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) ++ VM_WARN_ON_ONCE(true); ++ ++ young++; ++ ++ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && ++ !(folio_test_anon(folio) && folio_test_swapbacked(folio) && ++ !folio_test_swapcache(folio))) ++ folio_mark_dirty(folio); ++ ++ old_gen = folio_lru_gen(folio); ++ if (old_gen < 0) ++ folio_set_referenced(folio); ++ else if (old_gen != new_gen) ++ __set_bit(i, bitmap); ++ } ++ ++ arch_leave_lazy_mmu_mode(); ++ rcu_read_unlock(); ++ ++ /* feedback from rmap walkers to page table walkers */ ++ if (suitable_to_scan(i, young)) ++ update_bloom_filter(lruvec, max_seq, pvmw->pmd); ++ ++ if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { ++ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { ++ folio = pfn_folio(pte_pfn(pte[i])); ++ folio_activate(folio); ++ } ++ return; ++ } ++ ++ /* folio_update_gen() requires stable folio_memcg() */ ++ if (!mem_cgroup_trylock_pages(memcg)) ++ return; ++ ++ if (!walk) { ++ spin_lock_irq(&lruvec->lru_lock); ++ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); ++ } ++ ++ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { ++ folio = pfn_folio(pte_pfn(pte[i])); ++ if (folio_memcg_rcu(folio) != memcg) ++ continue; ++ ++ old_gen = folio_update_gen(folio, new_gen); ++ if (old_gen < 0 || old_gen == new_gen) ++ continue; ++ ++ if (walk) ++ update_batch_size(walk, folio, old_gen, new_gen); ++ else ++ lru_gen_update_size(lruvec, folio, old_gen, new_gen); ++ } ++ ++ if (!walk) ++ spin_unlock_irq(&lruvec->lru_lock); ++ ++ mem_cgroup_unlock_pages(); ++} ++ ++/****************************************************************************** ++ * the eviction ++ ******************************************************************************/ ++ ++static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) ++{ ++ bool success; ++ int gen = folio_lru_gen(folio); ++ int type = folio_is_file_lru(folio); ++ int zone = folio_zonenum(folio); ++ int delta = folio_nr_pages(folio); ++ int refs = folio_lru_refs(folio); ++ int tier = lru_tier_from_refs(refs); ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ ++ VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio); ++ ++ /* unevictable */ ++ if (!folio_evictable(folio)) { ++ success = lru_gen_del_folio(lruvec, folio, true); ++ VM_WARN_ON_ONCE_FOLIO(!success, folio); ++ folio_set_unevictable(folio); ++ lruvec_add_folio(lruvec, folio); ++ __count_vm_events(UNEVICTABLE_PGCULLED, delta); ++ return true; ++ } ++ ++ /* dirty lazyfree */ ++ if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) { ++ success = lru_gen_del_folio(lruvec, folio, true); ++ VM_WARN_ON_ONCE_FOLIO(!success, folio); ++ folio_set_swapbacked(folio); ++ lruvec_add_folio_tail(lruvec, folio); ++ return true; ++ } ++ ++ /* promoted */ ++ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) { ++ list_move(&folio->lru, &lrugen->lists[gen][type][zone]); ++ return true; ++ } ++ ++ /* protected */ ++ if (tier > tier_idx) { ++ int hist = lru_hist_from_seq(lrugen->min_seq[type]); ++ ++ gen = folio_inc_gen(lruvec, folio, false); ++ list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]); ++ ++ WRITE_ONCE(lrugen->protected[hist][type][tier - 1], ++ lrugen->protected[hist][type][tier - 1] + delta); ++ __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); ++ return true; ++ } ++ ++ /* waiting for writeback */ ++ if (folio_test_locked(folio) || folio_test_writeback(folio) || ++ (type == LRU_GEN_FILE && folio_test_dirty(folio))) { ++ gen = folio_inc_gen(lruvec, folio, true); ++ list_move(&folio->lru, &lrugen->lists[gen][type][zone]); ++ return true; ++ } ++ ++ return false; ++} ++ ++static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc) ++{ ++ bool success; ++ ++ /* unmapping inhibited */ ++ if (!sc->may_unmap && folio_mapped(folio)) ++ return false; ++ ++ /* swapping inhibited */ ++ if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && ++ (folio_test_dirty(folio) || ++ (folio_test_anon(folio) && !folio_test_swapcache(folio)))) ++ return false; ++ ++ /* raced with release_pages() */ ++ if (!folio_try_get(folio)) ++ return false; ++ ++ /* raced with another isolation */ ++ if (!folio_test_clear_lru(folio)) { ++ folio_put(folio); ++ return false; ++ } ++ ++ /* see the comment on MAX_NR_TIERS */ ++ if (!folio_test_referenced(folio)) ++ set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0); ++ ++ /* for shrink_page_list() */ ++ folio_clear_reclaim(folio); ++ folio_clear_referenced(folio); ++ ++ success = lru_gen_del_folio(lruvec, folio, true); ++ VM_WARN_ON_ONCE_FOLIO(!success, folio); ++ ++ return true; ++} ++ ++static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, ++ int type, int tier, struct list_head *list) ++{ ++ int gen, zone; ++ enum vm_event_item item; ++ int sorted = 0; ++ int scanned = 0; ++ int isolated = 0; ++ int remaining = MAX_LRU_BATCH; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ ++ VM_WARN_ON_ONCE(!list_empty(list)); ++ ++ if (get_nr_gens(lruvec, type) == MIN_NR_GENS) ++ return 0; ++ ++ gen = lru_gen_from_seq(lrugen->min_seq[type]); ++ ++ for (zone = sc->reclaim_idx; zone >= 0; zone--) { ++ LIST_HEAD(moved); ++ int skipped = 0; ++ struct list_head *head = &lrugen->lists[gen][type][zone]; ++ ++ while (!list_empty(head)) { ++ struct folio *folio = lru_to_folio(head); ++ int delta = folio_nr_pages(folio); ++ ++ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); ++ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); ++ ++ scanned += delta; ++ ++ if (sort_folio(lruvec, folio, tier)) ++ sorted += delta; ++ else if (isolate_folio(lruvec, folio, sc)) { ++ list_add(&folio->lru, list); ++ isolated += delta; ++ } else { ++ list_move(&folio->lru, &moved); ++ skipped += delta; ++ } ++ ++ if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH) ++ break; ++ } ++ ++ if (skipped) { ++ list_splice(&moved, head); ++ __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); ++ } ++ ++ if (!remaining || isolated >= MIN_LRU_BATCH) ++ break; ++ } ++ ++ item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; ++ if (!cgroup_reclaim(sc)) { ++ __count_vm_events(item, isolated); ++ __count_vm_events(PGREFILL, sorted); ++ } ++ __count_memcg_events(memcg, item, isolated); ++ __count_memcg_events(memcg, PGREFILL, sorted); ++ __count_vm_events(PGSCAN_ANON + type, isolated); ++ ++ /* ++ * There might not be eligible pages due to reclaim_idx, may_unmap and ++ * may_writepage. Check the remaining to prevent livelock if it's not ++ * making progress. ++ */ ++ return isolated || !remaining ? scanned : 0; ++} ++ ++static int get_tier_idx(struct lruvec *lruvec, int type) ++{ ++ int tier; ++ struct ctrl_pos sp, pv; ++ ++ /* ++ * To leave a margin for fluctuations, use a larger gain factor (1:2). ++ * This value is chosen because any other tier would have at least twice ++ * as many refaults as the first tier. ++ */ ++ read_ctrl_pos(lruvec, type, 0, 1, &sp); ++ for (tier = 1; tier < MAX_NR_TIERS; tier++) { ++ read_ctrl_pos(lruvec, type, tier, 2, &pv); ++ if (!positive_ctrl_err(&sp, &pv)) ++ break; ++ } ++ ++ return tier - 1; ++} ++ ++static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) ++{ ++ int type, tier; ++ struct ctrl_pos sp, pv; ++ int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; ++ ++ /* ++ * Compare the first tier of anon with that of file to determine which ++ * type to scan. Also need to compare other tiers of the selected type ++ * with the first tier of the other type to determine the last tier (of ++ * the selected type) to evict. ++ */ ++ read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); ++ read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); ++ type = positive_ctrl_err(&sp, &pv); ++ ++ read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); ++ for (tier = 1; tier < MAX_NR_TIERS; tier++) { ++ read_ctrl_pos(lruvec, type, tier, gain[type], &pv); ++ if (!positive_ctrl_err(&sp, &pv)) ++ break; ++ } ++ ++ *tier_idx = tier - 1; ++ ++ return type; ++} ++ ++static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, ++ int *type_scanned, struct list_head *list) ++{ ++ int i; ++ int type; ++ int scanned; ++ int tier = -1; ++ DEFINE_MIN_SEQ(lruvec); ++ ++ /* ++ * Try to make the obvious choice first. When anon and file are both ++ * available from the same generation, interpret swappiness 1 as file ++ * first and 200 as anon first. ++ */ ++ if (!swappiness) ++ type = LRU_GEN_FILE; ++ else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) ++ type = LRU_GEN_ANON; ++ else if (swappiness == 1) ++ type = LRU_GEN_FILE; ++ else if (swappiness == 200) ++ type = LRU_GEN_ANON; ++ else ++ type = get_type_to_scan(lruvec, swappiness, &tier); ++ ++ for (i = !swappiness; i < ANON_AND_FILE; i++) { ++ if (tier < 0) ++ tier = get_tier_idx(lruvec, type); ++ ++ scanned = scan_folios(lruvec, sc, type, tier, list); ++ if (scanned) ++ break; ++ ++ type = !type; ++ tier = -1; ++ } ++ ++ *type_scanned = type; ++ ++ return scanned; ++} ++ +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + bool *need_swapping) - { - int type; - int scanned; -@@ -4905,6 +4925,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap - - sc->nr_reclaimed += reclaimed; - ++{ ++ int type; ++ int scanned; ++ int reclaimed; ++ LIST_HEAD(list); ++ struct folio *folio; ++ enum vm_event_item item; ++ struct reclaim_stat stat; ++ struct lru_gen_mm_walk *walk; ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ struct pglist_data *pgdat = lruvec_pgdat(lruvec); ++ ++ spin_lock_irq(&lruvec->lru_lock); ++ ++ scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); ++ ++ scanned += try_to_inc_min_seq(lruvec, swappiness); ++ ++ if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) ++ scanned = 0; ++ ++ spin_unlock_irq(&lruvec->lru_lock); ++ ++ if (list_empty(&list)) ++ return scanned; ++ ++ reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); ++ ++ list_for_each_entry(folio, &list, lru) { ++ /* restore LRU_REFS_FLAGS cleared by isolate_folio() */ ++ if (folio_test_workingset(folio)) ++ folio_set_referenced(folio); ++ ++ /* don't add rejected pages to the oldest generation */ ++ if (folio_test_reclaim(folio) && ++ (folio_test_dirty(folio) || folio_test_writeback(folio))) ++ folio_clear_active(folio); ++ else ++ folio_set_active(folio); ++ } ++ ++ spin_lock_irq(&lruvec->lru_lock); ++ ++ move_pages_to_lru(lruvec, &list); ++ ++ walk = current->reclaim_state->mm_walk; ++ if (walk && walk->batched) ++ reset_batch_size(lruvec, walk); ++ ++ item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; ++ if (!cgroup_reclaim(sc)) ++ __count_vm_events(item, reclaimed); ++ __count_memcg_events(memcg, item, reclaimed); ++ __count_vm_events(PGSTEAL_ANON + type, reclaimed); ++ ++ spin_unlock_irq(&lruvec->lru_lock); ++ ++ mem_cgroup_uncharge_list(&list); ++ free_unref_page_list(&list); ++ ++ sc->nr_reclaimed += reclaimed; ++ + if (need_swapping && type == LRU_GEN_ANON) + *need_swapping = true; + - return scanned; - } - -@@ -4914,9 +4937,8 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap - * reclaim. - */ - static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, -- bool can_swap) ++ return scanned; ++} ++ ++/* ++ * For future optimizations: ++ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg ++ * reclaim. ++ */ ++static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, + bool can_swap, bool *need_aging) - { -- bool need_aging; - unsigned long nr_to_scan; - struct mem_cgroup *memcg = lruvec_memcg(lruvec); - DEFINE_MAX_SEQ(lruvec); -@@ -4926,8 +4948,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * - (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) - return 0; - -- need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); -- if (!need_aging) ++{ ++ unsigned long nr_to_scan; ++ struct mem_cgroup *memcg = lruvec_memcg(lruvec); ++ DEFINE_MAX_SEQ(lruvec); ++ DEFINE_MIN_SEQ(lruvec); ++ ++ if (mem_cgroup_below_min(memcg) || ++ (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) ++ return 0; ++ + *need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); + if (!*need_aging) - return nr_to_scan; - - /* skip the aging path at the default priority */ -@@ -4944,10 +4966,68 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * - return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; - } - ++ return nr_to_scan; ++ ++ /* skip the aging path at the default priority */ ++ if (sc->priority == DEF_PRIORITY) ++ goto done; ++ ++ /* leave the work to lru_gen_age_node() */ ++ if (current_is_kswapd()) ++ return 0; ++ ++ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false)) ++ return nr_to_scan; ++done: ++ return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; ++} ++ +static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, + struct scan_control *sc, bool need_swapping) +{ @@ -5731,7 +4130,7 @@ index c579b254fec7..3f83325fdc71 100644 + DEFINE_MAX_SEQ(lruvec); + + if (!current_is_kswapd()) { -+ /* age each memcg once to ensure fairness */ ++ /* age each memcg at most once to ensure fairness */ + if (max_seq - seq > 1) + return true; + @@ -5756,10 +4155,9 @@ index c579b254fec7..3f83325fdc71 100644 + + /* + * A minimum amount of work was done under global memory pressure. For -+ * kswapd, it may be overshooting. For direct reclaim, the target isn't -+ * met, and yet the allocation may still succeed, since kswapd may have -+ * caught up. In either case, it's better to stop now, and restart if -+ * necessary. ++ * kswapd, it may be overshooting. For direct reclaim, the allocation ++ * may succeed if all suitable zones are somewhat safe. In either case, ++ * it's better to stop now, and restart later if necessary. + */ + for (i = 0; i <= sc->reclaim_idx; i++) { + unsigned long wmark; @@ -5778,332 +4176,60 @@ index c579b254fec7..3f83325fdc71 100644 + return true; +} + - static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) - { - struct blk_plug plug; ++static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) ++{ ++ struct blk_plug plug; + bool need_aging = false; + bool need_swapping = false; - unsigned long scanned = 0; ++ unsigned long scanned = 0; + unsigned long reclaimed = sc->nr_reclaimed; + DEFINE_MAX_SEQ(lruvec); - - lru_add_drain(); - -@@ -4967,21 +5047,28 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc - else - swappiness = 0; - -- nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness); ++ ++ lru_add_drain(); ++ ++ blk_start_plug(&plug); ++ ++ set_mm_walk(lruvec_pgdat(lruvec)); ++ ++ while (true) { ++ int delta; ++ int swappiness; ++ unsigned long nr_to_scan; ++ ++ if (sc->may_swap) ++ swappiness = get_swappiness(lruvec, sc); ++ else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc)) ++ swappiness = 1; ++ else ++ swappiness = 0; ++ + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, &need_aging); - if (!nr_to_scan) -- break; ++ if (!nr_to_scan) + goto done; - -- delta = evict_folios(lruvec, sc, swappiness); ++ + delta = evict_folios(lruvec, sc, swappiness, &need_swapping); - if (!delta) -- break; ++ if (!delta) + goto done; - - scanned += delta; - if (scanned >= nr_to_scan) - break; - ++ ++ scanned += delta; ++ if (scanned >= nr_to_scan) ++ break; ++ + if (should_abort_scan(lruvec, max_seq, sc, need_swapping)) + break; + - cond_resched(); - } - ++ cond_resched(); ++ } ++ + /* see the comment in lru_gen_age_node() */ + if (sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH && !need_aging) + sc->memcgs_need_aging = false; +done: - clear_mm_walk(); - - blk_finish_plug(&plug); --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (8 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao - ` (4 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that -can be disabled include: - 0x0001: the multi-gen LRU core - 0x0002: walking page table, when arch_has_hw_pte_young() returns - true - 0x0004: clearing the accessed bit in non-leaf PMD entries, when - CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y - [yYnN]: apply to all the components above -E.g., - echo y >/sys/kernel/mm/lru_gen/enabled - cat /sys/kernel/mm/lru_gen/enabled - 0x0007 - echo 5 >/sys/kernel/mm/lru_gen/enabled - cat /sys/kernel/mm/lru_gen/enabled - 0x0005 - -NB: the page table walks happen on the scale of seconds under heavy -memory pressure, in which case the mmap_lock contention is a lesser -concern, compared with the LRU lock contention and the I/O congestion. -So far the only well-known case of the mmap_lock contention happens on -Android, due to Scudo [1] which allocates several thousand VMAs for -merely a few hundred MBs. The SPF and the Maple Tree also have -provided their own assessments [2][3]. However, if walking page tables -does worsen the mmap_lock contention, the kill switch can be used to -disable it. In this case the multi-gen LRU will suffer a minor -performance degradation, as shown previously. - -Clearing the accessed bit in non-leaf PMD entries can also be -disabled, since this behavior was not tested on x86 varieties other -than Intel and AMD. - -[1] https://source.android.com/devices/tech/debug/scudo -[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/ -[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/ - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/cgroup.h | 15 ++- - include/linux/mm_inline.h | 15 ++- - include/linux/mmzone.h | 9 ++ - kernel/cgroup/cgroup-internal.h | 1 - - mm/Kconfig | 6 + - mm/vmscan.c | 228 +++++++++++++++++++++++++++++++- - 6 files changed, 265 insertions(+), 9 deletions(-) - -diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h -index ac5d0515680e..9179463c3c9f 100644 ---- a/include/linux/cgroup.h -+++ b/include/linux/cgroup.h -@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) - css_put(&cgrp->self); - } - -+extern struct mutex cgroup_mutex; ++ clear_mm_walk(); + -+static inline void cgroup_lock(void) -+{ -+ mutex_lock(&cgroup_mutex); ++ blk_finish_plug(&plug); +} + -+static inline void cgroup_unlock(void) -+{ -+ mutex_unlock(&cgroup_mutex); -+} -+ - /** - * task_css_set_check - obtain a task's css_set with extra access conditions - * @task: the task to obtain css_set for -@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp) - * as locks used during the cgroup_subsys::attach() methods. - */ - #ifdef CONFIG_PROVE_RCU --extern struct mutex cgroup_mutex; - extern spinlock_t css_set_lock; - #define task_css_set_check(task, __c) \ - rcu_dereference_check((task)->cgroups, \ -@@ -708,6 +719,8 @@ struct cgroup; - static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } - static inline void css_get(struct cgroup_subsys_state *css) {} - static inline void css_put(struct cgroup_subsys_state *css) {} -+static inline void cgroup_lock(void) {} -+static inline void cgroup_unlock(void) {} - static inline int cgroup_attach_task_all(struct task_struct *from, - struct task_struct *t) { return 0; } - static inline int cgroupstats_build(struct cgroupstats *stats, -diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h -index f2b2296a42f9..4949eda9a9a2 100644 ---- a/include/linux/mm_inline.h -+++ b/include/linux/mm_inline.h -@@ -106,10 +106,21 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) - - #ifdef CONFIG_LRU_GEN - -+#ifdef CONFIG_LRU_GEN_ENABLED - static inline bool lru_gen_enabled(void) - { -- return true; -+ DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]); -+ -+ return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]); - } -+#else -+static inline bool lru_gen_enabled(void) -+{ -+ DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]); -+ -+ return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]); -+} -+#endif - - static inline bool lru_gen_in_fault(void) - { -@@ -222,7 +233,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, - - VM_WARN_ON_ONCE_FOLIO(gen != -1, folio); - -- if (folio_test_unevictable(folio)) -+ if (folio_test_unevictable(folio) || !lrugen->enabled) - return false; - /* - * There are three common cases for this page: -diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index b1635c4020dc..95c58c7fbdff 100644 ---- a/include/linux/mmzone.h -+++ b/include/linux/mmzone.h -@@ -387,6 +387,13 @@ enum { - LRU_GEN_FILE, - }; - -+enum { -+ LRU_GEN_CORE, -+ LRU_GEN_MM_WALK, -+ LRU_GEN_NONLEAF_YOUNG, -+ NR_LRU_GEN_CAPS -+}; -+ - #define MIN_LRU_BATCH BITS_PER_LONG - #define MAX_LRU_BATCH (MIN_LRU_BATCH * 64) - -@@ -428,6 +435,8 @@ struct lru_gen_struct { - /* can be modified without holding the LRU lock */ - atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; - atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; -+ /* whether the multi-gen LRU is enabled */ -+ bool enabled; - }; - - enum { -diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h -index 36b740cb3d59..63dc3e82be4f 100644 ---- a/kernel/cgroup/cgroup-internal.h -+++ b/kernel/cgroup/cgroup-internal.h -@@ -164,7 +164,6 @@ struct cgroup_mgctx { - #define DEFINE_CGROUP_MGCTX(name) \ - struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) - --extern struct mutex cgroup_mutex; - extern spinlock_t css_set_lock; - extern struct cgroup_subsys *cgroup_subsys[]; - extern struct list_head cgroup_roots; -diff --git a/mm/Kconfig b/mm/Kconfig -index 5c5dcbdcfe34..ab6ef5115eb8 100644 ---- a/mm/Kconfig -+++ b/mm/Kconfig -@@ -1127,6 +1127,12 @@ config LRU_GEN - help - A high performance LRU implementation to overcommit memory. - -+config LRU_GEN_ENABLED -+ bool "Enable by default" -+ depends on LRU_GEN -+ help -+ This option enables the multi-gen LRU by default. -+ - config LRU_GEN_STATS - bool "Full stats for debugging" - depends on LRU_GEN -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 3f83325fdc71..10f31f3c5054 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -51,6 +51,7 @@ - #include - #include - #include -+#include - - #include - #include -@@ -3070,6 +3071,14 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, - - #ifdef CONFIG_LRU_GEN - -+#ifdef CONFIG_LRU_GEN_ENABLED -+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); -+#define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) -+#else -+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS); -+#define get_cap(cap) static_branch_unlikely(&lru_gen_caps[cap]) -+#endif -+ - /****************************************************************************** - * shorthand helpers - ******************************************************************************/ -@@ -3946,7 +3955,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area - goto next; - - if (!pmd_trans_huge(pmd[i])) { -- if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)) -+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && -+ get_cap(LRU_GEN_NONLEAF_YOUNG)) - pmdp_test_and_clear_young(vma, addr, pmd + i); - goto next; - } -@@ -4044,10 +4054,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, - walk->mm_stats[MM_NONLEAF_TOTAL]++; - - #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG -- if (!pmd_young(val)) -- continue; -+ if (get_cap(LRU_GEN_NONLEAF_YOUNG)) { -+ if (!pmd_young(val)) -+ continue; - -- walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); -+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); -+ } - #endif - if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i)) - continue; -@@ -4309,7 +4321,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, - * handful of PTEs. Spreading the work out over a period of time usually - * is less efficient, but it avoids bursty page faults. - */ -- if (!arch_has_hw_pte_young()) { -+ if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { - success = iterate_mm_list_nowalk(lruvec, max_seq); - goto done; - } -@@ -5074,6 +5086,208 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc - blk_finish_plug(&plug); - } - +/****************************************************************************** + * state change + ******************************************************************************/ @@ -6249,6 +4375,29 @@ index 3f83325fdc71..10f31f3c5054 100644 + * sysfs interface + ******************************************************************************/ + ++static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) ++{ ++ return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); ++} ++ ++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ ++static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, ++ const char *buf, size_t len) ++{ ++ unsigned int msecs; ++ ++ if (kstrtouint(buf, 0, &msecs)) ++ return -EINVAL; ++ ++ WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); ++ ++ return len; ++} ++ ++static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( ++ min_ttl_ms, 0644, show_min_ttl, store_min_ttl ++); ++ +static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + unsigned int caps = 0; @@ -6265,6 +4414,7 @@ index 3f83325fdc71..10f31f3c5054 100644 + return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); +} + ++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ @@ -6297,6 +4447,7 @@ index 3f83325fdc71..10f31f3c5054 100644 +); + +static struct attribute *lru_gen_attrs[] = { ++ &lru_gen_min_ttl_attr.attr, + &lru_gen_enabled_attr.attr, + NULL +}; @@ -6306,462 +4457,6 @@ index 3f83325fdc71..10f31f3c5054 100644 + .attrs = lru_gen_attrs, +}; + - /****************************************************************************** - * initialization - ******************************************************************************/ -@@ -5084,6 +5298,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec) - struct lru_gen_struct *lrugen = &lruvec->lrugen; - - lrugen->max_seq = MIN_NR_GENS + 1; -+ lrugen->enabled = lru_gen_enabled(); - - for_each_gen_type_zone(gen, type, zone) - INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); -@@ -5123,6 +5338,9 @@ static int __init init_lru_gen(void) - BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); - BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); - -+ if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) -+ pr_err("lru_gen: failed to create sysfs group\n"); -+ - return 0; - }; - late_initcall(init_lru_gen); --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (9 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao - ` (3 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as -requested by many desktop users [1]. - -When set to value N, it prevents the working set of N milliseconds -from getting evicted. The OOM killer is triggered if this working set -cannot be kept in memory. Based on the average human detectable lag -(~100ms), N=1000 usually eliminates intolerable lags due to thrashing. -Larger values like N=3000 make lags less noticeable at the risk of -premature OOM kills. - -Compared with the size-based approach [2], this time-based approach -has the following advantages: -1. It is easier to configure because it is agnostic to applications - and memory sizes. -2. It is more reliable because it is directly wired to the OOM killer. - -[1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/ -[2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/ - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/mmzone.h | 2 ++ - mm/vmscan.c | 74 ++++++++++++++++++++++++++++++++++++++++-- - 2 files changed, 73 insertions(+), 3 deletions(-) - -diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h -index 95c58c7fbdff..87347945270b 100644 ---- a/include/linux/mmzone.h -+++ b/include/linux/mmzone.h -@@ -422,6 +422,8 @@ struct lru_gen_struct { - unsigned long max_seq; - /* the eviction increments the oldest generation numbers */ - unsigned long min_seq[ANON_AND_FILE]; -+ /* the birth time of each generation in jiffies */ -+ unsigned long timestamps[MAX_NR_GENS]; - /* the multi-gen LRU lists, lazily sorted on eviction */ - struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; - /* the multi-gen LRU sizes, eventually consistent */ -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 10f31f3c5054..9ef2ec3d3c0c 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -4293,6 +4293,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap) - for (type = 0; type < ANON_AND_FILE; type++) - reset_ctrl_pos(lruvec, type, false); - -+ WRITE_ONCE(lrugen->timestamps[next], jiffies); - /* make sure preceding modifications appear */ - smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); - -@@ -4422,7 +4423,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsig - return false; - } - --static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc) -+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl) - { - bool need_aging; - unsigned long nr_to_scan; -@@ -4436,16 +4437,36 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc) - mem_cgroup_calculate_protection(NULL, memcg); - - if (mem_cgroup_below_min(memcg)) -- return; -+ return false; - - need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); -+ -+ if (min_ttl) { -+ int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); -+ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); -+ -+ if (time_is_after_jiffies(birth + min_ttl)) -+ return false; -+ -+ /* the size is likely too small to be helpful */ -+ if (!nr_to_scan && sc->priority != DEF_PRIORITY) -+ return false; -+ } -+ - if (need_aging) - try_to_inc_max_seq(lruvec, max_seq, sc, swappiness); -+ -+ return true; - } - -+/* to protect the working set of the last N jiffies */ -+static unsigned long lru_gen_min_ttl __read_mostly; -+ - static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - { - struct mem_cgroup *memcg; -+ bool success = false; -+ unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); - - VM_WARN_ON_ONCE(!current_is_kswapd()); - -@@ -4468,12 +4489,32 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) - do { - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - -- age_lruvec(lruvec, sc); -+ if (age_lruvec(lruvec, sc, min_ttl)) -+ success = true; - - cond_resched(); - } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); - - clear_mm_walk(); -+ -+ /* check the order to exclude compaction-induced reclaim */ -+ if (success || !min_ttl || sc->order) -+ return; -+ -+ /* -+ * The main goal is to OOM kill if every generation from all memcgs is -+ * younger than min_ttl. However, another possibility is all memcgs are -+ * either below min or empty. -+ */ -+ if (mutex_trylock(&oom_lock)) { -+ struct oom_control oc = { -+ .gfp_mask = sc->gfp_mask, -+ }; -+ -+ out_of_memory(&oc); -+ -+ mutex_unlock(&oom_lock); -+ } - } - - /* -@@ -5231,6 +5272,28 @@ static void lru_gen_change_state(bool enabled) - * sysfs interface - ******************************************************************************/ - -+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) -+{ -+ return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); -+} -+ -+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, -+ const char *buf, size_t len) -+{ -+ unsigned int msecs; -+ -+ if (kstrtouint(buf, 0, &msecs)) -+ return -EINVAL; -+ -+ WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); -+ -+ return len; -+} -+ -+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( -+ min_ttl_ms, 0644, show_min_ttl, store_min_ttl -+); -+ - static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf) - { - unsigned int caps = 0; -@@ -5279,6 +5342,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR( - ); - - static struct attribute *lru_gen_attrs[] = { -+ &lru_gen_min_ttl_attr.attr, - &lru_gen_enabled_attr.attr, - NULL - }; -@@ -5294,12 +5358,16 @@ static struct attribute_group lru_gen_attr_group = { - - void lru_gen_init_lruvec(struct lruvec *lruvec) - { -+ int i; - int gen, type, zone; - struct lru_gen_struct *lrugen = &lruvec->lrugen; - - lrugen->max_seq = MIN_NR_GENS + 1; - lrugen->enabled = lru_gen_enabled(); - -+ for (i = 0; i <= MIN_NR_GENS + 1; i++) -+ lrugen->timestamps[i] = jiffies; -+ - for_each_gen_type_zone(gen, type, zone) - INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); - --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (10 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:00 ` [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide Yu Zhao - ` (2 subsequent siblings) - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Qi Zheng, - Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko, - Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Add /sys/kernel/debug/lru_gen for working set estimation and proactive -reclaim. These techniques are commonly used to optimize job scheduling -(bin packing) in data centers [1][2]. - -Compared with the page table-based approach and the PFN-based -approach, this lruvec-based approach has the following advantages: -1. It offers better choices because it is aware of memcgs, NUMA nodes, - shared mappings and unmapped page cache. -2. It is more scalable because it is O(nr_hot_pages), whereas the - PFN-based approach is O(nr_total_pages). - -Add /sys/kernel/debug/lru_gen_full for debugging. - -[1] https://dl.acm.org/doi/10.1145/3297858.3304053 -[2] https://dl.acm.org/doi/10.1145/3503222.3507731 - -Signed-off-by: Yu Zhao -Reviewed-by: Qi Zheng -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - include/linux/nodemask.h | 1 + - mm/vmscan.c | 411 ++++++++++++++++++++++++++++++++++++++- - 2 files changed, 402 insertions(+), 10 deletions(-) - -diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h -index 4b71a96190a8..3a0eec9f2faa 100644 ---- a/include/linux/nodemask.h -+++ b/include/linux/nodemask.h -@@ -493,6 +493,7 @@ static inline int num_node_state(enum node_states state) - #define first_online_node 0 - #define first_memory_node 0 - #define next_online_node(nid) (MAX_NUMNODES) -+#define next_memory_node(nid) (MAX_NUMNODES) - #define nr_node_ids 1U - #define nr_online_nodes 1U - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 9ef2ec3d3c0c..7657d54c9c42 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -52,6 +52,7 @@ - #include - #include - #include -+#include - - #include - #include -@@ -4197,12 +4198,40 @@ static void clear_mm_walk(void) - kfree(walk); - } - --static void inc_min_seq(struct lruvec *lruvec, int type) -+static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) - { -+ int zone; -+ int remaining = MAX_LRU_BATCH; - struct lru_gen_struct *lrugen = &lruvec->lrugen; -+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); - -+ if (type == LRU_GEN_ANON && !can_swap) -+ goto done; -+ -+ /* prevent cold/hot inversion if force_scan is true */ -+ for (zone = 0; zone < MAX_NR_ZONES; zone++) { -+ struct list_head *head = &lrugen->lists[old_gen][type][zone]; -+ -+ while (!list_empty(head)) { -+ struct folio *folio = lru_to_folio(head); -+ -+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); -+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); -+ -+ new_gen = folio_inc_gen(lruvec, folio, false); -+ list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]); -+ -+ if (!--remaining) -+ return false; -+ } -+ } -+done: - reset_ctrl_pos(lruvec, type, true); - WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); -+ -+ return true; - } - - static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) -@@ -4248,7 +4277,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) - return success; - } - --static void inc_max_seq(struct lruvec *lruvec, bool can_swap) -+static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) - { - int prev, next; - int type, zone; -@@ -4262,9 +4291,13 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap) - if (get_nr_gens(lruvec, type) != MAX_NR_GENS) - continue; - -- VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap); -+ VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); - -- inc_min_seq(lruvec, type); -+ while (!inc_min_seq(lruvec, type, can_swap)) { -+ spin_unlock_irq(&lruvec->lru_lock); -+ cond_resched(); -+ spin_lock_irq(&lruvec->lru_lock); -+ } - } - - /* -@@ -4301,7 +4334,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap) - } - - static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, -- struct scan_control *sc, bool can_swap) -+ struct scan_control *sc, bool can_swap, bool force_scan) - { - bool success; - struct lru_gen_mm_walk *walk; -@@ -4322,7 +4355,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, - * handful of PTEs. Spreading the work out over a period of time usually - * is less efficient, but it avoids bursty page faults. - */ -- if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { -+ if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { - success = iterate_mm_list_nowalk(lruvec, max_seq); - goto done; - } -@@ -4336,7 +4369,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, - walk->lruvec = lruvec; - walk->max_seq = max_seq; - walk->can_swap = can_swap; -- walk->force_scan = false; -+ walk->force_scan = force_scan; - - do { - success = iterate_mm_list(lruvec, walk, &mm); -@@ -4356,7 +4389,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, - - VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq)); - -- inc_max_seq(lruvec, can_swap); -+ inc_max_seq(lruvec, can_swap, force_scan); - /* either this sees any waiters or they will see updated max_seq */ - if (wq_has_sleeper(&lruvec->mm_state.wait)) - wake_up_all(&lruvec->mm_state.wait); -@@ -4454,7 +4487,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned - } - - if (need_aging) -- try_to_inc_max_seq(lruvec, max_seq, sc, swappiness); -+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false); - - return true; - } -@@ -5013,7 +5046,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * - if (current_is_kswapd()) - return 0; - -- if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap)) -+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false)) - return nr_to_scan; - done: - return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; -@@ -5352,6 +5385,361 @@ static struct attribute_group lru_gen_attr_group = { - .attrs = lru_gen_attrs, - }; - +/****************************************************************************** + * debugfs interface + ******************************************************************************/ @@ -6867,6 +4562,7 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644 + seq_putc(m, '\n'); +} + ++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; @@ -7025,6 +4721,7 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644 + return err; +} + ++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ @@ -7117,639 +4814,540 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644 + .release = seq_release, +}; + - /****************************************************************************** - * initialization - ******************************************************************************/ -@@ -5409,6 +5797,9 @@ static int __init init_lru_gen(void) - if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) - pr_err("lru_gen: failed to create sysfs group\n"); - ++/****************************************************************************** ++ * initialization ++ ******************************************************************************/ ++ ++void lru_gen_init_lruvec(struct lruvec *lruvec) ++{ ++ int i; ++ int gen, type, zone; ++ struct lru_gen_struct *lrugen = &lruvec->lrugen; ++ ++ lrugen->max_seq = MIN_NR_GENS + 1; ++ lrugen->enabled = lru_gen_enabled(); ++ ++ for (i = 0; i <= MIN_NR_GENS + 1; i++) ++ lrugen->timestamps[i] = jiffies; ++ ++ for_each_gen_type_zone(gen, type, zone) ++ INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); ++ ++ lruvec->mm_state.seq = MIN_NR_GENS; ++ init_waitqueue_head(&lruvec->mm_state.wait); ++} ++ ++#ifdef CONFIG_MEMCG ++void lru_gen_init_memcg(struct mem_cgroup *memcg) ++{ ++ INIT_LIST_HEAD(&memcg->mm_list.fifo); ++ spin_lock_init(&memcg->mm_list.lock); ++} ++ ++void lru_gen_exit_memcg(struct mem_cgroup *memcg) ++{ ++ int i; ++ int nid; ++ ++ for_each_node(nid) { ++ struct lruvec *lruvec = get_lruvec(memcg, nid); ++ ++ VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, ++ sizeof(lruvec->lrugen.nr_pages))); ++ ++ for (i = 0; i < NR_BLOOM_FILTERS; i++) { ++ bitmap_free(lruvec->mm_state.filters[i]); ++ lruvec->mm_state.filters[i] = NULL; ++ } ++ } ++} ++#endif ++ ++static int __init init_lru_gen(void) ++{ ++ BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); ++ BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); ++ ++ if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) ++ pr_err("lru_gen: failed to create sysfs group\n"); ++ + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + - return 0; - }; - late_initcall(init_lru_gen); --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (11 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-18 8:26 ` Mike Rapoport - 2022-09-18 8:00 ` [PATCH mm-unstable v15 14/14] mm: multi-gen LRU: design doc Yu Zhao - 2022-09-19 2:08 ` [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Bagas Sanjaya - 14 siblings, 1 reply; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Add an admin guide. - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - Documentation/admin-guide/mm/index.rst | 1 + - Documentation/admin-guide/mm/multigen_lru.rst | 162 ++++++++++++++++++ - mm/Kconfig | 3 +- - mm/vmscan.c | 4 + - 4 files changed, 169 insertions(+), 1 deletion(-) - create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst - -diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst -index 1bd11118dfb1..d1064e0ba34a 100644 ---- a/Documentation/admin-guide/mm/index.rst -+++ b/Documentation/admin-guide/mm/index.rst -@@ -32,6 +32,7 @@ the Linux memory management. - idle_page_tracking - ksm - memory-hotplug -+ multigen_lru - nommu-mmap - numa_memory_policy - numaperf -diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst -new file mode 100644 -index 000000000000..33e068830497 ---- /dev/null -+++ b/Documentation/admin-guide/mm/multigen_lru.rst -@@ -0,0 +1,162 @@ -+.. SPDX-License-Identifier: GPL-2.0 ++ return 0; ++}; ++late_initcall(init_lru_gen); + -+============= -+Multi-Gen LRU -+============= -+The multi-gen LRU is an alternative LRU implementation that optimizes -+page reclaim and improves performance under memory pressure. Page -+reclaim decides the kernel's caching policy and ability to overcommit -+memory. It directly impacts the kswapd CPU usage and RAM efficiency. ++#else /* !CONFIG_LRU_GEN */ + -+Quick start -+=========== -+Build the kernel with the following configurations. ++static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) ++{ ++} + -+* ``CONFIG_LRU_GEN=y`` -+* ``CONFIG_LRU_GEN_ENABLED=y`` ++static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) ++{ ++} + -+All set! ++#endif /* CONFIG_LRU_GEN */ + -+Runtime options -+=============== -+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the -+following subsections. ++static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) ++{ ++ unsigned long nr[NR_LRU_LISTS]; ++ unsigned long targets[NR_LRU_LISTS]; ++ unsigned long nr_to_scan; ++ enum lru_list lru; ++ unsigned long nr_reclaimed = 0; ++ unsigned long nr_to_reclaim = sc->nr_to_reclaim; ++ struct blk_plug plug; ++ bool scan_adjusted; + -+Kill switch -+----------- -+``enabled`` accepts different values to enable or disable the -+following components. Its default value depends on -+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled -+unless some of them have unforeseen side effects. Writing to -+``enabled`` has no effect when a component is not supported by the -+hardware, and valid values will be accepted even when the main switch -+is off. ++ if (lru_gen_enabled()) { ++ lru_gen_shrink_lruvec(lruvec, sc); ++ return; ++ } + -+====== =============================================================== -+Values Components -+====== =============================================================== -+0x0001 The main switch for the multi-gen LRU. -+0x0002 Clearing the accessed bit in leaf page table entries in large -+ batches, when MMU sets it (e.g., on x86). This behavior can -+ theoretically worsen lock contention (mmap_lock). If it is -+ disabled, the multi-gen LRU will suffer a minor performance -+ degradation for workloads that contiguously map hot pages, -+ whose accessed bits can be otherwise cleared by fewer larger -+ batches. -+0x0004 Clearing the accessed bit in non-leaf page table entries as -+ well, when MMU sets it (e.g., on x86). This behavior was not -+ verified on x86 varieties other than Intel and AMD. If it is -+ disabled, the multi-gen LRU will suffer a negligible -+ performance degradation. -+[yYnN] Apply to all the components above. -+====== =============================================================== ++ get_scan_count(lruvec, sc, nr); + -+E.g., -+:: ++ /* Record the original scan target for proportional adjustments later */ ++ memcpy(targets, nr, sizeof(nr)); + -+ echo y >/sys/kernel/mm/lru_gen/enabled -+ cat /sys/kernel/mm/lru_gen/enabled -+ 0x0007 -+ echo 5 >/sys/kernel/mm/lru_gen/enabled -+ cat /sys/kernel/mm/lru_gen/enabled -+ 0x0005 ++ /* ++ * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal ++ * event that can occur when there is little memory pressure e.g. ++ * multiple streaming readers/writers. Hence, we do not abort scanning ++ * when the requested number of pages are reclaimed when scanning at ++ * DEF_PRIORITY on the assumption that the fact we are direct ++ * reclaiming implies that kswapd is not keeping up and it is best to ++ * do a batch of work at once. For memcg reclaim one check is made to ++ * abort proportional reclaim if either the file or anon lru has already ++ * dropped to zero at the first pass. ++ */ ++ scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && ++ sc->priority == DEF_PRIORITY); + -+Thrashing prevention -+-------------------- -+Personal computers are more sensitive to thrashing because it can -+cause janks (lags when rendering UI) and negatively impact user -+experience. The multi-gen LRU offers thrashing prevention to the -+majority of laptop and desktop users who do not have ``oomd``. ++ blk_start_plug(&plug); ++ while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || ++ nr[LRU_INACTIVE_FILE]) { ++ unsigned long nr_anon, nr_file, percentage; ++ unsigned long nr_scanned; + -+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of -+``N`` milliseconds from getting evicted. The OOM killer is triggered -+if this working set cannot be kept in memory. In other words, this -+option works as an adjustable pressure relief valve, and when open, it -+terminates applications that are hopefully not being used. ++ for_each_evictable_lru(lru) { ++ if (nr[lru]) { ++ nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); ++ nr[lru] -= nr_to_scan; + -+Based on the average human detectable lag (~100ms), ``N=1000`` usually -+eliminates intolerable janks due to thrashing. Larger values like -+``N=3000`` make janks less noticeable at the risk of premature OOM -+kills. ++ nr_reclaimed += shrink_list(lru, nr_to_scan, ++ lruvec, sc); ++ } ++ } + -+The default value ``0`` means disabled. ++ cond_resched(); + -+Experimental features -+===================== -+``/sys/kernel/debug/lru_gen`` accepts commands described in the -+following subsections. Multiple command lines are supported, so does -+concatenation with delimiters ``,`` and ``;``. ++ if (nr_reclaimed < nr_to_reclaim || scan_adjusted) ++ continue; + -+``/sys/kernel/debug/lru_gen_full`` provides additional stats for -+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from -+evicted generations in this file. ++ /* ++ * For kswapd and memcg, reclaim at least the number of pages ++ * requested. Ensure that the anon and file LRUs are scanned ++ * proportionally what was requested by get_scan_count(). We ++ * stop reclaiming one LRU and reduce the amount scanning ++ * proportional to the original scan target. ++ */ ++ nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; ++ nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; + -+Working set estimation -+---------------------- -+Working set estimation measures how much memory an application needs -+in a given time interval, and it is usually done with little impact on -+the performance of the application. E.g., data centers want to -+optimize job scheduling (bin packing) to improve memory utilizations. -+When a new job comes in, the job scheduler needs to find out whether -+each server it manages can allocate a certain amount of memory for -+this new job before it can pick a candidate. To do so, the job -+scheduler needs to estimate the working sets of the existing jobs. ++ /* ++ * It's just vindictive to attack the larger once the smaller ++ * has gone to zero. And given the way we stop scanning the ++ * smaller below, this makes sure that we only make one nudge ++ * towards proportionality once we've got nr_to_reclaim. ++ */ ++ if (!nr_file || !nr_anon) ++ break; + -+When it is read, ``lru_gen`` returns a histogram of numbers of pages -+accessed over different time intervals for each memcg and node. -+``MAX_NR_GENS`` decides the number of bins for each histogram. The -+histograms are noncumulative. -+:: ++ if (nr_file > nr_anon) { ++ unsigned long scan_target = targets[LRU_INACTIVE_ANON] + ++ targets[LRU_ACTIVE_ANON] + 1; ++ lru = LRU_BASE; ++ percentage = nr_anon * 100 / scan_target; ++ } else { ++ unsigned long scan_target = targets[LRU_INACTIVE_FILE] + ++ targets[LRU_ACTIVE_FILE] + 1; ++ lru = LRU_FILE; ++ percentage = nr_file * 100 / scan_target; ++ } + -+ memcg memcg_id memcg_path -+ node node_id -+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages -+ ... -+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages ++ /* Stop scanning the smaller of the LRU */ ++ nr[lru] = 0; ++ nr[lru + LRU_ACTIVE] = 0; + -+Each bin contains an estimated number of pages that have been accessed -+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages -+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of -+the former is the largest and that of the latter is the smallest. ++ /* ++ * Recalculate the other LRU scan count based on its original ++ * scan target and the percentage scanning already complete ++ */ ++ lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; ++ nr_scanned = targets[lru] - nr[lru]; ++ nr[lru] = targets[lru] * (100 - percentage) / 100; ++ nr[lru] -= min(nr[lru], nr_scanned); + -+Users can write the following command to ``lru_gen`` to create a new -+generation ``max_gen_nr+1``: ++ lru += LRU_ACTIVE; ++ nr_scanned = targets[lru] - nr[lru]; ++ nr[lru] = targets[lru] * (100 - percentage) / 100; ++ nr[lru] -= min(nr[lru], nr_scanned); + -+ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` ++ scan_adjusted = true; ++ } ++ blk_finish_plug(&plug); ++ sc->nr_reclaimed += nr_reclaimed; + -+``can_swap`` defaults to the swap setting and, if it is set to ``1``, -+it forces the scan of anon pages when swap is off, and vice versa. -+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it -+employs heuristics to reduce the overhead, which is likely to reduce -+the coverage as well. ++ /* ++ * Even if we did not try to evict anon pages at all, we want to ++ * rebalance the anon lru active/inactive ratio. ++ */ ++ if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && ++ inactive_is_low(lruvec, LRU_INACTIVE_ANON)) ++ shrink_active_list(SWAP_CLUSTER_MAX, lruvec, ++ sc, LRU_ACTIVE_ANON); ++} + -+A typical use case is that a job scheduler runs this command at a -+certain time interval to create new generations, and it ranks the -+servers it manages based on the sizes of their cold pages defined by -+this time interval. ++/* Use reclaim/compaction for costly allocs or under memory pressure */ ++static bool in_reclaim_compaction(struct scan_control *sc) ++{ ++ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && ++ (sc->order > PAGE_ALLOC_COSTLY_ORDER || ++ sc->priority < DEF_PRIORITY - 2)) ++ return true; + -+Proactive reclaim -+----------------- -+Proactive reclaim induces page reclaim when there is no memory -+pressure. It usually targets cold pages only. E.g., when a new job -+comes in, the job scheduler wants to proactively reclaim cold pages on -+the server it selected, to improve the chance of successfully landing -+this new job. ++ return false; ++} + -+Users can write the following command to ``lru_gen`` to evict -+generations less than or equal to ``min_gen_nr``. ++/* ++ * Reclaim/compaction is used for high-order allocation requests. It reclaims ++ * order-0 pages before compacting the zone. should_continue_reclaim() returns ++ * true if more pages should be reclaimed such that when the page allocator ++ * calls try_to_compact_pages() that it will have enough free pages to succeed. ++ * It will give up earlier than that if there is difficulty reclaiming pages. ++ */ ++static inline bool should_continue_reclaim(struct pglist_data *pgdat, ++ unsigned long nr_reclaimed, ++ struct scan_control *sc) ++{ ++ unsigned long pages_for_compaction; ++ unsigned long inactive_lru_pages; ++ int z; + -+ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` ++ /* If not in reclaim/compaction mode, stop */ ++ if (!in_reclaim_compaction(sc)) ++ return false; + -+``min_gen_nr`` should be less than ``max_gen_nr-1``, since -+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to -+the active list) and therefore cannot be evicted. ``swappiness`` -+overrides the default value in ``/proc/sys/vm/swappiness``. -+``nr_to_reclaim`` limits the number of pages to evict. -+ -+A typical use case is that a job scheduler runs this command before it -+tries to land a new job on a server. If it fails to materialize enough -+cold pages because of the overestimation, it retries on the next -+server according to the ranking result obtained from the working set -+estimation step. This less forceful approach limits the impacts on the -+existing jobs. -diff --git a/mm/Kconfig b/mm/Kconfig -index ab6ef5115eb8..ceec438c0741 100644 ---- a/mm/Kconfig -+++ b/mm/Kconfig -@@ -1125,7 +1125,8 @@ config LRU_GEN - # make sure folio->flags has enough spare bits - depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP - help -- A high performance LRU implementation to overcommit memory. -+ A high performance LRU implementation to overcommit memory. See -+ Documentation/admin-guide/mm/multigen_lru.rst for details. - - config LRU_GEN_ENABLED - bool "Enable by default" -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 7657d54c9c42..1456f133f256 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -5310,6 +5310,7 @@ static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, c - return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); - } - -+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ - static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, - const char *buf, size_t len) - { -@@ -5343,6 +5344,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c - return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); - } - -+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ - static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr, - const char *buf, size_t len) - { -@@ -5490,6 +5492,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, - seq_putc(m, '\n'); - } - -+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ - static int lru_gen_seq_show(struct seq_file *m, void *v) - { - unsigned long seq; -@@ -5648,6 +5651,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, - return err; - } - -+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ - static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, - size_t len, loff_t *pos) - { --- -2.37.3.968.ga6b4b080e4-goog - - - - -* [PATCH mm-unstable v15 14/14] mm: multi-gen LRU: design doc - 2022-09-18 7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao - ` (12 preceding siblings ...) - 2022-09-18 8:00 ` [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide Yu Zhao -@ 2022-09-18 8:00 ` Yu Zhao - 2022-09-19 2:08 ` [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Bagas Sanjaya - 14 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-18 8:00 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Add a design doc. - -Signed-off-by: Yu Zhao -Acked-by: Brian Geffon -Acked-by: Jan Alexander Steffens (heftig) -Acked-by: Oleksandr Natalenko -Acked-by: Steven Barrett -Acked-by: Suleiman Souhlal -Tested-by: Daniel Byrne -Tested-by: Donald Carr -Tested-by: Holger Hoffstätte -Tested-by: Konstantin Kharlamov -Tested-by: Shuang Zhai -Tested-by: Sofia Trinh -Tested-by: Vaibhav Jain ---- - Documentation/mm/index.rst | 1 + - Documentation/mm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++ - 2 files changed, 160 insertions(+) - create mode 100644 Documentation/mm/multigen_lru.rst - -diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst -index 575ccd40e30c..4aa12b8be278 100644 ---- a/Documentation/mm/index.rst -+++ b/Documentation/mm/index.rst -@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose. - ksm - memory-model - mmu_notifier -+ multigen_lru - numa - overcommit-accounting - page_migration -diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst -new file mode 100644 -index 000000000000..d7062c6a8946 ---- /dev/null -+++ b/Documentation/mm/multigen_lru.rst -@@ -0,0 +1,159 @@ -+.. SPDX-License-Identifier: GPL-2.0 -+ -+============= -+Multi-Gen LRU -+============= -+The multi-gen LRU is an alternative LRU implementation that optimizes -+page reclaim and improves performance under memory pressure. Page -+reclaim decides the kernel's caching policy and ability to overcommit -+memory. It directly impacts the kswapd CPU usage and RAM efficiency. -+ -+Design overview -+=============== -+Objectives -+---------- -+The design objectives are: -+ -+* Good representation of access recency -+* Try to profit from spatial locality -+* Fast paths to make obvious choices -+* Simple self-correcting heuristics -+ -+The representation of access recency is at the core of all LRU -+implementations. In the multi-gen LRU, each generation represents a -+group of pages with similar access recency. Generations establish a -+(time-based) common frame of reference and therefore help make better -+choices, e.g., between different memcgs on a computer or different -+computers in a data center (for job scheduling). -+ -+Exploiting spatial locality improves efficiency when gathering the -+accessed bit. A rmap walk targets a single page and does not try to -+profit from discovering a young PTE. A page table walk can sweep all -+the young PTEs in an address space, but the address space can be too -+sparse to make a profit. The key is to optimize both methods and use -+them in combination. -+ -+Fast paths reduce code complexity and runtime overhead. Unmapped pages -+do not require TLB flushes; clean pages do not require writeback. -+These facts are only helpful when other conditions, e.g., access -+recency, are similar. With generations as a common frame of reference, -+additional factors stand out. But obvious choices might not be good -+choices; thus self-correction is necessary. -+ -+The benefits of simple self-correcting heuristics are self-evident. -+Again, with generations as a common frame of reference, this becomes -+attainable. Specifically, pages in the same generation can be -+categorized based on additional factors, and a feedback loop can -+statistically compare the refault percentages across those categories -+and infer which of them are better choices. -+ -+Assumptions -+----------- -+The protection of hot pages and the selection of cold pages are based -+on page access channels and patterns. There are two access channels: -+ -+* Accesses through page tables -+* Accesses through file descriptors -+ -+The protection of the former channel is by design stronger because: -+ -+1. The uncertainty in determining the access patterns of the former -+ channel is higher due to the approximation of the accessed bit. -+2. The cost of evicting the former channel is higher due to the TLB -+ flushes required and the likelihood of encountering the dirty bit. -+3. The penalty of underprotecting the former channel is higher because -+ applications usually do not prepare themselves for major page -+ faults like they do for blocked I/O. E.g., GUI applications -+ commonly use dedicated I/O threads to avoid blocking rendering -+ threads. -+ -+There are also two access patterns: -+ -+* Accesses exhibiting temporal locality -+* Accesses not exhibiting temporal locality -+ -+For the reasons listed above, the former channel is assumed to follow -+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is -+present, and the latter channel is assumed to follow the latter -+pattern unless outlying refaults have been observed. -+ -+Workflow overview -+================= -+Evictable pages are divided into multiple generations for each -+``lruvec``. The youngest generation number is stored in -+``lrugen->max_seq`` for both anon and file types as they are aged on -+an equal footing. The oldest generation numbers are stored in -+``lrugen->min_seq[]`` separately for anon and file types as clean file -+pages can be evicted regardless of swap constraints. These three -+variables are monotonically increasing. -+ -+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` -+bits in order to fit into the gen counter in ``folio->flags``. Each -+truncated generation number is an index to ``lrugen->lists[]``. The -+sliding window technique is used to track at least ``MIN_NR_GENS`` and -+at most ``MAX_NR_GENS`` generations. The gen counter stores a value -+within ``[1, MAX_NR_GENS]`` while a page is on one of -+``lrugen->lists[]``; otherwise it stores zero. -+ -+Each generation is divided into multiple tiers. A page accessed ``N`` -+times through file descriptors is in tier ``order_base_2(N)``. Unlike -+generations, tiers do not have dedicated ``lrugen->lists[]``. In -+contrast to moving across generations, which requires the LRU lock, -+moving across tiers only involves atomic operations on -+``folio->flags`` and therefore has a negligible cost. A feedback loop -+modeled after the PID controller monitors refaults over all the tiers -+from anon and file types and decides which tiers from which types to -+evict or protect. -+ -+There are two conceptually independent procedures: the aging and the -+eviction. They form a closed-loop system, i.e., the page reclaim. -+ -+Aging -+----- -+The aging produces young generations. Given an ``lruvec``, it -+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches -+``MIN_NR_GENS``. The aging promotes hot pages to the youngest -+generation when it finds them accessed through page tables; the -+demotion of cold pages happens consequently when it increments -+``max_seq``. The aging uses page table walks and rmap walks to find -+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` -+and calls ``walk_page_range()`` with each ``mm_struct`` on this list -+to scan PTEs, and after each iteration, it increments ``max_seq``. For -+the latter, when the eviction walks the rmap and finds a young PTE, -+the aging scans the adjacent PTEs. For both, on finding a young PTE, -+the aging clears the accessed bit and updates the gen counter of the -+page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. -+ -+Eviction -+-------- -+The eviction consumes old generations. Given an ``lruvec``, it -+increments ``min_seq`` when ``lrugen->lists[]`` indexed by -+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to -+evict from, it first compares ``min_seq[]`` to select the older type. -+If both types are equally old, it selects the one whose first tier has -+a lower refault percentage. The first tier contains single-use -+unmapped clean pages, which are the best bet. The eviction sorts a -+page according to its gen counter if the aging has found this page -+accessed through page tables and updated its gen counter. It also -+moves a page to the next generation, i.e., ``min_seq+1``, if this page -+was accessed multiple times through file descriptors and the feedback -+loop has detected outlying refaults from the tier this page is in. To -+this end, the feedback loop uses the first tier as the baseline, for -+the reason stated earlier. -+ -+Summary -+------- -+The multi-gen LRU can be disassembled into the following parts: -+ -+* Generations -+* Rmap walks -+* Page table walks -+* Bloom filters -+* PID controller -+ -+The aging and the eviction form a producer-consumer model; -+specifically, the latter drives the former by the sliding window over -+generations. Within the aging, rmap walks drive page table walks by -+inserting hot densely populated page tables to the Bloom filters. -+Within the eviction, the PID controller uses refaults as the feedback -+to select types to evict and tiers to protect. --- -2.37.3.968.ga6b4b080e4-goog - - - -* Re: [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs - 2022-09-18 8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao -@ 2022-09-28 18:46 ` Yu Zhao - 0 siblings, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-28 18:46 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Hi Andrew, - -Can you please take this fixlet? Thanks. - -Fix imprecise comments. - -Signed-off-by: Yu Zhao ---- - mm/vmscan.c | 9 ++++----- - 1 file changed, 4 insertions(+), 5 deletions(-) - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index a8fd6300fa7e..5b565470286b 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -5078,7 +5078,7 @@ static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, - DEFINE_MAX_SEQ(lruvec); - - if (!current_is_kswapd()) { -- /* age each memcg once to ensure fairness */ -+ /* age each memcg at most once to ensure fairness */ - if (max_seq - seq > 1) - return true; - -@@ -5103,10 +5103,9 @@ static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, - - /* - * A minimum amount of work was done under global memory pressure. For -- * kswapd, it may be overshooting. For direct reclaim, the target isn't -- * met, and yet the allocation may still succeed, since kswapd may have -- * caught up. In either case, it's better to stop now, and restart if -- * necessary. -+ * kswapd, it may be overshooting. For direct reclaim, the allocation -+ * may succeed if all suitable zones are somewhat safe. In either case, -+ * it's better to stop now, and restart later if necessary. - */ - for (i = 0; i <= sc->reclaim_idx; i++) { - unsigned long wmark; --- -2.37.3.998.g577e59143f-goog - - - - -* Re: [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks - 2022-09-18 8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao - 2022-09-18 8:17 ` Yu Zhao -@ 2022-09-28 19:36 ` Yu Zhao - 1 sibling, 0 replies; 23+ messages in thread -From: Yu Zhao @ 2022-09-28 19:36 UTC (permalink / raw) - To: Andrew Morton - Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen, - Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet, - Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel, - Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo, - Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc, - linux-kernel, linux-mm, x86, page-reclaim, Brian Geffon, - Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett, - Suleiman Souhlal, Daniel Byrne, Donald Carr, - Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai, - Sofia Trinh, Vaibhav Jain - -Hi Andrew, - -Can you please take another fixlet? Thanks. - -Don't sync disk for each aging cycle. - -wakeup_flusher_threads() was added under the assumption that if a -system runs out of clean cold pages, it might want to write back dirty -pages more aggressively so that they can become clean and be dropped. - -However, doing so can breach the rate limit a system wants to impose -on writeback, resulting in early SSD wearout. - -Reported-by: Axel Rasmussen -Signed-off-by: Yu Zhao ---- - mm/vmscan.c | 2 -- - 1 file changed, 2 deletions(-) - -diff --git a/mm/vmscan.c b/mm/vmscan.c -index 5b565470286b..0317d4cf4884 100644 ---- a/mm/vmscan.c -+++ b/mm/vmscan.c -@@ -4413,8 +4413,6 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, - if (wq_has_sleeper(&lruvec->mm_state.wait)) - wake_up_all(&lruvec->mm_state.wait); - -- wakeup_flusher_threads(WB_REASON_VMSCAN); ++ /* + * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX + * number of pages that were scanned. This will return to the caller + * with the risk reclaim/compaction and the resulting allocation attempt +@@ -3197,109 +6072,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) + unsigned long nr_reclaimed, nr_scanned; + struct lruvec *target_lruvec; + bool reclaimable = false; +- unsigned long file; + + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + + again: +- /* +- * Flush the memory cgroup stats, so that we read accurate per-memcg +- * lruvec stats for heuristics. +- */ +- mem_cgroup_flush_stats(); - - return true; + memset(&sc->nr, 0, sizeof(sc->nr)); + + nr_reclaimed = sc->nr_reclaimed; + nr_scanned = sc->nr_scanned; + +- /* +- * Determine the scan balance between anon and file LRUs. +- */ +- spin_lock_irq(&target_lruvec->lru_lock); +- sc->anon_cost = target_lruvec->anon_cost; +- sc->file_cost = target_lruvec->file_cost; +- spin_unlock_irq(&target_lruvec->lru_lock); +- +- /* +- * Target desirable inactive:active list ratios for the anon +- * and file LRU lists. +- */ +- if (!sc->force_deactivate) { +- unsigned long refaults; +- +- refaults = lruvec_page_state(target_lruvec, +- WORKINGSET_ACTIVATE_ANON); +- if (refaults != target_lruvec->refaults[0] || +- inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) +- sc->may_deactivate |= DEACTIVATE_ANON; +- else +- sc->may_deactivate &= ~DEACTIVATE_ANON; +- +- /* +- * When refaults are being observed, it means a new +- * workingset is being established. Deactivate to get +- * rid of any stale active pages quickly. +- */ +- refaults = lruvec_page_state(target_lruvec, +- WORKINGSET_ACTIVATE_FILE); +- if (refaults != target_lruvec->refaults[1] || +- inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) +- sc->may_deactivate |= DEACTIVATE_FILE; +- else +- sc->may_deactivate &= ~DEACTIVATE_FILE; +- } else +- sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; +- +- /* +- * If we have plenty of inactive file pages that aren't +- * thrashing, try to reclaim those first before touching +- * anonymous pages. +- */ +- file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); +- if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) +- sc->cache_trim_mode = 1; +- else +- sc->cache_trim_mode = 0; +- +- /* +- * Prevent the reclaimer from falling into the cache trap: as +- * cache pages start out inactive, every cache fault will tip +- * the scan balance towards the file LRU. And as the file LRU +- * shrinks, so does the window for rotation from references. +- * This means we have a runaway feedback loop where a tiny +- * thrashing file LRU becomes infinitely more attractive than +- * anon pages. Try to detect this based on file LRU size. +- */ +- if (!cgroup_reclaim(sc)) { +- unsigned long total_high_wmark = 0; +- unsigned long free, anon; +- int z; +- +- free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); +- file = node_page_state(pgdat, NR_ACTIVE_FILE) + +- node_page_state(pgdat, NR_INACTIVE_FILE); +- +- for (z = 0; z < MAX_NR_ZONES; z++) { +- struct zone *zone = &pgdat->node_zones[z]; +- if (!managed_zone(zone)) +- continue; +- +- total_high_wmark += high_wmark_pages(zone); +- } +- +- /* +- * Consider anon: if that's low too, this isn't a +- * runaway file reclaim problem, but rather just +- * extreme pressure. Reclaim as per usual then. +- */ +- anon = node_page_state(pgdat, NR_INACTIVE_ANON); +- +- sc->file_is_tiny = +- file + free <= total_high_wmark && +- !(sc->may_deactivate & DEACTIVATE_ANON) && +- anon >> sc->priority; +- } ++ prepare_scan_count(pgdat, sc); + + shrink_node_memcgs(pgdat, sc); + +@@ -3557,6 +6339,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) + struct lruvec *target_lruvec; + unsigned long refaults; + ++ if (lru_gen_enabled()) ++ return; ++ + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); + target_lruvec->refaults[0] = refaults; +@@ -3923,12 +6708,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } - --- -2.37.3.998.g577e59143f-goog + #endif + +-static void age_active_anon(struct pglist_data *pgdat, +- struct scan_control *sc) ++static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) + { + struct mem_cgroup *memcg; + struct lruvec *lruvec; + ++ if (lru_gen_enabled()) { ++ lru_gen_age_node(pgdat, sc); ++ return; ++ } ++ + if (!can_age_anon_pages(pgdat, sc)) + return; + +@@ -4248,12 +7037,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) + sc.may_swap = !nr_boost_reclaim; + + /* +- * Do some background aging of the anon list, to give +- * pages a chance to be referenced before reclaiming. All +- * pages are rotated regardless of classzone as this is +- * about consistent aging. ++ * Do some background aging, to give pages a chance to be ++ * referenced before reclaiming. All pages are rotated ++ * regardless of classzone as this is about consistent aging. + */ +- age_active_anon(pgdat, &sc); ++ kswapd_age_node(pgdat, &sc); + + /* + * If we're getting trouble reclaiming, start doing writepage +diff --git a/mm/workingset.c b/mm/workingset.c +index a5e84862fc8688..ae7e984b23c6b0 100644 +--- a/mm/workingset.c ++++ b/mm/workingset.c +@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly; + static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, + bool workingset) + { +- eviction >>= bucket_order; + eviction &= EVICTION_MASK; + eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; + eviction = (eviction << NODES_SHIFT) | pgdat->node_id; +@@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, + + *memcgidp = memcgid; + *pgdat = NODE_DATA(nid); +- *evictionp = entry << bucket_order; ++ *evictionp = entry; + *workingsetp = workingset; + } + ++#ifdef CONFIG_LRU_GEN ++ ++static void *lru_gen_eviction(struct folio *folio) ++{ ++ int hist; ++ unsigned long token; ++ unsigned long min_seq; ++ struct lruvec *lruvec; ++ struct lru_gen_struct *lrugen; ++ int type = folio_is_file_lru(folio); ++ int delta = folio_nr_pages(folio); ++ int refs = folio_lru_refs(folio); ++ int tier = lru_tier_from_refs(refs); ++ struct mem_cgroup *memcg = folio_memcg(folio); ++ struct pglist_data *pgdat = folio_pgdat(folio); ++ ++ BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); ++ ++ lruvec = mem_cgroup_lruvec(memcg, pgdat); ++ lrugen = &lruvec->lrugen; ++ min_seq = READ_ONCE(lrugen->min_seq[type]); ++ token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0); ++ ++ hist = lru_hist_from_seq(min_seq); ++ atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); ++ ++ return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); ++} ++ ++static void lru_gen_refault(struct folio *folio, void *shadow) ++{ ++ int hist, tier, refs; ++ int memcg_id; ++ bool workingset; ++ unsigned long token; ++ unsigned long min_seq; ++ struct lruvec *lruvec; ++ struct lru_gen_struct *lrugen; ++ struct mem_cgroup *memcg; ++ struct pglist_data *pgdat; ++ int type = folio_is_file_lru(folio); ++ int delta = folio_nr_pages(folio); ++ ++ unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); ++ ++ if (pgdat != folio_pgdat(folio)) ++ return; ++ ++ rcu_read_lock(); ++ ++ memcg = folio_memcg_rcu(folio); ++ if (memcg_id != mem_cgroup_id(memcg)) ++ goto unlock; ++ ++ lruvec = mem_cgroup_lruvec(memcg, pgdat); ++ lrugen = &lruvec->lrugen; ++ ++ min_seq = READ_ONCE(lrugen->min_seq[type]); ++ if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) ++ goto unlock; ++ ++ hist = lru_hist_from_seq(min_seq); ++ /* see the comment in folio_lru_refs() */ ++ refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; ++ tier = lru_tier_from_refs(refs); ++ ++ atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); ++ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); ++ ++ /* ++ * Count the following two cases as stalls: ++ * 1. For pages accessed through page tables, hotter pages pushed out ++ * hot pages which refaulted immediately. ++ * 2. For pages accessed multiple times through file descriptors, ++ * numbers of accesses might have been out of the range. ++ */ ++ if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) { ++ folio_set_workingset(folio); ++ mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); ++ } ++unlock: ++ rcu_read_unlock(); ++} ++ ++#else /* !CONFIG_LRU_GEN */ ++ ++static void *lru_gen_eviction(struct folio *folio) ++{ ++ return NULL; ++} ++ ++static void lru_gen_refault(struct folio *folio, void *shadow) ++{ ++} ++ ++#endif /* CONFIG_LRU_GEN */ ++ + /** + * workingset_age_nonresident - age non-resident entries as LRU ages + * @lruvec: the lruvec that was aged +@@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) + VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + ++ if (lru_gen_enabled()) ++ return lru_gen_eviction(folio); ++ + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + /* XXX: target_memcg can be NULL, go through lruvec */ + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); + eviction = atomic_long_read(&lruvec->nonresident_age); ++ eviction >>= bucket_order; + workingset_age_nonresident(lruvec, folio_nr_pages(folio)); + return pack_shadow(memcgid, pgdat, eviction, + folio_test_workingset(folio)); +@@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow) + int memcgid; + long nr; + ++ if (lru_gen_enabled()) { ++ lru_gen_refault(folio, shadow); ++ return; ++ } ++ + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); ++ eviction <<= bucket_order; + + rcu_read_lock(); + /*