diff --git a/linux-tkg-patches/6.0/0010-lru_6.0.patch b/linux-tkg-patches/6.0/0010-lru_6.0.patch
index b97022b..0b4b5a0 100644
--- a/linux-tkg-patches/6.0/0010-lru_6.0.patch
+++ b/linux-tkg-patches/6.0/0010-lru_6.0.patch
@@ -1,403 +1,381 @@
-linux-kernel.vger.kernel.org archive mirror
-
- help / color / mirror / Atom feed
-
-* [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework
-@ 2022-09-18  7:59 Yu Zhao
-  2022-09-18  7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
-                   ` (14 more replies)
-  0 siblings, 15 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  7:59 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao
-
-What's new
-==========
-1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
-   Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
-2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
-   machines. The old direct reclaim backoff, which tries to enforce a
-   minimum fairness among all eligible memcgs, over-swapped by about
-   (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
-   pulls the plug on swapping once the target is met, trades some
-   fairness for curtailed latency:
-   https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
-3. Fixed minior build warnings and conflicts. More comments and nits.
-
-TLDR
-====
-The current page reclaim is too expensive in terms of CPU usage and it
-often makes poor choices about what to evict. This patchset offers an
-alternative solution that is performant, versatile and
-straightforward.
-
-Patchset overview
-=================
-The design and implementation overview is in patch 14:
-https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
-
-01. mm: x86, arm64: add arch_has_hw_pte_young()
-02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-Take advantage of hardware features when trying to clear the accessed
-bit in many PTEs.
-
-03. mm/vmscan.c: refactor shrink_node()
-04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
-    its sole caller"
-Minor refactors to improve readability for the following patches.
-
-05. mm: multi-gen LRU: groundwork
-Adds the basic data structure and the functions that insert pages to
-and remove pages from the multi-gen LRU (MGLRU) lists.
-
-06. mm: multi-gen LRU: minimal implementation
-A minimal implementation without optimizations.
-
-07. mm: multi-gen LRU: exploit locality in rmap
-Exploits spatial locality to improve efficiency when using the rmap.
-
-08. mm: multi-gen LRU: support page table walks
-Further exploits spatial locality by optionally scanning page tables.
-
-09. mm: multi-gen LRU: optimize multiple memcgs
-Optimizes the overall performance for multiple memcgs running mixed
-types of workloads.
-
-10. mm: multi-gen LRU: kill switch
-Adds a kill switch to enable or disable MGLRU at runtime.
-
-11. mm: multi-gen LRU: thrashing prevention
-12. mm: multi-gen LRU: debugfs interface
-Provide userspace with features like thrashing prevention, working set
-estimation and proactive reclaim.
-
-13. mm: multi-gen LRU: admin guide
-14. mm: multi-gen LRU: design doc
-Add an admin guide and a design doc.
-
-Benchmark results
-=================
-Independent lab results
------------------------
-Based on the popularity of searches [01] and the memory usage in
-Google's public cloud, the most popular open-source memory-hungry
-applications, in alphabetical order, are:
-      Apache Cassandra      Memcached
-      Apache Hadoop         MongoDB
-      Apache Spark          PostgreSQL
-      MariaDB (MySQL)       Redis
-
-An independent lab evaluated MGLRU with the most widely used benchmark
-suites for the above applications. They posted 960 data points along
-with kernel metrics and perf profiles collected over more than 500
-hours of total benchmark time. Their final reports show that, with 95%
-confidence intervals (CIs), the above applications all performed
-significantly better for at least part of their benchmark matrices.
-
-On 5.14:
-1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
-   less wall time to sort three billion random integers, respectively,
-   under the medium- and the high-concurrency conditions, when
-   overcommitting memory. There were no statistically significant
-   changes in wall time for the rest of the benchmark matrix.
-2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
-   more transactions per minute (TPM), respectively, under the medium-
-   and the high-concurrency conditions, when overcommitting memory.
-   There were no statistically significant changes in TPM for the rest
-   of the benchmark matrix.
-3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
-   and [21.59, 30.02]% more operations per second (OPS), respectively,
-   for sequential access, random access and Gaussian (distribution)
-   access, when THP=always; 95% CIs [13.85, 15.97]% and
-   [23.94, 29.92]% more OPS, respectively, for random access and
-   Gaussian access, when THP=never. There were no statistically
-   significant changes in OPS for the rest of the benchmark matrix.
-4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
-   [2.16, 3.55]% more operations per second (OPS), respectively, for
-   exponential (distribution) access, random access and Zipfian
-   (distribution) access, when underutilizing memory; 95% CIs
-   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
-   respectively, for exponential access, random access and Zipfian
-   access, when overcommitting memory.
-
-On 5.15:
-5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
-   and [4.11, 7.50]% more operations per second (OPS), respectively,
-   for exponential (distribution) access, random access and Zipfian
-   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
-   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
-   exponential access, random access and Zipfian access, when swap was
-   on.
-6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
-   less average wall time to finish twelve parallel TeraSort jobs,
-   respectively, under the medium- and the high-concurrency
-   conditions, when swap was on. There were no statistically
-   significant changes in average wall time for the rest of the
-   benchmark matrix.
-7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
-   minute (TPM) under the high-concurrency condition, when swap was
-   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
-   respectively, under the medium- and the high-concurrency
-   conditions, when swap was on. There were no statistically
-   significant changes in TPM for the rest of the benchmark matrix.
-8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
-   [11.47, 19.36]% more total operations per second (OPS),
-   respectively, for sequential access, random access and Gaussian
-   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
-   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
-   for sequential access, random access and Gaussian access, when
-   THP=never.
-
-Our lab results
----------------
-To supplement the above results, we ran the following benchmark suites
-on 5.16-rc7 and found no regressions [10].
-      fs_fio_bench_hdd_mq      pft
-      fs_lmbench               pgsql-hammerdb
-      fs_parallelio            redis
-      fs_postmark              stream
-      hackbench                sysbenchthread
-      kernbench                tpcc_spark
-      memcached                unixbench
-      multichase               vm-scalability
-      mutilate                 will-it-scale
-      nginx
-
-[01] https://trends.google.com
-[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
-[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
-[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
-[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
-[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
-[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
-[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
-[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
-[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
-
-Read-world applications
-=======================
-Third-party testimonials
-------------------------
-Konstantin reported [11]:
-   I have Archlinux with 8G RAM + zswap + swap. While developing, I
-   have lots of apps opened such as multiple LSP-servers for different
-   langs, chats, two browsers, etc... Usually, my system gets quickly
-   to a point of SWAP-storms, where I have to kill LSP-servers,
-   restart browsers to free memory, etc, otherwise the system lags
-   heavily and is barely usable.
-
-   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
-   patchset, and I started up by opening lots of apps to create memory
-   pressure, and worked for a day like this. Till now I had not a
-   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
-   getting to the point of 3G in SWAP before without a single
-   SWAP-storm.
-
-Vaibhav from IBM reported [12]:
-   In a synthetic MongoDB Benchmark, seeing an average of ~19%
-   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
-   MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
-   three different request distributions, namely, Exponential, Uniform
-   and Zipfan.
-
-Shuang from U of Rochester reported [13]:
-   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
-   and [9.26, 10.36]% higher throughput, respectively, for random
-   access, Zipfian (distribution) access and Gaussian (distribution)
-   access, when the average number of jobs per CPU is 1; 95% CIs
-   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
-   throughput, respectively, for random access, Zipfian access and
-   Gaussian access, when the average number of jobs per CPU is 2.
-
-Daniel from Michigan Tech reported [14]:
-   With Memcached allocating ~100GB of byte-addressable Optante,
-   performance improvement in terms of throughput (measured as queries
-   per second) was about 10% for a series of workloads.
-
-Large-scale deployments
------------------------
-We've rolled out MGLRU to tens of millions of ChromeOS users and
-about a million Android users. Google's fleetwide profiling [15] shows
-an overall 40% decrease in kswapd CPU usage, in addition to
-improvements in other UX metrics, e.g., an 85% decrease in the number
-of low-memory kills at the 75th percentile and an 18% decrease in
-app launch time at the 50th percentile.
-
-The downstream kernels that have been using MGLRU include:
-1. Android [16]
-2. Arch Linux Zen [17]
-3. Armbian [18]
-4. ChromeOS [19]
-5. Liquorix [20]
-6. OpenWrt [21]
-7. post-factum [22]
-8. XanMod [23]
-
-[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
-[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
-[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
-[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
-[15] https://dl.acm.org/doi/10.1145/2749469.2750392
-[16] https://android.com
-[17] https://archlinux.org
-[18] https://armbian.com
-[19] https://chromium.org
-[20] https://liquorix.net
-[21] https://openwrt.org
-[22] https://codeberg.org/pf-kernel
-[23] https://xanmod.org
-
-Summery
-=======
-The facts are:
-1. The independent lab results and the real-world applications
-   indicate substantial improvements; there are no known regressions.
-2. Thrashing prevention, working set estimation and proactive reclaim
-   work out of the box; there are no equivalent solutions.
-3. There is a lot of new code; no smaller changes have been
-   demonstrated similar effects.
-
-Our options, accordingly, are:
-1. Given the amount of evidence, the reported improvements will likely
-   materialize for a wide range of workloads.
-2. Gauging the interest from the past discussions, the new features
-   will likely be put to use for both personal computers and data
-   centers.
-3. Based on Google's track record, the new code will likely be well
-   maintained in the long term. It'd be more difficult if not
-   impossible to achieve similar effects with other approaches.
-
-Yu Zhao (14):
-  mm: x86, arm64: add arch_has_hw_pte_young()
-  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-  mm/vmscan.c: refactor shrink_node()
-  Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
-    sole caller"
-  mm: multi-gen LRU: groundwork
-  mm: multi-gen LRU: minimal implementation
-  mm: multi-gen LRU: exploit locality in rmap
-  mm: multi-gen LRU: support page table walks
-  mm: multi-gen LRU: optimize multiple memcgs
-  mm: multi-gen LRU: kill switch
-  mm: multi-gen LRU: thrashing prevention
-  mm: multi-gen LRU: debugfs interface
-  mm: multi-gen LRU: admin guide
-  mm: multi-gen LRU: design doc
-
- Documentation/admin-guide/mm/index.rst        |    1 +
- Documentation/admin-guide/mm/multigen_lru.rst |  162 +
- Documentation/mm/index.rst                    |    1 +
- Documentation/mm/multigen_lru.rst             |  159 +
- arch/Kconfig                                  |    8 +
- arch/arm64/include/asm/pgtable.h              |   15 +-
- arch/x86/Kconfig                              |    1 +
- arch/x86/include/asm/pgtable.h                |    9 +-
- arch/x86/mm/pgtable.c                         |    5 +-
- fs/exec.c                                     |    2 +
- fs/fuse/dev.c                                 |    3 +-
- include/linux/cgroup.h                        |   15 +-
- include/linux/memcontrol.h                    |   36 +
- include/linux/mm.h                            |    5 +
- include/linux/mm_inline.h                     |  231 +-
- include/linux/mm_types.h                      |   76 +
- include/linux/mmzone.h                        |  214 ++
- include/linux/nodemask.h                      |    1 +
- include/linux/page-flags-layout.h             |   16 +-
- include/linux/page-flags.h                    |    4 +-
- include/linux/pgtable.h                       |   17 +-
- include/linux/sched.h                         |    4 +
- include/linux/swap.h                          |    4 +
- kernel/bounds.c                               |    7 +
- kernel/cgroup/cgroup-internal.h               |    1 -
- kernel/exit.c                                 |    1 +
- kernel/fork.c                                 |    9 +
- kernel/sched/core.c                           |    1 +
- mm/Kconfig                                    |   26 +
- mm/huge_memory.c                              |    3 +-
- mm/internal.h                                 |    1 +
- mm/memcontrol.c                               |   28 +
- mm/memory.c                                   |   39 +-
- mm/mm_init.c                                  |    6 +-
- mm/mmzone.c                                   |    2 +
- mm/rmap.c                                     |    6 +
- mm/swap.c                                     |   54 +-
- mm/vmscan.c                                   | 2995 ++++++++++++++++-
- mm/workingset.c                               |  110 +-
- 39 files changed, 4122 insertions(+), 156 deletions(-)
- create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
- create mode 100644 Documentation/mm/multigen_lru.rst
-
-
-base-commit: 6cf215f1d5dac59a5a09514138ca37aed2719d0a
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-@ 2022-09-18  7:59 ` Yu Zhao
-  2022-09-18  7:59 ` [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
-                   ` (13 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  7:59 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song,
-	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
-	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Some architectures automatically set the accessed bit in PTEs, e.g.,
-x86 and arm64 v8.2. On architectures that do not have this capability,
-clearing the accessed bit in a PTE usually triggers a page fault
-following the TLB miss of this PTE (to emulate the accessed bit).
-
-Being aware of this capability can help make better decisions, e.g.,
-whether to spread the work out over a period of time to reduce bursty
-page faults when trying to clear the accessed bit in many PTEs.
-
-Note that theoretically this capability can be unreliable, e.g.,
-hotplugged CPUs might be different from builtin ones. Therefore it
-should not be used in architecture-independent code that involves
-correctness, e.g., to determine whether TLB flushes are required (in
-combination with the accessed bit).
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Reviewed-by: Barry Song <baohua@kernel.org>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Acked-by: Will Deacon <will@kernel.org>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- arch/arm64/include/asm/pgtable.h | 15 ++-------------
- arch/x86/include/asm/pgtable.h   |  6 +++---
- include/linux/pgtable.h          | 13 +++++++++++++
- mm/memory.c                      | 14 +-------------
- 4 files changed, 19 insertions(+), 29 deletions(-)
-
+diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
+index 1bd11118dfb1c6..d1064e0ba34a29 100644
+--- a/Documentation/admin-guide/mm/index.rst
++++ b/Documentation/admin-guide/mm/index.rst
+@@ -32,6 +32,7 @@ the Linux memory management.
+    idle_page_tracking
+    ksm
+    memory-hotplug
++   multigen_lru
+    nommu-mmap
+    numa_memory_policy
+    numaperf
+diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
+new file mode 100644
+index 00000000000000..33e068830497e7
+--- /dev/null
++++ b/Documentation/admin-guide/mm/multigen_lru.rst
+@@ -0,0 +1,162 @@
++.. SPDX-License-Identifier: GPL-2.0
++
++=============
++Multi-Gen LRU
++=============
++The multi-gen LRU is an alternative LRU implementation that optimizes
++page reclaim and improves performance under memory pressure. Page
++reclaim decides the kernel's caching policy and ability to overcommit
++memory. It directly impacts the kswapd CPU usage and RAM efficiency.
++
++Quick start
++===========
++Build the kernel with the following configurations.
++
++* ``CONFIG_LRU_GEN=y``
++* ``CONFIG_LRU_GEN_ENABLED=y``
++
++All set!
++
++Runtime options
++===============
++``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
++following subsections.
++
++Kill switch
++-----------
++``enabled`` accepts different values to enable or disable the
++following components. Its default value depends on
++``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
++unless some of them have unforeseen side effects. Writing to
++``enabled`` has no effect when a component is not supported by the
++hardware, and valid values will be accepted even when the main switch
++is off.
++
++====== ===============================================================
++Values Components
++====== ===============================================================
++0x0001 The main switch for the multi-gen LRU.
++0x0002 Clearing the accessed bit in leaf page table entries in large
++       batches, when MMU sets it (e.g., on x86). This behavior can
++       theoretically worsen lock contention (mmap_lock). If it is
++       disabled, the multi-gen LRU will suffer a minor performance
++       degradation for workloads that contiguously map hot pages,
++       whose accessed bits can be otherwise cleared by fewer larger
++       batches.
++0x0004 Clearing the accessed bit in non-leaf page table entries as
++       well, when MMU sets it (e.g., on x86). This behavior was not
++       verified on x86 varieties other than Intel and AMD. If it is
++       disabled, the multi-gen LRU will suffer a negligible
++       performance degradation.
++[yYnN] Apply to all the components above.
++====== ===============================================================
++
++E.g.,
++::
++
++    echo y >/sys/kernel/mm/lru_gen/enabled
++    cat /sys/kernel/mm/lru_gen/enabled
++    0x0007
++    echo 5 >/sys/kernel/mm/lru_gen/enabled
++    cat /sys/kernel/mm/lru_gen/enabled
++    0x0005
++
++Thrashing prevention
++--------------------
++Personal computers are more sensitive to thrashing because it can
++cause janks (lags when rendering UI) and negatively impact user
++experience. The multi-gen LRU offers thrashing prevention to the
++majority of laptop and desktop users who do not have ``oomd``.
++
++Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
++``N`` milliseconds from getting evicted. The OOM killer is triggered
++if this working set cannot be kept in memory. In other words, this
++option works as an adjustable pressure relief valve, and when open, it
++terminates applications that are hopefully not being used.
++
++Based on the average human detectable lag (~100ms), ``N=1000`` usually
++eliminates intolerable janks due to thrashing. Larger values like
++``N=3000`` make janks less noticeable at the risk of premature OOM
++kills.
++
++The default value ``0`` means disabled.
++
++Experimental features
++=====================
++``/sys/kernel/debug/lru_gen`` accepts commands described in the
++following subsections. Multiple command lines are supported, so does
++concatenation with delimiters ``,`` and ``;``.
++
++``/sys/kernel/debug/lru_gen_full`` provides additional stats for
++debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
++evicted generations in this file.
++
++Working set estimation
++----------------------
++Working set estimation measures how much memory an application needs
++in a given time interval, and it is usually done with little impact on
++the performance of the application. E.g., data centers want to
++optimize job scheduling (bin packing) to improve memory utilizations.
++When a new job comes in, the job scheduler needs to find out whether
++each server it manages can allocate a certain amount of memory for
++this new job before it can pick a candidate. To do so, the job
++scheduler needs to estimate the working sets of the existing jobs.
++
++When it is read, ``lru_gen`` returns a histogram of numbers of pages
++accessed over different time intervals for each memcg and node.
++``MAX_NR_GENS`` decides the number of bins for each histogram. The
++histograms are noncumulative.
++::
++
++    memcg  memcg_id  memcg_path
++       node  node_id
++           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
++           ...
++           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
++
++Each bin contains an estimated number of pages that have been accessed
++within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
++and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
++the former is the largest and that of the latter is the smallest.
++
++Users can write the following command to ``lru_gen`` to create a new
++generation ``max_gen_nr+1``:
++
++    ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
++
++``can_swap`` defaults to the swap setting and, if it is set to ``1``,
++it forces the scan of anon pages when swap is off, and vice versa.
++``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
++employs heuristics to reduce the overhead, which is likely to reduce
++the coverage as well.
++
++A typical use case is that a job scheduler runs this command at a
++certain time interval to create new generations, and it ranks the
++servers it manages based on the sizes of their cold pages defined by
++this time interval.
++
++Proactive reclaim
++-----------------
++Proactive reclaim induces page reclaim when there is no memory
++pressure. It usually targets cold pages only. E.g., when a new job
++comes in, the job scheduler wants to proactively reclaim cold pages on
++the server it selected, to improve the chance of successfully landing
++this new job.
++
++Users can write the following command to ``lru_gen`` to evict
++generations less than or equal to ``min_gen_nr``.
++
++    ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
++
++``min_gen_nr`` should be less than ``max_gen_nr-1``, since
++``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
++the active list) and therefore cannot be evicted. ``swappiness``
++overrides the default value in ``/proc/sys/vm/swappiness``.
++``nr_to_reclaim`` limits the number of pages to evict.
++
++A typical use case is that a job scheduler runs this command before it
++tries to land a new job on a server. If it fails to materialize enough
++cold pages because of the overestimation, it retries on the next
++server according to the ranking result obtained from the working set
++estimation step. This less forceful approach limits the impacts on the
++existing jobs.
+diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
+index 575ccd40e30cfa..4aa12b8be278d3 100644
+--- a/Documentation/mm/index.rst
++++ b/Documentation/mm/index.rst
+@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose.
+    ksm
+    memory-model
+    mmu_notifier
++   multigen_lru
+    numa
+    overcommit-accounting
+    page_migration
+diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
+new file mode 100644
+index 00000000000000..d7062c6a894646
+--- /dev/null
++++ b/Documentation/mm/multigen_lru.rst
+@@ -0,0 +1,159 @@
++.. SPDX-License-Identifier: GPL-2.0
++
++=============
++Multi-Gen LRU
++=============
++The multi-gen LRU is an alternative LRU implementation that optimizes
++page reclaim and improves performance under memory pressure. Page
++reclaim decides the kernel's caching policy and ability to overcommit
++memory. It directly impacts the kswapd CPU usage and RAM efficiency.
++
++Design overview
++===============
++Objectives
++----------
++The design objectives are:
++
++* Good representation of access recency
++* Try to profit from spatial locality
++* Fast paths to make obvious choices
++* Simple self-correcting heuristics
++
++The representation of access recency is at the core of all LRU
++implementations. In the multi-gen LRU, each generation represents a
++group of pages with similar access recency. Generations establish a
++(time-based) common frame of reference and therefore help make better
++choices, e.g., between different memcgs on a computer or different
++computers in a data center (for job scheduling).
++
++Exploiting spatial locality improves efficiency when gathering the
++accessed bit. A rmap walk targets a single page and does not try to
++profit from discovering a young PTE. A page table walk can sweep all
++the young PTEs in an address space, but the address space can be too
++sparse to make a profit. The key is to optimize both methods and use
++them in combination.
++
++Fast paths reduce code complexity and runtime overhead. Unmapped pages
++do not require TLB flushes; clean pages do not require writeback.
++These facts are only helpful when other conditions, e.g., access
++recency, are similar. With generations as a common frame of reference,
++additional factors stand out. But obvious choices might not be good
++choices; thus self-correction is necessary.
++
++The benefits of simple self-correcting heuristics are self-evident.
++Again, with generations as a common frame of reference, this becomes
++attainable. Specifically, pages in the same generation can be
++categorized based on additional factors, and a feedback loop can
++statistically compare the refault percentages across those categories
++and infer which of them are better choices.
++
++Assumptions
++-----------
++The protection of hot pages and the selection of cold pages are based
++on page access channels and patterns. There are two access channels:
++
++* Accesses through page tables
++* Accesses through file descriptors
++
++The protection of the former channel is by design stronger because:
++
++1. The uncertainty in determining the access patterns of the former
++   channel is higher due to the approximation of the accessed bit.
++2. The cost of evicting the former channel is higher due to the TLB
++   flushes required and the likelihood of encountering the dirty bit.
++3. The penalty of underprotecting the former channel is higher because
++   applications usually do not prepare themselves for major page
++   faults like they do for blocked I/O. E.g., GUI applications
++   commonly use dedicated I/O threads to avoid blocking rendering
++   threads.
++
++There are also two access patterns:
++
++* Accesses exhibiting temporal locality
++* Accesses not exhibiting temporal locality
++
++For the reasons listed above, the former channel is assumed to follow
++the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
++present, and the latter channel is assumed to follow the latter
++pattern unless outlying refaults have been observed.
++
++Workflow overview
++=================
++Evictable pages are divided into multiple generations for each
++``lruvec``. The youngest generation number is stored in
++``lrugen->max_seq`` for both anon and file types as they are aged on
++an equal footing. The oldest generation numbers are stored in
++``lrugen->min_seq[]`` separately for anon and file types as clean file
++pages can be evicted regardless of swap constraints. These three
++variables are monotonically increasing.
++
++Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
++bits in order to fit into the gen counter in ``folio->flags``. Each
++truncated generation number is an index to ``lrugen->lists[]``. The
++sliding window technique is used to track at least ``MIN_NR_GENS`` and
++at most ``MAX_NR_GENS`` generations. The gen counter stores a value
++within ``[1, MAX_NR_GENS]`` while a page is on one of
++``lrugen->lists[]``; otherwise it stores zero.
++
++Each generation is divided into multiple tiers. A page accessed ``N``
++times through file descriptors is in tier ``order_base_2(N)``. Unlike
++generations, tiers do not have dedicated ``lrugen->lists[]``. In
++contrast to moving across generations, which requires the LRU lock,
++moving across tiers only involves atomic operations on
++``folio->flags`` and therefore has a negligible cost. A feedback loop
++modeled after the PID controller monitors refaults over all the tiers
++from anon and file types and decides which tiers from which types to
++evict or protect.
++
++There are two conceptually independent procedures: the aging and the
++eviction. They form a closed-loop system, i.e., the page reclaim.
++
++Aging
++-----
++The aging produces young generations. Given an ``lruvec``, it
++increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
++``MIN_NR_GENS``. The aging promotes hot pages to the youngest
++generation when it finds them accessed through page tables; the
++demotion of cold pages happens consequently when it increments
++``max_seq``. The aging uses page table walks and rmap walks to find
++young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
++and calls ``walk_page_range()`` with each ``mm_struct`` on this list
++to scan PTEs, and after each iteration, it increments ``max_seq``. For
++the latter, when the eviction walks the rmap and finds a young PTE,
++the aging scans the adjacent PTEs. For both, on finding a young PTE,
++the aging clears the accessed bit and updates the gen counter of the
++page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
++
++Eviction
++--------
++The eviction consumes old generations. Given an ``lruvec``, it
++increments ``min_seq`` when ``lrugen->lists[]`` indexed by
++``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
++evict from, it first compares ``min_seq[]`` to select the older type.
++If both types are equally old, it selects the one whose first tier has
++a lower refault percentage. The first tier contains single-use
++unmapped clean pages, which are the best bet. The eviction sorts a
++page according to its gen counter if the aging has found this page
++accessed through page tables and updated its gen counter. It also
++moves a page to the next generation, i.e., ``min_seq+1``, if this page
++was accessed multiple times through file descriptors and the feedback
++loop has detected outlying refaults from the tier this page is in. To
++this end, the feedback loop uses the first tier as the baseline, for
++the reason stated earlier.
++
++Summary
++-------
++The multi-gen LRU can be disassembled into the following parts:
++
++* Generations
++* Rmap walks
++* Page table walks
++* Bloom filters
++* PID controller
++
++The aging and the eviction form a producer-consumer model;
++specifically, the latter drives the former by the sliding window over
++generations. Within the aging, rmap walks drive page table walks by
++inserting hot densely populated page tables to the Bloom filters.
++Within the eviction, the PID controller uses refaults as the feedback
++to select types to evict and tiers to protect.
+diff --git a/arch/Kconfig b/arch/Kconfig
+index 8b311e400ec140..bf19a84fffa21b 100644
+--- a/arch/Kconfig
++++ b/arch/Kconfig
+@@ -1418,6 +1418,14 @@ config DYNAMIC_SIGFRAME
+ config HAVE_ARCH_NODE_DEV_GROUP
+ 	bool
+ 
++config ARCH_HAS_NONLEAF_PMD_YOUNG
++	bool
++	help
++	  Architectures that select this option are capable of setting the
++	  accessed bit in non-leaf PMD entries when using them as part of linear
++	  address translations. Page table walkers that clear the accessed bit
++	  may use this capability to reduce their search space.
++
+ source "kernel/gcov/Kconfig"
+ 
+ source "scripts/gcc-plugins/Kconfig"
 diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
-index b5df82aa99e6..71a1af42f0e8 100644
+index b5df82aa99e64b..71a1af42f0e897 100644
 --- a/arch/arm64/include/asm/pgtable.h
 +++ b/arch/arm64/include/asm/pgtable.h
 @@ -1082,24 +1082,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
@@ -413,7 +391,7 @@ index b5df82aa99e6..71a1af42f0e8 100644
 -}
 -#define arch_faults_on_old_pte		arch_faults_on_old_pte
 +#define arch_has_hw_pte_young		cpu_has_hw_af
-
+ 
  /*
   * Experimentally, it's cheap to set the access flag in hardware and we
   * benefit from prefaulting mappings as 'old' to start with.
@@ -424,173 +402,11 @@ index b5df82aa99e6..71a1af42f0e8 100644
 -}
 -#define arch_wants_old_prefaulted_pte	arch_wants_old_prefaulted_pte
 +#define arch_wants_old_prefaulted_pte	cpu_has_hw_af
-
+ 
  static inline bool pud_sect_supported(void)
  {
-diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
-index 44e2d6f1dbaa..dc5f7d8ef68a 100644
---- a/arch/x86/include/asm/pgtable.h
-+++ b/arch/x86/include/asm/pgtable.h
-@@ -1431,10 +1431,10 @@ static inline bool arch_has_pfn_modify_check(void)
- 	return boot_cpu_has_bug(X86_BUG_L1TF);
- }
-
--#define arch_faults_on_old_pte arch_faults_on_old_pte
--static inline bool arch_faults_on_old_pte(void)
-+#define arch_has_hw_pte_young arch_has_hw_pte_young
-+static inline bool arch_has_hw_pte_young(void)
- {
--	return false;
-+	return true;
- }
-
- #ifdef CONFIG_PAGE_TABLE_CHECK
-diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
-index d13b4f7cc5be..375e8e7e64f4 100644
---- a/include/linux/pgtable.h
-+++ b/include/linux/pgtable.h
-@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
- #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- #endif
-
-+#ifndef arch_has_hw_pte_young
-+/*
-+ * Return whether the accessed bit is supported on the local CPU.
-+ *
-+ * This stub assumes accessing through an old PTE triggers a page fault.
-+ * Architectures that automatically set the access bit should overwrite it.
-+ */
-+static inline bool arch_has_hw_pte_young(void)
-+{
-+	return false;
-+}
-+#endif
-+
- #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
- static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
- 				       unsigned long address,
-diff --git a/mm/memory.c b/mm/memory.c
-index e38f9245470c..3a9b00c765c2 100644
---- a/mm/memory.c
-+++ b/mm/memory.c
-@@ -126,18 +126,6 @@ int randomize_va_space __read_mostly =
- 					2;
- #endif
-
--#ifndef arch_faults_on_old_pte
--static inline bool arch_faults_on_old_pte(void)
--{
--	/*
--	 * Those arches which don't have hw access flag feature need to
--	 * implement their own helper. By default, "true" means pagefault
--	 * will be hit on old pte.
--	 */
--	return true;
--}
--#endif
--
- #ifndef arch_wants_old_prefaulted_pte
- static inline bool arch_wants_old_prefaulted_pte(void)
- {
-@@ -2871,7 +2859,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
- 	 * On architectures with software "accessed" bits, we would
- 	 * take a double page fault, so mark it accessed here.
- 	 */
--	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
-+	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
- 		pte_t entry;
-
- 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-  2022-09-18  7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
-@ 2022-09-18  7:59 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
-                   ` (12 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  7:59 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song,
-	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
-	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Some architectures support the accessed bit in non-leaf PMD entries,
-e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
-as part of linear address translation [1]. Page table walkers that
-clear the accessed bit may use this capability to reduce their search
-space.
-
-Note that:
-1. Although an inline function is preferable, this capability is added
-   as a configuration option for consistency with the existing macros.
-2. Due to the little interest in other varieties, this capability was
-   only tested on Intel and AMD CPUs.
-
-Thanks to the following developers for their efforts [2][3].
-  Randy Dunlap <rdunlap@infradead.org>
-  Stephen Rothwell <sfr@canb.auug.org.au>
-
-[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
-     Volume 3 (June 2021), section 4.8
-[2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
-[3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Reviewed-by: Barry Song <baohua@kernel.org>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- arch/Kconfig                   | 8 ++++++++
- arch/x86/Kconfig               | 1 +
- arch/x86/include/asm/pgtable.h | 3 ++-
- arch/x86/mm/pgtable.c          | 5 ++++-
- include/linux/pgtable.h        | 4 ++--
- 5 files changed, 17 insertions(+), 4 deletions(-)
-
-diff --git a/arch/Kconfig b/arch/Kconfig
-index 5dbf11a5ba4e..1c2599618eeb 100644
---- a/arch/Kconfig
-+++ b/arch/Kconfig
-@@ -1415,6 +1415,14 @@ config DYNAMIC_SIGFRAME
- config HAVE_ARCH_NODE_DEV_GROUP
- 	bool
-
-+config ARCH_HAS_NONLEAF_PMD_YOUNG
-+	bool
-+	help
-+	  Architectures that select this option are capable of setting the
-+	  accessed bit in non-leaf PMD entries when using them as part of linear
-+	  address translations. Page table walkers that clear the accessed bit
-+	  may use this capability to reduce their search space.
-+
- source "kernel/gcov/Kconfig"
-
- source "scripts/gcc-plugins/Kconfig"
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
-index f9920f1341c8..674d694a665e 100644
+index f9920f1341c8d4..674d694a665ef5 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -85,6 +85,7 @@ config X86
@@ -602,34 +418,48 @@ index f9920f1341c8..674d694a665e 100644
  	select ARCH_HAS_COPY_MC			if X86_64
  	select ARCH_HAS_SET_MEMORY
 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
-index dc5f7d8ef68a..5059799bebe3 100644
+index 44e2d6f1dbaa87..5059799bebe36d 100644
 --- a/arch/x86/include/asm/pgtable.h
 +++ b/arch/x86/include/asm/pgtable.h
 @@ -815,7 +815,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
-
+ 
  static inline int pmd_bad(pmd_t pmd)
  {
 -	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 +	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
 +	       (_KERNPG_TABLE & ~_PAGE_ACCESSED);
  }
-
+ 
  static inline unsigned long pages_to_mb(unsigned long npg)
+@@ -1431,10 +1432,10 @@ static inline bool arch_has_pfn_modify_check(void)
+ 	return boot_cpu_has_bug(X86_BUG_L1TF);
+ }
+ 
+-#define arch_faults_on_old_pte arch_faults_on_old_pte
+-static inline bool arch_faults_on_old_pte(void)
++#define arch_has_hw_pte_young arch_has_hw_pte_young
++static inline bool arch_has_hw_pte_young(void)
+ {
+-	return false;
++	return true;
+ }
+ 
+ #ifdef CONFIG_PAGE_TABLE_CHECK
 diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
-index a932d7712d85..8525f2876fb4 100644
+index a932d7712d851d..8525f2876fb409 100644
 --- a/arch/x86/mm/pgtable.c
 +++ b/arch/x86/mm/pgtable.c
 @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
  	return ret;
  }
-
+ 
 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
  int pmdp_test_and_clear_young(struct vm_area_struct *vma,
  			      unsigned long addr, pmd_t *pmdp)
  {
 @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
-
+ 
  	return ret;
  }
 +#endif
@@ -638,367 +468,189 @@ index a932d7712d85..8525f2876fb4 100644
  int pudp_test_and_clear_young(struct vm_area_struct *vma,
  			      unsigned long addr, pud_t *pudp)
  {
-diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
-index 375e8e7e64f4..a108b60a6962 100644
---- a/include/linux/pgtable.h
-+++ b/include/linux/pgtable.h
-@@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
- #endif
-
- #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
--#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
- static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
- 					    unsigned long address,
- 					    pmd_t *pmdp)
-@@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
- 	BUILD_BUG();
- 	return 0;
+diff --git a/fs/exec.c b/fs/exec.c
+index d046dbb9cbd083..c67b12f0f577fe 100644
+--- a/fs/exec.c
++++ b/fs/exec.c
+@@ -1011,6 +1011,7 @@ static int exec_mmap(struct mm_struct *mm)
+ 	active_mm = tsk->active_mm;
+ 	tsk->active_mm = mm;
+ 	tsk->mm = mm;
++	lru_gen_add_mm(mm);
+ 	/*
+ 	 * This prevents preemption while active_mm is being loaded and
+ 	 * it and mm are being updated, which could cause problems for
+@@ -1026,6 +1027,7 @@ static int exec_mmap(struct mm_struct *mm)
+ 	tsk->mm->vmacache_seqnum = 0;
+ 	vmacache_flush(tsk);
+ 	task_unlock(tsk);
++	lru_gen_use_mm(mm);
+ 	if (old_mm) {
+ 		mmap_read_unlock(old_mm);
+ 		BUG_ON(active_mm != old_mm);
+diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
+index 51897427a5346e..b4a6e0a1b945aa 100644
+--- a/fs/fuse/dev.c
++++ b/fs/fuse/dev.c
+@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page)
+ 	       1 << PG_active |
+ 	       1 << PG_workingset |
+ 	       1 << PG_reclaim |
+-	       1 << PG_waiters))) {
++	       1 << PG_waiters |
++	       LRU_GEN_MASK | LRU_REFS_MASK))) {
+ 		dump_page(page, "fuse: trying to steal weird page");
+ 		return 1;
+ 	}
+diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
+index ac5d0515680eae..9179463c3c9f82 100644
+--- a/include/linux/cgroup.h
++++ b/include/linux/cgroup.h
+@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
+ 	css_put(&cgrp->self);
  }
--#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
- #endif
-
- #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node()
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-  2022-09-18  7:59 ` [PATCH mm-unstable v15 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
-  2022-09-18  7:59 ` [PATCH mm-unstable v15 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
-                   ` (11 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song,
-	Miaohe Lin, Brian Geffon, Jan Alexander Steffens,
-	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
-	Daniel Byrne, Donald Carr, Holger Hoffstätte,
-	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain
-
-This patch refactors shrink_node() to improve readability for the
-upcoming changes to mm/vmscan.c.
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Reviewed-by: Barry Song <baohua@kernel.org>
-Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
- 1 file changed, 104 insertions(+), 94 deletions(-)
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 992ba6a0bf10..0869cee13a90 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -2728,6 +2728,109 @@ enum scan_balance {
- 	SCAN_FILE,
- };
-
-+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+ 
++extern struct mutex cgroup_mutex;
++
++static inline void cgroup_lock(void)
 +{
-+	unsigned long file;
-+	struct lruvec *target_lruvec;
-+
-+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
-+
-+	/*
-+	 * Flush the memory cgroup stats, so that we read accurate per-memcg
-+	 * lruvec stats for heuristics.
-+	 */
-+	mem_cgroup_flush_stats();
-+
-+	/*
-+	 * Determine the scan balance between anon and file LRUs.
-+	 */
-+	spin_lock_irq(&target_lruvec->lru_lock);
-+	sc->anon_cost = target_lruvec->anon_cost;
-+	sc->file_cost = target_lruvec->file_cost;
-+	spin_unlock_irq(&target_lruvec->lru_lock);
-+
-+	/*
-+	 * Target desirable inactive:active list ratios for the anon
-+	 * and file LRU lists.
-+	 */
-+	if (!sc->force_deactivate) {
-+		unsigned long refaults;
-+
-+		/*
-+		 * When refaults are being observed, it means a new
-+		 * workingset is being established. Deactivate to get
-+		 * rid of any stale active pages quickly.
-+		 */
-+		refaults = lruvec_page_state(target_lruvec,
-+				WORKINGSET_ACTIVATE_ANON);
-+		if (refaults != target_lruvec->refaults[0] ||
-+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-+			sc->may_deactivate |= DEACTIVATE_ANON;
-+		else
-+			sc->may_deactivate &= ~DEACTIVATE_ANON;
-+
-+		refaults = lruvec_page_state(target_lruvec,
-+				WORKINGSET_ACTIVATE_FILE);
-+		if (refaults != target_lruvec->refaults[1] ||
-+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-+			sc->may_deactivate |= DEACTIVATE_FILE;
-+		else
-+			sc->may_deactivate &= ~DEACTIVATE_FILE;
-+	} else
-+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-+
-+	/*
-+	 * If we have plenty of inactive file pages that aren't
-+	 * thrashing, try to reclaim those first before touching
-+	 * anonymous pages.
-+	 */
-+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-+		sc->cache_trim_mode = 1;
-+	else
-+		sc->cache_trim_mode = 0;
-+
-+	/*
-+	 * Prevent the reclaimer from falling into the cache trap: as
-+	 * cache pages start out inactive, every cache fault will tip
-+	 * the scan balance towards the file LRU.  And as the file LRU
-+	 * shrinks, so does the window for rotation from references.
-+	 * This means we have a runaway feedback loop where a tiny
-+	 * thrashing file LRU becomes infinitely more attractive than
-+	 * anon pages.  Try to detect this based on file LRU size.
-+	 */
-+	if (!cgroup_reclaim(sc)) {
-+		unsigned long total_high_wmark = 0;
-+		unsigned long free, anon;
-+		int z;
-+
-+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-+			   node_page_state(pgdat, NR_INACTIVE_FILE);
-+
-+		for (z = 0; z < MAX_NR_ZONES; z++) {
-+			struct zone *zone = &pgdat->node_zones[z];
-+
-+			if (!managed_zone(zone))
-+				continue;
-+
-+			total_high_wmark += high_wmark_pages(zone);
-+		}
-+
-+		/*
-+		 * Consider anon: if that's low too, this isn't a
-+		 * runaway file reclaim problem, but rather just
-+		 * extreme pressure. Reclaim as per usual then.
-+		 */
-+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-+
-+		sc->file_is_tiny =
-+			file + free <= total_high_wmark &&
-+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-+			anon >> sc->priority;
-+	}
++	mutex_lock(&cgroup_mutex);
 +}
 +
- /*
-  * Determine how aggressively the anon and file LRU lists should be
-  * scanned.
-@@ -3195,109 +3298,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
- 	unsigned long nr_reclaimed, nr_scanned;
- 	struct lruvec *target_lruvec;
- 	bool reclaimable = false;
--	unsigned long file;
-
- 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
-
- again:
--	/*
--	 * Flush the memory cgroup stats, so that we read accurate per-memcg
--	 * lruvec stats for heuristics.
--	 */
--	mem_cgroup_flush_stats();
--
- 	memset(&sc->nr, 0, sizeof(sc->nr));
-
- 	nr_reclaimed = sc->nr_reclaimed;
- 	nr_scanned = sc->nr_scanned;
-
--	/*
--	 * Determine the scan balance between anon and file LRUs.
--	 */
--	spin_lock_irq(&target_lruvec->lru_lock);
--	sc->anon_cost = target_lruvec->anon_cost;
--	sc->file_cost = target_lruvec->file_cost;
--	spin_unlock_irq(&target_lruvec->lru_lock);
--
--	/*
--	 * Target desirable inactive:active list ratios for the anon
--	 * and file LRU lists.
--	 */
--	if (!sc->force_deactivate) {
--		unsigned long refaults;
--
--		refaults = lruvec_page_state(target_lruvec,
--				WORKINGSET_ACTIVATE_ANON);
--		if (refaults != target_lruvec->refaults[0] ||
--			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
--			sc->may_deactivate |= DEACTIVATE_ANON;
--		else
--			sc->may_deactivate &= ~DEACTIVATE_ANON;
--
--		/*
--		 * When refaults are being observed, it means a new
--		 * workingset is being established. Deactivate to get
--		 * rid of any stale active pages quickly.
--		 */
--		refaults = lruvec_page_state(target_lruvec,
--				WORKINGSET_ACTIVATE_FILE);
--		if (refaults != target_lruvec->refaults[1] ||
--		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
--			sc->may_deactivate |= DEACTIVATE_FILE;
--		else
--			sc->may_deactivate &= ~DEACTIVATE_FILE;
--	} else
--		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
--
--	/*
--	 * If we have plenty of inactive file pages that aren't
--	 * thrashing, try to reclaim those first before touching
--	 * anonymous pages.
--	 */
--	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
--	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
--		sc->cache_trim_mode = 1;
--	else
--		sc->cache_trim_mode = 0;
--
--	/*
--	 * Prevent the reclaimer from falling into the cache trap: as
--	 * cache pages start out inactive, every cache fault will tip
--	 * the scan balance towards the file LRU.  And as the file LRU
--	 * shrinks, so does the window for rotation from references.
--	 * This means we have a runaway feedback loop where a tiny
--	 * thrashing file LRU becomes infinitely more attractive than
--	 * anon pages.  Try to detect this based on file LRU size.
--	 */
--	if (!cgroup_reclaim(sc)) {
--		unsigned long total_high_wmark = 0;
--		unsigned long free, anon;
--		int z;
--
--		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
--		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
--			   node_page_state(pgdat, NR_INACTIVE_FILE);
--
--		for (z = 0; z < MAX_NR_ZONES; z++) {
--			struct zone *zone = &pgdat->node_zones[z];
--			if (!managed_zone(zone))
--				continue;
--
--			total_high_wmark += high_wmark_pages(zone);
--		}
--
--		/*
--		 * Consider anon: if that's low too, this isn't a
--		 * runaway file reclaim problem, but rather just
--		 * extreme pressure. Reclaim as per usual then.
--		 */
--		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
--
--		sc->file_is_tiny =
--			file + free <= total_high_wmark &&
--			!(sc->may_deactivate & DEACTIVATE_ANON) &&
--			anon >> sc->priority;
--	}
-+	prepare_scan_count(pgdat, sc);
-
- 	shrink_node_memcgs(pgdat, sc);
-
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (2 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork Yu Zhao
-                   ` (10 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Miaohe Lin,
-	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
-	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-This patch undoes the following refactor:
-commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")
-
-The upcoming changes to include/linux/mm_inline.h will reuse
-__update_lru_size().
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/mm_inline.h | 9 ++++++++-
- 1 file changed, 8 insertions(+), 1 deletion(-)
-
++static inline void cgroup_unlock(void)
++{
++	mutex_unlock(&cgroup_mutex);
++}
++
+ /**
+  * task_css_set_check - obtain a task's css_set with extra access conditions
+  * @task: the task to obtain css_set for
+@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
+  * as locks used during the cgroup_subsys::attach() methods.
+  */
+ #ifdef CONFIG_PROVE_RCU
+-extern struct mutex cgroup_mutex;
+ extern spinlock_t css_set_lock;
+ #define task_css_set_check(task, __c)					\
+ 	rcu_dereference_check((task)->cgroups,				\
+@@ -708,6 +719,8 @@ struct cgroup;
+ static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
+ static inline void css_get(struct cgroup_subsys_state *css) {}
+ static inline void css_put(struct cgroup_subsys_state *css) {}
++static inline void cgroup_lock(void) {}
++static inline void cgroup_unlock(void) {}
+ static inline int cgroup_attach_task_all(struct task_struct *from,
+ 					 struct task_struct *t) { return 0; }
+ static inline int cgroupstats_build(struct cgroupstats *stats,
+diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
+index 567f12323f553e..877cbcbc6ed98a 100644
+--- a/include/linux/memcontrol.h
++++ b/include/linux/memcontrol.h
+@@ -350,6 +350,11 @@ struct mem_cgroup {
+ 	struct deferred_split deferred_split_queue;
+ #endif
+ 
++#ifdef CONFIG_LRU_GEN
++	/* per-memcg mm_struct list */
++	struct lru_gen_mm_list mm_list;
++#endif
++
+ 	struct mem_cgroup_per_node *nodeinfo[];
+ };
+ 
+@@ -444,6 +449,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
+  * - LRU isolation
+  * - lock_page_memcg()
+  * - exclusive reference
++ * - mem_cgroup_trylock_pages()
+  *
+  * For a kmem folio a caller should hold an rcu read lock to protect memcg
+  * associated with a kmem folio from being released.
+@@ -505,6 +511,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
+  * - LRU isolation
+  * - lock_page_memcg()
+  * - exclusive reference
++ * - mem_cgroup_trylock_pages()
+  *
+  * For a kmem page a caller should hold an rcu read lock to protect memcg
+  * associated with a kmem page from being released.
+@@ -959,6 +966,23 @@ void unlock_page_memcg(struct page *page);
+ 
+ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
+ 
++/* try to stablize folio_memcg() for all the pages in a memcg */
++static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
++{
++	rcu_read_lock();
++
++	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
++		return true;
++
++	rcu_read_unlock();
++	return false;
++}
++
++static inline void mem_cgroup_unlock_pages(void)
++{
++	rcu_read_unlock();
++}
++
+ /* idx can be of type enum memcg_stat_item or node_stat_item */
+ static inline void mod_memcg_state(struct mem_cgroup *memcg,
+ 				   int idx, int val)
+@@ -1433,6 +1457,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
+ {
+ }
+ 
++static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
++{
++	/* to match folio_memcg_rcu() */
++	rcu_read_lock();
++	return true;
++}
++
++static inline void mem_cgroup_unlock_pages(void)
++{
++	rcu_read_unlock();
++}
++
+ static inline void mem_cgroup_handle_over_high(void)
+ {
+ }
+diff --git a/include/linux/mm.h b/include/linux/mm.h
+index 21f8b27bd9fd30..88976a521ef546 100644
+--- a/include/linux/mm.h
++++ b/include/linux/mm.h
+@@ -1465,6 +1465,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
+ 	return page_to_pfn(&folio->page);
+ }
+ 
++static inline struct folio *pfn_folio(unsigned long pfn)
++{
++	return page_folio(pfn_to_page(pfn));
++}
++
+ static inline atomic_t *folio_pincount_ptr(struct folio *folio)
+ {
+ 	return &folio_page(folio, 1)->compound_pincount;
 diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
-index 7b25b53c474a..fb8aadb81cd6 100644
+index 7b25b53c474a7f..4949eda9a9a2ab 100644
 --- a/include/linux/mm_inline.h
 +++ b/include/linux/mm_inline.h
-@@ -34,7 +34,7 @@ static inline int page_is_file_lru(struct page *page)
+@@ -34,15 +34,25 @@ static inline int page_is_file_lru(struct page *page)
  	return folio_is_file_lru(page_folio(page));
  }
-
+ 
 -static __always_inline void update_lru_size(struct lruvec *lruvec,
 +static __always_inline void __update_lru_size(struct lruvec *lruvec,
  				enum lru_list lru, enum zone_type zid,
  				long nr_pages)
  {
-@@ -43,6 +43,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
+ 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ 
++	lockdep_assert_held(&lruvec->lru_lock);
++	WARN_ON_ONCE(nr_pages != (int)nr_pages);
++
  	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
  	__mod_zone_page_state(&pgdat->node_zones[zid],
  				NR_ZONE_LRU_BASE + lru, nr_pages);
@@ -1012,164 +664,27 @@ index 7b25b53c474a..fb8aadb81cd6 100644
  #ifdef CONFIG_MEMCG
  	mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
  #endif
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (3 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
-                   ` (9 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Evictable pages are divided into multiple generations for each lruvec.
-The youngest generation number is stored in lrugen->max_seq for both
-anon and file types as they are aged on an equal footing. The oldest
-generation numbers are stored in lrugen->min_seq[] separately for anon
-and file types as clean file pages can be evicted regardless of swap
-constraints. These three variables are monotonically increasing.
-
-Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
-in order to fit into the gen counter in folio->flags. Each truncated
-generation number is an index to lrugen->lists[]. The sliding window
-technique is used to track at least MIN_NR_GENS and at most
-MAX_NR_GENS generations. The gen counter stores a value within [1,
-MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
-stores 0.
-
-There are two conceptually independent procedures: "the aging", which
-produces young generations, and "the eviction", which consumes old
-generations. They form a closed-loop system, i.e., "the page reclaim".
-Both procedures can be invoked from userspace for the purposes of
-working set estimation and proactive reclaim. These techniques are
-commonly used to optimize job scheduling (bin packing) in data
-centers [1][2].
-
-To avoid confusion, the terms "hot" and "cold" will be applied to the
-multi-gen LRU, as a new convention; the terms "active" and "inactive"
-will be applied to the active/inactive LRU, as usual.
-
-The protection of hot pages and the selection of cold pages are based
-on page access channels and patterns. There are two access channels:
-one through page tables and the other through file descriptors. The
-protection of the former channel is by design stronger because:
-1. The uncertainty in determining the access patterns of the former
-   channel is higher due to the approximation of the accessed bit.
-2. The cost of evicting the former channel is higher due to the TLB
-   flushes required and the likelihood of encountering the dirty bit.
-3. The penalty of underprotecting the former channel is higher because
-   applications usually do not prepare themselves for major page
-   faults like they do for blocked I/O. E.g., GUI applications
-   commonly use dedicated I/O threads to avoid blocking rendering
-   threads.
-There are also two access patterns: one with temporal locality and the
-other without. For the reasons listed above, the former channel is
-assumed to follow the former pattern unless VM_SEQ_READ or
-VM_RAND_READ is present; the latter channel is assumed to follow the
-latter pattern unless outlying refaults have been observed [3][4].
-
-The next patch will address the "outlying refaults". Three macros,
-i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
-added in this patch to make the entire patchset less diffy.
-
-A page is added to the youngest generation on faulting. The aging
-needs to check the accessed bit at least twice before handing this
-page over to the eviction. The first check takes care of the accessed
-bit set on the initial fault; the second check makes sure this page
-has not been used since then. This protocol, AKA second chance,
-requires a minimum of two generations, hence MIN_NR_GENS.
-
-[1] https://dl.acm.org/doi/10.1145/3297858.3304053
-[2] https://dl.acm.org/doi/10.1145/3503222.3507731
-[3] https://lwn.net/Articles/495543/
-[4] https://lwn.net/Articles/815342/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- fs/fuse/dev.c                     |   3 +-
- include/linux/mm_inline.h         | 175 ++++++++++++++++++++++++++++++
- include/linux/mmzone.h            | 102 +++++++++++++++++
- include/linux/page-flags-layout.h |  13 ++-
- include/linux/page-flags.h        |   4 +-
- include/linux/sched.h             |   4 +
- kernel/bounds.c                   |   5 +
- mm/Kconfig                        |   8 ++
- mm/huge_memory.c                  |   3 +-
- mm/memcontrol.c                   |   2 +
- mm/memory.c                       |  25 +++++
- mm/mm_init.c                      |   6 +-
- mm/mmzone.c                       |   2 +
- mm/swap.c                         |  11 +-
- mm/vmscan.c                       |  75 +++++++++++++
- 15 files changed, 424 insertions(+), 14 deletions(-)
-
-diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
-index 51897427a534..b4a6e0a1b945 100644
---- a/fs/fuse/dev.c
-+++ b/fs/fuse/dev.c
-@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page)
- 	       1 << PG_active |
- 	       1 << PG_workingset |
- 	       1 << PG_reclaim |
--	       1 << PG_waiters))) {
-+	       1 << PG_waiters |
-+	       LRU_GEN_MASK | LRU_REFS_MASK))) {
- 		dump_page(page, "fuse: trying to steal weird page");
- 		return 1;
- 	}
-diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
-index fb8aadb81cd6..2ff703900fd0 100644
---- a/include/linux/mm_inline.h
-+++ b/include/linux/mm_inline.h
-@@ -40,6 +40,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
- {
- 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-+	lockdep_assert_held(&lruvec->lru_lock);
-+	WARN_ON_ONCE(nr_pages != (int)nr_pages);
-+
- 	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
- 	__mod_zone_page_state(&pgdat->node_zones[zid],
- 				NR_ZONE_LRU_BASE + lru, nr_pages);
-@@ -101,11 +104,177 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
+@@ -94,11 +104,224 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
  	return lru;
  }
-
+ 
 +#ifdef CONFIG_LRU_GEN
 +
++#ifdef CONFIG_LRU_GEN_ENABLED
 +static inline bool lru_gen_enabled(void)
 +{
-+	return true;
++	DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
++
++	return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
 +}
++#else
++static inline bool lru_gen_enabled(void)
++{
++	DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
++
++	return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
++}
++#endif
 +
 +static inline bool lru_gen_in_fault(void)
 +{
@@ -1181,6 +696,33 @@ index fb8aadb81cd6..2ff703900fd0 100644
 +	return seq % MAX_NR_GENS;
 +}
 +
++static inline int lru_hist_from_seq(unsigned long seq)
++{
++	return seq % NR_HIST_GENS;
++}
++
++static inline int lru_tier_from_refs(int refs)
++{
++	VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
++
++	/* see the comment in folio_lru_refs() */
++	return order_base_2(refs + 1);
++}
++
++static inline int folio_lru_refs(struct folio *folio)
++{
++	unsigned long flags = READ_ONCE(folio->flags);
++	bool workingset = flags & BIT(PG_workingset);
++
++	/*
++	 * Return the number of accesses beyond PG_referenced, i.e., N-1 if the
++	 * total number of accesses is N>1, since N=0,1 both map to the first
++	 * tier. lru_tier_from_refs() will account for this off-by-one. Also see
++	 * the comment on MAX_NR_TIERS.
++	 */
++	return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
++}
++
 +static inline int folio_lru_gen(struct folio *folio)
 +{
 +	unsigned long flags = READ_ONCE(folio->flags);
@@ -1233,6 +775,15 @@ index fb8aadb81cd6..2ff703900fd0 100644
 +		__update_lru_size(lruvec, lru, zone, -delta);
 +		return;
 +	}
++
++	/* promotion */
++	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
++		__update_lru_size(lruvec, lru, zone, -delta);
++		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
++	}
++
++	/* demotion requires isolation, e.g., lru_deactivate_fn() */
++	VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 +}
 +
 +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
@@ -1246,7 +797,7 @@ index fb8aadb81cd6..2ff703900fd0 100644
 +
 +	VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
 +
-+	if (folio_test_unevictable(folio))
++	if (folio_test_unevictable(folio) || !lrugen->enabled)
 +		return false;
 +	/*
 +	 * There are three common cases for this page:
@@ -1331,2665 +882,35 @@ index fb8aadb81cd6..2ff703900fd0 100644
  void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
  {
  	enum lru_list lru = folio_lru_list(folio);
-
+ 
 +	if (lru_gen_add_folio(lruvec, folio, false))
 +		return;
 +
  	update_lru_size(lruvec, lru, folio_zonenum(folio),
  			folio_nr_pages(folio));
  	if (lru != LRU_UNEVICTABLE)
-@@ -123,6 +292,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
+@@ -116,6 +339,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
  {
  	enum lru_list lru = folio_lru_list(folio);
-
+ 
 +	if (lru_gen_add_folio(lruvec, folio, true))
 +		return;
 +
  	update_lru_size(lruvec, lru, folio_zonenum(folio),
  			folio_nr_pages(folio));
  	/* This is not expected to be used on LRU_UNEVICTABLE */
-@@ -140,6 +312,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
+@@ -133,6 +359,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
  {
  	enum lru_list lru = folio_lru_list(folio);
-
+ 
 +	if (lru_gen_del_folio(lruvec, folio, false))
 +		return;
 +
  	if (lru != LRU_UNEVICTABLE)
  		list_del(&folio->lru);
  	update_lru_size(lruvec, lru, folio_zonenum(folio),
-diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index 18cf0fc5ce67..6f4ea078d90f 100644
---- a/include/linux/mmzone.h
-+++ b/include/linux/mmzone.h
-@@ -317,6 +317,102 @@ enum lruvec_flags {
- 					 */
- };
-
-+#endif /* !__GENERATING_BOUNDS_H */
-+
-+/*
-+ * Evictable pages are divided into multiple generations. The youngest and the
-+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
-+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
-+ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
-+ * corresponding generation. The gen counter in folio->flags stores gen+1 while
-+ * a page is on one of lrugen->lists[]. Otherwise it stores 0.
-+ *
-+ * A page is added to the youngest generation on faulting. The aging needs to
-+ * check the accessed bit at least twice before handing this page over to the
-+ * eviction. The first check takes care of the accessed bit set on the initial
-+ * fault; the second check makes sure this page hasn't been used since then.
-+ * This process, AKA second chance, requires a minimum of two generations,
-+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
-+ * LRU, e.g., /proc/vmstat, these two generations are considered active; the
-+ * rest of generations, if they exist, are considered inactive. See
-+ * lru_gen_is_active().
-+ *
-+ * PG_active is always cleared while a page is on one of lrugen->lists[] so that
-+ * the aging needs not to worry about it. And it's set again when a page
-+ * considered active is isolated for non-reclaiming purposes, e.g., migration.
-+ * See lru_gen_add_folio() and lru_gen_del_folio().
-+ *
-+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
-+ * number of categories of the active/inactive LRU when keeping track of
-+ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
-+ * in folio->flags.
-+ */
-+#define MIN_NR_GENS		2U
-+#define MAX_NR_GENS		4U
-+
-+#ifndef __GENERATING_BOUNDS_H
-+
-+struct lruvec;
-+
-+#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
-+#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
-+
-+#ifdef CONFIG_LRU_GEN
-+
-+enum {
-+	LRU_GEN_ANON,
-+	LRU_GEN_FILE,
-+};
-+
-+/*
-+ * The youngest generation number is stored in max_seq for both anon and file
-+ * types as they are aged on an equal footing. The oldest generation numbers are
-+ * stored in min_seq[] separately for anon and file types as clean file pages
-+ * can be evicted regardless of swap constraints.
-+ *
-+ * Normally anon and file min_seq are in sync. But if swapping is constrained,
-+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
-+ * min_seq behind.
-+ *
-+ * The number of pages in each generation is eventually consistent and therefore
-+ * can be transiently negative.
-+ */
-+struct lru_gen_struct {
-+	/* the aging increments the youngest generation number */
-+	unsigned long max_seq;
-+	/* the eviction increments the oldest generation numbers */
-+	unsigned long min_seq[ANON_AND_FILE];
-+	/* the multi-gen LRU lists, lazily sorted on eviction */
-+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
-+	/* the multi-gen LRU sizes, eventually consistent */
-+	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
-+};
-+
-+void lru_gen_init_lruvec(struct lruvec *lruvec);
-+
-+#ifdef CONFIG_MEMCG
-+void lru_gen_init_memcg(struct mem_cgroup *memcg);
-+void lru_gen_exit_memcg(struct mem_cgroup *memcg);
-+#endif
-+
-+#else /* !CONFIG_LRU_GEN */
-+
-+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
-+{
-+}
-+
-+#ifdef CONFIG_MEMCG
-+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
-+{
-+}
-+
-+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
-+{
-+}
-+#endif
-+
-+#endif /* CONFIG_LRU_GEN */
-+
- struct lruvec {
- 	struct list_head		lists[NR_LRU_LISTS];
- 	/* per lruvec lru_lock for memcg */
-@@ -334,6 +430,10 @@ struct lruvec {
- 	unsigned long			refaults[ANON_AND_FILE];
- 	/* Various lruvec state flags (enum lruvec_flags) */
- 	unsigned long			flags;
-+#ifdef CONFIG_LRU_GEN
-+	/* evictable pages divided into generations */
-+	struct lru_gen_struct		lrugen;
-+#endif
- #ifdef CONFIG_MEMCG
- 	struct pglist_data *pgdat;
- #endif
-@@ -749,6 +849,8 @@ static inline bool zone_is_empty(struct zone *zone)
- #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
- #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
- #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
-+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
-+#define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
-
- /*
-  * Define the bit shifts to access each section.  For non-existent
-diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
-index ef1e3e736e14..240905407a18 100644
---- a/include/linux/page-flags-layout.h
-+++ b/include/linux/page-flags-layout.h
-@@ -55,7 +55,8 @@
- #define SECTIONS_WIDTH		0
- #endif
-
--#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
-+	<= BITS_PER_LONG - NR_PAGEFLAGS
- #define NODES_WIDTH		NODES_SHIFT
- #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
- #error "Vmemmap: No space for nodes field in page flags"
-@@ -89,8 +90,8 @@
- #define LAST_CPUPID_SHIFT 0
- #endif
-
--#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
--	<= BITS_PER_LONG - NR_PAGEFLAGS
-+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-+	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
- #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
- #else
- #define LAST_CPUPID_WIDTH 0
-@@ -100,10 +101,12 @@
- #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
- #endif
-
--#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
--	> BITS_PER_LONG - NR_PAGEFLAGS
-+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-+	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
- #error "Not enough bits in page flags"
- #endif
-
-+#define LRU_REFS_WIDTH	0
-+
- #endif
- #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
-diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
-index 465ff35a8c00..0b0ae5084e60 100644
---- a/include/linux/page-flags.h
-+++ b/include/linux/page-flags.h
-@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
- 	 1UL << PG_private	| 1UL << PG_private_2	|	\
- 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
- 	 1UL << PG_slab		| 1UL << PG_active 	|	\
--	 1UL << PG_unevictable	| __PG_MLOCKED)
-+	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
-
- /*
-  * Flags checked when a page is prepped for return by the page allocator.
-@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
-  * alloc-free cycle to prevent from reusing the page.
-  */
- #define PAGE_FLAGS_CHECK_AT_PREP	\
--	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
-+	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
-
- #define PAGE_FLAGS_PRIVATE				\
- 	(1UL << PG_private | 1UL << PG_private_2)
-diff --git a/include/linux/sched.h b/include/linux/sched.h
-index e7b2f8a5c711..8cc46a789193 100644
---- a/include/linux/sched.h
-+++ b/include/linux/sched.h
-@@ -914,6 +914,10 @@ struct task_struct {
- #ifdef CONFIG_MEMCG
- 	unsigned			in_user_fault:1;
- #endif
-+#ifdef CONFIG_LRU_GEN
-+	/* whether the LRU algorithm may apply to this access */
-+	unsigned			in_lru_fault:1;
-+#endif
- #ifdef CONFIG_COMPAT_BRK
- 	unsigned			brk_randomized:1;
- #endif
-diff --git a/kernel/bounds.c b/kernel/bounds.c
-index 9795d75b09b2..5ee60777d8e4 100644
---- a/kernel/bounds.c
-+++ b/kernel/bounds.c
-@@ -22,6 +22,11 @@ int main(void)
- 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
- #endif
- 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
-+#ifdef CONFIG_LRU_GEN
-+	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
-+#else
-+	DEFINE(LRU_GEN_WIDTH, 0);
-+#endif
- 	/* End of constants */
-
- 	return 0;
-diff --git a/mm/Kconfig b/mm/Kconfig
-index e3fbd0788878..378306aee622 100644
---- a/mm/Kconfig
-+++ b/mm/Kconfig
-@@ -1118,6 +1118,14 @@ config PTE_MARKER_UFFD_WP
- 	  purposes.  It is required to enable userfaultfd write protection on
- 	  file-backed memory types like shmem and hugetlbfs.
-
-+config LRU_GEN
-+	bool "Multi-Gen LRU"
-+	depends on MMU
-+	# make sure folio->flags has enough spare bits
-+	depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
-+	help
-+	  A high performance LRU implementation to overcommit memory.
-+
- source "mm/damon/Kconfig"
-
- endmenu
-diff --git a/mm/huge_memory.c b/mm/huge_memory.c
-index f4a656b279b1..949d7c325133 100644
---- a/mm/huge_memory.c
-+++ b/mm/huge_memory.c
-@@ -2444,7 +2444,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
- #ifdef CONFIG_64BIT
- 			 (1L << PG_arch_2) |
- #endif
--			 (1L << PG_dirty)));
-+			 (1L << PG_dirty) |
-+			 LRU_GEN_MASK | LRU_REFS_MASK));
-
- 	/* ->mapping in first tail page is compound_mapcount */
- 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-diff --git a/mm/memcontrol.c b/mm/memcontrol.c
-index 403af5f7a2b9..937141d48221 100644
---- a/mm/memcontrol.c
-+++ b/mm/memcontrol.c
-@@ -5175,6 +5175,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
-
- static void mem_cgroup_free(struct mem_cgroup *memcg)
- {
-+	lru_gen_exit_memcg(memcg);
- 	memcg_wb_domain_exit(memcg);
- 	__mem_cgroup_free(memcg);
- }
-@@ -5233,6 +5234,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
- 	memcg->deferred_split_queue.split_queue_len = 0;
- #endif
- 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
-+	lru_gen_init_memcg(memcg);
- 	return memcg;
- fail:
- 	mem_cgroup_id_remove(memcg);
-diff --git a/mm/memory.c b/mm/memory.c
-index 3a9b00c765c2..63832dab15d3 100644
---- a/mm/memory.c
-+++ b/mm/memory.c
-@@ -5117,6 +5117,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
- 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
- }
-
-+#ifdef CONFIG_LRU_GEN
-+static void lru_gen_enter_fault(struct vm_area_struct *vma)
-+{
-+	/* the LRU algorithm doesn't apply to sequential or random reads */
-+	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
-+}
-+
-+static void lru_gen_exit_fault(void)
-+{
-+	current->in_lru_fault = false;
-+}
-+#else
-+static void lru_gen_enter_fault(struct vm_area_struct *vma)
-+{
-+}
-+
-+static void lru_gen_exit_fault(void)
-+{
-+}
-+#endif /* CONFIG_LRU_GEN */
-+
- /*
-  * By the time we get here, we already hold the mm semaphore
-  *
-@@ -5148,11 +5169,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
- 	if (flags & FAULT_FLAG_USER)
- 		mem_cgroup_enter_user_fault();
-
-+	lru_gen_enter_fault(vma);
-+
- 	if (unlikely(is_vm_hugetlb_page(vma)))
- 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
- 	else
- 		ret = __handle_mm_fault(vma, address, flags);
-
-+	lru_gen_exit_fault();
-+
- 	if (flags & FAULT_FLAG_USER) {
- 		mem_cgroup_exit_user_fault();
- 		/*
-diff --git a/mm/mm_init.c b/mm/mm_init.c
-index 9ddaf0e1b0ab..0d7b2bd2454a 100644
---- a/mm/mm_init.c
-+++ b/mm/mm_init.c
-@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
-
- 	shift = 8 * sizeof(unsigned long);
- 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
--		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
-+		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
- 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
--		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
-+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
- 		SECTIONS_WIDTH,
- 		NODES_WIDTH,
- 		ZONES_WIDTH,
- 		LAST_CPUPID_WIDTH,
- 		KASAN_TAG_WIDTH,
-+		LRU_GEN_WIDTH,
-+		LRU_REFS_WIDTH,
- 		NR_PAGEFLAGS);
- 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
-diff --git a/mm/mmzone.c b/mm/mmzone.c
-index 0ae7571e35ab..68e1511be12d 100644
---- a/mm/mmzone.c
-+++ b/mm/mmzone.c
-@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec)
- 	 * Poison its list head, so that any operations on it would crash.
- 	 */
- 	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
-+
-+	lru_gen_init_lruvec(lruvec);
- }
-
- #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
-diff --git a/mm/swap.c b/mm/swap.c
-index 9cee7f6a3809..0e423b7d458b 100644
---- a/mm/swap.c
-+++ b/mm/swap.c
-@@ -484,6 +484,11 @@ void folio_add_lru(struct folio *folio)
- 			folio_test_unevictable(folio), folio);
- 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-
-+	/* see the comment in lru_gen_add_folio() */
-+	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
-+	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-+		folio_set_active(folio);
-+
- 	folio_get(folio);
- 	local_lock(&cpu_fbatches.lock);
- 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
-@@ -575,7 +580,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio)
-
- static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio)
- {
--	if (folio_test_active(folio) && !folio_test_unevictable(folio)) {
-+	if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) {
- 		long nr_pages = folio_nr_pages(folio);
-
- 		lruvec_del_folio(lruvec, folio);
-@@ -688,8 +693,8 @@ void deactivate_page(struct page *page)
- {
- 	struct folio *folio = page_folio(page);
-
--	if (folio_test_lru(folio) && folio_test_active(folio) &&
--	    !folio_test_unevictable(folio)) {
-+	if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
-+	    (folio_test_active(folio) || lru_gen_enabled())) {
- 		struct folio_batch *fbatch;
-
- 		folio_get(folio);
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 0869cee13a90..8d41c4ef430e 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -3050,6 +3050,81 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
- 	return can_demote(pgdat->node_id, sc);
- }
-
-+#ifdef CONFIG_LRU_GEN
-+
-+/******************************************************************************
-+ *                          shorthand helpers
-+ ******************************************************************************/
-+
-+#define for_each_gen_type_zone(gen, type, zone)				\
-+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
-+		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
-+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
-+
-+static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
-+{
-+	struct pglist_data *pgdat = NODE_DATA(nid);
-+
-+#ifdef CONFIG_MEMCG
-+	if (memcg) {
-+		struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
-+
-+		/* for hotadd_new_pgdat() */
-+		if (!lruvec->pgdat)
-+			lruvec->pgdat = pgdat;
-+
-+		return lruvec;
-+	}
-+#endif
-+	VM_WARN_ON_ONCE(!mem_cgroup_disabled());
-+
-+	return pgdat ? &pgdat->__lruvec : NULL;
-+}
-+
-+/******************************************************************************
-+ *                          initialization
-+ ******************************************************************************/
-+
-+void lru_gen_init_lruvec(struct lruvec *lruvec)
-+{
-+	int gen, type, zone;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+
-+	lrugen->max_seq = MIN_NR_GENS + 1;
-+
-+	for_each_gen_type_zone(gen, type, zone)
-+		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
-+}
-+
-+#ifdef CONFIG_MEMCG
-+void lru_gen_init_memcg(struct mem_cgroup *memcg)
-+{
-+}
-+
-+void lru_gen_exit_memcg(struct mem_cgroup *memcg)
-+{
-+	int nid;
-+
-+	for_each_node(nid) {
-+		struct lruvec *lruvec = get_lruvec(memcg, nid);
-+
-+		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
-+					   sizeof(lruvec->lrugen.nr_pages)));
-+	}
-+}
-+#endif
-+
-+static int __init init_lru_gen(void)
-+{
-+	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
-+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
-+
-+	return 0;
-+};
-+late_initcall(init_lru_gen);
-+
-+#endif /* CONFIG_LRU_GEN */
-+
- static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
- {
- 	unsigned long nr[NR_LRU_LISTS];
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (4 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 05/14] mm: multi-gen LRU: groundwork Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
-                   ` (8 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-To avoid confusion, the terms "promotion" and "demotion" will be
-applied to the multi-gen LRU, as a new convention; the terms
-"activation" and "deactivation" will be applied to the active/inactive
-LRU, as usual.
-
-The aging produces young generations. Given an lruvec, it increments
-max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
-promotes hot pages to the youngest generation when it finds them
-accessed through page tables; the demotion of cold pages happens
-consequently when it increments max_seq. Promotion in the aging path
-does not involve any LRU list operations, only the updates of the gen
-counter and lrugen->nr_pages[]; demotion, unless as the result of the
-increment of max_seq, requires LRU list operations, e.g.,
-lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages),
-since it is only interested in hot pages.
-
-The eviction consumes old generations. Given an lruvec, it increments
-min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes
-empty. A feedback loop modeled after the PID controller monitors
-refaults over anon and file types and decides which type to evict when
-both types are available from the same generation.
-
-The protection of pages accessed multiple times through file
-descriptors takes place in the eviction path. Each generation is
-divided into multiple tiers. A page accessed N times through file
-descriptors is in tier order_base_2(N). Tiers do not have dedicated
-lrugen->lists[], only bits in folio->flags. The aforementioned
-feedback loop also monitors refaults over all tiers and decides when
-to protect pages in which tiers (N>1), using the first tier (N=0,1) as
-a baseline. The first tier contains single-use unmapped clean pages,
-which are most likely the best choices. In contrast to promotion in
-the aging path, the protection of a page in the eviction path is
-achieved by moving this page to the next generation, i.e., min_seq+1,
-if the feedback loop decides so. This approach has the following
-advantages:
-1. It removes the cost of activation in the buffered access path by
-   inferring whether pages accessed multiple times through file
-   descriptors are statistically hot and thus worth protecting in the
-   eviction path.
-2. It takes pages accessed through page tables into account and avoids
-   overprotecting pages accessed multiple times through file
-   descriptors. (Pages accessed through page tables are in the first
-   tier, since N=0.)
-3. More tiers provide better protection for pages accessed more than
-   twice through file descriptors, when under heavy buffered I/O
-   workloads.
-
-Server benchmark results:
-  Single workload:
-    fio (buffered I/O): +[30, 32]%
-                IOPS         BW
-      5.19-rc1: 2673k        10.2GiB/s
-      patch1-6: 3491k        13.3GiB/s
-
-  Single workload:
-    memcached (anon): -[4, 6]%
-                Ops/sec      KB/sec
-      5.19-rc1: 1161501.04   45177.25
-      patch1-6: 1106168.46   43025.04
-
-  Configurations:
-    CPU: two Xeon 6154
-    Mem: total 256G
-
-    Node 1 was only used as a ram disk to reduce the variance in the
-    results.
-
-    patch drivers/block/brd.c <<EOF
-    99,100c99,100
-    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
-    < 	page = alloc_page(gfp_flags);
-    ---
-    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
-    > 	page = alloc_pages_node(1, gfp_flags, 0);
-    EOF
-
-    cat >>/etc/systemd/system.conf <<EOF
-    CPUAffinity=numa
-    NUMAPolicy=bind
-    NUMAMask=0
-    EOF
-
-    cat >>/etc/memcached.conf <<EOF
-    -m 184320
-    -s /var/run/memcached/memcached.sock
-    -a 0766
-    -t 36
-    -B binary
-    EOF
-
-    cat fio.sh
-    modprobe brd rd_nr=1 rd_size=113246208
-    swapoff -a
-    mkfs.ext4 /dev/ram0
-    mount -t ext4 /dev/ram0 /mnt
-
-    mkdir /sys/fs/cgroup/user.slice/test
-    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
-    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
-    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
-      --buffered=1 --ioengine=io_uring --iodepth=128 \
-      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
-      --rw=randread --random_distribution=random --norandommap \
-      --time_based --ramp_time=10m --runtime=5m --group_reporting
-
-    cat memcached.sh
-    modprobe brd rd_nr=1 rd_size=113246208
-    swapoff -a
-    mkswap /dev/ram0
-    swapon /dev/ram0
-
-    memtier_benchmark -S /var/run/memcached/memcached.sock \
-      -P memcache_binary -n allkeys --key-minimum=1 \
-      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
-      --ratio 1:0 --pipeline 8 -d 2000
-
-    memtier_benchmark -S /var/run/memcached/memcached.sock \
-      -P memcache_binary -n allkeys --key-minimum=1 \
-      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
-      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
-
-Client benchmark results:
-  kswapd profiles:
-    5.19-rc1
-      40.33%  page_vma_mapped_walk (overhead)
-      21.80%  lzo1x_1_do_compress (real work)
-       7.53%  do_raw_spin_lock
-       3.95%  _raw_spin_unlock_irq
-       2.52%  vma_interval_tree_iter_next
-       2.37%  folio_referenced_one
-       2.28%  vma_interval_tree_subtree_search
-       1.97%  anon_vma_interval_tree_iter_first
-       1.60%  ptep_clear_flush
-       1.06%  __zram_bvec_write
-
-    patch1-6
-      39.03%  lzo1x_1_do_compress (real work)
-      18.47%  page_vma_mapped_walk (overhead)
-       6.74%  _raw_spin_unlock_irq
-       3.97%  do_raw_spin_lock
-       2.49%  ptep_clear_flush
-       2.48%  anon_vma_interval_tree_iter_first
-       1.92%  folio_referenced_one
-       1.88%  __zram_bvec_write
-       1.48%  memmove
-       1.31%  vma_interval_tree_iter_next
-
-  Configurations:
-    CPU: single Snapdragon 7c
-    Mem: total 4G
-
-    ChromeOS MemoryPressure [1]
-
-[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/mm_inline.h         |  36 ++
- include/linux/mmzone.h            |  41 ++
- include/linux/page-flags-layout.h |   5 +-
- kernel/bounds.c                   |   2 +
- mm/Kconfig                        |  11 +
- mm/swap.c                         |  39 ++
- mm/vmscan.c                       | 792 +++++++++++++++++++++++++++++-
- mm/workingset.c                   | 110 ++++-
- 8 files changed, 1025 insertions(+), 11 deletions(-)
-
-diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
-index 2ff703900fd0..f2b2296a42f9 100644
---- a/include/linux/mm_inline.h
-+++ b/include/linux/mm_inline.h
-@@ -121,6 +121,33 @@ static inline int lru_gen_from_seq(unsigned long seq)
- 	return seq % MAX_NR_GENS;
- }
-
-+static inline int lru_hist_from_seq(unsigned long seq)
-+{
-+	return seq % NR_HIST_GENS;
-+}
-+
-+static inline int lru_tier_from_refs(int refs)
-+{
-+	VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
-+
-+	/* see the comment in folio_lru_refs() */
-+	return order_base_2(refs + 1);
-+}
-+
-+static inline int folio_lru_refs(struct folio *folio)
-+{
-+	unsigned long flags = READ_ONCE(folio->flags);
-+	bool workingset = flags & BIT(PG_workingset);
-+
-+	/*
-+	 * Return the number of accesses beyond PG_referenced, i.e., N-1 if the
-+	 * total number of accesses is N>1, since N=0,1 both map to the first
-+	 * tier. lru_tier_from_refs() will account for this off-by-one. Also see
-+	 * the comment on MAX_NR_TIERS.
-+	 */
-+	return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
-+}
-+
- static inline int folio_lru_gen(struct folio *folio)
- {
- 	unsigned long flags = READ_ONCE(folio->flags);
-@@ -173,6 +200,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
- 		__update_lru_size(lruvec, lru, zone, -delta);
- 		return;
- 	}
-+
-+	/* promotion */
-+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
-+		__update_lru_size(lruvec, lru, zone, -delta);
-+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
-+	}
-+
-+	/* demotion requires isolation, e.g., lru_deactivate_fn() */
-+	VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
- }
-
- static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
-diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index 6f4ea078d90f..7e343420bfb1 100644
---- a/include/linux/mmzone.h
-+++ b/include/linux/mmzone.h
-@@ -350,6 +350,28 @@ enum lruvec_flags {
- #define MIN_NR_GENS		2U
- #define MAX_NR_GENS		4U
-
-+/*
-+ * Each generation is divided into multiple tiers. A page accessed N times
-+ * through file descriptors is in tier order_base_2(N). A page in the first tier
-+ * (N=0,1) is marked by PG_referenced unless it was faulted in through page
-+ * tables or read ahead. A page in any other tier (N>1) is marked by
-+ * PG_referenced and PG_workingset. This implies a minimum of two tiers is
-+ * supported without using additional bits in folio->flags.
-+ *
-+ * In contrast to moving across generations which requires the LRU lock, moving
-+ * across tiers only involves atomic operations on folio->flags and therefore
-+ * has a negligible cost in the buffered access path. In the eviction path,
-+ * comparisons of refaulted/(evicted+protected) from the first tier and the
-+ * rest infer whether pages accessed multiple times through file descriptors
-+ * are statistically hot and thus worth protecting.
-+ *
-+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
-+ * number of categories of the active/inactive LRU when keeping track of
-+ * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in
-+ * folio->flags.
-+ */
-+#define MAX_NR_TIERS		4U
-+
- #ifndef __GENERATING_BOUNDS_H
-
- struct lruvec;
-@@ -364,6 +386,16 @@ enum {
- 	LRU_GEN_FILE,
- };
-
-+#define MIN_LRU_BATCH		BITS_PER_LONG
-+#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 64)
-+
-+/* whether to keep historical stats from evicted generations */
-+#ifdef CONFIG_LRU_GEN_STATS
-+#define NR_HIST_GENS		MAX_NR_GENS
-+#else
-+#define NR_HIST_GENS		1U
-+#endif
-+
- /*
-  * The youngest generation number is stored in max_seq for both anon and file
-  * types as they are aged on an equal footing. The oldest generation numbers are
-@@ -386,6 +418,15 @@ struct lru_gen_struct {
- 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
- 	/* the multi-gen LRU sizes, eventually consistent */
- 	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
-+	/* the exponential moving average of refaulted */
-+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
-+	/* the exponential moving average of evicted+protected */
-+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
-+	/* the first tier doesn't need protection, hence the minus one */
-+	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
-+	/* can be modified without holding the LRU lock */
-+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
-+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
- };
-
- void lru_gen_init_lruvec(struct lruvec *lruvec);
-diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
-index 240905407a18..7d79818dc065 100644
---- a/include/linux/page-flags-layout.h
-+++ b/include/linux/page-flags-layout.h
-@@ -106,7 +106,10 @@
- #error "Not enough bits in page flags"
- #endif
-
--#define LRU_REFS_WIDTH	0
-+/* see the comment on MAX_NR_TIERS */
-+#define LRU_REFS_WIDTH	min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
-+			    ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
-+			    NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
-
- #endif
- #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
-diff --git a/kernel/bounds.c b/kernel/bounds.c
-index 5ee60777d8e4..b529182e8b04 100644
---- a/kernel/bounds.c
-+++ b/kernel/bounds.c
-@@ -24,8 +24,10 @@ int main(void)
- 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
- #ifdef CONFIG_LRU_GEN
- 	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
-+	DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
- #else
- 	DEFINE(LRU_GEN_WIDTH, 0);
-+	DEFINE(__LRU_REFS_WIDTH, 0);
- #endif
- 	/* End of constants */
-
-diff --git a/mm/Kconfig b/mm/Kconfig
-index 378306aee622..5c5dcbdcfe34 100644
---- a/mm/Kconfig
-+++ b/mm/Kconfig
-@@ -1118,6 +1118,7 @@ config PTE_MARKER_UFFD_WP
- 	  purposes.  It is required to enable userfaultfd write protection on
- 	  file-backed memory types like shmem and hugetlbfs.
-
-+# multi-gen LRU {
- config LRU_GEN
- 	bool "Multi-Gen LRU"
- 	depends on MMU
-@@ -1126,6 +1127,16 @@ config LRU_GEN
- 	help
- 	  A high performance LRU implementation to overcommit memory.
-
-+config LRU_GEN_STATS
-+	bool "Full stats for debugging"
-+	depends on LRU_GEN
-+	help
-+	  Do not enable this option unless you plan to look at historical stats
-+	  from evicted generations for debugging purpose.
-+
-+	  This option has a per-memcg and per-node memory overhead.
-+# }
-+
- source "mm/damon/Kconfig"
-
- endmenu
-diff --git a/mm/swap.c b/mm/swap.c
-index 0e423b7d458b..f74fd51fa9e1 100644
---- a/mm/swap.c
-+++ b/mm/swap.c
-@@ -428,6 +428,40 @@ static void __lru_cache_activate_folio(struct folio *folio)
- 	local_unlock(&cpu_fbatches.lock);
- }
-
-+#ifdef CONFIG_LRU_GEN
-+static void folio_inc_refs(struct folio *folio)
-+{
-+	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
-+
-+	if (folio_test_unevictable(folio))
-+		return;
-+
-+	if (!folio_test_referenced(folio)) {
-+		folio_set_referenced(folio);
-+		return;
-+	}
-+
-+	if (!folio_test_workingset(folio)) {
-+		folio_set_workingset(folio);
-+		return;
-+	}
-+
-+	/* see the comment on MAX_NR_TIERS */
-+	do {
-+		new_flags = old_flags & LRU_REFS_MASK;
-+		if (new_flags == LRU_REFS_MASK)
-+			break;
-+
-+		new_flags += BIT(LRU_REFS_PGOFF);
-+		new_flags |= old_flags & ~LRU_REFS_MASK;
-+	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
-+}
-+#else
-+static void folio_inc_refs(struct folio *folio)
-+{
-+}
-+#endif /* CONFIG_LRU_GEN */
-+
- /*
-  * Mark a page as having seen activity.
-  *
-@@ -440,6 +474,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
-  */
- void folio_mark_accessed(struct folio *folio)
- {
-+	if (lru_gen_enabled()) {
-+		folio_inc_refs(folio);
-+		return;
-+	}
-+
- 	if (!folio_test_referenced(folio)) {
- 		folio_set_referenced(folio);
- 	} else if (folio_test_unevictable(folio)) {
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 8d41c4ef430e..d1e60feea8ab 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -1334,9 +1334,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
-
- 	if (folio_test_swapcache(folio)) {
- 		swp_entry_t swap = folio_swap_entry(folio);
--		mem_cgroup_swapout(folio, swap);
-+
-+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
- 		if (reclaimed && !mapping_exiting(mapping))
- 			shadow = workingset_eviction(folio, target_memcg);
-+		mem_cgroup_swapout(folio, swap);
- 		__delete_from_swap_cache(folio, swap, shadow);
- 		xa_unlock_irq(&mapping->i_pages);
- 		put_swap_page(&folio->page, swap);
-@@ -2733,6 +2735,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
- 	unsigned long file;
- 	struct lruvec *target_lruvec;
-
-+	if (lru_gen_enabled())
-+		return;
-+
- 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
-
- 	/*
-@@ -3056,6 +3061,17 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
-  *                          shorthand helpers
-  ******************************************************************************/
-
-+#define LRU_REFS_FLAGS	(BIT(PG_referenced) | BIT(PG_workingset))
-+
-+#define DEFINE_MAX_SEQ(lruvec)						\
-+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
-+
-+#define DEFINE_MIN_SEQ(lruvec)						\
-+	unsigned long min_seq[ANON_AND_FILE] = {			\
-+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),	\
-+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
-+	}
-+
- #define for_each_gen_type_zone(gen, type, zone)				\
- 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
- 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
-@@ -3081,6 +3097,745 @@ static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int ni
- 	return pgdat ? &pgdat->__lruvec : NULL;
- }
-
-+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
-+{
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-+
-+	if (!can_demote(pgdat->node_id, sc) &&
-+	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
-+		return 0;
-+
-+	return mem_cgroup_swappiness(memcg);
-+}
-+
-+static int get_nr_gens(struct lruvec *lruvec, int type)
-+{
-+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
-+}
-+
-+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
-+{
-+	/* see the comment on lru_gen_struct */
-+	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
-+	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
-+	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
-+}
-+
-+/******************************************************************************
-+ *                          refault feedback loop
-+ ******************************************************************************/
-+
-+/*
-+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
-+ *
-+ * The P term is refaulted/(evicted+protected) from a tier in the generation
-+ * currently being evicted; the I term is the exponential moving average of the
-+ * P term over the generations previously evicted, using the smoothing factor
-+ * 1/2; the D term isn't supported.
-+ *
-+ * The setpoint (SP) is always the first tier of one type; the process variable
-+ * (PV) is either any tier of the other type or any other tier of the same
-+ * type.
-+ *
-+ * The error is the difference between the SP and the PV; the correction is to
-+ * turn off protection when SP>PV or turn on protection when SP<PV.
-+ *
-+ * For future optimizations:
-+ * 1. The D term may discount the other two terms over time so that long-lived
-+ *    generations can resist stale information.
-+ */
-+struct ctrl_pos {
-+	unsigned long refaulted;
-+	unsigned long total;
-+	int gain;
-+};
-+
-+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
-+			  struct ctrl_pos *pos)
-+{
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
-+
-+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
-+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-+	pos->total = lrugen->avg_total[type][tier] +
-+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
-+	if (tier)
-+		pos->total += lrugen->protected[hist][type][tier - 1];
-+	pos->gain = gain;
-+}
-+
-+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
-+{
-+	int hist, tier;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
-+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
-+
-+	lockdep_assert_held(&lruvec->lru_lock);
-+
-+	if (!carryover && !clear)
-+		return;
-+
-+	hist = lru_hist_from_seq(seq);
-+
-+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
-+		if (carryover) {
-+			unsigned long sum;
-+
-+			sum = lrugen->avg_refaulted[type][tier] +
-+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
-+
-+			sum = lrugen->avg_total[type][tier] +
-+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
-+			if (tier)
-+				sum += lrugen->protected[hist][type][tier - 1];
-+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
-+		}
-+
-+		if (clear) {
-+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
-+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
-+			if (tier)
-+				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
-+		}
-+	}
-+}
-+
-+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
-+{
-+	/*
-+	 * Return true if the PV has a limited number of refaults or a lower
-+	 * refaulted/total than the SP.
-+	 */
-+	return pv->refaulted < MIN_LRU_BATCH ||
-+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
-+	       (sp->refaulted + 1) * pv->total * pv->gain;
-+}
-+
-+/******************************************************************************
-+ *                          the aging
-+ ******************************************************************************/
-+
-+/* protect pages accessed multiple times through file descriptors */
-+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
-+{
-+	int type = folio_is_file_lru(folio);
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
-+	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
-+
-+	VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
-+
-+	do {
-+		new_gen = (old_gen + 1) % MAX_NR_GENS;
-+
-+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
-+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-+		/* for folio_end_writeback() */
-+		if (reclaiming)
-+			new_flags |= BIT(PG_reclaim);
-+	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
-+
-+	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
-+
-+	return new_gen;
-+}
-+
-+static void inc_min_seq(struct lruvec *lruvec, int type)
-+{
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+
-+	reset_ctrl_pos(lruvec, type, true);
-+	WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
-+}
-+
-+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
-+{
-+	int gen, type, zone;
-+	bool success = false;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	DEFINE_MIN_SEQ(lruvec);
-+
-+	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
-+
-+	/* find the oldest populated generation */
-+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
-+		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
-+			gen = lru_gen_from_seq(min_seq[type]);
-+
-+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-+				if (!list_empty(&lrugen->lists[gen][type][zone]))
-+					goto next;
-+			}
-+
-+			min_seq[type]++;
-+		}
-+next:
-+		;
-+	}
-+
-+	/* see the comment on lru_gen_struct */
-+	if (can_swap) {
-+		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
-+		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
-+	}
-+
-+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
-+		if (min_seq[type] == lrugen->min_seq[type])
-+			continue;
-+
-+		reset_ctrl_pos(lruvec, type, true);
-+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
-+		success = true;
-+	}
-+
-+	return success;
-+}
-+
-+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
-+{
-+	int prev, next;
-+	int type, zone;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+
-+	spin_lock_irq(&lruvec->lru_lock);
-+
-+	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
-+
-+	if (max_seq != lrugen->max_seq)
-+		goto unlock;
-+
-+	for (type = ANON_AND_FILE - 1; type >= 0; type--) {
-+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
-+			continue;
-+
-+		VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
-+
-+		inc_min_seq(lruvec, type);
-+	}
-+
-+	/*
-+	 * Update the active/inactive LRU sizes for compatibility. Both sides of
-+	 * the current max_seq need to be covered, since max_seq+1 can overlap
-+	 * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do
-+	 * overlap, cold/hot inversion happens.
-+	 */
-+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
-+	next = lru_gen_from_seq(lrugen->max_seq + 1);
-+
-+	for (type = 0; type < ANON_AND_FILE; type++) {
-+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-+			enum lru_list lru = type * LRU_INACTIVE_FILE;
-+			long delta = lrugen->nr_pages[prev][type][zone] -
-+				     lrugen->nr_pages[next][type][zone];
-+
-+			if (!delta)
-+				continue;
-+
-+			__update_lru_size(lruvec, lru, zone, delta);
-+			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
-+		}
-+	}
-+
-+	for (type = 0; type < ANON_AND_FILE; type++)
-+		reset_ctrl_pos(lruvec, type, false);
-+
-+	/* make sure preceding modifications appear */
-+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-+unlock:
-+	spin_unlock_irq(&lruvec->lru_lock);
-+}
-+
-+static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq,
-+			     struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan)
-+{
-+	int gen, type, zone;
-+	unsigned long old = 0;
-+	unsigned long young = 0;
-+	unsigned long total = 0;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+
-+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
-+		unsigned long seq;
-+
-+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-+			unsigned long size = 0;
-+
-+			gen = lru_gen_from_seq(seq);
-+
-+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-+				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-+
-+			total += size;
-+			if (seq == max_seq)
-+				young += size;
-+			else if (seq + MIN_NR_GENS == max_seq)
-+				old += size;
-+		}
-+	}
-+
-+	/* try to scrape all its memory if this memcg was deleted */
-+	*nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
-+
-+	/*
-+	 * The aging tries to be lazy to reduce the overhead, while the eviction
-+	 * stalls when the number of generations reaches MIN_NR_GENS. Hence, the
-+	 * ideal number of generations is MIN_NR_GENS+1.
-+	 */
-+	if (min_seq[!can_swap] + MIN_NR_GENS > max_seq)
-+		return true;
-+	if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
-+		return false;
-+
-+	/*
-+	 * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
-+	 * of the total number of pages for each generation. A reasonable range
-+	 * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
-+	 * aging cares about the upper bound of hot pages, while the eviction
-+	 * cares about the lower bound of cold pages.
-+	 */
-+	if (young * MIN_NR_GENS > total)
-+		return true;
-+	if (old * (MIN_NR_GENS + 2) < total)
-+		return true;
-+
-+	return false;
-+}
-+
-+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-+{
-+	bool need_aging;
-+	unsigned long nr_to_scan;
-+	int swappiness = get_swappiness(lruvec, sc);
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+	DEFINE_MAX_SEQ(lruvec);
-+	DEFINE_MIN_SEQ(lruvec);
-+
-+	VM_WARN_ON_ONCE(sc->memcg_low_reclaim);
-+
-+	mem_cgroup_calculate_protection(NULL, memcg);
-+
-+	if (mem_cgroup_below_min(memcg))
-+		return;
-+
-+	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
-+	if (need_aging)
-+		inc_max_seq(lruvec, max_seq, swappiness);
-+}
-+
-+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-+{
-+	struct mem_cgroup *memcg;
-+
-+	VM_WARN_ON_ONCE(!current_is_kswapd());
-+
-+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
-+	do {
-+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-+
-+		age_lruvec(lruvec, sc);
-+
-+		cond_resched();
-+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
-+}
-+
-+/******************************************************************************
-+ *                          the eviction
-+ ******************************************************************************/
-+
-+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
-+{
-+	bool success;
-+	int gen = folio_lru_gen(folio);
-+	int type = folio_is_file_lru(folio);
-+	int zone = folio_zonenum(folio);
-+	int delta = folio_nr_pages(folio);
-+	int refs = folio_lru_refs(folio);
-+	int tier = lru_tier_from_refs(refs);
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+
-+	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
-+
-+	/* unevictable */
-+	if (!folio_evictable(folio)) {
-+		success = lru_gen_del_folio(lruvec, folio, true);
-+		VM_WARN_ON_ONCE_FOLIO(!success, folio);
-+		folio_set_unevictable(folio);
-+		lruvec_add_folio(lruvec, folio);
-+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
-+		return true;
-+	}
-+
-+	/* dirty lazyfree */
-+	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
-+		success = lru_gen_del_folio(lruvec, folio, true);
-+		VM_WARN_ON_ONCE_FOLIO(!success, folio);
-+		folio_set_swapbacked(folio);
-+		lruvec_add_folio_tail(lruvec, folio);
-+		return true;
-+	}
-+
-+	/* protected */
-+	if (tier > tier_idx) {
-+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
-+
-+		gen = folio_inc_gen(lruvec, folio, false);
-+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
-+
-+		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
-+			   lrugen->protected[hist][type][tier - 1] + delta);
-+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
-+		return true;
-+	}
-+
-+	/* waiting for writeback */
-+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-+	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
-+		gen = folio_inc_gen(lruvec, folio, true);
-+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
-+		return true;
-+	}
-+
-+	return false;
-+}
-+
-+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
-+{
-+	bool success;
-+
-+	/* unmapping inhibited */
-+	if (!sc->may_unmap && folio_mapped(folio))
-+		return false;
-+
-+	/* swapping inhibited */
-+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
-+	    (folio_test_dirty(folio) ||
-+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
-+		return false;
-+
-+	/* raced with release_pages() */
-+	if (!folio_try_get(folio))
-+		return false;
-+
-+	/* raced with another isolation */
-+	if (!folio_test_clear_lru(folio)) {
-+		folio_put(folio);
-+		return false;
-+	}
-+
-+	/* see the comment on MAX_NR_TIERS */
-+	if (!folio_test_referenced(folio))
-+		set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
-+
-+	/* for shrink_page_list() */
-+	folio_clear_reclaim(folio);
-+	folio_clear_referenced(folio);
-+
-+	success = lru_gen_del_folio(lruvec, folio, true);
-+	VM_WARN_ON_ONCE_FOLIO(!success, folio);
-+
-+	return true;
-+}
-+
-+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
-+		       int type, int tier, struct list_head *list)
-+{
-+	int gen, zone;
-+	enum vm_event_item item;
-+	int sorted = 0;
-+	int scanned = 0;
-+	int isolated = 0;
-+	int remaining = MAX_LRU_BATCH;
-+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+
-+	VM_WARN_ON_ONCE(!list_empty(list));
-+
-+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
-+		return 0;
-+
-+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
-+
-+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
-+		LIST_HEAD(moved);
-+		int skipped = 0;
-+		struct list_head *head = &lrugen->lists[gen][type][zone];
-+
-+		while (!list_empty(head)) {
-+			struct folio *folio = lru_to_folio(head);
-+			int delta = folio_nr_pages(folio);
-+
-+			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
-+
-+			scanned += delta;
-+
-+			if (sort_folio(lruvec, folio, tier))
-+				sorted += delta;
-+			else if (isolate_folio(lruvec, folio, sc)) {
-+				list_add(&folio->lru, list);
-+				isolated += delta;
-+			} else {
-+				list_move(&folio->lru, &moved);
-+				skipped += delta;
-+			}
-+
-+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
-+				break;
-+		}
-+
-+		if (skipped) {
-+			list_splice(&moved, head);
-+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
-+		}
-+
-+		if (!remaining || isolated >= MIN_LRU_BATCH)
-+			break;
-+	}
-+
-+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
-+	if (!cgroup_reclaim(sc)) {
-+		__count_vm_events(item, isolated);
-+		__count_vm_events(PGREFILL, sorted);
-+	}
-+	__count_memcg_events(memcg, item, isolated);
-+	__count_memcg_events(memcg, PGREFILL, sorted);
-+	__count_vm_events(PGSCAN_ANON + type, isolated);
-+
-+	/*
-+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
-+	 * may_writepage. Check the remaining to prevent livelock if it's not
-+	 * making progress.
-+	 */
-+	return isolated || !remaining ? scanned : 0;
-+}
-+
-+static int get_tier_idx(struct lruvec *lruvec, int type)
-+{
-+	int tier;
-+	struct ctrl_pos sp, pv;
-+
-+	/*
-+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
-+	 * This value is chosen because any other tier would have at least twice
-+	 * as many refaults as the first tier.
-+	 */
-+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
-+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
-+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
-+		if (!positive_ctrl_err(&sp, &pv))
-+			break;
-+	}
-+
-+	return tier - 1;
-+}
-+
-+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
-+{
-+	int type, tier;
-+	struct ctrl_pos sp, pv;
-+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
-+
-+	/*
-+	 * Compare the first tier of anon with that of file to determine which
-+	 * type to scan. Also need to compare other tiers of the selected type
-+	 * with the first tier of the other type to determine the last tier (of
-+	 * the selected type) to evict.
-+	 */
-+	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
-+	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
-+	type = positive_ctrl_err(&sp, &pv);
-+
-+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
-+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
-+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
-+		if (!positive_ctrl_err(&sp, &pv))
-+			break;
-+	}
-+
-+	*tier_idx = tier - 1;
-+
-+	return type;
-+}
-+
-+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
-+			  int *type_scanned, struct list_head *list)
-+{
-+	int i;
-+	int type;
-+	int scanned;
-+	int tier = -1;
-+	DEFINE_MIN_SEQ(lruvec);
-+
-+	/*
-+	 * Try to make the obvious choice first. When anon and file are both
-+	 * available from the same generation, interpret swappiness 1 as file
-+	 * first and 200 as anon first.
-+	 */
-+	if (!swappiness)
-+		type = LRU_GEN_FILE;
-+	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
-+		type = LRU_GEN_ANON;
-+	else if (swappiness == 1)
-+		type = LRU_GEN_FILE;
-+	else if (swappiness == 200)
-+		type = LRU_GEN_ANON;
-+	else
-+		type = get_type_to_scan(lruvec, swappiness, &tier);
-+
-+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
-+		if (tier < 0)
-+			tier = get_tier_idx(lruvec, type);
-+
-+		scanned = scan_folios(lruvec, sc, type, tier, list);
-+		if (scanned)
-+			break;
-+
-+		type = !type;
-+		tier = -1;
-+	}
-+
-+	*type_scanned = type;
-+
-+	return scanned;
-+}
-+
-+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
-+{
-+	int type;
-+	int scanned;
-+	int reclaimed;
-+	LIST_HEAD(list);
-+	struct folio *folio;
-+	enum vm_event_item item;
-+	struct reclaim_stat stat;
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-+
-+	spin_lock_irq(&lruvec->lru_lock);
-+
-+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
-+
-+	scanned += try_to_inc_min_seq(lruvec, swappiness);
-+
-+	if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS)
-+		scanned = 0;
-+
-+	spin_unlock_irq(&lruvec->lru_lock);
-+
-+	if (list_empty(&list))
-+		return scanned;
-+
-+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
-+
-+	list_for_each_entry(folio, &list, lru) {
-+		/* restore LRU_REFS_FLAGS cleared by isolate_folio() */
-+		if (folio_test_workingset(folio))
-+			folio_set_referenced(folio);
-+
-+		/* don't add rejected pages to the oldest generation */
-+		if (folio_test_reclaim(folio) &&
-+		    (folio_test_dirty(folio) || folio_test_writeback(folio)))
-+			folio_clear_active(folio);
-+		else
-+			folio_set_active(folio);
-+	}
-+
-+	spin_lock_irq(&lruvec->lru_lock);
-+
-+	move_pages_to_lru(lruvec, &list);
-+
-+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
-+	if (!cgroup_reclaim(sc))
-+		__count_vm_events(item, reclaimed);
-+	__count_memcg_events(memcg, item, reclaimed);
-+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
-+
-+	spin_unlock_irq(&lruvec->lru_lock);
-+
-+	mem_cgroup_uncharge_list(&list);
-+	free_unref_page_list(&list);
-+
-+	sc->nr_reclaimed += reclaimed;
-+
-+	return scanned;
-+}
-+
-+static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
-+				    bool can_swap)
-+{
-+	bool need_aging;
-+	unsigned long nr_to_scan;
-+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-+	DEFINE_MAX_SEQ(lruvec);
-+	DEFINE_MIN_SEQ(lruvec);
-+
-+	if (mem_cgroup_below_min(memcg) ||
-+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
-+		return 0;
-+
-+	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan);
-+	if (!need_aging)
-+		return nr_to_scan;
-+
-+	/* skip the aging path at the default priority */
-+	if (sc->priority == DEF_PRIORITY)
-+		goto done;
-+
-+	/* leave the work to lru_gen_age_node() */
-+	if (current_is_kswapd())
-+		return 0;
-+
-+	inc_max_seq(lruvec, max_seq, can_swap);
-+done:
-+	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
-+}
-+
-+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-+{
-+	struct blk_plug plug;
-+	unsigned long scanned = 0;
-+
-+	lru_add_drain();
-+
-+	blk_start_plug(&plug);
-+
-+	while (true) {
-+		int delta;
-+		int swappiness;
-+		unsigned long nr_to_scan;
-+
-+		if (sc->may_swap)
-+			swappiness = get_swappiness(lruvec, sc);
-+		else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
-+			swappiness = 1;
-+		else
-+			swappiness = 0;
-+
-+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-+		if (!nr_to_scan)
-+			break;
-+
-+		delta = evict_folios(lruvec, sc, swappiness);
-+		if (!delta)
-+			break;
-+
-+		scanned += delta;
-+		if (scanned >= nr_to_scan)
-+			break;
-+
-+		cond_resched();
-+	}
-+
-+	blk_finish_plug(&plug);
-+}
-+
- /******************************************************************************
-  *                          initialization
-  ******************************************************************************/
-@@ -3123,6 +3878,16 @@ static int __init init_lru_gen(void)
- };
- late_initcall(init_lru_gen);
-
-+#else /* !CONFIG_LRU_GEN */
-+
-+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-+{
-+}
-+
-+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-+{
-+}
-+
- #endif /* CONFIG_LRU_GEN */
-
- static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-@@ -3136,6 +3901,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
- 	bool proportional_reclaim;
- 	struct blk_plug plug;
-
-+	if (lru_gen_enabled()) {
-+		lru_gen_shrink_lruvec(lruvec, sc);
-+		return;
-+	}
-+
- 	get_scan_count(lruvec, sc, nr);
-
- 	/* Record the original scan target for proportional adjustments later */
-@@ -3640,6 +4410,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
- 	struct lruvec *target_lruvec;
- 	unsigned long refaults;
-
-+	if (lru_gen_enabled())
-+		return;
-+
- 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
- 	target_lruvec->refaults[WORKINGSET_ANON] = refaults;
-@@ -4006,12 +4779,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
- }
- #endif
-
--static void age_active_anon(struct pglist_data *pgdat,
--				struct scan_control *sc)
-+static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
- {
- 	struct mem_cgroup *memcg;
- 	struct lruvec *lruvec;
-
-+	if (lru_gen_enabled()) {
-+		lru_gen_age_node(pgdat, sc);
-+		return;
-+	}
-+
- 	if (!can_age_anon_pages(pgdat, sc))
- 		return;
-
-@@ -4331,12 +5108,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
- 		sc.may_swap = !nr_boost_reclaim;
-
- 		/*
--		 * Do some background aging of the anon list, to give
--		 * pages a chance to be referenced before reclaiming. All
--		 * pages are rotated regardless of classzone as this is
--		 * about consistent aging.
-+		 * Do some background aging, to give pages a chance to be
-+		 * referenced before reclaiming. All pages are rotated
-+		 * regardless of classzone as this is about consistent aging.
- 		 */
--		age_active_anon(pgdat, &sc);
-+		kswapd_age_node(pgdat, &sc);
-
- 		/*
- 		 * If we're getting trouble reclaiming, start doing writepage
-diff --git a/mm/workingset.c b/mm/workingset.c
-index a5e84862fc86..ae7e984b23c6 100644
---- a/mm/workingset.c
-+++ b/mm/workingset.c
-@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
- static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
- 			 bool workingset)
- {
--	eviction >>= bucket_order;
- 	eviction &= EVICTION_MASK;
- 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
- 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
-@@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-
- 	*memcgidp = memcgid;
- 	*pgdat = NODE_DATA(nid);
--	*evictionp = entry << bucket_order;
-+	*evictionp = entry;
- 	*workingsetp = workingset;
- }
-
-+#ifdef CONFIG_LRU_GEN
-+
-+static void *lru_gen_eviction(struct folio *folio)
-+{
-+	int hist;
-+	unsigned long token;
-+	unsigned long min_seq;
-+	struct lruvec *lruvec;
-+	struct lru_gen_struct *lrugen;
-+	int type = folio_is_file_lru(folio);
-+	int delta = folio_nr_pages(folio);
-+	int refs = folio_lru_refs(folio);
-+	int tier = lru_tier_from_refs(refs);
-+	struct mem_cgroup *memcg = folio_memcg(folio);
-+	struct pglist_data *pgdat = folio_pgdat(folio);
-+
-+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
-+
-+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-+	lrugen = &lruvec->lrugen;
-+	min_seq = READ_ONCE(lrugen->min_seq[type]);
-+	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
-+
-+	hist = lru_hist_from_seq(min_seq);
-+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
-+
-+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
-+}
-+
-+static void lru_gen_refault(struct folio *folio, void *shadow)
-+{
-+	int hist, tier, refs;
-+	int memcg_id;
-+	bool workingset;
-+	unsigned long token;
-+	unsigned long min_seq;
-+	struct lruvec *lruvec;
-+	struct lru_gen_struct *lrugen;
-+	struct mem_cgroup *memcg;
-+	struct pglist_data *pgdat;
-+	int type = folio_is_file_lru(folio);
-+	int delta = folio_nr_pages(folio);
-+
-+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
-+
-+	if (pgdat != folio_pgdat(folio))
-+		return;
-+
-+	rcu_read_lock();
-+
-+	memcg = folio_memcg_rcu(folio);
-+	if (memcg_id != mem_cgroup_id(memcg))
-+		goto unlock;
-+
-+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-+	lrugen = &lruvec->lrugen;
-+
-+	min_seq = READ_ONCE(lrugen->min_seq[type]);
-+	if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
-+		goto unlock;
-+
-+	hist = lru_hist_from_seq(min_seq);
-+	/* see the comment in folio_lru_refs() */
-+	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
-+	tier = lru_tier_from_refs(refs);
-+
-+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
-+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
-+
-+	/*
-+	 * Count the following two cases as stalls:
-+	 * 1. For pages accessed through page tables, hotter pages pushed out
-+	 *    hot pages which refaulted immediately.
-+	 * 2. For pages accessed multiple times through file descriptors,
-+	 *    numbers of accesses might have been out of the range.
-+	 */
-+	if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
-+		folio_set_workingset(folio);
-+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
-+	}
-+unlock:
-+	rcu_read_unlock();
-+}
-+
-+#else /* !CONFIG_LRU_GEN */
-+
-+static void *lru_gen_eviction(struct folio *folio)
-+{
-+	return NULL;
-+}
-+
-+static void lru_gen_refault(struct folio *folio, void *shadow)
-+{
-+}
-+
-+#endif /* CONFIG_LRU_GEN */
-+
- /**
-  * workingset_age_nonresident - age non-resident entries as LRU ages
-  * @lruvec: the lruvec that was aged
-@@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
- 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
- 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-+	if (lru_gen_enabled())
-+		return lru_gen_eviction(folio);
-+
- 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- 	/* XXX: target_memcg can be NULL, go through lruvec */
- 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
- 	eviction = atomic_long_read(&lruvec->nonresident_age);
-+	eviction >>= bucket_order;
- 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
- 	return pack_shadow(memcgid, pgdat, eviction,
- 				folio_test_workingset(folio));
-@@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow)
- 	int memcgid;
- 	long nr;
-
-+	if (lru_gen_enabled()) {
-+		lru_gen_refault(folio, shadow);
-+		return;
-+	}
-+
- 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
-+	eviction <<= bucket_order;
-
- 	rcu_read_lock();
- 	/*
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (5 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
-                   ` (7 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Barry Song,
-	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
-	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Searching the rmap for PTEs mapping each page on an LRU list (to test
-and clear the accessed bit) can be expensive because pages from
-different VMAs (PA space) are not cache friendly to the rmap (VA
-space). For workloads mostly using mapped pages, searching the rmap
-can incur the highest CPU cost in the reclaim path.
-
-This patch exploits spatial locality to reduce the trips into the
-rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
-new function lru_gen_look_around() scans at most BITS_PER_LONG-1
-adjacent PTEs. On finding another young PTE, it clears the accessed
-bit and updates the gen counter of the page mapped by this PTE to
-(max_seq%MAX_NR_GENS)+1.
-
-Server benchmark results:
-  Single workload:
-    fio (buffered I/O): no change
-
-  Single workload:
-    memcached (anon): +[3, 5]%
-                Ops/sec      KB/sec
-      patch1-6: 1106168.46   43025.04
-      patch1-7: 1147696.57   44640.29
-
-  Configurations:
-    no change
-
-Client benchmark results:
-  kswapd profiles:
-    patch1-6
-      39.03%  lzo1x_1_do_compress (real work)
-      18.47%  page_vma_mapped_walk (overhead)
-       6.74%  _raw_spin_unlock_irq
-       3.97%  do_raw_spin_lock
-       2.49%  ptep_clear_flush
-       2.48%  anon_vma_interval_tree_iter_first
-       1.92%  folio_referenced_one
-       1.88%  __zram_bvec_write
-       1.48%  memmove
-       1.31%  vma_interval_tree_iter_next
-
-    patch1-7
-      48.16%  lzo1x_1_do_compress (real work)
-       8.20%  page_vma_mapped_walk (overhead)
-       7.06%  _raw_spin_unlock_irq
-       2.92%  ptep_clear_flush
-       2.53%  __zram_bvec_write
-       2.11%  do_raw_spin_lock
-       2.02%  memmove
-       1.93%  lru_gen_look_around
-       1.56%  free_unref_page_list
-       1.40%  memset
-
-  Configurations:
-    no change
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Barry Song <baohua@kernel.org>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/memcontrol.h |  31 +++++++
- include/linux/mm.h         |   5 +
- include/linux/mmzone.h     |   6 ++
- mm/internal.h              |   1 +
- mm/memcontrol.c            |   1 +
- mm/rmap.c                  |   6 ++
- mm/swap.c                  |   4 +-
- mm/vmscan.c                | 184 +++++++++++++++++++++++++++++++++++++
- 8 files changed, 236 insertions(+), 2 deletions(-)
-
-diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
-index a2461f9a8738..9b8ab121d948 100644
---- a/include/linux/memcontrol.h
-+++ b/include/linux/memcontrol.h
-@@ -445,6 +445,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
-  * - LRU isolation
-  * - lock_page_memcg()
-  * - exclusive reference
-+ * - mem_cgroup_trylock_pages()
-  *
-  * For a kmem folio a caller should hold an rcu read lock to protect memcg
-  * associated with a kmem folio from being released.
-@@ -506,6 +507,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
-  * - LRU isolation
-  * - lock_page_memcg()
-  * - exclusive reference
-+ * - mem_cgroup_trylock_pages()
-  *
-  * For a kmem page a caller should hold an rcu read lock to protect memcg
-  * associated with a kmem page from being released.
-@@ -960,6 +962,23 @@ void unlock_page_memcg(struct page *page);
-
- void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
-
-+/* try to stablize folio_memcg() for all the pages in a memcg */
-+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
-+{
-+	rcu_read_lock();
-+
-+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
-+		return true;
-+
-+	rcu_read_unlock();
-+	return false;
-+}
-+
-+static inline void mem_cgroup_unlock_pages(void)
-+{
-+	rcu_read_unlock();
-+}
-+
- /* idx can be of type enum memcg_stat_item or node_stat_item */
- static inline void mod_memcg_state(struct mem_cgroup *memcg,
- 				   int idx, int val)
-@@ -1434,6 +1453,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
- {
- }
-
-+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
-+{
-+	/* to match folio_memcg_rcu() */
-+	rcu_read_lock();
-+	return true;
-+}
-+
-+static inline void mem_cgroup_unlock_pages(void)
-+{
-+	rcu_read_unlock();
-+}
-+
- static inline void mem_cgroup_handle_over_high(void)
- {
- }
-diff --git a/include/linux/mm.h b/include/linux/mm.h
-index 8a5ad9d050bf..7cc9ffc19e7f 100644
---- a/include/linux/mm.h
-+++ b/include/linux/mm.h
-@@ -1490,6 +1490,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
- 	return page_to_pfn(&folio->page);
- }
-
-+static inline struct folio *pfn_folio(unsigned long pfn)
-+{
-+	return page_folio(pfn_to_page(pfn));
-+}
-+
- static inline atomic_t *folio_pincount_ptr(struct folio *folio)
- {
- 	return &folio_page(folio, 1)->compound_pincount;
-diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index 7e343420bfb1..9ef5aa37c60c 100644
---- a/include/linux/mmzone.h
-+++ b/include/linux/mmzone.h
-@@ -375,6 +375,7 @@ enum lruvec_flags {
- #ifndef __GENERATING_BOUNDS_H
-
- struct lruvec;
-+struct page_vma_mapped_walk;
-
- #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
- #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
-@@ -430,6 +431,7 @@ struct lru_gen_struct {
- };
-
- void lru_gen_init_lruvec(struct lruvec *lruvec);
-+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
-
- #ifdef CONFIG_MEMCG
- void lru_gen_init_memcg(struct mem_cgroup *memcg);
-@@ -442,6 +444,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
- {
- }
-
-+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
-+{
-+}
-+
- #ifdef CONFIG_MEMCG
- static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
- {
-diff --git a/mm/internal.h b/mm/internal.h
-index 4df67b6b8cce..0082d5fdddac 100644
---- a/mm/internal.h
-+++ b/mm/internal.h
-@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
- void folio_rotate_reclaimable(struct folio *folio);
- bool __folio_end_writeback(struct folio *folio);
- void deactivate_file_folio(struct folio *folio);
-+void folio_activate(struct folio *folio);
-
- void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
- 		unsigned long floor, unsigned long ceiling);
-diff --git a/mm/memcontrol.c b/mm/memcontrol.c
-index 937141d48221..4ea49113b0dd 100644
---- a/mm/memcontrol.c
-+++ b/mm/memcontrol.c
-@@ -2789,6 +2789,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
- 	 * - LRU isolation
- 	 * - lock_page_memcg()
- 	 * - exclusive reference
-+	 * - mem_cgroup_trylock_pages()
- 	 */
- 	folio->memcg_data = (unsigned long)memcg;
- }
-diff --git a/mm/rmap.c b/mm/rmap.c
-index 131def40e4f0..2ff17b9aabd9 100644
---- a/mm/rmap.c
-+++ b/mm/rmap.c
-@@ -825,6 +825,12 @@ static bool folio_referenced_one(struct folio *folio,
- 		}
-
- 		if (pvmw.pte) {
-+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
-+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
-+				lru_gen_look_around(&pvmw);
-+				referenced++;
-+			}
-+
- 			if (ptep_clear_flush_young_notify(vma, address,
- 						pvmw.pte)) {
- 				/*
-diff --git a/mm/swap.c b/mm/swap.c
-index f74fd51fa9e1..0a3871a70952 100644
---- a/mm/swap.c
-+++ b/mm/swap.c
-@@ -366,7 +366,7 @@ static void folio_activate_drain(int cpu)
- 		folio_batch_move_lru(fbatch, folio_activate_fn);
- }
-
--static void folio_activate(struct folio *folio)
-+void folio_activate(struct folio *folio)
- {
- 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
- 	    !folio_test_unevictable(folio)) {
-@@ -385,7 +385,7 @@ static inline void folio_activate_drain(int cpu)
- {
- }
-
--static void folio_activate(struct folio *folio)
-+void folio_activate(struct folio *folio)
- {
- 	struct lruvec *lruvec;
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index d1e60feea8ab..33a1bdfc04bd 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -1635,6 +1635,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
- 		if (!sc->may_unmap && folio_mapped(folio))
- 			goto keep_locked;
-
-+		/* folio_update_gen() tried to promote this page? */
-+		if (lru_gen_enabled() && !ignore_references &&
-+		    folio_mapped(folio) && folio_test_referenced(folio))
-+			goto keep_locked;
-+
- 		/*
- 		 * The number of dirty pages determines if a node is marked
- 		 * reclaim_congested. kswapd will stall and start writing
-@@ -3219,6 +3224,29 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
-  *                          the aging
-  ******************************************************************************/
-
-+/* promote pages accessed through page tables */
-+static int folio_update_gen(struct folio *folio, int gen)
-+{
-+	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
-+
-+	VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
-+	VM_WARN_ON_ONCE(!rcu_read_lock_held());
-+
-+	do {
-+		/* lru_gen_del_folio() has isolated this page? */
-+		if (!(old_flags & LRU_GEN_MASK)) {
-+			/* for shrink_page_list() */
-+			new_flags = old_flags | BIT(PG_referenced);
-+			continue;
-+		}
-+
-+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
-+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
-+	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
-+
-+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
-+}
-+
- /* protect pages accessed multiple times through file descriptors */
- static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
- {
-@@ -3230,6 +3258,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
- 	VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
-
- 	do {
-+		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
-+		/* folio_update_gen() has promoted this page? */
-+		if (new_gen >= 0 && new_gen != old_gen)
-+			return new_gen;
-+
- 		new_gen = (old_gen + 1) % MAX_NR_GENS;
-
- 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
-@@ -3244,6 +3277,43 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
- 	return new_gen;
- }
-
-+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
-+{
-+	unsigned long pfn = pte_pfn(pte);
-+
-+	VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
-+
-+	if (!pte_present(pte) || is_zero_pfn(pfn))
-+		return -1;
-+
-+	if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
-+		return -1;
-+
-+	if (WARN_ON_ONCE(!pfn_valid(pfn)))
-+		return -1;
-+
-+	return pfn;
-+}
-+
-+static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
-+				   struct pglist_data *pgdat)
-+{
-+	struct folio *folio;
-+
-+	/* try to avoid unnecessary memory loads */
-+	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
-+		return NULL;
-+
-+	folio = pfn_folio(pfn);
-+	if (folio_nid(folio) != pgdat->node_id)
-+		return NULL;
-+
-+	if (folio_memcg_rcu(folio) != memcg)
-+		return NULL;
-+
-+	return folio;
-+}
-+
- static void inc_min_seq(struct lruvec *lruvec, int type)
- {
- 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-@@ -3443,6 +3513,114 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
- 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
- }
-
-+/*
-+ * This function exploits spatial locality when shrink_page_list() walks the
-+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
-+ */
-+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
-+{
-+	int i;
-+	pte_t *pte;
-+	unsigned long start;
-+	unsigned long end;
-+	unsigned long addr;
-+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
-+	struct folio *folio = pfn_folio(pvmw->pfn);
-+	struct mem_cgroup *memcg = folio_memcg(folio);
-+	struct pglist_data *pgdat = folio_pgdat(folio);
-+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-+	DEFINE_MAX_SEQ(lruvec);
-+	int old_gen, new_gen = lru_gen_from_seq(max_seq);
-+
-+	lockdep_assert_held(pvmw->ptl);
-+	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
-+
-+	if (spin_is_contended(pvmw->ptl))
-+		return;
-+
-+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
-+	end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
-+
-+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
-+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
-+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
-+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
-+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
-+		else {
-+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
-+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
-+		}
-+	}
-+
-+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
-+
-+	rcu_read_lock();
-+	arch_enter_lazy_mmu_mode();
-+
-+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
-+		unsigned long pfn;
-+
-+		pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
-+		if (pfn == -1)
-+			continue;
-+
-+		if (!pte_young(pte[i]))
-+			continue;
-+
-+		folio = get_pfn_folio(pfn, memcg, pgdat);
-+		if (!folio)
-+			continue;
-+
-+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
-+			VM_WARN_ON_ONCE(true);
-+
-+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
-+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
-+		      !folio_test_swapcache(folio)))
-+			folio_mark_dirty(folio);
-+
-+		old_gen = folio_lru_gen(folio);
-+		if (old_gen < 0)
-+			folio_set_referenced(folio);
-+		else if (old_gen != new_gen)
-+			__set_bit(i, bitmap);
-+	}
-+
-+	arch_leave_lazy_mmu_mode();
-+	rcu_read_unlock();
-+
-+	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
-+		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
-+			folio = pfn_folio(pte_pfn(pte[i]));
-+			folio_activate(folio);
-+		}
-+		return;
-+	}
-+
-+	/* folio_update_gen() requires stable folio_memcg() */
-+	if (!mem_cgroup_trylock_pages(memcg))
-+		return;
-+
-+	spin_lock_irq(&lruvec->lru_lock);
-+	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
-+
-+	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
-+		folio = pfn_folio(pte_pfn(pte[i]));
-+		if (folio_memcg_rcu(folio) != memcg)
-+			continue;
-+
-+		old_gen = folio_update_gen(folio, new_gen);
-+		if (old_gen < 0 || old_gen == new_gen)
-+			continue;
-+
-+		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
-+	}
-+
-+	spin_unlock_irq(&lruvec->lru_lock);
-+
-+	mem_cgroup_unlock_pages();
-+}
-+
- /******************************************************************************
-  *                          the eviction
-  ******************************************************************************/
-@@ -3479,6 +3657,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
- 		return true;
- 	}
-
-+	/* promoted */
-+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
-+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
-+		return true;
-+	}
-+
- 	/* protected */
- 	if (tier > tier_idx) {
- 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (6 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:17   ` Yu Zhao
-  2022-09-28 19:36   ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
-                   ` (6 subsequent siblings)
-  14 siblings, 2 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-To further exploit spatial locality, the aging prefers to walk page
-tables to search for young PTEs and promote hot pages. A kill switch
-will be added in the next patch to disable this behavior. When
-disabled, the aging relies on the rmap only.
-
-NB: this behavior has nothing similar with the page table scanning in
-the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
-pages to swapcache and unmaps them.
-
-To avoid confusion, the term "iteration" specifically means the
-traversal of an entire mm_struct list; the term "walk" will be applied
-to page tables and the rmap, as usual.
-
-An mm_struct list is maintained for each memcg, and an mm_struct
-follows its owner task to the new memcg when this task is migrated.
-Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
-walk_page_range() with each mm_struct on this list to promote hot
-pages before it increments max_seq.
-
-When multiple page table walkers iterate the same list, each of them
-gets a unique mm_struct; therefore they can run concurrently. Page
-table walkers ignore any misplaced pages, e.g., if an mm_struct was
-migrated, pages it left in the previous memcg will not be promoted
-when its current memcg is under reclaim. Similarly, page table walkers
-will not promote pages from nodes other than the one under reclaim.
-
-This patch uses the following optimizations when walking page tables:
-1. It tracks the usage of mm_struct's between context switches so that
-   page table walkers can skip processes that have been sleeping since
-   the last iteration.
-2. It uses generational Bloom filters to record populated branches so
-   that page table walkers can reduce their search space based on the
-   query results, e.g., to skip page tables containing mostly holes or
-   misplaced pages.
-3. It takes advantage of the accessed bit in non-leaf PMD entries when
-   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
-4. It does not zigzag between a PGD table and the same PMD table
-   spanning multiple VMAs. IOW, it finishes all the VMAs within the
-   range of the same PMD table before it returns to a PGD table. This
-   improves the cache performance for workloads that have large
-   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
-
-Server benchmark results:
-  Single workload:
-    fio (buffered I/O): no change
-
-  Single workload:
-    memcached (anon): +[8, 10]%
-                Ops/sec      KB/sec
-      patch1-7: 1147696.57   44640.29
-      patch1-8: 1245274.91   48435.66
-
-  Configurations:
-    no change
-
-Client benchmark results:
-  kswapd profiles:
-    patch1-7
-      48.16%  lzo1x_1_do_compress (real work)
-       8.20%  page_vma_mapped_walk (overhead)
-       7.06%  _raw_spin_unlock_irq
-       2.92%  ptep_clear_flush
-       2.53%  __zram_bvec_write
-       2.11%  do_raw_spin_lock
-       2.02%  memmove
-       1.93%  lru_gen_look_around
-       1.56%  free_unref_page_list
-       1.40%  memset
-
-    patch1-8
-      49.44%  lzo1x_1_do_compress (real work)
-       6.19%  page_vma_mapped_walk (overhead)
-       5.97%  _raw_spin_unlock_irq
-       3.13%  get_pfn_folio
-       2.85%  ptep_clear_flush
-       2.42%  __zram_bvec_write
-       2.08%  do_raw_spin_lock
-       1.92%  memmove
-       1.44%  alloc_zspage
-       1.36%  memset
-
-  Configurations:
-    no change
-
-Thanks to the following developers for their efforts [3].
-  kernel test robot <lkp@intel.com>
-
-[1] https://lwn.net/Articles/23732/
-[2] https://llvm.org/docs/ScudoHardenedAllocator.html
-[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- fs/exec.c                  |    2 +
- include/linux/memcontrol.h |    5 +
- include/linux/mm_types.h   |   76 +++
- include/linux/mmzone.h     |   56 +-
- include/linux/swap.h       |    4 +
- kernel/exit.c              |    1 +
- kernel/fork.c              |    9 +
- kernel/sched/core.c        |    1 +
- mm/memcontrol.c            |   25 +
- mm/vmscan.c                | 1010 +++++++++++++++++++++++++++++++++++-
- 10 files changed, 1172 insertions(+), 17 deletions(-)
-
-diff --git a/fs/exec.c b/fs/exec.c
-index 9a5ca7b82bfc..507a317d54db 100644
---- a/fs/exec.c
-+++ b/fs/exec.c
-@@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
- 	active_mm = tsk->active_mm;
- 	tsk->active_mm = mm;
- 	tsk->mm = mm;
-+	lru_gen_add_mm(mm);
- 	/*
- 	 * This prevents preemption while active_mm is being loaded and
- 	 * it and mm are being updated, which could cause problems for
-@@ -1029,6 +1030,7 @@ static int exec_mmap(struct mm_struct *mm)
- 	tsk->mm->vmacache_seqnum = 0;
- 	vmacache_flush(tsk);
- 	task_unlock(tsk);
-+	lru_gen_use_mm(mm);
- 	if (old_mm) {
- 		mmap_read_unlock(old_mm);
- 		BUG_ON(active_mm != old_mm);
-diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
-index 9b8ab121d948..344022f102c2 100644
---- a/include/linux/memcontrol.h
-+++ b/include/linux/memcontrol.h
-@@ -350,6 +350,11 @@ struct mem_cgroup {
- 	struct deferred_split deferred_split_queue;
- #endif
-
-+#ifdef CONFIG_LRU_GEN
-+	/* per-memcg mm_struct list */
-+	struct lru_gen_mm_list mm_list;
-+#endif
-+
- 	struct mem_cgroup_per_node *nodeinfo[];
- };
-
 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
-index cf97f3884fda..e1797813cc2c 100644
+index cf97f3884fda20..e1797813cc2c2b 100644
 --- a/include/linux/mm_types.h
 +++ b/include/linux/mm_types.h
 @@ -672,6 +672,22 @@ struct mm_struct {
@@ -4013,12 +934,12 @@ index cf97f3884fda..e1797813cc2c 100644
 +		} lru_gen;
 +#endif /* CONFIG_LRU_GEN */
  	} __randomize_layout;
-
+ 
  	/*
 @@ -698,6 +714,66 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
  	return (struct cpumask *)&mm->cpu_bitmap;
  }
-
+ 
 +#ifdef CONFIG_LRU_GEN
 +
 +struct lru_gen_mm_list {
@@ -4083,22 +1004,137 @@ index cf97f3884fda..e1797813cc2c 100644
  extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
  extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index 9ef5aa37c60c..b1635c4020dc 100644
+index e24b40c52468a8..0c502618b37bf7 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
-@@ -408,7 +408,7 @@ enum {
-  * min_seq behind.
-  *
-  * The number of pages in each generation is eventually consistent and therefore
-- * can be transiently negative.
-+ * can be transiently negative when reset_batch_size() is pending.
-  */
- struct lru_gen_struct {
- 	/* the aging increments the youngest generation number */
-@@ -430,6 +430,53 @@ struct lru_gen_struct {
- 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+@@ -314,6 +314,207 @@ enum lruvec_flags {
+ 					 */
  };
-
+ 
++#endif /* !__GENERATING_BOUNDS_H */
++
++/*
++ * Evictable pages are divided into multiple generations. The youngest and the
++ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
++ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
++ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
++ * corresponding generation. The gen counter in folio->flags stores gen+1 while
++ * a page is on one of lrugen->lists[]. Otherwise it stores 0.
++ *
++ * A page is added to the youngest generation on faulting. The aging needs to
++ * check the accessed bit at least twice before handing this page over to the
++ * eviction. The first check takes care of the accessed bit set on the initial
++ * fault; the second check makes sure this page hasn't been used since then.
++ * This process, AKA second chance, requires a minimum of two generations,
++ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
++ * LRU, e.g., /proc/vmstat, these two generations are considered active; the
++ * rest of generations, if they exist, are considered inactive. See
++ * lru_gen_is_active().
++ *
++ * PG_active is always cleared while a page is on one of lrugen->lists[] so that
++ * the aging needs not to worry about it. And it's set again when a page
++ * considered active is isolated for non-reclaiming purposes, e.g., migration.
++ * See lru_gen_add_folio() and lru_gen_del_folio().
++ *
++ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
++ * number of categories of the active/inactive LRU when keeping track of
++ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
++ * in folio->flags.
++ */
++#define MIN_NR_GENS		2U
++#define MAX_NR_GENS		4U
++
++/*
++ * Each generation is divided into multiple tiers. A page accessed N times
++ * through file descriptors is in tier order_base_2(N). A page in the first tier
++ * (N=0,1) is marked by PG_referenced unless it was faulted in through page
++ * tables or read ahead. A page in any other tier (N>1) is marked by
++ * PG_referenced and PG_workingset. This implies a minimum of two tiers is
++ * supported without using additional bits in folio->flags.
++ *
++ * In contrast to moving across generations which requires the LRU lock, moving
++ * across tiers only involves atomic operations on folio->flags and therefore
++ * has a negligible cost in the buffered access path. In the eviction path,
++ * comparisons of refaulted/(evicted+protected) from the first tier and the
++ * rest infer whether pages accessed multiple times through file descriptors
++ * are statistically hot and thus worth protecting.
++ *
++ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
++ * number of categories of the active/inactive LRU when keeping track of
++ * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in
++ * folio->flags.
++ */
++#define MAX_NR_TIERS		4U
++
++#ifndef __GENERATING_BOUNDS_H
++
++struct lruvec;
++struct page_vma_mapped_walk;
++
++#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
++#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
++
++#ifdef CONFIG_LRU_GEN
++
++enum {
++	LRU_GEN_ANON,
++	LRU_GEN_FILE,
++};
++
++enum {
++	LRU_GEN_CORE,
++	LRU_GEN_MM_WALK,
++	LRU_GEN_NONLEAF_YOUNG,
++	NR_LRU_GEN_CAPS
++};
++
++#define MIN_LRU_BATCH		BITS_PER_LONG
++#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 64)
++
++/* whether to keep historical stats from evicted generations */
++#ifdef CONFIG_LRU_GEN_STATS
++#define NR_HIST_GENS		MAX_NR_GENS
++#else
++#define NR_HIST_GENS		1U
++#endif
++
++/*
++ * The youngest generation number is stored in max_seq for both anon and file
++ * types as they are aged on an equal footing. The oldest generation numbers are
++ * stored in min_seq[] separately for anon and file types as clean file pages
++ * can be evicted regardless of swap constraints.
++ *
++ * Normally anon and file min_seq are in sync. But if swapping is constrained,
++ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
++ * min_seq behind.
++ *
++ * The number of pages in each generation is eventually consistent and therefore
++ * can be transiently negative when reset_batch_size() is pending.
++ */
++struct lru_gen_struct {
++	/* the aging increments the youngest generation number */
++	unsigned long max_seq;
++	/* the eviction increments the oldest generation numbers */
++	unsigned long min_seq[ANON_AND_FILE];
++	/* the birth time of each generation in jiffies */
++	unsigned long timestamps[MAX_NR_GENS];
++	/* the multi-gen LRU lists, lazily sorted on eviction */
++	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
++	/* the multi-gen LRU sizes, eventually consistent */
++	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
++	/* the exponential moving average of refaulted */
++	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
++	/* the exponential moving average of evicted+protected */
++	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
++	/* the first tier doesn't need protection, hence the minus one */
++	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
++	/* can be modified without holding the LRU lock */
++	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
++	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
++	/* whether the multi-gen LRU is enabled */
++	bool enabled;
++};
++
 +enum {
 +	MM_LEAF_TOTAL,		/* total leaf entries */
 +	MM_LEAF_OLD,		/* old leaf entries */
@@ -4146,32 +1182,209 @@ index 9ef5aa37c60c..b1635c4020dc 100644
 +	bool force_scan;
 +};
 +
- void lru_gen_init_lruvec(struct lruvec *lruvec);
- void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
-
-@@ -480,6 +527,8 @@ struct lruvec {
- #ifdef CONFIG_LRU_GEN
- 	/* evictable pages divided into generations */
- 	struct lru_gen_struct		lrugen;
++void lru_gen_init_lruvec(struct lruvec *lruvec);
++void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
++
++#ifdef CONFIG_MEMCG
++void lru_gen_init_memcg(struct mem_cgroup *memcg);
++void lru_gen_exit_memcg(struct mem_cgroup *memcg);
++#endif
++
++#else /* !CONFIG_LRU_GEN */
++
++static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
++{
++}
++
++static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
++{
++}
++
++#ifdef CONFIG_MEMCG
++static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
++{
++}
++
++static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
++{
++}
++#endif
++
++#endif /* CONFIG_LRU_GEN */
++
+ struct lruvec {
+ 	struct list_head		lists[NR_LRU_LISTS];
+ 	/* per lruvec lru_lock for memcg */
+@@ -331,6 +532,12 @@ struct lruvec {
+ 	unsigned long			refaults[ANON_AND_FILE];
+ 	/* Various lruvec state flags (enum lruvec_flags) */
+ 	unsigned long			flags;
++#ifdef CONFIG_LRU_GEN
++	/* evictable pages divided into generations */
++	struct lru_gen_struct		lrugen;
 +	/* to concurrently iterate lru_gen_mm_list */
 +	struct lru_gen_mm_state		mm_state;
- #endif
++#endif
  #ifdef CONFIG_MEMCG
  	struct pglist_data *pgdat;
-@@ -1176,6 +1225,11 @@ typedef struct pglist_data {
-
+ #endif
+@@ -746,6 +953,8 @@ static inline bool zone_is_empty(struct zone *zone)
+ #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
+ #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
+ #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
++#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
++#define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
+ 
+ /*
+  * Define the bit shifts to access each section.  For non-existent
+@@ -1007,6 +1216,11 @@ typedef struct pglist_data {
+ 
  	unsigned long		flags;
-
+ 
 +#ifdef CONFIG_LRU_GEN
 +	/* kswap mm walk data */
 +	struct lru_gen_mm_walk	mm_walk;
 +#endif
 +
  	ZONE_PADDING(_pad2_)
-
+ 
  	/* Per-node vmstats */
+diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
+index 4b71a96190a84c..3a0eec9f2faa75 100644
+--- a/include/linux/nodemask.h
++++ b/include/linux/nodemask.h
+@@ -493,6 +493,7 @@ static inline int num_node_state(enum node_states state)
+ #define first_online_node	0
+ #define first_memory_node	0
+ #define next_online_node(nid)	(MAX_NUMNODES)
++#define next_memory_node(nid)	(MAX_NUMNODES)
+ #define nr_node_ids		1U
+ #define nr_online_nodes		1U
+ 
+diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
+index ef1e3e736e1483..7d79818dc06513 100644
+--- a/include/linux/page-flags-layout.h
++++ b/include/linux/page-flags-layout.h
+@@ -55,7 +55,8 @@
+ #define SECTIONS_WIDTH		0
+ #endif
+ 
+-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
++	<= BITS_PER_LONG - NR_PAGEFLAGS
+ #define NODES_WIDTH		NODES_SHIFT
+ #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
+ #error "Vmemmap: No space for nodes field in page flags"
+@@ -89,8 +90,8 @@
+ #define LAST_CPUPID_SHIFT 0
+ #endif
+ 
+-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
+-	<= BITS_PER_LONG - NR_PAGEFLAGS
++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
++	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+ #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
+ #else
+ #define LAST_CPUPID_WIDTH 0
+@@ -100,10 +101,15 @@
+ #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ #endif
+ 
+-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
+-	> BITS_PER_LONG - NR_PAGEFLAGS
++#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
++	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
+ #error "Not enough bits in page flags"
+ #endif
+ 
++/* see the comment on MAX_NR_TIERS */
++#define LRU_REFS_WIDTH	min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
++			    ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
++			    NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
++
+ #endif
+ #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
+diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
+index 465ff35a8c00a8..0b0ae5084e60c7 100644
+--- a/include/linux/page-flags.h
++++ b/include/linux/page-flags.h
+@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
+ 	 1UL << PG_private	| 1UL << PG_private_2	|	\
+ 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
+ 	 1UL << PG_slab		| 1UL << PG_active 	|	\
+-	 1UL << PG_unevictable	| __PG_MLOCKED)
++	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
+ 
+ /*
+  * Flags checked when a page is prepped for return by the page allocator.
+@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
+  * alloc-free cycle to prevent from reusing the page.
+  */
+ #define PAGE_FLAGS_CHECK_AT_PREP	\
+-	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
++	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
+ 
+ #define PAGE_FLAGS_PRIVATE				\
+ 	(1UL << PG_private | 1UL << PG_private_2)
+diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
+index 014ee8f0fbaabc..d9095251bffd2f 100644
+--- a/include/linux/pgtable.h
++++ b/include/linux/pgtable.h
+@@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ #endif
+ 
+ #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
++#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ 					    unsigned long address,
+ 					    pmd_t *pmdp)
+@@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ 	BUILD_BUG();
+ 	return 0;
+ }
+-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
++#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
+ #endif
+ 
+ #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
+ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+ #endif
+ 
++#ifndef arch_has_hw_pte_young
++/*
++ * Return whether the accessed bit is supported on the local CPU.
++ *
++ * This stub assumes accessing through an old PTE triggers a page fault.
++ * Architectures that automatically set the access bit should overwrite it.
++ */
++static inline bool arch_has_hw_pte_young(void)
++{
++	return false;
++}
++#endif
++
+ #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
+ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+ 				       unsigned long address,
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index e7b2f8a5c711c1..8cc46a789193eb 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -914,6 +914,10 @@ struct task_struct {
+ #ifdef CONFIG_MEMCG
+ 	unsigned			in_user_fault:1;
+ #endif
++#ifdef CONFIG_LRU_GEN
++	/* whether the LRU algorithm may apply to this access */
++	unsigned			in_lru_fault:1;
++#endif
+ #ifdef CONFIG_COMPAT_BRK
+ 	unsigned			brk_randomized:1;
+ #endif
 diff --git a/include/linux/swap.h b/include/linux/swap.h
-index 43150b9bbc5c..6308150b234a 100644
+index 43150b9bbc5caf..6308150b234a49 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -162,6 +162,10 @@ union swap_header {
@@ -4183,10 +1396,40 @@ index 43150b9bbc5c..6308150b234a 100644
 +	struct lru_gen_mm_walk *mm_walk;
 +#endif
  };
-
+ 
  #ifdef __KERNEL__
+diff --git a/kernel/bounds.c b/kernel/bounds.c
+index 9795d75b09b232..b529182e8b04fc 100644
+--- a/kernel/bounds.c
++++ b/kernel/bounds.c
+@@ -22,6 +22,13 @@ int main(void)
+ 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+ #endif
+ 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
++#ifdef CONFIG_LRU_GEN
++	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
++	DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
++#else
++	DEFINE(LRU_GEN_WIDTH, 0);
++	DEFINE(__LRU_REFS_WIDTH, 0);
++#endif
+ 	/* End of constants */
+ 
+ 	return 0;
+diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
+index 36b740cb3d59ef..63dc3e82be4f7f 100644
+--- a/kernel/cgroup/cgroup-internal.h
++++ b/kernel/cgroup/cgroup-internal.h
+@@ -164,7 +164,6 @@ struct cgroup_mgctx {
+ #define DEFINE_CGROUP_MGCTX(name)						\
+ 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
+ 
+-extern struct mutex cgroup_mutex;
+ extern spinlock_t css_set_lock;
+ extern struct cgroup_subsys *cgroup_subsys[];
+ extern struct list_head cgroup_roots;
 diff --git a/kernel/exit.c b/kernel/exit.c
-index 84021b24f79e..98a33bd7c25c 100644
+index 84021b24f79e3d..98a33bd7c25c50 100644
 --- a/kernel/exit.c
 +++ b/kernel/exit.c
 @@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm)
@@ -4198,16 +1441,16 @@ index 84021b24f79e..98a33bd7c25c 100644
  	put_task_struct(c);
  }
 diff --git a/kernel/fork.c b/kernel/fork.c
-index 90c85b17bf69..d2da065442af 100644
+index 2b6bd511c6ed1c..2dd4ca002a368d 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1152,6 +1152,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
  		goto fail_nocontext;
-
+ 
  	mm->user_ns = get_user_ns(user_ns);
 +	lru_gen_init_mm(mm);
  	return mm;
-
+ 
  fail_nocontext:
 @@ -1194,6 +1195,7 @@ static inline void __mmput(struct mm_struct *mm)
  	}
@@ -4216,11 +1459,11 @@ index 90c85b17bf69..d2da065442af 100644
 +	lru_gen_del_mm(mm);
  	mmdrop(mm);
  }
-
-@@ -2694,6 +2696,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
+ 
+@@ -2692,6 +2694,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
  		get_task_struct(p);
  	}
-
+ 
 +	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
 +		/* lock the task to synchronize with memcg migration */
 +		task_lock(p);
@@ -4229,28 +1472,115 @@ index 90c85b17bf69..d2da065442af 100644
 +	}
 +
  	wake_up_new_task(p);
-
+ 
  	/* forking complete and child started to run, tell ptracer */
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
-index 8fccd8721bb8..2c605bdede47 100644
+index ee28253c9ac0c2..c48c0a19642b6c 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
-@@ -5180,6 +5180,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
+@@ -5166,6 +5166,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
  		 * finish_task_switch()'s mmdrop().
  		 */
  		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 +		lru_gen_use_mm(next->mm);
-
+ 
  		if (!prev->mm) {                        // from kernel
  			/* will mmdrop() in finish_task_switch(). */
+diff --git a/mm/Kconfig b/mm/Kconfig
+index 0331f1461f81cd..96cd3ae25c6fcd 100644
+--- a/mm/Kconfig
++++ b/mm/Kconfig
+@@ -1124,6 +1124,32 @@ config PTE_MARKER_UFFD_WP
+ 	  purposes.  It is required to enable userfaultfd write protection on
+ 	  file-backed memory types like shmem and hugetlbfs.
+ 
++# multi-gen LRU {
++config LRU_GEN
++	bool "Multi-Gen LRU"
++	depends on MMU
++	# make sure folio->flags has enough spare bits
++	depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
++	help
++	  A high performance LRU implementation to overcommit memory. See
++	  Documentation/admin-guide/mm/multigen_lru.rst for details.
++
++config LRU_GEN_ENABLED
++	bool "Enable by default"
++	depends on LRU_GEN
++	help
++	  This option enables the multi-gen LRU by default.
++
++config LRU_GEN_STATS
++	bool "Full stats for debugging"
++	depends on LRU_GEN
++	help
++	  Do not enable this option unless you plan to look at historical stats
++	  from evicted generations for debugging purpose.
++
++	  This option has a per-memcg and per-node memory overhead.
++# }
++
+ source "mm/damon/Kconfig"
+ 
+ endmenu
+diff --git a/mm/huge_memory.c b/mm/huge_memory.c
+index f42bb51e023a03..79e0b08b4cf93c 100644
+--- a/mm/huge_memory.c
++++ b/mm/huge_memory.c
+@@ -2438,7 +2438,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
+ #ifdef CONFIG_64BIT
+ 			 (1L << PG_arch_2) |
+ #endif
+-			 (1L << PG_dirty)));
++			 (1L << PG_dirty) |
++			 LRU_GEN_MASK | LRU_REFS_MASK));
+ 
+ 	/* ->mapping in first tail page is compound_mapcount */
+ 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+diff --git a/mm/internal.h b/mm/internal.h
+index 785409805ed797..a1fddea6b34f41 100644
+--- a/mm/internal.h
++++ b/mm/internal.h
+@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
+ void folio_rotate_reclaimable(struct folio *folio);
+ bool __folio_end_writeback(struct folio *folio);
+ void deactivate_file_folio(struct folio *folio);
++void folio_activate(struct folio *folio);
+ 
+ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
+ 		unsigned long floor, unsigned long ceiling);
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
-index 4ea49113b0dd..392b1fd1e8c4 100644
+index b69979c9ced5c2..1c18d7c1ce7174 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
-@@ -6204,6 +6204,30 @@ static void mem_cgroup_move_task(void)
+@@ -2789,6 +2789,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+ 	 * - LRU isolation
+ 	 * - lock_page_memcg()
+ 	 * - exclusive reference
++	 * - mem_cgroup_trylock_pages()
+ 	 */
+ 	folio->memcg_data = (unsigned long)memcg;
+ }
+@@ -5170,6 +5171,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
+ 
+ static void mem_cgroup_free(struct mem_cgroup *memcg)
+ {
++	lru_gen_exit_memcg(memcg);
+ 	memcg_wb_domain_exit(memcg);
+ 	__mem_cgroup_free(memcg);
+ }
+@@ -5228,6 +5230,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
+ 	memcg->deferred_split_queue.split_queue_len = 0;
+ #endif
+ 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
++	lru_gen_init_memcg(memcg);
+ 	return memcg;
+ fail:
+ 	mem_cgroup_id_remove(memcg);
+@@ -6196,6 +6199,30 @@ static void mem_cgroup_move_task(void)
  }
  #endif
-
+ 
 +#ifdef CONFIG_LRU_GEN
 +static void mem_cgroup_attach(struct cgroup_taskset *tset)
 +{
@@ -4278,7 +1608,7 @@ index 4ea49113b0dd..392b1fd1e8c4 100644
  static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
  {
  	if (value == PAGE_COUNTER_MAX)
-@@ -6609,6 +6633,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
+@@ -6601,6 +6628,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  	.css_reset = mem_cgroup_css_reset,
  	.css_rstat_flush = mem_cgroup_css_rstat_flush,
  	.can_attach = mem_cgroup_can_attach,
@@ -4286,43 +1616,602 @@ index 4ea49113b0dd..392b1fd1e8c4 100644
  	.cancel_attach = mem_cgroup_cancel_attach,
  	.post_attach = mem_cgroup_move_task,
  	.dfl_cftypes = memory_files,
+diff --git a/mm/memory.c b/mm/memory.c
+index a78814413ac03e..cd1b5bfd9f3e9d 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -125,18 +125,6 @@ int randomize_va_space __read_mostly =
+ 					2;
+ #endif
+ 
+-#ifndef arch_faults_on_old_pte
+-static inline bool arch_faults_on_old_pte(void)
+-{
+-	/*
+-	 * Those arches which don't have hw access flag feature need to
+-	 * implement their own helper. By default, "true" means pagefault
+-	 * will be hit on old pte.
+-	 */
+-	return true;
+-}
+-#endif
+-
+ #ifndef arch_wants_old_prefaulted_pte
+ static inline bool arch_wants_old_prefaulted_pte(void)
+ {
+@@ -2870,7 +2858,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
+ 	 * On architectures with software "accessed" bits, we would
+ 	 * take a double page fault, so mark it accessed here.
+ 	 */
+-	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
++	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
+ 		pte_t entry;
+ 
+ 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+@@ -5120,6 +5108,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
+ 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+ }
+ 
++#ifdef CONFIG_LRU_GEN
++static void lru_gen_enter_fault(struct vm_area_struct *vma)
++{
++	/* the LRU algorithm doesn't apply to sequential or random reads */
++	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
++}
++
++static void lru_gen_exit_fault(void)
++{
++	current->in_lru_fault = false;
++}
++#else
++static void lru_gen_enter_fault(struct vm_area_struct *vma)
++{
++}
++
++static void lru_gen_exit_fault(void)
++{
++}
++#endif /* CONFIG_LRU_GEN */
++
+ /*
+  * By the time we get here, we already hold the mm semaphore
+  *
+@@ -5151,11 +5160,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ 	if (flags & FAULT_FLAG_USER)
+ 		mem_cgroup_enter_user_fault();
+ 
++	lru_gen_enter_fault(vma);
++
+ 	if (unlikely(is_vm_hugetlb_page(vma)))
+ 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ 	else
+ 		ret = __handle_mm_fault(vma, address, flags);
+ 
++	lru_gen_exit_fault();
++
+ 	if (flags & FAULT_FLAG_USER) {
+ 		mem_cgroup_exit_user_fault();
+ 		/*
+diff --git a/mm/mm_init.c b/mm/mm_init.c
+index 9ddaf0e1b0ab95..0d7b2bd2454a1f 100644
+--- a/mm/mm_init.c
++++ b/mm/mm_init.c
+@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
+ 
+ 	shift = 8 * sizeof(unsigned long);
+ 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
+-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
++		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
+ 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
+-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
++		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
+ 		SECTIONS_WIDTH,
+ 		NODES_WIDTH,
+ 		ZONES_WIDTH,
+ 		LAST_CPUPID_WIDTH,
+ 		KASAN_TAG_WIDTH,
++		LRU_GEN_WIDTH,
++		LRU_REFS_WIDTH,
+ 		NR_PAGEFLAGS);
+ 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
+ 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
+diff --git a/mm/mmzone.c b/mm/mmzone.c
+index 0ae7571e35abb0..68e1511be12de6 100644
+--- a/mm/mmzone.c
++++ b/mm/mmzone.c
+@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec)
+ 	 * Poison its list head, so that any operations on it would crash.
+ 	 */
+ 	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
++
++	lru_gen_init_lruvec(lruvec);
+ }
+ 
+ #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
+diff --git a/mm/rmap.c b/mm/rmap.c
+index 93d5a6f793d204..9e0ce48bca085d 100644
+--- a/mm/rmap.c
++++ b/mm/rmap.c
+@@ -833,6 +833,12 @@ static bool folio_referenced_one(struct folio *folio,
+ 		}
+ 
+ 		if (pvmw.pte) {
++			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
++			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
++				lru_gen_look_around(&pvmw);
++				referenced++;
++			}
++
+ 			if (ptep_clear_flush_young_notify(vma, address,
+ 						pvmw.pte)) {
+ 				/*
+diff --git a/mm/swap.c b/mm/swap.c
+index 9cee7f6a380942..0a3871a70952f3 100644
+--- a/mm/swap.c
++++ b/mm/swap.c
+@@ -366,7 +366,7 @@ static void folio_activate_drain(int cpu)
+ 		folio_batch_move_lru(fbatch, folio_activate_fn);
+ }
+ 
+-static void folio_activate(struct folio *folio)
++void folio_activate(struct folio *folio)
+ {
+ 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
+ 	    !folio_test_unevictable(folio)) {
+@@ -385,7 +385,7 @@ static inline void folio_activate_drain(int cpu)
+ {
+ }
+ 
+-static void folio_activate(struct folio *folio)
++void folio_activate(struct folio *folio)
+ {
+ 	struct lruvec *lruvec;
+ 
+@@ -428,6 +428,40 @@ static void __lru_cache_activate_folio(struct folio *folio)
+ 	local_unlock(&cpu_fbatches.lock);
+ }
+ 
++#ifdef CONFIG_LRU_GEN
++static void folio_inc_refs(struct folio *folio)
++{
++	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
++
++	if (folio_test_unevictable(folio))
++		return;
++
++	if (!folio_test_referenced(folio)) {
++		folio_set_referenced(folio);
++		return;
++	}
++
++	if (!folio_test_workingset(folio)) {
++		folio_set_workingset(folio);
++		return;
++	}
++
++	/* see the comment on MAX_NR_TIERS */
++	do {
++		new_flags = old_flags & LRU_REFS_MASK;
++		if (new_flags == LRU_REFS_MASK)
++			break;
++
++		new_flags += BIT(LRU_REFS_PGOFF);
++		new_flags |= old_flags & ~LRU_REFS_MASK;
++	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
++}
++#else
++static void folio_inc_refs(struct folio *folio)
++{
++}
++#endif /* CONFIG_LRU_GEN */
++
+ /*
+  * Mark a page as having seen activity.
+  *
+@@ -440,6 +474,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
+  */
+ void folio_mark_accessed(struct folio *folio)
+ {
++	if (lru_gen_enabled()) {
++		folio_inc_refs(folio);
++		return;
++	}
++
+ 	if (!folio_test_referenced(folio)) {
+ 		folio_set_referenced(folio);
+ 	} else if (folio_test_unevictable(folio)) {
+@@ -484,6 +523,11 @@ void folio_add_lru(struct folio *folio)
+ 			folio_test_unevictable(folio), folio);
+ 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+ 
++	/* see the comment in lru_gen_add_folio() */
++	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
++	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
++		folio_set_active(folio);
++
+ 	folio_get(folio);
+ 	local_lock(&cpu_fbatches.lock);
+ 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
+@@ -575,7 +619,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio)
+ 
+ static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio)
+ {
+-	if (folio_test_active(folio) && !folio_test_unevictable(folio)) {
++	if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) {
+ 		long nr_pages = folio_nr_pages(folio);
+ 
+ 		lruvec_del_folio(lruvec, folio);
+@@ -688,8 +732,8 @@ void deactivate_page(struct page *page)
+ {
+ 	struct folio *folio = page_folio(page);
+ 
+-	if (folio_test_lru(folio) && folio_test_active(folio) &&
+-	    !folio_test_unevictable(folio)) {
++	if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
++	    (folio_test_active(folio) || lru_gen_enabled())) {
+ 		struct folio_batch *fbatch;
+ 
+ 		folio_get(folio);
 diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 33a1bdfc04bd..c579b254fec7 100644
+index 382dbe97329f33..146a54cf1bd9e2 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
-@@ -49,6 +49,8 @@
+@@ -49,6 +49,10 @@
  #include <linux/printk.h>
  #include <linux/dax.h>
  #include <linux/psi.h>
 +#include <linux/pagewalk.h>
 +#include <linux/shmem_fs.h>
-
++#include <linux/ctype.h>
++#include <linux/debugfs.h>
+ 
  #include <asm/tlbflush.h>
  #include <asm/div64.h>
-@@ -3082,7 +3084,7 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
- 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
- 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
-
--static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
-+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
- {
- 	struct pglist_data *pgdat = NODE_DATA(nid);
-
-@@ -3127,6 +3129,371 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
- 	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+@@ -129,6 +133,12 @@ struct scan_control {
+ 	/* Always discard instead of demoting to lower tier memory */
+ 	unsigned int no_demotion:1;
+ 
++#ifdef CONFIG_LRU_GEN
++	/* help kswapd make better choices among multiple memcgs */
++	unsigned int memcgs_need_aging:1;
++	unsigned long last_reclaimed;
++#endif
++
+ 	/* Allocation order */
+ 	s8 order;
+ 
+@@ -1334,9 +1344,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
+ 
+ 	if (folio_test_swapcache(folio)) {
+ 		swp_entry_t swap = folio_swap_entry(folio);
+-		mem_cgroup_swapout(folio, swap);
++
++		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
+ 		if (reclaimed && !mapping_exiting(mapping))
+ 			shadow = workingset_eviction(folio, target_memcg);
++		mem_cgroup_swapout(folio, swap);
+ 		__delete_from_swap_cache(folio, swap, shadow);
+ 		xa_unlock_irq(&mapping->i_pages);
+ 		put_swap_page(&folio->page, swap);
+@@ -1633,6 +1645,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
+ 		if (!sc->may_unmap && folio_mapped(folio))
+ 			goto keep_locked;
+ 
++		/* folio_update_gen() tried to promote this page? */
++		if (lru_gen_enabled() && !ignore_references &&
++		    folio_mapped(folio) && folio_test_referenced(folio))
++			goto keep_locked;
++
+ 		/*
+ 		 * The number of dirty pages determines if a node is marked
+ 		 * reclaim_congested. kswapd will stall and start writing
+@@ -2728,6 +2745,112 @@ enum scan_balance {
+ 	SCAN_FILE,
+ };
+ 
++static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
++{
++	unsigned long file;
++	struct lruvec *target_lruvec;
++
++	if (lru_gen_enabled())
++		return;
++
++	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
++
++	/*
++	 * Flush the memory cgroup stats, so that we read accurate per-memcg
++	 * lruvec stats for heuristics.
++	 */
++	mem_cgroup_flush_stats();
++
++	/*
++	 * Determine the scan balance between anon and file LRUs.
++	 */
++	spin_lock_irq(&target_lruvec->lru_lock);
++	sc->anon_cost = target_lruvec->anon_cost;
++	sc->file_cost = target_lruvec->file_cost;
++	spin_unlock_irq(&target_lruvec->lru_lock);
++
++	/*
++	 * Target desirable inactive:active list ratios for the anon
++	 * and file LRU lists.
++	 */
++	if (!sc->force_deactivate) {
++		unsigned long refaults;
++
++		refaults = lruvec_page_state(target_lruvec,
++				WORKINGSET_ACTIVATE_ANON);
++		if (refaults != target_lruvec->refaults[0] ||
++			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
++			sc->may_deactivate |= DEACTIVATE_ANON;
++		else
++			sc->may_deactivate &= ~DEACTIVATE_ANON;
++
++		/*
++		 * When refaults are being observed, it means a new
++		 * workingset is being established. Deactivate to get
++		 * rid of any stale active pages quickly.
++		 */
++		refaults = lruvec_page_state(target_lruvec,
++				WORKINGSET_ACTIVATE_FILE);
++		if (refaults != target_lruvec->refaults[1] ||
++		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
++			sc->may_deactivate |= DEACTIVATE_FILE;
++		else
++			sc->may_deactivate &= ~DEACTIVATE_FILE;
++	} else
++		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
++
++	/*
++	 * If we have plenty of inactive file pages that aren't
++	 * thrashing, try to reclaim those first before touching
++	 * anonymous pages.
++	 */
++	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
++	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
++		sc->cache_trim_mode = 1;
++	else
++		sc->cache_trim_mode = 0;
++
++	/*
++	 * Prevent the reclaimer from falling into the cache trap: as
++	 * cache pages start out inactive, every cache fault will tip
++	 * the scan balance towards the file LRU.  And as the file LRU
++	 * shrinks, so does the window for rotation from references.
++	 * This means we have a runaway feedback loop where a tiny
++	 * thrashing file LRU becomes infinitely more attractive than
++	 * anon pages.  Try to detect this based on file LRU size.
++	 */
++	if (!cgroup_reclaim(sc)) {
++		unsigned long total_high_wmark = 0;
++		unsigned long free, anon;
++		int z;
++
++		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
++		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
++			   node_page_state(pgdat, NR_INACTIVE_FILE);
++
++		for (z = 0; z < MAX_NR_ZONES; z++) {
++			struct zone *zone = &pgdat->node_zones[z];
++
++			if (!managed_zone(zone))
++				continue;
++
++			total_high_wmark += high_wmark_pages(zone);
++		}
++
++		/*
++		 * Consider anon: if that's low too, this isn't a
++		 * runaway file reclaim problem, but rather just
++		 * extreme pressure. Reclaim as per usual then.
++		 */
++		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
++
++		sc->file_is_tiny =
++			file + free <= total_high_wmark &&
++			!(sc->may_deactivate & DEACTIVATE_ANON) &&
++			anon >> sc->priority;
++	}
++}
++
+ /*
+  * Determine how aggressively the anon and file LRU lists should be
+  * scanned.
+@@ -2947,152 +3070,2904 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
+ 	return can_demote(pgdat->node_id, sc);
  }
-
+ 
+-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+-{
+-	unsigned long nr[NR_LRU_LISTS];
+-	unsigned long targets[NR_LRU_LISTS];
+-	unsigned long nr_to_scan;
+-	enum lru_list lru;
+-	unsigned long nr_reclaimed = 0;
+-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+-	struct blk_plug plug;
+-	bool scan_adjusted;
++#ifdef CONFIG_LRU_GEN
+ 
+-	get_scan_count(lruvec, sc, nr);
++#ifdef CONFIG_LRU_GEN_ENABLED
++DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
++#define get_cap(cap)	static_branch_likely(&lru_gen_caps[cap])
++#else
++DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
++#define get_cap(cap)	static_branch_unlikely(&lru_gen_caps[cap])
++#endif
+ 
+-	/* Record the original scan target for proportional adjustments later */
+-	memcpy(targets, nr, sizeof(nr));
++/******************************************************************************
++ *                          shorthand helpers
++ ******************************************************************************/
+ 
+-	/*
+-	 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
+-	 * event that can occur when there is little memory pressure e.g.
+-	 * multiple streaming readers/writers. Hence, we do not abort scanning
+-	 * when the requested number of pages are reclaimed when scanning at
+-	 * DEF_PRIORITY on the assumption that the fact we are direct
+-	 * reclaiming implies that kswapd is not keeping up and it is best to
+-	 * do a batch of work at once. For memcg reclaim one check is made to
+-	 * abort proportional reclaim if either the file or anon lru has already
+-	 * dropped to zero at the first pass.
+-	 */
+-	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
+-			 sc->priority == DEF_PRIORITY);
++#define LRU_REFS_FLAGS	(BIT(PG_referenced) | BIT(PG_workingset))
+ 
+-	blk_start_plug(&plug);
+-	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+-					nr[LRU_INACTIVE_FILE]) {
+-		unsigned long nr_anon, nr_file, percentage;
+-		unsigned long nr_scanned;
++#define DEFINE_MAX_SEQ(lruvec)						\
++	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+ 
+-		for_each_evictable_lru(lru) {
+-			if (nr[lru]) {
+-				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
+-				nr[lru] -= nr_to_scan;
++#define DEFINE_MIN_SEQ(lruvec)						\
++	unsigned long min_seq[ANON_AND_FILE] = {			\
++		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),	\
++		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
++	}
+ 
+-				nr_reclaimed += shrink_list(lru, nr_to_scan,
+-							    lruvec, sc);
+-			}
+-		}
++#define for_each_gen_type_zone(gen, type, zone)				\
++	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
++		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
++			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+ 
+-		cond_resched();
++static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
++{
++	struct pglist_data *pgdat = NODE_DATA(nid);
+ 
+-		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+-			continue;
++#ifdef CONFIG_MEMCG
++	if (memcg) {
++		struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+ 
+-		/*
+-		 * For kswapd and memcg, reclaim at least the number of pages
+-		 * requested. Ensure that the anon and file LRUs are scanned
+-		 * proportionally what was requested by get_scan_count(). We
+-		 * stop reclaiming one LRU and reduce the amount scanning
+-		 * proportional to the original scan target.
+-		 */
+-		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+-		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
++		/* for hotadd_new_pgdat() */
++		if (!lruvec->pgdat)
++			lruvec->pgdat = pgdat;
+ 
+-		/*
+-		 * It's just vindictive to attack the larger once the smaller
+-		 * has gone to zero.  And given the way we stop scanning the
+-		 * smaller below, this makes sure that we only make one nudge
+-		 * towards proportionality once we've got nr_to_reclaim.
+-		 */
+-		if (!nr_file || !nr_anon)
+-			break;
++		return lruvec;
++	}
++#endif
++	VM_WARN_ON_ONCE(!mem_cgroup_disabled());
+ 
+-		if (nr_file > nr_anon) {
+-			unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
+-						targets[LRU_ACTIVE_ANON] + 1;
+-			lru = LRU_BASE;
+-			percentage = nr_anon * 100 / scan_target;
+-		} else {
+-			unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
+-						targets[LRU_ACTIVE_FILE] + 1;
+-			lru = LRU_FILE;
+-			percentage = nr_file * 100 / scan_target;
+-		}
++	return pgdat ? &pgdat->__lruvec : NULL;
++}
+ 
+-		/* Stop scanning the smaller of the LRU */
+-		nr[lru] = 0;
+-		nr[lru + LRU_ACTIVE] = 0;
++static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
++{
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ 
+-		/*
+-		 * Recalculate the other LRU scan count based on its original
+-		 * scan target and the percentage scanning already complete
+-		 */
+-		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+-		nr_scanned = targets[lru] - nr[lru];
+-		nr[lru] = targets[lru] * (100 - percentage) / 100;
+-		nr[lru] -= min(nr[lru], nr_scanned);
++	if (!can_demote(pgdat->node_id, sc) &&
++	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
++		return 0;
+ 
+-		lru += LRU_ACTIVE;
+-		nr_scanned = targets[lru] - nr[lru];
+-		nr[lru] = targets[lru] * (100 - percentage) / 100;
+-		nr[lru] -= min(nr[lru], nr_scanned);
++	return mem_cgroup_swappiness(memcg);
++}
+ 
+-		scan_adjusted = true;
+-	}
+-	blk_finish_plug(&plug);
+-	sc->nr_reclaimed += nr_reclaimed;
++static int get_nr_gens(struct lruvec *lruvec, int type)
++{
++	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
++}
+ 
+-	/*
+-	 * Even if we did not try to evict anon pages at all, we want to
+-	 * rebalance the anon lru active/inactive ratio.
+-	 */
+-	if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) &&
+-	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+-				   sc, LRU_ACTIVE_ANON);
++static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
++{
++	/* see the comment on lru_gen_struct */
++	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
++	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
++	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+ }
+ 
+-/* Use reclaim/compaction for costly allocs or under memory pressure */
+-static bool in_reclaim_compaction(struct scan_control *sc)
 +/******************************************************************************
 + *                          mm_struct list
 + ******************************************************************************/
 +
 +static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
-+{
+ {
+-	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+-			(sc->order > PAGE_ALLOC_COSTLY_ORDER ||
+-			 sc->priority < DEF_PRIORITY - 2))
+-		return true;
 +	static struct lru_gen_mm_list mm_list = {
 +		.fifo = LIST_HEAD_INIT(mm_list.fifo),
 +		.lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
 +	};
-+
+ 
+-	return false;
 +#ifdef CONFIG_MEMCG
 +	if (memcg)
 +		return &memcg->mm_list;
@@ -4330,21 +2219,38 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +	VM_WARN_ON_ONCE(!mem_cgroup_disabled());
 +
 +	return &mm_list;
-+}
-+
+ }
+ 
+-/*
+- * Reclaim/compaction is used for high-order allocation requests. It reclaims
+- * order-0 pages before compacting the zone. should_continue_reclaim() returns
+- * true if more pages should be reclaimed such that when the page allocator
+- * calls try_to_compact_pages() that it will have enough free pages to succeed.
+- * It will give up earlier than that if there is difficulty reclaiming pages.
+- */
+-static inline bool should_continue_reclaim(struct pglist_data *pgdat,
+-					unsigned long nr_reclaimed,
+-					struct scan_control *sc)
 +void lru_gen_add_mm(struct mm_struct *mm)
-+{
+ {
+-	unsigned long pages_for_compaction;
+-	unsigned long inactive_lru_pages;
+-	int z;
 +	int nid;
 +	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
 +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
-+
+ 
+-	/* If not in reclaim/compaction mode, stop */
+-	if (!in_reclaim_compaction(sc))
+-		return false;
 +	VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list));
 +#ifdef CONFIG_MEMCG
 +	VM_WARN_ON_ONCE(mm->lru_gen.memcg);
 +	mm->lru_gen.memcg = memcg;
 +#endif
 +	spin_lock(&mm_list->lock);
-+
+ 
+-	/*
 +	for_each_node_state(nid, N_MEMORY) {
 +		struct lruvec *lruvec = get_lruvec(memcg, nid);
 +
@@ -4677,13 +2583,156 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +	return success;
 +}
 +
- /******************************************************************************
-  *                          refault feedback loop
-  ******************************************************************************/
-@@ -3277,6 +3644,118 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
- 	return new_gen;
- }
-
++/******************************************************************************
++ *                          refault feedback loop
++ ******************************************************************************/
++
++/*
++ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
++ *
++ * The P term is refaulted/(evicted+protected) from a tier in the generation
++ * currently being evicted; the I term is the exponential moving average of the
++ * P term over the generations previously evicted, using the smoothing factor
++ * 1/2; the D term isn't supported.
++ *
++ * The setpoint (SP) is always the first tier of one type; the process variable
++ * (PV) is either any tier of the other type or any other tier of the same
++ * type.
++ *
++ * The error is the difference between the SP and the PV; the correction is to
++ * turn off protection when SP>PV or turn on protection when SP<PV.
++ *
++ * For future optimizations:
++ * 1. The D term may discount the other two terms over time so that long-lived
++ *    generations can resist stale information.
++ */
++struct ctrl_pos {
++	unsigned long refaulted;
++	unsigned long total;
++	int gain;
++};
++
++static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
++			  struct ctrl_pos *pos)
++{
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
++
++	pos->refaulted = lrugen->avg_refaulted[type][tier] +
++			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
++	pos->total = lrugen->avg_total[type][tier] +
++		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
++	if (tier)
++		pos->total += lrugen->protected[hist][type][tier - 1];
++	pos->gain = gain;
++}
++
++static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
++{
++	int hist, tier;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
++	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
++
++	lockdep_assert_held(&lruvec->lru_lock);
++
++	if (!carryover && !clear)
++		return;
++
++	hist = lru_hist_from_seq(seq);
++
++	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
++		if (carryover) {
++			unsigned long sum;
++
++			sum = lrugen->avg_refaulted[type][tier] +
++			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
++			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
++
++			sum = lrugen->avg_total[type][tier] +
++			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
++			if (tier)
++				sum += lrugen->protected[hist][type][tier - 1];
++			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
++		}
++
++		if (clear) {
++			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
++			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
++			if (tier)
++				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
++		}
++	}
++}
++
++static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
++{
++	/*
++	 * Return true if the PV has a limited number of refaults or a lower
++	 * refaulted/total than the SP.
++	 */
++	return pv->refaulted < MIN_LRU_BATCH ||
++	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
++	       (sp->refaulted + 1) * pv->total * pv->gain;
++}
++
++/******************************************************************************
++ *                          the aging
++ ******************************************************************************/
++
++/* promote pages accessed through page tables */
++static int folio_update_gen(struct folio *folio, int gen)
++{
++	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
++
++	VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
++	VM_WARN_ON_ONCE(!rcu_read_lock_held());
++
++	do {
++		/* lru_gen_del_folio() has isolated this page? */
++		if (!(old_flags & LRU_GEN_MASK)) {
++			/* for shrink_page_list() */
++			new_flags = old_flags | BIT(PG_referenced);
++			continue;
++		}
++
++		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
++		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
++	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
++
++	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
++}
++
++/* protect pages accessed multiple times through file descriptors */
++static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
++{
++	int type = folio_is_file_lru(folio);
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
++	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
++
++	VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
++
++	do {
++		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
++		/* folio_update_gen() has promoted this page? */
++		if (new_gen >= 0 && new_gen != old_gen)
++			return new_gen;
++
++		new_gen = (old_gen + 1) % MAX_NR_GENS;
++
++		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
++		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
++		/* for folio_end_writeback() */
++		if (reclaiming)
++			new_flags |= BIT(PG_reclaim);
++	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
++
++	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
++
++	return new_gen;
++}
++
 +static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
 +			      int old_gen, int new_gen)
 +{
@@ -4796,13 +2845,24 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +	return false;
 +}
 +
- static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
- {
- 	unsigned long pfn = pte_pfn(pte);
-@@ -3295,8 +3774,28 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
- 	return pfn;
- }
-
++static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
++{
++	unsigned long pfn = pte_pfn(pte);
++
++	VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
++
++	if (!pte_present(pte) || is_zero_pfn(pfn))
++		return -1;
++
++	if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
++		return -1;
++
++	if (WARN_ON_ONCE(!pfn_valid(pfn)))
++		return -1;
++
++	return pfn;
++}
++
 +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 +static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
 +{
@@ -4823,23 +2883,29 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +}
 +#endif
 +
- static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
--				   struct pglist_data *pgdat)
++static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 +				   struct pglist_data *pgdat, bool can_swap)
- {
- 	struct folio *folio;
-
-@@ -3311,9 +3810,375 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
- 	if (folio_memcg_rcu(folio) != memcg)
- 		return NULL;
-
++{
++	struct folio *folio;
++
++	/* try to avoid unnecessary memory loads */
++	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
++		return NULL;
++
++	folio = pfn_folio(pfn);
++	if (folio_nid(folio) != pgdat->node_id)
++		return NULL;
++
++	if (folio_memcg_rcu(folio) != memcg)
++		return NULL;
++
 +	/* file VMAs can contain anon pages from COW */
 +	if (!folio_is_file_lru(folio) && !can_swap)
 +		return NULL;
 +
- 	return folio;
- }
-
++	return folio;
++}
++
 +static bool suitable_to_scan(int total, int young)
 +{
 +	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
@@ -4963,7 +3029,8 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +			goto next;
 +
 +		if (!pmd_trans_huge(pmd[i])) {
-+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
++			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
++			    get_cap(LRU_GEN_NONLEAF_YOUNG))
 +				pmdp_test_and_clear_young(vma, addr, pmd + i);
 +			goto next;
 +		}
@@ -5061,10 +3128,12 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +		walk->mm_stats[MM_NONLEAF_TOTAL]++;
 +
 +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-+		if (!pmd_young(val))
-+			continue;
++		if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
++			if (!pmd_young(val))
++				continue;
 +
-+		walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
++			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
++		}
 +#endif
 +		if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
 +			continue;
@@ -5202,39 +3271,143 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +		kfree(walk);
 +}
 +
- static void inc_min_seq(struct lruvec *lruvec, int type)
- {
- 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-@@ -3365,7 +4230,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
- 	return success;
- }
-
--static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
-+static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
- {
- 	int prev, next;
- 	int type, zone;
-@@ -3375,9 +4240,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
-
- 	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
-
--	if (max_seq != lrugen->max_seq)
--		goto unlock;
--
- 	for (type = ANON_AND_FILE - 1; type >= 0; type--) {
- 		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
- 			continue;
-@@ -3415,10 +4277,76 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
-
- 	/* make sure preceding modifications appear */
- 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
--unlock:
++static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
++{
++	int zone;
++	int remaining = MAX_LRU_BATCH;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
++
++	if (type == LRU_GEN_ANON && !can_swap)
++		goto done;
++
++	/* prevent cold/hot inversion if force_scan is true */
++	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
++		struct list_head *head = &lrugen->lists[old_gen][type][zone];
++
++		while (!list_empty(head)) {
++			struct folio *folio = lru_to_folio(head);
++
++			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
++
++			new_gen = folio_inc_gen(lruvec, folio, false);
++			list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]);
++
++			if (!--remaining)
++				return false;
++		}
++	}
++done:
++	reset_ctrl_pos(lruvec, type, true);
++	WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
++
++	return true;
++}
++
++static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
++{
++	int gen, type, zone;
++	bool success = false;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	DEFINE_MIN_SEQ(lruvec);
++
++	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
++
++	/* find the oldest populated generation */
++	for (type = !can_swap; type < ANON_AND_FILE; type++) {
++		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
++			gen = lru_gen_from_seq(min_seq[type]);
++
++			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
++				if (!list_empty(&lrugen->lists[gen][type][zone]))
++					goto next;
++			}
++
++			min_seq[type]++;
++		}
++next:
++		;
++	}
++
++	/* see the comment on lru_gen_struct */
++	if (can_swap) {
++		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
++		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
++	}
++
++	for (type = !can_swap; type < ANON_AND_FILE; type++) {
++		if (min_seq[type] == lrugen->min_seq[type])
++			continue;
++
++		reset_ctrl_pos(lruvec, type, true);
++		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
++		success = true;
++	}
++
++	return success;
++}
++
++static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
++{
++	int prev, next;
++	int type, zone;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++
++	spin_lock_irq(&lruvec->lru_lock);
++
++	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
++
++	for (type = ANON_AND_FILE - 1; type >= 0; type--) {
++		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
++			continue;
++
++		VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
++
++		while (!inc_min_seq(lruvec, type, can_swap)) {
++			spin_unlock_irq(&lruvec->lru_lock);
++			cond_resched();
++			spin_lock_irq(&lruvec->lru_lock);
++		}
++	}
++
++	/*
++	 * Update the active/inactive LRU sizes for compatibility. Both sides of
++	 * the current max_seq need to be covered, since max_seq+1 can overlap
++	 * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do
++	 * overlap, cold/hot inversion happens.
++	 */
++	prev = lru_gen_from_seq(lrugen->max_seq - 1);
++	next = lru_gen_from_seq(lrugen->max_seq + 1);
++
++	for (type = 0; type < ANON_AND_FILE; type++) {
++		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
++			enum lru_list lru = type * LRU_INACTIVE_FILE;
++			long delta = lrugen->nr_pages[prev][type][zone] -
++				     lrugen->nr_pages[next][type][zone];
++
++			if (!delta)
++				continue;
++
++			__update_lru_size(lruvec, lru, zone, delta);
++			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
++		}
++	}
++
++	for (type = 0; type < ANON_AND_FILE; type++)
++		reset_ctrl_pos(lruvec, type, false);
++
++	WRITE_ONCE(lrugen->timestamps[next], jiffies);
++	/* make sure preceding modifications appear */
++	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
++
++	spin_unlock_irq(&lruvec->lru_lock);
++}
 +
- 	spin_unlock_irq(&lruvec->lru_lock);
- }
-
 +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
-+			       struct scan_control *sc, bool can_swap)
++			       struct scan_control *sc, bool can_swap, bool force_scan)
 +{
 +	bool success;
 +	struct lru_gen_mm_walk *walk;
@@ -5255,7 +3428,7 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +	 * handful of PTEs. Spreading the work out over a period of time usually
 +	 * is less efficient, but it avoids bursty page faults.
 +	 */
-+	if (!arch_has_hw_pte_young()) {
++	if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
 +		success = iterate_mm_list_nowalk(lruvec, max_seq);
 +		goto done;
 +	}
@@ -5269,7 +3442,7 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +	walk->lruvec = lruvec;
 +	walk->max_seq = max_seq;
 +	walk->can_swap = can_swap;
-+	walk->force_scan = false;
++	walk->force_scan = force_scan;
 +
 +	do {
 +		success = iterate_mm_list(lruvec, walk, &mm);
@@ -5289,379 +3462,118 @@ index 33a1bdfc04bd..c579b254fec7 100644
 +
 +	VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
 +
-+	inc_max_seq(lruvec, can_swap);
++	inc_max_seq(lruvec, can_swap, force_scan);
 +	/* either this sees any waiters or they will see updated max_seq */
 +	if (wq_has_sleeper(&lruvec->mm_state.wait))
 +		wake_up_all(&lruvec->mm_state.wait);
 +
-+	wakeup_flusher_threads(WB_REASON_VMSCAN);
++	return true;
++}
++
++static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq,
++			     struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan)
++{
++	int gen, type, zone;
++	unsigned long old = 0;
++	unsigned long young = 0;
++	unsigned long total = 0;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++
++	for (type = !can_swap; type < ANON_AND_FILE; type++) {
++		unsigned long seq;
++
++		for (seq = min_seq[type]; seq <= max_seq; seq++) {
++			unsigned long size = 0;
++
++			gen = lru_gen_from_seq(seq);
++
++			for (zone = 0; zone < MAX_NR_ZONES; zone++)
++				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
++
++			total += size;
++			if (seq == max_seq)
++				young += size;
++			else if (seq + MIN_NR_GENS == max_seq)
++				old += size;
++		}
++	}
++
++	/* try to scrape all its memory if this memcg was deleted */
++	*nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
++
++	/*
++	 * The aging tries to be lazy to reduce the overhead, while the eviction
++	 * stalls when the number of generations reaches MIN_NR_GENS. Hence, the
++	 * ideal number of generations is MIN_NR_GENS+1.
++	 */
++	if (min_seq[!can_swap] + MIN_NR_GENS > max_seq)
++		return true;
++	if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
++		return false;
++
++	/*
++	 * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
++	 * of the total number of pages for each generation. A reasonable range
++	 * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
++	 * aging cares about the upper bound of hot pages, while the eviction
++	 * cares about the lower bound of cold pages.
++	 */
++	if (young * MIN_NR_GENS > total)
++		return true;
++	if (old * (MIN_NR_GENS + 2) < total)
++		return true;
++
++	return false;
++}
++
++static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl)
++{
++	bool need_aging;
++	unsigned long nr_to_scan;
++	int swappiness = get_swappiness(lruvec, sc);
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++	DEFINE_MAX_SEQ(lruvec);
++	DEFINE_MIN_SEQ(lruvec);
++
++	VM_WARN_ON_ONCE(sc->memcg_low_reclaim);
++
++	mem_cgroup_calculate_protection(NULL, memcg);
++
++	if (mem_cgroup_below_min(memcg))
++		return false;
++
++	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
++
++	if (min_ttl) {
++		int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
++		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
++
++		if (time_is_after_jiffies(birth + min_ttl))
++			return false;
++
++		/* the size is likely too small to be helpful */
++		if (!nr_to_scan && sc->priority != DEF_PRIORITY)
++			return false;
++	}
++
++	if (need_aging)
++		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
 +
 +	return true;
 +}
 +
- static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq,
- 			     struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan)
- {
-@@ -3494,7 +4422,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-
- 	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
- 	if (need_aging)
--		inc_max_seq(lruvec, max_seq, swappiness);
-+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
- }
-
- static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-@@ -3503,6 +4431,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-
- 	VM_WARN_ON_ONCE(!current_is_kswapd());
-
-+	set_mm_walk(pgdat);
++/* to protect the working set of the last N jiffies */
++static unsigned long lru_gen_min_ttl __read_mostly;
 +
- 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
- 	do {
- 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-@@ -3511,11 +4441,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-
- 		cond_resched();
- 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
++static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
++{
++	struct mem_cgroup *memcg;
++	bool success = false;
++	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
 +
-+	clear_mm_walk();
- }
-
- /*
-  * This function exploits spatial locality when shrink_page_list() walks the
-- * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
-+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
-+ * the scan was done cacheline efficiently, it adds the PMD entry pointing to
-+ * the PTE table to the Bloom filter. This forms a feedback loop between the
-+ * eviction and the aging.
-  */
- void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- {
-@@ -3524,6 +4459,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 	unsigned long start;
- 	unsigned long end;
- 	unsigned long addr;
-+	struct lru_gen_mm_walk *walk;
-+	int young = 0;
- 	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
- 	struct folio *folio = pfn_folio(pvmw->pfn);
- 	struct mem_cgroup *memcg = folio_memcg(folio);
-@@ -3538,6 +4475,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 	if (spin_is_contended(pvmw->ptl))
- 		return;
-
-+	/* avoid taking the LRU lock under the PTL when possible */
-+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
++	VM_WARN_ON_ONCE(!current_is_kswapd());
 +
- 	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
- 	end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
-
-@@ -3567,13 +4507,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 		if (!pte_young(pte[i]))
- 			continue;
-
--		folio = get_pfn_folio(pfn, memcg, pgdat);
-+		folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
- 		if (!folio)
- 			continue;
-
- 		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
- 			VM_WARN_ON_ONCE(true);
-
-+		young++;
-+
- 		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
- 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
- 		      !folio_test_swapcache(folio)))
-@@ -3589,7 +4531,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 	arch_leave_lazy_mmu_mode();
- 	rcu_read_unlock();
-
--	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
-+	/* feedback from rmap walkers to page table walkers */
-+	if (suitable_to_scan(i, young))
-+		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
-+
-+	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
- 		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
- 			folio = pfn_folio(pte_pfn(pte[i]));
- 			folio_activate(folio);
-@@ -3601,8 +4547,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 	if (!mem_cgroup_trylock_pages(memcg))
- 		return;
-
--	spin_lock_irq(&lruvec->lru_lock);
--	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
-+	if (!walk) {
-+		spin_lock_irq(&lruvec->lru_lock);
-+		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
-+	}
-
- 	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
- 		folio = pfn_folio(pte_pfn(pte[i]));
-@@ -3613,10 +4561,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
- 		if (old_gen < 0 || old_gen == new_gen)
- 			continue;
-
--		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
-+		if (walk)
-+			update_batch_size(walk, folio, old_gen, new_gen);
-+		else
-+			lru_gen_update_size(lruvec, folio, old_gen, new_gen);
- 	}
-
--	spin_unlock_irq(&lruvec->lru_lock);
-+	if (!walk)
-+		spin_unlock_irq(&lruvec->lru_lock);
-
- 	mem_cgroup_unlock_pages();
- }
-@@ -3899,6 +4851,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
- 	struct folio *folio;
- 	enum vm_event_item item;
- 	struct reclaim_stat stat;
-+	struct lru_gen_mm_walk *walk;
- 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
- 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-@@ -3935,6 +4888,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
-
- 	move_pages_to_lru(lruvec, &list);
-
-+	walk = current->reclaim_state->mm_walk;
-+	if (walk && walk->batched)
-+		reset_batch_size(lruvec, walk);
-+
- 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
- 	if (!cgroup_reclaim(sc))
- 		__count_vm_events(item, reclaimed);
-@@ -3951,6 +4908,11 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
- 	return scanned;
- }
-
-+/*
-+ * For future optimizations:
-+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
-+ *    reclaim.
-+ */
- static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
- 				    bool can_swap)
- {
-@@ -3976,7 +4938,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
- 	if (current_is_kswapd())
- 		return 0;
-
--	inc_max_seq(lruvec, max_seq, can_swap);
-+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
-+		return nr_to_scan;
- done:
- 	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
- }
-@@ -3990,6 +4953,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
-
- 	blk_start_plug(&plug);
-
-+	set_mm_walk(lruvec_pgdat(lruvec));
-+
- 	while (true) {
- 		int delta;
- 		int swappiness;
-@@ -4017,6 +4982,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
- 		cond_resched();
- 	}
-
-+	clear_mm_walk();
-+
- 	blk_finish_plug(&plug);
- }
-
-@@ -4033,15 +5000,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
-
- 	for_each_gen_type_zone(gen, type, zone)
- 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
-+
-+	lruvec->mm_state.seq = MIN_NR_GENS;
-+	init_waitqueue_head(&lruvec->mm_state.wait);
- }
-
- #ifdef CONFIG_MEMCG
- void lru_gen_init_memcg(struct mem_cgroup *memcg)
- {
-+	INIT_LIST_HEAD(&memcg->mm_list.fifo);
-+	spin_lock_init(&memcg->mm_list.lock);
- }
-
- void lru_gen_exit_memcg(struct mem_cgroup *memcg)
- {
-+	int i;
- 	int nid;
-
- 	for_each_node(nid) {
-@@ -4049,6 +5022,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
-
- 		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
- 					   sizeof(lruvec->lrugen.nr_pages)));
-+
-+		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
-+			bitmap_free(lruvec->mm_state.filters[i]);
-+			lruvec->mm_state.filters[i] = NULL;
-+		}
- 	}
- }
- #endif
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (7 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-28 18:46   ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch Yu Zhao
-                   ` (5 subsequent siblings)
-  14 siblings, 1 reply; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-When multiple memcgs are available, it is possible to use generations
-as a frame of reference to make better choices and improve overall
-performance under global memory pressure. This patch adds a basic
-optimization to select memcgs that can drop single-use unmapped clean
-pages first. Doing so reduces the chance of going into the aging path
-or swapping, which can be costly.
-
-A typical example that benefits from this optimization is a server
-running mixed types of workloads, e.g., heavy anon workload in one
-memcg and heavy buffered I/O workload in the other.
-
-Though this optimization can be applied to both kswapd and direct
-reclaim, it is only added to kswapd to keep the patchset manageable.
-Later improvements may cover the direct reclaim path.
-
-While ensuring certain fairness to all eligible memcgs, proportional
-scans of individual memcgs also require proper backoff to avoid
-overshooting their aggregate reclaim target by too much. Otherwise it
-can cause high direct reclaim latency. The conditions for backoff are:
-1. At low priorities, for direct reclaim, if aging fairness or direct
-   reclaim latency is at risk, i.e., aging one memcg multiple times or
-   swapping after the target is met.
-2. At high priorities, for global reclaim, if per-zone free pages are
-   above respective watermarks.
-
-Server benchmark results:
-  Mixed workloads:
-    fio (buffered I/O): +[19, 21]%
-                IOPS         BW
-      patch1-8: 1880k        7343MiB/s
-      patch1-9: 2252k        8796MiB/s
-
-    memcached (anon): +[119, 123]%
-                Ops/sec      KB/sec
-      patch1-8: 862768.65    33514.68
-      patch1-9: 1911022.12   74234.54
-
-  Mixed workloads:
-    fio (buffered I/O): +[75, 77]%
-                IOPS         BW
-      5.19-rc1: 1279k        4996MiB/s
-      patch1-9: 2252k        8796MiB/s
-
-    memcached (anon): +[13, 15]%
-                Ops/sec      KB/sec
-      5.19-rc1: 1673524.04   65008.87
-      patch1-9: 1911022.12   74234.54
-
-  Configurations:
-    (changes since patch 6)
-
-    cat mixed.sh
-    modprobe brd rd_nr=2 rd_size=56623104
-
-    swapoff -a
-    mkswap /dev/ram0
-    swapon /dev/ram0
-
-    mkfs.ext4 /dev/ram1
-    mount -t ext4 /dev/ram1 /mnt
-
-    memtier_benchmark -S /var/run/memcached/memcached.sock \
-      -P memcache_binary -n allkeys --key-minimum=1 \
-      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
-      --ratio 1:0 --pipeline 8 -d 2000
-
-    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
-      --buffered=1 --ioengine=io_uring --iodepth=128 \
-      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
-      --rw=randread --random_distribution=random --norandommap \
-      --time_based --ramp_time=10m --runtime=90m --group_reporting &
-    pid=$!
-
-    sleep 200
-
-    memtier_benchmark -S /var/run/memcached/memcached.sock \
-      -P memcache_binary -n allkeys --key-minimum=1 \
-      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
-      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
-
-    kill -INT $pid
-    wait
-
-Client benchmark results:
-  no change (CONFIG_MEMCG=n)
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- mm/vmscan.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++-----
- 1 file changed, 96 insertions(+), 9 deletions(-)
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index c579b254fec7..3f83325fdc71 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -131,6 +131,12 @@ struct scan_control {
- 	/* Always discard instead of demoting to lower tier memory */
- 	unsigned int no_demotion:1;
-
-+#ifdef CONFIG_LRU_GEN
-+	/* help kswapd make better choices among multiple memcgs */
-+	unsigned int memcgs_need_aging:1;
-+	unsigned long last_reclaimed;
-+#endif
-+
- 	/* Allocation order */
- 	s8 order;
-
-@@ -4431,6 +4437,19 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
-
- 	VM_WARN_ON_ONCE(!current_is_kswapd());
-
 +	sc->last_reclaimed = sc->nr_reclaimed;
 +
 +	/*
@@ -5675,55 +3587,542 @@ index c579b254fec7..3f83325fdc71 100644
 +		return;
 +	}
 +
- 	set_mm_walk(pgdat);
-
- 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
-@@ -4842,7 +4861,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
- 	return scanned;
- }
-
--static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
++	set_mm_walk(pgdat);
++
++	memcg = mem_cgroup_iter(NULL, NULL, NULL);
++	do {
++		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
++
++		if (age_lruvec(lruvec, sc, min_ttl))
++			success = true;
++
++		cond_resched();
++	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
++
++	clear_mm_walk();
++
++	/* check the order to exclude compaction-induced reclaim */
++	if (success || !min_ttl || sc->order)
++		return;
++
++	/*
++	 * The main goal is to OOM kill if every generation from all memcgs is
++	 * younger than min_ttl. However, another possibility is all memcgs are
++	 * either below min or empty.
++	 */
++	if (mutex_trylock(&oom_lock)) {
++		struct oom_control oc = {
++			.gfp_mask = sc->gfp_mask,
++		};
++
++		out_of_memory(&oc);
++
++		mutex_unlock(&oom_lock);
++	}
++}
++
++/*
++ * This function exploits spatial locality when shrink_page_list() walks the
++ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
++ * the scan was done cacheline efficiently, it adds the PMD entry pointing to
++ * the PTE table to the Bloom filter. This forms a feedback loop between the
++ * eviction and the aging.
++ */
++void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
++{
++	int i;
++	pte_t *pte;
++	unsigned long start;
++	unsigned long end;
++	unsigned long addr;
++	struct lru_gen_mm_walk *walk;
++	int young = 0;
++	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
++	struct folio *folio = pfn_folio(pvmw->pfn);
++	struct mem_cgroup *memcg = folio_memcg(folio);
++	struct pglist_data *pgdat = folio_pgdat(folio);
++	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
++	DEFINE_MAX_SEQ(lruvec);
++	int old_gen, new_gen = lru_gen_from_seq(max_seq);
++
++	lockdep_assert_held(pvmw->ptl);
++	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
++
++	if (spin_is_contended(pvmw->ptl))
++		return;
++
++	/* avoid taking the LRU lock under the PTL when possible */
++	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
++
++	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
++	end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
++
++	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
++		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
++			end = start + MIN_LRU_BATCH * PAGE_SIZE;
++		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
++			start = end - MIN_LRU_BATCH * PAGE_SIZE;
++		else {
++			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
++			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
++		}
++	}
++
++	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
++
++	rcu_read_lock();
++	arch_enter_lazy_mmu_mode();
++
++	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
++		unsigned long pfn;
++
++		pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
++		if (pfn == -1)
++			continue;
++
++		if (!pte_young(pte[i]))
++			continue;
++
++		folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
++		if (!folio)
++			continue;
++
++		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
++			VM_WARN_ON_ONCE(true);
++
++		young++;
++
++		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
++		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
++		      !folio_test_swapcache(folio)))
++			folio_mark_dirty(folio);
++
++		old_gen = folio_lru_gen(folio);
++		if (old_gen < 0)
++			folio_set_referenced(folio);
++		else if (old_gen != new_gen)
++			__set_bit(i, bitmap);
++	}
++
++	arch_leave_lazy_mmu_mode();
++	rcu_read_unlock();
++
++	/* feedback from rmap walkers to page table walkers */
++	if (suitable_to_scan(i, young))
++		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
++
++	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
++		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
++			folio = pfn_folio(pte_pfn(pte[i]));
++			folio_activate(folio);
++		}
++		return;
++	}
++
++	/* folio_update_gen() requires stable folio_memcg() */
++	if (!mem_cgroup_trylock_pages(memcg))
++		return;
++
++	if (!walk) {
++		spin_lock_irq(&lruvec->lru_lock);
++		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
++	}
++
++	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
++		folio = pfn_folio(pte_pfn(pte[i]));
++		if (folio_memcg_rcu(folio) != memcg)
++			continue;
++
++		old_gen = folio_update_gen(folio, new_gen);
++		if (old_gen < 0 || old_gen == new_gen)
++			continue;
++
++		if (walk)
++			update_batch_size(walk, folio, old_gen, new_gen);
++		else
++			lru_gen_update_size(lruvec, folio, old_gen, new_gen);
++	}
++
++	if (!walk)
++		spin_unlock_irq(&lruvec->lru_lock);
++
++	mem_cgroup_unlock_pages();
++}
++
++/******************************************************************************
++ *                          the eviction
++ ******************************************************************************/
++
++static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
++{
++	bool success;
++	int gen = folio_lru_gen(folio);
++	int type = folio_is_file_lru(folio);
++	int zone = folio_zonenum(folio);
++	int delta = folio_nr_pages(folio);
++	int refs = folio_lru_refs(folio);
++	int tier = lru_tier_from_refs(refs);
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++
++	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
++
++	/* unevictable */
++	if (!folio_evictable(folio)) {
++		success = lru_gen_del_folio(lruvec, folio, true);
++		VM_WARN_ON_ONCE_FOLIO(!success, folio);
++		folio_set_unevictable(folio);
++		lruvec_add_folio(lruvec, folio);
++		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
++		return true;
++	}
++
++	/* dirty lazyfree */
++	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
++		success = lru_gen_del_folio(lruvec, folio, true);
++		VM_WARN_ON_ONCE_FOLIO(!success, folio);
++		folio_set_swapbacked(folio);
++		lruvec_add_folio_tail(lruvec, folio);
++		return true;
++	}
++
++	/* promoted */
++	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
++		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
++		return true;
++	}
++
++	/* protected */
++	if (tier > tier_idx) {
++		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
++
++		gen = folio_inc_gen(lruvec, folio, false);
++		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
++
++		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
++			   lrugen->protected[hist][type][tier - 1] + delta);
++		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
++		return true;
++	}
++
++	/* waiting for writeback */
++	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
++	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
++		gen = folio_inc_gen(lruvec, folio, true);
++		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
++		return true;
++	}
++
++	return false;
++}
++
++static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
++{
++	bool success;
++
++	/* unmapping inhibited */
++	if (!sc->may_unmap && folio_mapped(folio))
++		return false;
++
++	/* swapping inhibited */
++	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
++	    (folio_test_dirty(folio) ||
++	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
++		return false;
++
++	/* raced with release_pages() */
++	if (!folio_try_get(folio))
++		return false;
++
++	/* raced with another isolation */
++	if (!folio_test_clear_lru(folio)) {
++		folio_put(folio);
++		return false;
++	}
++
++	/* see the comment on MAX_NR_TIERS */
++	if (!folio_test_referenced(folio))
++		set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
++
++	/* for shrink_page_list() */
++	folio_clear_reclaim(folio);
++	folio_clear_referenced(folio);
++
++	success = lru_gen_del_folio(lruvec, folio, true);
++	VM_WARN_ON_ONCE_FOLIO(!success, folio);
++
++	return true;
++}
++
++static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
++		       int type, int tier, struct list_head *list)
++{
++	int gen, zone;
++	enum vm_event_item item;
++	int sorted = 0;
++	int scanned = 0;
++	int isolated = 0;
++	int remaining = MAX_LRU_BATCH;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++
++	VM_WARN_ON_ONCE(!list_empty(list));
++
++	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
++		return 0;
++
++	gen = lru_gen_from_seq(lrugen->min_seq[type]);
++
++	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
++		LIST_HEAD(moved);
++		int skipped = 0;
++		struct list_head *head = &lrugen->lists[gen][type][zone];
++
++		while (!list_empty(head)) {
++			struct folio *folio = lru_to_folio(head);
++			int delta = folio_nr_pages(folio);
++
++			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
++			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
++
++			scanned += delta;
++
++			if (sort_folio(lruvec, folio, tier))
++				sorted += delta;
++			else if (isolate_folio(lruvec, folio, sc)) {
++				list_add(&folio->lru, list);
++				isolated += delta;
++			} else {
++				list_move(&folio->lru, &moved);
++				skipped += delta;
++			}
++
++			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
++				break;
++		}
++
++		if (skipped) {
++			list_splice(&moved, head);
++			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
++		}
++
++		if (!remaining || isolated >= MIN_LRU_BATCH)
++			break;
++	}
++
++	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
++	if (!cgroup_reclaim(sc)) {
++		__count_vm_events(item, isolated);
++		__count_vm_events(PGREFILL, sorted);
++	}
++	__count_memcg_events(memcg, item, isolated);
++	__count_memcg_events(memcg, PGREFILL, sorted);
++	__count_vm_events(PGSCAN_ANON + type, isolated);
++
++	/*
++	 * There might not be eligible pages due to reclaim_idx, may_unmap and
++	 * may_writepage. Check the remaining to prevent livelock if it's not
++	 * making progress.
++	 */
++	return isolated || !remaining ? scanned : 0;
++}
++
++static int get_tier_idx(struct lruvec *lruvec, int type)
++{
++	int tier;
++	struct ctrl_pos sp, pv;
++
++	/*
++	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
++	 * This value is chosen because any other tier would have at least twice
++	 * as many refaults as the first tier.
++	 */
++	read_ctrl_pos(lruvec, type, 0, 1, &sp);
++	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
++		read_ctrl_pos(lruvec, type, tier, 2, &pv);
++		if (!positive_ctrl_err(&sp, &pv))
++			break;
++	}
++
++	return tier - 1;
++}
++
++static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
++{
++	int type, tier;
++	struct ctrl_pos sp, pv;
++	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
++
++	/*
++	 * Compare the first tier of anon with that of file to determine which
++	 * type to scan. Also need to compare other tiers of the selected type
++	 * with the first tier of the other type to determine the last tier (of
++	 * the selected type) to evict.
++	 */
++	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
++	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
++	type = positive_ctrl_err(&sp, &pv);
++
++	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
++	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
++		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
++		if (!positive_ctrl_err(&sp, &pv))
++			break;
++	}
++
++	*tier_idx = tier - 1;
++
++	return type;
++}
++
++static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
++			  int *type_scanned, struct list_head *list)
++{
++	int i;
++	int type;
++	int scanned;
++	int tier = -1;
++	DEFINE_MIN_SEQ(lruvec);
++
++	/*
++	 * Try to make the obvious choice first. When anon and file are both
++	 * available from the same generation, interpret swappiness 1 as file
++	 * first and 200 as anon first.
++	 */
++	if (!swappiness)
++		type = LRU_GEN_FILE;
++	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
++		type = LRU_GEN_ANON;
++	else if (swappiness == 1)
++		type = LRU_GEN_FILE;
++	else if (swappiness == 200)
++		type = LRU_GEN_ANON;
++	else
++		type = get_type_to_scan(lruvec, swappiness, &tier);
++
++	for (i = !swappiness; i < ANON_AND_FILE; i++) {
++		if (tier < 0)
++			tier = get_tier_idx(lruvec, type);
++
++		scanned = scan_folios(lruvec, sc, type, tier, list);
++		if (scanned)
++			break;
++
++		type = !type;
++		tier = -1;
++	}
++
++	*type_scanned = type;
++
++	return scanned;
++}
++
 +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
 +			bool *need_swapping)
- {
- 	int type;
- 	int scanned;
-@@ -4905,6 +4925,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
-
- 	sc->nr_reclaimed += reclaimed;
-
++{
++	int type;
++	int scanned;
++	int reclaimed;
++	LIST_HEAD(list);
++	struct folio *folio;
++	enum vm_event_item item;
++	struct reclaim_stat stat;
++	struct lru_gen_mm_walk *walk;
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
++
++	spin_lock_irq(&lruvec->lru_lock);
++
++	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
++
++	scanned += try_to_inc_min_seq(lruvec, swappiness);
++
++	if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS)
++		scanned = 0;
++
++	spin_unlock_irq(&lruvec->lru_lock);
++
++	if (list_empty(&list))
++		return scanned;
++
++	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
++
++	list_for_each_entry(folio, &list, lru) {
++		/* restore LRU_REFS_FLAGS cleared by isolate_folio() */
++		if (folio_test_workingset(folio))
++			folio_set_referenced(folio);
++
++		/* don't add rejected pages to the oldest generation */
++		if (folio_test_reclaim(folio) &&
++		    (folio_test_dirty(folio) || folio_test_writeback(folio)))
++			folio_clear_active(folio);
++		else
++			folio_set_active(folio);
++	}
++
++	spin_lock_irq(&lruvec->lru_lock);
++
++	move_pages_to_lru(lruvec, &list);
++
++	walk = current->reclaim_state->mm_walk;
++	if (walk && walk->batched)
++		reset_batch_size(lruvec, walk);
++
++	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
++	if (!cgroup_reclaim(sc))
++		__count_vm_events(item, reclaimed);
++	__count_memcg_events(memcg, item, reclaimed);
++	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
++
++	spin_unlock_irq(&lruvec->lru_lock);
++
++	mem_cgroup_uncharge_list(&list);
++	free_unref_page_list(&list);
++
++	sc->nr_reclaimed += reclaimed;
++
 +	if (need_swapping && type == LRU_GEN_ANON)
 +		*need_swapping = true;
 +
- 	return scanned;
- }
-
-@@ -4914,9 +4937,8 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
-  *    reclaim.
-  */
- static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
--				    bool can_swap)
++	return scanned;
++}
++
++/*
++ * For future optimizations:
++ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
++ *    reclaim.
++ */
++static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
 +				    bool can_swap, bool *need_aging)
- {
--	bool need_aging;
- 	unsigned long nr_to_scan;
- 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
- 	DEFINE_MAX_SEQ(lruvec);
-@@ -4926,8 +4948,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
- 	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
- 		return 0;
-
--	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan);
--	if (!need_aging)
++{
++	unsigned long nr_to_scan;
++	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
++	DEFINE_MAX_SEQ(lruvec);
++	DEFINE_MIN_SEQ(lruvec);
++
++	if (mem_cgroup_below_min(memcg) ||
++	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
++		return 0;
++
 +	*need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan);
 +	if (!*need_aging)
- 		return nr_to_scan;
-
- 	/* skip the aging path at the default priority */
-@@ -4944,10 +4966,68 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
- 	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
- }
-
++		return nr_to_scan;
++
++	/* skip the aging path at the default priority */
++	if (sc->priority == DEF_PRIORITY)
++		goto done;
++
++	/* leave the work to lru_gen_age_node() */
++	if (current_is_kswapd())
++		return 0;
++
++	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
++		return nr_to_scan;
++done:
++	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
++}
++
 +static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq,
 +			      struct scan_control *sc, bool need_swapping)
 +{
@@ -5731,7 +4130,7 @@ index c579b254fec7..3f83325fdc71 100644
 +	DEFINE_MAX_SEQ(lruvec);
 +
 +	if (!current_is_kswapd()) {
-+		/* age each memcg once to ensure fairness */
++		/* age each memcg at most once to ensure fairness */
 +		if (max_seq - seq > 1)
 +			return true;
 +
@@ -5756,10 +4155,9 @@ index c579b254fec7..3f83325fdc71 100644
 +
 +	/*
 +	 * A minimum amount of work was done under global memory pressure. For
-+	 * kswapd, it may be overshooting. For direct reclaim, the target isn't
-+	 * met, and yet the allocation may still succeed, since kswapd may have
-+	 * caught up. In either case, it's better to stop now, and restart if
-+	 * necessary.
++	 * kswapd, it may be overshooting. For direct reclaim, the allocation
++	 * may succeed if all suitable zones are somewhat safe. In either case,
++	 * it's better to stop now, and restart later if necessary.
 +	 */
 +	for (i = 0; i <= sc->reclaim_idx; i++) {
 +		unsigned long wmark;
@@ -5778,332 +4176,60 @@ index c579b254fec7..3f83325fdc71 100644
 +	return true;
 +}
 +
- static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
- {
- 	struct blk_plug plug;
++static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
++{
++	struct blk_plug plug;
 +	bool need_aging = false;
 +	bool need_swapping = false;
- 	unsigned long scanned = 0;
++	unsigned long scanned = 0;
 +	unsigned long reclaimed = sc->nr_reclaimed;
 +	DEFINE_MAX_SEQ(lruvec);
-
- 	lru_add_drain();
-
-@@ -4967,21 +5047,28 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
- 		else
- 			swappiness = 0;
-
--		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
++
++	lru_add_drain();
++
++	blk_start_plug(&plug);
++
++	set_mm_walk(lruvec_pgdat(lruvec));
++
++	while (true) {
++		int delta;
++		int swappiness;
++		unsigned long nr_to_scan;
++
++		if (sc->may_swap)
++			swappiness = get_swappiness(lruvec, sc);
++		else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
++			swappiness = 1;
++		else
++			swappiness = 0;
++
 +		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, &need_aging);
- 		if (!nr_to_scan)
--			break;
++		if (!nr_to_scan)
 +			goto done;
-
--		delta = evict_folios(lruvec, sc, swappiness);
++
 +		delta = evict_folios(lruvec, sc, swappiness, &need_swapping);
- 		if (!delta)
--			break;
++		if (!delta)
 +			goto done;
-
- 		scanned += delta;
- 		if (scanned >= nr_to_scan)
- 			break;
-
++
++		scanned += delta;
++		if (scanned >= nr_to_scan)
++			break;
++
 +		if (should_abort_scan(lruvec, max_seq, sc, need_swapping))
 +			break;
 +
- 		cond_resched();
- 	}
-
++		cond_resched();
++	}
++
 +	/* see the comment in lru_gen_age_node() */
 +	if (sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH && !need_aging)
 +		sc->memcgs_need_aging = false;
 +done:
- 	clear_mm_walk();
-
- 	blk_finish_plug(&plug);
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (8 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
-                   ` (4 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
-can be disabled include:
-  0x0001: the multi-gen LRU core
-  0x0002: walking page table, when arch_has_hw_pte_young() returns
-          true
-  0x0004: clearing the accessed bit in non-leaf PMD entries, when
-          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
-  [yYnN]: apply to all the components above
-E.g.,
-  echo y >/sys/kernel/mm/lru_gen/enabled
-  cat /sys/kernel/mm/lru_gen/enabled
-  0x0007
-  echo 5 >/sys/kernel/mm/lru_gen/enabled
-  cat /sys/kernel/mm/lru_gen/enabled
-  0x0005
-
-NB: the page table walks happen on the scale of seconds under heavy
-memory pressure, in which case the mmap_lock contention is a lesser
-concern, compared with the LRU lock contention and the I/O congestion.
-So far the only well-known case of the mmap_lock contention happens on
-Android, due to Scudo [1] which allocates several thousand VMAs for
-merely a few hundred MBs. The SPF and the Maple Tree also have
-provided their own assessments [2][3]. However, if walking page tables
-does worsen the mmap_lock contention, the kill switch can be used to
-disable it. In this case the multi-gen LRU will suffer a minor
-performance degradation, as shown previously.
-
-Clearing the accessed bit in non-leaf PMD entries can also be
-disabled, since this behavior was not tested on x86 varieties other
-than Intel and AMD.
-
-[1] https://source.android.com/devices/tech/debug/scudo
-[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
-[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/cgroup.h          |  15 ++-
- include/linux/mm_inline.h       |  15 ++-
- include/linux/mmzone.h          |   9 ++
- kernel/cgroup/cgroup-internal.h |   1 -
- mm/Kconfig                      |   6 +
- mm/vmscan.c                     | 228 +++++++++++++++++++++++++++++++-
- 6 files changed, 265 insertions(+), 9 deletions(-)
-
-diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
-index ac5d0515680e..9179463c3c9f 100644
---- a/include/linux/cgroup.h
-+++ b/include/linux/cgroup.h
-@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
- 	css_put(&cgrp->self);
- }
-
-+extern struct mutex cgroup_mutex;
++	clear_mm_walk();
 +
-+static inline void cgroup_lock(void)
-+{
-+	mutex_lock(&cgroup_mutex);
++	blk_finish_plug(&plug);
 +}
 +
-+static inline void cgroup_unlock(void)
-+{
-+	mutex_unlock(&cgroup_mutex);
-+}
-+
- /**
-  * task_css_set_check - obtain a task's css_set with extra access conditions
-  * @task: the task to obtain css_set for
-@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
-  * as locks used during the cgroup_subsys::attach() methods.
-  */
- #ifdef CONFIG_PROVE_RCU
--extern struct mutex cgroup_mutex;
- extern spinlock_t css_set_lock;
- #define task_css_set_check(task, __c)					\
- 	rcu_dereference_check((task)->cgroups,				\
-@@ -708,6 +719,8 @@ struct cgroup;
- static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
- static inline void css_get(struct cgroup_subsys_state *css) {}
- static inline void css_put(struct cgroup_subsys_state *css) {}
-+static inline void cgroup_lock(void) {}
-+static inline void cgroup_unlock(void) {}
- static inline int cgroup_attach_task_all(struct task_struct *from,
- 					 struct task_struct *t) { return 0; }
- static inline int cgroupstats_build(struct cgroupstats *stats,
-diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
-index f2b2296a42f9..4949eda9a9a2 100644
---- a/include/linux/mm_inline.h
-+++ b/include/linux/mm_inline.h
-@@ -106,10 +106,21 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
-
- #ifdef CONFIG_LRU_GEN
-
-+#ifdef CONFIG_LRU_GEN_ENABLED
- static inline bool lru_gen_enabled(void)
- {
--	return true;
-+	DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
-+
-+	return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
- }
-+#else
-+static inline bool lru_gen_enabled(void)
-+{
-+	DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
-+
-+	return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
-+}
-+#endif
-
- static inline bool lru_gen_in_fault(void)
- {
-@@ -222,7 +233,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
-
- 	VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
-
--	if (folio_test_unevictable(folio))
-+	if (folio_test_unevictable(folio) || !lrugen->enabled)
- 		return false;
- 	/*
- 	 * There are three common cases for this page:
-diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index b1635c4020dc..95c58c7fbdff 100644
---- a/include/linux/mmzone.h
-+++ b/include/linux/mmzone.h
-@@ -387,6 +387,13 @@ enum {
- 	LRU_GEN_FILE,
- };
-
-+enum {
-+	LRU_GEN_CORE,
-+	LRU_GEN_MM_WALK,
-+	LRU_GEN_NONLEAF_YOUNG,
-+	NR_LRU_GEN_CAPS
-+};
-+
- #define MIN_LRU_BATCH		BITS_PER_LONG
- #define MAX_LRU_BATCH		(MIN_LRU_BATCH * 64)
-
-@@ -428,6 +435,8 @@ struct lru_gen_struct {
- 	/* can be modified without holding the LRU lock */
- 	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
- 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
-+	/* whether the multi-gen LRU is enabled */
-+	bool enabled;
- };
-
- enum {
-diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
-index 36b740cb3d59..63dc3e82be4f 100644
---- a/kernel/cgroup/cgroup-internal.h
-+++ b/kernel/cgroup/cgroup-internal.h
-@@ -164,7 +164,6 @@ struct cgroup_mgctx {
- #define DEFINE_CGROUP_MGCTX(name)						\
- 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
-
--extern struct mutex cgroup_mutex;
- extern spinlock_t css_set_lock;
- extern struct cgroup_subsys *cgroup_subsys[];
- extern struct list_head cgroup_roots;
-diff --git a/mm/Kconfig b/mm/Kconfig
-index 5c5dcbdcfe34..ab6ef5115eb8 100644
---- a/mm/Kconfig
-+++ b/mm/Kconfig
-@@ -1127,6 +1127,12 @@ config LRU_GEN
- 	help
- 	  A high performance LRU implementation to overcommit memory.
-
-+config LRU_GEN_ENABLED
-+	bool "Enable by default"
-+	depends on LRU_GEN
-+	help
-+	  This option enables the multi-gen LRU by default.
-+
- config LRU_GEN_STATS
- 	bool "Full stats for debugging"
- 	depends on LRU_GEN
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 3f83325fdc71..10f31f3c5054 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -51,6 +51,7 @@
- #include <linux/psi.h>
- #include <linux/pagewalk.h>
- #include <linux/shmem_fs.h>
-+#include <linux/ctype.h>
-
- #include <asm/tlbflush.h>
- #include <asm/div64.h>
-@@ -3070,6 +3071,14 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
-
- #ifdef CONFIG_LRU_GEN
-
-+#ifdef CONFIG_LRU_GEN_ENABLED
-+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
-+#define get_cap(cap)	static_branch_likely(&lru_gen_caps[cap])
-+#else
-+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
-+#define get_cap(cap)	static_branch_unlikely(&lru_gen_caps[cap])
-+#endif
-+
- /******************************************************************************
-  *                          shorthand helpers
-  ******************************************************************************/
-@@ -3946,7 +3955,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
- 			goto next;
-
- 		if (!pmd_trans_huge(pmd[i])) {
--			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
-+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
-+			    get_cap(LRU_GEN_NONLEAF_YOUNG))
- 				pmdp_test_and_clear_young(vma, addr, pmd + i);
- 			goto next;
- 		}
-@@ -4044,10 +4054,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
- 		walk->mm_stats[MM_NONLEAF_TOTAL]++;
-
- #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
--		if (!pmd_young(val))
--			continue;
-+		if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
-+			if (!pmd_young(val))
-+				continue;
-
--		walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
-+			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
-+		}
- #endif
- 		if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
- 			continue;
-@@ -4309,7 +4321,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
- 	 * handful of PTEs. Spreading the work out over a period of time usually
- 	 * is less efficient, but it avoids bursty page faults.
- 	 */
--	if (!arch_has_hw_pte_young()) {
-+	if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
- 		success = iterate_mm_list_nowalk(lruvec, max_seq);
- 		goto done;
- 	}
-@@ -5074,6 +5086,208 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
- 	blk_finish_plug(&plug);
- }
-
 +/******************************************************************************
 + *                          state change
 + ******************************************************************************/
@@ -6249,6 +4375,29 @@ index 3f83325fdc71..10f31f3c5054 100644
 + *                          sysfs interface
 + ******************************************************************************/
 +
++static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
++{
++	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
++}
++
++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
++static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
++			     const char *buf, size_t len)
++{
++	unsigned int msecs;
++
++	if (kstrtouint(buf, 0, &msecs))
++		return -EINVAL;
++
++	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
++
++	return len;
++}
++
++static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
++	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
++);
++
 +static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
 +{
 +	unsigned int caps = 0;
@@ -6265,6 +4414,7 @@ index 3f83325fdc71..10f31f3c5054 100644
 +	return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
 +}
 +
++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
 +static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
 +			     const char *buf, size_t len)
 +{
@@ -6297,6 +4447,7 @@ index 3f83325fdc71..10f31f3c5054 100644
 +);
 +
 +static struct attribute *lru_gen_attrs[] = {
++	&lru_gen_min_ttl_attr.attr,
 +	&lru_gen_enabled_attr.attr,
 +	NULL
 +};
@@ -6306,462 +4457,6 @@ index 3f83325fdc71..10f31f3c5054 100644
 +	.attrs = lru_gen_attrs,
 +};
 +
- /******************************************************************************
-  *                          initialization
-  ******************************************************************************/
-@@ -5084,6 +5298,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
- 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-
- 	lrugen->max_seq = MIN_NR_GENS + 1;
-+	lrugen->enabled = lru_gen_enabled();
-
- 	for_each_gen_type_zone(gen, type, zone)
- 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
-@@ -5123,6 +5338,9 @@ static int __init init_lru_gen(void)
- 	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
- 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
-
-+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
-+		pr_err("lru_gen: failed to create sysfs group\n");
-+
- 	return 0;
- };
- late_initcall(init_lru_gen);
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (9 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 10/14] mm: multi-gen LRU: kill switch Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
-                   ` (3 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
-requested by many desktop users [1].
-
-When set to value N, it prevents the working set of N milliseconds
-from getting evicted. The OOM killer is triggered if this working set
-cannot be kept in memory. Based on the average human detectable lag
-(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
-Larger values like N=3000 make lags less noticeable at the risk of
-premature OOM kills.
-
-Compared with the size-based approach [2], this time-based approach
-has the following advantages:
-1. It is easier to configure because it is agnostic to applications
-   and memory sizes.
-2. It is more reliable because it is directly wired to the OOM killer.
-
-[1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
-[2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/mmzone.h |  2 ++
- mm/vmscan.c            | 74 ++++++++++++++++++++++++++++++++++++++++--
- 2 files changed, 73 insertions(+), 3 deletions(-)
-
-diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
-index 95c58c7fbdff..87347945270b 100644
---- a/include/linux/mmzone.h
-+++ b/include/linux/mmzone.h
-@@ -422,6 +422,8 @@ struct lru_gen_struct {
- 	unsigned long max_seq;
- 	/* the eviction increments the oldest generation numbers */
- 	unsigned long min_seq[ANON_AND_FILE];
-+	/* the birth time of each generation in jiffies */
-+	unsigned long timestamps[MAX_NR_GENS];
- 	/* the multi-gen LRU lists, lazily sorted on eviction */
- 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
- 	/* the multi-gen LRU sizes, eventually consistent */
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 10f31f3c5054..9ef2ec3d3c0c 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -4293,6 +4293,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
- 	for (type = 0; type < ANON_AND_FILE; type++)
- 		reset_ctrl_pos(lruvec, type, false);
-
-+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
- 	/* make sure preceding modifications appear */
- 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-
-@@ -4422,7 +4423,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsig
- 	return false;
- }
-
--static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
-+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl)
- {
- 	bool need_aging;
- 	unsigned long nr_to_scan;
-@@ -4436,16 +4437,36 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
- 	mem_cgroup_calculate_protection(NULL, memcg);
-
- 	if (mem_cgroup_below_min(memcg))
--		return;
-+		return false;
-
- 	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
-+
-+	if (min_ttl) {
-+		int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
-+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
-+
-+		if (time_is_after_jiffies(birth + min_ttl))
-+			return false;
-+
-+		/* the size is likely too small to be helpful */
-+		if (!nr_to_scan && sc->priority != DEF_PRIORITY)
-+			return false;
-+	}
-+
- 	if (need_aging)
- 		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
-+
-+	return true;
- }
-
-+/* to protect the working set of the last N jiffies */
-+static unsigned long lru_gen_min_ttl __read_mostly;
-+
- static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
- {
- 	struct mem_cgroup *memcg;
-+	bool success = false;
-+	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
-
- 	VM_WARN_ON_ONCE(!current_is_kswapd());
-
-@@ -4468,12 +4489,32 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
- 	do {
- 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
--		age_lruvec(lruvec, sc);
-+		if (age_lruvec(lruvec, sc, min_ttl))
-+			success = true;
-
- 		cond_resched();
- 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
-
- 	clear_mm_walk();
-+
-+	/* check the order to exclude compaction-induced reclaim */
-+	if (success || !min_ttl || sc->order)
-+		return;
-+
-+	/*
-+	 * The main goal is to OOM kill if every generation from all memcgs is
-+	 * younger than min_ttl. However, another possibility is all memcgs are
-+	 * either below min or empty.
-+	 */
-+	if (mutex_trylock(&oom_lock)) {
-+		struct oom_control oc = {
-+			.gfp_mask = sc->gfp_mask,
-+		};
-+
-+		out_of_memory(&oc);
-+
-+		mutex_unlock(&oom_lock);
-+	}
- }
-
- /*
-@@ -5231,6 +5272,28 @@ static void lru_gen_change_state(bool enabled)
-  *                          sysfs interface
-  ******************************************************************************/
-
-+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
-+{
-+	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
-+}
-+
-+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
-+			     const char *buf, size_t len)
-+{
-+	unsigned int msecs;
-+
-+	if (kstrtouint(buf, 0, &msecs))
-+		return -EINVAL;
-+
-+	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
-+
-+	return len;
-+}
-+
-+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
-+	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
-+);
-+
- static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
- {
- 	unsigned int caps = 0;
-@@ -5279,6 +5342,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
- );
-
- static struct attribute *lru_gen_attrs[] = {
-+	&lru_gen_min_ttl_attr.attr,
- 	&lru_gen_enabled_attr.attr,
- 	NULL
- };
-@@ -5294,12 +5358,16 @@ static struct attribute_group lru_gen_attr_group = {
-
- void lru_gen_init_lruvec(struct lruvec *lruvec)
- {
-+	int i;
- 	int gen, type, zone;
- 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-
- 	lrugen->max_seq = MIN_NR_GENS + 1;
- 	lrugen->enabled = lru_gen_enabled();
-
-+	for (i = 0; i <= MIN_NR_GENS + 1; i++)
-+		lrugen->timestamps[i] = jiffies;
-+
- 	for_each_gen_type_zone(gen, type, zone)
- 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
-
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (10 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide Yu Zhao
-                   ` (2 subsequent siblings)
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Qi Zheng,
-	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
-	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Add /sys/kernel/debug/lru_gen for working set estimation and proactive
-reclaim. These techniques are commonly used to optimize job scheduling
-(bin packing) in data centers [1][2].
-
-Compared with the page table-based approach and the PFN-based
-approach, this lruvec-based approach has the following advantages:
-1. It offers better choices because it is aware of memcgs, NUMA nodes,
-   shared mappings and unmapped page cache.
-2. It is more scalable because it is O(nr_hot_pages), whereas the
-   PFN-based approach is O(nr_total_pages).
-
-Add /sys/kernel/debug/lru_gen_full for debugging.
-
-[1] https://dl.acm.org/doi/10.1145/3297858.3304053
-[2] https://dl.acm.org/doi/10.1145/3503222.3507731
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- include/linux/nodemask.h |   1 +
- mm/vmscan.c              | 411 ++++++++++++++++++++++++++++++++++++++-
- 2 files changed, 402 insertions(+), 10 deletions(-)
-
-diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
-index 4b71a96190a8..3a0eec9f2faa 100644
---- a/include/linux/nodemask.h
-+++ b/include/linux/nodemask.h
-@@ -493,6 +493,7 @@ static inline int num_node_state(enum node_states state)
- #define first_online_node	0
- #define first_memory_node	0
- #define next_online_node(nid)	(MAX_NUMNODES)
-+#define next_memory_node(nid)	(MAX_NUMNODES)
- #define nr_node_ids		1U
- #define nr_online_nodes		1U
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 9ef2ec3d3c0c..7657d54c9c42 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -52,6 +52,7 @@
- #include <linux/pagewalk.h>
- #include <linux/shmem_fs.h>
- #include <linux/ctype.h>
-+#include <linux/debugfs.h>
-
- #include <asm/tlbflush.h>
- #include <asm/div64.h>
-@@ -4197,12 +4198,40 @@ static void clear_mm_walk(void)
- 		kfree(walk);
- }
-
--static void inc_min_seq(struct lruvec *lruvec, int type)
-+static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
- {
-+	int zone;
-+	int remaining = MAX_LRU_BATCH;
- 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
-+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
-
-+	if (type == LRU_GEN_ANON && !can_swap)
-+		goto done;
-+
-+	/* prevent cold/hot inversion if force_scan is true */
-+	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-+		struct list_head *head = &lrugen->lists[old_gen][type][zone];
-+
-+		while (!list_empty(head)) {
-+			struct folio *folio = lru_to_folio(head);
-+
-+			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
-+			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
-+
-+			new_gen = folio_inc_gen(lruvec, folio, false);
-+			list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]);
-+
-+			if (!--remaining)
-+				return false;
-+		}
-+	}
-+done:
- 	reset_ctrl_pos(lruvec, type, true);
- 	WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
-+
-+	return true;
- }
-
- static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
-@@ -4248,7 +4277,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
- 	return success;
- }
-
--static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
-+static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
- {
- 	int prev, next;
- 	int type, zone;
-@@ -4262,9 +4291,13 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
- 		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
- 			continue;
-
--		VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
-+		VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
-
--		inc_min_seq(lruvec, type);
-+		while (!inc_min_seq(lruvec, type, can_swap)) {
-+			spin_unlock_irq(&lruvec->lru_lock);
-+			cond_resched();
-+			spin_lock_irq(&lruvec->lru_lock);
-+		}
- 	}
-
- 	/*
-@@ -4301,7 +4334,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
- }
-
- static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
--			       struct scan_control *sc, bool can_swap)
-+			       struct scan_control *sc, bool can_swap, bool force_scan)
- {
- 	bool success;
- 	struct lru_gen_mm_walk *walk;
-@@ -4322,7 +4355,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
- 	 * handful of PTEs. Spreading the work out over a period of time usually
- 	 * is less efficient, but it avoids bursty page faults.
- 	 */
--	if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
-+	if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
- 		success = iterate_mm_list_nowalk(lruvec, max_seq);
- 		goto done;
- 	}
-@@ -4336,7 +4369,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
- 	walk->lruvec = lruvec;
- 	walk->max_seq = max_seq;
- 	walk->can_swap = can_swap;
--	walk->force_scan = false;
-+	walk->force_scan = force_scan;
-
- 	do {
- 		success = iterate_mm_list(lruvec, walk, &mm);
-@@ -4356,7 +4389,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
-
- 	VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
-
--	inc_max_seq(lruvec, can_swap);
-+	inc_max_seq(lruvec, can_swap, force_scan);
- 	/* either this sees any waiters or they will see updated max_seq */
- 	if (wq_has_sleeper(&lruvec->mm_state.wait))
- 		wake_up_all(&lruvec->mm_state.wait);
-@@ -4454,7 +4487,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
- 	}
-
- 	if (need_aging)
--		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
-+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
-
- 	return true;
- }
-@@ -5013,7 +5046,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
- 	if (current_is_kswapd())
- 		return 0;
-
--	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
-+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
- 		return nr_to_scan;
- done:
- 	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
-@@ -5352,6 +5385,361 @@ static struct attribute_group lru_gen_attr_group = {
- 	.attrs = lru_gen_attrs,
- };
-
 +/******************************************************************************
 + *                          debugfs interface
 + ******************************************************************************/
@@ -6867,6 +4562,7 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644
 +	seq_putc(m, '\n');
 +}
 +
++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
 +static int lru_gen_seq_show(struct seq_file *m, void *v)
 +{
 +	unsigned long seq;
@@ -7025,6 +4721,7 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644
 +	return err;
 +}
 +
++/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
 +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
 +				 size_t len, loff_t *pos)
 +{
@@ -7117,639 +4814,540 @@ index 9ef2ec3d3c0c..7657d54c9c42 100644
 +	.release = seq_release,
 +};
 +
- /******************************************************************************
-  *                          initialization
-  ******************************************************************************/
-@@ -5409,6 +5797,9 @@ static int __init init_lru_gen(void)
- 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
- 		pr_err("lru_gen: failed to create sysfs group\n");
-
++/******************************************************************************
++ *                          initialization
++ ******************************************************************************/
++
++void lru_gen_init_lruvec(struct lruvec *lruvec)
++{
++	int i;
++	int gen, type, zone;
++	struct lru_gen_struct *lrugen = &lruvec->lrugen;
++
++	lrugen->max_seq = MIN_NR_GENS + 1;
++	lrugen->enabled = lru_gen_enabled();
++
++	for (i = 0; i <= MIN_NR_GENS + 1; i++)
++		lrugen->timestamps[i] = jiffies;
++
++	for_each_gen_type_zone(gen, type, zone)
++		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
++
++	lruvec->mm_state.seq = MIN_NR_GENS;
++	init_waitqueue_head(&lruvec->mm_state.wait);
++}
++
++#ifdef CONFIG_MEMCG
++void lru_gen_init_memcg(struct mem_cgroup *memcg)
++{
++	INIT_LIST_HEAD(&memcg->mm_list.fifo);
++	spin_lock_init(&memcg->mm_list.lock);
++}
++
++void lru_gen_exit_memcg(struct mem_cgroup *memcg)
++{
++	int i;
++	int nid;
++
++	for_each_node(nid) {
++		struct lruvec *lruvec = get_lruvec(memcg, nid);
++
++		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
++					   sizeof(lruvec->lrugen.nr_pages)));
++
++		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
++			bitmap_free(lruvec->mm_state.filters[i]);
++			lruvec->mm_state.filters[i] = NULL;
++		}
++	}
++}
++#endif
++
++static int __init init_lru_gen(void)
++{
++	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
++	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
++
++	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
++		pr_err("lru_gen: failed to create sysfs group\n");
++
 +	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
 +	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
 +
- 	return 0;
- };
- late_initcall(init_lru_gen);
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (11 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-18  8:26   ` Mike Rapoport
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 14/14] mm: multi-gen LRU: design doc Yu Zhao
-  2022-09-19  2:08 ` [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Bagas Sanjaya
-  14 siblings, 1 reply; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Add an admin guide.
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- Documentation/admin-guide/mm/index.rst        |   1 +
- Documentation/admin-guide/mm/multigen_lru.rst | 162 ++++++++++++++++++
- mm/Kconfig                                    |   3 +-
- mm/vmscan.c                                   |   4 +
- 4 files changed, 169 insertions(+), 1 deletion(-)
- create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
-
-diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
-index 1bd11118dfb1..d1064e0ba34a 100644
---- a/Documentation/admin-guide/mm/index.rst
-+++ b/Documentation/admin-guide/mm/index.rst
-@@ -32,6 +32,7 @@ the Linux memory management.
-    idle_page_tracking
-    ksm
-    memory-hotplug
-+   multigen_lru
-    nommu-mmap
-    numa_memory_policy
-    numaperf
-diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
-new file mode 100644
-index 000000000000..33e068830497
---- /dev/null
-+++ b/Documentation/admin-guide/mm/multigen_lru.rst
-@@ -0,0 +1,162 @@
-+.. SPDX-License-Identifier: GPL-2.0
++	return 0;
++};
++late_initcall(init_lru_gen);
 +
-+=============
-+Multi-Gen LRU
-+=============
-+The multi-gen LRU is an alternative LRU implementation that optimizes
-+page reclaim and improves performance under memory pressure. Page
-+reclaim decides the kernel's caching policy and ability to overcommit
-+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
++#else /* !CONFIG_LRU_GEN */
 +
-+Quick start
-+===========
-+Build the kernel with the following configurations.
++static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
++{
++}
 +
-+* ``CONFIG_LRU_GEN=y``
-+* ``CONFIG_LRU_GEN_ENABLED=y``
++static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
++{
++}
 +
-+All set!
++#endif /* CONFIG_LRU_GEN */
 +
-+Runtime options
-+===============
-+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
-+following subsections.
++static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
++{
++	unsigned long nr[NR_LRU_LISTS];
++	unsigned long targets[NR_LRU_LISTS];
++	unsigned long nr_to_scan;
++	enum lru_list lru;
++	unsigned long nr_reclaimed = 0;
++	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
++	struct blk_plug plug;
++	bool scan_adjusted;
 +
-+Kill switch
-+-----------
-+``enabled`` accepts different values to enable or disable the
-+following components. Its default value depends on
-+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
-+unless some of them have unforeseen side effects. Writing to
-+``enabled`` has no effect when a component is not supported by the
-+hardware, and valid values will be accepted even when the main switch
-+is off.
++	if (lru_gen_enabled()) {
++		lru_gen_shrink_lruvec(lruvec, sc);
++		return;
++	}
 +
-+====== ===============================================================
-+Values Components
-+====== ===============================================================
-+0x0001 The main switch for the multi-gen LRU.
-+0x0002 Clearing the accessed bit in leaf page table entries in large
-+       batches, when MMU sets it (e.g., on x86). This behavior can
-+       theoretically worsen lock contention (mmap_lock). If it is
-+       disabled, the multi-gen LRU will suffer a minor performance
-+       degradation for workloads that contiguously map hot pages,
-+       whose accessed bits can be otherwise cleared by fewer larger
-+       batches.
-+0x0004 Clearing the accessed bit in non-leaf page table entries as
-+       well, when MMU sets it (e.g., on x86). This behavior was not
-+       verified on x86 varieties other than Intel and AMD. If it is
-+       disabled, the multi-gen LRU will suffer a negligible
-+       performance degradation.
-+[yYnN] Apply to all the components above.
-+====== ===============================================================
++	get_scan_count(lruvec, sc, nr);
 +
-+E.g.,
-+::
++	/* Record the original scan target for proportional adjustments later */
++	memcpy(targets, nr, sizeof(nr));
 +
-+    echo y >/sys/kernel/mm/lru_gen/enabled
-+    cat /sys/kernel/mm/lru_gen/enabled
-+    0x0007
-+    echo 5 >/sys/kernel/mm/lru_gen/enabled
-+    cat /sys/kernel/mm/lru_gen/enabled
-+    0x0005
++	/*
++	 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
++	 * event that can occur when there is little memory pressure e.g.
++	 * multiple streaming readers/writers. Hence, we do not abort scanning
++	 * when the requested number of pages are reclaimed when scanning at
++	 * DEF_PRIORITY on the assumption that the fact we are direct
++	 * reclaiming implies that kswapd is not keeping up and it is best to
++	 * do a batch of work at once. For memcg reclaim one check is made to
++	 * abort proportional reclaim if either the file or anon lru has already
++	 * dropped to zero at the first pass.
++	 */
++	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
++			 sc->priority == DEF_PRIORITY);
 +
-+Thrashing prevention
-+--------------------
-+Personal computers are more sensitive to thrashing because it can
-+cause janks (lags when rendering UI) and negatively impact user
-+experience. The multi-gen LRU offers thrashing prevention to the
-+majority of laptop and desktop users who do not have ``oomd``.
++	blk_start_plug(&plug);
++	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
++					nr[LRU_INACTIVE_FILE]) {
++		unsigned long nr_anon, nr_file, percentage;
++		unsigned long nr_scanned;
 +
-+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
-+``N`` milliseconds from getting evicted. The OOM killer is triggered
-+if this working set cannot be kept in memory. In other words, this
-+option works as an adjustable pressure relief valve, and when open, it
-+terminates applications that are hopefully not being used.
++		for_each_evictable_lru(lru) {
++			if (nr[lru]) {
++				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
++				nr[lru] -= nr_to_scan;
 +
-+Based on the average human detectable lag (~100ms), ``N=1000`` usually
-+eliminates intolerable janks due to thrashing. Larger values like
-+``N=3000`` make janks less noticeable at the risk of premature OOM
-+kills.
++				nr_reclaimed += shrink_list(lru, nr_to_scan,
++							    lruvec, sc);
++			}
++		}
 +
-+The default value ``0`` means disabled.
++		cond_resched();
 +
-+Experimental features
-+=====================
-+``/sys/kernel/debug/lru_gen`` accepts commands described in the
-+following subsections. Multiple command lines are supported, so does
-+concatenation with delimiters ``,`` and ``;``.
++		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
++			continue;
 +
-+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
-+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
-+evicted generations in this file.
++		/*
++		 * For kswapd and memcg, reclaim at least the number of pages
++		 * requested. Ensure that the anon and file LRUs are scanned
++		 * proportionally what was requested by get_scan_count(). We
++		 * stop reclaiming one LRU and reduce the amount scanning
++		 * proportional to the original scan target.
++		 */
++		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
++		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
 +
-+Working set estimation
-+----------------------
-+Working set estimation measures how much memory an application needs
-+in a given time interval, and it is usually done with little impact on
-+the performance of the application. E.g., data centers want to
-+optimize job scheduling (bin packing) to improve memory utilizations.
-+When a new job comes in, the job scheduler needs to find out whether
-+each server it manages can allocate a certain amount of memory for
-+this new job before it can pick a candidate. To do so, the job
-+scheduler needs to estimate the working sets of the existing jobs.
++		/*
++		 * It's just vindictive to attack the larger once the smaller
++		 * has gone to zero.  And given the way we stop scanning the
++		 * smaller below, this makes sure that we only make one nudge
++		 * towards proportionality once we've got nr_to_reclaim.
++		 */
++		if (!nr_file || !nr_anon)
++			break;
 +
-+When it is read, ``lru_gen`` returns a histogram of numbers of pages
-+accessed over different time intervals for each memcg and node.
-+``MAX_NR_GENS`` decides the number of bins for each histogram. The
-+histograms are noncumulative.
-+::
++		if (nr_file > nr_anon) {
++			unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
++						targets[LRU_ACTIVE_ANON] + 1;
++			lru = LRU_BASE;
++			percentage = nr_anon * 100 / scan_target;
++		} else {
++			unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
++						targets[LRU_ACTIVE_FILE] + 1;
++			lru = LRU_FILE;
++			percentage = nr_file * 100 / scan_target;
++		}
 +
-+    memcg  memcg_id  memcg_path
-+       node  node_id
-+           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
-+           ...
-+           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
++		/* Stop scanning the smaller of the LRU */
++		nr[lru] = 0;
++		nr[lru + LRU_ACTIVE] = 0;
 +
-+Each bin contains an estimated number of pages that have been accessed
-+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
-+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
-+the former is the largest and that of the latter is the smallest.
++		/*
++		 * Recalculate the other LRU scan count based on its original
++		 * scan target and the percentage scanning already complete
++		 */
++		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
++		nr_scanned = targets[lru] - nr[lru];
++		nr[lru] = targets[lru] * (100 - percentage) / 100;
++		nr[lru] -= min(nr[lru], nr_scanned);
 +
-+Users can write the following command to ``lru_gen`` to create a new
-+generation ``max_gen_nr+1``:
++		lru += LRU_ACTIVE;
++		nr_scanned = targets[lru] - nr[lru];
++		nr[lru] = targets[lru] * (100 - percentage) / 100;
++		nr[lru] -= min(nr[lru], nr_scanned);
 +
-+    ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
++		scan_adjusted = true;
++	}
++	blk_finish_plug(&plug);
++	sc->nr_reclaimed += nr_reclaimed;
 +
-+``can_swap`` defaults to the swap setting and, if it is set to ``1``,
-+it forces the scan of anon pages when swap is off, and vice versa.
-+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
-+employs heuristics to reduce the overhead, which is likely to reduce
-+the coverage as well.
++	/*
++	 * Even if we did not try to evict anon pages at all, we want to
++	 * rebalance the anon lru active/inactive ratio.
++	 */
++	if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) &&
++	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
++		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
++				   sc, LRU_ACTIVE_ANON);
++}
 +
-+A typical use case is that a job scheduler runs this command at a
-+certain time interval to create new generations, and it ranks the
-+servers it manages based on the sizes of their cold pages defined by
-+this time interval.
++/* Use reclaim/compaction for costly allocs or under memory pressure */
++static bool in_reclaim_compaction(struct scan_control *sc)
++{
++	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
++			(sc->order > PAGE_ALLOC_COSTLY_ORDER ||
++			 sc->priority < DEF_PRIORITY - 2))
++		return true;
 +
-+Proactive reclaim
-+-----------------
-+Proactive reclaim induces page reclaim when there is no memory
-+pressure. It usually targets cold pages only. E.g., when a new job
-+comes in, the job scheduler wants to proactively reclaim cold pages on
-+the server it selected, to improve the chance of successfully landing
-+this new job.
++	return false;
++}
 +
-+Users can write the following command to ``lru_gen`` to evict
-+generations less than or equal to ``min_gen_nr``.
++/*
++ * Reclaim/compaction is used for high-order allocation requests. It reclaims
++ * order-0 pages before compacting the zone. should_continue_reclaim() returns
++ * true if more pages should be reclaimed such that when the page allocator
++ * calls try_to_compact_pages() that it will have enough free pages to succeed.
++ * It will give up earlier than that if there is difficulty reclaiming pages.
++ */
++static inline bool should_continue_reclaim(struct pglist_data *pgdat,
++					unsigned long nr_reclaimed,
++					struct scan_control *sc)
++{
++	unsigned long pages_for_compaction;
++	unsigned long inactive_lru_pages;
++	int z;
 +
-+    ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
++	/* If not in reclaim/compaction mode, stop */
++	if (!in_reclaim_compaction(sc))
++		return false;
 +
-+``min_gen_nr`` should be less than ``max_gen_nr-1``, since
-+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
-+the active list) and therefore cannot be evicted. ``swappiness``
-+overrides the default value in ``/proc/sys/vm/swappiness``.
-+``nr_to_reclaim`` limits the number of pages to evict.
-+
-+A typical use case is that a job scheduler runs this command before it
-+tries to land a new job on a server. If it fails to materialize enough
-+cold pages because of the overestimation, it retries on the next
-+server according to the ranking result obtained from the working set
-+estimation step. This less forceful approach limits the impacts on the
-+existing jobs.
-diff --git a/mm/Kconfig b/mm/Kconfig
-index ab6ef5115eb8..ceec438c0741 100644
---- a/mm/Kconfig
-+++ b/mm/Kconfig
-@@ -1125,7 +1125,8 @@ config LRU_GEN
- 	# make sure folio->flags has enough spare bits
- 	depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
- 	help
--	  A high performance LRU implementation to overcommit memory.
-+	  A high performance LRU implementation to overcommit memory. See
-+	  Documentation/admin-guide/mm/multigen_lru.rst for details.
-
- config LRU_GEN_ENABLED
- 	bool "Enable by default"
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 7657d54c9c42..1456f133f256 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -5310,6 +5310,7 @@ static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, c
- 	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
- }
-
-+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
- static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
- 			     const char *buf, size_t len)
- {
-@@ -5343,6 +5344,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
- 	return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
- }
-
-+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
- static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
- 			     const char *buf, size_t len)
- {
-@@ -5490,6 +5492,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
- 	seq_putc(m, '\n');
- }
-
-+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
- static int lru_gen_seq_show(struct seq_file *m, void *v)
- {
- 	unsigned long seq;
-@@ -5648,6 +5651,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
- 	return err;
- }
-
-+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
- static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
- 				 size_t len, loff_t *pos)
- {
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-
-* [PATCH mm-unstable v15 14/14] mm: multi-gen LRU: design doc
-  2022-09-18  7:59 [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Yu Zhao
-                   ` (12 preceding siblings ...)
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 13/14] mm: multi-gen LRU: admin guide Yu Zhao
-@ 2022-09-18  8:00 ` Yu Zhao
-  2022-09-19  2:08 ` [PATCH mm-unstable v15 00/14] Multi-Gen LRU Framework Bagas Sanjaya
-  14 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-18  8:00 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Yu Zhao, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Add a design doc.
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
-Acked-by: Brian Geffon <bgeffon@google.com>
-Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
-Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
-Acked-by: Steven Barrett <steven@liquorix.net>
-Acked-by: Suleiman Souhlal <suleiman@google.com>
-Tested-by: Daniel Byrne <djbyrne@mtu.edu>
-Tested-by: Donald Carr <d@chaos-reins.com>
-Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
-Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
-Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
-Tested-by: Sofia Trinh <sofia.trinh@edi.works>
-Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
----
- Documentation/mm/index.rst        |   1 +
- Documentation/mm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++
- 2 files changed, 160 insertions(+)
- create mode 100644 Documentation/mm/multigen_lru.rst
-
-diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
-index 575ccd40e30c..4aa12b8be278 100644
---- a/Documentation/mm/index.rst
-+++ b/Documentation/mm/index.rst
-@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose.
-    ksm
-    memory-model
-    mmu_notifier
-+   multigen_lru
-    numa
-    overcommit-accounting
-    page_migration
-diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
-new file mode 100644
-index 000000000000..d7062c6a8946
---- /dev/null
-+++ b/Documentation/mm/multigen_lru.rst
-@@ -0,0 +1,159 @@
-+.. SPDX-License-Identifier: GPL-2.0
-+
-+=============
-+Multi-Gen LRU
-+=============
-+The multi-gen LRU is an alternative LRU implementation that optimizes
-+page reclaim and improves performance under memory pressure. Page
-+reclaim decides the kernel's caching policy and ability to overcommit
-+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
-+
-+Design overview
-+===============
-+Objectives
-+----------
-+The design objectives are:
-+
-+* Good representation of access recency
-+* Try to profit from spatial locality
-+* Fast paths to make obvious choices
-+* Simple self-correcting heuristics
-+
-+The representation of access recency is at the core of all LRU
-+implementations. In the multi-gen LRU, each generation represents a
-+group of pages with similar access recency. Generations establish a
-+(time-based) common frame of reference and therefore help make better
-+choices, e.g., between different memcgs on a computer or different
-+computers in a data center (for job scheduling).
-+
-+Exploiting spatial locality improves efficiency when gathering the
-+accessed bit. A rmap walk targets a single page and does not try to
-+profit from discovering a young PTE. A page table walk can sweep all
-+the young PTEs in an address space, but the address space can be too
-+sparse to make a profit. The key is to optimize both methods and use
-+them in combination.
-+
-+Fast paths reduce code complexity and runtime overhead. Unmapped pages
-+do not require TLB flushes; clean pages do not require writeback.
-+These facts are only helpful when other conditions, e.g., access
-+recency, are similar. With generations as a common frame of reference,
-+additional factors stand out. But obvious choices might not be good
-+choices; thus self-correction is necessary.
-+
-+The benefits of simple self-correcting heuristics are self-evident.
-+Again, with generations as a common frame of reference, this becomes
-+attainable. Specifically, pages in the same generation can be
-+categorized based on additional factors, and a feedback loop can
-+statistically compare the refault percentages across those categories
-+and infer which of them are better choices.
-+
-+Assumptions
-+-----------
-+The protection of hot pages and the selection of cold pages are based
-+on page access channels and patterns. There are two access channels:
-+
-+* Accesses through page tables
-+* Accesses through file descriptors
-+
-+The protection of the former channel is by design stronger because:
-+
-+1. The uncertainty in determining the access patterns of the former
-+   channel is higher due to the approximation of the accessed bit.
-+2. The cost of evicting the former channel is higher due to the TLB
-+   flushes required and the likelihood of encountering the dirty bit.
-+3. The penalty of underprotecting the former channel is higher because
-+   applications usually do not prepare themselves for major page
-+   faults like they do for blocked I/O. E.g., GUI applications
-+   commonly use dedicated I/O threads to avoid blocking rendering
-+   threads.
-+
-+There are also two access patterns:
-+
-+* Accesses exhibiting temporal locality
-+* Accesses not exhibiting temporal locality
-+
-+For the reasons listed above, the former channel is assumed to follow
-+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
-+present, and the latter channel is assumed to follow the latter
-+pattern unless outlying refaults have been observed.
-+
-+Workflow overview
-+=================
-+Evictable pages are divided into multiple generations for each
-+``lruvec``. The youngest generation number is stored in
-+``lrugen->max_seq`` for both anon and file types as they are aged on
-+an equal footing. The oldest generation numbers are stored in
-+``lrugen->min_seq[]`` separately for anon and file types as clean file
-+pages can be evicted regardless of swap constraints. These three
-+variables are monotonically increasing.
-+
-+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
-+bits in order to fit into the gen counter in ``folio->flags``. Each
-+truncated generation number is an index to ``lrugen->lists[]``. The
-+sliding window technique is used to track at least ``MIN_NR_GENS`` and
-+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
-+within ``[1, MAX_NR_GENS]`` while a page is on one of
-+``lrugen->lists[]``; otherwise it stores zero.
-+
-+Each generation is divided into multiple tiers. A page accessed ``N``
-+times through file descriptors is in tier ``order_base_2(N)``. Unlike
-+generations, tiers do not have dedicated ``lrugen->lists[]``. In
-+contrast to moving across generations, which requires the LRU lock,
-+moving across tiers only involves atomic operations on
-+``folio->flags`` and therefore has a negligible cost. A feedback loop
-+modeled after the PID controller monitors refaults over all the tiers
-+from anon and file types and decides which tiers from which types to
-+evict or protect.
-+
-+There are two conceptually independent procedures: the aging and the
-+eviction. They form a closed-loop system, i.e., the page reclaim.
-+
-+Aging
-+-----
-+The aging produces young generations. Given an ``lruvec``, it
-+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
-+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
-+generation when it finds them accessed through page tables; the
-+demotion of cold pages happens consequently when it increments
-+``max_seq``. The aging uses page table walks and rmap walks to find
-+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
-+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
-+to scan PTEs, and after each iteration, it increments ``max_seq``. For
-+the latter, when the eviction walks the rmap and finds a young PTE,
-+the aging scans the adjacent PTEs. For both, on finding a young PTE,
-+the aging clears the accessed bit and updates the gen counter of the
-+page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
-+
-+Eviction
-+--------
-+The eviction consumes old generations. Given an ``lruvec``, it
-+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
-+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
-+evict from, it first compares ``min_seq[]`` to select the older type.
-+If both types are equally old, it selects the one whose first tier has
-+a lower refault percentage. The first tier contains single-use
-+unmapped clean pages, which are the best bet. The eviction sorts a
-+page according to its gen counter if the aging has found this page
-+accessed through page tables and updated its gen counter. It also
-+moves a page to the next generation, i.e., ``min_seq+1``, if this page
-+was accessed multiple times through file descriptors and the feedback
-+loop has detected outlying refaults from the tier this page is in. To
-+this end, the feedback loop uses the first tier as the baseline, for
-+the reason stated earlier.
-+
-+Summary
-+-------
-+The multi-gen LRU can be disassembled into the following parts:
-+
-+* Generations
-+* Rmap walks
-+* Page table walks
-+* Bloom filters
-+* PID controller
-+
-+The aging and the eviction form a producer-consumer model;
-+specifically, the latter drives the former by the sliding window over
-+generations. Within the aging, rmap walks drive page table walks by
-+inserting hot densely populated page tables to the Bloom filters.
-+Within the eviction, the PID controller uses refaults as the feedback
-+to select types to evict and tiers to protect.
---
-2.37.3.968.ga6b4b080e4-goog
-
-
-
-* Re: [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
-@ 2022-09-28 18:46   ` Yu Zhao
-  0 siblings, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-28 18:46 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Hi Andrew,
-
-Can you please take this fixlet? Thanks.
-
-Fix imprecise comments.
-
-Signed-off-by: Yu Zhao <yuzhao@google.com>
----
- mm/vmscan.c | 9 ++++-----
- 1 file changed, 4 insertions(+), 5 deletions(-)
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index a8fd6300fa7e..5b565470286b 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -5078,7 +5078,7 @@ static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq,
- 	DEFINE_MAX_SEQ(lruvec);
-
- 	if (!current_is_kswapd()) {
--		/* age each memcg once to ensure fairness */
-+		/* age each memcg at most once to ensure fairness */
- 		if (max_seq - seq > 1)
- 			return true;
-
-@@ -5103,10 +5103,9 @@ static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq,
-
- 	/*
- 	 * A minimum amount of work was done under global memory pressure. For
--	 * kswapd, it may be overshooting. For direct reclaim, the target isn't
--	 * met, and yet the allocation may still succeed, since kswapd may have
--	 * caught up. In either case, it's better to stop now, and restart if
--	 * necessary.
-+	 * kswapd, it may be overshooting. For direct reclaim, the allocation
-+	 * may succeed if all suitable zones are somewhat safe. In either case,
-+	 * it's better to stop now, and restart later if necessary.
- 	 */
- 	for (i = 0; i <= sc->reclaim_idx; i++) {
- 		unsigned long wmark;
---
-2.37.3.998.g577e59143f-goog
-
-
-
-
-* Re: [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks
-  2022-09-18  8:00 ` [PATCH mm-unstable v15 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
-  2022-09-18  8:17   ` Yu Zhao
-@ 2022-09-28 19:36   ` Yu Zhao
-  1 sibling, 0 replies; 23+ messages in thread
-From: Yu Zhao @ 2022-09-28 19:36 UTC (permalink / raw)
-  To: Andrew Morton
-  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
-	Hillf Danton, Jens Axboe, Johannes Weiner, Jonathan Corbet,
-	Linus Torvalds, Matthew Wilcox, Mel Gorman, Michael Larabel,
-	Michal Hocko, Mike Rapoport, Peter Zijlstra, Tejun Heo,
-	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-doc,
-	linux-kernel, linux-mm, x86, page-reclaim, Brian Geffon,
-	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
-	Suleiman Souhlal, Daniel Byrne, Donald Carr,
-	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
-	Sofia Trinh, Vaibhav Jain
-
-Hi Andrew,
-
-Can you please take another fixlet? Thanks.
-
-Don't sync disk for each aging cycle.
-
-wakeup_flusher_threads() was added under the assumption that if a
-system runs out of clean cold pages, it might want to write back dirty
-pages more aggressively so that they can become clean and be dropped.
-
-However, doing so can breach the rate limit a system wants to impose
-on writeback, resulting in early SSD wearout.
-
-Reported-by: Axel Rasmussen <axelrasmussen@google.com>
-Signed-off-by: Yu Zhao <yuzhao@google.com>
----
- mm/vmscan.c | 2 --
- 1 file changed, 2 deletions(-)
-
-diff --git a/mm/vmscan.c b/mm/vmscan.c
-index 5b565470286b..0317d4cf4884 100644
---- a/mm/vmscan.c
-+++ b/mm/vmscan.c
-@@ -4413,8 +4413,6 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
- 	if (wq_has_sleeper(&lruvec->mm_state.wait))
- 		wake_up_all(&lruvec->mm_state.wait);
-
--	wakeup_flusher_threads(WB_REASON_VMSCAN);
++	/*
+ 	 * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX
+ 	 * number of pages that were scanned. This will return to the caller
+ 	 * with the risk reclaim/compaction and the resulting allocation attempt
+@@ -3197,109 +6072,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+ 	unsigned long nr_reclaimed, nr_scanned;
+ 	struct lruvec *target_lruvec;
+ 	bool reclaimable = false;
+-	unsigned long file;
+ 
+ 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+ 
+ again:
+-	/*
+-	 * Flush the memory cgroup stats, so that we read accurate per-memcg
+-	 * lruvec stats for heuristics.
+-	 */
+-	mem_cgroup_flush_stats();
 -
- 	return true;
+ 	memset(&sc->nr, 0, sizeof(sc->nr));
+ 
+ 	nr_reclaimed = sc->nr_reclaimed;
+ 	nr_scanned = sc->nr_scanned;
+ 
+-	/*
+-	 * Determine the scan balance between anon and file LRUs.
+-	 */
+-	spin_lock_irq(&target_lruvec->lru_lock);
+-	sc->anon_cost = target_lruvec->anon_cost;
+-	sc->file_cost = target_lruvec->file_cost;
+-	spin_unlock_irq(&target_lruvec->lru_lock);
+-
+-	/*
+-	 * Target desirable inactive:active list ratios for the anon
+-	 * and file LRU lists.
+-	 */
+-	if (!sc->force_deactivate) {
+-		unsigned long refaults;
+-
+-		refaults = lruvec_page_state(target_lruvec,
+-				WORKINGSET_ACTIVATE_ANON);
+-		if (refaults != target_lruvec->refaults[0] ||
+-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+-			sc->may_deactivate |= DEACTIVATE_ANON;
+-		else
+-			sc->may_deactivate &= ~DEACTIVATE_ANON;
+-
+-		/*
+-		 * When refaults are being observed, it means a new
+-		 * workingset is being established. Deactivate to get
+-		 * rid of any stale active pages quickly.
+-		 */
+-		refaults = lruvec_page_state(target_lruvec,
+-				WORKINGSET_ACTIVATE_FILE);
+-		if (refaults != target_lruvec->refaults[1] ||
+-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+-			sc->may_deactivate |= DEACTIVATE_FILE;
+-		else
+-			sc->may_deactivate &= ~DEACTIVATE_FILE;
+-	} else
+-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+-
+-	/*
+-	 * If we have plenty of inactive file pages that aren't
+-	 * thrashing, try to reclaim those first before touching
+-	 * anonymous pages.
+-	 */
+-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+-		sc->cache_trim_mode = 1;
+-	else
+-		sc->cache_trim_mode = 0;
+-
+-	/*
+-	 * Prevent the reclaimer from falling into the cache trap: as
+-	 * cache pages start out inactive, every cache fault will tip
+-	 * the scan balance towards the file LRU.  And as the file LRU
+-	 * shrinks, so does the window for rotation from references.
+-	 * This means we have a runaway feedback loop where a tiny
+-	 * thrashing file LRU becomes infinitely more attractive than
+-	 * anon pages.  Try to detect this based on file LRU size.
+-	 */
+-	if (!cgroup_reclaim(sc)) {
+-		unsigned long total_high_wmark = 0;
+-		unsigned long free, anon;
+-		int z;
+-
+-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+-			   node_page_state(pgdat, NR_INACTIVE_FILE);
+-
+-		for (z = 0; z < MAX_NR_ZONES; z++) {
+-			struct zone *zone = &pgdat->node_zones[z];
+-			if (!managed_zone(zone))
+-				continue;
+-
+-			total_high_wmark += high_wmark_pages(zone);
+-		}
+-
+-		/*
+-		 * Consider anon: if that's low too, this isn't a
+-		 * runaway file reclaim problem, but rather just
+-		 * extreme pressure. Reclaim as per usual then.
+-		 */
+-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+-
+-		sc->file_is_tiny =
+-			file + free <= total_high_wmark &&
+-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+-			anon >> sc->priority;
+-	}
++	prepare_scan_count(pgdat, sc);
+ 
+ 	shrink_node_memcgs(pgdat, sc);
+ 
+@@ -3557,6 +6339,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
+ 	struct lruvec *target_lruvec;
+ 	unsigned long refaults;
+ 
++	if (lru_gen_enabled())
++		return;
++
+ 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+ 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+ 	target_lruvec->refaults[0] = refaults;
+@@ -3923,12 +6708,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
  }
-
---
-2.37.3.998.g577e59143f-goog
+ #endif
+ 
+-static void age_active_anon(struct pglist_data *pgdat,
+-				struct scan_control *sc)
++static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+ {
+ 	struct mem_cgroup *memcg;
+ 	struct lruvec *lruvec;
+ 
++	if (lru_gen_enabled()) {
++		lru_gen_age_node(pgdat, sc);
++		return;
++	}
++
+ 	if (!can_age_anon_pages(pgdat, sc))
+ 		return;
+ 
+@@ -4248,12 +7037,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
+ 		sc.may_swap = !nr_boost_reclaim;
+ 
+ 		/*
+-		 * Do some background aging of the anon list, to give
+-		 * pages a chance to be referenced before reclaiming. All
+-		 * pages are rotated regardless of classzone as this is
+-		 * about consistent aging.
++		 * Do some background aging, to give pages a chance to be
++		 * referenced before reclaiming. All pages are rotated
++		 * regardless of classzone as this is about consistent aging.
+ 		 */
+-		age_active_anon(pgdat, &sc);
++		kswapd_age_node(pgdat, &sc);
+ 
+ 		/*
+ 		 * If we're getting trouble reclaiming, start doing writepage
+diff --git a/mm/workingset.c b/mm/workingset.c
+index a5e84862fc8688..ae7e984b23c6b0 100644
+--- a/mm/workingset.c
++++ b/mm/workingset.c
+@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
+ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+ 			 bool workingset)
+ {
+-	eviction >>= bucket_order;
+ 	eviction &= EVICTION_MASK;
+ 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
+ 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+@@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
+ 
+ 	*memcgidp = memcgid;
+ 	*pgdat = NODE_DATA(nid);
+-	*evictionp = entry << bucket_order;
++	*evictionp = entry;
+ 	*workingsetp = workingset;
+ }
+ 
++#ifdef CONFIG_LRU_GEN
++
++static void *lru_gen_eviction(struct folio *folio)
++{
++	int hist;
++	unsigned long token;
++	unsigned long min_seq;
++	struct lruvec *lruvec;
++	struct lru_gen_struct *lrugen;
++	int type = folio_is_file_lru(folio);
++	int delta = folio_nr_pages(folio);
++	int refs = folio_lru_refs(folio);
++	int tier = lru_tier_from_refs(refs);
++	struct mem_cgroup *memcg = folio_memcg(folio);
++	struct pglist_data *pgdat = folio_pgdat(folio);
++
++	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
++
++	lruvec = mem_cgroup_lruvec(memcg, pgdat);
++	lrugen = &lruvec->lrugen;
++	min_seq = READ_ONCE(lrugen->min_seq[type]);
++	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
++
++	hist = lru_hist_from_seq(min_seq);
++	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
++
++	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
++}
++
++static void lru_gen_refault(struct folio *folio, void *shadow)
++{
++	int hist, tier, refs;
++	int memcg_id;
++	bool workingset;
++	unsigned long token;
++	unsigned long min_seq;
++	struct lruvec *lruvec;
++	struct lru_gen_struct *lrugen;
++	struct mem_cgroup *memcg;
++	struct pglist_data *pgdat;
++	int type = folio_is_file_lru(folio);
++	int delta = folio_nr_pages(folio);
++
++	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
++
++	if (pgdat != folio_pgdat(folio))
++		return;
++
++	rcu_read_lock();
++
++	memcg = folio_memcg_rcu(folio);
++	if (memcg_id != mem_cgroup_id(memcg))
++		goto unlock;
++
++	lruvec = mem_cgroup_lruvec(memcg, pgdat);
++	lrugen = &lruvec->lrugen;
++
++	min_seq = READ_ONCE(lrugen->min_seq[type]);
++	if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
++		goto unlock;
++
++	hist = lru_hist_from_seq(min_seq);
++	/* see the comment in folio_lru_refs() */
++	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
++	tier = lru_tier_from_refs(refs);
++
++	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
++	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
++
++	/*
++	 * Count the following two cases as stalls:
++	 * 1. For pages accessed through page tables, hotter pages pushed out
++	 *    hot pages which refaulted immediately.
++	 * 2. For pages accessed multiple times through file descriptors,
++	 *    numbers of accesses might have been out of the range.
++	 */
++	if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
++		folio_set_workingset(folio);
++		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
++	}
++unlock:
++	rcu_read_unlock();
++}
++
++#else /* !CONFIG_LRU_GEN */
++
++static void *lru_gen_eviction(struct folio *folio)
++{
++	return NULL;
++}
++
++static void lru_gen_refault(struct folio *folio, void *shadow)
++{
++}
++
++#endif /* CONFIG_LRU_GEN */
++
+ /**
+  * workingset_age_nonresident - age non-resident entries as LRU ages
+  * @lruvec: the lruvec that was aged
+@@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
+ 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
+ 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ 
++	if (lru_gen_enabled())
++		return lru_gen_eviction(folio);
++
+ 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+ 	/* XXX: target_memcg can be NULL, go through lruvec */
+ 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
+ 	eviction = atomic_long_read(&lruvec->nonresident_age);
++	eviction >>= bucket_order;
+ 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
+ 	return pack_shadow(memcgid, pgdat, eviction,
+ 				folio_test_workingset(folio));
+@@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow)
+ 	int memcgid;
+ 	long nr;
+ 
++	if (lru_gen_enabled()) {
++		lru_gen_refault(folio, shadow);
++		return;
++	}
++
+ 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
++	eviction <<= bucket_order;
+ 
+ 	rcu_read_lock();
+ 	/*