PDS Kernel Configuration

2022-11-24 12:52:33 -05:00
2 changed files with 268 additions and 0 deletions
--- a/linux-tkg-patches/6.0/0012-misc-additions.patch
+++ b/linux-tkg-patches/6.0/0012-misc-additions.patch
@@ -64,6 +64,140 @@ index 2c7171e0b0010..85de313ddec29 100644
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
 From 2535fbde890f14c78b750139fcf87d1143850626 Mon Sep 17 00:00:00 2001
 From: Johannes Weiner <hannes@cmpxchg.org>
 Date: Tue, 2 Aug 2022 12:28:11 -0400
 Subject: [PATCH] mm: vmscan: fix extreme overreclaim and swap floods
 During proactive reclaim, we sometimes observe severe overreclaim, with
 several thousand times more pages reclaimed than requested.
 This trace was obtained from shrink_lruvec() during such an instance:
    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]
 While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
 by swapping.  These requests take over a minute, during which the write()
 to memory.reclaim is unkillably stuck inside the kernel.
 Digging into the source, this is caused by the proportional reclaim
 bailout logic.  This code tries to resolve a fundamental conflict: to
 reclaim roughly what was requested, while also aging all LRUs fairly and
 in accordance to their size, swappiness, refault rates etc.  The way it
 attempts fairness is that once the reclaim goal has been reached, it stops
 scanning the LRUs with the smaller remaining scan targets, and adjusts the
 remainder of the bigger LRUs according to how much of the smaller LRUs was
 scanned.  It then finishes scanning that remainder regardless of the
 reclaim goal.
 This works fine if priority levels are low and the LRU lists are
 comparable in size.  However, in this instance, the cgroup that is
 targeted by proactive reclaim has almost no files left - they've already
 been squeezed out by proactive reclaim earlier - and the remaining anon
 pages are hot.  Anon rotations cause the priority level to drop to 0,
 which results in reclaim targeting all of anon (a lot) and all of file
 (almost nothing).  By the time reclaim decides to bail, it has scanned
 most or all of the file target, and therefor must also scan most or all of
 the enormous anon target.  This target is thousands of times larger than
 the reclaim goal, thus causing the overreclaim.
 The bailout code hasn't changed in years, why is this failing now?  The
 most likely explanations are two other recent changes in anon reclaim:
 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.
 2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.
   Before the series starting with commit ccc5dc67340c ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.
   As a result, the VM is now more likely to actually finish large
   anon targets than before.
 Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
 larger LRU lists is made before bailing out on a met reclaim goal.
 This fixes the extreme overreclaim problem.
 Fairness is more subtle and harder to evaluate.  No obvious misbehavior
 was observed on the test workload, in any case.  Conceptually, fairness
 should primarily be a cumulative effect from regular, lower priority
 scans.  Once the VM is in trouble and needs to escalate scan targets to
 make forward progress, fairness needs to take a backseat.  This is also
 acknowledged by the myriad exceptions in get_scan_count().  This patch
 makes fairness decrease gradually, as it keeps fairness work static over
 increasing priority levels with growing scan targets.  This should make
 more sense - although we may have to re-visit the exact values.
 Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
 Reviewed-by: Rik van Riel <riel@surriel.com>
 Acked-by: Mel Gorman <mgorman@techsingularity.net>
 Cc: Hugh Dickins <hughd@google.com>
 Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
 Cc: <stable@vger.kernel.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---
 mm/vmscan.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 382dbe97329f33..266eb8cfe93a67 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
@@ -2955,8 +2955,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 +	bool proportional_reclaim;
 	struct blk_plug plug;
 -	bool scan_adjusted;
 	get_scan_count(lruvec, sc, nr);
@@ -2974,8 +2974,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	 * abort proportional reclaim if either the file or anon lru has already
 	 * dropped to zero at the first pass.
 	 */
 -	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 -			 sc->priority == DEF_PRIORITY);
 +	proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 +				sc->priority == DEF_PRIORITY);
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2995,7 +2995,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 -		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
 +		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
 			continue;
 		/*
@@ -3046,8 +3046,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr_scanned = targets[lru] - nr[lru];
 		nr[lru] = targets[lru] * (100 - percentage) / 100;
 		nr[lru] -= min(nr[lru], nr_scanned);
 -
 -		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
 	sc->nr_reclaimed += nr_reclaimed;
 From 430daaab3c78de6bd82f10cfb5a0f016c6e583f6 Mon Sep 17 00:00:00 2001
 From: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
 Date: Mon, 4 Oct 2021 14:07:34 -0400
--- a/linux-tkg-patches/6.1/0012-misc-additions.patch
+++ b/linux-tkg-patches/6.1/0012-misc-additions.patch
@@ -64,6 +64,140 @@ index 2c7171e0b0010..85de313ddec29 100644
 	select CPU_FREQ_GOV_PERFORMANCE
 	help
 From 2535fbde890f14c78b750139fcf87d1143850626 Mon Sep 17 00:00:00 2001
 From: Johannes Weiner <hannes@cmpxchg.org>
 Date: Tue, 2 Aug 2022 12:28:11 -0400
 Subject: [PATCH] mm: vmscan: fix extreme overreclaim and swap floods
 During proactive reclaim, we sometimes observe severe overreclaim, with
 several thousand times more pages reclaimed than requested.
 This trace was obtained from shrink_lruvec() during such an instance:
    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]
 While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
 by swapping.  These requests take over a minute, during which the write()
 to memory.reclaim is unkillably stuck inside the kernel.
 Digging into the source, this is caused by the proportional reclaim
 bailout logic.  This code tries to resolve a fundamental conflict: to
 reclaim roughly what was requested, while also aging all LRUs fairly and
 in accordance to their size, swappiness, refault rates etc.  The way it
 attempts fairness is that once the reclaim goal has been reached, it stops
 scanning the LRUs with the smaller remaining scan targets, and adjusts the
 remainder of the bigger LRUs according to how much of the smaller LRUs was
 scanned.  It then finishes scanning that remainder regardless of the
 reclaim goal.
 This works fine if priority levels are low and the LRU lists are
 comparable in size.  However, in this instance, the cgroup that is
 targeted by proactive reclaim has almost no files left - they've already
 been squeezed out by proactive reclaim earlier - and the remaining anon
 pages are hot.  Anon rotations cause the priority level to drop to 0,
 which results in reclaim targeting all of anon (a lot) and all of file
 (almost nothing).  By the time reclaim decides to bail, it has scanned
 most or all of the file target, and therefor must also scan most or all of
 the enormous anon target.  This target is thousands of times larger than
 the reclaim goal, thus causing the overreclaim.
 The bailout code hasn't changed in years, why is this failing now?  The
 most likely explanations are two other recent changes in anon reclaim:
 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.
 2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.
   Before the series starting with commit ccc5dc67340c ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.
   As a result, the VM is now more likely to actually finish large
   anon targets than before.
 Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
 larger LRU lists is made before bailing out on a met reclaim goal.
 This fixes the extreme overreclaim problem.
 Fairness is more subtle and harder to evaluate.  No obvious misbehavior
 was observed on the test workload, in any case.  Conceptually, fairness
 should primarily be a cumulative effect from regular, lower priority
 scans.  Once the VM is in trouble and needs to escalate scan targets to
 make forward progress, fairness needs to take a backseat.  This is also
 acknowledged by the myriad exceptions in get_scan_count().  This patch
 makes fairness decrease gradually, as it keeps fairness work static over
 increasing priority levels with growing scan targets.  This should make
 more sense - although we may have to re-visit the exact values.
 Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
 Reviewed-by: Rik van Riel <riel@surriel.com>
 Acked-by: Mel Gorman <mgorman@techsingularity.net>
 Cc: Hugh Dickins <hughd@google.com>
 Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
 Cc: <stable@vger.kernel.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---
 mm/vmscan.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 382dbe97329f33..266eb8cfe93a67 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
@@ -2955,8 +2955,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 +	bool proportional_reclaim;
 	struct blk_plug plug;
 -	bool scan_adjusted;
 	get_scan_count(lruvec, sc, nr);
@@ -2974,8 +2974,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	 * abort proportional reclaim if either the file or anon lru has already
 	 * dropped to zero at the first pass.
 	 */
 -	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 -			 sc->priority == DEF_PRIORITY);
 +	proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 +				sc->priority == DEF_PRIORITY);
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2995,7 +2995,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 -		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
 +		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
 			continue;
 		/*
@@ -3046,8 +3046,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr_scanned = targets[lru] - nr[lru];
 		nr[lru] = targets[lru] * (100 - percentage) / 100;
 		nr[lru] -= min(nr[lru], nr_scanned);
 -
 -		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
 	sc->nr_reclaimed += nr_reclaimed;
 From 430daaab3c78de6bd82f10cfb5a0f016c6e583f6 Mon Sep 17 00:00:00 2001
 From: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
 Date: Mon, 4 Oct 2021 14:07:34 -0400