Getting Shorter Garbage Collection Pauses
We noticed that our test PeopleSoft system was very slow on occasion, such that the load balancer decided it was broken. Sessions were redirected to the webpage we have for the system being in maintenance.
Since the webserver is Weblogic, it runs in a Java Virtual Machine (JVM). The first thing to check is how long the garbage collection pauses are. Fortunately I had garbage collection logging switched on, so I could see that they were over 100 seconds on occasion, which is far too long. This is what my garbage collection log parameters were set to (Java 11).
-Xlog:gc*,safepoint=info:file=gc_%p_%t.log:time:filecount=20,filesize=10M
Tuning Garbage Collection
It looks like for JDK 11 the G1 garbage collector is used, unless it is told otherwise. So let’s work through the tuning advice for G1 on Java 11.
Maximum Pause-Time Goal
This looks like the goal we want. We don’t want the pause time
to be too long. So we should set: XX:MaxGCPauseMillis=200
to set the maximum pause time to 1/5s. But this is the default
according to a
G1 tuning article on oracle.com
So it is already set. So we should try the next thing.
Footprint
This recommends setting the maximum and minimum heap sizes to the same values. This is done by the PeopleSoft Deployment Packages already.
Tuning for Latency
The Garbage First Garbage Collector Tuning chapter has a section on Tuning for Latency. This is what we want. The first paragraph in that section is about Unusual System or Real-Time Usage.
So (I think!) I list all the garbage collection information and safepoint information as well. This means I have logs like this (Timestamp removed, and wrapped for readability).
Entering safepoint region: GenCollectForAllocation
GC(86) Pause Young (Allocation Failure)
GC(87) Pause Full (Allocation Failure)
GC(87) Phase 1: Mark live objects
GC(87) Phase 1: Mark live objects 80594.661ms
GC(87) Phase 2: Compute new object addresses
GC(87) Phase 2: Compute new object addresses 12417.559ms
GC(87) Phase 3: Adjust pointers
GC(87) Phase 3: Adjust pointers 241.936ms
GC(87) Phase 4: Move objects
GC(87) Phase 4: Move objects 9406.268ms
GC(87) Pause Full (Allocation Failure) 971M->237M(989M) 102667.971ms
GC(86) DefNew: 309702K->0K(314560K)
GC(86) Tenured: 685365K->242922K(699072K)
GC(86) Metaspace: 226846K(232768K)->226846K(232768K)
NonClass: 202016K(207040K)->202016K(207040K)
Class: 24830K(25728K)->24830K(25728K)
GC(86) Pause Young (Allocation Failure) 971M->237M(989M) 102671.618ms
GC(86) User=0.76s Sys=1.60s Real=102.67s
Leaving safepoint region
Total time for which application threads were stopped: 102.6720217 seconds,
Stopping threads took: 0.0000231 seconds
So here User + Sys = 2.36s, which isn’t quick, but is nothing like the 102.67 seconds that the system was paused for. Sys is quite high compared to User, which suggests there is something happening in the environment (i.e. on the VM), but the main thing that stands out is the following line in the documentation:
Another situation to look out for is real time being a lot larger than the sum of the others this may indicate that the VM did not get enough CPU time on a possibly overloaded machine.
This seems to be our problem. It’s not the JMV itself, for some reason the JVM didn’t get enough CPU time to do garbage collection in a reasonable timescale.
Operating System Stats
So we need to look at the Operating system statistics. Looking at the SAR statistics (Our sysadmins helpfully gather SAR stats every minute on our systems) shows that the CPU was idle with outstanding IO requests. %iowait was quite high during this time.
03:35:01 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
03:36:01 all 16.95 3.78 4.87 41.20 0.00 0.54 0.62 0.00 0.00 32.05
03:37:01 all 6.17 0.00 2.25 22.43 0.00 0.45 0.28 0.00 0.00 68.42
03:38:01 all 11.08 3.92 2.58 0.28 0.00 0.42 0.32 0.00 0.00 81.41
Looking at top, our java process is using 1.5G RAM. The server has 6G RAM, and 2G swap, so in my opinion it should have plenty of RAM to do it’s work. But this is starting to feel familiar, the OS swapping out memory and slowing the system down? I seem to remember having dealt with a similar swapping performance problem at the application server level. I have already altered my systemd unit files for the webserver so it will auto restart. So it seems reasonable to try to make the web server JVM not swap in the same way I did for the application server. I don’t need to limit it’s memory though, as I already do that using the JVM parameters.
How to Prevent the JVM from Swapping
Let’s try switching off swap for Weblogic.
We are on RedHat Enterprise Linux 8 (RHEL8). I believe this all “just works” in RHEL9, but in RHEL8 it seems we need to enable swap accounting and cgroups-v2 using kernel boot parameters from grub.
There is more information in the RHEL8 manual about enabling cgroups-v2.
I understand that the above is not compatible with Podman, Docker or Kubernetes, which is why it was not enabled by default, so this might not be appropriate for every workload.
Then we can use systemd to prevent the JVM process from using swap by using systemd as I did for the application. We need to set MmeorySwapMax = 0.
It turns out that the reason I gave for processes using swap even though I asked them not to in on the application server is wrong. I just didn’t have swap accounting switched on.
Checking if Swap Accounting is Available and Enabled
According to
stack exchange,
file /boot/config-$( uname -r )
contains the kernel configuration.
There are two parameters that are relevant. If CONFIG_MEMCG_SWAP
is
set to Y
, then swap accounting is available to be enabled using a
boot parameters. If CONFIG_MEMCG_SWAP_ENABLED
is also set to Y
,
then it is enabled by default.
# grep CONFIG_MEMCG /boot/config-$( uname -r )
CONFIG_MEMCG_SWAP=y
This shows that accounting is switched on, which is the default in RedHat, but is not enabled.
The Actual Fixing
So taking all the above, we can do the following:
Set the following boot parameters
swapaccount=1 systemd.unified_cgroup_hierarchy=1
Then we can prevent the cgroup for the web server from using swap as follows:
systemctl set-property psft-pia-peoplesoft.service MemorySwapMax=0
Why Am I Looking at Stack Exchange?
Good question. Google likes Stack Exchange, and so it’s questions and answers tend to come at the top of the results. The problem is that the answers there are not definitive, and date quickly, so I would much rather link to current documentation. This will have a version number on it. Newer documentation will hopefully be in the same format, and we will be able to tell quickly whether the procedure has changed for newer versions of Linux. The problem is that RedHat doesn’t document this completely, so far as I can see, and so we need to go to the Linux documentation. My kernel version is 4.18, so we will need to look at this version of the documentation. On the kernel parameters documentation page, it says the following:
swapaccount=[0|1] [KNL] Enable accounting of swap in memory resource controller if no parameter or 1 is given or disable it if 0 is given (See Documentation/cgroup-v1/memory.txt)
There is no mention of systemd.unified_cgroup_hierarchy
.
There is no link to the file, but I found the latest version of the document at kernel.org/doc/Documentation/cgroup-v1/memory.txt This is the wrong version. Looking at the Linux source tree I found cgroup-v1/memory.txt for version 4.18. But both files are the same, and say:
NOTE: This document is hopelessly outdated and it asks for a complete rewrite. It still contains a useful information so we are keeping it here but make sure to check the current code if you need a deeper understanding.
There is no mention of swapaccounting
.
There is no mention of systemd.unified_cgroup_hierarchy
on systemd.io.
It is mentioned on
freedesktop.org,
but only to note that it is deprecated
in version 252. I am using systemd version 239, so I do still need to use
this. Where is the documentation?
Looking at /usr/share/doc/systemd/NEWS
on my Linux box, I found the following:
* The default control group setup mode may be selected both a boot-time via a set of kernel command line parameters (specifically: systemd.unified_cgroup_hierarchy= and systemd.legacy_systemd_cgroup_controller=), as well as a compile-time default selected on the configure command line (--with-default-hierarchy=). The upstream default is "hybrid" (i.e. the cgroups-v1 + cgroups-v2 mixture discussed above) now, but this will change in a future systemd version to be "unified" (pure cgroups-v2 mode). The third option for the compile time option is "legacy", to enter pure cgroups-v1 mode. We recommend downstream distributions to default to "hybrid" mode for release distributions, starting with v233. We recommend "unified" for development distributions (specifically: distributions such as Fedora's rawhide) as that's where things are headed in the long run. Use "legacy" for greatest stability and compatibility only.
I don’t want to read the kernel source, also I don’t understand what the above means so I will have to defer to others on Stack Exchange, and also test.
I think the documentation could be much better.
Testing it
To test it, we either need to wait until the operating system decides to
swap out the memory to use for something else (Maybe disc cache), or we
need to allocate some memory to force the Linux kernel to swap something
out. Since I am impatient, I will do the following. I get two Linux
sessions open. In the first one I run top
, and add the column for swap
using the f
command, arrowing down to SWAP
, and press space
, or d
to
select it. I like to move the column next to the other memory parameters,
which is done by pressing the right arrow
key to select it for moving,
then arrow up
until the column is where it needs to be. Then press enter
to confirm. Then press escape
to get back to the main top screen.
Now we can see how much swap each process is using, we need to force Linux to use some swap. There is a controversial one-liner on stack exchange which is handy to do this. My VM has 6G Ram, and the Java process only has 1G, so I started at 2G and worked up in 500M increments until the system started to swap out processes.
head -c 3500m /dev/zero | tail | sleep 60
Also, leaving the system running for a while I see the garbage collection pauses are much shorter. It seems this is sorted!