Why isn’t Oracle using huge pages on my Redhat Linux server?
I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:
$ top
top - 18:04:20 up 6 days, 3:22, 7 users, load average: 14.25, 12.61, 14.41 Tasks: 536 total, 2 running, 533 sleeping, 0 stopped, 1 zombie Cpu(s): 12.9%us, 19.2%sy, 0.0%ni, 20.9%id, 45.0%wa, 0.1%hi, 1.9%si, 0.0%st Mem: 16373544k total, 16334112k used, 39432k free, 4916k buffers Swap: 16777208k total, 2970156k used, 13807052k free, 5492216k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 491 root 10 -5 0 0 0 D 55.6 0.0 67:22.85 kswapd0 492 root 10 -5 0 0 0 S 25.8 0.0 37:01.75 kswapd1 494 root 11 -5 0 0 0 S 24.8 0.0 42:15.31 kswapd3 8730 oracle -2 0 8352m 3.5g 3.5g S 9.9 22.4 139:36.18 oracle 8726 oracle -2 0 8352m 3.5g 3.5g S 9.6 22.5 138:13.54 oracle 32643 oracle 15 0 8339m 97m 92m S 9.6 0.6 0:01.31 oracle 493 root 11 -5 0 0 0 S 9.3 0.0 43:11.31 kswapd2 8714 oracle -2 0 8352m 3.5g 3.5g S 9.3 22.4 137:14.96 oracle 8718 oracle -2 0 8352m 3.5g 3.5g S 8.9 22.3 137:01.91 oracle 19398 oracle 15 0 8340m 547m 545m R 7.9 3.4 0:05.26 oracle 8722 oracle -2 0 8352m 3.5g 3.5g S 7.6 22.5 139:18.33 oracle
The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):
$ free
total used free shared buffers cached Mem: 16373544 16268540 105004 0 1520 5465680 -/+ buffers/cache: 10801340 5572204 Swap: 16777208 2948684 13828524
To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:
$ ipcs -a
------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x62e08f78 0 oracle 640 132120576 16 0xdd188948 32769 oracle 660 8592031744 87
The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:
$ grep Huge cat /proc/meminfo
HugePages_Total: 4106 HugePages_Free: 4051 HugePages_Rsvd: 8 Hugepagesize: 2048 kB
While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:
$ grep vm.nr_hugepages /etc/sysctl.conf
vm.nr_hugepages=4106
vm.nr_hugepages=4106
$ bc
4106*(1024*1024*2)
8610906112
4106*(1024*1024*2)
8610906112
If we add the totals from the two shared memory segments above:
$ bc
8592031744+132120576
8724152320
8592031744+132120576
8724152320
We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:
1. Double check system values people send you
2. Solaris does a MUCH better job of handling large page sizes (huge pages are used transparently)
3. The Linux tools for investigating huge page allocations are severely lacking
4. Oracle is able to allocate a continuos 8GB chunk of shared memory on RHEL5, but not RHEL4 (I need to do some research to find out why)
Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!
No comments:
Post a Comment