Classic Java OutOfMemoryError

Few months back, one of the performance testing team in the hospitality industry encountered OutOfMemoryError during the endurance runs at the end of 2 hours. They repeated the endurance run again, but the end result was same.  Without a second thought, the performance testing team increased the heap to 8 GB from the existing 4 GB. This time the OutOfMemoryError had occurred at the end of 4 hours.

About the application: The AUT (Application under test) was very simple, users login into the system and does 3000 iterations of invoking a static page and then logs out.

Requirements: At any given point of time 8000 users have to be logged into the system. There are around 100K users available in the system

Analysis: Ideally there should not be any problem with such simple applications. Knowing this team for a while, asked them few questions:

  • What does the application perform?
  • How the users login and what does the user do?
  • Will the same users logins into the system or a new user logins every time?

Based on the responses provided by the team, came to a conclusion that the session is not getting in-validated when the user logs out.

Looked at the Session Keep-Alive time and found that this value has been configured to 24 hours.

After reducing the Session Keep-Alive time to 30 minutes, the OutOfMemoryError has been resolved even without looking at the heapdump (lots of time has been saved!!).

Conclusion: As a performance tester, very first thing is understand the application behavior.  Never change the values without fully understanding the impact of the change.

Advertisements

A potential memory leak – Metaspace leak

For the last few weeks, I have been working on Benchmarking a Product that was devloped using JDK 1.8u65 and deployed on Weblogic 12C.

Started the initial load tests  just to see how the application is behaving and during that process was observing DB and JVM performance. Found few issues during the load testing process.

  1. Inserts in the DB was happening 1 at a time
  2. Complete Refresh was happening every 30 seconds.
  3. Full GC’s were observed

The issue with point 1 is, too many network round trips to the database and also the log file sync wait event is the top wait event. Informed the dev team to make changes in the code so that the records get commited to the database in batches, thereby reducing the no. of network round trips.

The issue with point 2 is, MV refresh happens for the  all the records in the database every time the refresh happens. We dont see much impact when the number of records in the database size is in few MBs. The real problem would arise when the database grows to few GB/TBs. Asked the Dev team to implement incremental refresh.  Only for the first time complete refresh happens there onwards refresh happens for the delta records.

In the meantime, started analyzing  the GC logs and the reason for the Full GCs. At first glance, full gcs seems to be okay. On further analyzing the full gcs, the metaspace was also getting resized (space getting reclaimed) which is not normal. We know that young and old region gets resized (space getting reclaimed) during minor and major gcs. But why is metaspace is getting resized that too in a steady state.

There might be two reasons for this.

  1. The load testing process (the way the requests are sent). This can be ruled out as I do the inilization once and then the main transactions run.
  2. New instances are getting created again and again. This is the only option left to be explored. At this point, I cannot involve the developers as they would ask me what are those instances?

As a next step, took 3 complete heap dumps when the heap occupancy was at different stages (immediately after the full gc, when heap is half full and when the heap is about get full gc’ed).

Analyzed all the heap dumps, and found out a particular class instance was responsible for this. I could see the  no. of instance of this particular class increasing in all those three dumps.

Now, I had the required data, checked with the developer if the particular class instance is being used in the code, he confirmed that his part of the code is creating the class instances.

Oooola.. got the fix.  Again repeated the same process. This time I could see only a few instances of those class. The leak is fixed.

Hope it helps.

Configuring GC log Issue

I am trying to configure GC logs in weblogic startup scripts on one of the linux servers (oel7)

JDK: 1.7u80

MEM_ARGS=”-Xms2048m -Xmx2048m -XX:PermSize=512m -XX:MaxPermSize=512m -Xloggc:/scratch/oracle/middleware/user_projects/domains/oamdomain/servers/oam_server2/logs/oamserver2$$-gc.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution”

With the above parameters, I am unable to start weblogic servers… getting the below error.

Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Invalid file name for use with -Xloggc: Filename can only contain the characters [A-Z][a-z][0-9]-_.%[p|t] but it has been oamserver2-gc.log -XX:+PrintGC
Note %p or %t can only be used once

Solution:
MEM_ARGS has been split into two parts:

export MEM_ARGS=” -Xms2048m -Xmx2048m -XX:PermSize=512m -XX:MaxPermSize=512m -Xloggc:/scratch/oracle/middleware/user_projects/domains/oamdomain/servers/oam_server2/logs/oamserver2$$-gc.log”

export JAVA_OPTIONS=”-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution $JAVA_OPTIONS”

The weblogic startup script was creating problems while starting the server when we have everything part of mem_args. May be some sort of validation was happening somewhere 😉

Hope it helps.

IDM servers going down

Problem Statement:
All java process pertaining to IDM stack were getting killed very periodically. And the servers had to be restarted very often.

Softwares:
Weblogic 11g, IDM stack  java 1.70.79

Analysis:

  • There were no thread dumps or core dumps.
  • System logs were clean.
  • Greped the  weblogic logs for errors, could not find anything
  • After carefully going through each and every line of the weblogic logs, I could find below messages:
    ####<Jul 26, 2017 6:36:47 PM IST> <Notice> <WebLogicServer> <myhost.me.me.com> <AdminServer> <Thread-1> <<WLS Kernel>> <> <ae8121cbb691c6b3:-fc73acf:15d7e34bd34:-8000-00000000000002ff> <1501074407979> <BEA-000388> <JVM called WLS shutdown hook. The server will force shutdown now>
    ####<Jul 26, 2017 6:36:47 PM IST> <Alert> <WebLogicServer> <myhost.me.me.com> <AdminServer> <Thread-1> <<WLS Kernel>> <> <ae8121cbb691c6b3:-fc73acf:15d7e34bd34:-8000-00000000000002ff> <1501074407980> <BEA-000396> <Server shutdown has been requested by <WLS Kernel>>

The logs clearly indicates that the JVM itself initiated the shutdown of the servers and OS has  nothing to do with it.

Solution:
The problem might occur due to incompatibility of softwares. In our case there was no customization or any of our own developed softwares deployed. Ours was clean install.
This points me to JDK version. After pointing to the correct java, the issue has been resolved.

Hope it helps.