Types of OOM Error & Java Memory Leak Causes

Types of OOM:

  • java.lang.OutOfMemoryError: Java heap space
  • java.lang.OutOfMemoryError: PermGen space
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
  • java.lang.OutOfMemoryError: unable to create new native thread
  • java.lang.OutOfMemoryError: nativeGetNewTLA
  • java.lang.OutOfMemoryError: Requested array size exceeds VM limit
  • java.lang.OutOfMemoryError: request <size> bytes for <reason>. Out of swap
  • java.lang.OutOfMemoryError: <reason> <stack trace> (Native method)
  • java.lang.OutOfMemoryError: Metaspace

Here are the typical cause of Java Memory Leak:

  • Do not close DB, file, socket, JMS resources and other external resources properly
  • Do not close resources properly when an exception is thrown
  • Keep adding objects to a cache or a hash map or hashtable, or vector or ArrayLIst without expiring the old one
  • Do not implement the hash and equal function correctly for the key to a cache
  • Session data is too large
  • Leak in third party library or the application server
  • In an infinite application code loop (likely cause for high cpu)
  • Leaking memory in the native code

performance engineering questions

Few of the questions that can help in solving performance issues:

Load balance issues

  • What type of load balancing scheme is used? (Round robin, sticky IP, least connections, subnet based?)
  • What is the timeout of LB table?
  • Does it do any connection pooling?
  • Is it doing any content filtering?
  • Is it checking for HTTP response status?
  • Are there application dependencies associated with the LB timeout settings?
  • What failover strategies are employed?
  • What is the connection persistence timeout?
  • Are there application dependencies associated with the LB timeout settings?
  • What are the timeouts for critical functions?

Firewall issues

  • What is the throughput capacity?
  • What is the connection capacity and rate?
  • What is the DMZ operation?
  • What are the throughput policies from a single IP?
  • What are the connection policies from a single IP?

Firewalls and multiple DMZs

  • Does the firewall do content filtering?
  • Is it sensitive to inbound and/or outbound traffic?
  • What is its upper connection limit?
  • Are there policies associated with maximum connection or throughput per IP address?
  • Are there multiple firewalls in the architecture (multiple DMZs)?
  • If it has multiple DMZs, is it sensitive to data content?

Web server issues

  • How many connections can the server handle?
  • How many open file descriptors or handles is the server configured to handle?
  • How many processes or threads is the server configured to handle?
  • Does it release and renew threads and connections correctly?
  • How large is the server’s listen queue?
  • What is the server’s “page push” capacity?
  • What type of caching is done?
  • Is there any page construction done here?
  • Is there dynamic browsing?
  • What type of server-side scripting is done? (ASP, JSP, Perl, JavaScript, PHP, etc.)
  • Are there any SSL acceleration devices in front of the web server?
  • Are there any content caching devices in front of the web server?
  • Can server extensions and their functions be validated? (ASP, JSP, PHP, Perl, CGI, servlets, ISAPI filter/app, etc.)
  • Monitoring (Pools: threads, processes, connections, etc. Queues: ASP, sessions, etc. General: CPU, memory, I/O, context switch rate, paging, etc.)

Application server issues

  • Is there any page construction done here?
  • How is session management done and what is the capacity?
  • Are there any clustered configurations?
  • Is there any load balancing done?
  • If there is software load balancing, which one is the load balancer?
  • What is the page construction capacity?
  • Do components have a specific interface to peripheral and external systems?

Database server issues

  • Have both small and large data sets been tested?
  • What is the connection pooling configuration?
  • What are its upper limits?

The experienced performance engineer asks the questions;

  • Why is the application updating all these tables on an order creation?
  • Why is it calling the remote pricing call three times?
  • Why are you creating a new object for the same customer or product?
  • Why is the database connection handler making so many connections for a static number of users?
  • Did you expect your users/customer to come from a slow wireless connection? Did you test for that?
  • Did you realize the Application Servers where in one data center and the database was in another data center?
  • Who set the JVM memory configuration?
  • Why are the indexes on the same volumes as the files?
  • The performance testing database was one quarter size the production database.
  • How many physical CPU’s did you really allocate to the Database Server?
  • How was the peak volume determined?

“CodeCache is full. Compiler has been disabled”

JVM JIT generates compiled code and stores that in a memory area called CodeCache. The default maximum size of the CodeCache on most of the platforms is 48M. If any application needs to compile large number of methods resulting in huge amount of compiled code then this CodeCache may become full. When it becomes full, the compiler is disabled to stop any further compilations of methods, and a message like the following gets logged:

Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.

Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

Code Cache  [0xffffffff77400000, 0xffffffff7a390000, 0xffffffff7a400000) total_blobs=11659 nmethods=10690 adapters=882 free_code_cache=909Kb largest_free_block=502656

When this situation occurs, JVM may invoke sweeping and flushing of this space to make some room available in the CodeCache. There is a JVM option UseCodeCacheFlushing that can be used to control the flushing of the Codeache. With this option enabled JVM invokes an emergency flushing that discards older half of the compiled code(nmethods) to make space available in the CodeCache. In addition, it disables the compiler until the available free space becomes more than the configured CodeCacheMinimumFreeSpace. The default value ofCodeCacheMinimumFreeSpace option is 500K.

UseCodeCacheFlushing is set to false by default in JDK6, and is enabled by default since JDK7u4. This essentially means that in jdk6 when the CodeCache becomes full, it is not swept and flushed and further compilations are disabled, and in jdk7u+, an emergency flushing is invoked when the CodeCache becomes full. Enabling this option by default made some issues related to the CodeCache flushing visible in jdk7u4+ releases. The following are two known problems in jdk7u4+ with respect to the CodeCache flushing:

1. The compiler may not get restarted even after the CodeCache occupancy drops down to almost half after the emergency flushing.
2. The emergency flushing may cause high CPU usage by the compiler threads leading to overall performance degradation.

This performance issue, and the problem of the compiler not getting re-enabled again has been addressed in JDK8. To workaround these in JDK7u4+, we can increase the code cache size usingReservedCodeCacheSize option by setting it to a value larger than the compiled-code footprint so that the CodeCache never becomes full. Another solution to this is to disable the CodeCache Flushing using -XX:-UseCodeCacheFlushing JVM option.

Reference: https://blogs.oracle.com/poonam/entry/why_do_i_get_message

Bad state of performance testing … in the so called technology company

The code is already pushed to production without the load test.   Now the performance team has started the load test. We are seeing high response time and not getting the desired TPS.

The client’s performance engineer looks at the production access logs and says, even in the production we are getting High Response Times and concludes the code is bad. The clients performance engineer nor the performance team  does not bother to find out the root cause of the high response time. To avoid the high response times, on the load generators, timeout has been set to one second. Any response greater than 1 sec is marked as failed. This performance team is going to send these results to the management.

The Performance Team  is not bothered to do the code profiling. The fact is they dont know that the code profiler exists and profiling should be done in such cases

I feel very bad to work with such lousy people when they are not ready to listen. The performance team does not understand anything.

Initially I tried to educate the client and the performance team, but it was futile. At the end of one year, I realised that they are dumb fellows who are not ready to listen to anything. I stopped educating them…

Thanks, Frustrated Performance Engineer.

JMeter’s HttpRequest KeepAlive

Problem Statement: Load is not getting distributed equally among all the application servers during the load test.

Description:The client (performance engineer with 9+ years of load testing experience, who manages the team of contractors located at offshore) along with the developers look at the Jmeter script and conclude that keepAlive flag of Jmeters HTTPRequest is the root cause for the load not getting distributed equally. The same was conveyed to offshore performance test architect and requested him to disable the keepAlive flag of Jmeters HTTPRequest and restart the test.

Note:Jmeters HTTPRequest is a simple  post/get request. The response to this request is a xml containing max  of 256 characters. 

The offshore performance test architect does a quick search on the google and tells the team that disabling the keep alive will have a performance impact on the router (not on the load generator where the jmeter script is running). 

Thanks, Frustrated Performance Engineer