How to troubleshoot and monitor a Java and JEE application having performance and scalability problem. Here are the techniques used for production systems.
- Perform a series of JDK thread dump to locate the following possible problems:
- Application bottleneck: Identify application bottlenecks by locating the most common stack trace. Optimize requests that happen most often on the stack trace.
- Bad SQLs: If most threads are in the waiting state for the JDBC calls, trace down the bad SQLs to the DB.
- Slow DB: If many SQLs are having problem, conduct a DB profiling to locate the DB problem.
- DB or external system outages: Check if a lot of threads are in the waiting state of making external connection.
- Concurrency issue: Check if many stack trace are waiting in the same code for a lock.
- Infinite loop: Verify if threads remaining running over minutes at similar part of the source code.
- Connectivity problem: Un-expected low idling thread count indicates the requests are not reaching the application server.
- Thread count mis-configuration: Increase thread count if CPU utilization is low yet most thread are in runnable state.
- Monitor CPU utilization
- High CPU utilization implies design or coding in-efficiency. Execute a thread dump to locate bottleneck. If no problems are found, the system may reach full capacity.
- Low CPU utilization with abnormal high response time implies many threads are blocked. Execute a thread dump to narrow down the problem.
- Monitor process health including the Java application server
- Monitor whether all web servers, application servers, middle tier systems and DB server is running. Configure the system as service so it can be automatically re-started when the process die suddenly.
- Monitor the Java Heap Utilization
- Monitor the amount of Java Heap memory that can be re-claimed after a major garbage collections. If the re-claimed amount keep dropping consistently, the application is leaking memory. Perform memory profiling in locating the memory leak. If no memory is leaking but yet major garbage collection is frequent, tune the Java heap accordingly.
- Monitor un-usual exception in application log & application server log
- Monitor and resolve any exceptions detected in the application and server log. Examine the source code to ensure all resources, in particular DB, file, socket and JMS resources, are probably closed when the application throws an exception.
- Monitor memory & paging activities
- Growing residence (native) memory implies leaking memory in the native code. The source of leaking may include the application non-java native code, C code in the JVM and third party libraries. Also monitor the paging activities closely. Frequent paging means memory mis-configuration.
- Perform DB profiling
- Monitor the following matrix closely
- Identify the top SQLs in logical reads, latency and counts – Re-write or tune poorly performed SQLs or DB programming code.
- Top DB waiting and latch events – Identify bad DB coding or bad DB instance or table configuration.
- Amount of hard parses – Identify scalability problem because of improper DB programming.
- Hit ratio for different buffers and caches – Proof of bad SQLs or improper buffer size configuration.
- File I/O statistics – Proof of bad SQLs, or disk mis-configuration or layout
- Rollback ratio – Identify improper application logic
- Sorting efficiency – Improper sorting buffer configuration
- Undo log or rollback segment performance – Identify DB tuning problem
- Amount of SQL statements and transactions per second – A sudden jump reviews any bad application coding
- JMS Resources
- Monitor the Queue length and resource utilization
- Poison messages: Check if many messages un-processed and staying in the queues for a long time.
- JMS queue deadlocks: Check if no messages can be de-queued and finished.
- JMS listener problems: Check if no messages are processed in a particular queue.
- Memory consumption: Ensure queues having a large amount of pending messages can be paged out of the physical memory.
- JMS retry: Ensure the failed messages are not re-processed immediately. Otherwise, poison messages may consumes most of the CPU.
- Monitor file I/O performance
- Trend the I/O access and wait time. Re-design or re-configure the disk layout if necessary for better I/O performance in particular for the DB server.
- Monitor resource utilization including file descriptor
- Monitor resources closely to identify any application code is depleting OS level resources.
- Monitor HTTP access
- Monitor the top IP address in accessing the system. Detect any intruder trying to steal the content and data in the web site. Use the access log to trace any non 200 HTTP response.
- Monitor security access log
- Monitor OS level security log and web server log to detect hacker intrusion. It also gives hints on how hackers are attacking the system.
- Monitor network connectivity and TCP status
- Run netstat constantly to monitor the TCP socket state.
- High amount of TCP idle wait state implies TCP mis-configuration.
- High amount of TCP in SYNC or FIN state implies possible denial of service attack (DoS).
What are the most common performance and scalability problems for a J2EE (Java EE) Web application? Here are the most common tips and problems found in real production systems.
- Bad Caching Strategy: It is rare that users require absolutely real time information. Simply refreshing HTML content with a 60 second cache can already dramatically reduce the load to the application server and most important the DB for a high traffic web site. Cache HTML segment for the home page and most visited pages. Implement other caching strategy in the business service layer or the DB layer. For example, use Spring AOP to cache data returned from a business service or configure hibernate to cache DB query result.
- Missing DB indexes: After a new code push, indexes may be missing for the new SQL codes. The data query may be slow if the table is huge and the missing index forces a full table scan. Most development DB has a very small data set and therefore the problem is un-detected. Check the DB log or profile in production for long executed SQLs and add index if needed.
- Bad SQLs: The second most common DB performance problem is bad SQLs. Check the DB log or profile for long executed query. Most problems can be resolved by re-written the SQLs. Paid attentions to sub-query or SQLs with complicated joins. Occasionally, DB table tuning may be required.
- Too many fine grain calls to the service, data or the DB layer: Developers may use an iteration loop in retrieving a list of data. Each iteration may make a middle tier call which results in multiple SQL calls. If the list is long, the total DB requests can be huge. Developers should write a new service call and retrieve the list in a single DB call.
- All application server threads are waiting for the DB or external system connection: Web server has a limited number of threads. When a HTTP request is processed, a thread will exclusively dedicate to a request until it is completed. Hence, if an external system like DB is very slow, all web server threads may be waiting. When this happens, the web server will pause all new incoming requests. From a end user perspective, the system seems not responding. Add timeout logic when communicate with external system. Increasing the thread counts will only delay the problem and in some cases counter productive.
- SQLs retrieve too many rows of data: Do not retrive hundreds row of data to just display a few of them. Check the DB log or profile constantly for un-expected high usage of SQLs that retrieve a lot of rows .
- Do not use prepared statement for the DB: Always use prepared statement to avoid DB side SQL hard parsing. SQL hard parsing causes a lot of DB scalability problem when DB requests increases.
- Lack or improper pagination of data: Implement pagination to display a long list of data. Do not retrieve all the data from the database and use the Java code to filter out the data. Always use the database for data filtering and pagination.
- Non-optimize connection pool configuration: The maximum / minimum pool size and the retaining policy of idling pool thread can significant impact an application performance. The web server will be idle waiting for a DB connection if the pool size is too low. The retaining policy is important since most DB pool creation code has very low concurrency and cannot handle a sudden surge of concurrent requests.
- Frequent garbage collection caused by memory leak: When memory is leaking, the Java JVM will perform frequent garbage collection (GC) even they cannot reclaim too many memory. Eventually, the web server spend most of the time executing the GC rather than processing HTTP requests. Rebooting the server can temporarily release the problem but only stopping the leak can solve the problem.
- Do not process large amount of data at once: For request involving large amount of data, in particular batch process, sub-divide the large data set into chunk and process it separately. Otherwise, the request may deplete the Java heap or stack memory and crashes the JVM.
- Concurrency problems in the synchronization block: Code synchronization block carefully. Use established library to manage system and application resources like DB connection pool. For system with concurrency problem, the CPU utilization remains low even significantly increase the traffic.
- Bad DB tuning: If DB response is slow regardless of SQLs, DB instance tuning is needed. Monitor the memory paging activity closely in identifying any memory mis-configuration. Also monitor the file I/O wait time and DB memory usage closely.
- Process data in batch: To reduce DB requests, combine DB requests together and process those in a single batch. Use SQL batch if necessary instead of large volume of small SQL requests.
- JMS or application deadlock: Avoid a cyclic loop in making JMS requests. A request may send to Queue A which then send a message to Queue B and then again to Queue A. This circle loop will trigger deadlock in high volume requests.
- Bad Java heap configuration: Configure the maximum heap size, the minimum heap size, the young generation heap and the garbage collection algorithm correctly. The bigger is not the better and it is often depends on the application.
- Bad application server thread configuration: Too high of a thread count triggers high context switching overhead while low thread count causes low concurrency. Tuning it according to the application needs and behavior. Configure the connection pool thread count according to the amount of thread count.
- Internal bugs in the third party libraries or the application server: If new third party libraries are added to the application, monitor any concurrency and memory leak issue closely.
- Out of file descriptors: If the application does not close file or network resources correctly in particular within exception handling, the application may ran out of file descriptors and stop processing new requests.
- Infinite loop in the application code: An iteration loop may run into an infinite loop and trigger high CPU utilization. It can be data sensitive and happen to a small set of traffic. If the CPU utilization remains high during low traffic time, monitor the thread closely.
- Wrong firewall configuration: Some firewall configuration limits the amount of concurrent access from a single IP. This can be problematic if a web server is connected to another DB server through a firewall. Verify the firewall configuration if the application achieves much higher concurrency if tested within in a local network.
- Bad TCP tuning: In-proper TCP tuning causes un-resonable high amount of socket waiting to be closed (TIME_WAIT). New version of OS is usually tuned correctly for Web server. Make changes to the default TCP tuning parameters only if needed. Direct TCP programming may sometimes need special programming parameters for short but frequent TCP messages.
Types of OOM:
- java.lang.OutOfMemoryError: Java heap space
- java.lang.OutOfMemoryError: PermGen space
- java.lang.OutOfMemoryError: GC overhead limit exceeded
- java.lang.OutOfMemoryError: unable to create new native thread
- java.lang.OutOfMemoryError: nativeGetNewTLA
- java.lang.OutOfMemoryError: Requested array size exceeds VM limit
- java.lang.OutOfMemoryError: request <size> bytes for <reason>. Out of swap
- java.lang.OutOfMemoryError: <reason> <stack trace> (Native method)
- java.lang.OutOfMemoryError: Metaspace
Here are the typical cause of Java Memory Leak:
- Do not close DB, file, socket, JMS resources and other external resources properly
- Do not close resources properly when an exception is thrown
- Keep adding objects to a cache or a hash map or hashtable, or vector or ArrayLIst without expiring the old one
- Do not implement the hash and equal function correctly for the key to a cache
- Session data is too large
- Leak in third party library or the application server
- In an infinite application code loop (likely cause for high cpu)
- Leaking memory in the native code
Few of the questions that can help in solving performance issues:
Load balance issues
- What type of load balancing scheme is used? (Round robin, sticky IP, least connections, subnet based?)
- What is the timeout of LB table?
- Does it do any connection pooling?
- Is it doing any content filtering?
- Is it checking for HTTP response status?
- Are there application dependencies associated with the LB timeout settings?
- What failover strategies are employed?
- What is the connection persistence timeout?
- Are there application dependencies associated with the LB timeout settings?
- What are the timeouts for critical functions?
- What is the throughput capacity?
- What is the connection capacity and rate?
- What is the DMZ operation?
- What are the throughput policies from a single IP?
- What are the connection policies from a single IP?
Firewalls and multiple DMZs
- Does the firewall do content filtering?
- Is it sensitive to inbound and/or outbound traffic?
- What is its upper connection limit?
- Are there policies associated with maximum connection or throughput per IP address?
- Are there multiple firewalls in the architecture (multiple DMZs)?
- If it has multiple DMZs, is it sensitive to data content?
Web server issues
- How many connections can the server handle?
- How many open file descriptors or handles is the server configured to handle?
- How many processes or threads is the server configured to handle?
- Does it release and renew threads and connections correctly?
- How large is the server’s listen queue?
- What is the server’s “page push” capacity?
- What type of caching is done?
- Is there any page construction done here?
- Is there dynamic browsing?
- Are there any SSL acceleration devices in front of the web server?
- Are there any content caching devices in front of the web server?
- Can server extensions and their functions be validated? (ASP, JSP, PHP, Perl, CGI, servlets, ISAPI filter/app, etc.)
- Monitoring (Pools: threads, processes, connections, etc. Queues: ASP, sessions, etc. General: CPU, memory, I/O, context switch rate, paging, etc.)
Application server issues
- Is there any page construction done here?
- How is session management done and what is the capacity?
- Are there any clustered configurations?
- Is there any load balancing done?
- If there is software load balancing, which one is the load balancer?
- What is the page construction capacity?
- Do components have a specific interface to peripheral and external systems?
Database server issues
- Have both small and large data sets been tested?
- What is the connection pooling configuration?
- What are its upper limits?
The experienced performance engineer asks the questions;
- Why is the application updating all these tables on an order creation?
- Why is it calling the remote pricing call three times?
- Why are you creating a new object for the same customer or product?
- Why is the database connection handler making so many connections for a static number of users?
- Did you expect your users/customer to come from a slow wireless connection? Did you test for that?
- Did you realize the Application Servers where in one data center and the database was in another data center?
- Who set the JVM memory configuration?
- Why are the indexes on the same volumes as the files?
- The performance testing database was one quarter size the production database.
- How many physical CPU’s did you really allocate to the Database Server?
- How was the peak volume determined?
JVM JIT generates compiled code and stores that in a memory area called CodeCache. The default maximum size of the CodeCache on most of the platforms is 48M. If any application needs to compile large number of methods resulting in huge amount of compiled code then this CodeCache may become full. When it becomes full, the compiler is disabled to stop any further compilations of methods, and a message like the following gets logged:
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
Code Cache [0xffffffff77400000, 0xffffffff7a390000, 0xffffffff7a400000) total_blobs=11659 nmethods=10690 adapters=882 free_code_cache=909Kb largest_free_block=502656
When this situation occurs, JVM may invoke sweeping and flushing of this space to make some room available in the CodeCache. There is a JVM option UseCodeCacheFlushing that can be used to control the flushing of the Codeache. With this option enabled JVM invokes an emergency flushing that discards older half of the compiled code(nmethods) to make space available in the CodeCache. In addition, it disables the compiler until the available free space becomes more than the configured CodeCacheMinimumFreeSpace. The default value ofCodeCacheMinimumFreeSpace option is 500K.
UseCodeCacheFlushing is set to false by default in JDK6, and is enabled by default since JDK7u4. This essentially means that in jdk6 when the CodeCache becomes full, it is not swept and flushed and further compilations are disabled, and in jdk7u+, an emergency flushing is invoked when the CodeCache becomes full. Enabling this option by default made some issues related to the CodeCache flushing visible in jdk7u4+ releases. The following are two known problems in jdk7u4+ with respect to the CodeCache flushing:
1. The compiler may not get restarted even after the CodeCache occupancy drops down to almost half after the emergency flushing.
2. The emergency flushing may cause high CPU usage by the compiler threads leading to overall performance degradation.
This performance issue, and the problem of the compiler not getting re-enabled again has been addressed in JDK8. To workaround these in JDK7u4+, we can increase the code cache size usingReservedCodeCacheSize option by setting it to a value larger than the compiled-code footprint so that the CodeCache never becomes full. Another solution to this is to disable the CodeCache Flushing using -XX:-UseCodeCacheFlushing JVM option.
The code is already pushed to production without the load test. Now the performance team has started the load test. We are seeing high response time and not getting the desired TPS.
The client’s performance engineer looks at the production access logs and says, even in the production we are getting High Response Times and concludes the code is bad. The clients performance engineer nor the performance team does not bother to find out the root cause of the high response time. To avoid the high response times, on the load generators, timeout has been set to one second. Any response greater than 1 sec is marked as failed. This performance team is going to send these results to the management.
The Performance Team is not bothered to do the code profiling. The fact is they dont know that the code profiler exists and profiling should be done in such cases
I feel very bad to work with such lousy people when they are not ready to listen. The performance team does not understand anything.
Initially I tried to educate the client and the performance team, but it was futile. At the end of one year, I realised that they are dumb fellows who are not ready to listen to anything. I stopped educating them…
Thanks, Frustrated Performance Engineer.
Problem Statement: Load is not getting distributed equally among all the application servers during the load test.
Description:The client (performance engineer with 9+ years of load testing experience, who manages the team of contractors located at offshore) along with the developers look at the Jmeter script and conclude that keepAlive flag of Jmeters HTTPRequest is the root cause for the load not getting distributed equally. The same was conveyed to offshore performance test architect and requested him to disable the keepAlive flag of Jmeters HTTPRequest and restart the test.
Note:Jmeters HTTPRequest is a simple post/get request. The response to this request is a xml containing max of 256 characters.
The offshore performance test architect does a quick search on the google and tells the team that disabling the keep alive will have a performance impact on the router (not on the load generator where the jmeter script is running).
Thanks, Frustrated Performance Engineer