Top tips to troubleshoot & monitor JEE Web Application

How to troubleshoot and monitor a Java and JEE application having performance and scalability problem. Here are the techniques used for production systems.

  1. Perform a series of JDK thread dump to locate the following possible problems:
    1. Application bottleneck: Identify application bottlenecks by locating the most common stack trace. Optimize requests that happen most often on the stack trace.
    2. Bad SQLs: If most threads are in the waiting state for the JDBC calls, trace down the bad SQLs to the DB.
    3. Slow DB: If many SQLs are having problem, conduct a DB profiling to locate the DB problem.
    4. DB or external system outages: Check if a lot of threads are in the waiting state of making external connection.
    5. Concurrency issue: Check if many stack trace are waiting in the same code for a lock.
    6. Infinite loop: Verify if threads remaining running over minutes at similar part of the source code.
    7. Connectivity problem: Un-expected low idling thread count indicates the requests are not reaching the application server.
    8. Thread count mis-configuration: Increase thread count if CPU utilization is low yet most thread are in runnable state.
  2. Monitor CPU utilization
    1. High CPU utilization implies design or coding in-efficiency. Execute a thread dump to locate bottleneck. If no problems are found, the system may reach full capacity.
    2. Low CPU utilization with abnormal high response time implies many threads are blocked. Execute a thread dump to narrow down the problem.
    3. Monitor process health including the Java application server
    4. Monitor whether all web servers, application servers, middle tier systems and DB server is running.   Configure the system as service so it can be automatically re-started when the process die suddenly.
  3. Monitor the Java Heap Utilization
    1. Monitor the amount of Java Heap memory that can be re-claimed after a major garbage collections. If the re-claimed amount keep dropping consistently, the application is leaking memory. Perform memory profiling in locating the memory leak. If no memory is leaking but yet major garbage collection is frequent, tune the Java heap accordingly.
  4. Monitor un-usual exception in application log & application server log
  5. Monitor and resolve any exceptions detected in the application and server log. Examine the source code to ensure all resources, in particular DB, file, socket and JMS resources, are probably closed when the application throws an exception.
  6. Monitor memory & paging activities
    1. Growing residence (native) memory implies leaking memory in the native code. The source of leaking may include the application non-java native code, C code in the JVM and third party libraries.   Also monitor the paging activities closely. Frequent paging means memory mis-configuration.
  7. Perform DB profiling
  8. Monitor the following matrix closely
    1. Identify the top SQLs in logical reads, latency and counts – Re-write or tune poorly performed SQLs or DB programming code.
    2. Top DB waiting and latch events – Identify bad DB coding or bad DB instance or table configuration.
    3. Amount of hard parses – Identify scalability problem because of improper DB programming.
    4. Hit ratio for different buffers and caches – Proof of bad SQLs or improper buffer size configuration.
    5. File I/O statistics – Proof of bad SQLs, or disk mis-configuration or layout
    6. Rollback ratio – Identify improper application logic
    7. Sorting efficiency – Improper sorting buffer configuration
    8. Undo log or rollback segment performance – Identify DB tuning problem
    9. Amount of SQL statements and transactions per second – A sudden jump reviews any bad application coding
  9. JMS Resources
    1. Monitor the Queue length and resource utilization
    2. Poison messages: Check if many messages un-processed and staying in the queues for a long time.
    3. JMS queue deadlocks: Check if no messages can be de-queued and finished.
    4. JMS listener problems: Check if no messages are processed in a particular queue.
    5. Memory consumption: Ensure queues having a large amount of pending messages can be paged out of the physical memory.
    6. JMS retry: Ensure the failed messages are not re-processed immediately. Otherwise, poison messages may consumes most of the CPU.
  10. Monitor file I/O performance
    1. Trend the I/O access and wait time.   Re-design or re-configure the disk layout if necessary for better I/O performance in particular for the DB server.
  11. Monitor resource utilization including file descriptor
    1. Monitor resources closely to identify any application code is depleting OS level resources.
  12. Monitor HTTP access
    1. Monitor the top IP address in accessing the system.   Detect any intruder trying to steal the content and data in the web site. Use the access log to trace any non 200 HTTP response.
  13. Monitor security access log
    1. Monitor OS level security log and web server log to detect hacker intrusion. It also gives hints on how hackers are attacking the system.
  14. Monitor network connectivity and TCP status
  15. Run netstat constantly to monitor the TCP socket state.
  16. High amount of TCP idle wait state implies TCP mis-configuration.
  17. High amount of TCP in SYNC or FIN state implies possible denial of service attack (DoS).

Top J2EE application performance problems

What are the most common performance and scalability problems for a J2EE (Java EE) Web application? Here are the most common tips and problems found in real production systems.

  1. Bad Caching Strategy: It is rare that users require absolutely real time information. Simply refreshing HTML content with a 60 second cache can already dramatically reduce the load to the application server and most important the DB for a high traffic web site. Cache HTML segment for the home page and most visited pages.   Implement other caching strategy in the business service layer or the DB layer. For example, use Spring AOP to cache data returned from a business service or configure hibernate to cache DB query result.
  2. Missing DB indexes: After a new code push, indexes may be missing for the new SQL codes. The data query may be slow if the table is huge and the missing index forces a full table scan. Most development DB has a very small data set and therefore the problem is un-detected. Check the DB log or profile in production for long executed SQLs and add index if needed.
  3. Bad SQLs: The second most common DB performance problem is bad SQLs. Check the DB log or profile for long executed query. Most problems can be resolved by re-written the SQLs. Paid attentions to sub-query or SQLs with complicated joins. Occasionally, DB table tuning may be required.
  4. Too many fine grain calls to the service, data or the DB layer: Developers may use an iteration loop in retrieving a list of data. Each iteration may make a middle tier call which results in multiple SQL calls. If the list is long, the total DB requests can be huge. Developers should write a new service call and retrieve the list in a single DB call.
  5. All application server threads are waiting for the DB or external system connection: Web server has a limited number of threads. When a HTTP request is processed, a thread will exclusively dedicate to a request until it is completed. Hence, if an external system like DB is very slow, all web server threads may be waiting.   When this happens, the web server will pause all new incoming requests.   From a end user perspective, the system seems not responding. Add timeout logic when communicate with external system. Increasing the thread counts will only delay the problem and in some cases counter productive.
  6. SQLs retrieve too many rows of data: Do not retrive hundreds row of data to just display a few of them. Check the DB log or profile constantly for un-expected high usage of SQLs that retrieve a lot of rows .
  7. Do not use prepared statement for the DB: Always use prepared statement to avoid DB side SQL hard parsing.   SQL hard parsing causes a lot of DB scalability problem when DB requests increases.
  8. Lack or improper pagination of data: Implement pagination to display a long list of data. Do not retrieve all the data from the database and use the Java code to filter out the data. Always use the database for data filtering and pagination.
  9. Non-optimize connection pool configuration: The maximum / minimum pool size and the retaining policy of idling pool thread can significant impact an application performance. The web server will be idle waiting for a DB connection if the pool size is too low. The retaining policy is important since most DB pool creation code has very low concurrency and cannot handle a sudden surge of concurrent requests.
  10. Frequent garbage collection caused by memory leak: When memory is leaking, the Java JVM will perform frequent garbage collection (GC) even they cannot reclaim too many memory. Eventually, the web server spend most of the time executing the GC rather than processing HTTP requests.   Rebooting the server can temporarily release the problem but only stopping the leak can solve the problem.
  11. Do not process large amount of data at once: For request involving large amount of data, in particular batch process, sub-divide the large data set into chunk and process it separately. Otherwise, the request may deplete the Java heap or stack memory and crashes the JVM.
  12. Concurrency problems in the synchronization block: Code synchronization block carefully.   Use established library to manage system and application resources like DB connection pool. For system with concurrency problem, the CPU utilization remains low even significantly increase the traffic.
  13. Bad DB tuning: If DB response is slow regardless of SQLs, DB instance tuning is needed. Monitor the memory paging activity closely in identifying any memory mis-configuration. Also monitor the file I/O wait time and DB memory usage closely.
  14. Process data in batch: To reduce DB requests, combine DB requests together and process those in a single batch. Use SQL batch if necessary instead of large volume of small SQL requests.
  15. JMS or application deadlock: Avoid a cyclic loop in making JMS requests.   A request may send to Queue A which then send a message to Queue B and then again to Queue A. This circle loop will trigger deadlock in high volume requests.
  16. Bad Java heap configuration: Configure the maximum heap size, the minimum heap size, the young generation heap and the garbage collection algorithm correctly. The bigger is not the better and it is often depends on the application.
  17. Bad application server thread configuration: Too high of a thread count triggers high context switching overhead while low thread count causes low concurrency. Tuning it according to the application needs and behavior. Configure the connection pool thread count according to the amount of thread count.
  18. Internal bugs in the third party libraries or the application server: If new third party libraries are added to the application, monitor any concurrency and memory leak issue closely.
  19. Out of file descriptors: If the application does not close file or network resources correctly in particular within exception handling, the application may ran out of file descriptors and stop processing new requests.
  20. Infinite loop in the application code: An iteration loop may run into an infinite loop and trigger high CPU utilization.   It can be data sensitive and happen to a small set of traffic.   If the CPU utilization remains high during low traffic time, monitor the thread closely.
  21. Wrong firewall configuration: Some firewall configuration limits the amount of concurrent access from a single IP.   This can be problematic if a web server is connected to another DB server through a firewall. Verify the firewall configuration if the application achieves much higher concurrency if tested within in a local network.
  1. Bad TCP tuning: In-proper TCP tuning causes un-resonable high amount of socket waiting to be closed (TIME_WAIT).   New version of OS is usually tuned correctly for Web server. Make changes to the default TCP tuning parameters only if needed. Direct TCP programming may sometimes need special programming parameters for short but frequent TCP messages.

Types of OOM Error & Java Memory Leak Causes

Types of OOM:

  • java.lang.OutOfMemoryError: Java heap space
  • java.lang.OutOfMemoryError: PermGen space
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
  • java.lang.OutOfMemoryError: unable to create new native thread
  • java.lang.OutOfMemoryError: nativeGetNewTLA
  • java.lang.OutOfMemoryError: Requested array size exceeds VM limit
  • java.lang.OutOfMemoryError: request <size> bytes for <reason>. Out of swap
  • java.lang.OutOfMemoryError: <reason> <stack trace> (Native method)
  • java.lang.OutOfMemoryError: Metaspace

Here are the typical cause of Java Memory Leak:

  • Do not close DB, file, socket, JMS resources and other external resources properly
  • Do not close resources properly when an exception is thrown
  • Keep adding objects to a cache or a hash map or hashtable, or vector or ArrayLIst without expiring the old one
  • Do not implement the hash and equal function correctly for the key to a cache
  • Session data is too large
  • Leak in third party library or the application server
  • In an infinite application code loop (likely cause for high cpu)
  • Leaking memory in the native code