A potential memory leak – Metaspace leak

For the last few weeks, I have been working on Benchmarking a Product that was devloped using JDK 1.8u65 and deployed on Weblogic 12C.

Started the initial load tests  just to see how the application is behaving and during that process was observing DB and JVM performance. Found few issues during the load testing process.

  1. Inserts in the DB was happening 1 at a time
  2. Complete Refresh was happening every 30 seconds.
  3. Full GC’s were observed

The issue with point 1 is, too many network round trips to the database and also the log file sync wait event is the top wait event. Informed the dev team to make changes in the code so that the records get commited to the database in batches, thereby reducing the no. of network round trips.

The issue with point 2 is, MV refresh happens for the  all the records in the database every time the refresh happens. We dont see much impact when the number of records in the database size is in few MBs. The real problem would arise when the database grows to few GB/TBs. Asked the Dev team to implement incremental refresh.  Only for the first time complete refresh happens there onwards refresh happens for the delta records.

In the meantime, started analyzing  the GC logs and the reason for the Full GCs. At first glance, full gcs seems to be okay. On further analyzing the full gcs, the metaspace was also getting resized (space getting reclaimed) which is not normal. We know that young and old region gets resized (space getting reclaimed) during minor and major gcs. But why is metaspace is getting resized that too in a steady state.

There might be two reasons for this.

  1. The load testing process (the way the requests are sent). This can be ruled out as I do the inilization once and then the main transactions run.
  2. New instances are getting created again and again. This is the only option left to be explored. At this point, I cannot involve the developers as they would ask me what are those instances?

As a next step, took 3 complete heap dumps when the heap occupancy was at different stages (immediately after the full gc, when heap is half full and when the heap is about get full gc’ed).

Analyzed all the heap dumps, and found out a particular class instance was responsible for this. I could see the  no. of instance of this particular class increasing in all those three dumps.

Now, I had the required data, checked with the developer if the particular class instance is being used in the code, he confirmed that his part of the code is creating the class instances.

Oooola.. got the fix.  Again repeated the same process. This time I could see only a few instances of those class. The leak is fixed.

Hope it helps.

Advertisements

Debugging JavaDB Query Performance

— turn on RUNTIMESTATISTICS for connection:
CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);

— Indicate that statistics information should be captured into
— database tables in the SomeSchema schema:
CALL SYSCS_UTIL.SYSCS_SET_XPLAIN_SCHEMA(‘SomeSchema’);

—execute  the queries:
SELECT * FROM <SomeTable> WHERE someName = ‘XYZ’ ;

–turn off runtime statistics:
VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);
–Retrieve the text of statements which were captured, in order by the time when the statistics were captured:
select stmt_text, xplain_time from SomeSchema.sysxplain_statements    order by xplain_time;

–Retrieve the text of statements which were captured, showing the statements which took the longest time to execute first:
select s.stmt_text, st.execute_time from SomeSchema.sysxplain_statements s, SomeSchema.sysxplain_statement_timings st where s.timing_id = st.timing_id    order by st.execute_time desc

–Show the statements that were executed, together with the result sets that each statement required:
select st.stmt_text, rs.op_identifier from SomeSchema.sysxplain_statements st  join SomeSchema.sysxplain_resultsets rs  on st.stmt_id = rs.stmt_id

–Find statements which resulted in an external sort being performed:
select s.stmt_text, s.stmt_id, rs.op_identifier, srt.no_input_rows from SomeSchema.sysxplain_sort_props srt, SomeSchema.sysxplain_resultsets rs, SomeSchema.sysxplain_statements s where rs.stmt_id = s.stmt_id and rs.sort_rs_id = srt.sort_rs_id and srt.sort_type = ‘EX’

–Find statements which resulted in a tablescan:
select st.stmt_text, sp.no_visited_pages, sp.no_visited_rows from SomeSchema.sysxplain_scan_props sp, SomeSchema.sysxplain_resultsets rs, SomeSchema.sysxplain_statements st where st.stmt_id = rs.stmt_id and  rs.scan_rs_id = sp.scan_rs_id and  rs.op_identifier = ‘TABLESCAN’ and sp.scan_object_name = ‘<SomeTable>’

Tracing Derby SQL statements

I am working on a project which has a derby db alias apache derby alias javadb as one of the components which stores the configurations (javadb is suitable for in-memory operations for small no. of clients). The application which accesses the derby db has high response times.
To troubleshoot the high response times, I have enabled tracing of sql statements that are fired from the application.

  • Created derby.properties file and added a property “derby.language.logStatementText=true”
  • Restarted the application server including the derby database.

After that, all the sql statements will be logged into derby.log file.

Hope it helps

Note: tracing enables lots of logging of sql statements. Be careful while  dealing with the logs that are generated due to tracing

Configuring GC log Issue

I am trying to configure GC logs in weblogic startup scripts on one of the linux servers (oel7)

JDK: 1.7u80

MEM_ARGS=”-Xms2048m -Xmx2048m -XX:PermSize=512m -XX:MaxPermSize=512m -Xloggc:/scratch/oracle/middleware/user_projects/domains/oamdomain/servers/oam_server2/logs/oamserver2$$-gc.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution”

With the above parameters, I am unable to start weblogic servers… getting the below error.

Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Invalid file name for use with -Xloggc: Filename can only contain the characters [A-Z][a-z][0-9]-_.%[p|t] but it has been oamserver2-gc.log -XX:+PrintGC
Note %p or %t can only be used once

Solution:
MEM_ARGS has been split into two parts:

export MEM_ARGS=” -Xms2048m -Xmx2048m -XX:PermSize=512m -XX:MaxPermSize=512m -Xloggc:/scratch/oracle/middleware/user_projects/domains/oamdomain/servers/oam_server2/logs/oamserver2$$-gc.log”

export JAVA_OPTIONS=”-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution $JAVA_OPTIONS”

The weblogic startup script was creating problems while starting the server when we have everything part of mem_args. May be some sort of validation was happening somewhere 😉

Hope it helps.

IDM servers going down

Problem Statement:
All java process pertaining to IDM stack were getting killed very periodically. And the servers had to be restarted very often.

Softwares:
Weblogic 11g, IDM stack  java 1.70.79

Analysis:

  • There were no thread dumps or core dumps.
  • System logs were clean.
  • Greped the  weblogic logs for errors, could not find anything
  • After carefully going through each and every line of the weblogic logs, I could find below messages:
    ####<Jul 26, 2017 6:36:47 PM IST> <Notice> <WebLogicServer> <myhost.me.me.com> <AdminServer> <Thread-1> <<WLS Kernel>> <> <ae8121cbb691c6b3:-fc73acf:15d7e34bd34:-8000-00000000000002ff> <1501074407979> <BEA-000388> <JVM called WLS shutdown hook. The server will force shutdown now>
    ####<Jul 26, 2017 6:36:47 PM IST> <Alert> <WebLogicServer> <myhost.me.me.com> <AdminServer> <Thread-1> <<WLS Kernel>> <> <ae8121cbb691c6b3:-fc73acf:15d7e34bd34:-8000-00000000000002ff> <1501074407980> <BEA-000396> <Server shutdown has been requested by <WLS Kernel>>

The logs clearly indicates that the JVM itself initiated the shutdown of the servers and OS has  nothing to do with it.

Solution:
The problem might occur due to incompatibility of softwares. In our case there was no customization or any of our own developed softwares deployed. Ours was clean install.
This points me to JDK version. After pointing to the correct java, the issue has been resolved.

Hope it helps.

Installing Python on Linux systems when /usr/local/bin directory is read-only

We are given  linux systems which  has Python 2.7.5 installed.  And our application requires Python  version greater than 2.7.9. When we try to upgrade or install the Python to the latest version, it fails as the default installations requires an entry to be made on the read only file system  /usr/local/bin/. Even with root access we cannot modify the /usr/local/bin directory.

readonly

So how do we upgrade or install a new version of Python? Below are the steps:

  • Download the version of Python which we are interested
  • untar the binary to a directory.
  • cd <go  to the directory where it has been untared>
  • run the configure  command: ./configure –prefix=/myworks/softwares/pythonV2.7.13 –enable-shared –with-ensurepip=yes && make
  • run make command: make altinstall. This command will install the python to the directory specified in prefix path.
  • go to the .bash_profile and add the  statement: export LD_LIBRARY_PATH=/myworks/softwares/pythonV2.7.13/lib:$LD_LIBRARY_PATH
  • Reload the .bash_profile. Its done

Hope it helps

 

 

 

 

Connecting to remote Derby Database from Windows

One of my application uses Derby Database to store configuration information. I just wanted to see what tables are present in derby db. As I was using Derby for the first time, I did not know the basic commands to check the list of tables from the   linux system (where my derby db has been installed) sql prompt. I wanted to have a SQL Client which can connect to the remote db and then list the available tables and the data in it. This is when I started googling for SQL Client.  I found the suitable SQL client installer (squirrel-sql-3.7.1-standard.jar).

Installing and configuring the SQL Client:

  • Run the command “java -jar squirrel-sql-3.7.1-standard.jar” a gui would open and  click on next.
  • Download Apace Derby (db-derby-10.10.2.0-bin.zip)
  • Unzip the the derby zip  file to a folder.
  • Click on the “squirrel-sql.bat” file from the location where it is installed, it would open “SQuirreL SQL Client Version 3.7.1” GUI Editor.
  • Under Drivers tab on the top left of the editor, click on drivers and then double click  on Apache Derby Client.
  • Click on the “Extra Class Path” tab on the new window and add the derbyclient.jar from the folder where we have unzipped the derby package previously. Click OK. The driver would be registered.
  • Configure-Derby
  • Click on Alias, and then “add Alias” and input  the required details. TEST the connection and click on OK.
  • addalias

Hope it helps.

Commands to execute SQL tuning profiler

set serveroutput on

DECLARE
stmt_task VARCHAR2(64);
BEGIN
stmt_task := DBMS_SQLTUNE.CREATE_TUNING_TASK(sql_id => ‘8bk0dw24d58jg’);
dbms_output.Put_line(stmt_task);
END;

——–

EXEC DBMS_SQLTUNE.execute_tuning_task(task_name => ‘TASK_1342’);

———

set long 9999999
SET PAGESIZE 1000
SET LINESIZE 32767
SELECT DBMS_SQLTUNE.report_tuning_task(‘TASK_1342’) AS recommendations FROM dual;

Important Configuration Parameters for Tuning Apache Spark Job

Below are few of the  spark configurations that I have used while tuning the Apache Spark Job.

  • “-XX:MetaspaceSize=100M”: Before adding the parameter, Full GC’s were observed due to the metaspace resizing. Added the parameter and after that no  full gc on account of metaspace resizing observed
  • “-XX:+DisableExplicitGC”: In Standalone mode, System.gc is being invoked by the code every 30 minutes which does a full GC (not a right practice). After adding this parameter no full GC on account of System.gc observed
  • “-Xmx2G”: OOM was observed with the default heap size (1G)  when executing the run with more than  140K messages in the aggregation window. After heap has been increased to 2GB (the maximum allowed in this box) I was able to process 221K messages successfully in the aggregation window. At 250k messages we are getting OOM.
  • spark.memory.fraction – 0.85: In spark 1.6.0, the default value of 0.75 by  storage/executor memory. This value has been increased to give more memory to the storage/executor memory, this is done to avoid  OOM.
  • Storage level has been changed to ‘Disk_Only’:Before the change, we were getting OOM when processing 250K messages during the aggregation window of 300 seconds. After the change,  we could process 540K messages in the aggregation window without getting OOM.  Even though, IN-Memory gives better performance, due to limitation of the hardware availability i had to implement Disk-Only.
  • spark.serializer is set to KryoSerializer: Java serilizer has bigger memory footprint, To avoid the high memory footprint and for better performance we used this serializer
  • “-Xloggc:~/etl-gc.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause” : This parameters needs to be added as part of good performance practice. Also, it will be helpful to diagnose the problem by looking at the gc logs. The overhead of these parameters are very minimal in production
  • spark.streaming.backpressure.enabled – true: This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process.
  • spark.io.compression.codec set to org.apache.spark.io.LZFCompressionCodec: The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs.

Hope it helps.