Below are few of the spark configurations that I have used while tuning the Apache Spark Job.
- “-XX:MetaspaceSize=100M”: Before adding the parameter, Full GC’s were observed due to the metaspace resizing. Added the parameter and after that no full gc on account of metaspace resizing observed
- “-XX:+DisableExplicitGC”: In Standalone mode, System.gc is being invoked by the code every 30 minutes which does a full GC (not a right practice). After adding this parameter no full GC on account of System.gc observed
- “-Xmx2G”: OOM was observed with the default heap size (1G) when executing the run with more than 140K messages in the aggregation window. After heap has been increased to 2GB (the maximum allowed in this box) I was able to process 221K messages successfully in the aggregation window. At 250k messages we are getting OOM.
- spark.memory.fraction – 0.85: In spark 1.6.0, the default value of 0.75 by storage/executor memory. This value has been increased to give more memory to the storage/executor memory, this is done to avoid OOM.
- Storage level has been changed to ‘Disk_Only’:Before the change, we were getting OOM when processing 250K messages during the aggregation window of 300 seconds. After the change, we could process 540K messages in the aggregation window without getting OOM. Even though, IN-Memory gives better performance, due to limitation of the hardware availability i had to implement Disk-Only.
- spark.serializer is set to KryoSerializer: Java serilizer has bigger memory footprint, To avoid the high memory footprint and for better performance we used this serializer
- “-Xloggc:~/etl-gc.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause” : This parameters needs to be added as part of good performance practice. Also, it will be helpful to diagnose the problem by looking at the gc logs. The overhead of these parameters are very minimal in production
- spark.streaming.backpressure.enabled – true: This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process.
- spark.io.compression.codec set to org.apache.spark.io.LZFCompressionCodec: The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs.
Hope it helps.