Qouted from :
http://raymondtay.blogspot.com/2008/09/j2eeweblogic-performance-tuning-with.html.
One of the excellent posts I have ever read.
his article contains a series of investigations for a customer of mine where the environment is running a WebLogic cluster of 20 machines in round-robin on HP-UX to service a global J2EE application and it performed slowly during peak periods and occasional hangs. The application was a typical 3-tier architecture whereby web relegates requests to the middle-tier (EJBs, MQs, MDBs) and this middleware goes to the Database-tier (SQL inserts, updates, deletes, stored procedures etc). The application was found to be experiencing heavy load during peak periods everyday.
There were a couple of issues related to poor performing SQLs, poorly designed middleware apps, WebLogic cluster design and runtime issues, JVM memory consumption and frequent garbage collections. Let me try to detail them a bit without giving away too much customer information. Hopefully, it can help you in your investigations in your environment.
During the peak period, the major contributing factors of the apps slow-ness were:
The heap size was 1.5GB (min,max), 512MB for Eden and the PermGen was 192MB. The minor GC kicked in frequently releasing approximately 60MB on average; the major GC kicked in twice every minute (avg. 3-5s on average, 40s on max) releasing 400 – 500 MB each time and reverse engineering the figures reveals that the object creation rate was roughly 800 M – 1.0 GB per minute. As GC is primarily a CPU-intensive operation (with saving state, freeing memory, compacting the heap etc). The large object creation rate combined with the relatively long pauses GCs occurences suggests that the application are creating objects in an in-efficient manner and that created problems with the cluster’s session replication mechanism as the users of the system would see stale data – due to long pauses in GC, the data in the session was not replicated *properly* across to the other servers.
Applications were attached to the WebLogic system classpath which meant that the Java classes were never unloaded from memory and combined with the fact that there are ALWAYS classloader leaks meant that whenever the operation team redeploy a.k.a “hot”-redeployment the apps, it worsens the memory footprint since the previous memory was never release due to this leakage. If you keep hot-deploying these stuff you will almost certainly get an OutOfMemory Error: PermGen out of space.
EJBs (4,000+ EJBs deployed, in my opinion too many) were utilizing remote interfaces when there was no need as those apps were not doing cross-vm operations and based on my previous experimentation, you would get a 3-fold runtime improvement when you convert the EJBs to local interfaces. This improvement is because there is less object marshalling/unmarshalling via RMI since everything is on the same JVM heap and consumes less system resources like file descriptors/socket & memory since local interfaces implies a local/normal Java call.
As i mentioned previously, the apps were deployed in the cluster and that meant that all persistent objects (e.g. session data, user preferences etc) must be Serializable (i.e. persistent objects need to implement java.lang.Serializable) since there would be session replication across the servers in the cluster which further degraded the performance as the cluster needs to maintain state across all 20 machines. Source code analysis found that user’s were keeping results of database fetches in session data! You can imagine the pressure faced by the JVM memory subsystem + WebLogic cluster replication.
WebLogic cluster was also malfunctioning during peak periods throwing an exception message like <WorkManager> <BEA-002911> <WorkManager weblogic.kernel.System failed to schedule a request due to weblogic.utils.UnsyncCircularQueue$FullQueueException: Queue exceed maximum capacity of: ‘65536′ elements and this is an critical error thrown from the Work Manager which replaced the BEA traditional thread pool. What this meant was that the WebLogic cluster could no longer handle user’s requests and hanged. *I plan to unravel this mystery in a while to understand why this is happening*
The hardware loadbalancer was in “sticky” mode even though the WebLogic cluster was in round-robin mode which negated this round-robin-ness and resulted in certain servers encountering more stress than others and this was made worse by the long session timeout of 20+ hrs. That’s the cost of doing business….
After tracing the SQL statements execution times, it was found that they were causing alot of problems from missing indexes, lack of functional indexes, improper SQL statements which causes large database table joins and many “select count(*)…” from large table joins statements contributed to this object creation rate.
When i looked at these issues, the first couple of items i advised my customer was to do the following:
(1) Convert the EJBs to use local interfaces i.e. call-by-reference
(2) Tune the SQL statements via SQL reordering, indexes etc
(3) Tune the JVM heap to use more aggressive + parallel heap collectors via -XX:+UseParallelGC -XX:+UseConcMarkSweepGC (We are still experimenting this portion)
(4) Do not use system classpath to load application classes
(5) Review source codes to remove known classloader leaks
Part 2:
In my previous article, my tuning project with a customer ran into some trouble with WebLogic’s Work Manager and in particular, on the Java exception weblogic.utils.UnsyncCircularQueue$FullQueueException where the WebLogic server indicated that the queue where the server works on submitted requests. Checked the WebLogic docs and accordingly, the server will automatically resize the queue to fit the requests but what the docs didn’t mention was that the resize will fail to work if the size of the queue equals the capacity which happens to be 65536 and that’s the reason why it threw the error message “Queue exceeded the maximum capacity of ‘65536′ elements”.
However, checking the code reveals something quite peculiar and that is the constructor suggests that only queues of sizes exceeding 1 GB will throw this error but the default capacity is always 256 and reaching a maximum of 65536 elements and btw, its WebLogic 9.2 ; so my guess is that source codes need to be cleaned up ? If anybody has any idea why, do drop me a comment, thanks in advanced.
Part 3: