Can WASD Adapt to The Load?

Universidad de Málaga (University of Malaga, UMA) is a center for higher education covering 4 campuses, 19 faculties, 65 undergraduate courses and postgraduate programs, with some 3800 staff and 40,000 students, located on the Costa del Sol in southern Spain.  Many thanks to UMA Administration (again) for permission to publish this data.
http://www.uma.es/
The UMA front-end system is an ES40 with four EV68A, 833MHz CPUs, 4GB memory, running OpenVMS 7.3-2 and TCP/IP Services 5.4.

Although a user of WASD since 2003 (and other Web technologies on VMS previously), 2006 was the first time a fully Web-based student registration system had been implemented - see Can WASD Handle The Load?  At the commencement of the 2007-2008 academic year student registration and associated staff activities again were to be performed using the platform.  It included WASD v9.2.1, PHP Version 4.3.10 (via the CSWS PHP v1.3 engine) along with a number of in-house front- and back-end applications distributed across other VMS and Linux-based systems.   UMA staff had further developed and thoroughly tested the registration application suite for functionality and behaviour under load.  Any observed issues had been analysed and addressed.  Result; high levels of confidence!

Pre-Registration Week

Registration for a subset of the student body during the week before general registration performed as expected - uneventfully.

Week One

Day one of general registration began to experience Web service and system performance issues.  Delays and timeouts during request processing.  Processes swapped-out.  Lots of page faulting.  System 'clogged'.  Cursory analysis using  MONITOR SYSTEM  shows CPU fully utilised, limited free list and high page faulting.

This is reflected in the Web service activity report for day one.

UMA Day One

The dark blue line towards the bottom of the above graph (peak network connections per second) and the white line below it (peak requests-in-progress per second) show significant variability and enormous peaks and plateaus as system processing repeatedly grinds to a halt.  The peaks of 1,402 connections and 600 requests-in-progress represent server configuration limits at which HTTP 503 status is immediately returned.  The server was restarted during problem investigation.

Analysis during the initial period of that first day using WASD Statistics and WATCH, Availability Manager, T4, and the netstat utilitiy, revealed a Web-based application suite, unrelated to registration processing per se but of particular interest to students beginning the new year, was significantly resource-intensive (memory and CPU).  Under more everyday usage this was not an issue but the large number of instances of this application combined with the load presented by the registration suite itself resulted in resource starvation of all processing on the system.

With the primary contributing factor identified and understood a solution could be explored.  Unable immediately to increase physical memory or CPU capacity the obvious solution for controlling the situation was to reduce the number of instances of the identified application.  This would prevent the resource starvation, in particular virtual memory demands and the associated page faulting, allowing the more important registration applications to conclude successfully in a timely manner.  This readily could be accomplished using the WASD throttle (request queuing) facility.


WASD Throttle Report

Already in general use to limit concurrent PHP applications to numbers supported by available CPU resources (rule **.php*) it was a matter of introducing an additional throttle rule specifically against the problematic application to severely limit its numbers (rule /ordenac/**.php*).  A configuration file edit and a rule reload introduced the new control. 

Very soon the efficacy of the new rule was obvious.  Request processing improved in latency, request timeouts dropped and system paging reduced dramatically.

A solution without a system redesign or even a required Web server restart!

Days 1 and 2 - detail

Day two conclusively demonstrated the additional throttle rule to be an effective solution.  The enlarged sections in this graphic displaying days one and two show comparable periods during peak processing.  Note the much greater stability in peak connections and requests-in-progress on day two.  Similar stability can be seen beginning in the day one graph following the introduction of the trial rule (circled).  Day two throughput was also approximately 20% greater, understandable considering the much less inefficient use of system resources.  Compared to day one there were effectively zero processing issues.  Further observation allowed the new and very severe throttle rule to be relaxed somewhat, increasing the numbers of the problematic application until an effective ratio of that and the registration application suite produced a fully utilised but not over-extended system.  Again, changes without needing to interrupt registration processing with server restarts or the like.

The longer term solution will be a now-programmed quadrupling of physical memory to 16GB, addressing the most significant problematic system behaviour identified during this period, the excessive page faulting.

Week 1 2007

The full five days of the first week of general registration shows some 14.7 million requests processed and 101GB transfered.  Some 26GB of this included gzip-compressed HTTP/1.1 responses to 60GB of original content (compression to 46%) indicating a total 133GB of request data actually processed.  The extreme peak requests-in-progress and network connections during day one distort the Y-axis of this particular graphic.

Week Two

The second week of general registration proceded without significant issue.

Week 2 2007

The five weekdays of this second week show some 13.8 million requests processed and 103GB transferred.  The Y-axis scaling of this graphic is entirely different to week 1, showing greater detail and dynamic range.  The peak number of concurrent connections was 579 and peak concurrent requests-in-progress 183, with the maximum processed in any one minute 6,633 or a little more than 110 per second.

In total the ten weekdays of registration processed some 28.5 million requests and transferred 204GB of network traffic.  Approximately 50GB of this represented 117GB of gzip-compressed HTTP/1.1 response (compression to 46%) indicating a total of 271GB response data handled over the ten days.

The period described in this document is slightly different to that of 2006 -
Can WASD Handle The Load?  For a comparable fourteen days, spanning two weekends and beginning the first Monday of general registration, a total of 33.7 million requests and 267GB were processed, slightly up on 2006.  Statistics related to gzip-compression were not noted in 2006.

Congratulations

... to University of Malaga IT staff after yet another very successful year of Web-based registration
and for being able to troubleshoot effectively under significant pressure!


Conclusion

Can WASD adapt to your load?  Almost certainly!

Mark Daniel
04-OCT-2007