TL;DR An x86-64 introduced to a cluster should be just that.
It wasn't, at least for a couple of WASD pioneers.
The exercise uncovered an issue in porting VMS to X86.
Recently, an existing Alpha and an existing IA64 endeavoured to add an x86-64
(X86) member to establish new clusters. For each, this was both an exercise
in using the technology, as well as a test-bed for new X86 families of product.
The steps in establishing X86 clusters are described in the VSI Installation
Guide, OpenVMS x86-64 Version 9.2-2, Appendix A (choose your poison):
https://docs.vmssoftware.com/vsi-openvms-x86-64-v922-installati...
https://docs.vmssoftware.com/docs/vsi-openvms-x86-64-v922-insta...
While these steps outline how to build X86 + X86 clusters by essentially
cloning an original X86 system, and adjusting the subsequent system's network
and System Communication Service (SCS) characteristics, similar steps adapted
to an existing system plus an X86 virtual machine based system can be
employed to establish a mixed architecture cluster.
https://docs.vmssoftware.com/guidelines-for-openvms-cluster-con...
https://docs.vmssoftware.com/docs/VSI_Cluster_Guidelines.pdf
This has now been undertaken with (emulated) Alpha, IA64, and a hardware
Alpha (DS20), initially with varying WASD outcomes. All X86 clustered WASDs
are now operating as expected (currently with a small caveat).
Long(er) Story Short(er)
~~~~~~~~~~~~~~~~~~~~~~~~
Hunter Goatley of PSC (aka EISNER-meister) added a cloned V9.2-3 X86NER,
hosted using VirtualBox (VBox), to EISNER (decuserve.org) as a service to
DECUServe users and as an exercise in hosting X86.
At the same time, Jeremy Begg of VSM Software Services, integrated an updated
X86 V9.2-3 system, hosted using VBox, into a cluster with his IA64 workhorse.
When WASD was started, both of these systems experienced dire failures
requiring shutdown of the X86 system. :-{ Excising the WASD startup from
each system was required otherwise another iteration of shutdown ensued.
The symptom, "WASD:80 RWAST" when provided with a network connection.
Hmmm. Time to break out the toolbox. Firing up the (0.450kW) DS20꙳꙳
(usually only brought online for specific purposes) I used VBox VM clone
facility to essentially undertake the equivalent machine clone described in
Appendix A above. A copy of my development V9.2-3 X86. Rather than @AUTOGEN
the cluster into existence I decided to 'hand-edit' the SYSGEN CURRENT
/CLUSTER and /SCS directives for the Alpha and X86 systems. Alpha VOTES 1,
X86 VOTES 0, EXPECTED_VOTES 1. CLUSTER_AUTHORIZE.DAT ZIPed "-V" into place.
I could not get a cluster to form! The Alpha just ignored the X86 vain plea
via its network adapters, "waiting to form or join a cluster".
꙳꙳ Generously provided by Jeremy Begg after the demise of my 20+yr PWS.
Having firmly nudged my head against the Alpha for a couple of days I decided
on another approach, an X86 + X86 cluster on the same host, eliminating a
possibly problematic physical network. Already having a clone of my
development X86 I cloned the development VM once again (insurance) and
adjusted SYSGEN /CLUSTER and /SCS parameters as a primary voting member,
rebooted, checked the SYSGEN ... looked OK. Tentatively booted the secondary
member, watched the network adapters come up, "waiting to form or join" ...
and *JOY* ... was answered by the primary, continuing with the startup.
I had not disabled WASD startup on either of those X86 systems.
WASD seemed 'all-singing, all-dancing'. That is, all aspects working as
expected. The primary X86 (development system) was again processing crawler
requests. The secondary X86, without Internet presence, was exercised using
OWASP ZAP. A relief it didn't seem to be something fundamental with WASD.
In The Meantime
~~~~~~~~~~~~~~~
The IA64 clustered X86 also had sprung into life. @AUTOGEN applied to the
X86 had bumped up a few resources, notable among them SCSBUFFCNT (and other
memory-related params), and upon reboot, process WASD:80 had commenced
processing requests from (when previously standalone) crawlers.
Buoyed by the success of the X86 + X86 cluster I returned to the DS20 + X86
cluster initially configured then abandoned. Eliminating the network concern
using a standalone switch and two UTP cables still produced a complete and
utter silence when "waiting ..." So, not a quirk with my LAN (wired and
powerline). Comparing the SYSGEN of the working primary X86 to the Alpha a
likely culprit emerged; NISCS_LOAD_PEA0 0. Hand-writ strikes again!
Having adjusted NISCS_LOAD_PEA0 1 and rebooted the DS20, then the secondary
X86 ... *more* JOY as it completed the transition and continued on to
conclude the full startup, including working WASDs. This configuration
essentially replicated the otherwise problematic EISNER + X86NER cluster.
Again, relief the issue didn't seem fundamental with WASD.
Neither system having Internet presence I ran ZAP crawls. SOLID!
Specialist Assistance
~~~~~~~~~~~~~~~~~~~~~
Early in the whole saga, an RMSBUG error reported by X86NER seemed
fundamental to the issue. Having never encountered these, and HELP/MESSAGE
RMS-F-BUG suggesting something seriously amiss, I posted a VSI Forum entry
https://forum.vmssoftware.com/viewtopic.php?f=42&t=9711
which as it turned out was a bit of a misdirection. However the post
garnered some necessary and welcome specialist attention in Volker Halle and
Hein van den Heuvel, whose subsequent efforts were instrumental in analysis
of X86NER system dumps and other data.
First Runner-up
~~~~~~~~~~~~~~~
WSDEAULT and especially WSQUOTA need to be increased!
> looked at the crash and the RMS crash is indeed caused by the SS$_INSFWSL
> from the $EXPREG system service. The $EXPREG system service will only
> expand if we are below quota, but our current WSSIZE is 8226 pages and
> WSQUOTA is 3544. Also the current sysgen parameters for the WS default &
> quota are smaller than the SYSGEN default
Confirmed by VSI engineering, read the full (and enlightening) explanation
with Volker's post Tue Mar 24, 2026 8:22 am.
https://forum.vmssoftware.com/viewtopic.php?f=42&t=9711&start=1.
The solution was to restore some mysteriously 'way too low' SYSGEN values
using MODPARAMS.DAT directives PQL_M.. and PQL_D.. along with NPAGEDYN and
PAGEDYN then @AUTOGEN.
This is described above as the 'runner-up' critical issue because after
SYSGEN adjustment X86NER WASD happily processed requests (using local cURL)
but the "WASD:80 RWAST" persisted when /DO=EXIT=NOW :-{
And The Winner Is
~~~~~~~~~~~~~~~~~
> Uh OH.
>
> I may have to swap my volunteer hat for my VSI Engineering hat.
>
> Hein.
Which is exactly the way the issue progressed. Seemingly with WASD absolved
of responsibility it was still triggering the problem in some fashion.
Multiple ANALYZE/SYSTEM and ANALYZE/DUMP identified an outstanding DIO where
DIOCNT was always -1 compared to DIOLM seemed to be resulting in the RWAST.
An extensive edit+compile+execute+repeat through the WASD startup code first
isolated the DIOCNT -1 to the DCL.C module, then with further refinement to
the DclExit() function which runs when the server is shutting down.
It seemed that a channel with a IO$_WRITEOF queued to a mailbox was not being
correctly handled during image exit. If the queued I/O was $CANCELed during
that DclExit() then no RWAST! Reminder: this has worked for thirty years
across four platforms.
A small reproducer devised by Volker based on this knowledge was the
breakthrough tool allowing investigation on VSI test benches (sans WASD :-)
This was munged in various ways over the following week all resulting in the
process RWAST and commented on by an authoritative
> 00000414 0014 SYSTEM SYSTEM RWAST 6 821E7780 89C04000 296
>
> And SDA> SHOW CALL/SUMM shows the same well-known call stack (RM$UNMAP_GBL et. al.)
>
> So this is a REAL OpenVMS x86-64 V9.2-3 problem !!!
A succession of info-WASD emails warned X86 deployed sites of the issue.
https://wasd.vsm.com.au/info-WASD/2026/0004
https://wasd.vsm.com.au/info-WASD/2026/0005
https://wasd.vsm.com.au/info-WASD/2026/0006
Later an email from Hein with a triumphant (dare I say, somewhat relieved)
> We got it! And it's a good one. ;-)
8< snip 8<
> Yesterday at the Bootcamp in Malmo we huddled together ...
> fixed it 4 hours later!
followed by ~25 lines of MACRO, with which of course I needed followup
> I'll give an extra hint on the code.
> See that " .if df alpha... .endc "
> And that " .if df ia64... .endc "
> There is no 'else' clause.
> Now guess what happens when it is neither? Nothing! There is no X86
> variant. Just a silly porting oversight!
Will be addressed in a forthcoming VMS update.
Special Thanks
~~~~~~~~~~~~~~
A *lot* of time (11 weeks plus) and effort (in total 5 persons, plus VSI) and
email (377) to try and understand *what* was going on, and *why*.
To Volker and Hein for going the second, then a third, THEN a fourth mile each.
Quotable quotes for this post:
> Isn’t this great ?!
>
> 3 volunteers working from 3 different continents and time zones…
>
> Volker.
And again three weeks later:
> Thanks for all your work and mails, while I had a good night's sleep ;-)
>
> This truely is a worldwide volunteers effort - fascinating.
And then again after the issue was discovered:
> Again thanks to all of you for working on this problem. It was a really fun time 😉
>
> Volker.
Incidental
~~~~~~~~~~
Identified during this process, as related to SYSUAF 'context' used by
various system services (e.g. $GETUAI, $CREATE_USER_PROFILE), was a WASD
not-best-practise of using individual contexts for each within a single
program. The code now uses a single SYSUAF context.
Recommendations
~~~~~~~~~~~~~~~
• Don't hand-code SYSGEN :-|
I did and it delayed confirming the working DS20 + X86 by some days.
• Do edit MODPARAMS.DAT and @AUTOGEN (or @CLUSTER_CONFIG.COM)
• After 24 hours and some exercising of the system do @AUTOGEN.
• If using cluster-common SYSUAF remember WASD accounts and quota settings
apply to multiple systems.
• If using a common WASD tree and the hosts have differing roles then some
WASD_CONFIG_MAP, etc. may need conditional configuration.
• Don't neglect if-then-else-endif and/or called procedures in
[STARTUP]STARTUP_LOCAL.COM to differentiate server roles as necessary.
• Above all, don't be afraid to reach out, via VSI 'OpenVMS Forum' or other
means. Some problems are obvious to, or have already been experienced by
others, still other problems require specialist input.
This item is one of a collection at
https://wasd.vsm.com.au/other/#occasional
|