In Three Acts —— not a tragedy, neither a comedy
~~~~~~~~~~~~~
Just on a year ago this collection added an article, "Latency and back-ends",
which discussed a latent bug (of misunderstanding), after years, suddenly
becoming a problem due to a change in behaviour at the far end.
https://wasd.vsm.com.au/info-WASD/2024/0015
This article discusses another latent coding bug, at the other end of the
processing chain.
Kudos to Process Software Corporation (PSC). Retold with permission.
Dramatis Personae
~~~~~~~~~~~~~~~~~
[redacted] client of PSC
Hunter Goatley EISNER-meister (and PSC Engineer)
John Reagan mythological compiler chimaera
Richard Whalen PSC Principal Software Engineer
narrator vox mea propria
Act I
~~~~~
Res quaedam non recte se habe.
[redacted] is having an issue with WASD on field-test x86-64 MultiNet.
To investigate, PSC deploy a VirtualBox hosted, X86 V9.2-3 and MultiNet v6,
along with a fresh WASD v12.3 source kit. After the build, using the usual
demonstration procedure, WASD refuses to accept a request. Any request at
all. Just drops the connection.
Curiously, this is *not* the problem [redacted] is experiencing.
Normally at any point in problem solving the suggestion is WATCH.
But if you can't even connect?
> You *can* WATCH Hunter, at the command-line. It's a little messy.
> $ SPAWN HTTPD <any-other-required-paramaters> /WATCH=NOSTARTUP,-1
> The -1 just enables all WATCH items.
https://wasd.vsm.com.au/wasd_root/wasdoc/features/#watchfacilit.
https://wasd.vsm.com.au/wasd_root/wasdoc/features/#commandlineu..
Thanks for the console WATCH info. Rich tried it today and got this:
8< snip 8<
|15:06:25.58 HTTPD 3368 000014 INTERNAL TIMER input 30 seconds|
|15:06:25.58 TCPIP 0768 000014 NETWORK MAXQIO 0 maxseg:1460 sndbuf:62780 rcvbuf:0 _BG864: %X00000001|
|15:06:25.58 NETIO 0717 000014 NETWORK READ 0/16384 bytes (non-blocking)|
|15:06:25.58 NETIO 0865 000014 NETWORK READ %X00000001 0 (0/16384) bytes (non-blocking)|
8< snip 8<
The big question is “Why is it doing 0 (zero) byte reads?”
A fair question. My development bench doesn't.
|21:07:18.41 WATCH 3198 000002 FILTER CLIENT adding gort.lan,63944 on http://x86vms.lan,80 (192.168.1.86)|
|21:07:18.41 TCPIP 0736 000002 NETWORK SETMODE sndbuf:62780 rcvbuf:62780 %X00000001|
|21:07:18.41 TCPIP 0768 000002 NETWORK MAXQIO 64240 maxseg:1460 sndbuf:999999 rcvbuf:999999 %X00000001|
|21:07:18.41 NETIO 0722 000002 NETWORK READ 0/16384 bytes (non-blocking)|
Er, hello, it's doing zero byte $QIOs because that's how many bytes are
reported available to the receive buffer.
|15:06:25.58 TCPIP 0768 000014 NETWORK MAXQIO 0 maxseg:1460 sndbuf:62780 rcvbuf:0 _BG864: %X00000001|
^^^^^^^^
WASD, performing network I/O using $QIO, attempts to optimise the number of
whole TCP segments that can be fitted into a single QIO.
/*****************************************************************************/
/*
The socket MSS value has been established during connection acceptance.
Calculate the maximum number of full segments that can be QIOed and set MaxQio.
*/
int TcpIpSocketMaxQio (void *vptr)
8< snip 8<
{
qios = 65535;
if (qios > ioptr->TcpSndBuf / 2) qios = ioptr->TcpSndBuf / 2;
if (qios > ioptr->TcpRcvBuf / 2) qios = ioptr->TcpRcvBuf / 2;
qios = (qios / ioptr->TcpMaxSeg) * ioptr->TcpMaxSeg;
ioptr->TcpMaxQioSet = ioptr->TcpMaxQio = qios;
}
8< snip 8<
/*****************************************************************************/
In this way, data far exceeding single $QIO capacity, send groups of whole
segments, theoretically optimising transfer 'on the wire'.
Act II
~~~~~~
Fama cimex.
Rich finally tracked down the problem. I'm sorry to report that
it's a WASD bug!"
Once he verified that the MultiNet kernel was returning proper
sizes in the SENSEMODE call, he started looking at the WASD code.
In [SRC.HTTPD]TCPIP.C, these variables are declared as /ushort/,
but they should be /int/.
216 ushort TcpIpMaxSegLength,
217 TcpIpRcvBufLength,
218 TcpIpSndBufLength;
The documentation states that they should be /int/, and when we
changed "ushort" to "int" and build it, the WASD demo ran just fine.
Fair enough. This code has been running a long time across at least two
earlier CPU architectures.
Why broken on x86-64 now? The WASD x86-64 work began in late 2020 and the
essential port declared concluded twelve months later with the release of
v12.0. Further annual iterations through to v12.3 have consolidated both
WASD and the x86-64 versions of it. (In fact 2025 saw my development bench
move from my 20+ year old Alpha to X86. Even on an everyday desktop it is
blindingly fast in comparison.)
Literally millions of OWASP-ZAP generated crawl and exploit requests over
these four releases have not hinted at anything amiss.
https://wasd.vsm.com.au/wasd_root/wasdoc/config/#3.1.serverands...
My development bench is a commodity Dell, Windows 11, using Virtual Box,
hosting X86 VMS V9.2-3 and VSI TCP/IP Services V6.0.
Act III
~~~~~~~
Diabolus ex machina.
The reason [redacted] doesn't have a problem and we did is probably
because of different VM software and/or different CPUs and data
alignment. Rich said that John Reagan said that prior architectures
(Alpha, I64, and, to a lesser extent, VAX) have had significant
penalties for bad alignment, so the compilers would longword align
variables to minimize the impacts.
John said that x86_64 systems don't have the performance penalties,
so the compilers no longer pad variables to make them longword-
aligned. That and other things we can only speculate about explain
the WASD problem. On our x86_64 systems running under VirtualBox,
the /int/ values we were returning to the /ushort/ variables were
apparently overwriting each other, resulting in the 0-length receive
buffer size WASD was seeing.
That underlying CPU implementations and virtualisation architectures can in
combination with varying host O/S and virtualisation tools yield differing
behaviours is a sobering thought.
There was a brief suggestion that X86 VMS data alignment remaining padded at
some level (long/quad/octa) might go some way in mitigating these potential
alignment/size architectural issues. The conclusion was that the X86
instruction stream largely would be a result of the LLVM Core, effectively
meaning issues such as data alignment are not under VSI control.
PS. Tired of folk salting their work with Latin phrases ad nauseum?
This item is one of a collection at
https://wasd.vsm.com.au/other/#occasional
|