soyMAIL 2.2.0 requires JavaScript
soyMAIL @ wasd.vsm.com.au
       info-WASD Mailing List 2024 

Mon 10:09:50 Message "2024 / 0013" opened.  MIME.  utf-8.  Plain (−HTML)   10 kbytes.    JavaScript

Subject:[Info-WASD] In The Weeds0013 / 0000
From:mark.daniel@wasd.vsm.com.au
Reply-to:info-wasd@vsm.com.au
Date:Thu, 29 Aug 2024 14:09:18 +0930  [29-AUG-2024 14:09]
To:info-WASD@vsm.com.au

(way)TL;DR  sometimes broad-leaf weed spray just doesn't do the job
            and you need to get to your knees and pull them by hand

A commercial enterprise, using WASD as an element in business data transfer,
has been employing it across major versions 11 and 12.  Updates basically on
an annual basis some months after the WASD release.  In production WASD is
deployed across a two node cluster, sharing processing and storage, plus an
independent test-bed system.  An appropriately rigorous regime.

Six months after the most recent WASD release, v12.2.0, that version was
given a green-light for production.  I was advised this would be happening
the following Monday.  Always a little anxious when a significant cutover is
imminent (even when only indirectly your own), looking at the version log
there didn't seem to be any of the scale of changes as between v11 and v12.
Routine almost.

Perhaps already you can sense where this anecdote is heading...

The Tuesday I receive an email.

WASD's server process log is noticed to be much larger than usual.

The request header checking code that reports malformed headers and fields
within those headers is occasionally spitting out large lumps of what appears
to be partial requests in progress.

|%HTTPD-W-NOTICED, 13-JUL-2024 19:42:07, REQUEST:3595, FIELD "D;31249604;89490200001312225496;491608959698;
↩************" char \x20 at 57 of 7676
|-NOTICED-I-SERVICE, https://*****.*****.**:443
|-NOTICED-I-CLIENT, ###.###.###.###
|-NOTICED-I-HTTP, 1
|-NOTICED-I-URI, lex (5 bytes) Mobil
|-NOTICED-I-RXTX, err:0/0 raw:7720/0 net:7676/0
|######## ######## ######## 3835373B 343B3B31 3B36303B 313B3B0A 443B3331 ************857;4;;1;60;1;;.D;31
|32343936 30343B38 39343930 32303030 30323031 33373031 3434373B 34393137 249604;89490200002013701447;4917
|33343B3B 313B3138 303B313B 3B0A443B 33313234 ######## ######## ######## 34;;1;180;1;;.D;3124************
|30303032 30313337 30313434 373B3439 31373539 30303731 33323B42 7573696E 0002013701447;491759007132;Busin
8< snip 7296 bytes 8<
|37303B31 3B313835 3334343B 31313B69 6E746572 6E65742E ######## ######## 70;1;185344;11;internet.********
|6F6D3B0A 543B3130 370A5055 54202F7E ######## ######## ######## ######## om;.T;107.PUT /*****************
|######## ######## ######## ######## ######2E 544D5020 48545450 2F312E31 *******************.TMP HTTP/1.1
|0D0A4B65 65702D41 6C697665 3A200D0A 436F6E6E 65637469 6F6E3A20 54452C20 ..Keep-Alive: ..Connection: TE,
|4B656570 2D416C69 76650D0A 54453A20 74726169 6C657273 0D0A486F 73743A20 Keep-Alive..TE: trailers..Host:
|31302E31 36382E## ####2E## ####0D0A 436F6E74 656E742D 4C656E67 74683A20 10.168.***.***..Content-Length:
|32333639 310D0A41 7574686F 72697A61 74696F6E 3A204261 73696320 5A577874 23691..Authorization: Basic ZWxt
|######## ######## ######## ######## ######## ######## 0D0A0D0A          ************************....

The above is real data.  Significantly redacted for security and privacy.
Reproduced with permission.

To me it appeared to be a request header ("PUT /") appended to the body of
what was likely a preceding "PUT /" where either (or even all) of ...

1) the request header was incorrectly terminated (i.e. not <CR><LF><CR><LF>
or <LF><LF>) and so the body was being processed as still being the header, or

2) the associated body was not correctly "Content-Length:"-ed and so the body
had been underwritten, meaning a following pipelined** HTTP/1.1 request might
begin processing in the trailing portion of the body, or

3) the body may have been overwritten, meaning a following pipelined HTTP/1.1
request might begin processing in the leading portion of the body, or

4) the request body processing code was broken between versions, or

5) something else entirely.

The site had reverted to the previous version ... v12.0.0.  Hmmmm.  I had
assumed v12.1.0.  I did know they had been busy elsewhere.  Anyhow, a careful
comparison of the REQUEST.C request processing paths revealed no significant
differences between the three minor versions.

I then realised there was a difference in the TCP MSS (maximum segment size)
calculation melded into the maximum QIO size.  The idea being where data
exceeds the MSS it would be better to transmit (and receive) chunks of data
that comprised multiple full TCP datagrams.  v12.0.0 had always read and
written as MSS chunks.

Perhaps this somehow had fed into the reading and reassembling of large
requests.  A code modification to revert the behaviour was shipped and
trial-installed by the site.  At first it looked promising :-| but then a
number of the "REQUEST:3595, FIELD" detections were reported.  The site
reverted to the known-good version (v12.0.0) once again.

Out of many thousands of other transactions and individual clients only a
single client seemed to be impacted by this issue.  Only exhibited under
v12.2.0 and not v12.0.0.  Hmmmm.  A clue?  Too little data to tell.

As all FIELD detections involved partial requests and none showed an initial
request header from where the following (apparently) broken one(s)
originated, obviously a full network stream, go to whoa, needed to be
captured and analysed to try and understand the sequence of octets leading to
this behaviour.  So back to the bench to code up a versatile (enough) octet
sequence search for the network receive stream and trigger collection
whenever a hit occurs.

After three days and 1.75GB data collection there were a handful of useful
errors flagged and the site reverted to the known-good version due to concern
for clients.  Of course sifting through 1,750,000,000 bytes of printable data
using SEARCH and some custom code involved a little garment-rending.

Nevertheless, patterns began to emerge.  Two PUTS one after another, the
leading without an Authorization: header, the next containing one, strongly
suggested a PUT was generating a 401 (authorization required) response with
the next request supplying that.  (Remember, only the received network stream
can be seen.)  Much larger 401-ed PUTs were always associated with a trailing
FIELD error.

Of course these PUTs contained request bodies from a few hundred bytes to
tens and on occasion even many hundreds of thousands of kilobytes.  And the
client app was pushing through the full PUT body come-what-may.  Generally
before the server could transmit the 401.  So the body needs to be discarded
and retransmitted with a new request containing authorisation.  Redundant
traffic but hey, that's the way the generic HTTP web works.

Recall the comment, 'to me it appeared to be a request header ("PUT /")
appended to the body of what was likely a preceding "PUT /" where either ...'
Even more so now the entire relevant network stream had been made visible.
Hmmm, partial body obvious.  It didn't take long to join the body discard
dots.  Comparing the discard code base between v12.0.0 and v12.2.0 showed
obvious gefingerpoken.***

With v12.1.0 apparently I had 'refined' the RequestDiscardBody() function so
now it was *mostly* working.  If the entire PUTed body fitted into the first
(and only) network buffer when the 401 triggered all was well with the world.
Subsequent network buffers were left untouched.  And, of course, were used as
the beginning of the next pipelined** request.  Defensibly number 4) above
but IMO more of a 5) with a comparable functionality.

"Not tested?!", I hear you exclaim.  Arguably not adequately.  Actually had
been in use by myself for 20+ months prior without obvious issue.  soyMAIL
generates POSTs (also processed by PUT.C) for all clicks, and in the case of
message composition some large ones (like this email).  But seems not large
enough to break it.  Even with attachment uploads.  Possibly the site's
particular traffic profile exacerbated the bug.

Why not multiple clients impacted?  Good question.  None other was ever
observed.  Perhaps this client was the only one with requests large or
aggressive enough to reach the buggy behaviour threshold.  Most requests
observed were in the low kilobytes.  This one client routinely was generating
many that size along with the occasional request in the hundreds of kilobytes
and even megabytes.

Thanks to the site for their patience and assistance in resolving this issue.

A positive outcome from all this (in addition to a *working* upgraded site)
is the demonstrated value of being able to trigger data collection by
examining the network stream.  So, from the next release, WATCH will be able
to be configured to examine the receive and transmit network streams for a
specified sequence of octets and when encountered retrieve configured WATCH
items into a standalone log file.  Less weeding by hand.  Or at the very
least, with a cushion beneath the knees.

** Note on HTTP/1.1 pipelined requests
   https://en.wikipedia.org/wiki/HTTP_pipelining

*** Apologies to German language speakers subscribed to this list.
    A sign at a telecommunications site from employment in the early '70s.
    https://en.wikipedia.org/wiki/Blinkenlights

This item is one of a collection at
https://wasd.vsm.com.au/other/#occasional

  ¤¤¤       
  ¤¤¤