TL;DR WASD 'throttle' rule is one solution to regulating rampant content
harvesting by serialising (thereby rate-limiting) request processing.
Trawlers have been ramping up over the last year or so, often without
consideration to site impacts. Lots of chatter, and complaints, about the
latest rash of LLM꙳꙳ crawlers not limiting or even serialising trawls of
sites, and some ignoring /robots.txt completely. And because they often do
not contain identifying agent strings are pretty-much indistinguishable from
regular clients (apart from massive spikes in traffic).
꙳꙳ https://en.wikipedia.org/wiki/Large_language_model
Some impacted sites are even devising 'tar-pits' to lure AI crawlers into
very slow (and low cost) quagmires.
https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
https://www.osnews.com/story/141545/nepenthes-a-dangerous-tarpit-to-trap-llm-crawlers/
In some circumstances such concurrent accesses may result in VMS process
quota spits. This recent example shows multiple, concurrent accesses to the
'Conan The (VMS) Librarian' script resulting in a system's WASD process
quotas being exceeded resulting in WASD restart. Conan, coupled with textual
libraries (e.g. HELPLIB.HLB), provides a richly linked information repository.
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, WATCH:3850, <=30%:AST; AST:545/2000 BIO:719/2000 BYT:1511936/3574208
↩DIO:950/1000 ENQ:388/500 FIL:272/300 PGFL:335296/512000 PRC:0/100 TQ:298/300 %X00000018
|-SYSTEM-W-EXQUOTA, process quota exceeded
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, WATCH:3850, <=30%:AST; AST:458/2000 BIO:634/2000 BYT:1445312/3574208
↩DIO:949/1000 ENQ:388/500 FIL:272/300 PGFL:335296/512000 PRC:0/100 TQ:298/300 %X00000018
|-SYSTEM-W-EXQUOTA, process quota exceeded
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%HTTPD-W-NOTICED, 24-MAR-2025 14:30:10, DCL:5293, $CREMBX() %X00002A14
|-SYSTEM-F-EXBYTLM, exceeded byte count quota
|%SYSTEM-F-EXASTLM, exceeded AST quota
ptooey!
Conan script confirmed during WASD restart where orphaned scripting processes are
cleaned up (due to lack of ASTLM to perform that during normal server rundown).
|%HTTPD-I-DCL, detached process scripting
|%HTTPD-I-DCL, cleanup detached script process; 00015002 HTTP$NOBODY '/conan+5002'
|%HTTPD-I-DCL, cleanup detached script process; 0001A803 HTTP$NOBODY '/conan+A803'
|%HTTPD-I-DCL, cleanup detached script process; 001F0407 HTTP$NOBODY '/conan+0407'
|%HTTPD-I-DCL, cleanup detached script process; 0000E00B HTTP$NOBODY '/conan+E00B'
|%HTTPD-I-DCL, cleanup detached script process; 0001260F HTTP$NOBODY '/conan+260F'
|%HTTPD-I-DCL, cleanup detached script process; 001F1E15 HTTP$NOBODY '/conan+1E15'
8< 12 /conan script cleanups snipped 8<
|%HTTPD-I-DCL, cleanup detached script process; 0001223D HTTP$NOBODY '/conan+223D'
|%HTTPD-I-DCL, cleanup detached script process; 0002567F HTTP$NOBODY '/conan+567F'
|%HTTPD-I-DCL, cleanup detached script process; 00016E90 HTTP$NOBODY '/help+6E90'
|%HTTPD-I-DCL, cleanup detached script process; 000144B3 HTTP$NOBODY '/HyperRead+44B3'
|%HTTPD-I-DCL, cleanup detached script process; 001DA8B5 HTTP$NOBODY 'wuCME-active'
|%HTTPD-I-DCL, cleanup detached script process; 000074D9 HTTP$NOBODY '/conan+74D9'
|%HTTPD-I-DCL, cleanup detached script process; 0000AAEC HTTP$NOBODY '/conan+AAEC'
|%HTTPD-I-DCL, cleanup detached script process; 00020CF7 HTTP$NOBODY '/conan+0CF7'
|%HTTPD-I-DCL, cleanup detached script process; 0001FD04 HTTP$NOBODY '/soymail+FD04'
|%HTTPD-I-DCL, cleanup detached script process; 0001F119 HTTP$NOBODY '/conan+F119'
8< 12 more /conan script cleanups snipped 8<
|%HTTPD-I-DCL, cleanup detached script process; 00017329 HTTP$NOBODY '/conan+7329'
|%HTTPD-I-DCL, cleanup detached script process; 00014F2A HTTP$NOBODY '/conan+4F2A'
|%HTTPD-I-DCL, cleanup detached script process; 001F332B HTTP$NOBODY '/conan+332B'
|%HTTPD-I-DCL, persona enabled at command line
Had the Server Admin / DCL Report been checked just before the spit, those
towards the start of the cleanup list would have had multiple (probably many)
CGIplus hits while those a little further down likely one or two as a barrage
of requests for Conan were in progress (at least 37 concurrent). Nothing
essentially wrong with that but serialising access would proactively help
manage the resources, adding only minor latency to some responses.
Bumped server process quotas꙳꙳꙳ on the obvious ASTLM and BYTLM.
꙳꙳꙳ While process quota alerts are displayed on the Server Admin panel,
and the lack of ASTLM becomes obvious having no way to recover,
server process logs simply can be searched for alert messages:
$ SEARCH WASD_SERVER_LOGS:*.LOG /WINDOW=(1,0) /SINCE=01-JAN EXQUOTA
Introduced control over selected script activations using the 'throttle'
rule. Throttle was developed way-back-when resources were far slower and
more precious than today, a quarter century ago, for a European logistics
company (per JFP and JC). The purpose is to limit the number of requests
allowed to be processed concurrently, basically queuing those exceeding that
number, and then processed as they progress to the head of the queue.
This example concerns reigning in instances of the Conan script
script+ /conan* /cgi-bin/conan* throttle=3
script+ /help* /cgi-bin/conan* throttle=2
but also can be used to regulate access to non-script resources
pass /* /web/* throttle=10
pass /web/* /web/* throttle=10
And of course if a crawler can be identified a throttle rule can be
selectively applied
if (user-agent:*OAI-SearchBot*) set * throttle=3
The Server Admin / Throttle Report provides detailed statistics against each
rule. While preparing this article, I just happened in on a time where
throttle was active. See attachments 'throttle.png' and 'activity.png'.
Throttle Report (at ~100 hours up-time) shows Total (rule applications) of
22,311 & 14,132, with Queued Total of 12,849 & 107, Queued Cur(rent) of
★28★ & 0, Queued Max(imum) of 41 & 24, with the expected Processing
Max(imum) of 3 & 2, along with other metrics. The /conan Queued FIFO of
12,821 plus the Queued Cur(rent) of 28 equals the Queued Total (so the stats
hold together). The Activity Report corroborates the concurrent requests, as
well as a recent bump in activity.
Instantiated /conan scripts are 3 when throttled; without, at least 28+3‼️
There are variants on the basic rule that allow busy responses rather than
further queueing, timeouts on queuing, and the like. Explained in detail at
https://wasd.vsm.com.au/wasd_root/wasdoc/config/#requestthrottling
PS. If the Internet-facing site is being ravaged by creepy-crawlers and there
is no real need to have all your resources available externally the simpler
solution may be to make visible only those necessary.
if (remote-addr:192.168.1.0/24)
script+ /conan* /cgi-bin/conan*
script+ /help* /cgi-bin/conan*
endif
pass /* /web/* throttle=10
This item is one of a collection at
https://wasd.vsm.com.au/other/#occasional
|