The WASD Query facility provides real-time searching of plain-text and HTML documents. It is a simple-string search, not a GREP-style search, although it may contain the wildcard characters "*", matching zero or more characters, and "%", matching any single character. (When no wildcard characters delimit the search string it behaves as if "*" characters were present.) It is designed to provide a simple mechanism for locating documents containing a keyword, not for document analysis. The search string may contain spaces.
Only files considered plain-text or HTML will be searched. Others may be specified, or be selected from wildcard file specification, but they will not actually have their contents searched.
Plain-Text Search
A search of a plain-text file is straight-forward. Each line in the file is searched for the required string. The first time it is encountered is considered a hit. The line is not searched for any further occurances.
Searches of plain text files allow the subsequent selection of partial documents (i.e. the retrieval of only a number of lines around any actual hit). This allows the user to selectively extract a portion of a document, avoiding the need to explicitly scan through to the section of interest. The hit will be highlighted to make it more obvious.
HTML Search
A search of an HTML file is a little more complex. As might be expected, only text presented as part of the document is searched, markup text is ignored. That is, all text not part of an HTML tag construct is extracted and searched. For example, out of the following HTML fragment
<!-- an example HTML document --> <p> The document entitled <a href="example.html">"Example Document"</a> provides only an <i>overview</i> of the full capabilities of HTML.only the following text would actually be searched
The document entitled "Example Document" provides only an overview of the full capabilities of HTML.The HTML character entities "&", ">", "<", " ", """ and "&#nnn;" are converted to the representative character before matching.
The mechanism for partial document retrieval available with plain-text is not present with HTML documents. However, each hit is individually anchored when the document is returned, allowing the browser to "jump" directly to the section of document containing the hit. Most browsers present this at the top of the window, unless the end of the document is close to the hit, in which case the hit will be somewhere between the top and bottom of the window. The hit is not highlighted in the same way as for plain-text documents.
HTML problem, unbalanced <>
Occasionally HTML search results report:
HTML problem, unbalanced <> in /whatever/path/to/file.htmlThis indicates that at the end-of-file the search engine had encountered one more or less "<" than ">", when parsing out the markup tags. The search results become a little suspect because it is then uncertain when it was within a markup tag or within document text. The document should be investigated for errors.
Search Statistics
Appended to the page of search results are some statistics on how many files were searched, how many were not (if any were non-text or non-HTML), the number of hits, and some server system statistics.
135 files searched (40 not) with 79 files hit, for a total of 323 hits.
Elapsed: 00:06.95 CPU: 00:02.27 I/O: 467 Disk: 473 Records (lines): 55701