SCRUB - Simple Character String Replacer Documentation

First Release: December 1988

Copyright 1988-1993 Duxbury Systems, Inc.

This program and its documentation, that is specifically the files SCRUB.EXE, SCRUB.ERM, SCRUB.EHL and SCRUB.DOC, may be distributed freely and used for any legal purpose, provided it is not altered in any way nor its notices, including this notice, obscured from view.

Duxbury Systems, Inc. of Westford, Massachusetts, USA designs and manufactures software related to braille. We can be reached at 978-692-3000.

This documentation applies to SCRUB version 1;7, and possibly to later versions.

PURPOSE

SCRUB provides a simple means to apply a set of substitution rules throughout a file. The rules themselves reside in a file, so that similar rules may be applied to many files with speed, accuracy and convenience.

Typical applications would include converting word-processing format codes to other codings, converting specialized file formats to common ASCII text file format, and converting generic "boilerplate" text for use in a particular instance. SCRUB may also be used to add text to the beginning or end of a file.

USAGE

The user must first prepare a "substitution-list-file." This is an ordinary ASCII text file, wherein each line contains a substitution rule, usually of the simple form:

search-string~replacement-string~

For example, the file with exactly the two lines:

abc~x~

xy~product of x and y~

would cause every instance of "abc" to be turned into "x", while every "xy" would turn into "product of x and y". For example, a file with the contents:

No more abc's; let us consider xy.

xyabcd XYABCD

would be rewritten (in a separate file) as:

No more x's; let us consider product of x and y.

product of x and yxd XYABCD

Note that the substitution is always case-sensitive; that is why neither XY nor ABC in XYABCD were changed.

SCRUB can be run in any of the ways (modes), such as in the following examples:

(1) Inquiry mode:

SCRUB ?

would cause SCRUB to give a brief synopsis of usage and quit.

(2) Command line mode:

SCRUB : ~ infile.txt ~

would cause SCRUB to prompt for the name of the substitution table and for the name of an output file, and will then process INFILE.TXT to that output file according to the table's rules.

(3) Response-file mode:

SCRUB @ answers.txt

would cause SCRUB to read the lines of ANSWERS.TXT as the responses to its prompts, and process accordingly.

(4) Fully-prompted mode:

SCRUB

will cause prompting for all parameters.

The required parameters, in order, are the substitution-list file name, and the names of the input and output files.

The following details regarding rule syntax and application should be noted:

1. The replacement (produced) strings are NOT re-scanned for further possible replacements. Thus, if the second rule in the example above had been:

xy~product of abc and y~

then the output would have been:

No more x's; let us consider product of abc and y. product of abc and yxd XYABCD

That is, the two "abc" strings arising from the "xy" substitutions would not, in turn, be turned into "x" strings.

2. Longer search strings take precedence over shorter ones. For example, under the two rules

abc~x~

a~pq~

"abcd" would come out "xd" whereas "abdc" would come out "pqbdc". In other words, the shorter search string would be used only if a longer one did not apply. This property does not depend on the order that the rules are stated; the result would be the same if the rules were given in the opposite order.

3. A nil string may not be used as a search string, but may be used for a replacement (which has the effect of deleting instances of the search string).

4. The tilde (~) character is used in rules to mark the boundary between the string to be searched for and the replacement string, and also to mark the end of the replacement string. All the printing ASCII characters, including space but excluding vertical bar (|) and tilde (~), may appear as themselves in the strings. If one of those two characters must be part of one of the strings, it may be "quoted" by a prior vertical bar, i.e. tilde may be represented by "|~" (without the quotes) and vertical bar itself by "||". Finally, any character whatsoever may be entered as the four-character sequence "|nnn", where nnn is a decimal number between 000 and 255, expressing the character code value.

For example, the rule:

|009~|~|013|010~

would cause all ASCII "tab" characters (code 9) to be converted to tilde followed by a carriage return and line feed, the latter two defining the line-end sequence typical of ASCII text files on MSDOS and certain other operating systems.

5. To add the replacement string to the beginning of the output file, use |256 as the search string. To add the replacement string to the end of the output file, use |257 as the search string.

For example, the rule:

|256~This line belongs at the top~

would place the string "This line belongs at the top" at the beginning of the output file.

6. A vertical bar followed by a space causes the rest of a rule line to be ignored, allowing for comments, e.g.:

|009~|~|013|010~ | [tab] -> ~[line end]

and in particular a line starting with vertical bar and space is treated entirely as a comment, i.e. ignored.

7. Quite a large number of rules may be given, limited only by the memory available for holding rules. At present, search and replacement strings must not be longer than than 4096 characters.

IMPOSING STATE CONDITIONS (ADVANCED USAGE)

Sometimes it is desirable to limit some of the substitutions to certain parts of a text, while others may apply globally. In such a case, it may be possible to think in terms of "states" that the substitution process assumes as it traverses the input text. For example, let us assume that the problem is to substitute "x-variable" for the "x" in all the comments of a C program. In such a program, comments always begin with with the two-character sequence "/*" and end with "*/". To keep this example simple, we will further assume that there are no other cases where these character combinations could occur. Then, we could define the problem roughly as follows:

State 1 means "not in a comment"

State 2 means "in a comment"

In state 2 (only), "x" should be replaced by "x-variable"

The initial state is 1

When "/*" is encountered, change to state 2

When "*/" is encountered, change to state 1

To do this, we would make use of one of the more advanced rule forms, i.e. one of:

search-string~replacement-string~condition-state~

search-string~replacement-string~condition-state~next-state~

where the third and fourth fields contain either 0 (or equivalently nil) or a state number in the range 1 through 32767. The application rules for such forms are as follows:

Returning to the problem posed above, a substitution list (or "SCRUB table") that would implement our requirements would be as follows. Note how, to make the table comprehensible to human readers (including the original author at a later date), comments are used to give a name and version to the table, to describe the "meaning" of the states, and to explain the reasons for individual rules. Such practices are strongly recommended for tables that are intended for usage over an extended time:

| state 1 - initial & outside of comments

state 2 - inside of comments

/*~/*~~2~&tab;start comment

*/~*/~~1~&tab;end comment

x~x-variable~2~&tab;"x"->"x-variable" in comments only