A tagged file is an ASCII file containing commands like <xxx> (a start command) and </xxx> (an end command). The most common example of a tagged file is an HTML file. Here is an example of a tagged file:
<h1>A Sample File</h1>
<p>Here is my grocery list:
<ul>
<li>potatoes
<li>milk
<li>orange juice
</ul>
In this example <h1> means heading level 1, </h1> means end the heading, <p> means a paragraph of ordinary (body) text, <ul> means a list follows, <li> means a list item, and </ul> means the end of a list.
MegaDots imports HTML files with these kinds of tags.
One of the key concepts about tagged files is that the tags define the structure of the document. The <h1> tag does not say what size font or any other attribute. That is the job of the rendering software (often the browser).
HTML is the basis for the world wide web. Because HTML is written informally by all kinds of users, the syntax of the commands is often sloppy. This is further complicated by the fact that many tags are used to obtain certain visual effects when rendered in a browser, ignoring the structure implied by the tags that are used. This further complicates the task of trying to prepare braille from HTML tagged files.
SGML is a set of rules for mathematically defining a set of tags. The rules for SGML are so difficult that writing software for SGML tagging systems is very difficult. Further, SGML files do not lend themselves for clarifying the structure of a document. Every SGML file has a special file called the DTD (Document Type Definition) which lists the tags, and defines the legal ways they can be used.
XML is a set of rules for mathematically defining a set of tags. The rules for XML are designed to eliminate the problems of SGML. Software tools are easier to write for XML, and XML lends itself to defining structure.
Daisy and NISO are two efforts to define the tags for documents intended to the accessible to the blind. NIMAS is the current effort. NIMAS stands for National Instructional Materials Accessibility Standard. It is part of the IDEA federal Special Education legislation.
Duxbury Systems is committed to making MegaDots and
DBT import NIMAS files using our best practices. User need to know that
this standard is new, and that we all will be learning more about how
these areas will be used in the field. Keeping up to date with these
developments is a good reason to keep your software up to date. Please
check the Duxbury Systems website from time to time to see if there are
updates to nimas.msg or msgfile.exe to improve
how NIMAS files are imported into MegaDots.
In NIMAS files, Non-ASCII characters are encoded in UTF-8. MegaDots can decode the UTF-8 encodings. See below for more information on UTF-8.
MegaDots has a program called MSGFILE.EXE
to deal with a many forms of tagged files. If you encounter a file with
radically different tags than those used on HTML files, you can still
import these files if you are willing to learn about the tags, and be able
to figure out what styles and attributes you want these to have in
MegaDots.
The first application of this approach has been to import SGML files used by the IRS for their internal documentation. In a short period of time I was able to figure what the tags were, what they meant, and what MegaDots styles to change them into.
MSGFILE is a program you can run at the DOS command line. There are three parameters, the source data file, the target file, and an MSG file. An MSG file is sort of like a rules file. It tells MSGFILE what to do with the tags. If you add a fourth parameter (it does not matter what it is), then unknown tags are saved as hidden text.
The output file is an ASCII file which MegaDots will recognize as one containing MegaDots commands. If you export a MegaDots file to "ASCII line" with MegaDots markup, you get a file with a similar layout as the output file from MSGFILE.
The MSGFILE program ignores the DTD in your file. Changing the DTD will not affect how MSGFILE reads the data.
For example, if you have a file called
JOHN.HTM. You can import it into MegaDots by typing
MEGA JOHN.HTM <Enter>. Or you can import it with the
following two commands (as long as you copy HTML.MSG into the
current directory):
The first line creates a MegaDots marked-up ASCII file
called JOHN.TMP. The second line imports the temporary file.
HTML.MSGTake a look at html.msg and/or
nimas.msg in your MegaDots directory. These files control the
importing of HTML files and NIMAS files into MegaDots. If you change the
HTML.MSG file, you will change how MegaDots import HTML
files. HTML.MSG is an ASCII textfile, and it can be examined,
edited, and changed by any ASCII text editor.
The first line in the file contains the long name of the file type. The next line usually says "Style continuation: yes". The third line in blank.
The rest of the file is in a three column format. You need at least 2 spaces to separate the columns.
The first column is the tag (case is not important). The second column gives the MegaDots command. Here case is important. The third column comes from a restricted list of phrases that describes the kind of MegaDots operation.
Unusual MSG Commands
A long time ago, I wrote a program called TEXTCHK. It was designed to help people write and read ICADD files (crude tagged files generated by book publishers primarily to meet the requirements of the state of Texas). This program can diagnose problems with tags. It can also fix quite a few problems.
TEXTCHK is used to help import ICADD files in MegaDots. It is a virtually undocumented program included with your copy of MegaDots. It is very handy when faced with a new tagged file and you need to understand it.
Just type TEXTCHK input output
<Enter>. Here the input is a new file you need to analyze.
The output is a report about the tag usage. You get a lot of useless
information in the report. But you do get a tag census, a list of the tags
and the number of times they show up in the text. The report will divide
tags into "legal" and "illegal". Ignore this distinction. The program was
written to look at how well a file measured up to the ICADD standard. We
are interested in all the tags, we do not care if a tag is in the ICADD
list or not.
Once you have a list of tags, you can search for them
in the original tagged file to figure out what they mean. I have found
that I can analyze a half a megabyte file in an hour or two and build an
.MSG file for it.
Many files contain markup like á for "an a
acute". The MSGFILE.EXE knows about this and can handle these
properly for MegaDots. This is hard coded and you do not have any way of
adjusting these. If you have &xxx; markup which is not handled
correctly, contact David Holladay.
Lets say you have a file that contains
&reallybigdash; (this is a code not recognized by the MSGFILE program.
Lets say you want this to be a dash, which in MegaDots is represented by a
double hyphen. Create a line in the .MSG that has the
following three columns: &reallybigdash; -- text (the third column
indicates that the code &reallybigdash; is being replaced with the
double hyphen).
If you have a tag that you do not do anything to in
the .MSG file, it will show up in the "output" file. What
happens to that tag in MegaDots? It is either thrown away, or it is kept
as "hidden text". What is hidden text? It is text enclosed with the
emphasis of "hidden". To create some hidden text, mark text as a block,
and issue the control-F H command. This is not in the menu, so the hidden
text is a hidden command within MegaDots. Hidden text does not show up in
WYSIWYG. It does show up in show markup. It does not show up in any
inkprint or braille output. It just lives there in the MegaDots file.
There is a very obscure question in the MegaDots preferences that influences how these tagged files are imported. Go to the "File Import" preferences, "default file". In that huge screen there is a question "excess emphasis". If you say "disallow", all unknown tags will be thrown away. If you say "allow", all unknown tags will be in the MegaDots file as hidden text.
If you export a MegaDots file with "hidden" tags to HTML, all the hidden tags are reconstructed as regular tags again. If you have need to import tags, mess around with them in MegaDots and then export them again, you will probably want to use this "hidden" feature of MegaDots.
UTF-8 is a system for encoding Unicode files in an 8 bit system. A UTF-8 is just like a regular ASCII file, except for its unusual way of encoding accented letters and specialized typesetting characters.
MegaDots can read UTF-8 textfiles as long as they start with the UTF-8 prefix of "EF BB BF" (three bytes expressed as hexidecimal). MegaDots can now read HTML/XML files that use UTF-8 character encoding as long as the file is identified in a tag as using "UTF-8" encoding.
If MegaDots finds a character that it cannot identify, it will convert the character into hexidecimal and save the character in an identifiable sequence. For example, the Unicode character "0816" is imported into MegaDots as ~[0816] (notice the added brackets and tilde). You can write a rules file to convert these into sequences that make the correct braille.
UTF-8 can be a useful intermediate file format. For example, we were e-mailed a textfile written in "code page 1250". We imported the file into Notepad, and found that Notepad was able to corrrectly work out the correct accent marks. Notepad is able to export to a UTF-8 textfile with the usual 3 byte identifier. MegaDots is able to import the UTF-8 file Notepad exported.
This is how I would approach a project:
.MSG files and this
documentation to get the general idea.
HTML.MSG or if you want to start from scratch. If you want to
start from scratch, make use of USER.MSG (see below).
.MSG file for your project.
Make a trial run.
If you create a USER.MSG file, you can
force MegaDots to make use of it. Here are the different ways of forcing
MegaDots to import a file, making use of the USER.MSG file:
USER.MSG
conversion rules. Import the file normally. No other intervention is
needed.