Word - Missing or wrong characters

WARNING: This is a highly technical topic relating to strange characters which may appear in DBT after importing a Word file. You will require an understanding of Unicode and how Windows characters are coded.

If you find that an imported Word document contains strange characters, we would advise re-importing the Word file, having first ensured that DBT's Global: Word Importer is set to produce the "Output Unicode value" of "Unknown characters:". (Shown below)

Global: Word Importer dialog.

You will now find after importing the file that any character which DBT does not recognize, might appear as follows:

Chapter One (U:25b6) Introduction

In this example, the Word file contains a bullet type character called, "Black right-pointing triangle". With over 40,000 possible Unicode characters, and in a table which is constantly being added to, it is impossible to map all these to Braille in DBT.

The following explains how such unusual characters may be mapped to braille characters.

WRDUNI.TXT FILE FORMAT

Duxbury Systems, Inc.
November 30, 2001

PURPOSE

The wrduni.txt file controls the mapping of Unicode and other specialized font characters into DUSCI characters when the Duxbury Braille Translator (DBT) imports a Microsoft Word file.

Unicode is an international character encoding standard; see http://www.unicode.org on the World Wide Web for details.

Within the Windows system, specially encoded single-byte fonts may also be used instead of double-byte Unicode for characters that cannot be expressed within the Windows standard single-byte (Latin-1) font.

DUSCI is the internal multi-byte character encoding standard that is used within DBT. It is based upon Unicode but the encoding method is different. Whereas Unicode characters are always two bytes in length, DUSCI characters may be 1 or 2 bytes in length (and may theoretically be extended to 3 or more bytes if necessary to accommodate more characters in the future). The complete listing of currently assigned DUSCI characters codes, together with the corresponding Unicode code values, is given in the "Character List" document, under "Help" in DBT.

GENERAL CHARACTERISTICS AND ORGANIZATION

The wrduni.txt file itself is a simple "ASCII text file." WordPad, or any other editor which can naturally edit plain-text files, can be used to edit the file. When finished, be sure to save it back as plain text, not as a WordPad file nor in the format of Word nor any other word-processor program.

The file consists of a set of "sections," each section corresponding to the first byte of the Unicode value being mapped or a set of special font names.

In the first case, that is when mapping Unicode values, the section is headed by a line containing an asterisk and the initial byte value in hexadecimal, for example:

*1e

precedes the line(s) detailing the mappings for all Unicode values whose first byte has hexadecimal value 1e.

In the second case, that is when mapping special single-byte font values, the section is headed by a line containing "*00:" and then a list of the font names that follow the same mapping. If there is more than one font name, they are separated by vertical bars (|), for example:

*00: Afallon|Cwrwgl|Heledd|Padarn|Teifryn

would head a section detailing mappings for certain Welsh fonts -- namely Afallon, Cwrwgl, Heledd, Padarn, and Teifryn. Note that the font name(s) must be spelled exactly as they appear in the system font list, including capitalization and any punctuation that is part of the name.

Following the last section only, there should be a line containing just a single asterisk; this line marks the end of the file.

CHARACTER MAPPING LINES

Each line within a section gives the mapping for a single imported character. The mapping may yield one or several characters in DUSCI.

In the case of Unicode characters, the line begins with the value of the second byte of the imported character, in hexadecimal, followed by a colon and a space. Recall that the value of the first byte is given by the header line for the current section.

The mapped-to value or values then follow, either by giving the character(s) directly (if such characters are ASCII characters other than a vertical bar [|]) or by giving the code sequence expressed as three-digit decimal values each preceded by a vertical bar.

In the case of special single-byte font characters, the line begins with the hexadecimal code value, a colon and a space. The mapped-to value(s) are then expressed in the same manner as for Unicode characters. Note that any unmapped characters are treated the same as if they were in the Windows standard (Latin-1) font, which corresponds to the first page (i.e. section "*00") of Unicode. That means it is necessary only to map those characters that are encoded differently from the standard font.

Some examples of detail mapping lines follow:

  1. 1. In Unicode, hexadecimal value 00c7 corresponds to the Latin capital C with cedilla. That character in DUSCI is encoded as a single byte, decimal value 128. The appropriate mapping line is
    c7: |128

    within the "*00" section.
  2. In Unicode, hexadecimal value 00a3 corresponds to the British pound-sterling currency sign. That character in DUSCI is encoded as a two-byte sequence, decimal values 245 and 35 respectively. The appropriate mapping line is
    a3: |245|035
    within the "*00" section.
  3. In Unicode, Greek capital gamma is encoded with hexadecimal value 0393. That character in DUSCI is a two-byte sequence, decimal values 226 and 67 respectively. The appropriate mapping line is
    93: |226|067

    within the "*03" section.
  4. In the Welsh "Afallon" font, small w with dieresis is encoded with hexadecimal value be. That character in DUSCI is a two-byte sequence, decimal values 185 and 53 respectively. The appropriate mapping line is
    be: |185|053

    within a "*00: ..." section listing the Afallon font (see example of heading line given above).
  5. In the Vietnamese "VNI-Times" font, the code value e4 is for a "combining" character including a circumflex accent and a dot below (tone mark). Those are two separate combining characters in DUSCI, each with two bytes -- decimal values 227 and 50 for the first, 227 and 83 for the second. The appropriate mapping line is
    e4: |227|050|227|083

    within a "*00: ..." section listing the VNI-Times font.