The Glyph Grid and the NORC Encoding

Contents

1 - The Glyph Grid

The Hershey glyphs are constructed of straight lines which connect coordinate positions on a grid. (These are implemented as zero or more "polylines" which connect a series of two or more coordinate positions.) Coordinates are decimal integers within a limited range. The grid is centered on the coordinate position (0,0). (Glyphs are also centered on this position.) In the X axis the grid runs from -49 to 49, increasing left to right. In the Y axis it runs from -49 to 49, increasing top to bottom. Two special coordinate pairs, (-49,0) and (-49,-49), may not be used in any glyph (see the explanation of the NORC encoding below for the reason). Each glyph has encoded with it, additionally, a leftmost extent and a rightmost extent on the X axis (which positions are not necessarily the same as the leftmost and rightmost X coordinates of the glyph).

2 - The NORC Encoding and Nines Complement

I've never seen a tape, or the transcription of a tape, of the Hershey glyph repertories in their original "NORC" encoding (named after the computer on which they were first produced; see the earlier chapter on Dr. Hershey's paper "Calligraphy for Computers"). The presentation here is simply a reconstruction of certain aspects of that format from descriptions in Dr. Hershey's papers. This reconstruction may well contain serious errors. The value of this reconstruction, if indeed it has any, is that it allows an understanding of the interplay between the size of the Hershey glyph grid and its original encoding. Bear in mind that this encoding dates to at least 1967 (possibly 1960 or earlier), a time when computer storage was very expensive. This in turn created a generation of talented programmers who valued the efficient representation and processing of data.

2.1 - The Original NORC Encoding on Tape

In "Calligraphy for Computers," Dr. Hershey describes the "NORC encoding" (not there given this name) in this way:

The digital data for each character are recorded in separate blocks on tape. Each block consists of 16 decimal digit words. Each word is divided into four fields of four digits each. The first word is a beginning-of-block word and the last word is an end-of-block word. Each field of digital data is divided into two digit pairs. The first digit pair of the first field gives the left edge of the character block. The second digit pair of the first field gives the right edge of the character block. Each of the remaining fields give coordinates of a point. The first digit pair give the X-coordinate and the second digit pair gives the Y-coordinate of the point. (25)

Negative coordinates are expressed by 9's complements. A vector is plotted between each successive pair of points. A field of 5000 signifies the end of a string of connected vectors. When this field is sensed, plotting is terminated at the last point and is resumed at the next point. A field of 5050 signifies the end of the character. (26)

It is important for a modern reader who may have been schooled in {GNU®/}Linux®-like operating systems and their derivatives to realize that this NORC encoding came from a record oriented model of data processing, not from a stream or sequential file oriented model.

In understanding this encoding, it may be best to start at the "outside." The encoding assumes the presence of a magnetic tape which can be divided physically (electronically/magnetically) into "blocks" separated by interblock gaps. Hardware and operating system support would have existed to read each block separately. This reliance upon the physical characteristics of magnetic tape differs from the later use of the "NORC format" in "Cartography and Typography with True BASIC" [Hershey 1995], where the blocks are, it would seem, gathered into one or more sequential files.

When Dr. Hershey says that "each block consists of 16 decimal digit words," he means that each block consists of some number of "words," each made up of 16 decimal digits (not that each block consists of 16 words made up of decimal digits - the number of words per block is unlimited). "Each word is divided into four fields of four digits each." Thus, a word would be:

aaaabbbbccccdddd

where "a", "b", "c", and "d" represent decimal digits. Each four-digit series is a "field" within the word.

The details of the first and last words are not clear from this description. Dr. Hershey says that "the first word is a beginning-of-block word," but does not specify the value of this word. He also specifies that the "last word is an end-of-block word," but says only that the field (not word) with a value of 5050 "signifies the end of the character." In the absence of an actual NORC encoded tape, this probably doesn't matter that much.

Before going on to examine the format of the actual data, it might be of interest (well, it was of interest to me) to look at the way the NORC encoding was used and described 30 years later.

2.2 - The NORC Format in IBM Mainframe "Files"

In "Cartography and Typography with True BASIC" [Hershey 1995], Dr. Hershey describes the "NORC format" (as used on an IBM® mainframe) in a slightly different manner. The core of the encoding of the data has not changed, but the surrounding packaging has.

[on the IBM mainframe:] ... Convenience in the digitization was achieved when complements were recorded for negative coordinates. Thus negative values ranged from 50 to 99 while positive values ranged from 00 to 50. It wasy not necessary to devote a digit to the sign. The X-coordinate is positive rightward and the Y-coordinate is positive downward. Each datum is a 4-digit word with the following format:

DigitsInterpretation
1 - 2X-coordinate
3 - 4Y-coordinate

Each character occupies a block of data in NORC format. The first 11 digits in each record give the file number, the block number, and the record number. Each block begins at the beginning of a record and continues to the end of the block. The data in each block are preceded by a beginning-of-block word and are terminated by an end-of-block word. The beginning-of-block word and the end-of-block word give the number of words in each block. However, they are bypassed, because the end-of-line datum is 5000 and the end-of-character datum is 5050. The first datum for each character gives the distance to the left edge of the character block, and the second datum gives the distance to the right edge of the character block. The remainder of the data are the corners in the polygonal simulation with origin at the centroid of the character block.

(Dr. Hershey then goes on to describe an encoding for microcomputers which resembles the Hurt encoding with a bias of 64 rather than 82. See the later chapter on The Microcomputer Encoding Used in "Cartography and Typography in True BASIC.")

The "NORC format" here seems to describe the earlier NORC encoding (a record-based encoding employing physical blocks on tape) encapsulated in a sequential file (or multiple sequential files?) The format contains metadata which encode the length of a block, in a style comfortable to a mainframe programmer, but these metadata are redundant (and ignored by Dr. Hershey's software) because the blocks are scanned for an end of glyph identifier ("5050"), in a style comfortable to programmers trained in {GNU/}Linux-like and microcomputer environments.

2.3 - The NORC Data Encoding

The core of both descriptions is the same, however: data are encoded as two decimal digit subfields in nines complement. The first field contains two two-digit-pairs which specify the left and right edges of the glyph. Subsequent fields contain two two-digit pairs which encode a coordinate position. Polylines are drawn between coordinate positions, stopping when a particular value (5000) is found.

2.4 - A Puzzle

Dr. Hershey's comments in "Cartography and Typography with True BASIC" present one puzzle, however. It is a puzzle which would be solved immediately upon the examination of either an original NORC tape (if it could still be read; or a bit-for-bit transcription) or a mainframe tape for the "True BASIC" distribution, or preferably both - but I don't have such tapes.

In neither the original NORC nor the "True BASIC mainframe" descriptions does Dr. Hershey specify whether a "digit" is a 4-bit binary coded decimal digit or an 8 bit printable digit in some character code (whatever code was native to NORC in that case, and EBCDIC for the IBM® System/370 mainframe world of the "True BASIC" distribution).

In the absence of better information, it would make sense to assume BCD digits for the original NORC encoding. It might also make sense to assume this for the later encoding, particularly since the System/370 machine architecture was 32-bit, which would allow 4 digits (one complete nines complement coordinate position) per machine register.

However, when describing the microcomputer version of this distribution (p. 8), Dr. Hershey notes that its bias-64 encoding requires 2 bytes (16 bits), and thus constitutes "a two-fold compression of the data." This would suggest that each "digit" of the mainframe encoding was in fact being represented by an 8 bit character rather than a 4 bit BCD digit.

As noted, this puzzle could be solved trivially by examining the distribution tapes. Moreover, in the absence of such a tape (and the need to process it), the question of whether a "digit" in the mainframe distribution of Dr. Hershey's "Cartography and Typography with True BASIC" is BCD or printable EBCDIC is not relevant to an understanding of the nines complement encoding of the data using these digits.

2.5 - Nines Complement

"Nines1 complement" is a concept which may be unfamiliar to typographers. It refers to a computational practice in decimal arithmetic which allows both the efficient subtraction of numbers using only the (easier to implement) operations of complementing and addition (this advantage isn't necessarily relevant here) and (of direct relevance here) the space-efficient encoding of positive and negative decimal integers.

The "nines complement" of a single decimal digit is the number which must be added to the digit in order to equal nine. Thus, the nines complement of 2 is 7, because 2 + 7 = 9. The nines complement of a multidigit decimal integer is generated by taking the nines complement of each digit. Thus the nines complement of 123 is 876.

As noted above, nines complement notation has two uses in computer programming. The first, easy-to-implement subtraction, isn't relevant here (Dr. Hershey indicates in "Calligraphy for Computers" that the decimal data on tape were converted to binary for use in the computer itself.) The second use is to permit the expression of negative numbers without the need to devote an entire digit position to the minus sign (or its absence).

The way this is done is rather clever. Ordinary integers starting with 0 are simply encoded starting with 0. Thus, for the two decimal digits of the NORC encoding, numbers 0 through 49 are encoded as "00", "01", "02", ... "49". Numbers in the range -49 through -0 are encoded by taking their nines complement. Thus, (negative) 00 is represented by 99, -01 by "98", -02 by "97", and so forth, counting down, to -49, which is represented by "50". Note that nines complement notation has two representations for the number 0, ordinary "positive" 0 ("00") and a somewhat strange "negative" 0 ("99"). In summary:

nines complementrepresents the number
99-0
98-1
97-2
...
52-47
51-48
50-49
4949
4848
4747
...
022
011
000

2.6 - The Significance of the NORC Encoding

The Hershey glyph grid size is thus exactly that permitted by two-decimal-digit nines complement encoding of coordinates. In order to increase the size of his grid by even one unit, Dr. Hershey would have had to have gone to three decimal digits per coordinate (six per coordinate pair). This would have given him a 999x999 grid, which would have exceeded, significantly, the resolution of his output devices. More importantly, such a fifty percent increase in encoding size might have been a serious consideration for the encoding of thousands of characters given the costs of storage in the early 1960s.

The special coordinates which represent the end of each polyline (5000) and the end of the character data in the NORC encoding (5050) are thus seen to be composed of the leftmost X coordinate (-49, "50" in nines complement) either duplicated or in conjunction with 0. This means that two coordinate positions, (-49,0) and (-49,-49), may not be used in glyphs.2

The relationship of the glyph grid size to the original encoding of the glyphs is not apparent in, for example, the NBS/NTIS encoding (which permits a range from -99 to 99) or the Hurt encoding (which permits a range from -49 to 44).


1 The word is a plural, not a possessive, and thus is written without an apostrophe.

2 I find it curious that Dr. Hershey did not use the redundant "negative 0" ("99" in nines complement) in constructing these special values.

Exploring Dr. Hershey's Typography
CircuitousRoot