Translating the Hershey Data

Topics

The Hershey Glyph Formats

The Cell
The NBS Encoding
The James Hurt, Holzmann USENET Encoding

Extracting the Hershey Glyph Numbers
Translating a Single Glyph

Requirements
Coordinate Systems
Parameters: hnum & mbs
Parameter: baseline
Encoded Value: Left and Right Edges
Constant: Capital Height
Complications
Running varkhersh.awk
Details of the Translation

Translating All of the Glyphs

hershadjust.awk
hershgen.awk

Assembling a Font

Font Maps
varkfont.awk

Assembling All of the Fonts
Summary
Notes
Bibliography
GNU® Free Documentation License (separate file)
GNU General Public License (separate file)
Legal

The "cell" in which a Hershey glyph is constructed is a 99 unit by 99 unit square with its origin at the center. Glyphs consist of one or more polylines connecting points in this cell. (Exception: several glyphs consist of a single point. These are identified in Wolcott & Hilsenrath as "blank.")

1.2 - The NBS Encoding

In the encoding presented in Wolcott & Hilsenrath (I do not know yet if this is identical to Hershey's original encoding) this cell is numbered from the left to the right in the X axis from -49 through 49, inclusive. In the Y axis, it is numbered from the top to the bottom from -49 through 49, inclusive. That is to say, the most negative coordinate pair, (-49,-49), is in the upper left-hand corner.

Two coordinate pairs outside of this normal range have special meaning. The coordinate pair (-64,0) indicates that the pen should be lifted at that point and that a new polyline should commence at the next coordinate. The coordinate pair (-64,-64) indicates the end of the glyph's data.

The data are given in a tabular form as printable ASCII decimal integers plus the space and colon characters (the text indicates that a Binary Coded Decimal (BCD) version was also available). In addition to the coordinate and pen-up data, the left and right limits of the glyph are explicitly encoded. (Each glyph also has a baseline, but this is not explicitly encoded.)

1.3 - The James Hurt, Holzmann USENET Distribution Encoding

In the encoding devised by James Hurt and employed in the Holzmann USENET distribution, this 99 x 99 cell is clipped on the right to a 94 x 99 cell (X: -49 through 44) and the encoding of the cell numbers is changed. The coordinates are encoded using the printable ASCII characters, starting with decimal 33 ("!", the exclamation point; this is the first printable character after the space). In such an encoding, the letter "R" signifies zero, and so this encoding is described by Holzmann or Hurt in terms of obtaining a numeric value by subtracting the ASCII value of "R" from the ASCII value of encoding character. For example, if the encoding character is "!" (33), then the numeric value is ("!" - "R"), which is (33 - 82), or -49.

The maximum number encodable in this way is 44 ("~" (tilde), which is ASCII 126). "~" - "R" = 126 - 82 = 44. This is not sufficient to represent the entire X dimension of the original, theoretical, cell for the Hershey glyphs. However, an examination of the Hershey occidental data reveals that the widest glyph is the "41-circle" at Hershey glyph number 907. This circle extends only 41 units in the positive X direction. The Hurt encoding can therefore encode all of the Hershey occidental (at least) glyphs.

I always manage to confuse myself when thinking this through, so I've prepared the following table which lays out the correspondences of this encoding scheme without involving subtraction:

 ASCII   ASCII  coordinate
decimal   char    value
   33      !       -49
   34      "       -48
   35      #       -47
   36      $       -46
   37      %       -45
   38      &       -44
   39      '       -43
   40      (       -42
   41      )       -41
   42      *       -40
   43      +       -39
   44      ,       -38
   45      -       -37
   46      .       -36
   47      /       -35
   48      0       -34
   49      1       -33
   50      2       -32
   51      3       -31
   52      4       -30
   53      5       -29
   54      6       -28
   55      7       -27
   56      8       -26
   57      9       -25
   58      :       -24
   59      ;       -23
   60      <       -22
   61      =       -21
   62      >       -20
   63      >       -19
   64      @       -18
   65      A       -17
   66      B       -16
   67      C       -15
   68      D       -14
   69      E       -13
   70      F       -12
   71      G       -11
   72      H       -10
   73      I        -9
   74      J        -8
   75      K        -7
   76      L        -6
   77      M        -5
   78      N        -4
   79      O        -3
   80      P        -2
   81      Q        -1
   82      R         0
   83      S         1
   84      T         2
   85      U         3
   86      V         4
   87      W         5
   88      X         6
   89      Y         7
   90      Z         8
   91      [         9
   92      \        10
   93      ]        11
   94      ^        12
   95      _        13
   96      `        14
   97      a        15
   98      b        16
   99      c        17
  100      d        18
  101      e        19
  102      f        20
  103      g        21
  104      h        22
  105      i        23
  106      j        24
  107      k        25
  108      l        26
  109      m        27
  110      n        28
  111      o        29 
  112      p        30
  113      q        31
  114      r        32
  115      s        33
  116      t        34
  117      u        35
  118      v        36
  119      w        37
  120      x        38
  121      y        39
  122      z        40
  123      {        41
  124      |        42
  125      }        43
  126      ~        44

The Hurt encoding has a number of advantages. It is compact, employing only one character per coordinate as opposed to the three used in the NBS/NTIS encoding. It uses only printable ASCII, and so may pass through communications and printing systems with less difficulty than binary data (the NBS/NTIS encoding scheme does this as well). Most importantly, it differs from the NBS/NTIS encoding, which, apparently, was a condition imposed by the NTIS on the redistribution of these US taxpayer funded data created in the public service.

In the Hurt encoding, the direction of the X and Y enumeration remains the same as the NBS/NTIS coding: left to right, top to bottom. (The point (-49, -49) is still in the upper left-hand corner.)

The (-64,0) pen-up pseudo-coordinate of the NBS/NTIS encoding is encoded as " R" ("[sp]R"). That is, it is encoded as the space character (decimal 33, which would correspond to -50 in the encoding scheme above) followed by the "R" character (which encodes the number zero).

Given this encoding scheme, each glyph is encoded in a column-sensitive way (padded with ASCII space characters as necessary) as follows:

Columns 0 through 4 (the first 5 bytes) encode the Hershey glyph number as a printable ASCII decimal integer. E.g., "[sp][sp][sp][sp]1" or "[sp]3926".

Columns 5 through 7 (the next 3 bytes) encode the length of the data to follow, as a printable ASCII decimal integer signifying the number of 16 bit words (two-byte pairs) to follow.

Column 8 encodes the left edge of the glyph (using the ASCII-based encoding scheme described above).

Column 9 encodes the right edge of the glyph.

The remaining columns encode coordinate pairs (or the "[sp]R" pen-up pseudo-coordinate pair).

In the Holzmann USENET distribution, when the encoding of a glyph exceeds 72 characters a newline is inserted (and if necessary at each successive 72 character point). This helps to ensure that the data may be handled by software or devices that can handle only 72 characters per line. It is convenient if, prior to use, these extra newlines are removed so that each glyph's encoding corresponds to exactly one line. (A script to do this is presented in the previous section, Preparing the Hershey Data.)

The NBS/NTIS encoding of an explicit glyph end (-64, -64) is omitted. Instead, each glyph encodes the length of its data (in columns 5 through 7, as described above). This allows the glyph data to be reassembled into a single line even when split over multiple lines in a distribution.

As an example, the first Hershey glyph (a cartographic-sized capital A) is encoded as:

    1  9MWRMNV RRMVV RPSTS

Just to make sure that the spaces are apparent:

[sp][sp][sp][sp]1[sp][sp]9MWRMNV.RRMVV.RPSTS

This is glyph number 1, and it contains 9 double-bytes of data. (Indeed there are 18 ASCII characters following.)

The left edge of this glyph is at "M" (which by the table above, or by the subtraction of "R", can be seen to be -5). The right edge is "W" (+5).

The first polyline goes from "RM" (0,-5) to "NV" (-4, 4). Then an " R" (-50, 0) indicates that the pen should be raised. (This "polyline" has in fact only one stroke.) The line so drawn is the left side of the "A" (remember, negative X coordinates are at the left, and negative Y coordinates are to the top).

The next polyline (also of one stroke; this is a simple glyph) goes from "RM" (0, -5) to "VV" (4, 4). This is the right side of the "A". Then a " R" indicates that the pen should be raised again.

Finally, a third line goes from "PS" (-2, 1) to "TS" (2, 1). This is the "crossbar" of the "A". With this the data line is done and the glyph is complete.

2 - Extracting the Hershey Glyph Numbers

At first I thought I'd need some small assisting scripts to extract the Hershey Glyph numbers from the prepared Hershey data file hersh.occ. In the end I did not, but I've left them in here as an example of how simple Awk scripts can be.

{
   printf ("%d ", substr($0,1,5) + 0)
}

This one line script, surrounded by 41 lines of other stuff, is at hershnums.awk.

There are three other variants of this script, each extracting a range or ranges within the glyphs (cartographic, indexical, normal). These are nearly as simple, and are at:
hershnums-cartographic.awk
hershnums-indexical.awk
and
hershnums-normal.awk

3 - Translating a Single Glyph

3.1 - Requirements

I chose to write an Awk programming language script which would extract a single glyph from the Hershey data and translate it for use with VARKON®. Given such a script, the creation of entire fonts for VARKON may then be orchestrated by repeated calls to this script from one or more shell scripts. It's an example of the power of the standard tools of GNU®/Linux® and related systems. For a philosophical discussion of this, I might mention Neal Stephenson's short book In the Beginning was the Command Line (online at the author's website: http://www.cryptonomicon.com/beginning.html).

Initially, I wished to produce output from this script in one of two forms: VARKON font glyph information (an "ASCII.NNN" file) and a VARKON MBS module which would draw the glyph. By producing a non-font MBS alternative form, I hoped to be able to view the glyphs without loading them into fonts and to assemble graphical tables of the glyphs without the font mechanism. What I discovered is that dealing with nearly 1600 MBS modules was cumbersome. Moreover, VARKON does not allow the dynamic creation of the name of a module when it is called. So, for example, if I named my MBS modules after the glyphs they encoded (h0001.MBS, h0002.MBS, etc.) I could not automatically generate the calls to these modules by arithmetic substitution to generate their names dynamically.

In the end, I abandoned the MBS form of output from the glyph translation script. I've left it in the script, however, in case it might be useful in other contexts.

3.2 - Coordinate Systems

The translation process involves four coordinate systems (or at least four uses of three systems).

The initial coordinate system is that of the distribution, which is the same (though encoded differently) for both the NBS/NTIS and USENET distributions. This coordinate system, as noted earlier, consists of either a 99 x 99 (NBS/NTIS) grid centered on its origin or a 94 x 99 (USENET) section of this grid. The lowest numbered point, (-49,-49) or ("!", "!"), is in the upper left hand corner. I'll call this the "distribution" coordinate system.

The first step in the transformation is to renumber the coordinate system, without moving the glyph, so that the origin (0,0) is at the lower left hand corner and all coordinate numbers are nonnegative. I'll call this the "renumbered" coordinate system.

The next step in the transformation is to shift the glyph within the renumbered coordinate system so that its left side is flush against the left of the grid and its baseline is at an appropriate level within the grid. For large glyphs this may result in some compromises. Although this all really takes place within the "renumbered" coordinate system, I'll term this the "shifted" coordinate system. This is the coordinate system that should be used by the MBS modules which draw inspection versions of the glyphs (because the numbers involved are small, so many glyphs may fit together easily).

The final step is to scale the shifted glyph coordinates (in the range 0 .. 98) to VARKON font coordinates (in the ranges 0 .. 10000 and 0 .. 17500) in such a way that the height of the glyph is correct, and the aspect ratio of the glyph is retained. Additionally, "move" (pen up and down) motions need to be encoded into the coordinates. I'll call this either the "scaled" coordinate system or the "VARKON" coordinate system.

3.3 - Parameters: hnum & mbs

My script requirements meant that the script would have to take parameters. Awk supports "name=value" style command line parameters. The first two parameters necessary were thus "hnum" (which specified the Hershey glyph of interest by its number) and "mbs" (a boolean value, 0 for VARKON font data output, 1 for MBS module output).

These two parameters are enough to specify the Hershey glyph to translate, and together with a knowledge of the Hershey and VARKON glyph formats, to put it somewhere in the VARKON glyph cell.

Putting it in some reasonable location requires a bit more information: baseline, leftmost edge, and capital height. ¹

3.4 - Parameter: baseline

The Hershey Glyphs come in three sizes. From smallest to largest these are: "cartographic" (called "very small" in the USENET distribution), "indexical" ("small"; "smaller than normal" in the USENET distribution), and "normal." The intended baselines for these glyphs are not encoded in their data. Moreover, baselines cannot be calculated from glyph data, as they require an aesthetic judgment about the glyph (perhaps together with others in its typeface).

The USENET distribution (but not Wolcott & Hilsenrath) identifies baselines for these three ranges. The "cartographic" ("very small") glyphs (Hershey glyph numbers in the range 0 to 500) are said to have a baseline of Y=-5. The "indexical" ("small") glyphs (1001 through 2000) are said to have a baseline of Y=-6. The "normal" glyphs (all others) are said to have a baseline of Y=-9.

I'm not entirely sure of these baselines, though. First, they seem not to employ the coordinate system of the data. For example, the top of the cartographic "A" (glyph 1) is at (0,-5). Its baseline cannot also be -5. Second, they seem not to match the data. For example, in the cartographic series, the bottoms of the glyphs are at 4.

In consequence, I'll use the following baselines, derived from the data (expressed here in the distribution coordinate system):

cartographic: 4
indexical: 6
normal: 9

The baseline is supplied via the named parameter "baseline".

3.5 - Encoded Values: Left and Right Edges

Each Hershey Glyph encodes explicitly its leftmost and rightmost edges. I'm going to let the rightmost edge fall where it will. The leftmost edge, though, should correspond to the left side of the VARKON glyph cell (X=0). Since this edge is encoded in the data, it need not be supplied as a parameter.

Note that as encoded the leftmost edge is not necessarily at the position of the leftmost glyph point. For example, glyph 1 (cartographic "A") has a leftmost point at X=-4, but an encoded left edge of X=-5.

3.6 - Constant: Capital Height

Finally, each size has a height (the height of a capital, from its baseline to a top line). The values that I will use differ from those in Wolcott & Hilsenrath (cartographic: 9; indexical: 13; normal: 21). The Holzmann USENET distribution does not give a height. Rather, it gives baseline and top line values (but with top and bottom reversed).

I find by inspection of the data, for example, that the height of the cartographic "A" (glyph 1) is 10, not 9. It goes from -5 (top) to 4 (bottom), and subtracted this is indeed -9 (or an absolute value of 9, which is the Wolcott & Hilsenrath value). But as drawn the glyph goes from one to the other; -5 to 4 inclusive.

So I'll use:

Cartographic: 10
Indexical: 14
Normal: 22

These are absolute values which are valid in the distribution, renumbered, and shifted coordinate systems.

Height is used for scaling, and since all Hershey glyphs are of constant proportion to each other they all must scale by the same factor. This will be determined by the capital height of a normal sized Hershey glyph (22). Since this is a constant, it need not be supplied as a parameter to the script.

3.7 - Complications

The Hershey glyphs and the VARKON font layout were designed independently of each other, so it is to be expected that there are complications when translating from one to the other.

The Hershey glyphs encode left and right margins. These do not necessarily correspond to the leftmost and rightmost points in the glyph, though. VARKON seems designed to expect the left edge of an ordinary character to correspond to X=0. For example, the main vertical in the uppercase "B" is at X=0 (for TSLANT=0). It may be useful to adjust the glyph's position automatically so as to accomplish this.

Some Hershey glyphs extend to the left of the left margin. (E.g., the normal script D, glyph 554.) This causes problems for VARKON, which expects all points in the font space to be nonnegative. It is necessary to adjust these glyphs positions. This can be done automatically.

In general, there are four ways to adjust ("jog") a glyph's position: left, right, up, and down. I'll encode these with three optional parameters, jogl, jogr, jogu, and jogd. The variables set by these parameters will be initialized to 0 in the BEGIN block of the Awk script. If specified, the parameters will override this initialization. The units on these parameters are nonnegative integers in the renumbered or shifted coordinate systems.

Since the range of the shifted coordinate system is (0,0) to (93,99) while that of the VARKON font data is (0,0) to (10000,17500), the glyph data must be scaled (into the "scaled" coordinate system; see above). It makes sense, I think, to scale all glyphs by the same amount, unless there is some reason not to. The basic scaling factor is that which will make a Normal Simplex Taper0 Roman ² capital such as the "A" at glyph number 501 the same size as a capital in VARKON font 0.

If it is necessary to change this, a scaling adjustment will be needed. I'll specify this with the parameter "scaleadjust". A scaleadjust value of 1 indicates no adjustment at all (and the default if no adjustment is specified is 1). Scale adjustments greater than 1 increase the scale; for example, "scaleadjust=2" indicates that the scaled glyph should be twice as large as it normally would be. Similarly, scale adjustments less than 1 (down to just above 0, which probably needn't be handled) result in a decrease of the scale. A glyph scaled with "scaleadjust=0.5" would be half as large as it normally would be.

3.8 - Running varkhersh.awk

My translation script, varkhersh.awk, is thus invoked with three named parameters. For example:

awk -f hurt.awk -f varkhersh.awk < hersh.occ hnum=1 mbs=1 baseline=4 > hg1.MBS

This invocation should cause the Awk script to extract Hershey glyph number 1 ("A" in the cartographic size, baseline 4 and height 10 in untranslated units) and write out an MBS module named hg1.MBS ("Hershey Glyph 1") which will draw this glyph.

Note that varkhersh.awk requires the function hurt() (which performs the James Hurt decoding as a direct table lookup) in the file hurt.awk. This file must be specified (via a "-f" parameter) first on the command line.

3.9 - Details of the Translation

Here's where Prof. Don Knuth's Literate Programming would come in handy (especially the small, clean system "noweb" by Norman Ramsey). These techniques would allow me to discuss the translation script in the order of its most interesting bits (and automatically assemble the script out of the discussion). I'm not quite set up for this yet, though, so I'll just pull out the core chunk and let it stand on its own.

This is the function hurt() in hurt.awk

function hurt (c) {
   ascii[" "]  = -50
   ascii["!"]  = -49
   ascii["\""] = -48
   ascii["#"]  = -47
   ascii["S"]  = -46
   ascii["%"]  = -45
   ascii["&"]  = -44
   ascii["'"]  = -43
   ascii["("]  = -42
   ascii[")"]  = -41
   ascii["*"]  = -40
   ascii["+"]  = -39
   ascii[","]  = -38
   ascii["-"]  = -37
   ascii["."]  = -36
   ascii["/"]  = -35
   ascii["0"]  = -34
   ascii["1"]  = -33
   ascii["2"]  = -32
   ascii["3"]  = -31
   ascii["4"]  = -30
   ascii["5"]  = -29
   ascii["6"]  = -28
   ascii["7"]  = -27
   ascii["8"]  = -26
   ascii["9"]  = -25
   ascii[":"]  = -24
   ascii[";"]  = -23; ascii["\["] =  9; ascii["{"] = 41
   ascii["<"]  = -22; ascii["\\"] = 10; ascii["|"] = 42
   ascii["="]  = -21; ascii["\]"] = 11; ascii["}"] = 43
   ascii[">"]  = -20; ascii["^"]  = 12; ascii["~"] = 44
   ascii["?"]  = -19; ascii["_"]  = 13
   ascii["@"]  = -18; ascii["`"]  = 14
   ascii["A"]  = -17; ascii["a"]  = 15
   ascii["B"]  = -16; ascii["b"]  = 16
   ascii["C"]  = -15; ascii["c"]  = 17
   ascii["D"]  = -14; ascii["d"]  = 18
   ascii["E"]  = -13; ascii["e"]  = 19
   ascii["F"]  = -12; ascii["f"]  = 20
   ascii["G"]  = -11; ascii["g"]  = 21
   ascii["H"]  = -10; ascii["h"]  = 22
   ascii["I"]  =  -9; ascii["i"]  = 23
   ascii["J"]  =  -8; ascii["j"]  = 24
   ascii["K"]  =  -7; ascii["k"]  = 25
   ascii["L"]  =  -6; ascii["l"]  = 26
   ascii["M"]  =  -5; ascii["m"]  = 27
   ascii["N"]  =  -4; ascii["n"]  = 28
   ascii["O"]  =  -3; ascii["o"]  = 29
   ascii["P"]  =  -2; ascii["p"]  = 30
   ascii["Q"]  =  -1; ascii["q"]  = 31
   ascii["R"]  =   0; ascii["r"]  = 32 
   ascii["S"]  =   1; ascii["s"]  = 33
   ascii["T"]  =   2; ascii["t"]  = 34
   ascii["U"]  =   3; ascii["u"]  = 35
   ascii["V"]  =   4; ascii["v"]  = 36
   ascii["W"]  =   5; ascii["w"]  = 37
   ascii["X"]  =   6; ascii["x"]  = 38
   ascii["Y"]  =   7; ascii["y"]  = 39
   ascii["Z"]  =   8; ascii["z"]  = 40

   return ascii[substr(c,1,1)]
}

This is the translation code:

BEGIN {
   jogl = 0
   jogr = 0
   jogu = 0
   jobd = 0
}
{

   # Check to see if this is the glyph we're interested in
   # Notes: Adding 0 forces a conversion to an integer.
   #        Awk counts strings from 1, not 0.
   if ((substr($0,1,5) + 0) == hnum) {

      # printf ("%s\n", $0)

      # Obtain the number of 16-bit words of coordinate data,
      # excluding the left and right margins.
      datapairs = substr($0,6,3) - 1
      # printf ("datapairs: %d\n", datapairs)

      # See if this is a blank; if so, skip it and emit nothing
      if (datapairs == 0) {
         next
      }

      # mbs: emit module opening
      if (mbs == 1) {
         printf("global drawing module hg%d(\n", hnum)
         printf("int origin_x;\n", hnum)
         printf("int origin_y", hnum)
         printf(");\n", hnum)
         printf("beginmodule\n")
         # printf ("! %s\n", $0)
      }

      # Obtain the left margin (ignore the right margin)
      # This is in the "distribution" coordinate system.
      leftmargin_distribution = hurt(substr($0,9,1))
      # printf ("leftmargin: %d\n", leftmargin)

      # Precompute and print the number of points (vectors)
      vectorcount = 0
      if (mbs == 0) {
         pointpos = 11
         for (i = 1; i <= datapairs; i++) {
            if (substr($0,pointpos,2) != " R") {
               vectorcount = vectorcount + 1
            }
            pointpos = pointpos + 2
         }
         printf ("%d\n", vectorcount - 1);
      }

      # Extract and transform the coordinate data
      pointpos = 11
      newpolyline = 1
      x_shifted_prev = 0
      y_shifted_prev = 0
      refnum = 1
      for (i = 1; i <= datapairs; i++) {

         # Is it a real coordinate or pseudo-coordinate pair?
         if (substr($0,pointpos,2) == " R") {

            newpolyline = 1
            # printf ("pen up\n");

            # Go on to the next coordinate pair
            pointpos = pointpos + 2

         } else {

            # Get the coordinate pair
            x_encoded = substr($0,pointpos,1)
            y_encoded = substr($0,pointpos + 1,1)
            # printf ("%c%c\n", x_encoded, y_encoded)

            # Decode the coordinate pair
            x_distribution = hurt(x_encoded)
            y_distribution = hurt(y_encoded)
            # printf ("distribution %d %d\n", x_distribution, y_distribution)

            # Convert from the "distribution" coordinate system to the
            # "renumbered" coordinate system.  (Convert cell from upper left 
            # at (-49,-49) to lower left at (0,0)
            # X was     -49 to -1,  0,  1 to 49
            # X becomes   0 to 48, 49, 50 to 98
            x_renumbered = x_distribution + 49
            # Y was      49 to  1,  0, -1 to -49
            # Y becomes   0 to 48, 49, 50 to  98
            y_renumbered = (y_distribution * -1) + 49
            # printf ("renumbered X Y: %d %d\n", x_renumbered, y_renumbered)

            x_renumbered = x_renumbered - jogl
            x_renumbered = x_renumbered + jogr
            y_renumbered = y_renumbered + jogu
            y_renumbered = y_renumbered - jogd
            # printf ("renumbered & jogged X Y: %d %d\n", x_renumbered, y_renumbered)

            # The Hershey "distribtion" and "renumbered" coordinate systems
            # are integer systems.  However, it will turn out that the
            # baseline for the shifted coordinate system is 10.5.
            # Rounding this creates rounding errors in the subsequently
            # scaled values (baseline and topline are off - by a small
            # but visible amount).
            # The solution is to treat the "shifted" coordinate system
            # as a real number system.  Values scaled from it can be
            # truncated to integers before use with VARKON; the error
            # after scaling should be small enough.

            # Shift glyph
            # X: left margin goes to left of cell
            leftmargin_renumbered = leftmargin_distribution + 49
            x_shifted = x_renumbered - leftmargin_renumbered
            # Y: There are two baselines which must be brought together.
            #    The glyph has its own baseline, specified by the "baseline"
            #    parameter.  I'll call this the "glyph baseline."
            #    The shifted coordinate system has a baseline,
            #    calculatable from the height of a "normal" Hershey glyph.
            #    This baseline is therefore a constant.
            #    I'll call it the "cell baseline."
            #
            #    Height is an absolute value valid in distribution,
            #    renumbered, and shifted coordinate systems.
            #    The VARKON baseline (5000) is at a distance half the
            #    height (10000) up, so the baseline in the shifted coordinate
            #    system should also be half the height up.
            #    The height of Hershey glyph 501 (simplex normal size "A")
            #    is 21 (-12 to 9 in the distribution coordinate system).
            #    The cell baseline in the shifted coordinate system must 
            #    therefore be half this, or 10.5.
            #    This cell baseline is the same for all sizes
            #    (cartographic, indexical, and normal).

            #    Glyph Y coordinates must be shifted so that the 
            #    glyph baseline is made to coincide with the cell baseline.
 
            # To calculate all of this, first change the glyph baseline
            # from the distribution coordinate system to the
            # renumbered coordinate system.
            # Example: cartographic baseline = 4  in distribution coordinates
            #                                = 45 in renumbered coordinates
            glyph_baseline_renumbered = (baseline * -1) + 49
            # Then calculate the difference between the glyph baseline
            # and the cell baseline.  E.g., (45 - 11) = 34 for cartographic
            baseline_difference = glyph_baseline_renumbered - 10.5
            # Then shift the glyph's Y coordinates down by this amount.
            y_shifted = y_renumbered - baseline_difference

            # printf ("shifted X Y: %d %d\n", x_shifted, y_shifted)

            if ((mbs == 1) && (newpolyline == 0)) {
               printf("   lin_free(#%d, vec(origin_x + %d, origin_y + %d),\
                                        vec(origin_x + %d, origin_y + %d));\n",
                                   refnum, x_shifted_prev, y_shifted_prev,
                                           x_shifted, y_shifted)
               refnum = refnum + 1
            }

            # Since all Hershey glyphs have the same relative size,
            # they must all employ the same scaling factor. 
            # The problem is that both the Hershey and VARKON
            # coordinate systems are integer, and the Hershey system
            # is relatively coarse.
            # If the Hershey coordinate for the baseline
            # (11 in the shifted coordinate system)
            # is divided into the VARKON baseline, the result is
            # (5000/10.5) = 476.190476..

            x_scaled = int(x_shifted * (5000/10.5) * scaleadjustx)
            y_scaled = int(y_shifted * (5000/10.5) * scaleadjusty)
            # printf ("scaled X Y: %d %d\n", x_scaled, y_scaled)

            # If this is the start of a new polyline, add 32768
            # to X to encode this fact.

            if (newpolyline == 1) {
            x_scaled = x_scaled + 32768
            }

            if (mbs==0) {
               printf ("%d %d\n", x_scaled, y_scaled)
            }

            # Go on to the next coordinate pair
            # Since this not a "pen up" pseudo-coordinate, we cannot
            # (yet at least) be at the start of a new polyline.
            newpolyline = 0
            x_shifted_prev = x_shifted
            y_shifted_prev = y_shifted
            pointpos = pointpos + 2
         }
      }

      # mbs: emit module closing
      if (mbs == 1) {
         printf("endmodule\n")
      }

   } # end if hnum matches

}

The entire script is available as varkhersh.awk

4 - Translating All of the Glyphs

4.1 - hershadjust.awk

There are 1597 glyphs (including blanks) to process. The adjustments for most (but not all) can be determined automatically. Specifying the adjustments on all of them by hand would be too laborious. My solution is a script, hershadjust.awk, which analyzes the processed Hershey glyph data and calculates (and writes out) those adjustments that can be made automatically. This file of adjustments can be edited manually to further tweak the translation. (It is also fairly simple to allow #-comments in this file so that these tweaks may be documented.) It can then be processed by another Awk script to generate the translated VARKON font format and MBS format data.

The format of the adjustments file can be fairly simple. Either a line is a comment introduced by "#" in the initial column or it contains numeric data (printable ASCII nonnegative decimal integers and simple floats). If it contains numeric data, the first number is the Hershey glyph numer. The next four numbers are the jog values (left, right, up, down, each potentially 0). The final two numbers are the x and y scaling adjustments. Each is a simple floating point value of the form "xxx.yyy" to whatever precision Awk can handle.

There is no need for this script to record the encoded left and right glyph margins, as these can be re-extracted as needed by varkhersh.awk. It is not possible to extract or calculate the baseline information, as that is external to the glyph data.

The script is invoked like this:

awk -f hershadjust.awk < hersh.occ > hersh.adj

The core of the script is this:

{
   hnum = substr($0,1,5)

   # Obtain the number of 16-bit words of coordinate data,
   # excluding the left and right margins.
   datapairs = substr($0,6,3) - 1
   # printf ("datapairs: %d\n", datapairs)

   # Obtain the left and right margins,
   # and convert them to the "renumbered" coordinate system
   marginl_distribution = hurt(substr($0,9,1))
   marginl_renumbered   = marginl_distribution + 49
   marginr_distribution = hurt(substr($0,10,1))
   marginr_renumbered   = marginr_distribution + 49

   # printf ("marginl_distribution %d\n", marginl_distribution
   # printf ("marginr_distribution %d\n", marginr_distribution

   # printf ("marginl_renumbered: %d\n", marginl_renumbered)
   # printf ("marginr_renumbered: %d\n", marginr_renumbered)

   # start the lowestx value out at its extreme
   lowestx_renumbered  = 93

   jogl = 0
   jogr = 0

   # Analyze the coordinate data pairs
   if (datapairs > 1) {   # skip the blanks 
      pointpos = 11
      for (i = 1; i <= datapairs; i++) {
         if (substr($0,pointpos,2) != " R") {    # skip pen-up pseudocoordinate
            x_distribution = hurt(substr($0,pointpos,1))
            x_renumbered   = x_distribution + 49
            # printf ("x_renumbered: %d\n", x_renumbered)
            if (x_renumbered < lowestx_renumbered) {
               lowestx_renumbered = x_renumbered
               # printf ("lowestx_renumbered: %d\n", lowestx_renumbered)
            }
         }
         pointpos = pointpos + 2
      }

      if (lowestx_renumbered < marginl_renumbered) {
         jogr = marginl_renumbered - lowestx_renumbered
      }
      if (lowestx_renumbered > marginl_renumbered) {
         jogl = lowestx_renumbered - marginl_renumbered
      }
   
      # Write the analysis
      printf ("%d %d %d %d %d %f %f\n", hnum, jogl, jogr, 0, 0, 1.0, 1.0)
   } 

}

The initial (unmodified) adjustment data are in hersh.adj-orig

The hand-modified adjustment data are in hersh.adj. These adjustments represent my experiences in translating the entire Hershey occidental glyph set, as discussed in the next chapter ("Layout of the Hershey Occidental Glyphs"). They are not in any way, however, authoritative.

4.2 - hershgen.awk

This script traverses the adjustments file and for each glyph executes varkhersh.awk twice, once to generate a VARKON font character data file and once to generate a VARKON MBS module to draw the glyph. This script encodes within it the baselines for the glyph ranges.

It is invoked in this way:

awk -f hershgen.awk < hersh.adj outdir=hg

Where "hg" is simply the directory name where the many output files should be put. In this case, I put them in a subdirectory, "hg"; in the general case, I could have put them anywhere with a relative or absolute path.

{

   if (substr($0,1,1) == "#") {
      continue
   }
   hn = $1
   jl = $2
   jr = $3
   ju = $4
   jd = $5
   sx = $6
   sy = $7

   if (hn < 501) {                         # cartographic
      bl = 4
   } else
   if ((hn >= 1001) && (hn <= 2000)) {   # indexical
      bl = 6
   } else {
      bl = 9                                 # normal
   } 

   printf ("processing glyph %d   \r", hn)

   # hg*.vfd  VARKON font data file for glyph (will become ASCII.*)
   command = sprintf ("awk -f hurt.awk -f varkhersh.awk < hersh.occ hnum=%d mbs=0 baseline=%d jogl=%d jogr=%d jogu=%d jogd=%d scaleadjustx=%f scaleadjusty=%f > %s/hg%d.vfd\n", hn, bl, jl, jr, ju, jd, sx, sy, outdir, hn)
   # printf("%s", command);
   system(command)

   # hg*.MBS  VARKON MBS module to draw glyph
   command = sprintf ("awk -f hurt.awk -f varkhersh.awk < hersh.occ hnum=%d mbs=1 baseline=%d jogl=%d jogr=%d jogu=%d jogd=%d scaleadjustx=%f scaleadjusty=%f > %s/hg%d.MBS\n", hn, bl, jl, jr, ju, jd, sx, sy, outdir, hn)
   system(command)
}
END {
   printf ("\n")
}

The full script is in: hershgen.awk

The process of creating the *.vfd and *.MBS files out of the Hershey data is, as done here, not efficient. It doesn't have to be, as it only takes a minute to run on a no-longer-current 900MHz IA-32 machine, and it need be done only once.

5 - Assembling a Font

5.1 - Font Maps

A collection of glyphs is not itself a font. To make fonts out of this large collection of glyphs, mappings must be established between the fonts' organizations (the numbers 0 to 127 or 255, usually considered as the ASCII character set) and the glyphs themselves. Holzmann (and Hurt?) did this in the Holzmann USENET distribution, supplying various "*.hmp" files with such correspondences. I have chosen to establish my own mappings, however.

Certain parts of the mappings are obvious ("A" to "A"); other parts are less obvious or, indeed, questionable. These are not the only possible mappings. Given such a rich set of glyphs, the number of possible mappings is very large.

I've set up my mappings using a single mapping file for each one. The format of this file is quite simple: Lines introduced with "#" are comments and are ignored. Lines which don't start with "#" contain two printable ASCII decimal nonnegative integers. The first of these is an ASCII character number, the second is the number of the Hershey glyph mapped to it (use 0 for no mapping). The rest of the line is ignored. The separation between fields must be parsable by Awk. Each of the 128 ASCII positions from 0 to 127 must appear, in order, exactly once (this simplifies later processing).

For example, the following file (stripped of some comments) maps the Hershey Cartographic Uniplex Taper0 Lineface Latin Uppercase Letter, Arabic Numeral, and Symbol glyphs:

# com-lemur-vk-hershey-cartographic-uniplex-taper0-lineface-latin
# Version 0 by DMM
#
# Map to ASCII the 
# Hershey Cartographic Uniplex Taper0 Lineface glyphs for
# Latin uppercase letters, numerals, and symbols.
#
# Field 1: ASCII number
# Field 2: Hershey glyph number
# Both fields must be printable ASCII unsigned nonnegative integers
# Rest of line, or line beginning with "#": ignored
#
# 0 through 31 unmapped; ASCII control characters
#
0 0     # ASCII control characters, unmapped
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
20 0
31 0
32 0    # ASCII blank, unmapped
33 214  # ASCII !
34 217  # ASCII " = Hershey second sign
35 233  # ASCII #
36 219  # ASCII $
37 0    # ASCII %
38 234  # ASCII &
39 231  # ASCII ' = Hershey right single quote
40 221  # ASCII (
41 222  # ASCII )
42 228  # ASCII *
43 225  # ASCII +
44 211  # ASCII ,
45 224  # ASCII -
46 210  # ASCII .
47 220  # ASCII /
48 200  # ASCII 0
49 201  # ASCII 1
50 202  # ASCII 2
51 203  # ASCII 3
52 204  # ASCII 4
53 205  # ASCII 5
54 206  # ASCII 6
55 207  # ASCII 7
56 208  # ASCII 8
57 209  # ASCII 9
58 212  # ASCII :
59 213  # ASCII ;
60 0    # ASCII <
61 226  # ASCII =
62 0    # ASCII >
63 215  # ASCII ?
64 0    # ASCII @
65 1    # ASCII "A"
66 2
67 3
68 4
69 5
70 6
71 7
72 8
73 9
74 10
75 11
76 12
77 13
78 14
79 15
80 16
81 17
82 18
83 19
84 20
85 21
86 22
87 23
88 24
89 25
90 26
91 0    # ASCII [
92 0    # ASCII backslash
93 0    # ASCII ]
94 0    # ASCII ^
95 0    # ASCII _
#
# The following glyphs do not have ASCII equivalents
# They're mapped here to ASCII "a" through "h", otherwise unused in this font
96 230  # ASCII ` = Hershey left single quote
97 216  # Hershey prime or minute; also used for ASCII 34 (')
98 217  # Hershey second;
99 218  # Hershey degree sign;
100 227 # Hershey cross or multiplication;
101 229 # Hershey dot or multiplication
102 232 # Hershey right arrow
103 235 # Hershey lozenge
104 0   # ASCII h
105 0   #
106 0   #
107 0   #
108 0   #
109 0   #
110 0   #
111 0   #
112 0   #
113 0   #
114 0   #
115 0   #
116 0   #
117 0   #
118 0   #
119 0   #
120 0   #
121 0   #
121 0   # ASCII z
123 0   # ASCII {
124 223 # ASCII |
125 0   # ASCII }
126 0   # ASCII ~
127 0   # ASCII DEL, unmapped

This font map file is available as: com-lemur-vk-hershey-cartographic-uniplex-taper0-lineface-latin

Note that this mapping contains some curious features. Since the glyphs of this type have no lower case versions, it does not map the ASCII lower case characters (unlike the Holzmann mapping, which does). It maps several glyphs which have no obvious ASCII mapping to some of these lowercase positions.

5.2 - varkfont.awk

The script varkfont.awk traverses a given font map file and assembles the indicated glyphs into a VARKON font file. This font file can then be copied into the $VARKON_ROOT/cnf/fnt directory and used. For simplicity, I assume that the font *.vfd files are in the same directory; this location could be parameterized.

It is invoked in this way:

awk -f varkfont < glyphdir=GLYPHDIR FONTMAP FONTMAP

For example:

awk -f varkfont < glyphdir=hg com-lemur-hershey-cartographic-roman com-lemur-hershey-cartographic-roman > com-lemur-hershey-cartographic-roman.FNT

It's an interesting exercise in Awk programming, as it must make two passes through the input mapping file (hence the file is specified twice on the command line) and it must open (and close, or it runs out of file descriptors!) each glyph *.vfd data file twice. As traditionally befits a trickier program, it has fewer comments. It all depends upon an understanding of the Awk execution paradigm, state machines, and the Awk continue and getline statements.

BEGIN {
   pass = 1
   totalchars  = 0
   totalpoints = 0
}
{
   if (substr($0,1,1) == "#") {
      next
   } else {
      ascii = $1
   }

   if (pass == 1) {
      # printf ("Pass 1, ASCII: %d\n", ascii)
      vfd = sprintf("%s/hg%d.vfd", glyphdir, $2)
      command = sprintf("test -e %s", vfd)
      if (system(command) == 0) {
         getline numpoints < vfd
         totalchars  = totalchars  + 1
         totalpoints = totalpoints + numpoints
         # printf ("(%d,%d): numpoints = %d\n", ascii, $2, numpoints)
         close(vfd)  # important
      }
   }

   if (pass == 2) {
      # printf ("Pass 2, %d\n", ascii)
      vfd = sprintf("%s/hg%d.vfd", glyphdir, $2)
      command = sprintf("test -e %s", vfd)
      if (system(command) == 0) {
         command = sprintf("cat %s", vfd)
         system(command)
      } else {
         printf ("0\n")
      }
   }

   if ((pass == 1) && (ascii == 127)) {
      printf ("%d\n", totalchars)
      printf ("%d\n", totalpoints)
      pass = 2
      nextfile
   }

}
END {
   # I never use non-ASCII VARKON font positions 128-255,
   # but VARKON expects them in the FNT file, so issue them here.
   for (i = 1; i <= 128; i++) {
      printf ("0\n")
   }
}

6 - Assembling All of the Fonts

7 - Summary

This is just a note to myself, so that I have example command invocations in one place.

prepare the data, creating hersh.occ

create the font mapping files by hand

awk -f hurt.awk -f hershadjust.awk < hersh.occ > hersh.adj

vi hersh.adj, make changes as necessary, by hand

awk -f hershgen.awk < hersh.adj outdir=hg

For each font:

awk -f varkfont.awk glyphdir=hg com-lemur-hershey-normal-uniplex-taper0-roman-latin com-lemur-hershey-normal-uniplex-taper0-roman-latin > com-lemur-hershey-normal-uniplex-taper0-roman-latin.FNT

cp com-lemur-hershey-normal-uniplex-taper0-roman-latin.FNT $VARKON_ROOT/cnf/fnt/10.FNT

check font with "allasc.MBS"

8 - Notes

¹ In addition to the information identified here, Wolcott & Hilsenrath (4, 5) also identify the size of a printer's "em" for two of the glyph sizes: normal (em = 32) and indexical (em = 21). They indicate that this em size could be used for vertical spacing between baselines of successive lines of text. As VARKON text strings do not wrap, this information is of less use here.

The USENET distribution also identifies top lines for the three sizes (4, 7, and 12). These may also be obtained by adding the glyph height from Wolcott & Hilsenrath to the baselines. This information does not seem to be useful in the present translation.

² See the next chapter for a discussion of these terms.

9 - Bibliography

Wolcott, Norman M. and Joseph Hilsenrath. A Contribution to Computer Typesetting Techniques: Tables of Coordinates for Hershey's Repertory of Occidental Type Fonts and Graphic Symbols. Washington, D. C.: Office of Standard Reference Data, National Bureau of Standards, U.S. Department of Commerce, April 1976. NBS Special Publication 424. National Technical Information Service (NTIS) Order Number PB251845.

Legal

Copyright

The data, files, text, and programs of the Holzmann USENET Hershey Glyph Distribution may be redistributed and used freely under their original terms as specified in the Holzmann USENET Hershey Glyph Distribution Cover Statement. The distribution here complies with these terms. The data of the Hershey Glyphs as transformed for use with VARKON may be redistributed and used freely under these same terms. I assert no additional rights or conditions on the use of the transformed data. Some of the text and programs in the Holzmann USENET Hershey Font Distribution may be Copyright 1986 by Peter Holzmann and/or James Hurt. Their own terms either allow or require their redistribution with the Hershey data. The distribution of these texts, files, data, and programs here is subject to all of the disclaimers of warranty and liability noted herein.

License

Permission is granted to copy, distribute and/or modify copyrighted portions of this document (other than the portions the copyright of which is owned by Peter Holzmann and/or James Hurt, which are freely redistributable under their own terms) under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License."

Note: Those portions of this document which are in the public domain, if any, may be copied freely. The distribution of these public domain portions is subject to all of the disclaimers of warranty and liability noted herein.

This work is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Free Documentation License for more details.

You should have received a copy of the GNU Free Documentation License along with this work; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

NOTICE OF DISCLAIMER OF WARRANTY AND LIABILITY:

This work is distributed "as-is," without any warranty of any kind, expressed or implied; without even the implied warranty of merchantability or fitness for a particular purpose.

In no event will the author(s), editor(s), or publisher(s) of this work be liable to you or to any other party for damages, including but not limited to any general, special, incidental or consequential damages arising out of your use of or inability to use this work or the information contained in it, even if you have been advised of the possibility of such damages.

In no event will the author(s), editor(s), or publisher(s) of this work be liable to you or to any other party for any injury, death, disfigurement, or other personal damage arising out of your use of or inability to use this work or the information contained in it, even if you have been advised of the possibility of such injury, death, disfigurement, or other personal damage.

Trademarks

GNU is a registered trademark of the Free Software Foundation.
VARKON is or was a trademark of Microform AB (Sweden).

Forward to Layout of the Hershey Occidental Glyphs