Adobe Systems Incorporated [Navigation Bar]
Main Topics
Header

Unicode and Glyph Names

[ Document version 1.1. Last updated 17 December 1998 ]


  1. Introduction
  2. Glyph naming
    1. Maximum name length and permissible characters
    2. The uni<CODE> glyph naming convention
    3. How to name glyphs
      1. Standard UV
      2. CUS UV
      3. Non-Unicode ligature
      4. Non-Unicode glyphic variant
      5. All others
    4. Additional notes
  3. Extracting Unicode semantics from glyph names
    1. Algorithm
    2. Populating a Unicode space
    3. Search and copy/paste facilities
  4. Adobe Glyph List notes
    1. Character sets covered
    2. Corporate Use subarea
    3. Double-mappings
  5. Document changes


1. Introduction

This document describes Adobe(R)'s PostScript(R) glyph naming conventions in the context of Unicode. The purpose of these conventions is to attach standardized semantics to glyph names, including glyphs that represent characters that don't have standard Unicode values (UVs) like certain ligatures or glyphic variants.

Two perspectives are presented: that of the font developer, when deciding what to name the glyphs in a font; and that of any process that needs to extract Unicode semantics from glyph names, such as a Type 1-to-OpenType converter when creating a Unicode 'cmap', or the search facility in an application that does not use OpenType layout tables.

The 3 data files referred to in this document are:

  • "The Adobe Glyph List" (AGL). This maps approximately 1000 glyph names to standard or Corporate Use subarea (CUS) UVs. For more details, including double-mapped glyphs and industry standard character sets covered by the list, see section 4.

  • "Unicode's Corporate Use Subarea as used by Adobe." These assignments cover characters such as small capitals which are commonly used in Adobe fonts but are not part of the Unicode Standard. This data file also provides Unicode-style character decompositions for many of these assignments. For more details, see section 4.b.

  • "Zapf Dingbats Glyph Names and UVs." This list should be used only for the font Zapf Dingbats, as described in section 3.a.

The Unicode Standard 2.1 has been used in this document and related data files, except for the 4 characters mentioned in the header of the AGL data file.


2. Glyph naming

Font developers should follow these guidelines in all cases where glyph names are needed: Type 1 fonts, OpenType fonts with non-CID-keyed CFF data, and TrueType fonts and OpenType fonts with TrueType data that contain 'post' tables with implicit or explicit glyph names.

2.a. Maximum name length and permissible characters

A glyph name may be up to 31 characters in length, must be entirely comprised of characters from the following set:

    A-Z
    a-z
    0-9
    . (period)
    _ (underscore)

and must not start with a digit or period. The only exception is the special character ".notdef".

"twocents", "a1", and "_" are valid glyph names. "2cents" and ".twocents" are not.

2.b. The uni<CODE> glyph naming convention

In certain situations described in this document, a glyph needs to be named according to the uni<CODE> convention.

This means that the glyph name should be of the format "uni<CODE>", where <CODE> is the glyph's UV represented as a 4-digit uppercase hexadecimal number. The "uni" component must be lowercase.

For example, the uni<CODE> glyph name for U+054A, ARMENIAN CAPITAL LETTER PEH, should be "uni054A".

2.c. How to name glyphs

Each step below should be considered, in order, until a name is assigned to a particular glyph. The process should be repeated for every glyph.

In order to verify the choice of glyph name, the result of applying the algorithm in section 3.a to the glyph name can be compared with the intended meaning.

Note that font developers can implement a glyph aliasing mechanism in their production tools that could provide more descriptive glyph name aliases for uni<CODE> or any other glyph names, as long as the glyph names in the final font follow the guidelines below.

2.c.i. Standard UV

If the character that the glyph represents has a standard UV, i.e. a UV assigned by the Unicode Standard, then assign its name as follows:

If the UV is in the Adobe Glyph List, use the glyph name associated with it in AGL. For example, the glyph name for U+00C1, LATIN CAPITAL LETTER A WITH ACUTE, should be "Aacute". If the UV is one of a double-mapping in AGL, and if separate designs are desired for each UV, then one glyph should be given the AGL name and the other should be given a uni<CODE> name according to the table in section 4.c.

If the UV is not in AGL, use a uni<CODE> glyph name. For example, the glyph name for U+0AB8, GUJARATI LETTER SA, should be "uni0AB8".

If the glyph represents a Unicode surrogate character, name the glyph "uni<CODE1><CODE2>", where <CODE1> is the high-surrogate UV, and <CODE2> is the low-surrogate UV. No surrogate characters have been assigned by the Unicode Standard as of Version 2.1.

2.c.ii. CUS UV

If the character that the glyph represents is in the data file "Unicode's Corporate Use Subarea as used by Adobe," then assign its name as follows:

If the CUS UV is in AGL, use the glyph name associated with it in AGL. For example, the glyph name for CUS U+F761, LATIN SMALL CAPITAL LETTER A, should be "Asmall".

If the CUS UV is not in AGL, the glyph should be named according to the uni<CODE> convention. For example, the glyph name for CUS U+F66D, LATIN SMALL CAPITAL LETTER A WITH BREVE, should be "uniF66D".

2.c.iii. Non-Unicode ligature

(If this point is reached, the character that the glyph represents has neither a standard nor a CUS UV.) If the character that the glyph represents is a ligature, or otherwise decomposes into standard Unicode or CUS characters, then two formats are available for its name:

  • Format 1: Descriptive

    The decomposition is expressed by joining the glyph names of the components, in order, by underscores. The glyph name of a component is an AGL or uni<CODE> name.

    For example, the "o f f i" ligature should be named "o_f_f_i".

  • Format 2: Unicode

    The glyph name is expressed as "uni" followed by two or more <CODE>s, which indicate the UVs of the components of the character, in order.

    For example, the character LATIN CAPITAL LETTER EZH WITH CIRCUMFLEX AND GRAVE, which is not in Unicode, should be named "uni01B703020300", since LATIN CAPITAL LETTER EZH is at U+01B7, COMBINING CIRCUMFLEX ACCENT is at U+0302, and COMBINING GRAVE ACCENT is at U+0300.

    A maximum of 7 components is available with this format due to glyph name length restrictions.

2.c.iv. Non-Unicode glyphic variant

If the glyph is a glyphic variant of a character in category (i), (ii), or (iii) above:

The glyph name is of the form:

    <base glyph name>.[<variant descriptor>]

Note the period after the base glyph name. An optional variant descriptor can follow the period. <base glyph name> is:

  • in AGL, or
  • a uni<CODE> name, or
  • a ligature or other decomposition as described in (iii) above.

Any process which determines semantics from glyph names, such as the one described in section 3.a, will ignore the descriptor, if present. The descriptor may contain periods or any other permitted characters; only the first period in the glyph name is relevant.

For example, a variant of the "T h" ligature can be named "T_h.swash". "T.swash_h" would be incorrect since this will be interpreted as a glyphic variant of "T".

Some of Adobe's internal conventions for variant descriptors are listed below; other developers may use or ignore these additional conventions as they see fit.

  • Variant descriptors currently in use at Adobe include:

    "swash", e.g. "A.swash", for swashes.
    "begin", e.g. "p.begin", for variant beginnings of words, as in some script fonts.
    "end", e.g. "e.end", for variant endings of words.
    "alt", e.g. "a.alt", for alternates.

    These descriptors may also include a numbering scheme as described below.

  • If there are multiple variants of the same base glyph, then the variant descriptors should include zero-padded fixed length numbers so that if and when the glyph names are sorted (as in section 3.b), the intended order will be preserved. For example, if the "ampersand" glyph has 23 alternates, they would be named "ampersand.alt01" through "ampersand.alt23", rather than "ampersand.alt" along with "ampersand.alt1" through "ampersand.alt22".

2.c.v. All others

If this step is reached, then the glyph has no useful semantic value. Examples of such glyphs include ornaments. Any glyph name may be assigned as long as it cannot be interpreted as a glyph name from (i) to (iv) above.

Adobe's current internal practice is to name ornaments "orn.001", "orn.002" and so on; developers may choose other naming conventions.

2.d. Additional notes

Do not use a uni<CODE> name for a glyph that is in AGL (except for one of a double-mapping), since this might produce undesirable results in pieces of software that use only the AGL glyph name to test for the presence of a particular UV in a font.

For example, the Adobe PostScript driver for Windows re-encodes a Type 1 (PFB/PFM) font to a particular Windows code page on the basis of AGL glyph names; it does not recognize uni<CODE> glyph names. When re-encoding to Windows ANSI (code page 1252) and printing code point 0xE9 (U+00E9, LATIN CAPITAL LETTER E WITH ACUTE), for instance, it will print the glyph "eacute" if present, ".notdef" otherwise, even if the glyph "uni00E9" were present).

Ideally, such an application would first check to see whether the font had a uni<CODE> name for a particular UV before using the AGL name (including double-mapped AGL glyph names). However, this might not be acceptable to the application since the re-encoding array could change depending on the font.

Note that Adobe Type Manager(R) for Windows NT(R) (ATM(R)/NT) tests whether a Type 1 font supports a particular Windows code page by the presence of the following UVs. If there are two UVs indicated for a code page then the presence of either one is sufficient for the code page to be considered supported.

Windows Code Page
   UV
   Glyph name
1250: Windows Latin 2 (Central Europe) 010C Ccaron
0148 ncaron
1251: Windows Cyrillic (Slavic) 042F afii10049
0451 afii10071
1252: Windows Latin 1 (ANSI) 00EA ecircumflex
1253: Windows Greek 03A9 Omega (or uni<CODE> override)
03CB upsilondieresis
1254: Windows Latin 5 (Turkish) 0130 Idotaccent
1255: Windows Hebrew 05D0 afii57664
1256: Windows Arabic 0622 afii57410
1257: Windows Baltic Rim 0173 uogonek
1258: Windows Vietnamese 20AB dong
OEM 2592 shade


3. Extracting Unicode semantics from glyph names

The guidelines in 3.a should be followed by any application that needs to determine the meaning of a glyph from its glyph name. Sections 3.b and 3.c give examples of such applications.

3.a. Algorithm

The following pseudocode should be implemented for all fonts except Zapf Dingbats (PostScript FontName: ZapfDingbats); this has its own separate Unicode lookup table in the data file "Zapf Dingbats Glyph Names and UVs." The pseudocode does not determine the validity of a glyph name.

".notdef" is special: it is used when a glyph name in a PostScript encoding does not exist in a font. It does not have a UV and does not appear in AGL.


getGlyphNameSemantics()
    Input:  glyphName g
    Output: UV+      # One or more UVs
            isDecomp # A boolean, indicating that UV+ is a decomposition, as
                       opposed to 1 or 2 UVs (2 for an AGL double-mapping)
            isVar    # A boolean, indicating that g is a glyphic variant of UV+
    {
    isDecomp = false;
    isVar = false;

    If g contains a period:                                        # Sec. I
        isVar = true;
        g = everything before the first period in g;
        If g is empty:
            Return (UNRECOGNIZED, -, -);

    If g is in AGL:                                                # Sec. II
        Return (AGL UVs, isDecomp, isVar);                                
                                                                          
    If g is of the form uni<CODE>:                                 # Sec. III
        Return (<CODE>, isDecomp, isVar);                                 
                                                                          
    isDecomp = true;                                               # Sec. IV
    If g contains an underscore:
        Split g by underscores;
        If each component yeilds a UV by sections II or III above:
            Return (<UV1><UV2><UV3>..., isDecomp, isVar);
        Return (UNRECOGNIZED, -, -);

    If g is of the form "uni" with 2 or more <CODE>s following it: # Sec. V
        Return (<UV1><UV2><UV3>..., isDecomp, isVar);

    Return (UNRECOGNIZED, -, -);                                   # Sec. VI
    }
When two UVs are returned by a double-mapped glyph, and only one can be accepted, then the UV that corresponds to the first of each pair in the table in section 4.c should be used.

Some sample inputs and outputs of this function are:

glyphName g
   UV+
   isDecomp
   isVar
   Comment
T 0054 false false
T.swash 0054 false true
mu 00B5,03BC false false AGL double-mapping
uni03BC 03BC false false
T_h 0054,0068 true false
T_h. 0054,0068 true true
T_h.swash 0054,0068 true true
T.h_swash 0054 false true
T_uni0127 0054,0127 true false
uni01B703020300 01B7,0302,0300 true false
zerooldstyle F730 false false CUS UV
T_ UNRECOGNIZED - - Empty second component
uni03bc UNRECOGNIZED - - <CODE> not uppercase

Note that the UV or UVs that this function will produce might be CUS UVs. Each such UV can be decomposed into its standard Unicode values by consulting the decompositions in the data file "Unicode's CUS as used by Adobe." In the example of "zerooldstyle" in the table above, CUS U+F730 decomposes into "<osf> 0030", a glyphic variant of DIGIT ZERO (U+0030).

Surrogate character names (described in section 2.c.i) can be easily distinguished from two-component non-Unicode ligatures (described in section 2.c.iii Format 2): the high-surrogate UV will be in the range U+D800 through U+DBFF, and the low-surrogate UV will be in the range U+DC00 through U+DFFF, as defined by the Unicode Standard.

3.b. Populating a Unicode space

If an application is interested only in extracting standard or CUS UVs (categories (i) and (ii) in section 2.c), it can modify the algorithm to simply delete sections I, IV, and V.

Examples of such applications include a Type 1-to-OpenType converter, when creating the Unicode 'cmap'; and ATM/NT, when loading Type 1 fonts, since Windows NT represents all characters in terms of Unicode.

If the uni<CODE> as well as the AGL glyph name for a particular UV are present, then the uni<CODE> glyph should take precedence at that UV.

If an application wants to encode unrecognized glyphs in Unicode, it should do so in the End User subarea, by sorting the unrecognized glyph names in the font by increasing ASCII order, and assigning them to a contiguous run of UVs starting at U+E000, the lower end of the Private Use Area. If this run of UVs overlaps with the UV assigned to a glyph with a uni<CODE> name, the results are undefined. The Unicode Standard makes no provision for avoiding a "stack-heap collision" between the End User subarea and CUS. Furthermore, Microsoft(R) will treat the range U+F000 through U+F0FF as the definition of its symbol code page.

3.c. Search and copy/paste facilities

If a font's glyphs have been properly named, search facilities can accurately locate all glyphic variants of the seach string's characters. For example, if the user types in the letter "t", then glyphs "t", "Tsmall", "t.swash", "t.begin", and "t.end" can all be matched.

The same principle applies to copy/paste facilities. For example, glyph "ampersand.alt01", when copied and pasted from one application into another, would be known to be a glyphic variant of AMPERSAND (U+0026), and the regular ampersand could be used to display the character as a fallback strategy.

Note that applications that have access to a font's OpenType layout tables can also glean this information from the various features in the glyph substitution ('GSUB') table. This would be the only recourse to identify non-Unicode glyphic variants for fonts that do not have glyph names, such as OpenType fonts with CID-keyed CFF data.


4. Adobe Glyph List notes

4.a. Character sets covered

The Adobe Glyph List includes the complete character complements from:

  • the Adobe Standard Roman Character Set [PostScript Language Reference Manual (PSLRM), E.5]
  • the Adobe Standard Central European Character Set [PSLRM Supplement]
  • the Adobe Standard Cyrillic Character Set, including 4 Serbian cursive variants [Technical Note No. 5013, Adobe Developer Support]
  • the Adobe Expert Character Set [PSLRM E.8]
  • the font Symbol [PSLRM E.4]
  • Windows Glyph List (WGL) 4*
  • Windows code pages 1250-1258:
      1250: Windows Latin 2 (Central Europe)
      1251: Windows Cyrillic (Slavic)
      1252: Windows Latin 1 (ANSI)
      1253: Windows Greek
      1254: Windows Latin 5 (Turkish)
      1255: Windows Hebrew
      1256: Windows Arabic
      1257: Windows Baltic Rim
      1258: Windows Vietnamese
  • Macintosh encodings (except for the Apple logo):
      Mac OS Arabic
      Mac OS CentralEurope
      Mac OS Cyrillic
      Mac OS Greek**
      Mac OS Hebrew**
      Mac OS Icelandic
      Mac OS Roman
      Mac OS Romanian
      Mac OS Turkish**
      Mac OS Ukrainian
  • ISO 8859 Parts 1-10:
    1. Latin alphabet No. 1
    2. Latin alphabet No. 2
    3. Latin alphabet No. 3
    4. Latin alphabet No. 4
    5. Latin/Cyrillic alphabet
    6. Latin/Arabic alphabet
    7. Latin/Greek alphabet
    8. Latin/Hebrew alphabet
    9. Latin alphabet No. 5
    10. Latin alphabet No. 6

*    The fi and fl ligatures are double-mapped in WGL4 to Private Use Area UVs U+F001 and U+F002 respectively, for HP printer compatibility. AGL does not perform this double-mapping since U+F001 and U+F002 are in Microsoft's symbol code page area (see section 4.b). AGL does map these ligatures to their standard UVs, U+FB01 and U+FB02 respectively. Except for this, all of WGL4's glyph names follow this document's guidelines. (WGL4 source: http://www.microsoft.com/typography/OTSPEC/WGL4.htm as of 17 December 1998.)
** The following Apple-defined CUS characters are not in AGL: 1 undefined code point in MacOS Greek and Turkish (CUS U+F8A0), and 6 obsolete or deprecated characters in Mac OS Hebrew (CUS U+F89A through U+F89F).

The glyphs in Zapf Dingbats are in a separate table and are recognized by the UV-assigning algorithm as a special case (see section 3.a).

4.b. Corporate Use subarea

The Unicode Standard states that character assignments in the CUS could be completely internal, hidden from end users, and used only for vendor-specific application support, or could be published as vendor-specific character assignments available to applications and end users.

The CUS characters in AGL fall into the latter category in that they are available to end users; however, several of them, such as the Cyrillic glyphic variants, are not vendor-specific, and would be useful to several vendors.

In fact, we envision the CUS as a collaborative effort among vendors, wherein each vendor ensures that new assignments do not overlap with existing ones. This shared approach ensures optimal use of the limited UVs available. It also avoids the obvious problems that applications would have in identifying the vendor of a font in order to determine which vendor's CUS assignments were in effect.

In addition, we regard CUS assignments as useful until OpenType features become widely available in fonts and supported by applications.

Apple has published CUS assignments in the range U+F800 through U+F8FF. Adobe uses CUS assignments in the range U+F600 through U+F7FF, as well as the same assignments for some characters in Symbol and Zapf Dingbats from the Apple-defined range. Microsoft is treating the range U+F000 through U+F0FF as the definition of its symbol code page.

Glyph names of future CUS characters should follow the uni<CODE> naming convention.

See the data file "Unicode's CUS as used by Adobe" for a description of the CUS assignments and their Unicode-style character decompositions.

4.c. Double-mappings

AGL contains certain double-mappings, i.e. glyphs that are mapped to two UVs for compatibility with legacy fonts. If a developer wishes to provide separate designs for a double-mapping (see section 2.c.i), then one of the UVs may have a uni<CODE> glyph name. AGL 1.2 contains the following double-mapped glyphs:

Glyphname
   Note*
   UV
   Descriptive name
Delta - 2206 INCREMENT
uni 0394 GREEK CAPITAL LETTER DELTA
Omega - 2126 OHM SIGN
uni 03A9 GREEK CAPITAL LETTER OMEGA
Scedilla - 015E LATIN CAPITAL LETTER S WITH CEDILLA
cus F6C1 LATIN CAPITAL LETTER S WITH CEDILLA
Tcommaaccent - 021A LATIN CAPITAL LETTER T WITH COMMA BELOW
uni 0162 LATIN CAPITAL LETTER T WITH CEDILLA
fraction - 2044 FRACTION SLASH
uni 2215 DIVISION SLASH
hyphen - 002D HYPHEN-MINUS
uni 00AD SOFT HYPHEN
macron - 00AF MACRON
uni 02C9 MODIFIER LETTER MACRON
mu - 00B5 MICRO SIGN
uni 03BC GREEK SMALL LETTER MU
periodcentered - 00B7 MIDDLE DOT
uni 2219 BULLET OPERATOR
scedilla - 015F LATIN SMALL LETTER S WITH CEDILLA
cus F6C2 LATIN SMALL LETTER S WITH CEDILLA
space - 0020 SPACE
uni 00A0 NO-BREAK SPACE
tcommaaccent - 021B LATIN SMALL LETTER T WITH COMMA BELOW
uni 0163 LATIN SMALL LETTER T WITH CEDILLA

*    Note legend:
uni    uni<CODE> override allowed for this UV.
- uni<CODE> override not allowed for this UV.
cus uni<CODE> override not allowed for this UV. It is double-mapped to the CUS for compatibility with previous versions of AGL, so a uni<CODE> override isn't needed.

For example, if different designs are desired for MICRO SIGN (U+00B5) and GREEK SMALL LETTER MU (U+03BC), then glyph "mu" should be designed as MICRO SIGN and glyph "uni03BC" should be designed as GREEK SMALL LETTER MU.

Developers should be aware, however, that the Adobe PostScript driver for Windows re-encodes a Type 1 (PFB/PFM) font to a particular Windows code page on the basis of AGL glyph names; it does not recognize uni<CODE> glyph names. For example, when the driver needs to re-encode to Windows Greek, it will use glyph "mu" for both Windows Greek code point 0xB5 (MICRO SIGN) and code point 0xEC (GREEK SMALL LETTER MU), even if the glyph "uni03BC" were present.

This limitation is not present for GDI printing with OpenType or TrueType fonts. Adobe's current plans are to provide separate designs for some double-mappings in OpenType fonts. If developers produce PFB fonts that have separate designs for double-mappings, ensuring that the advance widths of both glyphs in each pair of designs is the same will prevent line rewrap in the problem situation.


5. Document changes

v1.1    [17 December 1998] Generally revised entire document. Updated most tables and data files. Added section on selecting glyph names. Pseudocode for extracting semantics expanded to include non-Unicode ligatures and glyphic variants. Added section on providing separate designs for double-mappings. Removed section on discrepancies with WGL4 (no longer applicable; WGL4 was updated).
v1.0 [10 November 1997] First version.



| Home | Introduction | SDKs | Marketing | Developer Resources | Tech Notes | University |
| Type Technology Forum | Program Guide |


 

Copyright © 1999 Adobe Systems Incorporated.
All rights reserved.
Legal notices and trademark attributions.
Online Privacy Policy.