Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes – concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other – is causing me trouble.

Seeing how these terms get used in documents like Matthias Bynens’ JavaScript has a unicode problem or Wikipedia’s piece on Han unification, I’ve gathered that these concepts are not the same thing and that it’s dangerous to conflate them, but I’m kind of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, but it’s full of “definitions” like this:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. …

Character. … (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. …

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. …

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

2 Answers
2

Leave a Reply

Your email address will not be published. Required fields are marked *