Accurately representing medieval texts with digital fonts

When working with medieval texts (e.g. from the 14^th century) written in e.g. German there are certain characters or ligatures that cannot be represented with most digital fonts. One of the simpler examples of those characters are predecessors of today’s German umlauts like „ü“. In 14^th century texts you find those written as „uͤ“. During transcription you could simply represent them with „ue“ or „u^e“. While this works some of the characteristics of the original are altered or lost.

Unicode blocks

Some of the simpler characters like the example above are part of Unicode, for others the Medieval Unicode Font Initiative (MUFI) strives to represent and encode characters and ligatures that are not yet part of the standard and puts them into Private Use Areas (PUAs). There are special fonts that support MUFI subsets. For our simple example we can turn to Combining Diacritical Marks, a code block which is part of Unicode. This basically combines two Unicode codepoints on top of each other. For example „uͤ“ can be represented using U+0364 and the letter „u“. However, mainstream digital fonts often do not support this code block, because it is used rather rarely.

Suitable Fonts

To find a suitable font you can e.g. turn to SIL (originally Summer Institute of Linguistics), which are also known for their similarly named open font license (OFL). They have created and maintain fonts that support a wide range of languages and characters. One of those fonts is Charis SIL, a serif font for Latin and Cyrillic scripts. I use that one to represent medieval transcripts in print-ready or PDF documents by typesetting it with XeTeX and LaTeX (xelatex).

For the web a serif font – at least in my opinion – works less well. A sans-serif alternative to Charis SIL is Andika (also from SIL). The text you are currently reading is typeset in that font (as of 2024). Nowadays web fonts are usually represented in the Web Open Font Format (WOFF or WOFF2, the latter offering better compression). They should be as small as possible so that a) (mobile) clients do not need to download a lot of data for font rendering and b) the browser loads the font fast enough so that there is no font change because the initial download takes too long and the browser has rendered the text in a standard font already.

While download speeds are reasonable fast nowadays it is still a good idea to strive for small font file sizes. Unfortunately, being able to render a lot of different characters also means larger download sizes. Andika Regular takes 391 KB in the WOFF and 259 KB in the WOFF2 variant. This does not sound like it is much but other fonts usually are up to 10x smaller (mostly smaller than 100 KB).

Font Subsetting

One way to mitigate this problem is to know which characters you might need by investigating the use cases of the font e.g. in which language and script something will be written and which special characters will likely be used. Then you can just embed those characters in the font file and with that reduce the file size. This process is known as subsetting.

So how do you do that? My go-to site for all sorts of font related things is Font Squirrel. They also have a web-based webfont generator which is capable of subsetting. With fonts the size of Andika it reaches its capacity limit and strange things like downloads that do not have any font files inside them are happening. When you limit yourself to one font variant (either regular or italic or …) it works better. The next problem seems to appear when choosing „Custom Subsetting“ and trying to enter an Unicode range e.g. for combining diacritical marks. In short I was not able to make it work for me.

Luckily, there are other options without having to resort to more involved methods like Glyphhanger or even pyftsubset. I tried Fontie and it lets you specify Unicode code blocks directly. It also chokes when processing all font variants simultaneously (timeout) but limiting yourself to one variant again works. It looks like that your code block selection is not preserved between subsequent generation runs, but this seems to be only a style problem because if you click into the multiselect field the selection is still there. To inspect the subsetted font files FontDrop! can be used. It neatly displays the font contents and in the „Type Yourself“ feature you can enter text or test characters to see if the relevant glyphs are part of the font.

With that I was able to reduce the file size of Andika Regular to 58 KB for my subset in the WOFF2 variant. This is 4.5 times smaller as the complete file. Nice.