Typesetting Large Documents

Typesetting Large Documents

For a few years I have now been working on my compilation of charters for a region in Austria. It is only available in digital form in a web page format and in a print-like format. The print-like format currently has 232 pages. Here, I want to talk about what I use for typesetting and how it helps reformatting and restructuring the document.

Let’s first clarify the term “typesetting”. Originally, it had the meaning to arrange physical type on pages for use in printing. In the digital era it evolved to any composition of text for publication or display. So what I’m doing here in this blog post is essentially typesetting as well. Below I’m distinguishing the system I’m using for typesetting from a “word processor”. This is a certain type or category of software for typesetting such as LibreOffice Writer or Microsoft Word.

Intro From the Past

Let’s say you are attempting to write a large document of 50–100 or even 200 pages. I fail to imagine how hard this must have been in the pre-digital era. Also today this is not so easy. You might have a table of contents, a bibliography, citations, maybe a list of figures, several chapters and so on. In 2003 (yes, over 20 years ago) I had to write my first larger (pre-university) document. I used Microsoft Word for this and it worked out OK but was not a smooth experience. Things “kept happening”. For example, some parts looked different after an edit that did not change their formatting. Additionally, the edits impacted other parts of the document and not just the edited section. Due to this I have looked for other possibilities for my university documents.

I’m sure that word processors have evolved in the last 20 years. I’m not in the position to judge because I have not used them for larger documents anymore. What follows is not a comparison with word processors. I’m also not arguing why other approaches are better. I’m just showing the aspects that helped me with my currently 232 pages document. I will also talk a bit of some of the problems that my approach has.

TeX and LaTeX

I’m using LaTeX (spoken La-Tech) originally written in the 1980s by Leslie Lamport. It aims to be able to use TeX (spoken Tech) – originally developed beginning in 1977 by Donald E. Knuth – more easily. Since then, contributors have continuously updated the system, but the architecture remains largely unchanged.

A quick example how LaTeX markup looks like (taken from Wikipedia):

\documentclass{article} % Starts an article
\usepackage{amsmath} % Imports amsmath
\title{\LaTeX} % Title

\begin{document} % Begins a document
\maketitle
\LaTeX{} is a document preparation system for
the \TeX{} typesetting program. It offers
programmable desktop publishing features and
extensive facilities for automating most
aspects of typesetting and desktop publishing,
including numbering and cross-referencing,
tables and figures, page layout,
bibliographies, and much more. \LaTeX{} was
originally written in 1984 by Leslie Lamport
and has become the dominant method for using
\TeX; few people write in plain \TeX{} anymore.
The current version is \LaTeXe.

% This is a comment, not shown in final output.
% The following shows typesetting power of LaTeX:
\begin{align}
E_0 &= mc^2 \\
E &= \frac{mc^2}{\sqrt{1-\frac{v^2}{c^2}}}
\end{align}
\end{document}

It might be clear now why it is easier to learn for people with a (computer) science background. But there is software that helps you to write LaTeX markup as well. Computer scientists or mathematicians often use LaTeX, because it excels at typesetting formulas. Also the wider (natural) science community uses it, but you find it less often in humanities.

Being In Control

The seemingly complex way of annotating your text is what gives you the amount of control over typesetting that you often need. Word processors are usually WYSIWYG (“what you see is what you get”) programs. They hide the complexity of the markup from the user. In the background nowadays such documents are usually in XML format. The user is dependent on the user interface to correctly translate between the display and the markup. In contrast LaTeX advocates often half-jokingly called it WYSIWYW (“what you see is what you want”). This amount of control comes with increased complexity.

Macros and Other Advantages

In LaTeX you can define your own markup commands that consist of other markup. Adding logic through a programming language is not easily possible. Instead you can work around this by using additional packages which then allow some limited flow control. Defining your own logic is no distinct feature of LaTeX, also other word processors offer that. LaTeX let’s you explicitly define reusable markup. I don’t know if this is possible with the macro functionality of word processors.

In my charter compilation I make extensible use of reusable markup. It helps me to better structure my data. I then use this structure to extract data to present in a HTML version of the compilation. It also ensures that the compilation consists of repeatable elements and each element looks the same. Such an element is a charter for example.

This is how my “charter” command or reusable markup looks like in code:

\newcommand{\charter}[9] {
\ifstrempty{#1}{}{\defval{charter#2}{#1}}%
\normalsize\ifstrempty{#1}{\item}{\item[#1]}\phantomsection%
\iftoggle{usebackref}{%
\backreflabel{#2} \emph{#3}%
}{%
\label{#2} \emph{#3}%
}

\nopagebreak%
\ifstrempty{#4}{}{\vspace{-3pt}\small#4}%

\ifstrempty{#6}{\scriptsize{}\ifstrempty{#5}{}{#5}\ifstrempty{#7}{}{; #7}}%
{%
\ifstrempty{#5}{}{\scriptsize{}#5}

\ifstrempty{#6}{}{\small#6}

\ifstrempty{#7}{}{\scriptsize{}#7}
}%

\ifstrempty{#8}{}%
{\small{}Online-Edition: \href{#8}{\texttt{\urltransform{#8}}}%
\ifstrempty{#9}{}{ \scriptsize{}(letzter Zugriff: \DTMDate{#9})}}%
\normalsize
}

It takes 9 parameters and then does things which I will not explain in detail. They are not really relevant here. With the command I can hide the complexity of the details. I also can use the command over and over again without worrying about the formatting of each one of them. Also, when I make changes to the markup it affects all charters in the same way.

As described above one can extend the system with packages. It is also able to produce output in several formats while PDF ist probably the most used format. Some scientific journals require authors to hand in articles using LaTeX markup. Good looking typography, exceptional bibliography and citation handling as well as perfectly typeset formulas are hallmarks of the system.

Some Problems of LaTeX

But it is not without problems. One of the more obvious ones is that has a rather steep learning curve. Also, it is often not intuitively accessible to all users. To be honest I have never fully grasped how LaTeX works internally. This is a problem when you try to do things that are not working out of the box. I might not have tried hard enough. Maybe it is just my problem and not a problem of the system’s inner workings being quite complex and not really intuitive.

TeX and LaTeX have served the (scientific) community well for decades. The system predates most modern markup languages and it’s architecture is feeling a bit “archaic” at times. While (La)TeX is extensible, its distributions often include most packages, which may require downloading several gigabytes of data beforehand. However, this might not be a significant issue anymore nowadays. The system is a bit slow when compiling documents. Usually, you are not able to utilize more than one core of your CPU at the same time. It also produces loads of intermediate output files which can feel a bit clunky.

Summary

LaTeX has really helped me in typesetting documents for quite some time. I would not want to go back to standard word processors for more complicated documents. The results might require some tweaking to get away from the “standard” LaTeX look which is immediately recognizable. But they always look good. I don’t want to miss the control I have over all the parts of my document. For that I accept the higher complexity that comes with that. At the same time it starts to feel a bit dated. I also have never got a grasp at the inner workings of the system.