Let’s again talk about my collection of charters for a region in Austria. When I started I did not have a fully laid out plan, but just got going. So starting with a text document of some sort felt like the natural thing to do. The process is – like expected – fully digital and the result was not intended to be ever actually printed. At least not at large scale. As the collection grew it was getting harder to think about the whole in this text representation. In this post I want to focus on this aspect which I find interesting. It touches upon digital media, media theory and theory of science.
To be fully accurate the title would have to be “Outgrowing a Print-Like Representation”. This is because the result was never meant to be actually printed. So print-like better describes a specific form of digital representation of data. But that wouldn’t sound to good, would it? In a previous post I already talked about how typesetting the document works. Here I’d like to start with describing how the document evolved and when I had to stop thinking about it as document but more as database and what consequences this has.
Why all this?
I started this because of my dissatisfaction with the casual or anecdotal reference to charters like “X was first mentioned in a charter from 1351”. Right, but which charter? Where do I find it? What is its content? Of course there were charter editions like “Monumenta Historica Ducatus Carinthiæ” and so I started with those. First I did not have a clear picture of the criteria which would decide if a charter should be part of my compilation. There was only a vague guideline that it should be concerning the region around Gmünd. It was clear from the beginning that, if you only focus on one place like Gmünd, you would not be able to present a full picture of what was going on. So the boundaries are kind of fuzzy and I would need to add some charters that would not fall into this categorization nevertheless.
It was not even clear that it would be about charters. First I wanted to compile things that happened whether it being in a charter or not. Even now I’m still adding things that might not be part of a charter just because it is important to note. But I knew from the beginning that it should not become some sort of chronicle. I also found it unhelpful that authors of regional history often omitted the sources for some of their statements. Where did they get that from? Is that true? One author then was building upon the other and quickly you where heading in some obscure direction. I wanted to try to disentangle some of the misconceptions that manifested over the years by rigorously citing the sources and by noting where authors made different or erroneous statements.
Evolution
With that the content evolved. I had entries for things happening, charters and the bibliography. With growing content I became concerned that information might not be easily accessible in my PDF format. There was an advantage over print with full text search and it contained hypertext. References to sources or charters had links. Events were sorted chronologically and accessible by centuries. A big change came when I decided that I should have registers of persons and locations. This is a big advantage because persons or locations might appear in different charters and names and spelling is often different. Also it might not be always clear which person or location is meant. Attributing mentioned persons or locations to identifiable persons or places is part of the research work. Also reading charters can be challenging. I have reported about this in a previous post.
Challenges of a print format
In print format which we use to convey information for centuries we have well-known structures. There is a table of contents, we have chapters and sections, there are footnotes and registers – and most important page numbers. All those parts help to make the information more accessible. The table of contents maps the chapters and sections to page numbers for easy access. Additional information are placed in footnotes. Readers can access them with numbers or some other identifier – footnotes are either at the bottom of the page or in a separate listing. Registers like the above mentioned person and location register or a bibliography link pieces of information to pages or another form of identifier. All this makes the serial nature of print more accessible.
In a digital print-like format you have the above mentioned advantages of full text search and hyperlinks for the above mentioned identifiers. With that you can immediately jump to chapters or sections, from the text to the footnote, from the mentioned person to the register entry or the other way round and so on. Despite these advantages I faced the following challenges.
- citability – page numbers not necessarily suitable for citations
- non-specific search – only full text, no categories
- limited navigation or segmentation of content – leading to loss of overview
- limited visualization options – multimedia, interactivity
Never finished
The citability problem is rather specific to my case and not a general problem of print-like formats or at least not to the ones that are complete upon publication. Citations with page numbers have the precondition that the numbers are final upon publication. For the vast majority of published works this is the case. For digital print-like documents it is easy to publish several versions as work progresses. But this is a problem for citations. To which version does a page reference point?
Pages might change between versions and so citations with page numbers become unreliable. I used two mitigation strategies. First, all documents come with a date of publication. To keep page number reference reliable I would need to keep available all versions ever published which I do not want to do. For the second mitigation strategy I gave charters and notes a dedicated identifier which does not change. I never reuse it in case I remove an element. Still, this does not solve all problems. I did not add identifiers for persons and locations. Also the removal of elements is a problem but this would be also a problem for other presentation formats.
Search and navigation
While a digital document allows full text search it is not possible to differentiate categories. In other representations there might be the possibility to specify to only search for charters or persons or to filter results on categories or data fields. This allows for a better search experience. Similarly, while a digital print-like representation might have hyperlinks which leads to a better navigation experience navigation relies primarily on page numbers. Pages are also the primary unit for segmentation of content in print-like representations. Other representations might selectively display content and in doing so aid in information retrieval and avoid that the user loses overview. If only pages are available this limits the options significantly.
Visualization
Print-like representations limit options for interactivity and visualization. I’m able to have images and diagrams but views are static and you have to choose them carefully. In other representations it might be possible to interact with a visualization and explore the data in more depth. If we want to display a map view with data markers we can also display a static version in a print-like representation but the full potential of a map can only be unlocked in a HTML representation.
Plurality
A big advantage of a PDF document is that the data is available in a single container of reasonable size. I can download the PDF, store it locally and it is available when needed. A HTML representation has likely more features but can become unavailable. Another downside of digital online content is that links often are not stable. There are certain mechanisms to avoid that and I talked about that in another post about persistent identifiers. Certain communities sometimes also frown at digital content within citations, but if done right I think there is no reason for that.
Since I started with a PDF representation and this is still my main format I wanted to keep it but also started to provide a HTML representation which offers more possibilities and might be better to explore the data. This also gives me the opportunity to extensively use links to individual charters for easy use in argument or deduction chains in hypertext-enabled publications. Find an example for this in this article.
Next steps
Currently the print-like document is still the format where I enter the data. At the beginning this was great as it was flexible. Now I have a better idea of the necessary structure and also the amount of data did reach a level where it is required to make the last step towards a database rather than a document. This then also allows to add more output formats like e.g. a XML representation for charters. I will keep the print-like representation for now but this might become obsolete in the future.