What persistent content identifiers are and why they are used

What persistent content identifiers are and why they are used

How do you access content on the internet? You probably use search engines and/or follow links. By now most are used to the somewhat cryptic format of URLs (Uniform Resource Locators). But how often have you followed a link and were greeted with a more or less cryptic page saying the page was not found (and also maybe mentioning the number 404)?

If I should venture a guess I would say: quite often. Online content tends to be quite transient. Websites come and go, are restructured or „re-launched“ which often wipes old content or makes it accessible at a different location without pointing the old location to the new one. The result is that content that is available today might be gone tomorrow. For an information society this can be problematic especially for science. Some online content might be dispensable or at least some think it is (and others think that content which some see as important is dispensable) but it still is a loss.

What persistent identifiers are

Persistent identifiers (PIDs) do not require less effort in keeping content available, but when you know the identifier it is easier to access its object. What are persistent identifiers now? They usually have three characteristics while two of them are already in the name. They:

  • are persistent, meaning long-living
  • identify a thing in a unique way
  • are somewhat opaque, meaning „cryptic“

Before going into detail about the other characteristics let’s consider some examples.

Identifier NameExample
ARK (Archival Resource Key)ark/74904/v8002r
DOI (Digital Object Identifier)10.7906/indecs.10.3.3
Handle11471/104.10.1.60
ISBN (International Standard Book Number)9783934223189
URN NBN (Uniform Resource Name National Bibliography Number)urn:nbn:at:at-ubw:1-30448.87836.410759-3
Examples for persistent identifiers

Most PIDs (all types above) can be represented as part of an URL making them accessible on the web. URNs are not URLs but a separate locator system. However, also URNs can be part of URLs.

Characteristics of persistent identifiers

Let’s look at the characteristics now. How persistent any identifier is naturally depends on the behavior of who registers it, but all that register such identifiers are in theory agreeing to certain retention policies or at least long-term commitment. All of the issued PIDs are unique meaning that the issuer has to ensure that no two identical identifiers are issued. Often an issuer only controls part of the PID – a namespace. The PID then is assigned to a thing, it can be real or digital or both like a painting, book, document and so on. To understand the opaqueness characteristic we look at a concrete example.

Take the famous painting of „Mona Lisa“ by Leonardo da Vinci which is owned by the Louvre in Paris, France. It happens to be that the Louvre also assigned a PID – an ARK – to it: ark/53355/cl010062370. So it identifies a thing that is both real and digital – the painting hanging in the Louvre and the web page about it. The ARK is unique in that 53355 is the NAAN (Name Assigning Authority Number) of the Louvre. Only the Louvre assigns ARKs under this number. For the rest of the PID cl010062370 the Louvre has to ensure that it is unique. The identifier is persistent by commitment of the Louvre and it is opaque because it does not read ark/louvre/mona-lisa but ark/53355/cl010062370.

Why is opaqueness a good thing? If it would be ark/louvre/mona-lisa it would be immediately apparent what it is. It turns out that what is immediately apparent today might become legacy and a burden tomorrow. Non-opaque identifiers do not age and travel well. Consider the Louvre changing its name or the painting getting transferred to another organization. With the identifier being 53355/cl010062370 it could stay like it is but if it would be louvre/mona-lisa it either would have to change or at least people would be always confused because it read „louvre“ and not „name of new organization“. Not mentioning that louvre/mona-lisa anyway only makes sense to people speaking certain languages.

Choosing a persistent identifier system

Which one of the available PID systems you choose depends on your needs and/or budget. Some are free of charge while some organizations charge a fee for each issued identifier. The operation and maintenance of such systems costs some money but depending on the volume of traffic and system architecture is quite cheap because most systems do not add significant technological complexity to the already existing web infrastructure (despite some operators telling you a different story).

To aid you in choosing which PID system is for you there is a digital guide published by the National Archive of the Netherlands.

Since this is all about open/libre things I have chosen ARKs which do fit best in that scheme in my opition, because they are open (NAANs are centrally registered, but the criteria are quite inclusive), free of charge and distributed (meaning that resolving of identifiers is delegated to registrants within their namespace). This gives a high degree of freedom and you can virtually assign unlimited identifiers within your namespace but is technically more involved. However, ARKs are also available „as a service“.

Referencing persistent identifiers

One of the main use cases of PIDs is to reference them and their objects. This can be done as a stable link or by writing down the identifier. This is also helpful in references or citations in scientific works. In print it is a bit less useful because it is not straightforward to type the identifier out at least for the longer ones even if they still might be shorter than the average web link.

Each PID system usually has one or more resolvers which are systems that can translate a PID to the location of the object it references. Examples are doi.org for DOI and n2t.net for ARK. There are also meta-resolvers that are able to handle several PID systems like n2t.net or identifiers.org. For example 10.7906/indecs.10.3.3 (DOI) and ark/74904/v8002r (ARK) can be both resolved by identifiers.org.

The resolvers use the mechanism of HTTP redirects which are built into the web. Technically this works by the resolver knowing which system it has to contact for a specific PID system and then redirecting there. The resolver for the PID system then either knows directly how to resolve the identifier or at least to which organization it has to redirect for the specific namespace. If you use identifiers.org to resolve an ARK of this website it takes several steps.

  1. you ask identifiers.org about ark:/74904/v8002r
  2. identifiers.org knows that it should contact n2t.net to resolve ARKs
  3. n2t.net knows that it should contact ark.gminfo.at to resolve ARKs for 74904
  4. ark.gminfo.at knows that it should contact gminfo.at to resolve ARKs starting with v8 (called a shoulder in ARK terminology)
  5. you end up at the page for the ARK 74904/v8002r

Some technical details

ark.gminfo.at uses a simple webserver (nginX) that does redirects depending on URL patterns. Also the systems described above use a similar technique. To create ARKs I use NOIDs (Nice Opaque Identifiers). Those are not restricted to ARKs. For publishing articles here I use WordPress so I created a plugin that allows to associate PIDs to posts and currently it only supports ARKs. Originally NOID was implemented in Perl. There was already a PHP implementation of a NOID minter (minting means to create a PID here) but I found a Ruby implementation easier to use (because it didn’t require Berkeley DB) so I ported that one to PHP and integrated it into the plugin.