Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Metadata Services: Best Practices

Acquisitions & Cataloging Services

Preservation Metadata

It is recommended that all projects dealing with digital content attempt to follow the guidelines established in the most recent version of the PREMIS (PREservation Metadata Implementation Strategies) Data Dictionary for Preservation Metadata.

The PREMIS guidelines are very detailed in describing the preservation metadata that is ideal to capture.  It is only recommended that all projects attempt to meet a few minimum requirements detailed below.

PREMIS Data Model Overview

The PREMIS Data Model consists of five primary entities:

  • Intellectual Entity - content that is considered a single intellectual unit (may or may not be digital)
  • Objects - digital form(s) of an intellectual entity
  • Event - an event that impacts or involves at least one Object. Events are associated with or preformed by an Agent
  • Agent - a person, organization, or software/system associated with Events that occur on an Object or Rights on an Object
  • Rights - assertions of very basic rights/permissions pertaining to an Object

The PREMIS Data Dictionary details the recommended preservation metadata recommended to be captured for the last four entities (Objects, Events, Agents, and Rights).  The Intellectual Entity is not covered by PREMIS, as it would be described using the best practices for descriptive metadata.

PREMIS Data Model

It's also worth noting that PREMIS deals with three types of Objects:

  • File - an actual file on an operating system (e.g. a PDF file)
  • Bitstream - a series of bytes (1s and 0s) within a File which have meaningful properties unto themselves (e.g. the header information within a JPEG2000 image file)
  • Representation - a set of files (including structural metadata) which are required to render a single intellectual entity (e.g. a webpage consisting of HTML, CSS, and images, all necessary to render something useable)

The PREMIS Data Model is described more completely in the Introduction of the PREMIS Data Dictionary for Preservation Metadata.

Minimum Requirements

The PREMIS list of recommended preservation metadata is extensive. The following is a list of the minimal metadata which should be captured for each entity (Object, Event, Agent, Rights).

Please note that although these best practices recommend the minimal preservation metadata that should be gathered, PREMIS does not specify a metadata schema for implementation.  We recommend storing this metadata in an appropriate metadata schema, based on the used packaging format.  For example, if METS is used for packaging, there is an existing PREMIS metadata schema for usage as administrative metadata with METS: http://www.loc.gov/standards/premis/schemas.html


Objects

Minimally, the following preservation metadata should be captured about an Object:

  • Object Type (majority of the time will be "file")
  • Identifiers
  • Fixity (checksum, etc.)
  • Size
  • Format of file
  • Relationships to other objects (especially in establishing digital provenance)

Within the PREMIS data dictionary, this information is expressed as follows.  Please note that object type abbreviations refer to:  File (F),  Representation (R), and Bitstream (B).  

Semantic 
Unit/Component
Object
Type
NoteExamples
objectIdentifierType

R, F, B

 

The type of identifier used to locate the object within the preservation system in which it is stored.

hdl  (Handle)

objectIdentifierValue

R, F, B

The value of the object's identifier

2142/8796

objectCategory

R, F, B

The type of object being described.  Controlled Vocab:  representation, file, or bitstream

file
representation
bitstream

preservationLevelValue

R, F

Level of preservation support attempted for this object.  (We need to establish our own controlled vocabulary for these values)

Categories?

1, 2, 3, or 4?

full or bit-level?

preservationLevelDateAssigned

R, F

The date this preservation level was assigned

2008-03-29

fixity

F, B

The information necessary to perform occasional fixity checks

 

messageDigestAlgorithm

F, B

Algorithm used to generate the message digest

MD5

messageDigest

F, B

Value of the message digest

(a checksum value)

size

F, B

The size (in bytes) of file

1024

format

F, B

 

 

formatDesignation

F, B

 

 

formatName

F, B

The mime type of the file format

application/pdf
image/jp2
text/xml

originalName

R, F

The original filename

123456.pdf

 


Events

Although all events on objects can oftentimes be difficult to track and record, it is recommended that we attempt to record the following types of events (whenever possible):

  • File format changes - this includes both migration to alternative formats (e.g. for access), as well as normalization to common formats. This helps us to keep track of the provenance of files.
  • Ingest into a new digital system/repository - this helps us keep track of the various locations of files.
  • Modifications to files which significantly change the file itself - It is unimportant to track fixes to spelling or minor font changes. However, larger changes such as OCR of an image-based PDF, removal/addition of pages, or other major structural/content changes are important events within the historical provenance of a file.
  • Any activities resulting in a new file - Generally, we should attempt to track any activities which create a new file. When the creation of new files is outsourced (e.g. for large scale digitization), this may be more difficult to track. However, it's still worth tracking the source of the files (even if the source is generically identified as the company which created the new files).

Minimally, the following preservation metadata should be captured about an Event on an Object:

  • Event Type
  • Event Date / Time (to best of your knowledge)
  • Event Detail (human readable notes on the event that occurred)
  • References to the Object(s) affected and the Agent(s) that performed the event

Within the PREMIS data dictionary, this information is expressed as follows:

Semantic 
Unit / Component
NoteExamples
eventIdentifierType

A controlled vocabulary representing the Institution or Company that performed the event.  This would likely usually be something like "UIUC Library".

UIUC Library
OCA
etc.

eventIdentifierValue

An identifier which can be used to reference this event. This should likely be based on the date/time the event occurred, to ensure its uniqueness.

scan-2008-03-23
migrate-2008-04-21

eventType

The type of event described.   We need to establish our own Controlled Vocabulary of event types. PREMIS documents some suggested terms.

ingestion
creation
deletion
migration
normalization
validation
(etc.)

eventDateTime

The date/time when the event occurred.  Recommended in ISO 8601

2006-07-16T19:20:30

eventDetail

Detailed notes (human readable / understandable) of the event that occurred

(Description of the event: who, what, why, what software was used, etc.)

linkingAgentIdentifier

Provides information about which agent performed event

 

linkingAgentType

References the agentIdenfierType of the Agent(s) performing the Event (see the Agent section below!)

UIUC Library
OCA
etc.

linkingAgentValue

References the agentIdenfierValue of the Agent(s) performing the Event (see the Agent section below!)

 

linkingObjectIdentifier

Provides information about which object(s) were affected by the event

 

linkingObjectType

References the objectIdenfierType of the Object(s) affected by the Event (see the Object section above!)

(a checksum value)

linkingObjectValue

References the objectIdenfierValue of the Object(s) affected by the Event (see the Object section above!)

1024


Agents

Only Agents which perform actual Events on Objects need to be tracked.  Agents may be organizations, software programs, systems or individual people.

Minimally, the following preservation metadata should be captured about an Agent which performs an Event:

  • Agent Type (person, software, etc.)
  • Agent Name

Within the PREMIS data dictionary, this information is expressed as follows:

Semantic 
Unit/Component
NoteExamples
agentIdentifierType

A controlled vocabulary representing the type of an agent identifier.  For a person, this may be represented as "UIUC NetID". 

UIUC Library
UIUC NetID
Software Program

agentIdentifierValue

An identifier which can be used to reference this agent. 

tdonohue
LSDWG
Acrobat-Pro-9.0

agentType

The type of agent described.   We need to establish our own Controlled Vocabulary of event types. PREMIS documents some suggested terms.

person
organization
software

agentName

A human readable name for the agent

Tim Donohue
Large Scale Digitization Working Group
Adobe Acrobat 9.0 Pro


Rights

For the purpose of tracking simplistic provenance of digital files, Rights Statements are unnecessary.   In PREMIS, Rights Statements tend to document the permissions of a repository on objects within it.

There are no minimally required preservation metadata that should be captured for Rights statements.  However, if it is easily captured or available, it is recommended to attempt to record known Copyright Information about individual objects in the following PREMIS data dictionary units.    

copyrightInformation

  • copyrightStatus - status of the copyright (e.g. copyrighted, publicdomain, unknown)
  • copyrightJurisdiction - jurisdiction of copyright (e.g. us, de)
  • copyrightStatusDeterminationDate - date when this status was determined
  • copyrightNote - any additional notes about copyright information

Again, copyright information is not necessary to record, unless it is already known.