AI-ready Archives: Preparing Archival Data for the AI Era

Lê Nguyễn Tường Vân · February 12, 2026
AI-ready Archives: Preparing Archival Data for the AI Era

As Artificial Intelligence (AI) is increasingly discussed in the archival field, the question is no longer “Should we use AI?” but rather “How should we prepare archival data so that AI does not undermine the foundational principles of archival practice?”

 

In February 2026, Prof. Giovanni Colavizza and Prof. Lise Jaillant published the AI Preparedness Guidelines for Archivists, issued by the Archives & Records Association (UK & Ireland) under a CC BY license. In this document, the authors emphasize a central point: AI is only truly useful when archival collections are carefully prepared in terms of data, metadata, structure, and evaluation mechanisms.

 

Below are the key elements of the guidelines that may be of particular interest to the archival community in Vietnam.

 

What Does “AI-Ready” Mean?

The guidelines distinguish between two main types of AI models commonly applied in archives:

1. Task-Specific AI

These models are trained to perform clearly defined tasks, such as:

  • Classifying types of records
  • Extracting names, places, and dates
  • Detecting or flagging sensitive content

2. Generative AI

These models generate language and can:

  • Summarize records
  • Suggest descriptions or keywords
  • Answer user questions based on archival data

An important approach highlighted in the guidelines is RAG (Retrieval-Augmented Generation). In this model, the AI system first retrieves relevant material from a well-prepared collection, and only then generates content based on the retrieved data. This approach helps reduce “hallucinations” (AI generating information not present in the source material) and improves accuracy.

 

Four Pillars for AI-Ready Archives

 

1. Completeness and Excluded Data

It is not necessary to digitize 100% of a collection in order to apply AI. However, it is essential to:

  • Clearly state whether the dataset is complete or partial
  • Explain reasons for gaps (not yet digitized, legal restrictions, physical loss, etc.)
  • Document known biases (e.g., overrepresentation of certain social groups or historical periods)

     

This is especially important for Generative AI, since AI can only reflect what is present in the data.

 

2. Metadata and Access Conditions

AI cannot function effectively if metadata is incomplete or fragmented.

It is necessary to ensure:

  • At least minimal metadata at the item level
  • Clear preservation and representation of provenance and series structure
  • Explicit documentation of access conditions (open, restricted, closed)
  • Identification of the language(s) of both the materials and the metadata

     

The guidelines particularly emphasize the value of narrative metadata, such as curatorial notes, historical context, and critical analysis. These elements help AI systems better understand cultural depth, power dynamics, and layered meanings within archival materials.

 

3. Data Formats and File Structure

Preparing data for AI does not mean “cleaning” it in ways that disrupt the original archival structure.

Instead, it is important to:

  • Preserve original files and folder structures
  • Create standardized derivative copies for AI processing
  • Standardize formats (e.g., UTF-8 text or XML for documents; TIFF/JPEG for images)
  • Use clear and structured file naming conventions that can be accessed programmatically via APIs

     

This is particularly relevant for systems using IIIF, OCR pipelines, or databases integrated with vector search technologies.

 

4. Application-Specific Evaluation

Each AI application requires its own set of evaluation metrics, rather than relying on generic criteria. For example:

  • The percentage of AI-generated descriptions accepted with minor edits
  • Time saved per record
  • The rate of false positives when detecting sensitive content
  • User satisfaction with a RAG-based access system

     

Defining evaluation methods from the outset helps ensure that AI delivers practical value rather than functioning merely as a technological experiment.

 

Checklist Before Implementing AI

Before launching an AI project, you should be able to answer “yes” to most of the following questions:

  • Is there a clearly defined use case?
  • Do you understand the completeness (or partial nature) of the dataset?
  • Is there minimal metadata and documented provenance?
  • Are standardized derivative files available for AI processing?
  • Are there clear evaluation criteria?
  • Is there a human review mechanism in place (AI supports, but does not replace professionals)?

 

The most important step is not deploying another tool, but investing in AI data preparedness.

 

Preparing data for AI is essentially an extension of the core principles of archival practice: thorough documentation, preservation of context, structural transparency, and professional accountability. When this foundation is strong, AI can become a supportive tool rather than a force that compromises the value and integrity of archival collections.