Digital Appendix

A Study on Archaeological Informatization Using Large Language Models (LLMs)

- Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -

KIM Hongyeon ([email protected])

DOI: 10.22755/kjchs.2025.58.3.34

Engine Source Code NEW

PoC Source Code

Research Radar Source Code

1. heripo-engine Demo Site

2. A Look at the PoC in Action

This page is the digital appendix to the eponymous paper published on September 30, 2025 in Heritage: History & Science, Vol. 58, No. 3. Because certain aspects of the system built during the research cannot be fully conveyed on the printed page, this digital appendix has been prepared to complement the paper.

The paper was written during April–May 2025. LLM technology has been advancing at an extraordinary pace, and models far more capable and cost-effective than those available at the time of writing have since emerged. For instance, OpenAI's GPT-4o — employed throughout the paper — has been succeeded by GPT-5, which delivers substantial performance gains at reduced cost. Given this rapid rate of progress, the performance, accuracy, and pricing figures presented in the paper and accompanying videos should be understood as reflecting the state of technology at that particular moment. Replacing the model alone would already yield noticeably improved results in the applications shown. The core contribution of this research lies not in the performance or output at any particular point in time, but in validating the concept that metadata can be extracted automatically from excavation reports by designing an automation pipeline and applying LLMs where they are most effective.

The various limitations visible in the videos, as well as the technical and practical constraints discussed in the paper, will continue to be addressed. Related materials and demonstration videos will be added to this space on an ongoing basis.

Follow-up research and development are ongoing. In January 2026, heripo-engine was released as an open-source TypeScript library that extracts structured data from excavation-report PDFs. The engine handles the core preprocessing stages of the data pipeline — PDF parsing, document structure analysis, and image/table extraction — and has succeeded in reducing the processing cost per report to under $0.50. A web platform is being built on this foundation, ~~with the aim of developing and releasing a publicly usable prototype service in the second half of 2025.~~ The release schedule has been adjusted to sometime in 2026. Methodology refinement and implementation will nonetheless continue in parallel so as to bring the release forward as far as possible.

This digital appendix is updated primarily with major announcements. For continuous updates from heripo lab, please refer to our GitHub. Those wishing to receive updates more conveniently are encouraged to subscribe to the newsletter, which delivers research trends in the cultural heritage field along with updates from heripo lab.

Research & Development Log

June 6, 2026
Presentation at Dataya Nolja 2026: from a humanities student's dream to open source
- At Dataya Nolja 2026, I gave a talk titled "From a Humanities Student's Dream to Open Source: Building an LLM Pipeline for Excavating Sleeping Archaeological Data."
- The talk explained how archaeological data buried in excavation-report PDFs can be extracted and structured through an LLM pipeline, and how that work is being extended into open-source tools such as heripo-engine and, ultimately, open datasets.
- A recording is available on YouTube.
April 17, 2026
Officially listed in the global archaeological open-source database "open-archaeo"
- This paper ("A Study on Archaeological Informatization Using Large Language Models (LLMs)") and its PoC digital appendix repository have been officially listed in open-archaeo, a global database that curates open-source resources for archaeology.
- The listed entry can be viewed here.
April 16, 2026
Official English version of the paper released
- The official English version of "A Study on Archaeological Informatization Using Large Language Models (LLMs)" has been released. It is now available via the download menu above.
- This English edition was translated and edited by the author with the permission of the National Research Institute of Cultural Heritage, Korea. The translation includes corrections to factual errors identified in the original Korean edition and supplementary translation notes to aid international readers' understanding. These modifications do not affect the core arguments, results, or interpretation of the original paper.
- The English version can also be downloaded directly from here.
February 9, 2026
Two memoranda of understanding signed with the Korean Archaeological Society
Agreements
- Newsletter Service MOU: The Korean Archaeological Society's official announcements and academic information are combined with heripo lab's automated newsletter system to publish and operate the official Korean Archaeological Society newsletter on a regular basis. The existing Research Radar has been reorganized as the official Korean Archaeological Society newsletter.
- Digital Transformation of the Archaeological Research Environment MOU: The Korean Archaeological Society's accumulated scholarly assets are combined with heripo lab's artificial intelligence (AI) and data-processing technologies to build a next-generation archaeological database and develop a researcher-friendly intelligent platform.
January 28, 2026
heripo-engine released as open source
Project overview
- heripo-engine is a TypeScript library that extracts structured data from archaeological excavation-report PDFs.
- Released under the Apache 2.0 license — free for anyone to use, modify, and redistribute.
- An online demo is available at (engine-demo.heripo.org) for immediate hands-on trial (limited to 3 runs per day).
Key features
- PDF parsing and OCR: deep-learning OCR based on the Docling SDK and Apple's Vision Framework (local processing, free of charge).
- Document structure analysis: automatic table-of-contents recognition using a rule-based approach with an LLM fallback, plus page-number mapping via a Vision LLM.
- Data extraction: automatic parsing of image and table captions, with automatic construction of chapter/section/subsection hierarchies.
- Extreme cost efficiency: under $0.07 for a thin report, under $0.35 for a thick one.
Tech stack
- TypeScript monorepo (pnpm workspace)
- Docling SDK (PDF parsing), Apple Vision Framework (OCR)
- Vercel AI SDK (LLM integration — supports OpenAI, Anthropic, Google, and more)
- Next.js 15 (demo application)
- System requirements: macOS (Apple Silicon or Intel), Node.js ≥ 22, Python 3.9–3.12
Long-term vision: a multi-stage data pipeline
- v0.1.x (done): PDF parsing and OCR, document-structure extraction (table of contents, chapters/sections, page mapping), image/table extraction.
- v0.2.x (planned): an immutable archaeological data ledger (general-purpose model, concept extraction).
- v0.3.x (planned): extended standardization (hierarchical standard model, normalization).
- v0.4.x (planned): ontology (semantic model, knowledge graph).
- v1.0.x (planned): production readiness (performance optimization, API stability).
Community contributions
- Development of regional archaeological standards worldwide (East Asia, Southeast Asia, South Asia, Central Asia, Middle East & Western Asia, Europe, Africa, the Americas, and Oceania).
- Building ontologies by period, site type, artifact type, and special domains.
- Expanded multilingual support (Korean, English, Chinese, Japanese, Arabic, Spanish, French, German, Hindi, Russian, and more).
- Technical improvements and performance optimization.
January 5, 2026
heripo lab, an open-source R&D group, has been founded
About the group
- heripo lab is an open-source R&D group that combines archaeological domain knowledge with software engineering to drive real improvements in research efficiency.
- Archaeology domain experts and software engineers collaborate on tasks such as designing archaeological data ontologies, defining data schemas, and validating academic consistency.
Core values
- Academic rigor: accurate data modeling grounded in archaeological domain knowledge.
- Open-source philosophy: all outputs are released under the Apache 2.0 license, so they can be used without restriction by both public projects and commercial enterprises.
- Practicality: tools and platforms designed to be usable in real research environments.
What's next
- Development of archaeological data standards and ontologies.
- Refinement of the LLM-based data extraction pipeline.
- Development and public release of the web platform.
- Building a community and a collaborative ecosystem.
December 14, 2025
Preprocessing engine status: table-of-contents extraction and page parsing complete
Stage-1 preprocessing: Docling-based PDF parsing
- When PDF reports are parsed with Docling, the output is clean enough to be converted directly into HTML. Every necessary element — photographs, drawings, page-level images, text, and more — can be extracted.
- However, the objective of this project is not PDF-to-HTML conversion. The primary objective is to load every individual feature and artifact into a relational database and establish a robust data foundation. Once this foundation is in place, the data can be repurposed for any downstream application — web platforms, HTML reports, AI datasets (RAG, training data), and more.
- Extracting data from excavation reports requires a specialized preprocessing process, so we split preprocessing into stage 1 (Docling-based) and stage 2 (an excavation-report-specific preprocessor).
- OCR is performed within Docling using Apple Vision Framework. This necessitated committing to a macOS-only environment.
- At present, Apple Vision Framework surpasses Nvidia-GPU-based OCR in both processing speed and Korean-language recognition quality, at substantially lower cost. A Mac mini in the $530 range is more than sufficient, and power consumption is negligible — which is precisely the configuration in use.
- The caption-matching problem — the hardest problem in the paper — is now also solved, because Docling's stage-1 preprocessing separates captions cleanly.
Stage-2 preprocessing: pages and table-of-contents extraction complete
- Accurately extracting the actual pages and table of contents from excavation reports — which lack any standardized technical specification — constitutes the core of this system and its most challenging task.
- Without this foundation, we cannot identify targets (features, artifacts, etc.) or perform LLM-driven structuring, data extraction, and listing.
- We have completed a three-stage pipeline for table-of-contents extraction (TocFinder → MarkdownConverter → TocExtractor).
- Page-format and page-number parsing is complete (Vision-LLM based).
- A variety of paging styles (position of page 1, single/double-sided layouts, position of page numbers, etc.) and table-of-contents formats are now handled (involving a substantial volume of code).
- This — the hardest and most central task — is now complete.
Cost optimization strategy
- A multi-model architecture was adopted: a worker (an inexpensive open-source model) paired with a supervisor (a GPT 5.2 frontier model).
- The frontier model is used only where strictly necessary, and output tokens are tightly controlled.
- Preprocessing cost: under $0.50 per report.
- Total cost is estimated at no more than a few dollars per report, with a target of under $1.35 per report.
What's next
- With the preprocessor nearly complete, the core foundation is in place. Extracting features, artifacts, and related data is now a matter of time.
- Using the preprocessing output, we plan to design a general-purpose schema and proceed with structuring.
- Once enough data has been processed at scale, we will release it as a web platform so that anyone can use it.
- Subsequently, report-derived data will be matched against the Dictionary of Korean Archaeology.
- In the longer term, ontology models specialized for particular archaeological subfields will be selectively applied so that the data can serve as semantic data. Related research trends will be tracked and existing results incorporated.
- All outputs are released as open source, like Research Radar (under a license policy that imposes no restrictions on either public projects or commercial enterprises).
- Source code will be released alongside the public launch of the web platform, and new features — dictionary integration, ontology application, and more — will continue to be contributed to the open-source project thereafter.
November 28, 2025
Two Research Radar open-source projects released
Released projects
- As announced in October, we have split Research Radar into a general-purpose framework (@llm-newsletter-kit/core) and a cultural-heritage implementation (@heripo/research-radar) and released both.
- LLM Newsletter Kit: npm install @llm-newsletter-kit/core / GitHub
- Research Radar: npm install @heripo/research-radar @llm-newsletter-kit/core / GitHub
- Both projects are released under the Apache-2.0 license — free for anyone to use, modify, and redistribute.
LLM Newsletter Kit (general-purpose framework)
- A framework extracted from Research Radar by removing the cultural-heritage-specific logic and keeping only the pure engine.
- It provides a pipeline of web crawling → LLM analysis → content generation → storage, and every stage can be swapped out via dependency injection.
- Built in TypeScript, with 100% test coverage and validated in real production use.
- Applicable domains: not limited to cultural heritage but extensible to any specialist field — science, social science, IT, medicine, law, and more. By changing the crawling targets and analysis criteria, a newsletter for any given field can be generated automatically.
Research Radar (reference implementation for cultural heritage)
- A complete example of the framework in action; this is the source code for the service currently running at heripo.app.
- Includes crawlers and parsers that automatically collect news and announcements from 62 institutions (including the National Museum of Korea, the Korea Heritage Service, and the Korean Archaeological Society).
- By copying this code and modifying only the crawling targets and analysis criteria, a newsletter for any field can be built.
- Real-world results: cost of $0.2–1 per issue, fully autonomous operation, and a 15% subscriber click-through rate.
What's next
- Ongoing maintenance and incorporation of user feedback.
- Expanded documentation and additional use cases to come.
October 30, 2025
Platform release schedule adjusted as methodology evolves; update on Research Radar core release
Methodology refinement and platform release schedule
- At the time of writing, the PoC used OpenAI's GPT-4o (April–May); in August we explored adopting open-source LLMs instead of commercial ones; and in September we further advanced the methodology by refining PDF preprocessing around a PDF → Markdown approach.
- The key is to prepare the report in as LLM-friendly a structure as possible before the LLM is involved. Beyond that, a pipeline employing open-source LLMs rather than commercial models is being pursued in order to satisfy consistency, performance, and cost requirements simultaneously.
- The next step is to fine-tune an open-source LLM on the Dictionary of Korean Archaeology as a corpus, enabling it to understand the specialized concepts and conventions of excavation reports with greater accuracy prior to data extraction and normalization.
- Academic rigor and accuracy are prioritized. A rapid release of a half-finished application may be defensible from a business standpoint, but because this project must guarantee rigor, whatever time is necessary will be invested to establish the soundness of the methodology.
- Accordingly, the public release date of the web platform announced in the conclusion of the paper has been adjusted from the second half of 2025 to sometime in 2026. Research and development will nonetheless continue to advance steadily so as to bring the release forward as far as possible.
Status of open-sourcing the Research Radar core
- As announced in the September 30 log, we are in the process of open-sourcing the core logic.
- Rather than limiting it to Korean cultural heritage, it is being rebuilt as a "general-purpose newsletter AI kit" applicable to any field, country, or language; the main features are nearly complete.
- Documentation (installation and configuration guides), example pipelines, and a minimum level of test coverage still need finishing touches, so we are adjusting the release to sometime in November.
September 30, 2025
Paper and open-source release; preview of the Research Radar core release
Scope and approach of the open-source release
- The paper was published on September 30, and at the same time the PoC source code was released as open source in the form of a snapshot.
- That repository is intended as an archival record; it does not accept external contributions and has no separate maintenance plan.
- Follow-up development is taking place in a separate repository and will be maintained as a regular open-source project. It will be released once it reaches a certain level of maturity.
Plans for releasing the Research Radar core logic
- We ~~planned to release the core pipeline (core logic) of Research Radar sometime in October.~~ This has been adjusted to sometime in November.
- The core logic has been reorganized around LangChain, and we have strengthened its stability and reusability compared with the version used at the time of writing.
- To help fellow researchers reproduce and apply the work, we will also provide runnable examples with minimal configuration and brief documentation.
September 14, 2025
Partial revision of the PDF preprocessing approach and an advanced pipeline (PDF → Markdown)
Open-source library review: Docling, pdf-document-layout-analysis
- We are considering adopting open-source libraries for the preprocessing step that converts PDFs into Markdown.
- Docling: strong in document layout analysis and structuring, able to identify document elements such as tables, tables of contents, and headings with relative precision and convert them into a structured output.
- Huridocs' pdf-document-layout-analysis: a tool that analyzes the block layout (paragraphs, images, tables, etc.) of PDF pages, useful for understanding the structure within each page.
- Both tools are optimized for general-purpose documents; given the project's specifics, we still need to test how well they handle the varied layouts and plate compositions of Korean excavation reports.
Background and expected benefits
- The core goal is to convert wildly varied report formats into a uniform Markdown format so that the downstream stages (metadata extraction, validation, storage) can be stabilized.
- If PDF → Markdown preprocessing works smoothly, a significant portion of the layout differences between reports can be absorbed by rule-based processing, improving the consistency and reliability of results.
- These libraries either provide reliable OCR themselves or integrate with it easily, making it possible to process even scanned documents (where text extraction is hard) relatively robustly. This shortens preprocessing development time while also guaranteeing a baseline of text-recognition accuracy.
- Because Markdown is text-based, LLMs can interpret its context more easily, and reprocessing and audit (trace-back) costs are lower than for PDFs.
- The extracted metadata will be stored in a relational database so that researchers, the public, and AI systems can easily query and use the structured data.
Risks and mitigations
- These libraries may not perfectly handle the special notations, tables, and plate annotations specific to Korean-language documents, especially those in archaeology and cultural heritage.
- In such cases we plan to respond by developing specialized rules and post-processing modules, informed by the internal architecture and logic of the libraries.
- We will initially validate effectiveness through a hybrid approach (library output plus custom rules and normalization).
Reverse ontology roadmap
- We are experimenting with automatically extracting and storing metadata from preprocessed Markdown using LLMs, and on that basis progressively building an ontology automatically.
- Flow: automatic extraction/storage (with partial use) → LLM-driven automatic ontology construction → automatic data reconstruction based on the ontology.
- Once preprocessing quality and schema consistency are secured, we expect meaningful results in a relatively short time.
August 23, 2025
Excavation-report metadata extraction pipeline design complete; development kicked off
Design refinement and start of development
- We have finished designing an excavation-report metadata extraction pipeline that is more sophisticated than either the PoC from the paper or Research Radar, and full-scale development has begun.
Why we chose this framework
- Unlike in the paper, the LangChain framework has been adopted. At the time of writing the paper, data-extraction development was being attempted for the first time and many unknowns remained; since the code was a one-off for the paper, maximum flexibility unconstrained by any framework was preferred. By now, however, sufficient know-how has been accumulated that, with long-term maintenance in mind, adopting a framework was judged more advantageous.
Experience and issues operating commercial models
- In the paper, GPT-4o was used as the primary LLM. It was chosen because it guaranteed above-baseline performance and speed and did not demand the complex setup or heavy additional compute required by open-source models.
- The paper highlighted cost as the main issue with commercial models, but as research and development have continued, the situation has shifted somewhat.
- The biggest issue is consistency. Even though GPT-5 is generally more capable than GPT-4o, the same prompt can produce different outputs. In hybrid models that are combined with rule-based programming, this variability is particularly sensitive. Results from GPT-4o a few months ago also differ from those of today's GPT-4o.
- Research Radar uses GPT-5 and GPT-4.1, and while running the newsletter generation pipeline we have seen the OpenAI servers become intermittently unstable. Especially when users are concentrated, we have observed not just outright errors but a degradation in generation quality itself — which is harder to manage than a clean error, since returning a clear error is preferable to producing unpredictable output.
- For everyday AI use or comparative analysis over a given dataset, commercial models are generally advantageous; however, for the task of rigorously extracting and structuring academic data from unstructured material, externally controlled commercial models (GPT, Claude, Gemini, and the like) are judged to be unsuitable.
- Taking quality control and cost into account together, open-source models were therefore concluded to be more appropriate. New models are continually appearing on the open-source side as well, but the key advantage is the ability to determine — after sufficient validation — exactly when and how to upgrade.
Reorganizing the data processing flow (PDF → Markdown)
- A significant change has also been made to the data-processing approach. Whereas the paper extracted metadata directly from the PDF source, Markdown has now been introduced as an intermediate stage. The PDF is first converted into Markdown, after which metadata is extracted from that file and stored in the database. The PDF-to-Markdown conversion stage does include some preprocessing to handle the varied formats of reports, but since it operates largely by rules, the output is highly reliable.
- The broad usefulness of Markdown was also taken into account. If an issue with the extracted data surfaces subsequently, inspecting a lightweight, well-structured Markdown file is far more cost-effective than reopening the original PDF. Moreover, depending on its contents, a PDF may require separate OCR, whereas Markdown consists of clean text and is therefore more amenable to LLM comprehension and analysis. It is also a better fit for implementing the Expert Interaction model discussed in the paper.
Redesigning the chunking strategy (per page → per entity)
- In the paper, coarse data extraction used 20-page chunks and detailed data extraction used 2-page chunks.
- However, because context can vary dramatically with report format and length, we now consider page-based chunking to be inadvisable.
- In the new design, we chunk the converted Markdown document per entity, which brings the following benefits.
  - The amount of context fed to the LLM is reduced, yielding relatively high-quality results with almost no factual distortion.
  - For example, we are testing the hypothesis that feeding a particular artifact description of under 100 characters and instructing the model to extract it into a fixed schema produces stable results.
  - If this proves practical, smaller-parameter models can also do the job, improving both cost and speed simultaneously.
August 20, 2025
Selected for a poster presentation at the 49th Annual Meeting of the Korean Archaeological Society
- Presentation topic: "The Present and Future of LLM-based Archaeological Information Platforms" (an extension of the existing "A Study on Archaeological Informatization Using Large Language Models (LLMs)").
- Dates: November 7 (Fri) – 8 (Sat), 2025.
- Venue: Global Plaza and Institute for Humanities Korea, Kyungpook National University (Daegu).
August 8, 2025

Research Radar launched
Launched 2025-08-08

The first feature of the web platform discussed in the paper is now operational. Research Radar is a service that collects web materials related to cultural heritage, employs an LLM to rank them by importance, and delivers the results as a personalized newsletter each morning at 8:30 AM.

Although the topic is different from metadata extraction in the paper, Research Radar is likewise a hybrid pipeline that stitches several modules together — but unlike in the paper, it is built on LangChain. The stages are as follows.

⚙️

Crawler

Rule-based

→

🗂️

Deduplication

Rule-based

→

🧠

Importance scoring

LLM-based

→

✍️

Markdown document generation

LLM-based

(includes factuality/copyright checks and
recursive regeneration logic)

→

🔄

Markdown → HTML conversion

Rule-based

→

📑

HTML template merging

Rule-based

📡 Subscribe
August 6, 2025

Paper accepted for publication in Heritage: History & Science, Vol. 58, No. 3
July 2025
Foundations of the heripo platform laid
- Designed and built the overall layout, design system, and related groundwork.
- Carried out the database design for the platform.
- Built the platform's account system, including sign-up, login, and profile editing.
June 2025
Development of Research Radar, a cultural-heritage newsletter service
- After submitting the paper, balancing regular work with the groundwork for the heripo platform made even keeping abreast of the latest academic trends burdensome.
- This prompted a pause from other tasks to build what began as an "automation tool for personal use."
- During development, it became apparent that the tool could be valuable not only for archaeology but for the broader cultural heritage academic and professional community, and it was accordingly refined to a level suitable for general use.
- A July launch had been planned, but because the paper was under review and anonymity needed to be maintained in this digital appendix and on the Research Radar landing page, the launch was postponed.
May 25, 2025

Submitted "A Study on Archaeological Informatization Using Large Language Models (LLMs)" to Heritage: History & Science, Vol. 58, No. 3
April – May 2025

Wrote the paper "A Study on Archaeological Informatization Using Large Language Models (LLMs)"

A Study on Archaeological Informatization Using Large Language Models (LLMs)

- Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -

1. heripo-engine Demo Site

2. A Look at the PoC in Action

Research & Development Log

Presentation at Dataya Nolja 2026: from a humanities student's dream to open source

Officially listed in the global archaeological open-source database "open-archaeo"

Official English version of the paper released

Two memoranda of understanding signed with the Korean Archaeological Society

heripo-engine released as open source

heripo lab, an open-source R&D group, has been founded

Preprocessing engine status: table-of-contents extraction and page parsing complete

Two Research Radar open-source projects released

Platform release schedule adjusted as methodology evolves; update on Research Radar core release

Paper and open-source release; preview of the Research Radar core release

Partial revision of the PDF preprocessing approach and an advanced pipeline (PDF → Markdown)

Excavation-report metadata extraction pipeline design complete; development kicked off

Selected for a poster presentation at the 49th Annual Meeting of the Korean Archaeological Society

Research Radar launched

Paper accepted for publication in Heritage: History & Science, Vol. 58, No. 3

Foundations of the heripo platform laid

Development of Research Radar, a cultural-heritage newsletter service

Submitted "A Study on Archaeological Informatization Using Large Language Models (LLMs)" to Heritage: History & Science, Vol. 58, No. 3

Wrote the paper "A Study on Archaeological Informatization Using Large Language Models (LLMs)"