Digital Appendix
This page is the digital appendix to the eponymous paper published on September 30, 2025 in Heritage: History & Science, Vol. 58, No. 3. Because certain aspects of the system built during the research cannot be fully conveyed on the printed page, this digital appendix has been prepared to complement the paper.
The paper was written during April–May 2025. LLM technology has been advancing at an extraordinary pace, and models far more capable and cost-effective than those available at the time of writing have since emerged. For instance, OpenAI's GPT-4o — employed throughout the paper — has been succeeded by GPT-5, which delivers substantial performance gains at reduced cost. Given this rapid rate of progress, the performance, accuracy, and pricing figures presented in the paper and accompanying videos should be understood as reflecting the state of technology at that particular moment. Replacing the model alone would already yield noticeably improved results in the applications shown. The core contribution of this research lies not in the performance or output at any particular point in time, but in validating the concept that metadata can be extracted automatically from excavation reports by designing an automation pipeline and applying LLMs where they are most effective.
The various limitations visible in the videos, as well as the technical and practical constraints discussed in the paper, will continue to be addressed. Related materials and demonstration videos will be added to this space on an ongoing basis.
Follow-up research and development are ongoing. In January 2026, heripo-engine was released as an open-source TypeScript library that extracts structured data from excavation-report PDFs. The engine handles the core preprocessing stages of the data pipeline — PDF parsing, document structure analysis, and image/table extraction — and has succeeded in reducing the processing cost per report to under $0.50. A web platform is being built on this foundation,
with the aim of developing and releasing a publicly usable prototype service in the second half of 2025.
The release schedule has been adjusted to sometime in 2026. Methodology refinement and implementation will nonetheless continue in parallel so as to bring the release forward as far as possible.
This digital appendix is updated primarily with major announcements. For continuous updates from heripo lab, please refer to our GitHub. Those wishing to receive updates more conveniently are encouraged to subscribe to the newsletter, which delivers research trends in the cultural heritage field along with updates from heripo lab.
@llm-newsletter-kit/core) and a cultural-heritage implementation (@heripo/research-radar) and released both.npm install @llm-newsletter-kit/core / GitHubnpm install @heripo/research-radar @llm-newsletter-kit/core / GitHubThe first feature of the web platform discussed in the paper is now operational. Research Radar is a service that collects web materials related to cultural heritage, employs an LLM to rank them by importance, and delivers the results as a personalized newsletter each morning at 8:30 AM.
Although the topic is different from metadata extraction in the paper, Research Radar is likewise a hybrid pipeline that stitches several modules together — but unlike in the paper, it is built on LangChain. The stages are as follows.