22nd AIAI 2026, 16 - 19 July 2026, Chania, Crete, Greece

Investigating Data Lakes Semantic Enrichment Using LLMs

Photiou Artemis, Papageorgiou Panagiotis, Pingos Michalis, Andreou Andreas

Abstract:

  The increasing adoption of Data Lakes in modern data-driven enterprises has intensified the need for effective metadata management to support data discovery, governance, and analytics. However, heterogeneous and incomplete metadata often leads to poorly organized repositories, commonly referred to as data swamps. This paper investigates whether Large Language Models (LLMs) can enhance metadata quality within data lake ingestion workflows. A containerized architecture is proposed that integrates LLM-based metadata evaluation and semantic enrichment during the ingestion process. The system is implemented using a Hadoop Distributed File System environment and operates on cultural heritage metadata retrieved from the Europeana API. Multiple LLMs are evaluated under identical prompting conditions to assess their ability to evaluate metadata quality and generate additional semantic attributes. Experiments examine the impact of metadata sanitization and enrichment across several configurations while measuring both metadata quality and computational overhead. The results show that moderate metadata sanitization improves evaluation consistency, while targeted LLM-based enrichment preserves semantic richness without significant performance overhead. These findings suggest that LLMs can effectively support metadata governance and reduce the risk of Data Lakes evolving into poorly structured data swamps.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.