Forschungszentrum Jülich GmbH
Elaheh Akhoundi, Abril Azócar Guzmán*, Stefan Sandfeld
Institute for Advanced Simulations – Materials Data Science and Informatics (IAS‑9), Forschungszentrum Jülich GmbH, Germany
*a.azocar.guzman@fz-juelich.de
Large Language Models (LLMs) are becoming increasingly common in data management, transforming the way we handle, analyze, and explore information. LLM-based tools can be used to extract structured data from large volumes of unstructured content in materials science. However, a key challenge in their application is reliability, particularly their tendency to produce hallucinations, or inaccurate outputs, specially in domain-specific tasks.
To mitigate this, we optimized several stages of our data processing pipeline, and integrated ontologies to provide a structured semantic backbone for knowledge extraction tasks. We analyze a text corpus of scientific papers related to crystallographic defects and extract relevant information using application- and domain-level ontologies developed within the NFDI-MatWerk framework. Specifically, we employ the Computational Material Sample Ontology, the Crystallographic Defects Ontology Suite, and the Atomistic Simulation Methods Ontology [1].
To evaluate the effectiveness of this ontology-assisted approach, we compare the extracted data against a reference dataset annotated by domain experts in materials science, with a focus on crystallographic defects. We use quantitative metrics such as structural consistency checks and F1 scores, and conduct an error analysis to categorize common failure types, including entity boundary issues and ambiguous terminology. By identifying these weak points, we further refine our pipeline, moving toward more reliable structured data extraction in materials science.
References
[1] https://github.com/OCDO/
Abstract
Erwerben Sie einen Zugang, um dieses Dokument anzusehen.
© 2026