Ruhr-Universität Bochum
Two primary data sources exist for gaining insight into material properties: experimental measurements and computer simulations. The results of these experiments and simulations are often documented in publications, yet this data remains unstructured, making it challenging to effectively manage and retrieve pertinent knowledge. This unstructured nature comes from inconsistent reporting standards, varied terminology, and the complex interplay between experimental setups and simulation parameters. Because of these obstacles, researchers are unable to make full use of the abundance of information presented in publications.
To address these challenges, our strategy involves two main steps. First, we manually compile an extensive reference dataset that is used for both training and evaluating a large language model-based extraction scheme. Second, based on the raw data, we use various natural language processing techniques to preprocess the text, identify entities, and extract relationships. We employ different methods such as fine-tuning and prompt engineering with large language models to automatically extract relevant information from the documents. We then validate these automatically extracted results against the manually created reference dataset. With this approach, we construct a comprehensive knowledge graph that serves as a structured framework linking individual pieces of knowledge on materials properties obtained from both experiments and simulations.
Our goals are first, to simplify the storage and retrieval of data; and second, to improve usage of existing materials data in literature by unifying the combination of experimental and simulated data in a queryable knowledge graph.
© 2026