Nanyang Technological University
Materials discovery often depends on extensive datasets to enable accurate predictions and the identification of promising materials. However, the limited availability of accurate data in many materials science subdomains often leads to reliance on simulation-based data, which can compromise accuracy. To address this challenge, this study explores the use of data mining techniques to extract larger datasets from textual scientific databases. Specifically, we apply these techniques to the subdomain of biosensors and early cancer detection, using optimized keyword searches within the Elsevier database. From this, 750 relevant papers published between 2024 and 2025 were identified. A trial set of 48 papers was then processed using XMLTree and Pandas, resulting in the extraction of 2,134 paragraphs. Each paragraph is pre-labeled by its section metadata, enabling downstream tasks such as Named Entity Recognition (NER) using GLiNER, BERT, and HuggingFace models. Early-stage results suggest that section-based segmentation enhances the performance of zero-shot NER models. This study presents a scalable approach for converting unstructured literature into structured databases, supporting data-driven materials research in interdisciplinary domains.
Abstract
Erwerben Sie einen Zugang, um dieses Dokument anzusehen.
Poster
Erwerben Sie einen Zugang, um dieses Dokument anzusehen.
© 2025