AI MSE 2025
Lecture
19.11.2025 (CET)
Leveraging Large Language Models in Polymer Informatics Data Collection
VB

Veronique Barthelemy (M.Sc.)

TNO Netherlands Organisation for Applied Scientific Research

Barthelemy, V. (Speaker)¹; Holzbach, N.¹; van de Bunt, N.¹; Shahmohammadi, S.¹; Boersma, A.¹; Urbanus, J.H.¹
¹TNO Netherlands Organisation for Applied Scientific Research, The Hague (Netherlands)
Vorschau
18 Min.

In the field of polymer informatics, leveraging artificial intelligence to design new polymers through data-driven approaches is becoming increasingly essential. This process necessitates the use of extensive datasets and advanced data driven tools. We have developed a comprehensive framework to retrieve polymer properties (mechanical, chemical, thermal,….) from scientific publications, using Large Language Models (LLMs) in order to train machine learning models to predict polymer properties accurately and design new polymers.

Our workflow begins with retrieving relevant papers from a publisher’s API and filtering them using zero-shot classification techniques to select only those of interest. The Azure document Intelligence tool is employed to parse the PDF, extract the metadata and the text while the tables are converted to image and analyzed with GPT-4o to parse them with high accuracy. Subsequently, the tables are processed to identify properties of interest, standardize their names, and units of measure. This step ensures that only tables containing relevant properties are retained. The accuracy of the parser and pre-processing steps reach 70 to 90%.

The next phase involves the use of GPT-4o model to analyze the text and match the polymer labels used in the tables to the corresponding polymer names mentioned in the text, if available. These identified polymer names are then converted into SMILES (Simplified Molecular Input Line Entry System) representations for their chemical structures using the OPSIN platform. This conversion process remains a hybrid approach, requiring manual intervention to correct and refine the outcomes. Additionally, not all articles explicitly specify polymer names, as they are often described in schemes and figures, requiring, in the future, to expand the tool to molecular structure recognition.

In conclusion, we leveraged the most recent AI capabilities to build a highly automated workflow that substantially reduces the amount of manual work required. This innovation enables the collection of large amounts of relevant data with much less effort compared to classical approaches, thereby facilitating more robust and scalable polymer informatics investigations.


Ähnliche Inhalte

© 2026