Universidad de Concepcion
Thermodynamics is a fundamental science that explores the behavior of systems in various states—whether stable, metastable, or unstable—as they interact with their surroundings. Through thermodynamic modeling employing the Calculation of Phase Diagrams (CALPHAD) method, researchers have gained the ability to transcend traditional equilibrium applications, enabling a quantitative understanding of materials, their enhancement, and the design of novel materials. The CALPHAD method describes the Gibbs free energy for each phase within a thermodynamic system. Additionally, there is a wide array of sophisticated commercial tools for CALPHAD modeling, complemented by extensive databases catering to educational, research, and industrial needs. CALPHAD-based computational tools like Thermo-Calc software assist in the process design stage by conducting thermodynamic and kinetic calculations that generate reliable materials data. Moreover, these tools reduce the need for experiments and tests, resulting in a more economical and practical process.
The modern CALPHAD approach is undeniably fundamental for establishing materials genomic databases and is widely recognized as a foundational tool for materials-by-design. Presently, the scientific community employs this technique to enrich materials databases for machine learning (ML) applications. These studies, conducted under this approach, have examined the strength properties of various steels, including high-strength, ferritic-martensitic, pearlitic, and austenitic steels, showcasing their promising capabilities. Furthermore, research has delved into creep properties in austenitic steels and Fe-Ni-Cr-Al alloys, fatigue in low-alloy steels and carbon steels, and the classification of steel microstructures using ML methods, demonstrating ML's potential for future materials development.
Various types of machine learning models are applied in materials design, each with unique advantages and applications. Among the most common are regression models (such as linear regression or random forest regression), which predict alloy properties based on input variables such as chemical composition and processing temperature. These models optimize the formulation of new alloys and forecast their behavior under different conditions. Conversely, classification models (e.g., decision trees or support vector machines) identify patterns in large datasets and categorize alloys into specific categories, such as high strength, corrosion resistance, or thermal conductivity. Additionally, clustering models (e.g., k-means clustering) identify similarities between different alloys, grouping them into families with similar properties and facilitating the exploration of new materials.
Machine learning (ML) is a data-driven methodology reliant on the quantity and quality of available data. In comparison to fields like astronomy or geology, where vast volumes of data are collected per hour, the amount of data in materials science is notably limited. Consequently, applying ML in material design is susceptible to overfitting with limited data, diminishing the model's generalization ability. Data quality in ML is influenced by the extent of coverage and uncertainties associated with it. Spatial coverage comprehensively represents different aspects of the properties under study, while uncertainties encompass measurement errors, noise, and biases affecting data reliability. Ensuring adequate coverage and addressing uncertainties are crucial for enhancing the reliability and effectiveness of ML models.
Data for research purposes can be sourced from available databases or published papers, offering a diverse range of information from experiments and simulations. While databases provide vast data quantities, their reproducibility may be uncertain, impacting data quality. To ensure both quantity and quality, authoritative databases are preferred. Alternatively, autonomous workflows for data generation offer convenience but may yield inferior data compared to databases. If required data are unavailable from databases or autonomous workflows, datasets can be generated through lab-scale calculations using software platforms like Materials Studio or Vienna Ab initio Simulation Package, ensuring good data reproducibility. Despite the resource-intensive nature of calculations for complex materials, ML models constructed using calculated data typically exhibit favorable evaluation metrics.
Considering various methods available to material scientists for obtaining data, primarily through characterization equipment, relying solely on experimental approaches for collecting data for ML models is impractical. However, publications containing characterization information about metallic alloys provide a viable data source. Manual data collection is suboptimal, necessitating automation for efficient and accurate work. Leveraging techniques such as data scraping can enhance the data collection process, gathering valuable information from online sources comprehensively. Data scraping may encounter challenges due to the diversity of data reporting in publications, requiring careful consideration and adaptation of scraping methods for accurate data extraction.
Addressing the challenge of text data collection from PDF archives for material science involves dealing with the variation in data reporting formats. To optimize data collection, scraping code must extract data from various formats, which could be challenging, especially dealing with numerous publications. Applying pre-trained artificial intelligence models in natural language processing holds promise for optimizing data collection. Models like BERT or GPT-3 can facilitate the reorganization and summarization of text from publications, streamlining the extraction process and enhancing data collection efficiency.
Utilizing pre-trained artificial intelligence models in natural language processing offers an innovative approach to tackle data extraction complexities. Models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-3 (Generative Pre-trained Transformer 3) have demonstrated remarkable capabilities in understanding and generating human-like text. By fine-tuning these models on domain-specific datasets in material science, researchers can enhance their ability to comprehend technical language and extract relevant information from PDF documents accurately. In Python, these models can be accessed and utilized through libraries like Hugging Face's Transformers or TensorFlow, enabling sophisticated natural language processing techniques with relative ease and efficiency. Integrating advanced AI technology with Python programming provides a powerful toolkit for optimizing data collection and analysis in material science research.
In the realm of scientific literature, numerous publications explore various materials and their applications. However, this study focuses on a specific material, 9-12%Cr steel, investigating the kinetics of Laves phase precipitation under creep condition at high temperatures (up to 650 °C). A notable drawback associated with the growth or coarsening of Laves phase within high-Cr martensitic steels is its propensity to degrade long-term creep strength, leading to a progressive reduction in allowable stress over time. However, stabilized precipitate sizes render Laves-phase precipitation one of the most efficacious reinforcement mechanisms. This phenomenon has prompted the development of ferritic steels with chromium content exceeding 13%wt, where Laves phase serves to anchor dislocations and grain boundaries. Recent research into the heterogeneous nucleation of Laves phase within creep cavities has unveiled its potential to prolong the lifespan of metals subjected to creep conditions by thwarting coarsening, agglomeration, and crack propagation. These alloys characterized by Laves phase precipitation are deemed self-healing, underscoring the significance and promise of Laves phase precipitation studies.
Over the course of about three decades, research into Laves phase precipitation in 9-12%Cr steel has been extensive. However, despite the wealth of kinetic information available, it is often presented in formats that pose challenges to traditional data collection methods. In this study, we gathered approximately two thousand publications from www.sciencedirect.com, ensuring compliance with legal conditions for PDF download, and filtered them according to our investigation's needs. The PDFs were processed using Python code; initially, text extraction from the PDF format was conducted using the PyMuPDF library. Subsequently, the resulting text was tokenized, segmented, and encoded to obtain embeddings through a pretrained AI model based on BERT. Specifically, we utilized the sentence-transformers/paraphrase-multilingual-mpnet-base-v2 model from the Transformers library.
With these embeddings, the BERT base models searched for the text fragments most closely matching a defined prompt, such as "What is the size, measured in nanometers, of the Laves Phase particles?" The matched text fragments were then integrated. In the subsequent step, the resulting text was formatted according to the AI Stable LM Zephyr 3B pretrained model for text generation. The model responded to prompts such as "Do not provide more than the requested data, be concise. List the alloy class, Laves phase size in nanometers, treatment temperature in degrees Celsius, and treatment time in hours from the given". At this juncture, the resulting data containing the Laves phase particle size, exposure temperature, and exposure time was collected using a scraping function tailored specifically for the AI Stable LM Zephyr 3B text format and saved as a CSV file. For another hand, the alloys composition and heat treatment were collected for a simple scraping code because generally the experimental procedure in this type of alloys is reported in a similar text format. Finally, the data were joined in a complete database.
To compare data collection methods, we initially gathered data manually, resulting in approximately 120 measurements of Laves phase particle sizes within a month, drawn from around five hundred publications. Similarly, utilizing a scraping code without AI specifically designed for retrieving Laves phase sizes, we acquired four hundred data points, albeit with notable inconsistencies and noise. However, upon integrating AI into the scraping code, data collection markedly improved, yielding up to 1300 alloys after undergoing data treatment. The resultant database encompassed details regarding composition, heat treatments, Laves phase precipitate size, exposure temperature, and duration. While the final database correlated with the 120 manually collected data points, other elements posed challenges for validation and appeared somewhat opaque. Notably, the untested data points fell within logical ranges and did not exhibit exaggerated or improbable values.
The compiled database serves as a valuable resource for predictive modeling through machine learning algorithms to forecast the size of Laves phase precipitates. By leveraging the collected data on alloy composition, exposure temperature, and duration alongside corresponding Laves phase sizes, we can train regression models, such as linear regression or random forest regression. These models can predict Laves phase sizes based on these input variables, allowing us to analyze how changes in alloy composition, temperature, and time affect the Laves phase growing and coarsening. Furthermore, the predictions of the model will be compared with calculations conducted using Thermo-Calc PRISMA module in the next step of our research. This comparison will provide validation and enhance the reliability of our predictive models.
Improving the results of machine learning models compared to established CLAPHAD-based tools is crucial for propelling materials engineering towards new frontiers of precision and efficiency. By surpassing the limits of accuracy and predictive capability of Thermo-Calc, machine learning models offer the opportunity to better understand phase formation processes in materials and design materials with optimized properties for a wide range of industrial and technological applications.
Abstract
Erwerben Sie einen Zugang, um dieses Dokument anzusehen.
© 2025