High-quality metal material process dataset construction method based on large language model

By using a large language model-based approach, we can automatically extract and structure the data of metal material processing from academic literature, solving the problems of low data processing efficiency and poor format adaptability in existing technologies. This approach has enabled the construction of a high-quality dataset of metal material processing data, supporting materials genome engineering and intelligent manufacturing.

CN121789818BActive Publication Date: 2026-06-19NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2026-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve batch and accurate extraction and integration of metal material process data from academic papers, and existing automated tools cannot adapt to heterogeneous multi-source data formats and non-standard information processing, resulting in low data processing efficiency and high maintenance costs.

Method used

This paper adopts a large language model-based approach to extract metal material process entries from academic literature through a pre-trained model. Combined with regular expressions and a three-level duplicate detection mechanism, it realizes automated processing from academic literature to structured data, including DOI acquisition, structured parsing, data extraction, and deduplication.

Benefits of technology

It achieves efficient and accurate end-to-end extraction of standardized material entries from unstructured documents, improves data processing efficiency and deduplication accuracy, and constructs a high-quality metal material process dataset, providing a reliable data foundation for materials genome engineering and intelligent manufacturing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121789818B_ABST
    Figure CN121789818B_ABST
Patent Text Reader

Abstract

This invention provides a method for constructing high-quality metal material process datasets based on a large language model, relating to the field of process data collection and processing technology. It introduces a chemical composition hashing (HC) mechanism, which transforms complex composition information into unique identifiers through standardized concatenation of a preset element set and MD5 hashing, enabling efficient and accurate cross-file comparison. A three-level duplicate detection system—precise duplicates → similar duplicates → LLM intelligent confirmation—effectively avoids false rejections due to differences in expression while ensuring high recall. A prompting engineering system is constructed based on a locally deployed large language model, integrating domain knowledge templates and strict output format constraints. This achieves complete reconstruction and structured expression of complex process chains without manual intervention, enabling end-to-end extraction of standardized material entries from massive amounts of unstructured documents. Furthermore, semantic-level feature hashing and LLM-assisted judgment mechanisms significantly improve deduplication accuracy and logical consistency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of process data collection and processing technology, and in particular to a method for constructing a high-quality metal material process dataset based on a large language model. Background Technology

[0002] In today's scientific research and innovation system, academic papers (especially in the fields of materials science and engineering) contain a wealth of data on metallic materials processing. This data is the core foundation for researchers to conduct subsequent research, avoid duplication of effort, and drive technological breakthroughs, providing crucial data support for new materials development and process optimization.

[0003] Achieving batch and accurate extraction of academic paper materials, resolving format heterogeneity issues, and integrating multi-source data through intelligent deduplication technology have become critical problems that urgently need to be solved in the field of academic data processing. The successful development of an end-to-end automated processing method would greatly improve the efficiency and quality of scientific research data processing, and would be of great significance in promoting the digital transformation of scientific research.

[0004] Existing solutions lack sufficient data extraction capabilities. Traditional methods rely on manual reading of each paper and manual data entry, which is insufficient for handling large-scale literature data (such as batch processing of thousands or even tens of thousands of papers). Existing automated tools are mostly limited to single-format parsing, unable to adapt to the XML full-text data structure of academic databases (such as Elsevier), and lack the ability to accurately extract information from chapter-based and tabular materials.

[0005] Existing methods suffer from inconsistent data extraction formats. The structures of different papers vary significantly, with information scattered across various locations such as paragraphs, tables, and figure captions, and the expression methods are not standardized. The conversational language used in process descriptions makes it difficult to directly integrate and utilize the extracted data, requiring substantial additional manpower for standardization.

[0006] Existing solutions suffer from poor adaptability and scalability. Their inflexibility manifests in their technical architecture's inability to adapt to changing application scenarios and data sources, resulting in high maintenance costs. Systems based on handwritten rules or fixed templates experience a sharp performance drop once the paper's chapter structure, writing style, or data presentation changes even slightly. Each new format or expression requires manual rewriting and debugging of rules, hindering scalable application. They are also ineffective against non-standard and implicit information. When information is in a non-standard form or requires common-sense reasoning, existing methods, due to their inherent limitations, cannot effectively process and reason about it. Summary of the Invention

[0007] The technical problem to be solved by this invention is to address the shortcomings of the existing technology by providing a method for constructing a high-quality metal material process dataset based on a large language model. This method automates the entire process from the automated collection and structured parsing of academic literature data to the extraction of metal material process entries and post-processing of the metal material process dataset, thereby achieving the goal of quickly, completely, and accurately constructing a high-quality metal material process dataset.

[0008] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:

[0009] On the one hand, this invention provides a method for constructing a high-quality metal material process dataset based on a large language model, comprising the following steps:

[0010] Step 1: Automatically acquire relevant literature and convert it into unified structured data;

[0011] Step 2: Extract metal material process items from the data generated in Step 1 using a pre-trained large language model. Specific steps include:

[0012] Step 2.1: Load the structured text content of the paper to be processed, including material experimental descriptions, performance test results, and process details;

[0013] Step 2.2: Load the table information stored in JSON format in the paper, including the material name, element content, all processing technologies related to the material, and the complete process;

[0014] Step 2.3: Prepare two types of templates, including process description examples and complex process flow examples;

[0015] Step 2.4: Configure the large language model, using the locally deployed gpt-oss:20b model, which supports long text processing;

[0016] Step 2.5: Configure API: Base address: http: / / localhost:11434 / v1, which is the default address of the Ollam service; Authentication key: any string;

[0017] Step 2.6: Configure generation parameters: temperature coefficient is 0; maximum number of tokens generated is 8192, suitable for long text process descriptions;

[0018] Step 2.7: Construct structured prompts;

[0019] Step 2.8: Model Invocation and Response Processing;

[0020] The prompts are sent to the model via the `client.chat.completions.create` interface, with input including the paper text, table data, and templates, and the response includes material entries; regular expressions are used ( Extract structured entries from the model output, remove redundant characters, and retain attribute key-value pairs;

[0021] Step 2.9: Validity verification and retry mechanism;

[0022] Length validation: If the response length is ≤20 characters, it is considered invalid output and a retry is triggered;

[0023] Retry strategy: The maximum number of retries is 3, with a 0.5-second interval between each retry, to ensure that the model fully processes the input;

[0024] Step 2.10: Store the metal material process entries;

[0025] All extracted standardized material entries are saved in JSON format to the output folder, with the file names corresponding to the DOI of the input paper, to construct the initial metallic materials process dataset;

[0026] Step 3: Post-process the initial metal material process dataset extracted in Step 2. The specific steps are as follows:

[0027] Step 3.1: Traverse all JSON format files in the preset input folder, extract the DOI identifier associated with the file and the set of material entry strings stored in the file; filter out invalid entries with a length ≤ 50 characters, and retain valid entries for subsequent processing;

[0028] Step 3.2: Use regular expression matching patterns to extract fields from valid entry strings and convert them into a structured dictionary in key-value pair format;

[0029] Step 3.3: Extract the core features for duplicate detection from the structured dictionary, construct a feature dictionary, and provide data support for subsequent comparisons;

[0030] Step 3.4: Based on the extracted feature dictionary, the three-level mechanism of "precise duplication judgment → similar duplication judgment → LLM intelligent confirmation" is used to achieve accurate identification of duplicate items;

[0031] Step 3.5: Processing and outputting deduplication results;

[0032] Construct a duplicate entry mapping table to record the source DOI, original entry, and basis for duplication for each duplicate entry; traverse all entries, retain the base entry in each group of duplicate entries, prioritize the entry with the highest DOI in the source file, and remove the remaining duplicate entries;

[0033] After deduplication, the data is stored in the deduplicated_data subdirectory of the output folder according to DOI, maintaining JSON format; the source, corresponding baseline entry, duplication type and reason for judgment of all removed entries are recorded and saved as duplicate_report.json; a statistical file deduplication_stats.json is generated, which includes the total number of entries, the number of deduplicated entries, the exact number of duplicates, the number of similar duplicates, the number of duplicates confirmed by LLM, and the deduplication rate.

[0034] The formula for calculating the deduplication rate is as follows:

[0035] ;

[0036] in, The total number of duplicate entries. This represents the total number of original valid entries;

[0037] Step 3.6: Filtering for invalid and redundant zero values;

[0038] To clean up meaningless or misleading "0" values ​​that may exist in the structured dictionary, the following rule-based cleaning is performed:

[0039] If the content field of an element is "0" and is not explicitly marked in the original paper's table, i.e. it was not measured experimentally, then replace it with the empty value null to avoid misjudging it as "not containing the element";

[0040] For mechanical property fields, namely tension, yield strength, and elongation, if their values ​​are 0 or empty, but are clearly described in the context, they are marked as unreported.

[0041] Delete all entries that contain only default filler fields and have no valid process description to prevent “empty shell entries” from polluting the dataset.

[0042] Furthermore, step 1 specifically includes:

[0043] Step 1.1: Based on the preset keyword list, search scope, and field restriction parameters, construct a dynamic query statement through the academic database API interface, and use a pagination mechanism to batch retrieve relevant papers and extract their DOI identifiers; the obtained DOI set is deduplicated and saved as a CSV file as the input index for subsequent processing;

[0044] Step 1.2: Based on the DOI set, call the full-text API, and in conjunction with the configuration parameters such as API key, request timeout, and retry strategy, send HTTP GET requests to the server one by one to download the full text of the paper in XML format; during the download process, URL encoding and path security processing are performed on the DOI to ensure the legality of the request and the compatibility of file naming; the XML content of successful responses is saved to the specified directory, and error logs are recorded for failed requests for tracking purposes;

[0045] Step 1.3: Parse and transform all downloaded XML files; use the parsing tool lxml to load the document structure, and extract the chapter tree and table data using XPath; the chapter content is recursively organized according to hierarchical numbering, and the main text and subsections are merged to generate a flattened paragraph sequence predata; the table information is extracted by extracting tags, descriptions, table headers and data bodies to form a structured JSON object; finally, the three types of data, namely the chapter tree, table data and predata entries, are integrated into a standard JSON file, named and stored with the original paper's DOI, completing the transformation from unstructured literature to a structured knowledge carrier.

[0046] Furthermore, in step 2.7, the structured prompts include the following core elements:

[0047] Role definition: The model is designated as "Materials Science Information Extraction Expert";

[0048] Target attribute list: Clearly define the 47 attributes to be extracted, including: basic material information, i.e., material name; mechanical properties, including the name, value, and unit of tensile strength, yield strength, and elongation; element content, i.e., the content of 36 elements, with elements not mentioned set to 0; process description, which must include a complete process chain: previous material and process → current material;

[0049] Extraction rules: The process description must be complete, including all processing steps involving the materials, avoiding vague references; when integrating tabular data, the phrase "as in Table X" must be replaced with specific content; the output format must strictly be "entry N: [attribute 1: value 1, attribute 2: value 2, ...]", with the attribute order consistent with the target list; embed template examples from the input data preparation to guide the model to follow the format and logic.

[0050] Furthermore, step 3.2 specifically includes:

[0051] The parsed regular expression is: ;

[0052] in, This is a prefix for Python's raw strings to prevent the backslash "\" from being escaped;

[0053] “ "This is capture group 1, used to match the key portion, where the key cannot contain colons or commas; the first half of the bracket " "and the second half of the parenthesis" "Indicates the capture group, start and end;" The negation character set is indicated by "," which matches any single character except for the full-width colon ":" and the full-width comma ",". The quotation mark (") indicates a quantifier that matches the preceding character set one or more times.

[0054] Capture group 1 followed by " The colon "" is a literal match for a full-width colon and must appear once to separate the key and value.

[0055] “ "This is capture group 2, used to match values; the outer parenthesis " "and the second half of the parenthesis" "" indicates a capture group, capturing the entire value content into the value; the first " The quotation mark ("") indicates that the first occurrence of zero or more non-full-width commas (") characters will match the beginning of the value. " indicates a non-capturing group, used to handle multiple commas that may appear in a value;" "Indicates no capture, only for grouping;" " indicates that the entire group can be repeated 0 or more times;

[0056] The first " within the non-capture group" " indicates a literal match of a full-width comma, i.e., the comma inside the value;" "Indicates negative forward-looking, where " " indicates a negative sign, meaning that the following pattern cannot be matched." "" indicates matching one or more characters that are neither colons nor commas, i.e., the next possible key; " indicates that another colon should follow;

[0057] The overall meaning of the non-capturing group is: ensure that the current comma is not followed by the beginning of "new key:"; if the comma is immediately followed by "text:", it means that this is the separator of the next field and the current value should not be included; otherwise, it is a normal comma inside the value and the current value can be safely included.

[0058] The second "in the non-capture group" The colon indicates that after consuming this "safe" comma, the match continues for zero or more non-comma characters, that is, continues for the remainder of the value;

[0059] The logic for the entire value part is as follows: first, consume the non-comma content, then repeatedly consume "safe comma + subsequent non-comma content" until a real new field separator is encountered, i.e., "new key:" is followed by a comma.

[0060] The parsing logic is as follows: match the core structure of "key:value", support commas in the value field to achieve accurate splitting of multiple fields, but do not include new "key:" structures; perform space removal and trailing comma removal on the extracted keys and values ​​respectively to generate a standardized structured dictionary.

[0061] Furthermore, step 3.3 specifically includes:

[0062] The feature extraction formula is:

[0063] Features={M,T V ,Y V E V H C ,P d};

[0064] Where M is the material name; T V Y represents the tension value. V E is the yield value. V This is the elongation value; H C P is the hash value of the chemical composition. d For process flow description;

[0065] By standardizing the process and hashing, the chemical composition of the material is transformed into a unique identifier, enabling rapid grouping and comparison.

[0066] Definition of the baseline element set: A preset set of 36 common elements in materials science.

[0067] E={H,B,C,N,O,F,Na,Mg,Al,Si,P,S,Cl,Ca,Ti,V,Cr,Mn,Fe,Co,Ni,Cu,Zn,As,Y,Zr,Nb,Mo,Sn,Sb,La,Ce,Ta,W,Pb,Bi};

[0068] Content standardization: for each element Extract the corresponding content value from the structured dictionary, prioritizing the matching of "e". i If the "element content" field is not found, the element symbol 'e' will be matched directly. i If no corresponding field exists, the content value is set to "0"; the extracted content values ​​are processed to remove spaces and convert to lowercase to generate a standardized content value v. i ;

[0069] Following the fixed order of the element set E, the elements are concatenated with their standardized content to form "e". i :v i The format string, with each string separated by "|", forms the concatenated string S containing the chemical composition. C ; Using the MD5 algorithm to analyze S CPerform a hash calculation to generate a 128-bit binary hash value, which is then converted into a 32-bit hexadecimal string as the final hash. C ;

[0070] ;

[0071] in, This indicates a sequential string concatenation operation; MD5() is the standard MD5 hash function.

[0072] Furthermore, in step 3.4, the specific method for precise repetition determination is as follows:

[0073] The core basis for determining whether two items are completely identical duplicates is the dual matching of chemical composition and material name;

[0074] Chemical composition matching: H in two feature dictionaries C1 =H C2 That is, if the hash values ​​are completely identical, it means that the chemical composition is the same;

[0075] Material name matching: Perform normalization processing on the two material names M1 and M2, including lowercase conversion, removal of spaces and full-width spaces, to satisfy Norm(M1)=Norm(M2);

[0076] Process description consistency matching: Standardization processing is performed on the two process flow descriptions Pd1 and Pd2, including removing spaces, punctuation marks, special characters, and lowercase conversion. Consistency is verified by text similarity calculation; if neither item has a process description, i.e., Pd1 = ... And Pd2= The default process is the same; if one has a process description and the other does not, i.e., Pd1= Or Pd2= If the processes are inconsistent, and both entries have process descriptions, a simplified text similarity calculation is used, based on character-level edit distance normalization, and a high similarity threshold λ is set to satisfy... If the processes are consistent, then it is determined that the processes are the same. It is the similarity between the two process flows Pd1 and Pd2;

[0077] The formula for calculating character-level similarity is:

[0078] ;

[0079] in, The function is a standardized function for describing the process; the content in parentheses describes the function's functionality. `Len()` represents the character edit distance, i.e., the minimum number of operations required for insertion, deletion, and replacement; `Len()` represents the string length.

[0080] ;

[0081] in, A two-entry feature dictionary; To standardize the function, the same standardized function is used for both material names and process descriptions, adapting to different field formats; This is a process consistency determination function that returns a boolean value. If the process description consistency matching rules mentioned above are met, the boolean value is True, which means exact repetition; otherwise, the boolean value is False, which means inaccurate repetition.

[0082] The entire formula returns a Boolean value;

[0083] Judgment result: If the above three conditions of chemical composition matching, material name matching, and process description consistency matching are met simultaneously, it is judged as an exact duplicate.

[0084] Furthermore, in step 3.4, the specific method for determining similarity and repetition is as follows:

[0085] For entries with the same chemical composition but different names or process descriptions, the similarity of the process flow text is used to determine whether they are similar duplicates.

[0086] Prerequisites: That is, they have the same chemical composition;

[0087] Text similarity calculation: TF-IDF vector transformation and cosine similarity algorithm are used to calculate the similarity between two process flow descriptions. Similarity;

[0088] The formula for calculating term frequency (TF) is:

[0089] ;

[0090] Where w is A single word in For w in The number of times it appears in;

[0091] The formula for calculating Inverse Document Frequency (IDF) is as follows:

[0092] ;

[0093] Where N represents the total number of process flow descriptions participating in the comparison. The number of process flow descriptions containing the word 'w';

[0094] TF-IDF vectors are transformed into:

[0095] ;

[0096] Through this formula Transform into high-dimensional TF-IDF feature vectors ;

[0097] The formula for calculating cosine similarity is:

[0098] ;

[0099] Where · represents the vector dot product operation, It is the L2 norm of the vector;

[0100] Judgment result: Set similarity threshold ,like If so, it is judged as a suspected similar duplicate; Returns a Boolean value and its corresponding similarity value; if the Boolean value is True, it is a suspected similar duplicate, and if the Boolean value is False, it is a dissimilar duplicate.

[0101] Furthermore, in step 3.4, the specific method for LLM intelligent confirmation is as follows:

[0102] For suspected similar duplicate entries, a large language model is invoked to perform expert-level logical judgment and correct the bias of machine comparison;

[0103] Input Construction: Input two original strings, source DOI, text similarity value and preset judgment criteria into LLM. The preset judgment criteria are: identical chemical composition + consistent material reference + basically identical process are duplicates; significant process differences or different states of intermediate products are non-duplicates.

[0104] Model configuration: The locally deployed gpt-oss:20b model is used, with a temperature coefficient of 0.1 and a maximum number of generated tokens of 512.

[0105] Result analysis: Extract the "judgment result" and reason from the LLM output to finally confirm whether it is a duplicate entry.

[0106] On the other hand, this application proposes a computer-readable storage medium storing executable instructions that, when executed, cause a processor to perform the method for constructing a high-quality metal material process dataset based on a large language model.

[0107] Secondly, this application proposes a computer program product, including a computer program or instructions that, when executed by a processor, implement the method for constructing a high-quality metal material process dataset based on a large language model.

[0108] The beneficial effects of adopting the above technical solution are as follows: The method for constructing a high-quality metal material process dataset based on a large language model provided by this invention constructs a fully automated framework that integrates batch DOI acquisition, structured parsing, information extraction driven by a large language model, and multi-level intelligent deduplication. This framework not only realizes the end-to-end extraction of standardized material entries from massive unstructured documents, but also significantly improves the deduplication accuracy and logical consistency through semantic-level feature hashing and LLM-assisted judgment mechanism. The chemical composition hashing (HC) mechanism introduced in this invention transforms complex composition information into unique identifiers through standardized concatenation of a preset element set and MD5 hash operation, achieving efficient and accurate cross-file fast comparison. The three-level duplication determination system designed in this invention—"exact duplication → similar duplication → LLM intelligent confirmation"—effectively avoids false rejection due to differences in expression while ensuring high recall, and is particularly suitable for handling the state evolution description of the same material in different research stages or processing paths. Furthermore, the prompting engineering system built on a locally deployed large language model (gpt-oss:20b) integrates domain knowledge templates and strict output format constraints, achieving complete reconstruction and structured expression of complex process chains without manual intervention, significantly outperforming traditional rule matching or shallow NLP methods. Through this method, this invention successfully constructs a high-quality, non-redundant, and semantically complete metallic material process dataset, providing a reliable data foundation for materials genome engineering, intelligent manufacturing, and knowledge graph construction. It also provides an efficient, reusable, and easily scalable technical paradigm for the automated mining of tacit knowledge in academic literature, possessing significant scientific research value and industrial application prospects. Attached Figure Description

[0109] Figure 1 The flowchart illustrates the method for constructing a high-quality metal material process dataset based on a large language model, as provided in the first embodiment of the present invention. Detailed Implementation

[0110] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.

[0111] Example 1:

[0112] A method for constructing high-quality metal material processing datasets based on large language models, such as... Figure 1 As shown, the method of this embodiment is described below.

[0113] Step 1: Automatically acquire relevant academic literature and convert it into unified structured data, specifically including the following steps:

[0114] Step 1.1: Based on the preset keyword list, search scope, and domain restriction parameters, construct a dynamic query statement through the academic database API interface (such as Scopus or Crossref), and use a pagination mechanism to batch retrieve relevant papers and extract their DOI identifiers; the resulting DOI set is deduplicated and saved as a CSV file as the input index for subsequent processing.

[0115] Step 1.2: Based on the DOI set, call the full-text API (such as Elsevier Content API), and in conjunction with the configuration parameters such as API key, request timeout, and retry policy, send HTTP GET requests to the server one by one to download the full text of the paper in XML format; during the download process, URL encoding and path security processing are performed on the DOI to ensure the legality of the request and the compatibility of file naming; the XML content of successful responses is saved to the specified directory, and error logs are recorded for failed requests for tracking.

[0116] Step 1.3: Parse and transform all downloaded XML files; use the parsing tool lxml to load the document structure, and extract the section tree and table data using XPath; the section content is recursively organized according to hierarchical numbering, and the main text and subsections are merged to generate a flattened paragraph sequence predata; the table information is extracted by extracting tags, descriptions, table headers and data bodies to form a structured JSON object; finally, the three types of data, namely the section tree, table data and predata entries, are integrated into a standard JSON file, named and stored with the original paper's DOI, completing the transformation from unstructured documents to structured knowledge carriers.

[0117] Step 2: Extract metal material process items from the data generated in Step 1 using a pre-trained large language model. Specific steps include:

[0118] Step 2.1: Load the structured text content (predata field) of the paper to be processed, including material experimental descriptions, performance test results, and process details.

[0119] Step 2.2: Load the table information (tables fields) stored in JSON format in the paper, including the material name, element content, all processing technologies related to the material, and the complete process.

[0120] Step 2.3: Prepare two types of templates, including process description examples and complex process flow examples.

[0121] Step 2.4: Configure the large language model, using the locally deployed gpt-oss:20b model, which supports long text processing;

[0122] Step 2.5: Configure API: Base address (base_url): http: / / localhost:11434 / v1 (Ollama service default address); Authentication key (api_key): Any string (Ollama does not require real key verification).

[0123] Step 2.6: Configure generation parameters: Temperature coefficient (temperature) is 0 (to suppress randomness and ensure output stability); Maximum generated tokens (max_tokens) is 8192 (to adapt to long text process descriptions).

[0124] Step 2.7: Construct structured prompts, including the following core elements:

[0125] Role definition: The model is designated as "Materials Science Information Extraction Expert";

[0126] Target attribute list: Clearly define the 47 attributes to be extracted, including: basic material information, i.e., material name; mechanical properties, including the name, value, and unit of tensile strength, yield strength, and elongation; element content, i.e., the content of 36 elements such as H, B, and C, with unmentioned elements set to 0; and process description, which must include a complete process chain: previous material and process → current material.

[0127] Extraction rules: The process description must be complete, including all processing steps involving the material, avoiding vague references (such as specifying "sample" as a specific material name); when integrating tabular data, the phrase "as in Table X" must be replaced with specific content; the output format must strictly be "entry N: [attribute 1: value 1, attribute 2: value 2, ...]", with the attribute order consistent with the target list; embed the template example from the input data preparation to guide the model to follow the format and logic.

[0128] Step 2.8: Model Invocation and Response Processing.

[0129] The client.chat.completions.create interface sends prompts to the model, with input including the paper text, table data, and templates, and retrieves responses including material entries. The regular expression (r'entry\d+:(\[.*?\])') is used to extract structured entries from the model output, and redundant characters are removed while retaining attribute key-value pairs.

[0130] Step 2.9: Validity verification and retry mechanism.

[0131] Length validation: If the response length is ≤20 characters, it is considered invalid output and a retry is triggered;

[0132] Retry strategy: The maximum number of retries is 3, with a 0.5-second interval between each retry, to ensure that the model fully processes the input.

[0133] Step 2.10: Store the metal material process entry.

[0134] All extracted standardized material entries are saved in JSON format to the output folder, with the file names corresponding to the DOI of the input paper, thus constructing the initial metallic materials and processes dataset.

[0135] Step 3: Post-process the initial metal material process dataset extracted in Step 2. The specific steps are as follows:

[0136] Step 3.1: Traverse all JSON format files in the preset input folder, extract the DOI identifier associated with the file (as the unique identifier of the file) and the set of material entry strings stored in the file; filter out invalid entries with a length ≤ 50 characters, and retain valid entries for subsequent processing.

[0137] Step 3.2: Use regular expression matching patterns to extract fields from valid entry strings and convert them into a structured dictionary in key-value pair format. Specifically:

[0138] Parsing regular expressions: .

[0139] in, This is a prefix for Python's raw strings to prevent the backslash "\" from being escaped;

[0140] “ "This is capture group 1, used to match the key portion, where the key cannot contain colons or commas; the first half of the bracket " "and the second half of the parenthesis" "Indicates the capture group, start and end;" The negation character set is indicated by "," which matches any single character except for the full-width colon ":" and the full-width comma ",". The ">" indicates a quantifier, matching the preceding character set one or more times.

[0141] Capture group 1 followed by " The colon "" is a literal match for a full-width colon and must appear once to separate the key and value.

[0142] “ "This is capture group 2, used to match values; the outer parenthesis " "and the second half of the parenthesis" "" indicates a capture group, capturing the entire value content into the value; the first " The quotation mark ("") indicates that the first occurrence of zero or more non-full-width commas (") characters will match the beginning of the value. " indicates a non-capturing group, used to handle multiple commas that may appear in a value;" "Indicates no capture, only for grouping;" "" indicates that the entire group can be repeated 0 or more times.

[0143] The first " within the non-capture group" " indicates a literal match of a full-width comma, i.e., the comma inside the value;" "Indicates negative forward-looking, where " " indicates a negative sign, meaning that the following pattern cannot be matched." "" indicates matching one or more characters that are neither colons nor commas, i.e., the next possible key; " indicates that another colon should follow.

[0144] The overall meaning of the non-capturing group is: ensure that the current comma is not followed by the beginning of "new key:"; if the comma is followed by "text:", it means that this is the separator of the next field and the current value should not be included; otherwise, it is a normal comma inside the value and the current value can be safely included.

[0145] The second "in the non-capture group" The "" indicates that after consuming this "safe" comma, it continues to match zero or more non-comma characters, that is, continues the remaining part of the value.

[0146] The logic for the entire value part is as follows: first, consume the non-comma content, then repeatedly consume "safe comma + subsequent non-comma content" until a real new field separator is encountered, i.e., "new key:" is followed by a comma.

[0147] The parsing logic is as follows: match the core structure of "key:value", support commas in the value field to achieve accurate splitting of multiple fields, but do not include new "key:" structures; perform space removal and trailing comma removal on the extracted keys and values ​​respectively to generate a standardized structured dictionary.

[0148] Step 3.3: Extract core features for duplicate detection from the structured dictionary to construct a feature dictionary, providing data support for subsequent comparisons. Specifically:

[0149] The feature extraction formula is:

[0150] Features={M,T V ,Y V E V H C ,P d};

[0151] Where M is the material name; T V Y represents the tension value. V E is the yield value. V This is the elongation value; H C P is the hash value of the chemical composition. dThis describes the process flow.

[0152] Through standardization and hashing, the chemical composition of materials is transformed into a unique identifier, enabling rapid grouping and comparison.

[0153] Definition of the baseline element set: The preset set of 36 common elements in materials science is as follows:

[0154] E={H,B,C,N,O,F,Na,Mg,Al,Si,P,S,Cl,Ca,Ti,V,Cr,Mn,Fe,Co,Ni,Cu,Zn,As,Y,Zr,Nb,Mo,Sn,Sb,La,Ce,Ta,W,Pb,Bi}.

[0155] Content standardization: for each element Extract the corresponding content value from the structured dictionary, prioritizing the matching of "e". i If the "element content" field is not found, the element symbol 'e' will be matched directly. i If no corresponding field exists, the content value is set to "0". The extracted content values ​​are then processed to remove spaces and convert to lowercase, generating a standardized content value v. i .

[0156] Following the fixed order of the element set E, the elements are concatenated with their standardized content to form "e". i :v i The format string, with each string separated by "|", forms the concatenated string S containing the chemical composition. C The MD5 algorithm (Message Digest Algorithm Version 5) is used to analyze S... C Perform a hash calculation to generate a 128-bit binary hash value, which is then converted into a 32-bit hexadecimal string as the final hash. C :

[0157] ;

[0158] in, This indicates a sequential string concatenation operation; MD5() is the standard MD5 hash function.

[0159] Step 3.4: Based on the extracted feature dictionary, the system uses a three-level mechanism of "precise duplication judgment → similar duplication judgment → LLM intelligent confirmation" to achieve accurate identification of duplicate entries.

[0160] The specific method for precise duplicate detection is as follows:

[0161] The core basis for determining whether two items are completely identical duplicates is the dual matching of chemical composition and material name;

[0162] Chemical composition matching: H in two feature dictionaries C1 =H C2That is, if the hash values ​​are completely identical, it means that the chemical composition is the same;

[0163] Material name matching: Perform normalization processing on the two material names M1 and M2, including lowercase conversion, removal of spaces and full-width spaces, to satisfy Norm(M1)=Norm(M2);

[0164] Process description consistency matching: Standardization processing is performed on the two process flow descriptions Pd1 and Pd2, including removing spaces, punctuation marks, special characters, and lowercase conversion. Consistency is verified by text similarity calculation; if neither item has a process description, i.e., Pd1 = ... And Pd2= The default process is the same; if one has a process description and the other does not, i.e., Pd1= Or Pd2= If the processes are inconsistent, and both entries have process descriptions, a simplified text similarity calculation is used, based on character-level edit distance normalization, with a high similarity threshold λ (default 0.95) set to meet the requirements. If the processes are consistent, then it is determined that the processes are the same. It is the similarity between the two process flows Pd1 and Pd2;

[0165] The formula for calculating character-level similarity is:

[0166] ;

[0167] in, The function is a standardization function for the process description (removing spaces, punctuation, and lowercase conversion). The content in parentheses is the function's purpose. `Len()` represents the character edit distance, i.e., the minimum number of operations required for insertion, deletion, and replacement; `Len()` represents the string length.

[0168] ;

[0169] in, A two-entry feature dictionary; To standardize the function, the same standardized function is used for both material names and process descriptions, adapting to different field formats; This is a process consistency determination function that returns a boolean value. If the process description consistency matching rules mentioned above are met, the boolean value is True, which means exact repetition; otherwise, the boolean value is False, which means inaccurate repetition.

[0170] The entire formula returns a Boolean value;

[0171] Judgment result: If the above three conditions of chemical composition matching, material name matching, and process description consistency matching are met simultaneously, it is judged as an exact duplicate.

[0172] The specific method for determining similarity and repetition is as follows:

[0173] For entries with the same chemical composition but different names or process descriptions, the similarity of the process flow text is used to determine whether they are similar duplicates.

[0174] Prerequisites: That is, they have the same chemical composition;

[0175] Text similarity calculation: TF-IDF vector transformation and cosine similarity algorithm are used to calculate the similarity between two process flow descriptions. Similarity;

[0176] The formula for calculating term frequency (TF) is:

[0177] ;

[0178] Where w is A single word in For w in The number of times it appears in;

[0179] The formula for calculating Inverse Document Frequency (IDF) is as follows:

[0180] ;

[0181] Where N represents the total number of process flow descriptions participating in the comparison. The number of process flow descriptions containing the word 'w';

[0182] TF-IDF vectors are transformed into:

[0183] ;

[0184] Through this formula Transform into high-dimensional TF-IDF feature vectors ;

[0185] The formula for calculating cosine similarity is:

[0186] ;

[0187] Where · represents the vector dot product operation, It is the L2 norm of the vector;

[0188] Judgment result: Set similarity threshold ,like If so, it is judged as a suspected similar duplicate. Returns a Boolean value and its corresponding similarity value; if the Boolean value is True, it is a suspected similar duplicate, and if the Boolean value is False, it is a dissimilar duplicate.

[0189] The specific method for LLM intelligent verification is as follows:

[0190] For suspected similar duplicate entries, a large language model is invoked to perform expert-level logical judgment and correct the bias of machine comparison;

[0191] Input Construction: Input two original strings, source DOI, text similarity value and preset judgment criteria into LLM. The preset judgment criteria are: identical chemical composition + consistent material reference + basically identical process are duplicates; significant process differences or different states of intermediate products are non-duplicates.

[0192] Model configuration: The local gpt-oss:20b model is used, with a temperature coefficient of 0.1 (to reduce randomness) and a maximum number of generated tokens of 512.

[0193] Result analysis: Extract the "judgment result" (yes / no) and reason from the LLM output to finally confirm whether it is a duplicate entry.

[0194] Step 3.5: Processing and outputting deduplication results.

[0195] Construct a duplicate entry mapping table to record the source DOI, original entry, and basis for duplication for each duplicate entry; traverse all entries, retain the base entry in each group of duplicate entries, prioritize the entry with the highest DOI in the source file, and remove the remaining duplicate entries.

[0196] After deduplication, the data is stored in the deduplicated_data subdirectory of the output folder according to DOI, maintaining JSON format; the source, corresponding baseline entry, duplication type and reason for judgment of all removed entries are recorded and saved as duplicate_report.json; a statistical file deduplication_stats.json is generated, which includes the total number of entries, the number of deduplicated entries, the exact number of duplicates, the number of similar duplicates, the number of duplicates confirmed by LLM, and the deduplication rate.

[0197] The formula for calculating the deduplication rate is as follows:

[0198] ;

[0199] in, The total number of duplicate entries. This represents the total number of original valid entries.

[0200] Step 3.6: Filtering for invalid and redundant zero values.

[0201] To clean up meaningless or misleading "0" values ​​that may exist in the structured dictionary, the following rule-based cleaning is performed:

[0202] If the content field of an element is "0" and is not explicitly marked in the original paper's table, i.e. it was not measured experimentally, then replace it with the empty value null to avoid misjudging it as "not containing the element";

[0203] For mechanical property fields, namely tension, yield strength, and elongation, if their values ​​are 0 or empty, but are clearly described in the context, they are marked as unreported.

[0204] Delete all entries that contain only default filler fields and have no valid process description to prevent “empty shell entries” from polluting the dataset.

[0205] After all the above processing, a high-quality metal material process dataset is output.

[0206] Example 2:

[0207] This embodiment proposes a computer-readable storage medium that stores executable instructions. When these instructions are executed, if they are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.

[0208] The computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method for constructing a high-quality metal material process dataset based on a large language model as described in the various embodiments of this application.

[0209] The aforementioned storage media include: flash memory, hard disks, multimedia cards, card-type memory (e.g., SD (Secure Digital Memory Card) or DX (Memory Data Register, MDR) memory), random access memory (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic storage, disks, optical discs, servers, APP (Application) application stores, and other media capable of storing program verification codes. These media store computer programs, which, when executed by a processor, can implement the various steps of the aforementioned method for constructing high-quality metal material process datasets based on large language models.

[0210] Example 3:

[0211] This embodiment proposes a computer program product, including a computer program or instructions, which, when executed by a processor, implements the method for constructing a high-quality metal material process dataset based on a large language model.

[0212] Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or part of the technical solution, can be embodied in the form of a computer program product.

[0213] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the present invention.

Claims

1. A method for constructing a high-quality metal material processing dataset based on a large language model, characterized in that: Includes the following steps: Step 1: Automatically acquire relevant literature and convert it into unified structured data; Step 2: Extract metal material process items from the data generated in Step 1 using a pre-trained large language model. Specific steps include: Step 2.1: Load the structured text content of the paper to be processed, including material experimental descriptions, performance test results, and process details; Step 2.2: Load the table information stored in JSON format in the paper, including the material name, element content, all processing technologies related to the material, and the complete process; Step 2.3: Prepare two types of templates, including process description examples and complex process flow examples; Step 2.4: Configure the large language model, using the locally deployed gpt-oss:20b model, which supports long text processing; Step 2.5: Configure API: Base address: http: / / localhost:11434 / v1, which is the default address of the Ollam service; Authentication key: any string; Step 2.6: Configure generation parameters: temperature coefficient is 0; maximum number of tokens generated is 8192, suitable for long text process descriptions; Step 2.7: Construct structured prompts; Step 2.8: Model Invocation and Response Processing; The prompts are sent to the model via the client.chat.completions.create interface. The input includes the paper text, table data and template, and the response includes material entries. The structured entries are extracted from the model output using regular expressions. Redundant characters are removed and attribute key-value pairs are retained. Step 2.9: Validity verification and retry mechanism; Length validation: If the response length is ≤20 characters, it is considered invalid output and a retry is triggered; Retry strategy: The maximum number of retries is 3, with a 0.5-second interval between each retry, to ensure that the model fully processes the input; Step 2.10: Store the metal material process entries; All extracted standardized material entries are saved in JSON format to the output folder, with the file names corresponding to the DOI of the input paper, to construct the initial metallic materials process dataset; Step 3: Post-process the initial metal material process dataset extracted in Step 2. The specific steps are as follows: Step 3.1: Traverse all JSON format files in the preset input folder, extract the DOI identifier associated with the file and the set of material entry strings stored in the file; filter out invalid entries with a length ≤ 50 characters, and retain valid entries for subsequent processing; Step 3.2: Use regular expression matching patterns to extract fields from valid entry strings and convert them into a structured dictionary in key-value pair format; Step 3.3: Extract core features for duplicate detection from the structured dictionary to construct a feature dictionary, providing data support for subsequent comparisons; specifically: The feature extraction formula is: Features={M,T V ,Y V ,E V ,H C ,P d }; Where M is the material name; T V Y represents the tension value. V E is the yield value. V This is the elongation value; H C P is the hash value of the chemical composition. d For process flow description; By standardizing the process and hashing, the chemical composition of the material is transformed into a unique identifier, enabling rapid grouping and comparison. Definition of the baseline element set: A preset set of 36 common elements in materials science. E={H,B,C,N,O,F,Na,Mg,Al,Si,P,S,Cl,Ca,Ti,V,Cr,Mn,Fe,Co,Ni,Cu,Zn,As,Y,Zr,Nb,Mo,Sn,Sb,La,Ce,Ta,W,Pb,Bi}; Content standardization: for each element Extract the corresponding content value from the structured dictionary, prioritizing matching "e". i If the "element content" field is not found, the element symbol 'e' will be matched directly. i If no corresponding field exists, the content value is set to "0"; the extracted content values ​​are processed to remove spaces and convert to lowercase to generate a standardized content value v. i ; Following the fixed order of the element set E, the elements are concatenated with their standardized content to form "e". i :v i The format string, with each string separated by "|", forms the concatenated string S containing the chemical composition. C ; Using the MD5 algorithm to analyze S C Perform a hash calculation to generate a 128-bit binary hash value, which is then converted into a 32-bit hexadecimal string as the final hash. C ; ; in, This demonstrates sequential string concatenation operations; MD5() is the standard MD5 hash function. Step 3.4: Based on the extracted feature dictionary, a three-level mechanism of "precise duplication judgment → similar duplication judgment → LLM intelligent confirmation" is used to achieve accurate identification of duplicate entries; the specific method of precise duplication judgment is as follows: The core basis for determining whether two items are completely identical duplicates is the dual matching of chemical composition and material name; Chemical composition matching: H in two feature dictionaries C1 =H C2 That is, if the hash values ​​are completely identical, it means that the chemical composition is the same; Material name matching: Perform normalization processing on the two material names M1 and M2, including lowercase conversion, removal of spaces and full-width spaces, to satisfy Norm(M1)=Norm(M2); Process description consistency matching: Standardization processing is performed on the two process flow descriptions Pd1 and Pd2, including removing spaces, punctuation marks, special characters, and lowercase conversion. Consistency is verified by text similarity calculation; if neither item has a process description, i.e., Pd1 = ... And Pd2= The default process is the same; if one has a process description and the other does not, i.e., Pd1= Or Pd2= If the processes are inconsistent, and both entries have process descriptions, a simplified text similarity calculation is used, based on character-level edit distance normalization, and a high similarity threshold λ is set to satisfy... If the processes are consistent, then it is determined that the processes are the same. It is the similarity between the two process flows Pd1 and Pd2; The formula for calculating character-level similarity is: ; in, The function is a standardized function for describing the process; the content in parentheses describes the function's functionality. `Len()` represents the character edit distance, i.e., the minimum number of operations required for insertion, deletion, and replacement; `Len()` represents the string length. ; in, A two-entry feature dictionary; To standardize the function, the same standardized function is used for both material names and process descriptions, adapting to different field formats; This is a process consistency determination function that returns a boolean value. If the process description consistency matching rules mentioned above are met, the boolean value is True, which means exact repetition; otherwise, the boolean value is False, which means inaccurate repetition. The overall formula returns a Boolean value; Judgment result: If all three conditions of matching chemical composition, matching material name, and matching process description are met simultaneously, it is judged as an exact duplicate; Step 3.5: Processing and outputting deduplication results; Construct a duplicate entry mapping table to record the source DOI, original entry, and basis for duplication for each duplicate entry; traverse all entries, retain the base entry in each group of duplicate entries, prioritize the entry with the highest DOI in the source file, and remove the remaining duplicate entries; After deduplication, the data is stored in the deduplicated_data subdirectory of the output folder according to DOI, maintaining JSON format; the source, corresponding baseline entry, duplication type and reason for judgment of all removed entries are recorded and saved as duplicate_report.json; a statistical file deduplication_stats.json is generated, which includes the total number of entries, the number of deduplicated entries, the exact number of duplicates, the number of similar duplicates, the number of duplicates confirmed by LLM, and the deduplication rate. The formula for calculating the deduplication rate is as follows: ; in, The total number of duplicate entries. This represents the total number of original valid entries; Step 3.6: Filtering for invalid and redundant zero values; To clean up meaningless or misleading "0" values ​​that may exist in the structured dictionary, the following rule-based cleaning is performed: If the content field of an element is "0" and is not explicitly marked in the original paper's table, meaning it was not measured experimentally, then replace it with the empty value null to avoid misjudging it as "not containing the element"; For mechanical property fields, namely tension, yield strength, and elongation, if their values ​​are 0 or empty, but are clearly described in the context, they are marked as unreported. Delete all entries that contain only default fill fields and have no valid process description to prevent "empty shell entries" from polluting the dataset.

2. The method for constructing a high-quality metal material process dataset based on a large language model according to claim 1, characterized in that: Step 1 specifically includes: Step 1.1: Based on the preset keyword list, search scope, and field restriction parameters, construct a dynamic query statement through the academic database API interface, and use a pagination mechanism to batch retrieve relevant papers and extract their DOI identifiers; the obtained DOI set is deduplicated and saved as a CSV file as the input index for subsequent processing; Step 1.2: Based on the DOI set, call the full-text API, and in conjunction with the configuration parameters such as API key, request timeout, and retry strategy, send HTTP GET requests to the server one by one to download the full text of the paper in XML format; during the download process, URL encoding and path security processing are performed on the DOI to ensure the legality of the request and the compatibility of file naming; the XML content of successful responses is saved to the specified directory, and error logs are recorded for failed requests for tracking purposes; Step 1.3: Parse and transform all downloaded XML files; use the parsing tool lxml to load the document structure, and extract the chapter tree and table data using XPath; the chapter content is recursively organized according to hierarchical numbering, and the main text and subsections are merged to generate a flattened paragraph sequence predata; the table information is extracted by extracting tags, descriptions, table headers and data bodies to form a structured JSON object; finally, the three types of data, namely the chapter tree, table data and predata entries, are integrated into a standard JSON file, named and stored with the original paper's DOI, completing the transformation from unstructured literature to a structured knowledge carrier.

3. The method for constructing a high-quality metal material process dataset based on a large language model according to claim 1, characterized in that: In step 2.7, the structured prompts include the following core elements: Role definition: The specified model is "Materials Science Information Extraction Expert"; Target attribute list: Clearly define the 47 attributes to be extracted, including: basic material information, i.e., material name; mechanical properties, including the name, value, and unit of tensile strength, yield strength, and elongation; element content, i.e., the content of 36 elements, with elements not mentioned set to 0; process description, which must include a complete process chain: previous material and process → current material; Extraction rules: The process description must be complete, including all processing steps involving the materials, avoiding vague references; when integrating tabular data, the phrase "as shown in Table X" must be replaced with specific content; the output format must strictly be "entry N: [attribute 1: value 1, attribute 2: value 2, ...]", with the attribute order consistent with the target list; embed template examples from the input data preparation to guide the model to follow the format and logic.

4. The method for constructing a high-quality metal material process dataset based on a large language model according to claim 1, characterized in that: Step 3.2 specifically involves: The parsed regular expression is: ; in, This is a prefix for Python's raw strings to prevent the backslash "\" from being escaped; " "This is capture group 1, used to match the key portion, where the key cannot contain colons or commas; the first half of the parenthesis..." "and the second half of the parenthesis" "Indicates the capture group, start and end;" The negation character set is represented by "," which matches any single character except for the full-width colon ":" and the full-width comma ",". The quotation mark (") indicates a quantifier that matches the preceding character set one or more times. Capture group 1 after " The colon "" is a literal match for a full-width colon and must appear once to separate the key and value. " "This is capture group 2, used to match values; the outer parenthesis within it..." "and the second half of the parenthesis" "" indicates a capture group, capturing the entire value content into the value; the first " "" indicates that it first matches zero or more non-full-width commas, that is, the beginning part of the value; "Indicates a non-capturing group, used to handle multiple commas that may appear in a value;" "Indicates no capture, only for grouping;" " indicates that the entire group can be repeated 0 or more times; The first " within the non-capture group" " indicates a literal match of a full-width comma, i.e., the comma inside the value;" "Indicates negative forward-looking, in which" "Indicates a negative sign, meaning that the following pattern cannot be matched." "" indicates matching one or more characters that are neither colons nor commas, i.e., the next possible key; " indicates that another colon should follow; The overall meaning of the non-capturing group is: ensure that the current comma is not followed by the beginning of "new key:"; if the comma is immediately followed by "text:", it means that this is the separator of the next field and the current value should not be included; otherwise, it is a normal comma inside the value and the current value can be safely included. The second one in the non-capture group The colon indicates that after consuming the "safe" comma, the match continues for zero or more non-comma characters, that is, continues for the remainder of the value; The logic for the entire value part is as follows: first, consume the non-comma content, then repeatedly consume "safe comma + subsequent non-comma content" until a real new field separator is encountered, i.e., "new key:" is followed by a comma. The parsing logic is as follows: match the core structure of "key:value", support commas in the value field to achieve accurate splitting of multiple fields, but do not include new "key:" structures; perform space removal and trailing comma removal on the extracted keys and values ​​respectively to generate a standardized structured dictionary.

5. The method for constructing a high-quality metal material process dataset based on a large language model according to claim 1, characterized in that: In step 3.4, the specific method for determining similarity and repetition is as follows: For entries with the same chemical composition but different names or process descriptions, the similarity of the process flow text is used to determine whether they are similar duplicates. Prerequisites: That is, they have the same chemical composition; Text similarity calculation: TF-IDF vector transformation and cosine similarity algorithm are used to calculate the similarity between two process flow descriptions. Similarity; The formula for calculating term frequency (TF) is: ; Where w is A single word in For w in The number of times it appears in; The formula for calculating Inverse Document Frequency (IDF) is as follows: ; Where N represents the total number of process flow descriptions participating in the comparison. The number of process flow descriptions containing the word 'w'; TF-IDF vectors are transformed into: ; Through this formula Transform into high-dimensional TF-IDF feature vectors ; The formula for calculating cosine similarity is: ; Where · represents the vector dot product operation, It is the L2 norm of the vector; Judgment result: Set similarity threshold ,like If so, it is judged as a suspected similar duplicate; Returns a Boolean value and its corresponding similarity value; if the Boolean value is True, it is a suspected similar duplicate, and if the Boolean value is False, it is a dissimilar duplicate.

6. The method for constructing a high-quality metal material process dataset based on a large language model according to claim 1, characterized in that: In step 3.4, the specific method for LLM intelligent verification is as follows: For suspected similar duplicate entries, a large language model is invoked to perform expert-level logical judgment and correct the bias of machine comparison; Input Construction: Input two original strings, source DOI, text similarity value and preset judgment criteria into the LLM. The preset judgment criteria are: the same chemical composition + consistent material reference + basically the same process are duplicates; significant process differences or different states of intermediate products are non-duplicates. Model configuration: The locally deployed gpt-oss:20b model is used, with a temperature coefficient of 0.1 and a maximum number of generated tokens of 512. Result analysis: Extract the "judgment result" and reason from the LLM output to finally confirm whether it is a duplicate entry.

7. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores executable instructions that, when executed, cause a processor to perform the method for constructing a high-quality metal material process dataset based on a large language model as described in any one of claims 1-6.

8. A computer program product, characterized in that: Includes a computer program or instructions that, when executed by a processor, implement the method for constructing a high-quality metal material process dataset based on a large language model as described in any one of claims 1-6.