A method for construction and evaluation of manufacturing multi-modal question answering dataset

By constructing a multimodal question-answering dataset for the manufacturing industry, the problems of cross-modal association and professional terminology understanding in manufacturing documents are solved, achieving efficient and accurate question-answer pair generation and model adaptability, and supporting the application of intelligent document systems in the manufacturing industry.

CN121835908BActive Publication Date: 2026-06-19BEIJING INFORMATION SCI & TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INFORMATION SCI & TECH UNIV
Filing Date
2025-12-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal question-answering models struggle to effectively handle complex cross-modal relationships and domain-specific terminology in manufacturing scenarios, leading to parameter reading errors, inaccurate unit identification, and misunderstandings of technical terms, thus failing to meet the needs of accurate question answering and complex reasoning in manufacturing documents.

Method used

A multimodal question-answering dataset for the manufacturing industry is constructed. By decoupling and separating the national standard documents into multiple modes, multimodal elements such as text, tables, formulas, and images are extracted. A dynamic context window mechanism is constructed to generate question-answer pairs that meet the requirements of the manufacturing industry. Batch question-answer pair generation and answer reasoning of multimodal models are then performed.

Benefits of technology

It has achieved the construction of an efficient and accurate multimodal question-answering dataset for the manufacturing industry, supporting the training and evaluation of large multimodal models in manufacturing scenarios, and improving the accuracy and adaptability of the model in manufacturing document understanding.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121835908B_ABST
    Figure CN121835908B_ABST
Patent Text Reader

Abstract

This invention discloses a method for constructing and evaluating a multimodal question-answering dataset for the manufacturing industry, comprising: S100: constructing a corpus of standard manufacturing documents; S200: annotating metadata for each standard document in the corpus; S300: decoupling and separating multimodal elements, extracting multimodal heterogeneous elements and performing text recognition, and rearranging the recognition results according to the original standard document layout; S400: post-processing the extracted text, and inputting the post-processed text into a language model for term extraction according to prompt word templates, and constructing a global manufacturing terminology dictionary using the extracted terms; S500: constructing an initial seed set of question-answer pairs; S600: using a multimodal model to generate batch question-answer pairs and generate answer reasoning steps, thereby constructing a multimodal manufacturing standard question-answering dataset. This application can achieve efficient and accurate construction of a multimodal question-answering dataset for the manufacturing industry, providing a reliable data foundation for the subsequent training and evaluation of large multimodal models in manufacturing scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of terminology dataset construction technology, specifically involving a method for constructing and evaluating a multimodal question-and-answer dataset for the manufacturing industry. Background Technology

[0002] The manufacturing industry is undergoing a profound transformation from traditional digitalization to intelligentization. In product design, process planning, quality control, and safety management, companies heavily rely on various technical documents such as national standards, industry standards, equipment manuals, and inspection procedures. These documents typically include structural diagrams, flowcharts, statistical charts, structured tables, and large blocks of technical text, exhibiting high information density and complex modal formats. Currently, understanding and retrieving these documents still relies primarily on manual review, which is not only inefficient and costly but also fails to meet the real-time and automation requirements of emerging business scenarios such as intelligent quality inspection and intelligent diagnostics.

[0003] To improve document processing efficiency, various document intelligence and multimodal question-answering models have emerged in recent years. These models can jointly model images and text to achieve automatic recognition and question answering of documents such as invoices, academic papers, and presentations. These models are typically trained on general multimodal datasets and have achieved good results in natural image understanding and general document question-answering tasks. However, when these models are directly applied to manufacturing scenarios, they commonly encounter problems such as parameter reading errors, inaccurate unit identification, and misunderstandings of technical terms. This makes it difficult to support accurate question answering and complex reasoning for manufacturing documents, limiting the further application of intelligent document systems in manufacturing enterprises.

[0004] The main reasons for the above problems are as follows: First, strong modal heterogeneity. Manufacturing documents contain multiple modalities, including technical drawings, functional structure diagrams, flowcharts, tables, and paragraph text. Key information is often distributed across modalities. For example, the dimension of a certain part, assembly clearance, or inspection conditions often appear in both schematic diagram annotations and clause text. Existing general datasets, which are mainly based on single-page images or single modalities, are difficult to cover such complex cross-modal relationships. Second, prominent domain specificity. Manufacturing documents contain a large number of low-frequency technical terms such as "axial clearance," "tolerance zone," and "impactor," and these terms often require specific context for correct interpretation. Models pre-trained on general corpora have difficulty learning this kind of fine semantics.

[0005] Currently, most mainstream multimodal question-answering models adopt a training paradigm of "large-scale image-text pre-training + downstream multimodal question-answering / instruction data fine-tuning": In the pre-training stage, large-scale image-text pairing data (such as image-text descriptions, web page images and text, scene images with text, etc.) are used to learn the alignment relationship between visual features and text representations; in the downstream stage, supervised fine-tuning or instruction fine-tuning is performed on specially constructed multimodal question-answering datasets or multimodal instruction datasets to improve the model's question-answering ability and instruction compliance ability. However, to better utilize general multimodal models in manufacturing, it is essential to leverage the generalization ability of pre-trained multimodal models, inject manufacturing-specific knowledge into the model, and adapt the model to the target domain for better transfer to manufacturing scenarios.

[0006] While deep learning and general-purpose pre-trained models have achieved significant success in many fields, fine-tuning models for the manufacturing sector requires a large amount of multimodal data. However, existing general-purpose multimodal datasets suffer from several shortcomings. Firstly, in terms of image type, they are mostly drawn from everyday life or general domains, lacking structural diagrams, schematic diagrams, and process diagrams commonly found in manufacturing documents. Secondly, in terms of annotation granularity, they focus more on factual question-and-answer formats and short text responses, with insufficient annotation of high-precision information such as key parameters, units, and tolerance ranges. Thirdly, in terms of contextual modeling, most datasets use single pages or local regions as basic units, rarely considering the joint reasoning needs across pages, chapters, or even documents. Finally, in terms of content understanding, there is a general lack of specialized modeling for manufacturing terminology.

[0007] Given that manufacturing scenarios typically involve various physical quantities and their corresponding units, such as machining dimensions, material compressive strength, and equipment operating temperature, numerical deviations can directly lead to production accidents or product scrap. Therefore, manufacturing scenarios are extremely sensitive to "numerical values ​​+ units," and general indicators cannot reflect the risks. Furthermore, manufacturing scenarios often involve process lists and inventory lists, which require avoiding omissions and redundancies while ensuring accurate matching of elements, which general indicators cannot directly measure. At the same time, general indicators are difficult to use to determine whether the terminology is used correctly in semantically correct answers, which is essential in manufacturing scenarios with high standards of standardization. Summary of the Invention

[0008] The purpose of this application is to provide a method for constructing and evaluating a multimodal question-answering dataset for the manufacturing industry. This application performs multimodal decoupling and separation on manufacturing standard documents in the national standard full-text disclosure system, uses a multimodal large model to annotate question-answer pairs for complex schematic diagrams, structural diagrams, process diagrams, etc., and then checks and filters the data to construct a new dataset that meets specific requirements, providing benchmark support for intelligent document understanding in the manufacturing field.

[0009] On the one hand, this application provides a method for constructing a multimodal question-answering dataset for the manufacturing industry, including:

[0010] S100: Construct a corpus of standard documents for the manufacturing industry;

[0011] S200: Perform metadata annotation on standard documents in the corpus;

[0012] S300: Perform multimodal element decoupling and separation on standard documents in the corpus, extract multimodal heterogeneous elements such as text, tables, formulas, and images, and perform text recognition on each. Rearrange the recognition results according to the original standard document layout; images include, but are not limited to, schematic diagrams, structural diagrams, and process diagrams; S400:

[0013] S400: Post-process the extracted text and input the post-processed text into the language model to extract terms according to the prompt word template. Use the extracted terms to build a global manufacturing terminology dictionary.

[0014] S500: For document pages containing tables, formulas, or images, a dynamic context window mechanism is built based on adjacent paragraphs of the image and a global manufacturing terminology dictionary. Text fragments semantically related to the image or table are organized into supplementary context information. Then, questions of various task types are manually labeled, and the answer types are divided into numerical, list, and text types according to the target application scenario to form an initial seed set of question-answer pairs.

[0015] S600: Input the decoupled individual images or tables, supplemented contextual information through a dynamic context window mechanism, a global manufacturing terminology dictionary, and a seed set of question-answer pairs into the multimodal model, perform batch question-answer pair generation and answer reasoning step generation, and construct a multimodal manufacturing standard question-answer dataset.

[0016] Specifically, step S100 further includes:

[0017] S110: Download the manufacturing standard documents in PDF format from the National Standards Full-Text Disclosure System to obtain the original set of standard documents;

[0018] S120: Select manufacturing standard documents with high text-image mixing from the original set of standard documents;

[0019] Sub-step S120 specifically involves: performing the following steps on each manufacturing standard document: dividing the manufacturing standard document into images by page, prompting a multimodal model to filter images that simultaneously contain two or more elements from structural diagrams, schematic diagrams, statistical charts, process diagrams, tables, and text. If the proportion of such images exceeds a preset threshold, then the manufacturing standard document is a manufacturing standard document with a high degree of image-text mixing.

[0020] S130: The selected manufacturing standard documents are generated into PDF files and saved to form a corpus of manufacturing standard documents.

[0021] Furthermore, in step S120, the multimodal model Qwen2.5-VL-7B is used for screening.

[0022] Furthermore, the metadata mentioned in step S200 includes the domain, standard name, standard number, keywords, publication date, and implementation date.

[0023] Specifically, post-processing of the extracted text includes:

[0024] S410: Uses regular expressions to match common garbled character features and abnormal characters, and performs preliminary filtering of obviously meaningless or noisy characters;

[0025] S420: Extract candidate characters and their preceding and following text fragments and perform semantic association analysis to determine whether the candidate characters match the industry special character dictionary. If not, delete the candidate characters. The industry special character dictionary is obtained by manually compiling national standards related to symbols and includes special symbols and units.

[0026] On the other hand, this application provides an evaluation method for a multimodal question-answering dataset in the manufacturing industry, used to perform quality checks on the constructed multimodal standard question-answering dataset for the manufacturing industry, including:

[0027] After removing the answers and reasoning processes from the samples in the multimodal manufacturing standard question-and-answer dataset, the samples are input into the multimodal model to obtain the model's predicted answers and its new reasoning processes.

[0028] The model's predicted answer and its new reasoning process are input into the language model along with the answers and reasoning processes from the multimodal manufacturing standard question-answering dataset for consistency verification.

[0029] When the language model determines that the input is consistent, it is recorded as "pass"; when it determines that the input is inconsistent, it is recorded as "fail".

[0030] Specifically, the multimodal model is the GPT-4o model.

[0031] Specifically, the language model is the Deepseek-V3 model.

[0032] Compared with the prior art, this application has the following advantages and beneficial effects:

[0033] Given the unique characteristics of the manufacturing industry, existing methods for constructing multimodal question-answering datasets are not suitable for this context. This application provides a method specifically for constructing multimodal question-answering datasets in the manufacturing industry. This method enables efficient and accurate construction of multimodal question-answering datasets for manufacturing, and systematically performs multimodal parsing, terminology extraction, question-answer pair generation, and refined quality assessment of standard manufacturing documents. This provides a reliable data foundation for the subsequent training and evaluation of large-scale multimodal models in the manufacturing industry. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0035] Figure 1 A flowchart illustrating the construction method of this application;

[0036] Figure 2 This is a block diagram illustrating the principle of the estimation and evaluation methods in the embodiments of this application. Detailed Implementation

[0037] The technical solution and effects of this application will be clearly and completely described below with reference to specific embodiments. Obviously, the described specific embodiments are only a part of the specific embodiments of this application, and not all of them. Based on the specific embodiments in this application, all other specific embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0038] The method for constructing a multimodal question-answering dataset for the manufacturing industry provided in this application can be found in [reference needed]. Figure 1 The steps include:

[0039] S100: Construct a corpus of standard documents for the manufacturing industry;

[0040] This step further includes:

[0041] S110: Download the manufacturing standard documents in PDF format from the National Standards Full-Text Disclosure System to obtain the original set of standard documents;

[0042] To ensure the industry representativeness and diversity of the corpus, standard documents from multiple sub-fields, including mechanical manufacturing, mechanical systems and general components, road vehicle engineering, shipbuilding materials, new energy, and equipment manufacturing, were selected in terms of data coverage.

[0043] S120: Select manufacturing standard documents with high text-image mixing from the original set of standard documents;

[0044] In this embodiment, the multimodal model Qwen2.5-VL-7B is used to filter the original set of standard documents, identifying PDF documents with high text-image mixing. It should be noted that the PDF documents here refer to manufacturing standard documents in PDF format from the original set of standard documents.

[0045] Specifically, for each PDF document, the following steps are performed: the PDF document is split into images by page, and a multimodal model is prompted to filter images that contain two or more of the following elements: structural diagrams, schematic diagrams, statistical charts, process diagrams, tables, and text. If the proportion of such images exceeds a preset threshold, then the PDF document is a PDF document with a high degree of image and text mixing.

[0046] The above refers to the proportion of such images among all the images segmented. Furthermore, in this embodiment, the proportion threshold is preset to 40%.

[0047] In addition, to increase sample diversity, some manufacturing standard documents with low text-image mixing were retained.

[0048] Low image-text mixing ratio manufacturing standard documents are the opposite of low image-text mixing ratio manufacturing standard documents, referring to PDF documents in which the proportion of images does not exceed a preset proportion threshold.

[0049] S130: The selected manufacturing standard documents are generated into PDF files and saved to form a corpus of manufacturing standard documents.

[0050] S200: Perform metadata annotation on standard documents in the corpus;

[0051] To facilitate subsequent retrieval of related context, metadata includes, but is not limited to, domain, standard name, standard number, keywords, publication date, and implementation date, in order to form a structured index of manufacturing standard documents.

[0052] S300: Perform multimodal element decoupling and separation on standard documents in the corpus, extract multimodal heterogeneous elements such as text, tables, formulas, and images, and perform text recognition on each. Then, rearrange the recognition results according to the original standard document layout. Among them, images include, but are not limited to, schematic diagrams, structural diagrams, and process diagrams.

[0053] After the metadata annotation is completed, each standard PDF document is parsed page by page. This includes using a multimodal element decoupling and separation method to analyze the layout of the standard document and to detect and separate multimodal heterogeneous elements such as text, tables, formulas, and images.

[0054] In this embodiment, the specific steps of this implementation method are as follows:

[0055] S310: Based on the DocLayout-YOLO model, perform layout analysis and detection on the pages of standard documents, and locate document element areas such as images, tables, body text, titles, legends, and formulas;

[0056] S320: To address the complexity of the formula region, the Unimernet_tiny model is used to identify the formulas in the formula region and convert the identified formulas into LaTeX format;

[0057] S330: Based on the detected bounding box coordinates, crop the page image to generate table sub-images, text sub-images, formula sub-images, as well as schematic diagrams, structural diagrams, and process diagrams.

[0058] S340: Use the PaddleOCR Chinese OCR model to perform OCR recognition on text regions to obtain text content;

[0059] S350: Rearranges and organizes the recognized text, headings, tables, formulas, etc., according to the original document layout, and finally generates the corresponding Markdown format document.

[0060] S400: Post-process the extracted text and input the post-processed text into the language model to extract terms according to the prompt word template. Use the extracted terms to build a global manufacturing terminology dictionary.

[0061] Specifically, the prompt word template is used to filter concepts with clear and professional meanings in the context of industrial manufacturing, including equipment or machine names, processing technology, manufacturing process, materials, parts, models, process parameters, technical indicators, quality control, testing methods, industry abbreviations, etc.; terms are extracted according to the concepts.

[0062] In natural language processing, text post-processing refers to cleaning the text to improve its quality. In this specific embodiment, the post-processing includes: using rule-based methods to remove meaningless characters such as garbled text, special characters, and redundant formatting symbols, while retaining key technical symbols.

[0063] The post-processing of the text employs a dual error correction method of "semantic association + rule matching," including:

[0064] S410: Uses regular expressions to match common garbled character features and abnormal characters, and performs preliminary filtering of obviously meaningless or noisy characters;

[0065] S420: Extract candidate characters and their preceding and following text fragments and perform semantic association analysis to determine whether the candidate characters match the industry special character dictionary. If they do not match, delete the candidate characters to avoid accidental deletion of key technical symbols such as φ, ±, and ≥ by traditional rule filtering. The industry special character dictionary is obtained by manually compiling national standards related to symbols and includes special symbols and units.

[0066] S500: For document pages containing tables, formulas, or images, a dynamic context window mechanism is built based on adjacent paragraphs of the image and a global manufacturing terminology dictionary. Text fragments semantically related to the image or table are organized into supplementary context information. Then, questions of various task types are manually labeled, and the answer types are divided into numerical (Quant), list (List), and text (Text) according to the target application scenario to form an initial seed set of question-answer pairs.

[0067] In this embodiment, task types include DocVQA, VIE, DLA, and DCS, etc.

[0068] The above-mentioned text fragments related to the semantics of the image or table are organized into contextual supplementary information, specifically including: dividing the text into paragraphs, organizing the text fragments that refer to the table or image, and text fragments that are highly semantically similar to the table or image title into contextual supplementary information.

[0069] The schematic diagrams, structural diagrams, process diagrams, and tables obtained in step three are linked with the text content and terminology dictionary obtained in steps four and five to construct a dynamic context window, thereby enhancing the semantic interpretability of the images and tables and providing contextual support for the subsequent construction of question-and-answer pairs.

[0070] S600: Input the decoupled individual images or tables, supplemented contextual information through a dynamic context window mechanism, a global manufacturing terminology dictionary, and a seed set of question-answer pairs into the multimodal model, perform batch question-answer pair generation and answer reasoning step generation, and construct a multimodal manufacturing standard question-answer dataset.

[0071] Specifically, domain experts annotate seed question-answer pairs based on dynamic context windows and a global manufacturing terminology dictionary. Question types cover four task formats: Document Visual Question Answering (DocVQA), Key Information Extraction (VIE), Layout Analysis (DLA), and Document Content Understanding (DCS). Answer types include numerical, list, and text. Differentiated annotation specifications are developed for different question and answer types.

[0072] S700: Perform quality checks on the question-and-answer pairs generated in step S600. Specifically, a three-level verification system of "model pre-validation → consistency verification → manual review" is adopted to ensure the accuracy and reliability of the generated question-and-answer pairs.

[0073] To ensure the quality of the constructed dataset, this application establishes a data validation process based on multimodal models and language models, including:

[0074] First, the samples in the multimodal manufacturing standard question-and-answer dataset are processed by removing the answers and reasoning processes, and then input into the multimodal model to obtain the model's predicted answers and its new reasoning processes. In this embodiment, the multimodal model selected is the GPT-4o model.

[0075] Then, the model's predicted answer and its new reasoning process are input into the language model along with the answers and reasoning processes from the multimodal manufacturing standard question-and-answer dataset for consistency verification. In this embodiment, the Deepseek-V3 model is used for consistency verification.

[0076] When the language model determines that the model's answer is consistent with the reference answer, it is marked as "passed," and a portion of the samples are randomly selected for manual review; when it is determined to be "inconsistent," all corresponding samples are submitted to manual verification and modification.

[0077] After the above verification process, the generated question-answer pairs are filtered and corrected to form the final high-quality dataset for multimodal document understanding in the manufacturing industry, ManuStdVQA. Example

[0078] This embodiment uses an independent test set containing 3958 samples, with the following sample distribution for numerical, list, and text classes: 997 numerical samples, 821 list samples, and 2140 text samples.

[0079] This embodiment designs a differentiated scoring mechanism based on the characteristics of each type of answer:

[0080] (1) Numerical tasks

[0081] Given that manufacturing scenarios typically involve various physical quantities and their corresponding units, such as machining dimensions, material compressive strength, and equipment operating temperature, numerical deviations can directly lead to production accidents or product scrapping. Furthermore, general-purpose models are prone to unit confusion or misuse during processing. Therefore, in addition to employing precise numerical matching, this embodiment further introduces a unit consistency verification mechanism. The specific calculation method is as follows: ;

[0082] in, The unit consistency score is obtained through a rule-based matching mechanism: 1 when the unit matches successfully, and 0 otherwise. The score represents an exact match of numerical values, also calculated using a rule-based approach, with the same value selection rules as... Consistent, and This item is used to reflect the consistency between numerical values ​​and unit matching. It is only 1 when both are correct, otherwise 0, thus preventing the model from receiving a non-zero score when only the unit is correct. The weight parameter is less than 1.

[0083] The aforementioned mechanism means that a perfect score can only be obtained when the predicted value and the actual value are completely identical in both value and unit. However, this stringent criterion may lead to some reasonable results being misjudged—for example, when the difference between the predicted value and the actual value is only eliminated by unit conversion or by the difference between Chinese and English, it may be classified as a score of 0. To address this, for samples with a score of 0, this application further introduces Deepseek-V3 for secondary verification: the model determines whether the predicted value can be converted into the actual value; if the determination result is "yes," the final score is corrected to 1; otherwise, the score remains unchanged.

[0084] (2) List-based tasks

[0085] For list-type data, such as assembly process lists and parts material lists, it is necessary to ensure that the list length is consistent with the standard answer to avoid omissions and redundancies, while also ensuring accurate matching of core elements. Therefore, the evaluation mechanism comprehensively considers both length consistency and element accuracy. Let the actual list length be... The length of the predicted value list is Define length consistency score For: when hour, =1; otherwise, The setting is 0. This setting is based on the fact that the matching degree of list length directly reflects the model's comprehensive retrieval ability of visual information and its complete control over image details. If the lengths are inconsistent, it means that the model has a fundamental deviation in the completeness or accuracy of information extraction; therefore, it is directly set to 0. This item is scored as 0.

[0086] Then calculate the element accuracy. It is defined as the proportion of elements in the predicted value list that completely match elements in the true value list out of the total number of elements. ,in This represents the number of correctly predicted elements. This is used to demonstrate the consistency between length and accuracy, thereby preventing a model from receiving a higher rating when it performs well in a single dimension. and The weights are less than 1. The final score is calculated using the following formula: .

[0087] (3) Text-based tasks

[0088] For text-based data, such as technical document explanations and troubleshooting instructions, the core is to convey accurate semantics. While inaccuracies in terminology can be corrected using industry common sense, semantic gaps directly impact usability. Therefore, the evaluation mechanism integrates semantic similarity assessment and terminology consistency verification. The specific calculation method is as follows: First, the Deepseek-V3 model is introduced for semantic similarity scoring. This model judges the matching degree between the semantic content of the true and predicted values. If they are semantically consistent, the semantic correctness score is increased. Set it to 1, otherwise 0. Next, define the terminology consistency score. To verify the consistency between the terminology used in the predicted values ​​and the terminology in the problem, if there are missing terms, mismatches, or semantic discrepancies (not synonym substitutions), then... The score is 0 if the result is positive and 1 otherwise. The final score is obtained by weighting the above two factors: , The weight parameter is less than 1, or 0 if there is no term in the problem.

[0089] During the model training phase, the experiment was conducted using a single NVIDIA A100 graphics card, and the Qwen2.5-VL-7B model was trained using the LoRA parameter fine-tuning method. The specific parameter settings for the training process are as follows: the training dataset contains 35,560 samples, the total number of training epochs is 2; the batch size is set to 16; the rank parameter of LoRA is set to 16; and the initial learning rate is configured as follows. The learning rate was dynamically adjusted using a cosine learning rate decay strategy to optimize the model convergence process. Table 1 shows the model's evaluation score and weighted score on each type of sample after averaging three experiments.

[0090] Table 1. Model evaluation scores and overall weighted scores on each type of sample.

[0091] Model Numerical samples List-type samples Text samples comprehensive Janus-Pro-7B 5.71 3.33 34.19 20.61 Molmo-7B-D-0924 19.73 7.10 40.95 28.58 llava-ov-qwen2-7b 28.48 8.87 43.16 32.35 MiniCPM-V-2.6 56.56 51.77 72.49 64.18 SAIL-VL-1d6-8B 56.18 54.85 77.66 67.06 InternVL3-8B 64.15 60.68 70.87 67.52 GLM-4V-9B 60.57 60.37 77.53 69.70 MiniCPM-O-2.6 62.63 60.47 79.12 71.10 Qwen2.5-VL-3B 64.14 63.32 80.25 72.68 InternVL3-9B 68.85 65.76 79.46 73.95 Ovis2-8B 70.27 67.61 83.58 76.91 Qwen2.5-VL-7B 78.51 68.82 81.59 78.16 Lora-Qwen2.5-VL-7B 83.34 75.50 84.54 82.36

[0092] Lora-Qwen2.5-VL-7B is the model trained in this embodiment. As can be seen from Table 1, the overall performance of this model is improved by 4.20%, with significant improvements in numerical and list-type tasks, and slight improvements in text-type tasks.

[0093] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for constructing a manufacturing multi-modal question answering dataset, the method comprising: include: S100: Construct a corpus of standard documents for the manufacturing industry; S200: Perform metadata annotation on standard documents in the corpus; S300: Perform multimodal element decoupling and separation on standard documents in the corpus, extract multimodal heterogeneous elements such as text, tables, formulas, and images, and perform text recognition on each. Then, rearrange the recognition results according to the original standard document layout. Among them, images include one or more of the following: schematic diagrams, structural diagrams, and process diagrams. S400: Post-process the extracted text and input the post-processed text into the language model to extract terms according to the prompt word template. Use the extracted terms to build a global manufacturing terminology dictionary. Specifically, the prompt word template is used to filter concepts with clear and professional meanings in the context of industrial manufacturing, including equipment or machine names, processing technology, manufacturing process, materials, parts, models, process parameters, technical indicators, quality control, testing methods, and industry abbreviations; terms are extracted according to the aforementioned concepts. S500: For document pages containing tables, formulas, and images, a dynamic context window mechanism is constructed based on adjacent paragraphs of the image and a global manufacturing terminology dictionary. Text fragments semantically related to the image or table are organized into supplementary context information. Specifically, this includes: segmenting the text according to paragraphs, organizing text fragments referencing the table or image, and text fragments that meet the semantic similarity criteria with the table or image title into supplementary context information; associating the images and tables obtained in step S300 with the text content and terminology dictionary obtained in steps S400 and S500 to construct a dynamic context window, thereby enhancing the semantic interpretability of the images and tables; subsequently, questions of various task types are manually labeled, and the answer types are divided into numerical, list, and text types according to the target application scenario to form an initial seed set of question-answer pairs. S600: Input the decoupled individual images or tables, supplemented contextual information through a dynamic context window mechanism, a global manufacturing terminology dictionary, and a seed set of question-answer pairs into the multimodal model, perform batch question-answer pair generation and answer reasoning step generation, and construct a multimodal manufacturing standard question-answer dataset.

2. The construction method as described in claim 1, characterized in that: Step S100 further includes: S110: Download the manufacturing standard documents in PDF format from the National Standards Full-Text Disclosure System to obtain the original set of standard documents; S120: Select manufacturing standard documents with high text-image mixing from the original set of standard documents; Sub-step S120 specifically involves: performing the following steps on each manufacturing standard document: dividing the manufacturing standard document into images by page, prompting a multimodal model to filter images that simultaneously contain two or more elements from structural diagrams, schematic diagrams, statistical charts, process diagrams, tables, and text. If the proportion of such images exceeds a preset threshold, then the manufacturing standard document is a manufacturing standard document with a high degree of image-text mixing. S130: The selected manufacturing standard documents are generated into PDF files and saved to form a corpus of manufacturing standard documents.

3. The construction method as described in claim 1, characterized in that: The metadata mentioned in step S200 includes the domain, standard name, standard number, keywords, publication date, and implementation date.

4. The construction method as described in claim 2, characterized in that: In step S120, the multimodal model Qwen2.5-VL-7B is used for screening.

5. The construction method as described in claim 1, characterized in that: The post-processing of the extracted text includes: S410: Uses regular expressions to match common garbled character features and abnormal characters, and performs preliminary filtering of meaningless or noisy characters; S420: Extract candidate characters and their preceding and following text fragments and perform semantic association analysis to determine whether the candidate characters match the industry special character dictionary. If not, delete the candidate characters. The industry special character dictionary is obtained by manually compiling national standards related to symbols and includes special symbols and units.

6. An evaluation method for a multimodal question-answering dataset in the manufacturing industry, characterized in that: Used to perform quality inspection on the multimodal manufacturing standard question-and-answer dataset constructed according to any one of claims 1 to 5, including: After removing the answers and reasoning processes from the samples in the multimodal manufacturing standard question-and-answer dataset, the samples are input into the multimodal model to obtain the model's predicted answers and its new reasoning processes. The model's predicted answer and its new reasoning process are input into the language model along with the answers and reasoning processes from the multimodal manufacturing standard question-answering dataset for consistency verification. When the language model determines that the language model agrees, it is recorded as "pass"; when it determines that the language model disagrees, it is recorded as "fail".

7. The evaluation method as described in claim 6, characterized in that: The multimodal model is the GPT-4o model.

8. The evaluation method as described in claim 6, characterized in that: The language model is the Deepseek-V3 model.