A layout analysis method and device for a large model in the semiconductor field
By combining the outputs of optical character recognition and page segmentation models, and using a large model in the semiconductor field for bounding box association and merging, the problem of traditional page analysis in recognizing multiple categories of elements and adapting to flexible layout in the semiconductor field is solved, thus achieving more efficient page analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN ZHIXIAN FUTURE IND SOFTWARE CO LTD
- Filing Date
- 2024-08-09
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional layout analysis methods are unable to efficiently identify multiple categories of elements and adapt to flexible layout formats in the semiconductor field, resulting in low retrieval efficiency.
The semiconductor image document is initially processed using an optical character recognition model and a page segmentation model. The bounding boxes are associated and merged using a large model in the semiconductor field, and the page analysis is guided by preset prompts.
It improves the efficiency and accuracy of layout analysis, enabling better identification and merging of text and block elements in semiconductor image documents, and outputting more accurate layout analysis results.
Smart Images

Figure CN119049071B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of layout analysis technology, and in particular to a layout analysis method and apparatus based on a large model in the semiconductor field. Background Technology
[0002] Page layout analysis is a method for identifying and understanding the layout of documents, images, or web pages. It can be used to automate document content processing, such as extracting structured information and identifying key elements in text or images. Its main applications include document processing, image recognition, and web page analysis.
[0003] In document processing, layout analysis technology can identify elements such as headings, paragraphs, lists, and tables, thereby better understanding the document structure and enabling automated functions such as document summarization and key information extraction. In image recognition, layout analysis technology can identify text and graphics elements in images, thereby enabling automated image understanding and processing. In web page analysis, layout analysis technology can understand the structure of web pages, thereby enabling automated web page content extraction and analysis.
[0004] In the semiconductor manufacturing process, production personnel provide a large number of documents containing various elements such as images, text, and flowcharts. Finding the specific document or information needed by the user or system from this vast amount of data is particularly difficult. Therefore, layout analysis is necessary to identify the relationships between images and text, thereby improving retrieval efficiency.
[0005] However, traditional layout analysis can only identify a limited number of categories and cannot adapt to flexible layout formats. Therefore, there is a need for an efficient and flexible layout analysis method specifically for the semiconductor industry. Summary of the Invention
[0006] To address the aforementioned issues, this application proposes a layout analysis method, apparatus, computer-readable storage medium, and electronic device based on a large-scale model in the semiconductor field, which enables efficient and flexible layout analysis.
[0007] Firstly, this application provides a layout analysis method based on a large semiconductor domain model. The method includes: using an optical character recognition (OCR) model to perform text recognition on a semiconductor image document, outputting multiple text elements included in the semiconductor image document and the bounding box information corresponding to each text element; using a layout segmentation model to segment the semiconductor image document, outputting multiple sub-images, as well as the bounding box information and category of each sub-image; inputting the outputs of the OCR model and the layout segmentation model, along with preset prompts, into the large semiconductor domain model, enabling it to perform bounding box association and merging operations under the guidance of the preset prompts, and outputting the layout analysis results corresponding to the semiconductor image document; the large semiconductor domain model is obtained through pre-training.
[0008] Therefore, this application inputs the output results of the optical character recognition model and the page segmentation model into a large semiconductor model for page analysis, which can merge the associated bounding boxes, thereby improving the efficiency and accuracy of page analysis.
[0009] In one possible implementation, the section categories include text lines, page numbers, tables, and graphics.
[0010] In one possible implementation, the bounding box information includes the position and size of the bounding box within the semiconductor image document.
[0011] In one possible implementation, the bounding box association and merging operation includes: performing position calibration on the output results of the optical character recognition model and the page segmentation model; associating the bounding boxes of two text elements whose positional relationship meets preset conditions, or the bounding boxes of a text element and a sub-image; and merging the bounding boxes of the two associated text elements, or the bounding boxes of a text element and a sub-image.
[0012] In one possible implementation, the layout segmentation model also outputs the confidence level of each segment category.
[0013] In one possible implementation, the layout analysis results include: information on multiple merged target bounding boxes, labels of the target bounding boxes, content contained in each target bounding box, and layout relationships across lines, wherein the labels of the target bounding boxes indicate the category of the content they contain.
[0014] In one possible implementation, the large model in the semiconductor field is obtained through pre-training, including: the large model in the semiconductor field is pre-trained using sample data, and the sample data is obtained by manually annotating the output results of past optical character recognition models and page segmentation models in the semiconductor field. The annotation rules include annotating two text elements whose positional relationship meets preset conditions, or the bounding boxes of text elements and the corresponding blocks.
[0015] Secondly, this application provides a layout analysis device based on a large model in the semiconductor field. The device includes: a text recognition module for performing text recognition on a semiconductor image document using an optical character recognition model, outputting multiple text elements included in the semiconductor image document and the bounding box information corresponding to each text element; a layout segmentation module for performing layout segmentation on the semiconductor image document using a layout segmentation model, outputting multiple sub-images of the sub-images, as well as the bounding box information and sub-image category of each sub-image; and a large model processing module for inputting the output results of the optical character recognition model and the layout segmentation model, along with preset prompts, into the large model in the semiconductor field, enabling it to perform bounding box association and merging operations under the guidance of the preset prompts, and outputting the layout analysis results corresponding to the semiconductor image document; the large model processing module is obtained through pre-training.
[0016] Thirdly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method described in the first aspect or any possible implementation thereof.
[0017] Fourthly, this application provides an electronic device, comprising: at least one memory for storing a program; and at least one processor for executing the program stored in the memory; wherein, when the program stored in the memory is executed, the processor is configured to execute the method described in the first aspect or any possible implementation thereof.
[0018] It is understood that the beneficial effects of the second to fourth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description
[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This application provides a layout analysis system based on a large model in the semiconductor field.
[0021] Figure 2 This is a flowchart of a layout analysis method based on a large model in the semiconductor field provided in an embodiment of this application;
[0022] Figure 3 This is a semiconductor image document with multiple sections marked, provided in the embodiments of this application;
[0023] Figure 4This is a semiconductor image document with multiple sections and text elements marked, as provided in the embodiments of this application;
[0024] Figure 5 This is a schematic diagram of a layout analysis device based on a large model in the semiconductor field, provided in an embodiment of this application. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions in the embodiments of this application will be described below with reference to the accompanying drawings.
[0026] In the description of the embodiments of this application, the words "exemplary," "for example," or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design described as "exemplary," "for example," or "for instance" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the words "exemplary," "for example," or "for instance" is intended to present the relevant concepts in a specific manner.
[0027] In the description of the embodiments in this application, the term "and / or" is merely a description of the association relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, B existing alone, and A and B existing simultaneously. Furthermore, unless otherwise stated, the term "multiple" means two or more. For example, multiple systems refer to two or more systems, and multiple screen terminals refer to two or more screen terminals.
[0028] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and their variations all mean "including but not limited to," unless otherwise specifically emphasized.
[0029] As mentioned earlier, in the semiconductor manufacturing process, production personnel provide a large number of documents containing various elements such as images, text, and flowcharts. Traditional layout analysis can only identify a limited number of categories from these documents and cannot adapt to flexible layout formats.
[0030] In view of this, the embodiments of this application first process the semiconductor image document through an optical character recognition model and a page segmentation model, and then input the processing results into a large semiconductor model for page analysis and optimization. This can associate bounding boxes whose positional relationships meet preset conditions, thereby identifying and merging those bounding boxes that actually belong to the same semantic part but are segmented, so that the large semiconductor model outputs more accurate page analysis results.
[0031] For example, Figure 1 This application illustrates a layout analysis system based on a large-scale model in the semiconductor field, as provided in an embodiment of this application. This system can be deployed on any device, apparatus, platform, server, device cluster, etc., with computing and processing capabilities.
[0032] like Figure 1 As shown, the input to the layout analysis system is a semiconductor image document, a graphical document page that can be captured during the semiconductor manufacturing process. The original document format can be Word, PDF, or PPT, thus offering more flexible layout options compared to traditional documents. During semiconductor manufacturing, production personnel provide information including images, text, flowcharts, and other elements to generate this document, resulting in a wider range of element categories compared to traditional documents. Therefore, more effective layout analysis processing is required.
[0033] First, an optical character recognition model is used to perform text recognition on a semiconductor image document, outputting multiple text elements included in the semiconductor image document and the bounding box information corresponding to each text element.
[0034] In parallel with the above steps, the semiconductor image document is segmented using a page segmentation model, outputting multiple sub-images of each segment, along with the bounding box information and segment category for each segment. The confidence level for each segment category can also be output.
[0035] Next, the outputs of the optical character recognition model and the page segmentation model, along with preset prompts, are input into the large-scale semiconductor model. Guided by the preset prompts, the model performs bounding box association and merging operations, outputting the page analysis results of the semiconductor image document. The large-scale semiconductor model is pre-trained using sample data, which is obtained by manually annotating the outputs of previous optical character recognition and page segmentation models in the semiconductor field. The annotation rules include annotating two text elements whose positional relationship meets preset conditions, or the bounding boxes of text elements and corresponding sections. Positional relationship meeting preset conditions means that the distance between the bounding boxes is close, and the difference meets a preset value.
[0036] In summary, optical character recognition (OCR) models, focusing on text recognition, offer higher text extraction efficiency; while page segmentation models, emphasizing document structure analysis, can handle a wider range of element categories. Therefore, by inputting the outputs of both models into a pre-trained semiconductor domain large-scale model and combining their results, more efficient and accurate page analysis can be achieved.
[0037] Figure 2 A flowchart illustrating a layout analysis method based on a large model in the semiconductor field, provided in an embodiment of this application, is shown. Figure 2 As shown, the method includes the following steps:
[0038] Step S201: Use an optical character recognition model to perform text recognition on the semiconductor image document, and output multiple text elements included in the semiconductor image document and the bounding box information corresponding to each text element.
[0039] In this embodiment, the optical character recognition model is used to convert text image regions in semiconductor image documents into editable and searchable text elements. The model can be a self-developed algorithm model for the semiconductor field, or a mature publicly released tool library, such as Tesseract OCR, EasyOCR, etc., as long as the selected model can achieve the accuracy required by the business.
[0040] Optical character recognition (OCR) models are used to recognize text in semiconductor image documents. The resulting text elements are strings within the semiconductor image document, which can include text in any language, such as Chinese or English, as well as meaningful symbols, such as Arabic numerals and units of measurement. The OCR model provides a bounding box—a rectangular region—defining the position and size of each recognized text element within the semiconductor image document.
[0041] A bounding box typically has the following information: Top-left corner coordinates (x_min, y_min): The pixel coordinates of the top-left corner of the bounding box, usually with the top-left corner of the image document as the origin (0, 0). Bottom-right corner coordinates (x_max, y_max×): The pixel coordinates of the bottom-right corner of the bounding box, determining the width and height of the bounding box. Width: The width of the bounding box, equal to x_max minus x_min. Height: The height of the bounding box, equal to y_max minus y_min.
[0042] Step S202: Use the page segmentation model to segment the semiconductor image document, output multiple sub-images of the blocks, as well as the bounding box information and block category of each block.
[0043] In this embodiment, the page segmentation model is used to divide the semiconductor image document into different categories of blocks or regions, identify the block category of each block, and mark the regional location of each block. The block category indicates the type of elements displayed within it, such as text lines, page numbers, tables, and graphics.
[0044] Thus, by using the page segmentation model to segment semiconductor image documents, multiple sub-images of the segmented sections can be obtained, each section having a corresponding section category and corresponding bounding box information.
[0045] Optionally, the layout segmentation model also outputs the confidence score for each segment category: representing the degree of certainty the layout segmentation model has regarding the segmentation results of the segmented subgraphs. The confidence score can be provided to larger semiconductor models in subsequent processing to improve the accuracy of layout analysis performed by these larger models.
[0046] For example, Figure 3 This document illustrates a semiconductor image document with multiple sections marked, provided in an embodiment of this application. For example... Figure 3 As shown, multiple sections are marked in the original semiconductor image document. Each section has a corresponding bounding box, which indicates the position and size of the sub-region containing each section within the image document. The position and size of each bounding box are represented by a rectangular area, and the section category, i.e., the category of the elements displayed in that section, is displayed in the upper left corner of the rectangular area. For example: text block, table, figure. Figure 3 In the example, the top left corner of the rectangle also displays the confidence level of the board category, such as 100%.
[0047] The layout analysis model segments the original image document to obtain, for example, Figure 3 The text describes multiple sub-plots with bounding boxes, along with bounding box information and plot categories for each plot, as well as the confidence level of each plot category.
[0048] Step S203: Input the output results of the optical character recognition model and the layout segmentation model, as well as the preset prompt words, into the semiconductor domain big model, so that it performs bounding box association and merging operations under the guidance of the preset prompt words, and outputs the layout analysis results corresponding to the semiconductor image document; the semiconductor domain big model is obtained through pre-training.
[0049] In this embodiment, the large-scale semiconductor model is obtained through pre-training and is a multimodal model capable of processing text and images. Sample data is obtained by manually annotating the output results of past optical character recognition and page segmentation models in the semiconductor field. Annotation rules include labeling the bounding boxes of two text elements whose positional relationship meets preset conditions, or the bounding boxes of text elements and corresponding blocks. This enables the trained large-scale semiconductor model to efficiently recognize and merge two text elements, or text elements and blocks, whose positional relationship meets preset conditions for image documents in the semiconductor field.
[0050] Preset prompts can be generated using prompt technology to guide large-scale semiconductor models in two stages of document layout analysis and processing, based on the outputs of optical character recognition (OCR) and page segmentation models. These preset prompts guide the large-scale semiconductor model to complete specific tasks by providing contextual information or questions to stimulate the model's relevant capabilities. This process includes:
[0051] The first step is to instruct the large model to perform position calibration on the outputs of the optical character recognition model and the page segmentation model.
[0052] In this step, prompt can be an explicit instruction or contextual information that guides the model to identify the correlation between the OCR results and the page segmentation results.
[0053] This step includes the following:
[0054] Provide preset prompts, such as: "Please calibrate the position information of the corresponding area in the page segmentation result of the text content extracted by the following OCR".
[0055] It provides text elements extracted by the OCR model and their corresponding bounding box information, as well as the sub-images, sub-categories, and their corresponding bounding box information identified by the page segmentation model.
[0056] The system guides the large model to identify the correspondence between two text elements or between a text element and a section, ensuring that text elements are associated with the correct text elements or section parts. Specifically, if the positional relationship of the bounding boxes corresponding to a text element or section meets preset conditions—for example, if the bounding box of a text element recognized by OCR is located within the bounding box of a certain section—then the text element can be initially considered to belong to that section. Conflicts may occur during the association process, such as a text element's bounding box being located within the bounding boxes of multiple sections. In such cases, certain rules or algorithms (such as maximum overlap area, minimum edit distance, etc.) are needed to resolve the conflicts and determine the final assignment of the text element.
[0057] Thus, position calibration ensures the consistency between the layout structure of the image document and the text elements, matching the text elements with their actual positions in the semiconductor image document, providing a foundation for subsequent multimodal analysis and layout relationship optimization.
[0058] The second step involves merging bounding boxes using the multimodal capabilities of the large model. In this step, `prompt` can be a further instruction that guides the large model to utilize its multimodal capabilities to identify and merge bounding boxes.
[0059] This step includes the following:
[0060] Provide preset prompts, such as: "Please use multimodal analysis capabilities to identify and merge bounding boxes that belong to the same semantic part but are separated";
[0061] Provide the association results after the first step of calibration, including the association between two text elements, or between a text element and a section;
[0062] The large model is guided to analyze multimodal features such as text flow, format consistency, and semantic coherence based on their relationships, and to identify and merge bounding boxes that actually belong to the same semantic part but are segmented (such as a complete piece of text, or an image and its corresponding explanatory text or title being split into multiple parts).
[0063] The merged bounding boxes more accurately reflect the actual layout structure of the semiconductor image document; therefore, this step optimizes the layout analysis results.
[0064] The third step is to output the optimized layout analysis results.
[0065] This step includes the following:
[0066] Provide preset prompts, such as: "Please output the final layout analysis results, including accurate text content, bounding box information, bounding box labels, and layout relationships across lines."
[0067] The results of the optimized layout analysis are used to guide the large model.
[0068] The output of a large model can include the following parts:
[0069] Accurate text content: After bounding box calibration and merging steps, the text recognition results provided by the large model are consistent with the actual content of the document.
[0070] Box Label: The explicit label of each merged target bounding box, indicating the category of its contained content.
[0071] Page layout relationships across lines: Correctly identifying the relationships between lines of text and paragraphs. This includes the beginning and end of paragraphs, as well as the logical connections between different lines of text.
[0072] The output document format of the large model can be Extensible Markup Language (XML) or JavaScript Object Representation (JSON), etc. The output document is used to describe the structured data output by the large model, which includes the various parts of the image document and their categories.
[0073] The output documents of the large model can be saved to a knowledge base. Furthermore, after a user or system raises a question, the knowledge base is used to perform a question-related search and returns the search results.
[0074] In the steps above, applying prompt technology to the document layout analysis process can more effectively guide large multimodal models to complete complex layout analysis tasks, improving the accuracy and efficiency of the analysis. Prompt technology acts as a bridge for communication with the model, ensuring that the model understands the task requirements and provides accurate output.
[0075] The preset prompts used in the above three steps can be provided to the large model all at once, or they can be provided to the large model in two or three steps in combination with the processing results of each step. This paper does not limit this.
[0076] In addition, since the large model has a length limit for the content it can receive, if the image document to be analyzed is very large and exceeds the length limit, the large model needs to be called multiple times to repeat the above three steps in order to complete the layout analysis of the entire image document.
[0077] For example, Figure 4 This illustrates a semiconductor image document with multiple sections and text elements labeled, as provided in an embodiment of this application.
[0078] like Figure 4 As shown, the original semiconductor image document is an image document obtained by scanning a PPT format page, including four parts: title, text, charts, and images.
[0079] Using the above Figure 2 The process of performing layout analysis on the original semiconductor image document using the method described in the article is as follows:
[0080] First, an optical character recognition model is used to identify text elements in an image document, converting the text image into an editable text format.
[0081] Specifically, the optical character recognition model can identify the text content of text elements in various parts of a document and pinpoint their respective regions. For example, it can identify the text element "Application of Artificial Intelligence in the Semiconductor Field" as region 11, the text element "Artificial intelligence technology, especially deep learning and machine learning, is changing the way semiconductors are manufactured" as region 12, the "Proportional Chart of Different AI Technologies in the Semiconductor Field" as region 13, and the "AI-Assisted Semiconductor Production Query Interface Chart" as region 14. Each region is marked with coordinates.
[0082] Second, the original semiconductor image document is segmented using a page segmentation model to analyze the layout of the PPT pages. The output of page segmentation includes multiple sub-images, the category of each sub-image, and the corresponding bounding box information.
[0083] For example, a module may include Figure 4 The titles, text, pie charts, query interfaces, chart titles, and image titles shown are all displayed.
[0084] Based on the output results of the first and second steps, it can be seen that the original semiconductor image document, which originally belonged to the same chart section (i.e., pie chart) and text elements (i.e., region 13), as well as the original image section (i.e. query interface) and text elements (i.e., region 14), were divided into two different parts.
[0085] Third, based on the output results of the optical character recognition model and the page segmentation model, the large model in the semiconductor field performs page analysis to generate a comprehensive and easy-to-understand output.
[0086] by Figure 4 Taking the chart section of the original semiconductor image document as an example, the large model performs correlation analysis and merges the bounding boxes of the "Pie Chart" section and the "Region 13" text element, outputting the target bounding box information corresponding to the "Chart" section in the original semiconductor image document. The content of this target bounding box includes:
[0087] Chart title: "Application Proportion of Different AI Technologies in the Semiconductor Field"
[0088] Chart Description: The chart shows that deep learning accounts for 40%, machine learning for 30%, natural language processing for 20%, and image recognition for 10%.
[0089] And the label for the output target bounding box: chart.
[0090] Similarly, the large model also performs similar association and merging processes on the bounding boxes of other text elements and sections, and outputs the final layout analysis results.
[0091] Therefore, this application first processes the semiconductor image document through an optical character recognition model and a page segmentation model, and then inputs the processing results into a large semiconductor model for page analysis. This can merge related text and blocks or the bounding boxes corresponding to related text, thereby outputting more accurate page analysis results.
[0092] It should be noted that although the operations of the methods of the embodiments of this application are described in a specific order in the above embodiments, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. On the contrary, the steps depicted in the flowcharts may be executed in a different order. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0093] Based on the methods in the above embodiments, exemplarily, Figure 5 This application illustrates a layout analysis apparatus based on a large model in the semiconductor field, as provided in an embodiment of this application. Figure 5 As shown, the layout analysis device 500 includes:
[0094] The text recognition module 510 is used to perform text recognition on semiconductor image documents using an optical character recognition model, and outputs multiple text elements included in the semiconductor image document as well as the bounding box information corresponding to each text element.
[0095] The page segmentation module 520 is used to segment semiconductor image documents using a page segmentation model, outputting multiple sub-images of each segment, as well as the bounding box information and segment category of each segment.
[0096] The large model processing module 530 is used to input the output results of the optical character recognition model and the layout segmentation model, as well as preset prompts, into the semiconductor domain large model. Guided by the preset prompts, the model performs bounding box association and merging operations, outputting the layout analysis results corresponding to the semiconductor image document. The large model processing module is obtained through pre-training.
[0097] Based on the methods in the above embodiments, this application provides an electronic device. The electronic device may include: at least one memory for storing a program; and at least one processor for executing the program stored in the memory. When the program stored in the memory is executed, the processor executes the methods described in the above embodiments. Exemplarily, the electronic device may be a mobile phone, tablet computer, desktop computer, laptop computer, handheld computer, server, ultra-mobile personal computer (UMPC), netbook, cellular phone, personal digital assistant (PDA), augmented reality (AR) device, virtual reality (VR) device, artificial intelligence (AI) device, wearable device, in-vehicle device, smart home device, and / or smart city device. This application does not impose any special limitations on the specific category of the electronic device.
[0098] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0099] It is understood that the various numerical designations used in the embodiments of this application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this application. It should be understood that in the embodiments of this application, the order of the process numbers does not imply the order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0100] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of this application. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of this application should be included within the scope of protection of this application.
Claims
1. A layout analysis method based on a large-scale model in the semiconductor field, characterized in that, The method includes: The optical character recognition model is used to perform text recognition on a semiconductor image document, and the output includes multiple text elements in the semiconductor image document and the bounding box information corresponding to each text element. The semiconductor image document is segmented using a page segmentation model, and multiple sub-images of each segment are output, along with the bounding box information and segment category of each segment. The outputs of the optical character recognition model and the layout segmentation model, along with preset prompts, are input into the semiconductor domain big model. Guided by the preset prompts, the model performs bounding box association and merging operations, and outputs the layout analysis results corresponding to the semiconductor image document. The semiconductor domain big model is obtained through pre-training.
2. The method according to claim 1, characterized in that, The section categories include text lines, page numbers, tables, and graphics.
3. The method according to claim 1, characterized in that, The bounding box information includes the position and size of the bounding box in the semiconductor image document.
4. The method according to claim 1, characterized in that, The bounding box association and merging operations include: The outputs of the optical character recognition model and the page segmentation model are positionally calibrated, and the bounding boxes of two text elements, or text elements and page sub-images, whose positional relationships meet preset conditions are associated; and Merge the bounding boxes corresponding to the two associated text elements, or the text element and the sub-graph.
5. The method according to claim 1, characterized in that, The page segmentation model also outputs the confidence level of each page category.
6. The method according to claim 1, characterized in that, The layout analysis results include: information on multiple target bounding boxes obtained by merging, labels of the target bounding boxes, content contained in each target bounding box, and layout relationships across lines, wherein the labels of the target bounding boxes indicate the category of the content they contain.
7. The method according to claim 1, characterized in that, The large-scale model in the semiconductor field is obtained through pre-training and includes: The large model in the semiconductor field is pre-trained using sample data. The sample data is obtained by manually annotating the output results of past optical character recognition models and page segmentation models in the semiconductor field. The annotation rules include annotating two text elements whose positional relationship meets the preset conditions, or the bounding boxes of text elements and corresponding blocks.
8. A layout analysis device based on a large-scale model in the semiconductor field, characterized in that, The device includes: The text recognition module is used to perform text recognition on semiconductor image documents using an optical character recognition model, and outputs multiple text elements included in the semiconductor image document and the bounding box information corresponding to each text element. The page segmentation module is used to segment the semiconductor image document using a page segmentation model, outputting multiple sub-images of each segment, as well as the bounding box information and segment category of each segment. The large model processing module is used to input the output results of the optical character recognition model and the layout segmentation model, as well as the preset prompt words, into the semiconductor field large model, so that it performs bounding box association and merging operations under the guidance of the preset prompt words, and outputs the layout analysis results corresponding to the semiconductor image document; the large model processing module is obtained through pre-training.
9. A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-7.
10. An electronic device, characterized in that, include: At least one memory for storing programs; At least one processor is configured to execute a program stored in the memory; wherein, when the program stored in the memory is executed, the processor is configured to perform the method as described in any one of claims 1-7.