Method and system for processing multimodal data and providing same to large language model

The method addresses the challenge of integrating multimodal data by analyzing document layout, generating descriptive and structured text from images and tables, and integrating them into a structured format for large language models, enhancing the accuracy and completeness of document processing.

WO2026135353A1PCT designated stage Publication Date: 2026-06-25POSCO HLDG INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
POSCO HLDG INC
Filing Date
2025-12-19
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Current systems struggle to effectively process and integrate text, images, and tables within documents, leading to incomplete or misinterpreted information due to limitations in handling multimodal data.

Method used

A method that analyzes the layout of documents to identify text, image, and table areas, generates descriptive text from images and structured text from tables, and integrates these forms into a structured format for large language models while preserving the original document's mapping relationships.

Benefits of technology

Enables accurate and comprehensive understanding of document content by large language models, ensuring all information is utilized without loss of structure or meaning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025022287_25062026_PF_FP_ABST
    Figure KR2025022287_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A method for processing multimodal data in a document, according to one embodiment of the present invention, comprises the steps of: inputting an electronic document into an artificial intelligence model so as to analyze the layout of the electronic document, thereby identifying a text region including text, an image region including an image, and a table region including a table; extracting first text information from the text region; generating, from the image region, second text information describing the image; recognizing matrix coordinates of the table in the table region, and generating third text information in which lower-level text corresponding to each of matrix coordinates and at least one higher-level text corresponding to the corresponding matrix coordinates are mapped; and integrating the first text information, the second text information, and the third text information on the basis of the layout of the electronic document, and providing same to a large language model, wherein the integration step can be performed while maintaining a mapping relationship with the original document.
Need to check novelty before this filing date? Find Prior Art

Description

Method and system for processing multimodal data and providing it to a large language model

[0001] The present invention relates to a method and a program for processing multimodal data from documents of various formats and converting it into a form that can be utilized by a large language model.

[0002] Generally, systems that extract and process information from documents focus on text-based processing, which limits their ability to effectively process other forms of data, such as images or tables.

[0003] While the recent emergence of RAG systems utilizing large language models has significantly improved the accuracy and reliability of document processing, these systems still focus primarily on processing text-based data. In particular, there are difficulties in extracting meaningful information from images or processing data while preserving the structural characteristics of tables.

[0004] Furthermore, current systems have limitations in integrally processing different forms of data. For example, it is difficult to process and understand text descriptions, product images, and specification tables included in product manuals within a single consistent context. As a result, not all information contained in the document can be effectively utilized, leading to issues such as the omission or misinterpretation of some information.

[0005] Therefore, there is a need for a new approach that integrally processes various forms of data contained in documents and transforms them into a form that large language models can effectively utilize.

[0006] According to the present invention, a method is provided to effectively process multimodal data, such as text, images, and tables, in documents composed of various formats.

[0007] According to the present invention, a method is provided for analyzing the layout of a document to identify the characteristics of each area and applying an appropriate processing method based thereon.

[0008] According to the present invention, a method is provided for generating descriptive text in the form of natural language from an image and converting it into structured text while preserving the structural characteristics of a table.

[0009] According to the present invention, a method is provided for integrating different forms of data and converting them into a form that is easy for a large language model to process.

[0010] According to the present invention, a method is provided for structuring data while maintaining a mapping relationship with the original document and storing it in a form that can be utilized in a RAG system.

[0011] The technical problems to be solved in this document are not limited to those mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art to which this invention belongs from the description below.

[0012] A method for processing multimodal data within a document according to an embodiment of the present invention comprises: a step of inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document and identifying a text area containing text, an image area containing an image, and a table area containing a table; a step of extracting first text information from the text area; a step of generating second text information describing the image from the image area; a step of recognizing matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; and a step of integrating the first text information, the second text information, and the third text information based on the layout of the electronic document and providing them to a large language model; wherein the integration step may be performed while maintaining a mapping relationship with the original document.

[0013] In the above multimodal data processing method, the layout analysis step may include: a step of dividing a document into regions; a step of classifying region types by determining the characteristics of each region; and a step of analyzing the relationships between the classified regions.

[0014] In the above multimodal data processing method, the step of generating second text information from the image region may include: a step of extracting feature points of the image by inputting the image into an identification artificial intelligence model; and a step of inputting the feature points into an image analysis model.

[0015] In the above multimodal data processing method, the step of generating the third text information may include: determining the matrix coordinates of the table by inputting the table into an identification artificial intelligence model; analyzing the relationships between data elements; and converting the analyzed structure and relationships into a structured format.

[0016] In the above multimodal data processing method, the integration step may include: a step of arranging the processing results of each region according to contextual order; a step of establishing reference relationships between regions; and a step of integrating the arranged data and reference relationships into a single document.

[0017] A computer program stored on a computer-readable storage medium according to an embodiment of the present invention, wherein the computer program performs steps for processing multimodal data within a document when executed on one or more processors of a computing device, the steps include: inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document and identifying a text area containing text, an image area containing an image, and a table area containing a table; extracting first text information from the text area; generating second text information describing the image from the image area; recognizing matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; and integrating the first text information, the second text information, and the third text information based on the layout of the electronic document and providing them to a large language model; wherein the integration step can process multimodal data within the document while maintaining a mapping relationship with the original document.

[0018] In the above computer program, the layout analysis step may include: a step of dividing a document into regions; a step of classifying region types by determining the characteristics of each region; and a step of analyzing the relationships between the classified regions.

[0019] The step of generating second text information from the image area in the above computer program may include: a step of extracting feature points of the image by inputting the image into an identification artificial intelligence model; and a step of inputting the feature points into an image analysis model.

[0020] The step of generating the third text information in the above computer program may include: determining the matrix coordinates of the table by inputting the table into an identification artificial intelligence model; analyzing the relationships between data elements; and converting the analyzed structure and relationships into a structured format.

[0021] In the above computer program, the integration step may include: a step of arranging the processing results of each region according to contextual order; a step of establishing reference relationships between regions; and a step of integrating the arranged data and reference relationships into a single document.

[0022] In a storage medium storing at least one instruction according to an embodiment of the present invention, when the at least one instruction is executed by a processor, the processor is configured to perform the following steps: inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document and identifying a text area containing text, an image area containing an image, and a table area containing a table; extracting first text information from the text area; generating second text information describing the image from the image area; recognizing matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; and integrating the first text information, the second text information, and the third text information based on the layout of the electronic document and providing them to a large language model; wherein the integration step may be performed while maintaining a mapping relationship with the original document.

[0023] According to the present invention, in the process of processing multimodal data, the layout of the document is analyzed to accurately distinguish text, image, and table areas, and by applying a processing method suitable for each area, all information of the document can be effectively utilized.

[0024] According to the present invention, during the process of converting image and table data into text, natural explanatory text can be generated while preserving the structure and meaning of the original data, thereby enabling a large language model to accurately understand and process the entire content of the document.

[0025] According to the present invention, by storing data in a structured form while maintaining a mapping relationship with the original document during the process of integrating different types of data, it is possible to provide more accurate and reliable information in a RAG system.

[0026] FIG. 1 is an overall configuration diagram of a multimodal data processing system according to one embodiment of the present invention.

[0027] FIG. 2 is a document layout analysis configuration diagram according to one embodiment of the present invention.

[0028] FIG. 3 is a flowchart illustrating the process of generating text from an image area according to an embodiment of the present invention.

[0029] FIG. 4 is a flowchart illustrating a table processing process according to one embodiment of the present invention.

[0030] FIG. 5 is a block diagram illustrating a data integration process according to an embodiment of the present invention.

[0031] FIG. 6 is a block diagram illustrating the process of generating LLM input data according to one embodiment of the present invention.

[0032] FIG. 7 is a block diagram showing the details of an integrated document generation process according to one embodiment of the present invention.

[0033] The embodiments described in this document and the configurations illustrated in the drawings are merely preferred examples of the disclosed invention, and various modifications that may replace the embodiments and drawings of this specification may exist at the time of filing this application.

[0034] The terms used in this document are for describing the embodiments and are not intended to limit or restrict the disclosed invention.

[0035] For example, in this specification, singular expressions may include plural expressions unless the context clearly indicates otherwise.

[0036] In this document, each of the phrases such as "A or B", "at least one of A and B", "at least one of A or B", "A, B or C", "at least one of A, B and C", and "at least one of A, B, or C" may include any one of the items listed together in the corresponding phrase, or all possible combinations thereof.

[0037] The term "and / or" includes a combination of multiple related described components or any of the multiple related described components. For example, "A and / or B" may include only "A," only "B," or both "A and B."

[0038] Additionally, terms such as “include” or “have” are intended to express the existence of the features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, and do not exclude the additional existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0039] When it is said that one component is “connected,” “combined,” “supported,” or “in contact” with another component, this includes not only cases where the components are directly connected, combined, supported, or in contact, but also cases where they are indirectly connected, combined, supported, or in contact through a third component.

[0040] When it is said that a component is located “on” another component, this includes not only cases where one component is in contact with the other, but also cases where another component exists between the two components.

[0041] Meanwhile, terms such as “front,” “rear,” “left,” “right,” “top,” and “bottom” used in the following description are defined based on the drawings; however, the shape and position of each component are not limited by these terms. For example, the front side may be defined as the +X side and the rear side as the -X side. For example, based on the drawings, the right side may be defined as the +Y side and the left side as the -Y side. For example, based on the drawings, the top side may be defined as the +Z side and the bottom side as the -Z side.

[0042] In addition, terms including ordinal numbers, such as "first," "second," etc., are used to distinguish one component from another and do not limit the components.

[0043] In addition, terms such as "~part," "~unit," "~block," "~part," and "~module" may refer to a unit that processes at least one function or operation. For example, the terms may refer to at least one piece of hardware such as an FPGA (field-programmable gate array) or ASIC (application specific integrated circuit), at least one piece of software stored in memory, or at least one process processed by a processor.

[0044] An embodiment of the disclosed invention is described in detail below with reference to the attached drawings. Identical reference numbers or symbols in the attached drawings may indicate parts or components that perform substantially the same function.

[0045] The operating principle and embodiments of the present invention will be described below with reference to the attached drawings.

[0046] FIG. 1 is an overall configuration diagram of a multimodal data processing system according to an embodiment of the present invention. FIG. 1 may show the entire processing process from the input of an electronic document to its provision to a large language model.

[0047] Referring to FIG. 1, when an electronic document is input into an artificial intelligence model (100), the control unit of the computing device can analyze the layout (110). For example, a product manual in PDF format, a corporate annual report in Word format, a product specification in PowerPoint format, or a scanned research paper may be input.

[0048] In the layout analysis process, an electronic document can be input into an artificial intelligence model to analyze the layout of the electronic document and perform a step of identifying text areas containing text, image areas containing images, and table areas containing tables.

[0049] The layout analysis step may include a step of dividing the document into areas and a step of analyzing the relationships between the classified areas.

[0050] For example, in the step of dividing a document into areas, text areas, image areas, and table areas within the document can be identified to divide the document's areas.

[0051] For example, if a product manual is entered, it can be analyzed by dividing it into a text area containing a detailed description of the product, an image area showing the product's appearance or configuration, and a table area containing detailed specifications.

[0052] The step of analyzing relationships between classified areas allows for the analysis of the relationships between each area. For example, if a product manual is entered, relationships can be analyzed between the text area containing detailed product descriptions, the image area showing the product's appearance or configuration, and the table area containing detailed specifications.

[0053] The control unit of the computing device can perform the step of extracting first text information from a text area.

[0054] More specifically, first text information can be extracted from a text area (111). For example, text from the "product features" section of a product manual, text from the "executive summary" section of an annual report, or text from the "abstract" and "conclusion" sections of a research paper can be extracted.

[0055] The control unit of the computing device can perform the step of generating second text information describing an image from an image area.

[0056] The control unit of the computing device may generate second text information by performing the steps of extracting feature points of an image and inputting the feature points into an image analysis model to generate second text information describing the image area (112).

[0057] For example, if a product photo is included, descriptive text such as "The product has a rectangular design in metallic gray..." can be generated. If a circuit diagram is included, a description such as "The central processing unit and memory are connected by a bus..." can be generated, and if a graph is included, a description such as "Sales revenue has shown a continuously rising trend since the first quarter of 2023..." can be generated.

[0058] In the step of recognizing the matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each matrix coordinate and at least one upper text corresponding to the matrix coordinate are mapped, the control unit of the computing device for the table area can convert it into structured third text information to generate third text information (113).

[0059] The step of generating third text information may include the step of determining matrix coordinates of a table by inputting the table into an identification artificial intelligence model, the step of analyzing relationships between data elements, and the step of converting the analyzed structure and relationships into a structured format while preserving them.

[0060] For example, information in a table format such as "Display | 6.7 inches" in a product specification sheet can be converted into natural language text such as "The display size is 6.7 inches."

[0061] When the layout of the document is analyzed, the control unit of the computing device can perform different processing for each area. It can extract first text information from the text area (111), generate second text information from the image area (112), and extract third text information from the table area (113).

[0062] The control unit of the computing device may perform the step of integrating the first text information, second text information, and third text information, which are a plurality of extracted information (111, 112, 113), based on the layout of the electronic document and providing them to a large language model.

[0063] More specifically, text information obtained from each region can be integrated by the control unit of the computing device (120), and the characteristics and context of the text for each region are preserved. For example, the text integration step of integrating text can be performed while maintaining the layout structure of the original document and the relationship between the data.

[0064] For example, the product manual can be integrated in the following form: "This product features a rectangular design in metallic gray. Key specifications include a 6.7-inch OLED display, the latest CPU, and 8GB of RAM."

[0065] The step of the control unit of a computing device integrating text information may involve making text data of different formats into a single consistent format. Additionally, it may involve optimizing the representation of the data so that a computer program can process it effectively.

[0066] The integration step may include a step of arranging the processing results of each area in contextual order, a step of establishing reference relationships between areas, and a step of integrating the arranged data and reference relationships into a single document. The integration step may be performed while maintaining the mapping relationship with the original document.

[0067] The integrated text data can be input into a large language model (130). Through this process, data of various forms of documents are integrated and converted into a form that can be processed by the large language model, and each step can be executed sequentially or some steps can be processed in parallel as needed.

[0068] In other words, the integrated text information can be converted into a structured format such as JSON and input into a large language model (130). At this time, the document type, description content, specification information, description of visual elements, etc., can be provided in a systematically structured manner. Through this, the large language model can understand and process the entire content of the document more accurately.

[0069] In particular, this processing enables the effective utilization of all forms of data contained in documents and helps large language models understand and process the content of documents more accurately.

[0070] FIG. 2 is a diagram of a document layout analysis configuration according to an embodiment of the present invention. FIG. 2 can show in detail the process of deriving the analysis and classification results of original data in the layout analysis step (110).

[0071] Referring to FIG. 2, the layout analysis may include original data (150) and a classification result (160). The original data (150) may include a text area (151), an image area (152), and a table data area (153), and each area may include data with different characteristics.

[0072] The text area (151) may refer to a part written in the form of a general sentence or paragraph within a document. For example, in a smartphone manual, product introduction phrases such as "This product is a premium smartphone equipped with the latest AI technology, maximizing the user experience" or usage instructions such as "Press the power button for 3 seconds to turn on the power" may be included.

[0073] The image area (152) may refer to a portion containing visual information included in the document. For example, this may include a product photo showing the front and back of a smartphone, an explanatory image showing the battery replacement method in order, or a screenshot capturing the main screen of a user interface.

[0074] The table data area (153) refers to a portion containing information in a table format consisting of rows and columns. For example, a product specification table such as Table 1 below may correspond to this.

[0075] Category | Specifications Display Size: 6.7 inches Resolution: 3200x1800 Performance CPU: Latest processor RAM: 8GB

[0076] The classification result (160) may include three types of information derived from the analysis of the original data. First, the area type classification (161) indicates the result of classifying what type of data each area within the document contains. For example, by analyzing the first page of the manual, the product name at the top can be classified as title text, the product image in the center as the main image, and the specification table at the bottom as table data. The analysis of relationships between areas (162) shows the result of analyzing how different areas are related. For example, the reference relationship between the text "Press the power button as shown in Figure 1" and the image showing the actual location of the power button, or the relationship between the text "Check the parts listed in Table 1" and the corresponding parts list table can be identified.

[0077] Table structure information (163) represents the results of analyzing the structural characteristics of the table. For example, the following structural characteristics can be identified from the product specification table of Table 1 described above:

[0078] The 'Display' item is merged across two rows, and the 'Size' and 'Resolution' items below it depend on it.

[0079] The 'Performance' item also has two merged rows, and the 'CPU' and 'Memory' items are organized as sub-items.

[0080] The values ​​of each item are aligned in the right column.

[0081] This layout analysis allows for the accurate identification of the document's structural characteristics, which enables optimized processing tailored to the specific properties of each area during subsequent stages. In particular, the results of this analysis play a key role in understanding the document's logical structure and preserving the relationships between data.

[0082] FIG. 3 is a flowchart illustrating a process of generating text from an image area according to an embodiment of the present invention. FIG. 3 can show a detailed process of generating descriptive text by analyzing an image.

[0083] Referring to FIG. 3, image processing includes sequential steps of image (112a), visual feature extraction (112b), feature extraction (112c), and text generation (112d).

[0084] First, in the image (112a) step, an image separated from the document is input. For example, in the case of a product manual, a photo of the product's exterior, a diagram showing the names of each part, a flowchart explaining how to use it, etc., may be input. Specific examples include a photo of the front of a smartphone or a diagram showing the location of each button.

[0085] In the visual feature extraction (112b) step, key visual features are extracted from the input image. For example, visual features such as the size and shape of the display, the position and number of cameras, and the arrangement of buttons can be extracted from the product image. Additionally, the color, material, and overall design features of the product can also be extracted in this step.

[0086] In the feature extraction (112c) step, the meaning and relationships of the extracted visual features are analyzed. For example, based on the extracted features, the overall composition of the product can be identified or the function of each part can be inferred. Specifically, the characteristics of the camera system can be analyzed through the arrangement of camera modules or the characteristics of the user interface can be identified through the arrangement of buttons.

[0087] In the text generation (112d) step, the analyzed features are converted into descriptive text in the form of natural language. For example, in the case of a smartphone image, a descriptive text such as “The front of the product is equipped with a 6.7-inch display, and the front camera is located at the top center. The power button and volume control buttons are located on the right side, and the triple camera system is vertically aligned on the back” can be generated.

[0088] Through this image processing, visual information is converted into text form, which can then be integrated and processed together with other text information within the document. In particular, this process preserves the information contained in the images in text form without loss, helping large language models understand the entire content of the document more accurately.

[0089] FIG. 4 is a flowchart illustrating a table processing process according to an embodiment of the present invention. FIG. 4 may show detailed steps for generating structured text from a table.

[0090] Referring to FIG. 4, table processing may include sequential steps of data area detection (113a), matrix structure analysis (113b), identification of hierarchical relationships (113c), and formalized format conversion (113d).

[0091] The control unit of the computing device can first detect an area configured in the form of a table within the document (113a). For example, it can automatically identify a three-tiered table consisting of divisions, items, and specifications in a product specification, or a two-tiered table listing parts. At this time, the boundaries of the table can be accurately identified by analyzing the arrangement of lines or cells that make up the table.

[0092] When a table area is detected, the control unit of the computing device can analyze the matrix structure of the table (113b). At this stage, structural characteristics such as the number of rows and columns of the table, whether cells are merged, and indentation can be identified. For example, in a product specification sheet, a structure can be analyzed in which the 'display' and 'performance' items are each merged into two rows, and the detailed items below them are separated by indentation.

[0093] Once the matrix structure is identified, the control unit of the computing device can identify the hierarchical relationships between the data within the table (113c). For example, it can identify that 'size' and 'resolution' are organized as sub-items under the main category of 'display'. It can also identify the relationship between the specification values ​​corresponding to each sub-item.

[0094] Finally, the control unit of the computing device can convert the contents of the analyzed table into a structured format (113d). This may be a process of converting the table into a form that is easy for a large language model to process, while preserving the structural characteristics of the table and the relationships between the data. For example, it can be converted into a natural sentence form such as, "The display provides a 6.7-inch screen and supports a high resolution of 3200x1800. In terms of performance, it is equipped with the latest processor and 8GB RAM to ensure smooth operation."

[0095] Through the aforementioned table processing process, structured data can be converted into text that naturally integrates into the context of the entire document while preserving its meaning and relationships. Large language models can understand and process the entire content of the document more accurately.

[0096] FIG. 5 is a block diagram illustrating a data integration process according to an embodiment of the present invention. FIG. 5 can show how text information existing in different forms within a document is integrated into one.

[0097] Referring to FIG. 5, the data integration process may include the step of receiving three types of text input (121, 122, 123), analyzing them (124), and structuring them (125).

[0098] The control unit of the computing device can first receive three types of text input. General text (121) may refer to parts written as basic descriptions or body text in a document. For example, sentences such as "The newly released product is a premium product that maximizes the user experience and actively utilizes AI technology" in a smartphone manual may fall into this category.

[0099] Text (122) extracted from an image may refer to a descriptive text generated by analyzing an image within a document. For example, by analyzing a product photo, text describing visual elements may be generated, such as, "The front of the metallic gray body is equipped with a bezel-less display, and a triple camera is vertically arranged on the rear."

[0100] Structured text (123) based on a table or chart may be a natural language conversion of information from the table or chart. Based on the product specifications, it may be converted into a form such as “This product is equipped with a 6.7-inch AMOLED display and provides excellent performance through the latest processor and 8GB RAM.”

[0101] Texts entered in different forms like this can undergo a structure / relationship analysis (124) step. Here, it is possible to determine how the information contained in each text is related. For example, it is possible to comprehensively analyze how the AI ​​technology mentioned in the general text is related to a specific function of the product, how the design elements described in the image are connected to specific features of the product, and in what context the specification information is important.

[0102] In the final standardization (125) stage, all information can be naturally integrated based on this analysis.

[0103] For example, the output of the standardization stage (125) may be an integration of the aforementioned information, such as, “The premium smartphone introduced this time has elevated the user experience to the next level with AI technology at its core. The body, in a sophisticated metallic gray color, is equipped with a 6.7-inch AMOLED display to provide an immersive screen, and the triple camera system on the rear supports various shooting functions. The combination of the latest processor and 8GB RAM ensures excellent performance in all tasks.”

[0104] Through this integration process, all information within the document is reorganized into a single text with a natural flow, which greatly helps large language models understand and process the document's content more accurately.

[0105] FIG. 6 is a block diagram illustrating the process of generating LLM input data according to an embodiment of the present invention. FIG. 6 can show how different forms of text information are processed and finally input into a large language model.

[0106] Referring to FIG. 6, the LLM input data generation process may involve receiving three different types of text to generate a single integrated document. The control unit of the computing device may receive plain text (131), text extracted from an image (132), and structured text (133) from a table or chart, respectively.

[0107] General text (131) may be a part of the document written in the form of basic descriptions or sentences. For example, in the case of a smartphone product manual, marketing phrases or product descriptions such as “The newly released product achieves a perfect harmony of innovative AI technology and sophisticated design. The interface, newly designed to maximize user convenience, provides an intuitive and efficient user experience.” may be included here.

[0108] Text (132) extracted from an image may refer to a descriptive text generated by analyzing visual information. For example, text describing the visual features of the image in detail may be generated in the form of: “The front of the product features a 6.7-inch display with a bezel-less design, and the metal frame on the side emphasizes a premium feel. On the back, a stylish camera module is naturally integrated into the design, and a biometric sensor is located at the bottom.”

[0109] Structured text (133) based on a table or chart may be a conversion of structured data into natural sentences. For example, information from a product specification sheet may be converted into a natural descriptive text in the form of: "This product is equipped with the latest 5nm processor and 12GB LPDDR5 RAM to provide excellent performance. The 3200x1800 resolution AMOLED display boasts vivid colors and sharp image quality, and can be used for a long time with a 5000mAh large capacity battery."

[0110] These three texts can be combined into a single completed document during the workbook creation (134) stage. In this process, the context and relevance of each piece of information are naturally connected, and redundant content can be appropriately organized. The finally created workbook is input into a large language model (135), so that the model can accurately understand and process the entire content of the document.

[0111] Through this processing, information that initially existed in different forms is reorganized into a document with a single consistent context, which enables large language models to understand and utilize the content of the document more accurately.

[0112] FIG. 7 is a block diagram showing details of an integrated document generation process according to an embodiment of the present invention. FIG. 7 can show how information existing in different forms is completed into a single integrated document.

[0113] Referring to FIG. 7, the process of creating a workbook may include three stages: an input document (170), area-specific processing (180), and a workbook (190). The control unit of the computing device may process and analyze the contents of the document at each stage to finally create a single integrated document.

[0114] First, in the input document (170) step, various forms of information such as a document (171) about the 00 phone, product images (172), and product specifications (173) are received. For example, in the case of a smartphone product introduction document:

[0115] The 00Phone document may include basic descriptions such as, "This product is the latest smartphone that embodies a harmony of innovative technology and sophisticated design."

[0116] Product images may include photos of the front and back of the product or diagrams explaining key features.

[0117] Detailed information such as the screen size being 6.7 inches, the resolution being 320x1800, and the latest processor being installed may be organized in a table format in the product specifications.

[0118] These input information can be processed according to each characteristic in the area-specific processing (180) step:

[0119] In the text extraction (181) process, the core content of the document is identified and the main explanation is extracted.

[0120] In the image description (182) process, the image can be analyzed to generate a description such as "a bezel-less display is applied to the front of the product, and a stylish camera module is placed on the back."

[0121] In the table structuring (183) process, information from the product specification sheet can be analyzed and converted into a structured description such as "a high-performance smartphone supporting a 6.7-inch display and 320x1800 resolution."

[0122] Finally, in the Workbook (190, 191) step, all information processed in the previous step can be naturally integrated into one. For example:

[0123] This smartphone, embodying a harmony of the latest technology and sophisticated design, elevates the user experience to the next level. Equipped with a 6.7-inch bezel-less display on the front, it provides an immersive viewing experience, while the combination of a high resolution of 320x1800 and the latest processor delivers optimal performance in any situation. The stylish rear camera module enhances design perfection while offering excellent shooting capabilities.

[0124] Through this integration process, information of different formats is completed into a document with a single natural context, which enables large language models to accurately understand and process the entire content.

[0125] The document created through the workbook generation process of Fig. 7 can be converted into a more natural form while maintaining the structure and order of the input document. When generating such a workbook, the control unit of the computing device can be configured to track which part of the original input document is connected to which part of the workbook.

[0126] For example, in the case of a smartphone product manual, the product introduction section of the "00 Phone Document" is mapped to the introduction section of the integrated document, the description extracted from the product image is mapped to the section describing the product's appearance and design, and the information from the product specification sheet can be mapped to the section describing the detailed specifications.

[0127] Mapped information in this way can be converted into a form that is easy for computers to process and stored. For example, the sentence "This product is a premium smartphone equipped with the latest AI technology" can be structured to clearly identify the context in which it was used and which images or specifications it is connected to. This enables the tracking of the source and context of each piece of information, allowing for the management of information in an integrated form while preserving the structural characteristics of the original data.

[0128] Data processed in this manner can be utilized as a core component of a RAG system. Based on this structured data, the RAG system can provide more accurate and contextually relevant answers to user questions.

[0129] When a user asks a question about a specific function or characteristic of a product, the system can generate a high-quality answer by comprehensively utilizing all relevant information, and if necessary, can also present the source of the original information.

[0130] This helps increase the reliability of information and provide users with richer context. As a result, the RAG system can utilize this structured data to provide more accurate and reliable information, which can significantly improve the performance of the entire system and the user experience.

[0131] Meanwhile, the disclosed embodiments may be implemented in the form of a recording medium that stores instructions executable by a computer. The instructions may be stored in the form of program code and, when executed by a processor, may generate a program module to perform the operation of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium.

[0132] The present invention includes all types of recording media in which instructions that can be decoded by a computer are stored as computer-readable recording media. Examples include ROM (read only memory), RAM (random access memory), magnetic tape, magnetic disk, flash memory, optical data storage device, etc.

[0133] Additionally, computer-readable recording media may be provided in the form of non-transitory storage media. Here, 'non-transitory storage media' simply means that it is a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily. For example, 'non-transitory storage media' may include a buffer in which data is stored temporarily.

[0134] According to one embodiment, the method according to the various embodiments disclosed herein may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable recording medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) may be temporarily stored or temporarily created on a device-readable recording medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0135] As described above, the disclosed embodiments have been explained with reference to the attached drawings. Those skilled in the art will understand that the present invention may be practiced in forms different from the disclosed embodiments without changing the technical spirit or essential features of the invention. The disclosed embodiments are illustrative and should not be interpreted restrictively.

Claims

1. Regarding a method for processing multimodal data within a document, A step of inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document; A step of identifying a text area containing text, an image area containing an image, and a table area containing a table within the electronic document based on the analyzed layout; A step of extracting first text information from the above text area; A step of generating second text information describing the image from the image area; A step of recognizing the matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; A step of integrating the first text information, the second text information, and the third text information based on the layout of the electronic document; and The method includes the step of providing the above-mentioned integrated information to a large language model; The above integration step is, A multimodal data processing method performed while maintaining a mapping relationship with the original document.

2. In Paragraph 1, The above layout analysis step is, Step of dividing the document into sections; A step of classifying area types by determining the characteristics of each area; and A multimodal data processing method comprising a step of analyzing the relationship between classified regions.

3. In Paragraph 1, The step of generating the above second text information is, A step of extracting feature points of the image by inputting the image into an identification artificial intelligence model; and A multimodal data processing method comprising the step of inputting the above feature points into an image analysis model.

4. In Paragraph 1, The step of generating the above third text information is, A step of determining the matrix coordinates of the table by inputting the table into an identification artificial intelligence model; Step of analyzing relationships between data elements; and A multimodal data processing method comprising the step of converting into a structured format while preserving the analyzed structure and relationships.

5. In Paragraph 1, The above integration step is, A step of arranging the processing results of each area in contextual order; Step of establishing reference relationships between regions; and A multimodal data processing method comprising the step of integrating arranged data and reference relationships into a single document.

6. A computer program stored on a computer-readable storage medium, wherein the computer program performs steps for processing multimodal data within a document when executed on one or more processors of a computing device, and The above steps are, A step of inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document; A step of identifying a text area containing text, an image area containing an image, and a table area containing a table within the electronic document based on the analyzed layout; A step of extracting first text information from the above text area; A step of generating second text information describing the image from the image area; A step of recognizing the matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; A step of integrating the first text information, the second text information, and the third text information based on the layout of the electronic document; and The method includes the step of providing the above-mentioned integrated information to a large language model; The above integration step is, A computer program that processes multimodal data within a document while maintaining a mapping relationship with the original document.

7. In Paragraph 6, The above layout analysis step is, Step of dividing the document into sections; A step of classifying area types by determining the characteristics of each area; and A computer program comprising a step of analyzing the relationships between classified regions.

8. In Paragraph 6, The step of generating the above second text information is, A step of extracting feature points of the image by inputting the image into an identification artificial intelligence model; and A computer program comprising the step of inputting the above feature points into an image analysis model.

9. In Paragraph 6, The step of generating the above third text information is, A step of determining the matrix coordinates of the table by inputting the table into an identification artificial intelligence model; Step of analyzing relationships between data elements; and A computer program comprising a step of converting analyzed structures and relationships into a standardized format while preserving them.

10. In Paragraph 6, The above integration step is, A step of arranging the processing results of each area in contextual order; Step of establishing reference relationships between regions; and A computer program comprising the step of integrating arranged data and reference relationships into a single document.

11. In a storage medium storing at least one instruction, when the at least one instruction is executed by a processor, the processor, A step of inputting an electronic document into an artificial intelligence model to analyze the layout of the electronic document; A step of identifying a text area containing text, an image area containing an image, and a table area containing a table within the electronic document based on the analyzed layout; A step of extracting first text information from the above text area; A step of generating second text information describing the image from the image area; A step of recognizing the matrix coordinates of the table in the table area and generating third text information in which a sub-text corresponding to each of the matrix coordinates and at least one upper-level text corresponding to the matrix coordinates are mapped; A step of integrating the first text information, the second text information, and the third text information based on the layout of the electronic document; and The step of providing the above integrated information to a large language model is executed, and The above integration step is, A storage medium that is performed while maintaining a mapping relationship with the original document.