Data processing method and apparatus therefor

By employing chunking algorithms to preprocess data into chunk metadata and embedding it in vector databases, the method enhances the accuracy and speed of Retrieval-Augmented Generation systems, addressing the limitations of conventional RAG technologies.

WO2026135243A1PCT designated stage Publication Date: 2026-06-25POSCO HLDG INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
POSCO HLDG INC
Filing Date
2025-12-17
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

The performance of Retrieval-Augmented Generation (RAG) technologies is hindered by the accuracy and speed of data retrieval, as conventional methods rely heavily on morphological analysis of queries, leading to inaccurate and misleading responses due to irrelevant information retrieval.

Method used

A data processing method involving chunking algorithms to divide input data into first and second chunk data, generating metadata, and embedding these in vector databases, allowing for multi-dimensional search strategies to enhance accuracy and speed.

Benefits of technology

Improves the accuracy and speed of search results by preprocessing data into chunked metadata and storing it in vector databases, enabling faster and more precise responses to user queries.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025022021_25062026_PF_FP_ABST
    Figure KR2025022021_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Technology for processing data to be retrieved and storing same provides a data processing method and apparatus, the method comprising the steps of: when input data is received, dividing the input data into one or more pieces of first chunk data by using a first chunking algorithm; generating chunk metadata for each of the one or more pieces of first chunk data; dividing target metadata in the chunk metadata into one or more pieces of second chunk data by using a second chunking algorithm; and combining and embedding the second chunk data and the chunk metadata and storing same in a vector database.
Need to check novelty before this filing date? Find Prior Art

Description

Data processing method and apparatus

[0001] The present disclosure relates to a technology for processing and storing data to be searched.

[0002] With the rapid advancement of artificial intelligence technology, its integration into various industrial fields is taking place at a rapid pace.

[0003] Technologies are being developed that store large volumes of data in databases and combine them with natural language processing techniques to go beyond simply searching for documents or materials, providing answers tailored to requests.

[0004] For example, Retrieval-Augmented Generation (RAG) is a technology in Natural Language Processing (NLP) that combines Large-Scale Language Models (LLM) with information retrieval techniques. In particular, RAG is utilized to efficiently integrate the generation and retrieval of information. This technology operates by retrieving relevant information from external databases or document sets during the process where a large-scale language model generates an answer to a user's query, thereby enhancing the accuracy and reliability of the response. RAG aims to address issues of information scarcity, lack of reliability, or lack of timeliness by harmoniously combining the generative capabilities of large-scale language models with the accuracy of information retrieval systems.

[0005] However, the performance of RAG depends heavily on the quality of the retrieved information. If inaccurate or irrelevant information is returned during the search phase, the responses during the generation phase may also be inaccurate or misleading. Furthermore, since RAG must perform search and generation in real-time for queries, response speed is a critical issue.

[0006] Therefore, various studies are being conducted in RAG technology regarding how quickly and accurately searches can be performed.

[0007] The present disclosure aims to provide a technology for processing and storing data to be searched.

[0008] In one aspect, the present embodiments provide a method for processing data, comprising the steps of: when input data is received, dividing the input data into one or more first chunk data using a first chunking algorithm; generating chunk metadata for each of the one or more first chunk data; dividing target metadata among the chunk metadata into one or more second chunk data using a second chunking algorithm; and combining and embedding the second chunk data and chunk metadata and storing them in a vector database.

[0009] In another aspect, the present embodiments provide a data processing device comprising: a first chunking unit that divides input data into one or more first chunk data using a first chunking algorithm when input data is received; a metadata generation unit that generates chunk metadata for each of the one or more first chunk data; a second chunking unit that divides target metadata among the chunk metadata into one or more second chunk data using a second chunking algorithm; and an embedding unit that combines and embeds the second chunk data and chunk metadata and stores them in a vector database.

[0010] According to the present disclosure, it is possible to support the derivation of fast and accurate search results through data processing.

[0011] FIG. 1 is a diagram illustrating a data processing method according to one embodiment.

[0012] FIG. 2 is a diagram illustrating the operation of generating first chunk data according to one embodiment.

[0013] FIG. 3 is a diagram illustrating the operation of generating chunk metadata according to one embodiment.

[0014] FIG. 4 is a diagram illustrating chunk metadata according to one embodiment in an exemplary manner.

[0015] FIG. 5 is a diagram illustrating the operation of generating second chunk data according to one embodiment.

[0016] FIG. 6 is a diagram illustrating the operation of combining chunk metadata with second chunk data and storing it in a vector database according to one embodiment.

[0017] FIG. 7 is a diagram illustrating the operation of combining second chunk data, chunk metadata, and metadata of input data according to another embodiment and storing them in a vector database.

[0018] FIG. 8 is a diagram illustrating the operation of processing an input query according to one embodiment.

[0019] FIG. 9 is a diagram illustrating the configuration of a data processing device according to one embodiment.

[0020] Hereinafter, some embodiments of the present disclosure will be described in detail with reference to the exemplary drawings. In assigning reference numerals to the components of each drawing, the same components may have the same reference numeral as much as possible, even if they are shown in different drawings. Furthermore, in describing the embodiments, if it is determined that a detailed description of related known components or functions may obscure the essence of the technical concept, such detailed description may be omitted. Where terms such as "comprising," "having," or "consisting of" are used in this specification, other parts may be added unless "only" is used. Where a component is expressed in the singular, it may include a plural unless otherwise specified.

[0021] Additionally, terms such as first, second, A, B, (a), (b), etc., may be used to describe the components of the present disclosure. These terms are used merely to distinguish the components from other components, and the nature, order, sequence, or number of the components are not limited by such terms.

[0022] In describing the positional relationship of components, where it is stated that two or more components are "connected," "combined," or "joined," it should be understood that while the two or more components may be directly "connected," "combined," or "joined," they may also be "connected," "combined," or "joined" with other components "intervened." Here, the other components may be included in one or more of the two or more components that are "connected," "combined," or "joined" with one another.

[0023] In describing the temporal flow relationship regarding components, methods of operation, or methods of production, for example, when the temporal or sequential relationship is described using "after," "following," "next," or "before," it may include cases where the relationship is not continuous unless "immediately" or "directly" is used.

[0024] Meanwhile, where numerical values ​​or corresponding information regarding a component (e.g., levels, etc.) are mentioned, even without separate explicit notation, the numerical values ​​or corresponding information may be interpreted as including a range of error that may occur due to various factors (e.g., process factors, internal or external shocks, noise, etc.).

[0025] The embodiments are described in detail below with reference to the drawings.

[0026]

[0027] Recent advancements in artificial intelligence technology are driving innovation across various fields, with text generation AI receiving particular attention. However, issues regarding the accuracy and reliability of this technology remain a significant challenge for both users and developers. One technology that has emerged to address this problem is RAG (Retrieval-Augmented Generation).

[0028] Retrieval-Augmented Generation (RAG) is a method focused on improving the shortcomings of LLM, specifically the "potential for factual errors" and "limitations in contextual understanding." RAG is a technology that enhances the model's generative capabilities and factual understanding abilities by connecting external knowledge bases to LLM.

[0029] The main components of RAG are as follows.

[0030] Query Encoder: A language model for understanding user questions. It encodes the given question into a vector form.

[0031] Knowledge Retriever: Searches for relevant information in external knowledge databases based on an encoded question. For example, it finds paragraphs or phrases related to the question within vast collections of documents, such as Wikipedia, news articles, and specialized books.

[0032] Knowledge-Augmented Generator (KAG): A language model that generates answers to questions by utilizing retrieved knowledge. While similar to existing LLMs, it can generate more accurate and richer answers by accepting retrieved knowledge as additional input.

[0033] RAG is gaining popularity across various industries for its ability to quickly obtain accurate answers through natural language-based questions. In particular, it maximizes business efficiency by allowing the operation of a separate search database to database a company's internal documents and utilize external data.

[0034] However, conventional technology separates natural language-based questions (input queries) into morpheme units and embeds them to accurately understand the questions. Additionally, numerous studies are being conducted on how accurately the questioner's intent can be grasped.

[0035] However, to improve the performance of RAG, accurately locating data within the search database is a critical issue, in addition to the questions themselves. While there are technologies that perform searches by additionally utilizing tag information generated from question analysis, there are limitations to rapidly finding highly accurate answers in the search database solely through the morphological analysis of these tags and questions.

[0036] In this regard, the present disclosure aims to provide a technology that enables the provision of faster and more accurate search results by chunking information stored in a search database and combining various information.

[0037] FIG. 1 is a diagram illustrating a data processing method according to one embodiment.

[0038] Referring to FIG. 1, a method for processing data may include the step of dividing the input data into one or more first chunk data using a first chunking algorithm when input data is received (S110).

[0039] For example, a data processing method may receive input data. Upon receiving input data, the data processing method may divide the input data using a pre-configured chunking algorithm. Various known algorithms may be used as the chunking algorithm. For example, various algorithms such as Fixed Size Chunking, Content-Aware Chunking, and Recursive Chunking may be used.

[0040] Depending on the chunking operation, the input data may be divided into multiple chunk data. At least one first chunk data divided may be divided to include some overlapping content depending on the chunk size and overlap size of the input data.

[0041] Accordingly, the first chunking algorithm can divide the input data into one or more first chunk data using chunk size and overlap size parameters.

[0042] A method for processing data may include the step of generating chunk metadata for each of one or more first chunk data (S120).

[0043] Chunk metadata can be generated for each of the first chunk data using a metadata generator. For example, the metadata generator can generate chunk metadata for each of the first chunk data.

[0044] For example, chunk metadata may be generated to include pre-configured item information based on the content of a first chunk. To this end, the metadata generator may be a device trained to generate metadata using specific information, such as an LLM module. As an example, chunk metadata is generated for each of one or more first chunk data, and may include summary information, keyword information, and title information regarding the content of the first chunk data, divided by item.

[0045] Through this, metadata for each first chunk of data can be generated.

[0046] The method for processing data may include the step of splitting target metadata among chunk metadata into one or more second chunk data using a second chunking algorithm (S130).

[0047] For example, target metadata can be pre-configured as information regarding specific items within chunk metadata. As one example, target metadata could be summary information among the chunk metadata items. As another example, target metadata could be information combining summary information and title information. In addition to this, target metadata can be configured in various ways depending on the settings or items.

[0048] The method for processing data may divide the target metadata into second chunk data using a second chunking algorithm. As previously mentioned, various known algorithms may also be applied to the second chunking algorithm. For example, the second chunking algorithm may divide the target metadata into multiple second chunk data based on chunk size and overlap size.

[0049] For example, the second chunking algorithm can chunk the target metadata based on a chunk size and an overlap size that are set based on the data length of the target metadata. The chunk size is set to the average value of the distribution of the lengths of the target metadata, and the overlap size can be set based on the chunk size. For example, the overlap size can be set to 2 / 3 of the chunk size.

[0050] The input data is divided into multiple first chunk data, and the target metadata for each first chunk data can be further divided into multiple second chunk data.

[0051] The method for processing data may include the step of combining and embedding the second chunk data and chunk metadata and storing them in a vector database (S140).

[0052] A method for processing data involves combining second chunk data and chunk metadata to store input data in a vector database, and embedding the combined data.

[0053] For example, the step of storing in a vector database can embed data by combining data among chunk metadata that is not included in the target metadata with second chunk data.

[0054] As another example, the step of storing in a vector database can embed by combining the second chunk data, chunk metadata that is not included in the target metadata, and the metadata of the input data.

[0055] Through this, the data processing method can improve search performance by storing the results of reprocessing the metadata of the chunked data, rather than simply chunking and storing the input data.

[0056] The first chunk data can be stored in a second vector database distinct from the vector database. That is, data combined with the aforementioned second chunk data and metadata is stored in a separate vector database, and the first chunk data can be stored in a different second vector database.

[0057] When an input query is received and a search is performed, the search system may search the vector database, the second vector database, etc., in sequence and generate an answer using the results. That is, if satisfactory data is not found through a similarity search in the aforementioned vector database, the search system may derive a result by re-searching the second vector database. Alternatively, the second vector database may be the primary search target, and the vector database may be the secondary search target.

[0058] Through the above operations, it is possible to resolve the issues of accuracy and speed degradation resulting from performing chunking operations based on conventional queries and searching for input data stored in a database. That is, according to the present disclosure, by preprocessing and storing input data stored in a database, the accuracy of similarity searches with input queries can be increased, and speed can also be improved.

[0059] The operation of FIG. 1 described above is explained in more detail below with reference to the drawings. However, this is an exemplary description and the embodiments described below are not limited to the examples. Furthermore, while the following description is based on documents, the present disclosure can be applied not only to documents but also to various information such as images and tables.

[0060] FIG. 2 is a diagram illustrating the operation of generating first chunk data according to one embodiment.

[0061] Referring to FIG. 2, when input data (200) is input, the first chunking algorithm can divide the input data (200) into a plurality of first chunking data (250, 251, 252). In this case, the first chunking algorithm can generate the first chunking data (250, 251, 252) by considering the chunking size and the overlap size.

[0062] For example, the first chunking algorithm may be a Fixed Size Chunking algorithm. Fixed Size Chunking is a method of cutting a document into fixed-length units. For instance, when there is a document of 5,000 characters, it is a method of mechanically cutting it into chunks of 100 or 200 characters. Alternatively, the first chunking algorithm may be a Content-Aware chunking algorithm. Content-Aware chunking is a method of chunking that recognizes context; simply put, it is a method of cutting a document into units of sentences or paragraphs. Alternatively, the first chunking algorithm may be a Recursive chunking algorithm. Recursive chunking is a method that combines the Fixed and Context-Aware methods. It is a method of cutting a document into units of sentences or paragraphs and then dividing it into fixed sizes again, taking into account the size after cutting. In addition to these, there are many other chunking algorithms, and there are no limitations on the first chunking algorithm in this disclosure. Furthermore, an overlap technique may be introduced in each of the aforementioned algorithms to further enhance accuracy.

[0063] FIG. 3 is a diagram illustrating the operation of generating chunk metadata according to one embodiment.

[0064] Referring to FIG. 3, among the multiple first chunk data (250, 251, 252) divided by input data, the explanation is based on 250. The same operation is applied to other first chunk data as well.

[0065] Chunk metadata (300) for the first chunk data (250) is generated by a metadata generator. For example, the chunk metadata (300) may generate content for pre-set items (310, 320, 330) based on the divided information of the first chunk data (250). For example, when the first chunk data (250) is input into the metadata generator, the metadata generator may generate data for pre-set items using LLM technology. For example, the first chunk data (250) may be summarized using LLM technology to generate first item information (310). Alternatively, keywords may be extracted from the first chunk data (250) to generate second item information (320). Alternatively, titles may be extracted from the first chunk data (250) to generate Nth item information (330). In addition to this, various item information can be generated from the first chunk data (250) according to various settings and managed as chunk metadata (300) in correspondence with the first chunk data (250).

[0066] FIG. 4 is a diagram illustrating chunk metadata according to one embodiment in an exemplary manner.

[0067] Referring to FIG. 4, chunk metadata is generated based on the first chunk data. It can be generated using LLM, etc., according to preset items.

[0068] For example, chunk metadata may include summary information items (410). Summary information items (410) may be generated by a metadata generator that generates summary information using the first chunk data as input.

[0069] Additionally, the chunk metadata may include keyword information items (420). Keyword information items (420) may be generated by a metadata generator that generates keyword information using the first chunk data as input.

[0070] Additionally, the chunk metadata may include a title information item (430). The title information item (430) may be generated by a metadata generator that generates title information using the first chunk data as input. In addition, various information may be generated as chunk metadata items depending on the settings. The aforementioned metadata generator may be included as a logical part of a data processing device. Alternatively, the metadata generator may be implemented as a separate physical or logical device.

[0071] FIG. 5 is a diagram illustrating the operation of generating second chunk data according to one embodiment.

[0072] Referring to FIG. 5, the second chunk data can be generated based on the target metadata. The target metadata can be set with information from a preset item among the chunk metadata. For example, the case where the first item information (310) is the target metadata is described as an example. As an example, the target metadata can be set to summary information, etc., among the aforementioned items. This is because the summary information is written in the form of a sentence and contains a large amount of information from the first chunk data.

[0073] The data processing method can generate multiple second chunk data (500, 510) by extracting target metadata and using a second chunking algorithm. Various algorithms of the first chunking algorithm described above may be used as the second chunking algorithm.

[0074] For example, the second chunking algorithm can chunk the target metadata based on a chunk size and an overlap size that are set based on the data length of the target metadata. The chunk size is a parameter for determining how large the first item information (310) will be chunked. The overlap size is a parameter for determining how large the overlapping size of 500 and 510 will be set. Through this, the second chunking algorithm can chunk the first item information (target metadata) while sliding to set the second chunk data (500, 510).

[0075] For example, the chunk size can be set to the average value of the distribution of the target metadata lengths. As another example, the chunk size can be set to a fixed value. The overlap size can be set based on the chunk size. For example, the overlap size can be set to 2 / 3 of the chunk size. The setting value for each parameter can be varied, and can be set to a fixed value or a value linked to other parameters.

[0076] When the second chunk data is generated, the data processing method can store it in the vector DB by combining the generated data and performing embedding.

[0077] FIG. 6 is a diagram illustrating the operation of combining chunk metadata with second chunk data and storing it in a vector database according to one embodiment.

[0078] Referring to FIG. 6, the data (600) to be embedded may be the second chunk data (500) and the chunk metadata that is not the target metadata (320, 330). For example, since the second chunk data (500) is data chunked with the first item information (310) as the target metadata, the information (320, 330) excluding the first item information (310) among the chunk metadata may be combined with the second chunk data (500).

[0079] The combined data (600) is embedded and stored as a vector value in the vector DB (610). Likewise, the 510 second chunk data is also stored in the vector DB (610) through the same process.

[0080] FIG. 7 is a diagram illustrating the operation of combining second chunk data, chunk metadata, and metadata of input data according to another embodiment and storing them in a vector database.

[0081] Referring to FIG. 7, the data (700) to be embedded may be the second chunk data (500), data (320, 330) among the chunk metadata that is not the target metadata, and metadata (710) of the input data. For example, since the second chunk data (500) is data chunked with the first item information (310) as the target metadata, information (320, 330) among the chunk metadata excluding the first item information (310) may be combined with the second chunk data (500). In addition, the metadata (710) of the initially input data may also be included and combined as a component of the data (700) to be embedded. Through this, a vector value containing more information can be generated.

[0082] The combined data (700) is embedded and stored as a vector value in the vector DB (610). Likewise, the 510 second chunk data is also stored in the vector DB (610) through the same process.

[0083] In addition to this, the data to be embedded may be configured to have various combination relationships depending on the user's settings.

[0084] Through the above operations, input data can be stored in the DB by dividing it into various information elements, rather than simply storing it in document format or vectorized form. This provides the effect of improving search performance. Below, the search process using the DB constructed in this way is briefly explained.

[0085] FIG. 8 is a diagram illustrating the operation of processing an input query according to one embodiment.

[0086] Referring to FIG. 8, the search system receives an input query (800). In the RAG, the input query (800) can be morphologically analyzed and embedded so that related documents are searched based on similarity.

[0087] The input query (800) can be input into the generator (810). The generator (810) is a module capable of understanding and processing the user's natural language input, such as an LLM. Additionally, the generator (810) can convert search results into natural language and provide them to the user. Here, the example of the input query (800) being input into the generator (810) is used for explanation, but it does not matter if it is input directly into the search device (820).

[0088] The search device (820) refers to a device that performs a search in a database based on an input query (800). The generation device (810) can receive search results from the search device (820) and generate an answer. Accordingly, the generation device (810) may be an artificial intelligence model trained with a large-scale language model.

[0089] According to the present disclosure, the data stored in the vector DB (610) may be a vector value in which the aforementioned second chunk data and chunk metadata (excluding the target metadata) are combined. Alternatively, the data stored in the vector DB (610) may be a vector value in which the aforementioned second chunk data, chunk metadata, and metadata of the input data are combined and embedded.

[0090] Meanwhile, the input data is divided into first chunk data, and the first chunk data can be stored in a separate second vector DB. When a search request is received, the search device (820) can perform a search using a similarity search technique in the vector DB (610). If a search is not performed in the vector DB (610), the search device (820) can perform a search in the second vector DB (830). Alternatively, the search may be performed in a different order. Through this, multi-dimensional and multi-DB searches can be performed, thereby increasing accuracy.

[0091] Meanwhile, a third vector DB may be separately configured for the second chunk data itself. Alternatively, when stored in the vector DB (610), it may be stored to correspond to the metadata of the input data. That is, if the metadata of the input data does not include a vector value, the vector value and the metadata of the input data may be stored in correspondence with each other in the form of a Tag or correspondence relationship. This operation can be performed in the same way in the second vector DB (830).

[0092] The operation of preprocessing input data for searching according to the present disclosure and storing it in a DB has been described above. Through this operation, data in the DB can be stored in a state that facilitates searching, and the performance of the entire system can be improved.

[0093] The following section focuses on the configuration of a data processing unit capable of executing the aforementioned operations. The data processing unit can perform all of the operations described above and may be logically separated or executed by a processor. The configuration described below is a classification intended to aid understanding and can be implemented using a single processor and one or more memories.

[0094] FIG. 9 is a diagram illustrating the configuration of a data processing device according to one embodiment.

[0095] Referring to FIG. 9, the data processing device (900) may include a first chunking unit (910) that divides the input data into one or more first chunk data using a first chunking algorithm when input data is received, a metadata generation unit (920) that generates chunk metadata for each of the one or more first chunk data, a second chunking unit (930) that divides the target metadata among the chunk metadata into one or more second chunk data using a second chunking algorithm, and an embedding unit (940) that combines the second chunk data and chunk metadata and embeds them to store them in a vector database.

[0096] For example, a data processing device (900) may receive input data. When input data is received, the first chunking unit (910) may divide the input data using a preset chunking algorithm. Various known algorithms may be used as the chunking algorithm. For example, various algorithms such as Fixed Size Chunking, Content-Aware chunking, and Recursive chunking may be used as the chunking algorithm.

[0097] Depending on the chunking operation, the input data may be divided into multiple chunk data. At least one first chunk data divided may be divided to include some overlapping content depending on the chunk size and overlap size of the input data.

[0098] Accordingly, the first chunking algorithm can divide the input data into one or more first chunk data using chunk size and overlap size parameters.

[0099] Additionally, the metadata generation unit (920) can generate chunk metadata for each of the first chunk data.

[0100] For example, chunk metadata can be generated by including pre-set item information based on the content of the first chunk. To this end, the metadata generation unit (920) can be trained to generate metadata using specific information, such as an LLM module. For example, chunk metadata is generated for each of one or more first chunk data, and may include summary information, keyword information, and title information regarding the content of the first chunk data, divided by item. Through this, metadata for each first chunk data can be generated.

[0101] For example, target metadata can be pre-configured as information regarding specific items within chunk metadata. As one example, target metadata could be summary information among the chunk metadata items. As another example, target metadata could be information combining summary information and title information. In addition to this, target metadata can be configured in various ways depending on the settings or items.

[0102] The second chunking unit (930) can divide the data into second chunks using a second chunking algorithm. As previously mentioned, various known algorithms may also be applied to the second chunking algorithm. For example, the second chunking algorithm can divide the target metadata into multiple second chunks based on the chunk size and overlap size.

[0103] For example, the second chunking algorithm can chunk the target metadata based on a chunk size and an overlap size that are set based on the data length of the target metadata. The chunk size is set to the average value of the distribution of the lengths of the target metadata, and the overlap size can be set based on the chunk size. For example, the overlap size can be set to 2 / 3 of the chunk size.

[0104] The input data is divided into multiple first chunk data, and the target metadata for each first chunk data can be further divided into multiple second chunk data.

[0105] The embedding unit (940) can combine the second chunk data and chunk metadata and embedding the combined data in order to store input data in a vector database.

[0106] For example, the step of storing in a vector database can embed data by combining data among chunk metadata that is not included in the target metadata with second chunk data.

[0107] As another example, the embedding unit (940) can combine the second chunk data, the chunk metadata that is not included in the target metadata, and the metadata of the input data to perform embedding.

[0108] Through this, the data processing device (900) can improve search performance by storing the result of reprocessing the metadata of the chunked data, rather than simply chunking and storing the input data.

[0109] The first chunk data can be stored in a second vector database distinct from the vector database. That is, data combined with the aforementioned second chunk data and metadata is stored in a separate vector database, and the first chunk data can be stored in a different second vector database.

[0110] The foregoing description is merely an illustrative explanation of the technical concept of the present disclosure, and those skilled in the art to which the present disclosure pertains may make various modifications and variations within the scope of the essential characteristics of the technical concept. Furthermore, since these embodiments are intended to explain, not limit, the scope of the technical concept is not limited by these embodiments. The scope of protection of the present disclosure shall be interpreted by the claims below, and all technical concepts within an equivalent scope shall be interpreted as being included within the scope of rights of the present disclosure.

[0111]

[0112] CROSS-REFERENCE TO RELATED APPLICATION

[0113] This patent application claims priority pursuant to Section 119(a) of the U.S. Patent Act (35 USC § 119(a)) to Korean Patent Application No. 10-2024-0190684 filed on December 19, 2024, all of which are incorporated by reference into this patent application. Additionally, this patent application claims priority in countries other than the United States for the same reasons as above, all of which are incorporated by reference into this patent application.

Claims

1. Regarding the method of processing data, When input data is received, a step of dividing the input data into one or more first chunk data using a first chunking algorithm; A step of generating chunk metadata for each of the one or more first chunk data above; A step of dividing the target metadata among the above chunk metadata into one or more second chunk data using a second chunking algorithm; and A data processing method comprising the step of combining and embedding the second chunk data and the chunk metadata and storing them in a vector database.

2. In Paragraph 1, The above-mentioned first chunking algorithm is, A data processing method characterized by dividing the above input data into one or more first chunk data using chunk size and overlap size parameters.

3. In Paragraph 1, The above chunk metadata is, A data processing method that is generated for each of the above one or more first chunk data, and includes summary information, keyword information, and title information regarding the content of the first chunk data, divided by item.

4. In Paragraph 3, The above target metadata is, A data processing method characterized by the summary information among the chunk metadata above.

5. In Paragraph 1, The above second chunking algorithm is, A data processing method characterized by chunking the target metadata based on a chunk size and an overlap size set based on the data length of the target metadata.

6. In Paragraph 5, The above chunk size is, A data processing method in which the average value of the distribution of the length of the target metadata is set, and the overlap size is set based on the chunk size.

7. In Paragraph 1, The step of storing in the above vector database is, A data processing method for embedding by combining data not included in the target metadata among the chunk metadata with the second chunk data.

8. In Paragraph 1, The step of storing in the above vector database is, A data processing method for embedding by combining the second chunk data, the chunk metadata that is not included in the target metadata, and the metadata of the input data.

9. In Paragraph 1, The above first chunk data is, A data processing method in which data is stored in a second vector database distinct from the above vector database, and the vector database and the second vector database are configured to perform sequential searching.

10. A first chunking unit that, when input data is received, divides the input data into one or more first chunk data using a first chunking algorithm; A metadata generation unit that generates chunk metadata for each of the above one or more first chunk data; A second chunking unit that divides target metadata among the above chunk metadata into one or more second chunk data using a second chunking algorithm; and A data processing device comprising an embedding unit that combines and embeds the second chunk data and the chunk metadata and stores them in a vector database.

11. In Paragraph 10, The above chunk metadata is, A data processing device that is generated for each of the above one or more first chunk data and includes summary information, keyword information, and title information regarding the content of the first chunk data, divided by item.

12. In Paragraph 11, The above target metadata is, A data processing device characterized by being the summary information among the chunk metadata above.

13. In Paragraph 10, The above second chunking algorithm is, A data processing device characterized by chunking the target metadata based on a chunk size and an overlap size set based on the data length of the target metadata.

14. In Paragraph 13, The above chunk size is, A data processing device that is set to the average value of the distribution of the length of the target metadata, and the overlap size is set based on the chunk size.

15. In Paragraph 10, The above embedding unit is, A data processing device that combines and embeds data among the chunk metadata that is not included in the target metadata with the second chunk data.

16. In Paragraph 10, The above embedding unit is, A data processing device that combines and embeds the second chunk data, the chunk metadata that is not included in the target metadata, and the metadata of the input data.

17. In Paragraph 10, The above embedding unit is, A data processing device that stores the first chunk data in a second vector database distinct from the vector database, and is configured such that the vector database and the second vector database are searched sequentially.