Information extraction method and device, electronic equipment, storage medium and program product

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By filtering commonly used text fragments and prompts, the method solves the problems of low efficiency and low accuracy in information extraction in existing technologies, and achieves efficient and accurate information extraction, adapting to the needs of various text types.

CN122309642APending Publication Date: 2026-06-30MASHANG CONSUMER FINANCE CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: MASHANG CONSUMER FINANCE CO LTD
Filing Date: 2024-12-30
Publication Date: 2026-06-30

Application Information

Patent Timeline

30 Dec 2024

Application

30 Jun 2026

Publication

CN122309642A

IPC: G06F16/3329

AI Tagging

Technology Topics

Engineering Data mining

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data set Descent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangement Heating and refrigeration combinations Heat flow Working fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories Engineering Sediment
An IGBT lifetime prediction method based on a GA-Elman-LSTM combined model
CN115964937BImprove forecast accuracySolve the problem of easy to fall into local minimumInternal combustion piston engines Biological models Engineering Data mining

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122309642A_ABST

Patent Text Reader

Abstract

This disclosure relates to an information extraction method, apparatus, electronic device, storage medium, and program product, belonging to the field of natural language processing technology. The information extraction method includes: determining a commonly used text segment corresponding to a first element; wherein the number of times the extraction result corresponding to the first element appears in the commonly used text segment satisfies a frequency threshold; determining a first text segment from multiple text segments of the text based on the commonly used text segment; the similarity between the first text segment and the commonly used text segment satisfies a similarity condition; determining prompt information based on the first element and the first text segment; and performing information extraction on the text based on the prompt information to obtain the extraction result of the first element. This disclosure improves the accuracy of information extraction by accurately selecting the first text segment containing the element and then obtaining prompt information based on the first text segment containing the element for information extraction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of natural language processing technology, and more specifically, to an information extraction method, apparatus, electronic device, storage medium, and program product. Background Technology

[0002] With the continuous advancement of informatization in modern society, the informatization of various industries is also rapidly progressing. In the actual work of many industries, a large number of text files are typically generated. Current information extraction methods for processing these texts usually rely on manual data entry or extracting necessary element tags based on experience.

[0003] However, current information extraction methods not only consume significant human and material resources but also often fail to guarantee high extraction accuracy, have low reusability, and are difficult to adapt to different scenarios. Therefore, there is an urgent need in this field for an information extraction method that can improve the accuracy of information extraction and reduce reuse costs.

[0004] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0005] The purpose of this disclosure is to provide an information extraction method, apparatus, electronic device, storage medium, and program product that can improve the accuracy of information extraction to a certain extent.

[0006] According to a first aspect of this disclosure, an information extraction method is provided, comprising:

[0007] Identify the common text fragments corresponding to the first element;

[0008] Based on the commonly used text fragment, a first text fragment is determined from multiple text fragments of the text; the similarity between the first text fragment and the commonly used text fragment satisfies the similarity condition;

[0009] The prompt information is determined based on the first element and the first text fragment;

[0010] Based on the prompt information, information is extracted from the text to obtain the extraction result of the first element.

[0011] According to a second aspect of this disclosure, an information extraction apparatus is provided, comprising:

[0012] The determination module is used to determine the conventional text fragment corresponding to the first element;

[0013] The determining module is further configured to determine a first text segment from multiple text segments of a text based on the conventional text segment; the similarity between the first text segment and the conventional text segment satisfies a similarity condition;

[0014] The processing module is used to determine the prompt information based on the first element and the first text fragment;

[0015] The processing module is further configured to extract information from the text based on the prompt information to obtain the extraction result of the first element.

[0016] According to a third aspect of this disclosure, an electronic device is provided, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the information extraction method described in any of the preceding claims by executing the executable instructions.

[0017] According to a fourth aspect of this disclosure, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the information extraction method described in any of the preceding claims.

[0018] According to a fifth aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the information extraction method described in any of the preceding claims.

[0019] The exemplary embodiments disclosed herein can have the following beneficial effects:

[0020] In the information extraction method of this exemplary embodiment, a common text fragment corresponding to the first element to be extracted is determined, and the first text fragment containing the first element is accurately selected from multiple text fragments based on the common text fragment corresponding to the first element. The similarity between the first text fragment and the common text fragment satisfies the similarity condition. Then, prompt information is obtained based on the first element and the first text fragment for information extraction. On the one hand, since the common text fragment is the text fragment that appears most frequently in the extraction results corresponding to the first element, the accuracy and efficiency of text fragment selection are improved by using the common text fragment of the first element before information extraction. On this basis, information extraction is performed based on the first text fragment containing the first element, which reduces the length of the text fragment required for information extraction, thereby reducing the time consumption of the information extraction process. On the other hand, obtaining prompt information based on the first text fragment and using the prompt information for information extraction allows for a more focused approach on the information related to the first element in the prompt information during the information extraction process, thereby improving the accuracy of information extraction.

[0021] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.

[0023] Figure 1 A flowchart illustrating an information extraction method according to a related embodiment of the present disclosure is shown;

[0024] Figure 2 A flowchart illustrating an exemplary embodiment of the information extraction method of this disclosure is shown.

[0025] Figure 3 A flowchart illustrating an information extraction method according to a specific embodiment of the present disclosure is shown;

[0026] Figure 4 A block diagram of an information extraction apparatus according to an exemplary embodiment of this disclosure is shown;

[0027] Figure 5 A schematic diagram of the structure of an electronic device according to an example embodiment of the present disclosure is shown. Detailed Implementation

[0028] Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this disclosure more comprehensive and complete, and to fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a full understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced with one or more of the specific details omitted, or other methods, components, apparatus, steps, etc., can be employed. In other instances, well-known technical solutions are not shown or described in detail to avoid obscuring various aspects of this disclosure.

[0029] Furthermore, the accompanying drawings are merely illustrative of this disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and therefore repeated descriptions of them will be omitted. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0030] Information extraction has a wide range of applications. For example, in contract review, companies may need to extract useful information from contracts to facilitate related work, such as finding information about the ownership of technological achievements or identifying the parties involved (Party A and Party B). Furthermore, in the legal industry, various documents are often presented in semi-structured text format.

[0031] When processing these text documents, relying on manual input or experience-based extraction of necessary legal element tags not only consumes significant human and material resources, but also often fails to guarantee high accuracy in complex situations, such as those involving multiple parties or appraisal institutions. Furthermore, extraction rules are difficult to adapt to different cases and scenarios, have low reusability, and require constant adjustment and maintenance, increasing maintenance costs.

[0032] In some relevant embodiments, information extraction methods based on large language model prompts can be used, such as... Figure 1 As shown, the information extraction process is as follows:

[0033] Step S110. Combine the complete text with the elements and prompt templates to obtain the prompt for information extraction.

[0034] When extracting information, the first step is to define which elements need to be extracted from the target text. For example, in a court judgment, elements might include the defense attorney, the cause of action, and the fine. In a contract, elements might include the names of Party A and Party B, and the contract term.

[0035] Then, the complete text is combined with the elements and prompt information template to obtain the prompt information, such as "You are now a legal review expert. Your task is to extract the following information from the text: total price of the judgment document. Please output the result in the following format: - Fine-: The fine is x yuan. If the relevant attributes cannot be extracted, please use the unknown answer. Text: {judgment document text}".

[0036] Step S120. Input the prompt information into the large language model.

[0037] Step S130. Obtain the information extraction results through the large language model.

[0038] The prompt information is used as input to a large language model, and the large language model is used to obtain the information extraction results.

[0039] The information extraction methods described above have advantages such as not requiring labeled training samples, low implementation cost, and high information extraction accuracy. However, these methods require multiple calls to a large language model, with each input consisting of the entire target text (potentially several thousand characters), resulting in a long extraction time per session and consequently a long overall information extraction time. Furthermore, in complex scenarios where the target text contains multiple identical element names, such as multiple plaintiffs and defendants in a court document or multiple instances of "Party A" and "Party B" in a contract, the large model cannot focus on key information within the prompts due to the input being the entire target text. It struggles to grasp the main points and accurately identify the context of the elements, leading to incorrect extraction results and low information extraction accuracy. Additionally, when the element does not exist in the target text, the large model may misinterpret it, tending to use the extracted value from the example template as the corresponding value for that element, potentially causing errors in the extraction results.

[0040] To address the aforementioned problems, this exemplary implementation first provides an information extraction method. (Reference) Figure 2 As shown, the above information extraction method may include the following steps:

[0041] Step S210. Determine the conventional text fragment corresponding to the first element.

[0042] Step S220. Based on the commonly used text fragments, determine the first text fragment from multiple text fragments of the text; the similarity between the first text fragment and the commonly used text fragments satisfies the similarity condition.

[0043] Step S230. Determine the prompt information based on the first element and the first text fragment.

[0044] Step S240. Extract information from the text based on the prompt information to obtain the extraction result of the first element.

[0045] In the information extraction method of this exemplary embodiment, a common text fragment corresponding to the first element to be extracted is determined, wherein the number of times the extraction result corresponding to the first element appears in the common text fragment meets a frequency threshold. Based on the common text fragment corresponding to the first element, the first text fragment containing the first element is accurately selected from multiple text fragments of the text. The similarity between the first text fragment and the common text fragment meets a similarity condition. Then, prompt information is obtained based on the first element and the first text fragment for information extraction. On the one hand, since the common text fragment is the text fragment in which the extraction result corresponding to the first element appears more frequently, the text fragment is filtered by using the common text fragment of the first element before information extraction, which improves the accuracy and efficiency of text fragment filtering. On this basis, information extraction is performed based on the first text fragment containing the first element, which reduces the length of the text fragment required for information extraction, thereby reducing the time consumption of the information extraction process. On the other hand, obtaining prompt information based on the first text fragment and using the prompt information for information extraction can focus more on the information related to the first element in the prompt information during the information extraction process, thereby improving the accuracy of information extraction.

[0046] Furthermore, when encountering other types of text, since the conventional text fragments corresponding to the same elements are similar, it is not necessary to add conventional text fragments corresponding to the same elements, or only a small number of conventional text fragments need to be added to ensure the accuracy of information extraction, which significantly reduces the reuse cost of information extraction methods.

[0047] Below, in conjunction with Figure 3 The steps described above in this example implementation will be explained in more detail.

[0048] In step S210, the conventional text segment corresponding to the first element is determined; wherein, the number of times the extraction result corresponding to the first element appears in the conventional text segment meets the number threshold.

[0049] In this example implementation, the first element refers to the target element to be extracted. For example, the first element to be extracted from a court judgment document could be the plaintiff, the defendant, the amount of compensation, etc., while the first element to be extracted from a contract document could be Party A, Party B, the contract name, the contract term, etc.

[0050] In this example implementation, the extraction result corresponding to the first element appears in a common text fragment the number of times it meets the frequency threshold. In other words, a common text fragment refers to a text fragment in which the extraction result corresponding to the first element appears frequently, and the frequency meets the frequency threshold. For example, if there are two text fragments, and the extraction result corresponding to the first element appears twice in the first text fragment and five times in the second text fragment, with a frequency threshold of three times, then the second text fragment can be considered a common text fragment corresponding to the first element. A common text fragment corresponding to the first element refers to a frequently used description of the context fragments used by the first element during historical extraction, i.e., a text fragment that appears frequently during historical extraction. For example, in the historical extraction process of contract text, "Party A: XXX" can be considered one of the common text fragments for the element "Party A".

[0051] When obtaining common text fragments for the first element, the number of common text fragments can be determined based on the different information extraction difficulties. Assuming that the number of common text fragments for the first element to be obtained is K, more common text fragments can be collected for elements that are more difficult to extract, for example, K=15, while fewer common text fragments can be collected for elements that are easier to extract, for example, K=5.

[0052] In this example implementation, determining the conventional text fragment corresponding to the first element may specifically include the following steps:

[0053] Step S310. Obtain the extraction difficulty coefficient of the first element, and determine the number of common text fragments of the first element based on the extraction difficulty coefficient.

[0054] The difficulty coefficient of extracting the first element can be reflected by its historical extraction accuracy. Specifically, the historical extraction accuracy of an element is inversely proportional to its extraction difficulty coefficient; the lower the historical extraction accuracy, the higher the difficulty coefficient of the corresponding element extraction. For example, a historical extraction accuracy of 100% corresponds to an extraction difficulty coefficient of 0; a historical extraction accuracy of 50% corresponds to an extraction difficulty coefficient of 0.5. For elements with higher extraction difficulty coefficients, providing more common text fragments can improve the current extraction accuracy.

[0055] In one embodiment, a correspondence between the extraction difficulty coefficient and the number of text fragments can be set based on historical element extraction data. After determining the extraction difficulty coefficient of the first element, the number of commonly used text fragments of the first element can be determined based on the extraction difficulty coefficient of the first element and the correspondence.

[0056] In another embodiment, a calculation formula between the extraction difficulty coefficient and the number of text fragments can be summarized based on historical element extraction data. After determining the extraction difficulty coefficient of the first element, the number of commonly used text fragments corresponding to the first element can be calculated based on the calculation formula and the extraction difficulty coefficient of the first element.

[0057] For example, when determining the difficulty coefficient of element extraction, five commonly used text fragments can be collected for each element, and the extraction accuracy of different elements can be calculated. For elements with low extraction accuracy, the number of commonly used text fragments corresponding to that element can be increased, for example, to 10.

[0058] It should be noted that the first element can correspond to multiple candidate idiomatic text fragments. These multiple candidate idiomatic text fragments can be sorted according to the frequency of occurrence of the extraction results corresponding to the first element. After determining the number of idiomatic text fragments corresponding to the first element, the idiomatic text fragments are selected in descending order of frequency.

[0059] Step S320. Obtain the common text fragments of the first element based on the number of common text fragments.

[0060] In this example implementation, the historical extraction accuracy of the first element reflects its extraction difficulty. Adjusting the number of commonly used text fragments obtained based on this extraction difficulty further improves the efficiency and accuracy of subsequent information extraction. Besides historical extraction accuracy, other parameters, such as extraction frequency, can also be used as extraction difficulty coefficients; however, this example implementation does not impose specific limitations.

[0061] The collection results of commonly used text fragments can be used as a knowledge base and reused in different types of target texts. When encountering new types of text, it is easy to add the contextual description of the element to the knowledge base, which can greatly improve the accuracy of information extraction, while almost no increase in extraction time due to adding a description.

[0062] In step S220, a first text segment is determined from multiple text segments of the text based on a conventional text segment; the similarity between the first text segment and the conventional text segment satisfies the similarity condition.

[0063] In this example implementation, "text" refers to the complete text to be extracted, such as complete court judgments, contracts, medical records, company annual reports, etc.

[0064] After segmenting the text, multiple text fragments can be obtained. When segmenting text, it can be done by period, resulting in multiple text fragments, potentially numbering in the hundreds. Alternatively, other segmentation methods can be used, such as segmenting by paragraph; this example implementation does not impose a specific limitation. Segmenting a long text (e.g., thousands of words) can yield multiple shorter text fragments.

[0065] In this example implementation, a second text segment can be determined from multiple text segments based on commonly used text segments. The similarity between the second text segment and the commonly used text segment satisfies a similarity condition. Then, a first text segment is determined based on the second text segment. Here, the second text segment refers to the text segment similar to the commonly used text segment selected from multiple text segments based on the similarity condition. The first text segment refers to the text segment containing the first element selected from multiple second text segments, and the similarity between the first text segment and the commonly used text segment also satisfies the similarity condition. Since the commonly used text segment is the text segment whose extraction result corresponding to the first element appears most frequently, by using the commonly used text segment of the first element to filter text segments before information extraction, the first text segment containing the first element can be accurately selected from multiple text segments, improving the accuracy and efficiency of text segment filtering.

[0066] In this example implementation, determining the second text segment from multiple text segments based on conventional text segments may specifically include the following steps:

[0067] Step S410. Determine the idiomatic text fragment vector of the idiomatic text fragment, and determine the text fragment vector of each text fragment in the text.

[0068] In this example implementation, multiple text fragments and multiple commonly used text fragments can be converted into corresponding text fragment vectors. For example, a pre-trained vector model, such as the GTE (Gegeral Text Embeddings) model, can be used to obtain the embedding vectors of the corresponding text fragments and the embedding vectors of the commonly used text fragments, and then the text fragment vectors corresponding to the multiple text fragments can be stored in a vector database. The pre-trained vector model is used to convert text data into corresponding vector data. Besides this, other methods can also be used for text-to-vector conversion; this example implementation does not impose specific limitations.

[0069] Step S420. For each text segment in the text, determine the vector similarity between the text segment vector and the conventional text segment vector.

[0070] For each text segment in the vector database, the vector similarity between each text segment and each commonly used text segment vector can be calculated. Based on the vector similarity, the text segment with the highest vector similarity to each commonly used text segment is selected from the vector database. The vector similarity can be obtained using cosine similarity, or other similarity calculation methods can be used.

[0071] Step S430. The text segment corresponding to the vector similarity that meets the similarity condition is determined as the second text segment.

[0072] In this example implementation, text segments whose vector similarity meets the similarity filtering criteria can be identified as the second text segment based on the vector similarity corresponding to each text segment.

[0073] The similarity filtering criteria can be determined according to actual needs. For example, when determining the second text segment based on the similarity filtering criteria, the text segment with the highest similarity to the commonly used text segment can be determined as the second text segment, or the text segment whose vector similarity meets a certain threshold can be determined as the second text segment. This example implementation does not make specific limitations.

[0074] The number of second text fragments can also be determined according to actual needs. For example, for each commonly used text fragment, the most similar text fragment can be retrieved from the vector database as the corresponding second text fragment. Assuming the number of commonly used text fragments is K, then a total of K most similar second text fragments will be retrieved from the vector database. Alternatively, multiple corresponding second text fragments can be retrieved for each commonly used text fragment; this example implementation does not impose a specific limitation. Based on these second text fragments, a set of most similar text fragments can be obtained.

[0075] In this example implementation, by using vector similarity retrieval, multiple second text segments with the highest similarity to conventional text segments can be accurately retrieved from the text segments of the text, laying a foundation for subsequently determining the text segment containing the first element.

[0076] In this example implementation, after determining the second text segment, the second text segment whose repetition frequency meets the quantity threshold can be used as the first text segment.

[0077] Among the multiple second text fragments filtered from several commonly used text segments, there are usually duplicate text fragments. When selecting a text fragment, the second text fragment with the highest repetition frequency can be used as the first text fragment. The final first text fragment usually only has a few dozen characters, which greatly reduces the number of characters compared to the original text.

[0078] After obtaining the set of most similar text fragments corresponding to the first element, we can count the majority vote of the K second text fragments. For example, if k = 3 and the set of second text fragments is [a, a, b], then the vote result is second text fragment a; if k = 7 and the set of second text fragments is [a, a, b, b, a, c, d], then the vote result is second text fragment a.

[0079] In this example implementation, if there is no text fragment with the highest repetition frequency among the multiple second text fragments, other commonly used text fragments of the first element are retrieved again, and new second text fragments are obtained based on the other commonly used text fragments.

[0080] If no duplicate second text fragments appear in the voting results, or if multiple second text fragments have the same number of repetitions, then other commonly used text fragments of the first element are retrieved, and the number of fragments K is increased to obtain new second text fragments, until the second text fragment with the most repetitions is selected. The above steps can maximize the accuracy of the final selected text fragments.

[0081] In this example implementation, after obtaining the first text fragment, if the first text fragment has an incomplete description, one or more text fragments can be extended forward and backward based on the first text fragment, and the extended text fragments can be used as the first element, the final first text fragment, to replace the original first text fragment as the input of the information extraction model. This can prevent information extraction errors caused by incomplete description of the text fragments.

[0082] In step S230, a prompt message is determined based on the first element and the first text fragment.

[0083] In this example implementation, the first element and the first text fragment can be input into the prompt information template to obtain the prompt information.

[0084] Prompt learning guides large language models to generate specific types of text by providing them with specific prompts. These prompts can take various forms, such as text, images, and audio, and are used to help the model understand the user's needs and generate the corresponding text.

[0085] After obtaining the first text fragment, the element name of the first element and the first text fragment can be combined with the prompt message template to obtain the prompt message of the first element.

[0086] In this example implementation, the prompt template can use a Few-shot template. Few-Shot-COT (Few-Shot-Chain-of-thought) decomposes a complex problem into many sub-problems in the example, thereby guiding the large language model to also decompose a complex problem into many sub-problems when answering questions, thus enhancing the reasoning ability of the large model.

[0087] In step S240, information is extracted from the text based on the prompt information to obtain the extraction result of the first element.

[0088] After receiving the prompt information, it can be input into the information extraction model for information extraction to obtain the extraction result of the first element. The information extraction model can be an LLM (Large Language Model) or other types of language models. The information extraction model can be trained using multiple elements and their corresponding historical extraction results.

[0089] Taking LLM (Large Language Model) as an example, a large language model is a natural language processing model with a large number of parameters. This type of model typically requires significant computational resources and training data to train, thus enabling it to handle various complex natural language tasks, including language understanding, generation, and translation. Large language models are primarily based on the Transformer model structure. The computation time of a Transformer model is proportional to the square of the input text length; therefore, reducing the length of the input text can reduce the computation time of a large language model.

[0090] In this example implementation, when extracting information using a large language model, it is no longer necessary to input the entire text; only the first text fragment corresponding to the first element needs to be input. Therefore, the input length of the large language model may be reduced from thousands to tens. When the input length is significantly reduced, the information extraction time will also be significantly reduced. At the same time, the significantly reduced input length of the large model enables it to more accurately understand the meaning of the prompt words and focus more on the key information of the prompt words, thereby improving the accuracy of information extraction.

[0091] like Figure 3 The diagram shown is a complete flowchart of an information extraction method in a specific embodiment of this disclosure. It illustrates the steps described above in this example embodiment. The specific steps of the flowchart are as follows:

[0092] Step S510. Divide the target text to be extracted into segments according to periods.

[0093] The target text to be extracted is segmented according to ".", resulting in N (N may be several hundred) text fragments.

[0094] Step S520. Obtain the corresponding vectors of the text segments through the pre-trained vector model and put them into the vector database.

[0095] After pre-training a vector model, such as GTE, the embedding vectors of the corresponding text segments are obtained and stored in the vector database.

[0096] Step S530. Collect common descriptions of the context fragments corresponding to each element to be extracted and convert them into corresponding vectors.

[0097] Collect common descriptions of the context fragments corresponding to each element to be extracted, assuming a total of K descriptions. For elements that are more difficult to extract, collect more descriptions, such as K=15; for elements that are easier to extract, collect fewer descriptions, such as K=5. Process the collected context fragments corresponding to each element through the same pre-trained vector model to obtain the corresponding K vectors.

[0098] Step S540. Retrieve the set of most similar text fragments from the vector database.

[0099] It can retrieve the set of text segments that are most similar to each of the K vectors from the vector database.

[0100] Step S550. Calculate the majority vote results for the most similar text segments.

[0101] Calculate the majority vote result for K text segments and use it as the target context segment for that element.

[0102] To prevent information extraction errors due to incomplete description of the context fragment, a fragment can be extended forward and backward from the context fragment to form the final context fragment corresponding to the element.

[0103] Step S560. Input the data into the large language model to obtain the information extraction results.

[0104] The element name, the corresponding final context fragment, and the Few-shot template are combined and input into the large language model to obtain the information extraction results.

[0105] In this example implementation, common descriptions of the context fragments corresponding to each element to be extracted are first collected. Then, using vector retrieval, the set of contexts most relevant to the element in the target text is obtained. The majority vote result of the text fragment set is used as the context fragment corresponding to the element. Finally, the element name, the final context fragment corresponding to the element, and the Few-shot template are combined and input into the large language model to obtain the information extraction result.

[0106] To reduce the time consumption of information extraction, engineering methods can be employed, such as deploying more large model instances or using VLLM (Vectorized Large Language Model Serving System) tools to deploy large models. VLLM is a tool for accelerating large language model inference by utilizing techniques such as PagedAttention, continuous batch processing, and distributed inference support to improve the inference speed of large models. Combining VLLM with the information extraction method implemented in this example can further reduce the time consumption of information extraction.

[0107] The information extraction method in this example implementation can be applied to scenarios that require information extraction, such as the review of court documents, or the extraction of information such as contract texts, medical records, and company annual reports.

[0108] It should be noted that although the steps of the method in this disclosure are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple steps.

[0109] It should be noted that all information disclosed herein (including but not limited to user personal information) is authorized by the user or fully authorized by all parties.

[0110] Furthermore, this disclosure also provides an information extraction device. (See reference) Figure 4 As shown, the information extraction device may include a determining module 401 and a processing module 402. Wherein:

[0111] The determining module 401 can be used to determine the conventional text fragment corresponding to the first element; wherein, the number of times the extraction result corresponding to the first element appears in the conventional text fragment meets the number threshold.

[0112] The determining module 401 can also be used to determine the first text segment from multiple text segments of the text based on the conventional text segment; the similarity between the first text segment and the conventional text segment satisfies the similarity condition;

[0113] Processing module 402 can be used to determine prompt information based on the first element and the first text fragment;

[0114] The processing module 402 can also be used to extract information from the text based on the prompt information to obtain the extraction result of the first element.

[0115] In some exemplary embodiments of this disclosure, the determining module 401 may include a second text fragment determining unit and a first text fragment determining unit. Wherein:

[0116] The second text segment determination unit can be used to determine the second text segment from multiple text segments of a text based on a conventional text segment; the similarity between the second text segment and the conventional text segment satisfies the similarity condition;

[0117] The first text fragment determination unit can be used to determine the first text fragment based on the second text fragment.

[0118] In some exemplary embodiments of this disclosure, the second text segment determination unit may include a text vector determination unit, a vector similarity determination unit, and a second text segment selection unit. Wherein:

[0119] The text vector determination unit can be used to determine the conventional text fragment vector of conventional text fragments, as well as to determine the text fragment vector of each text fragment in the text;

[0120] The vector similarity determination unit can be used to determine the vector similarity between the vector of a text segment and the vector of a conventional text segment for each text segment in the text.

[0121] The second text segment selection unit can be used to determine the text segment corresponding to the vector similarity that meets the similarity condition as the second text segment.

[0122] In some exemplary embodiments of this disclosure, the first text segment determination unit may specifically be used to identify a second text segment whose repetition frequency meets a quantity threshold as the first text segment.

[0123] In some exemplary embodiments of this disclosure, the determining module 401 may further include a habitual fragment quantity determining unit and a habitual text fragment acquisition unit. Wherein:

[0124] The habitual fragment quantity determination unit can be used to obtain the extraction difficulty coefficient of the first element and determine the number of habitual text fragments of the first element based on the extraction difficulty coefficient.

[0125] The habitual text fragment acquisition unit can be used to acquire the habitual text fragments of the first element based on the number of habitual text fragments.

[0126] In some exemplary embodiments of this disclosure, the processing module 402 may further include a model input unit, which can be used to extract information from the prompt information input information extraction model to obtain the extraction result of the first element.

[0127] The specific details of each module / unit in the above information extraction device have been described in detail in the corresponding method embodiment section, and will not be repeated here.

[0128] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to exemplary embodiments of this disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0129] Figure 5 A schematic diagram of the structure of an electronic device according to an example embodiment of the present disclosure is shown.

[0130] It should be noted that, Figure 5 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

[0131] like Figure 5 As shown, the electronic device includes a processor 501 and a memory 502, which are interconnected via a bus 503. The memory 502 stores executable instructions for the processor 501, which is configured to perform method steps of various exemplary embodiments of the present disclosure by executing the executable instructions, such as the information extraction method described above.

[0132] Exemplary embodiments of this disclosure also provide a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the information extraction method described above.

[0133] In one embodiment, the computer program product can be a tangible product containing a computer program, such as a computer-readable storage medium storing the computer program. The readable storage medium can be a storage medium based on electrical, magnetic, optical, electromagnetic, infrared, or other signals, including but not limited to: random access memory (RAM), read-only memory (ROM), magnetic tape, floppy disk, flash memory, hard disk drive (HDD), solid-state drive (SSD), etc. For example, the computer program product can be implemented as a non-volatile storage medium storing the computer program, such as read-only memory, NAND flash memory, etc.

[0134] In one implementation, the computer program product can be an intangible product containing a computer program. For example, the computer program product can be implemented as a virtual digital product, such as an executable file, installation package, or other digital file storing the computer program.

[0135] Computer program code can be written in one or more programming languages. Examples of programming languages include C, Java, and C++. Program code can execute entirely on the user's computing device, partially on the user's computing device, or as a standalone software package. It can also execute partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, such as a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via an internet connection provided by a mobile network operator).

[0136] Computer programs can be carried or transmitted via signals such as electrical, magnetic, optical, electromagnetic, and infrared rays. Electronic devices can convert the signals carrying computer programs into digital signals, thereby running the computer programs. When a computer program runs on an electronic device, its code is used to cause the electronic device to execute (more specifically, to execute by the processor of the electronic device) the method steps of various exemplary embodiments of this disclosure, such as the information extraction method described above.

[0137] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0138] It should be noted that although several modules for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to embodiments of this disclosure, the features and functions of two or more modules described above can be embodied in one module. Conversely, the features and functions of one module described above can be further divided and embodied by multiple modules.

[0139] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein.

[0140] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. An information extraction method characterized by, include: Identify the common text fragments corresponding to the first element; Based on the commonly used text fragment, a first text fragment is determined from multiple text fragments of the text; the similarity between the first text fragment and the commonly used text fragment satisfies the similarity condition; The prompt information is determined based on the first element and the first text fragment; Based on the prompt information, information is extracted from the text to obtain the extraction result of the first element.

2. The information extraction method of claim 1, wherein, Determining the first text segment from multiple text segments based on the conventional text segment includes: Based on the conventional text fragment, a second text fragment is determined from multiple text fragments of the text; the similarity between the second text fragment and the conventional text fragment satisfies the similarity condition; The first text segment is determined based on the second text segment.

3. The information extraction method of claim 2, wherein, The step of determining the second text segment from multiple text segments based on the conventional text segment includes: Determine the conventional text segment vector of the conventional text segment, and determine the text segment vector of each text segment in the text; For each text segment in the text, determine the vector similarity between the text segment vector and the conventional text segment vector; The text segment corresponding to the vector similarity that meets the similarity condition is determined as the second text segment.

4. The information extraction method according to claim 2, characterized in that, Determining the first text segment based on the second text segment includes: The second text segment whose repetition frequency meets the quantity threshold is taken as the first text segment.

5. The information extraction method according to claim 1, characterized in that, The determination of the conventional text fragment corresponding to the first element includes: Obtain the extraction difficulty coefficient of the first element, and determine the number of common text fragments of the first element based on the extraction difficulty coefficient; Obtain the conventional text fragments of the first element based on the number of conventional text fragments.

6. The information extraction method according to claim 1, characterized in that, The step of extracting information from the text based on the prompt information to obtain the extraction result of the first element includes: The prompt information is input into the information extraction model for information extraction, and the extraction result of the first element is obtained.

7. An information extraction device, characterized in that, include: The determination module is used to determine the conventional text fragment corresponding to the first element; The determining module is further configured to determine a first text segment from multiple text segments of a text based on the conventional text segment; the similarity between the first text segment and the conventional text segment satisfies a similarity condition; The processing module is used to determine the prompt information based on the first element and the first text fragment; The processing module is further configured to extract information from the text based on the prompt information to obtain the extraction result of the first element.

8. An electronic device, characterized in that, include: processor; as well as A memory for storing one or more programs, which, when executed by the processor, cause the processor to implement the information extraction method as described in any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the information extraction method as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the information extraction method as described in any one of claims 1 to 6.