A data extraction method, system, electronic device and storage medium
By using feature vectors of entity relationship sets for filtering and mapping before document extraction, the problem of low efficiency and accuracy in document relationship extraction is solved, achieving efficient and accurate information extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ANHUI IFLYTEK INTELLIGENT SYST
- Filing Date
- 2024-08-02
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, document relationship extraction methods based on template matching require a lot of manual intervention and are difficult to identify effective information in messy content, resulting in low efficiency and accuracy.
By obtaining the feature vector set of the document to be detected, preliminary screening is performed using the entity relationship set, and related segments are divided and relationship mapping is performed to extract the segment dataset of the target business.
It improves the efficiency and accuracy of document relationship extraction, reduces manual intervention, and enhances the efficiency of document processing and the accuracy of information extraction.
Smart Images

Figure CN119128168B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a data extraction method, system, electronic device, and storage medium. Background Technology
[0002] With the advancement of digital infrastructure and large-scale artificial intelligence models, the rapid construction of domain-based knowledge bases by extracting entity relationships from documents is becoming increasingly important. Currently, mainstream technologies include template-matching-based methods.
[0003] However, simply relying on template matching requires a lot of manual labor to input relevant content according to a fixed extraction template; at the same time, if the content to be extracted is messy, it may be difficult to identify the desired content, resulting in low efficiency and accuracy of document relationship extraction. Summary of the Invention
[0004] The main technical problem addressed by this application is to provide a data extraction method, system, electronic device, and storage medium that can solve the technical problems of low efficiency and accuracy in document relationship extraction.
[0005] To solve the above-mentioned technical problems, one technical solution adopted in this application is: the method includes:
[0006] Multiple documents to be detected are obtained, and a set of entity relationships for the target business is constructed; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities;
[0007] Each of the documents to be detected is subjected to feature vector extraction to obtain the first feature vector set of each of the documents to be detected.
[0008] Using the first feature vector set, the association relationship between each of the documents to be detected and the entity relationship set is obtained, and the target document of the target business is selected from the plurality of documents to be detected;
[0009] The target document is divided into multiple associated fragments. Relationship mapping is performed on the associated fragments according to each triple of the entity relationship set to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target business.
[0010] Optionally, the step of selecting the target document of the target business from the plurality of target documents by using the first feature vector set to obtain the association relationship between each of the documents to be detected and the entity relationship set includes:
[0011] Feature extraction is performed on each triple of the entity relation set to obtain the second feature vector set of the entity relation set;
[0012] The similarity between each document to be detected and the entity relationship set is obtained using the first feature vector set and the second feature vector set;
[0013] Documents with a similarity higher than a set threshold are identified as target documents.
[0014] Optionally, the step of obtaining the similarity between each document to be detected and the entity relation set using the first feature vector set and the second feature vector set includes:
[0015] The angle between the first feature vector set and the second feature vector set of the document to be detected is obtained;
[0016] The similarity between the document to be detected and the set of entity relationships is determined based on the vector angle; wherein the similarity is negatively correlated with the vector angle.
[0017] Optionally, the step of dividing the target document into multiple associated fragments, extracting relational mappings from the associated fragments according to each triplet of the entity relation set, and obtaining multiple fragment data associated with each triplet to obtain the fragment dataset of the target business includes:
[0018] The target document is divided into multiple related segments, and the segment vectors of the related segments are obtained;
[0019] The fragment vectors are mapped to each of the triples to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target service.
[0020] Optionally, the step of converting the associated fragment into a fragment vector includes:
[0021] The associated fragments are segmented into multiple sentences, and each sentence is converted into a sentence vector.
[0022] The sentence vector is used to perform semantic recognition on the sentence, and the related context sentence vectors are concatenated into the fragment vector based on the semantic recognition results.
[0023] Optionally, after the step of dividing the target document into multiple related fragments, extracting relational mappings from the related fragments according to each triplet of the entity relation set to obtain multiple fragment data associated with each triplet, and obtaining the fragment dataset of the target business, the method further includes:
[0024] Obtain at least one data extraction instruction for a set target of the target service; wherein the extraction instruction carries the extraction conditions of the set target;
[0025] Using the extraction criteria, a specific dataset corresponding to the set target is extracted from the fragment dataset.
[0026] Optionally, the extraction criteria include keywords, and a frequency threshold for the occurrence of the keywords and / or a probability threshold for the probability of the keywords appearing in each of the data segments;
[0027] The step of extracting a specific dataset corresponding to a set target from the fragment dataset using the extraction conditions includes:
[0028] Select from the fragment dataset the fragments in which the occurrence frequency of the keyword exceeds the specified threshold, or / and select the fragments in the fragment dataset in which the probability of the keyword appearing in the fragment dataset is greater than the specified probability, and determine the combination of the fragments as the specific dataset.
[0029] Optionally, the step of obtaining the probability of the keyword appearing in the fragment dataset includes:
[0030] The fragment data is segmented into words to obtain word combinations of the fragment;
[0031] The frequency and / or probability of the keyword in the word combination are counted to obtain the frequency and / or probability of the keyword in the fragment dataset.
[0032] To solve the above-mentioned technical problems, another technical solution adopted in this application is: a data extraction system, the data extraction system comprising:
[0033] The document extraction module is used to acquire multiple documents to be detected and to construct a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, each triple including two entities related to the target business and the corresponding relationship between the two entities; feature vectors are extracted from each document to be detected to obtain a first feature vector set for each document to be detected;
[0034] The relationship matching module is used to obtain the association relationship between each of the documents to be detected and the entity relationship set using the first feature vector set, and select the target document of the target business from the plurality of documents to be detected;
[0035] The extraction mapping module is used to divide the target document into multiple associated fragments, extract the associated fragments by performing relation mapping according to each triple of the entity relation set, obtain multiple fragment data associated with each triple, and obtain the fragment dataset of the target business.
[0036] To solve the above-mentioned technical problems, another technical solution adopted in this application is: an electronic device, the electronic device comprising:
[0037] Memory, used to store executable program code;
[0038] A processor is configured to call and run the executable program code from the memory, causing the electronic device to perform the data extraction method as described in any of the preceding descriptions.
[0039] To solve the above-mentioned technical problems, another technical solution adopted in this application is: a computer-readable storage medium storing a computer program, which, when executed, implements the data extraction method described in any of the above-mentioned methods.
[0040] Unlike existing technologies, this application provides a data extraction method, which includes: acquiring multiple documents to be detected and constructing an entity relationship set for a target business; wherein the entity relationship set includes multiple triples, each triple including two entities related to the target business and the corresponding relationship between the two entities; extracting feature vectors from each document to be detected to obtain a first feature vector set for each document to be detected; using the first feature vector set to obtain the association relationship between each document to be detected and the entity relationship set to select the target document of the target business from the multiple documents to be detected; dividing the target document into multiple associated fragments, and performing relationship mapping extraction on the associated fragments according to each triple of the entity relationship set to obtain multiple fragment data associated with each triple, thereby obtaining the fragment dataset of the target business.
[0041] The beneficial effects of this application embodiment are: by using the entity relationship set of the target business to perform a preliminary screening of the feature vector of the document before extracting the document to be detected, the target document is determined from the messy documents, and then the target document and the entity relationship set are mapped and extracted to obtain the fragment dataset of the target business, which can effectively improve the efficiency and accuracy of document relationship extraction. Attached Figure Description
[0042] Figure 1 This is a flowchart illustrating the first embodiment of the data extraction method of this application;
[0043] Figure 2 This is a flowchart illustrating the second embodiment of the data extraction method of this application;
[0044] Figure 3 This is a flowchart illustrating the third embodiment of the data extraction method of this application;
[0045] Figure 4This is a flowchart illustrating the fourth embodiment of the data extraction method of this application;
[0046] Figure 5 This is an exemplary structural block diagram of the data extraction system of the data extraction method of this application;
[0047] Figure 6 This is another exemplary structural block diagram of the data extraction system of the data extraction method of this application;
[0048] Figure 7 This is an exemplary structural block diagram of an electronic device using the data extraction method of this application;
[0049] Figure 8 This is an exemplary structural block diagram of a computer-readable storage medium for the data extraction method of this application. Detailed Implementation
[0050] The present application will now be described in detail with reference to the accompanying drawings and embodiments.
[0051] Please see Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the data extraction method of this application. It should be noted that if substantially the same result is obtained, the method of this application does not necessarily require further clarification. Figure 1 The illustrated process sequence is limited. For example... Figure 1 As shown, the method includes:
[0052] Step S110: Obtain multiple documents to be detected and construct a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities;
[0053] This embodiment takes into account that with the development and advancement of digital government construction and large-scale artificial intelligence models, the rapid construction of domain-based knowledge bases is becoming increasingly important. Taking government services - housing provident fund withdrawal as an example, to support the business system in providing housing provident fund withdrawal services to the public, it is necessary to first structurally transform relevant government policy documents, service guides, etc. (e.g., item name, application conditions, application materials, processing procedures, processing time limits, fee standards, frequently asked questions, etc.), and then construct a knowledge base for the government service domain as required. However, this requires a large amount of manual service, document import and extraction for extraction, transformation, and verification. When the number of documents to be extracted is large and the content is mixed and disordered, the extraction efficiency and service experience are relatively poor, increasing the cost of system and specialized knowledge base construction.
[0054] Therefore, this embodiment proposes a data extraction method based on the knowledge domain. Before extracting the document to be detected, the feature vector of the document is initially screened using the entity relationship set of the target business to identify the target document from the messy documents. Then, the target document and the entity relationship set are mapped and extracted to obtain the fragment dataset of the target business, which can effectively improve the efficiency and accuracy of document relationship extraction.
[0055] Specifically, multiple documents to be detected can be a set of original documents from which relationships need to be extracted. The target business can refer to the main business or tasks that need to be processed or focused on in a specific business scenario. The target entity relationship set can contain multiple triples related to the target business, used to define the target knowledge to be identified and extracted. Each triple can consist of two entities and the relationship between them, for example, (entity A, entity relationship, entity B).
[0056] For example, the target business could be providing housing provident fund withdrawal services, and the multiple documents to be tested could be policy documents, service guides, and other documents related to housing provident fund withdrawal services. The entity relationship set can be used to define key entities related to the housing provident fund withdrawal business and the relationships between these key entities. For example, entities could include application conditions and processing procedures, and relationships could be the conditions that need to be met and / or the steps involved.
[0057] Step S120: Extract feature vectors from each of the documents to be detected to obtain a first feature vector set for each of the documents to be detected;
[0058] Specifically, the first feature vector set can be a set of feature vectors corresponding to each document to be detected. By treating the feature vectors of each document to be detected as a vector set, and then using this first feature vector set to select target documents that are related to the entity relationship set from each document to be detected, a preliminary extraction can be performed before the document extraction, while avoiding the detection of individual feature vectors separately, thus improving the extraction efficiency.
[0059] This can involve transforming text data into a computationally computable form, such as using natural language processing techniques to extract keywords, syntactic structures, and other features for feature vector extraction.
[0060] Step S130: Using the first feature vector set, obtain the association relationship between each of the documents to be detected and the entity relationship set, and select the target document of the target business from the multiple documents to be detected;
[0061] Specifically, to reduce redundancy in document processing and analysis and improve extraction efficiency, at least one target document can be identified from each document by calculating the correlation between the first feature vector set corresponding to each document to be detected and the entity relation set. The correlation can be a relationship between two or more entities in the target document and a relationship between entities in any triple. These relationships can be explicit, such as explicitly stated relationships, and / or implicit, such as relationships inferred from context.
[0062] For example, the target document most relevant to the housing provident fund withdrawal business can be selected from the various documents to be tested. For example, the target document may describe in detail the application materials and / or processing procedures.
[0063] Step S140: Divide the target document into multiple associated fragments, perform relation mapping extraction on the associated fragments according to each triple of the entity relation set, and obtain multiple fragment data associated with each triple to obtain the fragment dataset of the target business.
[0064] Specifically, to identify relation instances in the target document that may be related to the target entity relation set, the target document can be further divided into multiple associated fragments. These fragments can be sentences or paragraphs and may contain semantic information similar to or identical to at least one triple. Then, the relations and entities corresponding to the triples are identified and extracted from each associated fragment, resulting in multiple fragment data related to the triples. This allows each fragment data to centrally describe the entity relations within the triples, thereby obtaining a fragment dataset corresponding to the target business.
[0065] For example, specific conditions and steps related to the application materials can be identified within the associated fragments, forming structured data. All paragraph data corresponding to the triples are collected to form a complete fragment dataset. This fragment dataset can contain all key information regarding housing provident fund withdrawal procedures, facilitating automated system processing, user queries, and service provision.
[0066] This embodiment performs a preliminary screening of the document's feature vector using the entity relationship set of the target business before extracting the document to be detected. This identifies the target document from the messy documents, and then performs a relationship mapping extraction between the target document and the entity relationship set to obtain the fragment dataset of the target business. This can effectively improve the efficiency and accuracy of document relationship extraction.
[0067] Please see Figure 2 , Figure 2 This is a flowchart illustrating the second embodiment of the data extraction method of this application.
[0068] In this embodiment, the method includes the following steps:
[0069] Step S210: Obtain multiple documents to be detected and construct a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities;
[0070] Specifically, it can be done as described in step S110 above, and will not be repeated here.
[0071] Step S220: Extract feature vectors from each of the documents to be detected to obtain a first feature vector set for each of the documents to be detected;
[0072] Specifically, feature vector extraction can employ methods such as term frequency-inverse document frequency (TF / IDF), tape model, word embedding, sentence vector, deep learning model, principal component analysis, topic model, etc., to extract feature vectors from data in multiple documents to be detected. This embodiment does not limit the specific methods used.
[0073] Step S230: Extract features from each triple of the entity relation set to obtain the second feature vector set of the entity relation set;
[0074] Specifically, in order to effectively extract and analyze business-related information, simplify similarity calculation, and make complex business relationship processing more intuitive and efficient, the target entity relationship set can be treated as a document to obtain the feature vectors of each triple in the target entity relationship set, thus obtaining a second feature vector set. Then, the first and second feature vector sets can be used to calculate the similarity between the target entity relationship set and the document to be detected.
[0075] Step S240: Use the first feature vector set and the second feature vector set to obtain the similarity between each document to be detected and the entity relationship set;
[0076] Specifically, to effectively enhance the extraction results, the second feature vector of the target entity relationship set can be used. By calculating the similarity between the feature vector of the document to be detected and the feature vector of the target entity relationship set, the relevance between the document and the target business can be obtained. The similarity calculation methods can include cosine similarity, Euclidean distance, etc.
[0077] Further, step S240 includes:
[0078] The similarity between the document to be detected and the entity relationship set is determined by using the vector angle between the first feature vector set and the second feature vector set of the document to be detected; wherein the similarity is negatively correlated with the vector angle.
[0079] Specifically, this embodiment uses the cosine similarity method to calculate the feature vector sets corresponding to each of the two documents to obtain the similarity between them. The similarity is negatively correlated with the angle between the vectors; that is, the smaller the angle, the higher the similarity; the larger the angle, the lower the similarity. Please refer to formula (1):
[0080]
[0081] In the formula, SC(d) i d j ) represents similarity, cos(d i d j ) represents the included angle, d i Represents the i-th document; d j This represents the j-th document; w ki w represents the weight of feature term k in document i; kj This represents the weight of feature term k in document j.
[0082] Step S250: The document to be detected with a similarity higher than a set threshold is identified as the target document.
[0083] Specifically, to filter out documents highly relevant to the target business, documents with similarity scores exceeding a set threshold can be identified as target documents based on the similarity results from the previous step. Furthermore, when calculating the similarity between a subset of feature vectors from the first feature vector set and a subset of feature vectors from the second feature vector set to determine the similarity between the current document to be detected and the target entity relationship set, if the similarity between a subset of feature vectors from the first feature vector set and a subset of feature vectors from the second feature vector set already meets the set threshold, then the current document to be detected is identified as the target document. This eliminates the need for further calculation and comparison of other feature vectors in the first feature vector set of the current document to be detected, improving the efficiency of the initial screening of documents to be detected.
[0084] The threshold can be a pre-defined boundary value used to filter documents that are irrelevant to the target entity relationship set. Similarity scores SC(0,1) are obtained from high to low, and target documents are selected based on whether the similarity score meets the similarity threshold. Taking a similarity threshold of D = 0.8 as an example, documents with similarity scores higher than or equal to the threshold are identified as target documents, while those lower than the threshold are ignored, which can improve the efficiency and accuracy of knowledge extraction.
[0085] Step S260: Divide the target document into multiple related segments and obtain the segment vectors of the related segments;
[0086] Specifically, to facilitate subsequent calculations and comparisons, by vectorizing the associated fragments, the semantic information of the text can be transformed into numerical form, resulting in fragment vectors corresponding to each associated fragment. These fragment vectors can then be used to describe various characteristics and attributes of the text fragments.
[0087] Further, step S260 includes:
[0088] The target document is divided into multiple related segments, each related segment is further divided into multiple sentences, and each sentence is converted into a sentence vector. Semantic recognition is performed on the sentences to obtain semantic recognition results. Based on the semantic recognition results, the sentence vectors of related contexts are concatenated into the segment vector.
[0089] Specifically, to refine the document content and make processing and analysis more precise, related segments can be divided according to sentence boundaries, and text information can be converted into numerical representations. This allows the system to use the converted sentence vectors to identify and understand the semantic content of sentences, thereby determining the relationships between sentences. Semantic recognition can include identifying entities, relationships, and sentiments within sentences.
[0090] Then, based on the results of semantic recognition, related context sentences can be concatenated into a fragment vector. The context sentence vector refers to semantically related data that collectively describe a complete situation or event, thus preserving the complete semantics and context. This fragment vector is then concatenated and merged with the current sentence vector to form a new fragment vector.
[0091] Step S270: Map the fragment vectors to each of the triples to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target service.
[0092] Specifically, in order to integrate scattered fragmented data into more comprehensive and structured information, relational mapping can be performed between fragmented data and triples to extract semantic associations between fragment vectors and triples. This ensures that the extracted fragmented data is related to the entity relationships of the target business, thereby forming a fragmented dataset corresponding to the target business. This fragmented dataset can be used to describe various entity relationships and specific content related to the target business, thereby enriching the information in the knowledge graph or business model and improving its accuracy and coverage.
[0093] In this embodiment, by extracting features from each triple of the entity relationship set, and calculating the similarity between the feature vector of the document to be detected and the feature vector of the entity relationship set, documents related to the entity relationship set can be identified more accurately. At the same time, by calculating the feature vector set, the similarity between the document and the entity relationship set can be quickly evaluated without performing full-text analysis one by one.
[0094] Please see Figure 3 , Figure 3 This is a flowchart illustrating the third embodiment of the data extraction method of this application.
[0095] In this embodiment, the method includes the following steps:
[0096] Step S310: Obtain multiple documents to be detected and construct a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities;
[0097] Specifically, it can be done as described in step S110 above, which will not be repeated in this embodiment.
[0098] Step S320: Extract feature vectors from each of the documents to be detected to obtain a first feature vector set for each of the documents to be detected;
[0099] Specifically, it can be done as described in step S120 above, which will not be repeated in this embodiment.
[0100] Step S330: Using the first feature vector set, obtain the association relationship between each of the documents to be detected and the entity relationship set, and select the target document of the target business from the multiple documents to be detected;
[0101] Specifically, it can be done as described in step S130 above, which will not be repeated in this embodiment.
[0102] Step S340: Divide the target document into multiple associated fragments, perform relation mapping extraction on the associated fragments according to each triple of the entity relation set, and obtain multiple fragment data associated with each triple to obtain the fragment dataset of the target business.
[0103] Specifically, it can be done as described in step S140 above, which will not be repeated in this embodiment.
[0104] Step S350: Obtain at least one data extraction instruction for the target service; wherein the extraction instruction carries the extraction conditions for the target.
[0105] This embodiment takes into account that different roles have different needs when handling target business. For example, when the target business is handling housing provident fund, individuals can obtain the specific documents and procedures required for the process; enterprises can understand the specific requirements and documents needed for enterprise processing; and staff can obtain the latest policies and operating procedures. Therefore, by further dividing roles through user-defined data extraction instructions, the accuracy and relevance of information can be improved.
[0106] Specifically, a specific target can be different target roles, such as individuals, businesses, or service personnel. Extraction criteria can be used to filter and select data in a dataset that matches a set target. Users can customize specific targets and corresponding extraction criteria through data extraction commands. This allows the system to retrieve the extraction criteria carried by the data extraction command when it receives a data extraction command for at least one set target of the target business, and then extract relevant data from the dataset.
[0107] Step S360: Extract a specific dataset corresponding to the set target from the fragment dataset using the extraction conditions.
[0108] Specifically, when obtaining extraction criteria, data related to a set target can be extracted from a fragment dataset to obtain a specific dataset. For example, SQL queries can be used to filter records that meet the criteria from a database; data processing tools, such as the Pandas library in Python, can also be used to extract corresponding data from a data frame through conditional filtering.
[0109] This embodiment allows users to customize the setting of targets and corresponding extraction conditions using data extraction commands, which can improve the flexibility of relationship extraction while enabling different roles to access related information and enhancing the security of information data.
[0110] Further, please refer to Figure 4 , Figure 4 This is a flowchart illustrating the fourth embodiment of the data extraction method of this application. This embodiment is a refinement of steps S350 and S360 in the third embodiment described above. Step S350 includes:
[0111] Step S410: Obtain at least one set target data extraction instruction for the target service; wherein the extraction instruction carries the extraction conditions of the set target, the extraction conditions include keywords, and a frequency threshold for the occurrence of the keywords and / or a probability threshold for the probability of the keywords appearing in each of the data segments.
[0112] Step S360 includes:
[0113] Step S420: Select from the fragment dataset fragments in which the frequency of occurrence of the keyword exceeds the frequency threshold or / and the probability of occurrence of the keyword in the fragment data is greater than the set probability, and determine the combination of the fragment data as the specific dataset.
[0114] Specifically, keywords can be specific central terms that need attention in the fragment data, and these keywords can be related to the role level of a specific target. The frequency of keyword occurrence can be the total number of times a keyword appears in a specific fragment data, used to assess the significance of the keyword in the fragment. If a keyword appears more frequently in a fragment, it indicates that the fragment is highly relevant to the keyword. The probability of a keyword appearing in the fragment data can be the frequency of a certain keyword appearing in all fragment data, which can reflect the universality of the keyword and the overall relevance of the dataset.
[0115] The frequency threshold can be the minimum number of times a keyword appears in the data segments. For example, if the threshold is set to 10, then segments containing that keyword at least 10 times can be focused on. The probability threshold can be the probability of a keyword appearing in the data segments. For example, if the threshold is set to 0.5, then segments containing that keyword appearing more than 50% of the time can be focused on. Thus, the selected data segments can be combined to form a specific dataset.
[0116] Further, the step of obtaining the probability of the keyword appearing in the fragment dataset includes:
[0117] The fragment data is segmented into words to obtain word combinations of the fragment;
[0118] Specifically, word segmentation can be the process of breaking down a piece of text into individual words or phrases. For example, irrelevant characters and symbols, such as punctuation marks and special characters, can be removed; all text can also be converted to a uniform capitalization to reduce interference from repeated words. Word combinations can be all the individual words or phrases extracted from fragment data, and these word combinations can be used for further analysis.
[0119] The frequency and / or probability of the keyword in the word combination are counted to obtain the frequency and / or probability of the keyword in the fragment dataset.
[0120] Specifically, the calculation of the frequency of occurrence of the keyword in the fragment dataset and / or the probability of the keyword appearing in the fragment data, P(c|w;θ), requires that P(c|w;θ) satisfy two conditions: P(c|w;θ) is in the range of 0 and 1; and for the fragment dataset, the sum of P(c|w;θ) is 1. This embodiment adopts the following formula (2):
[0121]
[0122] In the formula, U represents the fragment dataset, c represents the words in the context, c' represents the words in the fragment dataset, and u c It is the vector representation of c; v w It is a vector representation of the keyword; u c’ It is the vector representation of c'.
[0123] This embodiment improves the efficiency and accuracy of data processing by selecting data segments from the data segment dataset where the frequency of the keyword exceeds a set number, or / and where the probability of the keyword appearing in the data segment dataset is greater than a set probability. It also provides strong support for data analysis and business decision-making.
[0124] Please see Figure 5 , Figure 5 This is an exemplary structural block diagram of the data extraction system of the data extraction method of this application. For example... Figure 5 As shown, the data extraction system 500 of this application may include a document extraction module 501, a relationship matching module 502, and an extraction mapping module 503.
[0125] The document extraction module 501 obtains multiple documents to be detected and constructs a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, each triple including two entities related to the target business and the corresponding relationship between the two entities; feature vectors are extracted from each document to be detected to obtain a first feature vector set for each document to be detected;
[0126] The relationship matching module 502 uses the first feature vector set to obtain the association relationship between each of the documents to be detected and the entity relationship set, and selects the target document of the target business from the multiple documents to be detected;
[0127] The extraction mapping module 503 divides the target document into multiple associated fragments, performs relation mapping extraction on the associated fragments according to each triple of the entity relation set, and obtains multiple fragment data associated with each triple, thereby obtaining the fragment dataset of the target business.
[0128] This embodiment performs a preliminary screening of the document's feature vector using the entity relationship set of the target business before extracting the document to be detected. This identifies the target document from the messy documents, and then performs a relationship mapping extraction between the target document and the entity relationship set to obtain the fragment dataset of the target business. This can effectively improve the efficiency and accuracy of document relationship extraction.
[0129] Furthermore, the extraction mapping module 503 is also used to obtain at least one data extraction instruction for a set target of the target service; wherein the extraction instruction carries the extraction conditions of the set target; and extracts a specific dataset corresponding to the set target from the fragment dataset using the extraction conditions.
[0130] Further, please refer to Figure 6 , Figure 6 This is another exemplary structural block diagram of the data extraction system for the data extraction method of this application. Based on the above... Figure 5 The data extraction system 500 further includes:
[0131] The relation extraction and processing unit 600 is connected to the document extraction module 501, the relation matching module 502, and the extraction mapping module 503, respectively.
[0132] Knowledge base 601 is connected to the document extraction module 501 and the relationship matching module 502;
[0133] Configuration library 602 is connected to the extraction mapping module 503.
[0134] Specifically, the relation extraction and processing unit 600 is connected to the document extraction module 501, the relation matching module 502, and the extraction mapping module 503. It is used to coordinate and manage each module, ensure smooth information transmission between modules, and avoid data loss or duplicate processing.
[0135] Knowledge base 601 can provide prior knowledge of documents and entity relationships, ensuring that document extraction module 501 can obtain data such as the set of documents and entity relationships to be detected. This allows document extraction module 501 to retrieve previously stored document and entity relationship information from knowledge base 601. Simultaneously, relationship matching module 502 is connected to knowledge base 601, enabling relationship matching module 502 to obtain entity relationship information from knowledge base 601 to calculate association relationships.
[0136] The configuration library 602 allows for the definition of mapping rules for different business operations. The extraction mapping module 503 can utilize the mapping rules and parameters defined in the configuration library 602 to perform relational mapping extraction, obtaining a detailed fragment dataset. Furthermore, users can flexibly configure specific targets and corresponding extraction conditions through the configuration library 602 to further extract entity relationships at the role level from the fragment dataset.
[0137] Compared with the prior art, this embodiment coordinates and manages the knowledge base 601, configuration library 602, document extraction module 501, relationship matching module 502, and extraction mapping module 503 through the relationship extraction processing unit 600, which enables the data extraction system 500 to operate efficiently and obtain accurate relationship extraction results.
[0138] Please see Figure 7 , Figure 7 This is an exemplary structural block diagram of an electronic device using the data extraction method of this application. For example... Figure 7 As shown, the electronic device 700 of this application may include a processor 701 and a memory 702, wherein the processor 701 and the memory 702 communicate via a bus. The memory 702 stores program instructions for relation extraction, which, when executed by the processor 701, cause the implementation of the method steps described above in conjunction with the accompanying drawings:
[0139] Multiple documents to be detected are obtained, and a set of entity relationships for the target business is constructed; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities;
[0140] Each of the documents to be detected is subjected to feature vector extraction to obtain the first feature vector set of each of the documents to be detected.
[0141] The association between each of the documents to be detected and the entity relationship set is obtained using the first feature vector set, and the target document of the target business is selected from the plurality of documents to be detected.
[0142] The target document is divided into multiple associated fragments. Relationship mapping is performed on the associated fragments according to each triple of the entity relationship set to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target business.
[0143] Optionally, the processor 701 performs the step of selecting the target document of the target business from the plurality of target documents by using the first feature vector set to obtain the association relationship between each of the documents to be detected and the entity relationship set, including:
[0144] Feature extraction is performed on each triple of the entity relation set to obtain the second feature vector set of the entity relation set;
[0145] The similarity between each document to be detected and the entity relationship set is obtained using the first feature vector set and the second feature vector set;
[0146] Documents with a similarity higher than a set threshold are identified as target documents.
[0147] Optionally, the processor 701 performs the step of obtaining the similarity between each document to be detected and the entity relation set using the first feature vector set and the second feature vector set, including:
[0148] The angle between the first feature vector set and the second feature vector set of the document to be detected is obtained;
[0149] The similarity between the document to be detected and the set of entity relationships is determined based on the vector angle; wherein the similarity is negatively correlated with the vector angle.
[0150] Optionally, the processor 701 executes the step of dividing the target document into multiple associated fragments, extracting relational mappings from the associated fragments according to each triplet of the entity relation set, and obtaining multiple fragment data associated with each triplet to obtain the fragment dataset of the target business, including:
[0151] The target document is divided into multiple related segments, and the segment vectors of the related segments are obtained;
[0152] The fragment vectors are mapped to each of the triples to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target service.
[0153] Optionally, the processor 701 performs the step of converting the associated fragment into a fragment vector, including:
[0154] The associated fragments are segmented into multiple sentences, and each sentence is converted into a sentence vector.
[0155] The sentence vector is used to perform semantic recognition on the sentence, and the related context sentence vectors are concatenated into the fragment vector based on the semantic recognition results.
[0156] Optionally, after the processor 701 executes the steps of dividing the target document into multiple associated fragments, extracting relational mappings from the associated fragments according to each triplet of the entity relation set, obtaining multiple fragment data associated with each triplet, and obtaining the fragment dataset of the target business, the processor further includes:
[0157] Obtain at least one data extraction instruction for a set target of the target service; wherein the extraction instruction carries the extraction conditions of the set target;
[0158] Using the extraction criteria, a specific dataset corresponding to the set target is extracted from the fragment dataset.
[0159] Optionally, the processor 701 executes the extraction conditions including keywords, and a frequency threshold for the occurrence of the keywords and / or a probability threshold for the probability of the keywords appearing in each of the data segments;
[0160] The step of extracting a specific dataset corresponding to a set target from the fragment dataset using the extraction conditions includes:
[0161] Select from the fragment dataset the fragments in which the occurrence frequency of the keyword exceeds the specified threshold, or / and select the fragments in the fragment dataset in which the probability of the keyword appearing in the fragment dataset is greater than the specified probability, and determine the combination of the fragments as the specific dataset.
[0162] Optionally, the processor 701 performs the step of obtaining the probability of the keyword appearing in the fragment dataset, including:
[0163] The fragment data is segmented into words to obtain word combinations of the fragment;
[0164] The frequency and / or probability of the keyword in the word combination are counted to obtain the frequency and / or probability of the keyword in the fragment dataset.
[0165] Please see Figure 8 , Figure 8 This is an exemplary structural block diagram of a computer-readable storage medium for the data extraction method of this application. (See diagram below.) Figure 8 As shown, the computer-readable storage medium 800 stores a computer program 801. When the computer program 801 is run on a computer, it causes the computer to perform the above-described related method steps to implement a data extraction method in the above embodiments.
[0166] The above scheme obtains multiple documents to be detected and constructs an entity relationship set for the target business. The entity relationship set includes multiple triples, each triple including two entities related to the target business and their corresponding relationship. Feature vectors are extracted from each document to be detected to obtain a first feature vector set. The association between each document and the entity relationship set is obtained using the first feature vector set, and the target document for the target business is selected from the multiple documents. The target document is divided into multiple associated fragments, and relation mapping extraction is performed on the associated fragments according to each triple of the entity relationship set to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target business. By performing a preliminary screening of the document feature vectors using the entity relationship set of the target business before extracting the documents to be detected, the target document is identified from the messy documents. Then, relation mapping extraction is performed between the target document and the entity relationship set to obtain the fragment dataset of the target business, which effectively improves the efficiency and accuracy of document relationship extraction.
[0167] In the several embodiments provided in this application, it should be understood that the disclosed methods, electronic devices, and storage media can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.
[0168] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.
[0169] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0170] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0171] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A data extraction method, characterized in that, The method includes: Multiple documents to be detected are obtained, and a set of entity relationships for the target business is constructed; wherein, the set of entity relationships includes multiple triples, and each triple includes two entities related to the target business and the corresponding relationship between the two entities; Each of the documents to be detected is subjected to feature vector extraction to obtain the first feature vector set of each of the documents to be detected. Using the first feature vector set to obtain the association relationship between each of the documents to be detected and the entity relationship set, the target document for the target business is selected from the plurality of documents to be detected, including: Feature extraction is performed on each triple of the entity relation set to obtain a second feature vector set of the entity relation set; the similarity between each document to be detected and the entity relation set is obtained using the first feature vector set and the second feature vector set; documents to be detected with similarity higher than a set threshold are identified as target documents; when the similarity between some feature vectors in the first feature vector set and some feature vectors in the second feature vector set meets the set threshold, the current document to be detected is identified as the target document, and the calculation and comparison of other feature vectors of the current document to be detected, other than some feature vectors in the first feature vector set, are stopped; The target document is divided into multiple associated fragments. Relationship mapping is performed on the associated fragments according to each triple of the entity relationship set to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target business. Obtain at least one data extraction instruction for a set target of the target service; wherein the extraction instruction carries the extraction conditions of the set target; Using the extraction criteria, a specific dataset corresponding to the set target is extracted from the fragment dataset.
2. The data extraction method according to claim 1, characterized in that, The step of obtaining the similarity between each document to be detected and the entity relationship set using the first feature vector set and the second feature vector set includes: The angle between the first feature vector set and the second feature vector set of the document to be detected is obtained; The similarity between the document to be detected and the set of entity relationships is determined based on the vector angle; wherein the similarity is negatively correlated with the vector angle.
3. The data extraction method according to any one of claims 1 to 2, characterized in that, The step of dividing the target document into multiple associated fragments, extracting relational mappings from the associated fragments according to each triplet of the entity relation set, and obtaining multiple fragment data associated with each triplet to obtain the fragment dataset of the target business includes: The target document is divided into multiple related segments, and the segment vectors of the related segments are obtained; The fragment vectors are mapped to each of the triples to obtain multiple fragment data associated with each triple, thus obtaining the fragment dataset of the target service.
4. The data extraction method according to claim 3, characterized in that, The step of obtaining the fragment vector of the associated fragment includes: The associated fragments are segmented into multiple sentences, and each sentence is converted into a sentence vector. The sentence is semantically recognized to obtain the semantic recognition result, and the sentence vectors of the related contexts are concatenated into the fragment vector based on the semantic recognition result.
5. The data extraction method according to claim 1, characterized in that, The extraction criteria include keywords, and a frequency threshold for the occurrence of the keywords and / or a probability threshold for the probability of the keywords appearing in each of the data segments. The step of extracting a specific dataset corresponding to a set target from the fragment dataset using the extraction conditions includes: Select from the fragment dataset the fragments in which the occurrence frequency of the keyword exceeds the frequency threshold and / or select the fragments in which the probability of the keyword appearing in the fragment dataset is greater than the probability threshold, and determine the combination of the fragments as the specific dataset.
6. The data extraction method according to claim 5, characterized in that, The steps for obtaining the probability of the keyword appearing in the fragment dataset include: The fragment data is segmented into words to obtain word combinations of the fragment; The frequency and / or probability of the keyword in the word combination are counted to obtain the frequency and / or probability of the keyword in the fragment dataset.
7. A data extraction system, characterized in that, The data extraction system includes: The document extraction module is used to acquire multiple documents to be detected and to construct a set of entity relationships for the target business; wherein, the set of entity relationships includes multiple triples, each triple including two entities related to the target business and the corresponding relationship between the two entities; feature vectors are extracted from each document to be detected to obtain a first feature vector set for each document to be detected; The relationship matching module is used to obtain the association relationship between each of the documents to be detected and the entity relationship set using the first feature vector set, and to select the target document of the target business from the plurality of documents to be detected. This includes: extracting features from each triple of the entity relationship set to obtain a second feature vector set of the entity relationship set; using the first feature vector set and the second feature vector set to obtain the similarity between each document to be detected and the entity relationship set; identifying documents to be detected with similarity higher than a set threshold as the target document; and when the similarity between a subset of feature vectors in the first feature vector set and a subset of feature vectors in the second feature vector set meets the set threshold, identifying the current document to be detected as the target document, and stopping the calculation and comparison of other feature vectors of the current document to be detected other than the subset of feature vectors in the first feature vector set. The extraction mapping module is used to divide the target document into multiple associated fragments, extract the associated fragments according to each triple of the entity relation set, obtain multiple fragment data associated with each triple, and obtain the fragment dataset of the target business; obtain a data extraction instruction for at least one set target of the target business; wherein the extraction instruction carries the extraction conditions of the set target; and extract a specific dataset corresponding to the set target from the fragment dataset using the extraction conditions.
8. An electronic device, characterized in that, The electronic device includes: Memory, used to store executable program code; A processor is configured to call and run the executable program code from the memory, causing the electronic device to perform the data extraction method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed, implements the data extraction method as described in any one of claims 1 to 6.