Artificial intelligence-based data retrieval method and apparatus
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN HOLOGRAPHIC FIELD CULTURE TECHNOLOGY CO LTD
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240894A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data retrieval technology, and in particular to a data retrieval method and apparatus based on artificial intelligence. Background Technology
[0002] As enterprises deepen their digital transformation, they accumulate massive amounts of multi-source, heterogeneous data. This data typically includes structured data tables in databases (such as business information and personnel lists) and unstructured data documents in document libraries (such as meeting minutes, technical solutions, and PDF contracts). Existing enterprise-level data retrieval technologies mainly suffer from the following problems: 1) Weak semantic understanding ability: Traditional keyword matching technology (such as inverted index) has difficulty understanding the user's natural language query intent, which can easily lead to missed detections due to query term mismatch.
[0003] 2) Difficulty in integrating multi-source data: Structured and unstructured data are often stored separately, making it difficult to perform correlation queries and uncover the hidden topological relationships between data.
[0004] 3) Single search path: It often adopts a "one-size-fits-all" search method, which cannot distinguish between explicit entity queries (such as "Zhang San's phone number") and vague semantic queries (such as "how to solve the server overheating problem"), resulting in low search efficiency and poor results accuracy.
[0005] Therefore, there is an urgent need for an intelligent data retrieval method that can integrate the enterprise knowledge graph topology, possess deep semantic understanding capabilities, and dynamically select retrieval paths based on query intent. Summary of the Invention
[0006] This invention provides a data retrieval method and apparatus based on artificial intelligence. This invention solves the problems of inaccurate retrieval results caused by weak semantic understanding, low data fusion, and a single retrieval strategy in existing technologies.
[0007] In a first aspect, embodiments of the present invention provide a data retrieval method based on artificial intelligence, the method comprising: Collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct enterprise knowledge graphs, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. By introducing a graph-guided contrastive learning loss function, the graph node embedding vectors are used as regularization constraints for training, thereby constructing a semantic coding model based on a dual-tower structure. Deploy a query intent recognition network to analyze users' natural language queries in real time and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query; Based on the probability distribution, the retrieval path corresponding to the precise entity query or the fuzzy semantic query is dynamically selected. Based on the retrieval path, the semantic coding model is used to retrieve data in the enterprise knowledge graph or vector index library to obtain the retrieval results. The search results are refined by multi-feature fusion to obtain a relevance score. Based on the relevance score, the search results are reordered and the final reordered search list is output.
[0008] The technical solution provided in this application has at least the following beneficial effects: By constructing an enterprise knowledge graph and combining it with a graph neural network, the model can learn the topological structure information of the data. During the training phase, a graph-guided contrastive learning loss function is introduced, using graph node embedding vectors as regularization constraints to align the semantic vector space with the graph topological space, significantly improving the accuracy of semantic understanding. A query intent recognition network is deployed to determine in real time whether a user wants to search for specific "entities" or vague "semantics." For precise entity queries, the graph subgraph diffusion path is used; for vague semantic queries, the quantitative similarity retrieval path is used, avoiding the limitations of a single retrieval mode and greatly improving retrieval efficiency. Structured data (graph) and unstructured data (vector index) are unified under the same framework, and a multi-feature fusion re-ranking mechanism is used to achieve high-quality retrieval across data types.
[0009] In one alternative implementation, multi-source heterogeneous data from enterprises is collected, their entities and attributes are extracted, an enterprise knowledge graph is constructed, and a graph neural network is used to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph, including: Collect multi-source heterogeneous data from enterprises, and perform data cleaning and multi-source fusion on the multi-source heterogeneous data to obtain a data block sequence. The multi-source heterogeneous data from enterprises includes structured data tables and unstructured data documents. An entity recognition and relation extraction model based on natural language processing algorithms is used to extract named entities, corresponding attributes, and entity relations from data block sequences. Construct an enterprise knowledge graph based on several named entities, their corresponding attributes, and entity relationships; A graph neural network based on deep learning algorithms is used to aggregate neighborhood information of the enterprise knowledge graph, resulting in graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph.
[0010] In one alternative implementation, a semantic encoding model based on a dual-tower structure is constructed by introducing a graph-guided contrastive learning loss function, using the graph node embedding vectors as regularization constraints for training, and including: An initial semantic coding model based on a dual-tower structure is constructed. The semantic coding model includes a query encoder, a document encoder, and a base model, wherein the base model is a masked language model. The data blocks are merged to build a large-scale corpus, and the initial base model in the initial semantic coding model is pre-trained using the large-scale corpus to obtain the pre-trained base model. Using the graph node embedding vector as a regularization constraint, a contrastive learning loss function for the initial semantic coding model based on a dual-tower structure is designed. A graph-guided contrastive learning sample set is constructed, and the initial semantic encoding model is trained using the contrastive learning loss function to obtain the final semantic encoding model.
[0011] In one optional implementation, the contrastive learning loss function includes the MLM loss function, the InfoNCE loss function, and the topology regularization loss function; The formula for the contrastive learning loss function is:
[0012] In the formula, To compare the learning loss values; The MLM loss value for the masked language model; To compare the InfoNCE loss values of the learning sample set; This represents the topological regularization loss value. To compare the learning loss weights;
[0013] In the formula, This refers to a pair of nodes in an enterprise knowledge graph that have a direct connection relationship. u , v Named entity nodes in the enterprise knowledge graph; This is the subset of edges used for training in the enterprise knowledge graph; For semantic coding models of nodes u , v The semantic vector generated after encoding the context text; For nodes v Graph node embedding vectors; semantic vector semantic similarity; The weights are the topological regularization loss weights.
[0014] In one alternative implementation, a graph-guided contrastive learning sample set is constructed, and based on the contrastive learning loss function, the initial semantic encoding model is trained using the contrastive learning sample set to obtain the final semantic encoding model, including: For several data blocks in the data block sequence, variant data blocks are generated using data augmentation techniques, and these variant data blocks are used as positive samples. Other data blocks unrelated to the query are used as negative samples; Traverse the knowledge graph, select entity pairs with direct connections, and search for data blocks that mainly contain the entity pairs in the data block sequence as topological constraint samples. Summarize all positive samples, negative samples, and topologically constrained samples to construct a graph-guided contrastive learning sample set; The initial semantic coding model is trained using a contrastive learning sample set, and the contrastive learning loss value is calculated for each training iteration using the contrastive learning loss function. If the contrastive learning loss value meets the requirements, the final semantic encoding model is obtained; otherwise, the next round of training is performed.
[0015] In one optional implementation, a query intent recognition network is deployed to analyze the user's natural language query in real time and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query, including: Based on a lightweight pre-trained model, a binary classifier is constructed as the initial query intent recognition network, and two types of labels are defined for the query intent recognition network. The labels include a first category label corresponding to precise entity queries and a second category label corresponding to fuzzy semantic queries. Historical search logs are automatically tagged using regular expressions and keyword rules, and then manually calibrated to obtain a categorized sample set; The initial query intent recognition network is trained using a classification sample set to obtain the final query intent recognition network. The system collects users' natural language queries in real time and inputs them into the final query intent recognition network to identify the query intent and obtain the probability distribution of the first category label and the second category label.
[0016] In one optional implementation, the retrieval path corresponding to the precise entity query or the fuzzy semantic query is dynamically selected based on the probability distribution. Then, based on the retrieval path, a semantic encoding model is used to retrieve data from the enterprise knowledge graph or vector index to obtain the retrieval results, including: If the probability value of the first category label in the probability distribution is greater than the preset probability threshold, then the natural language query belongs to the precise entity query; otherwise, the natural language query belongs to the fuzzy semantic query. If the natural language query is a precise entity query, then select the precise retrieval path corresponding to the precise entity query, perform precise data retrieval in the enterprise knowledge graph, and obtain the retrieval results; If the natural language query is a fuzzy semantic query, then the fuzzy retrieval path corresponding to the fuzzy semantic query is selected, and the semantic coding model is used to perform fuzzy data retrieval in the vector index library to obtain the retrieval results.
[0017] In one optional implementation, if the natural language query is a precise entity query, then the precise retrieval path corresponding to the precise entity query is selected, and precise data retrieval is performed in the enterprise knowledge graph to obtain the retrieval results, including: If the natural language query is an exact entity query, then select the exact retrieval path corresponding to the exact entity query; The precise retrieval path is executed, and entity recognition and relation extraction models are used to obtain several key entities for the natural language query. Based on several key entities, subgraph diffusion is carried out in the enterprise knowledge graph to obtain several project nodes and corresponding attribute nodes. Based on the attribute values or the degree of matching with the query intent, the project nodes and attribute nodes are scored and sorted to obtain data result cards, which are then output as search results.
[0018] In one optional implementation, if the natural language query belongs to the fuzzy semantic query, then the fuzzy retrieval path corresponding to the fuzzy semantic query is selected, and a semantic coding model is used to perform fuzzy data retrieval in the vector index library to obtain the retrieval results, including: If the natural language query belongs to the fuzzy semantic query, then select the fuzzy search path corresponding to the fuzzy semantic query; The fuzzy search path is executed. The document encoder of the semantic coding model is used to divide all data blocks of the unstructured data document into document vectors. The Faiss library is used to build the HNSW index of each document vector to obtain the vector index library. A query encoder using a semantic coding model transforms natural language queries into query vectors; Based on the query vector, a KNN search is performed on the HNSW index in the vector index library to retrieve the Top-K candidate documents, and the Top-K candidate documents are output as the search results.
[0019] Secondly, embodiments of the present invention provide an artificial intelligence-based data retrieval device for implementing a data retrieval method, the device comprising: The data acquisition and graph construction unit is used to collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct enterprise knowledge graphs, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. The semantic coding model building unit is used to train the graph node embedding vector as a regularization constraint by introducing a graph-guided contrastive learning loss function, thereby constructing a semantic coding model based on a dual-tower structure. The query intent recognition unit is used to deploy a query intent recognition network, analyze the user's natural language query in real time, and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query. The data retrieval unit is used to dynamically select the retrieval path corresponding to the precise entity query or the fuzzy semantic query according to the probability distribution, and to retrieve data in the enterprise knowledge graph or vector index library according to the retrieval path using the semantic coding model to obtain the retrieval results. The result re-ranking unit is used to perform multi-feature fusion for refined scoring of the search results, obtain a relevance score, re-rank the search results based on the relevance score, and output the final re-ranked search list.
[0020] A third aspect of this invention provides an electronic device, which includes: At least one processor; and a memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by at least one processor, such that the at least one processor can perform the method proposed in the first aspect of the present invention.
[0021] A fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method as described in the first aspect of the present invention. Attached Figure Description
[0022] Figure 1 This is a schematic diagram of the electronic device structure of the hardware operating environment involved in the embodiments of the present invention; Figure 2 This is a flowchart illustrating the steps of a data retrieval method based on artificial intelligence provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the functional units of a data retrieval device based on artificial intelligence provided in an embodiment of the present invention. Detailed Implementation
[0023] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0024] The present invention will be further described below with reference to the accompanying drawings.
[0025] Reference Figure 1 , Figure 1 This is a schematic diagram of the electronic device structure of the hardware operating environment involved in the embodiments of the present invention.
[0026] like Figure 1 As shown, the electronic device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to enable communication between these components. The user interface 1003 may include a display screen or an input unit such as a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1005 may be a high-speed random access memory (RAM) or a stable non-volatile memory (NVM), such as a disk drive. The memory 1005 may also optionally be a storage device independent of the aforementioned processor 1001.
[0027] Those skilled in the art will understand that Figure 1 The structure shown does not constitute a limitation on the electronic device and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0028] like Figure 1 As shown, the memory 1005, which serves as a storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, and an electronic program for an artificial intelligence-based data retrieval device.
[0029] exist Figure 1In the electronic device shown, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the electronic device of the present invention can be set in the electronic device. The electronic device calls the electronic program of the artificial intelligence-based data retrieval device stored in the memory 1005 through the processor 1001 and executes the artificial intelligence-based data retrieval method provided in the embodiment of the present invention.
[0030] Reference Figure 2 The present invention provides an artificial intelligence-based data retrieval method, the method comprising: S201: Collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct an enterprise knowledge graph, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. S202: By introducing a graph-guided contrastive learning loss function, the embedding vectors of the graph nodes are used as regularization constraints for training, and a semantic coding model based on a dual-tower structure is constructed. S203: Deploy a query intent recognition network to analyze the user's natural language query in real time and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query; S204: Dynamically select the retrieval path corresponding to the precise entity query or the fuzzy semantic query based on the probability distribution, and use the semantic coding model to retrieve data in the enterprise knowledge graph or vector index library according to the retrieval path to obtain the retrieval results; S205: Perform multi-feature fusion on the search results for refined scoring to obtain a relevance score. Based on the relevance score, reorder the search results and output the final search list after reordering.
[0031] The technical solution provided in this application has at least the following beneficial effects: By constructing an enterprise knowledge graph and combining it with a graph neural network, the model can learn the topological structure information of the data. During the training phase, a graph-guided contrastive learning loss function is introduced, using graph node embedding vectors as regularization constraints to align the semantic vector space with the graph topological space, significantly improving the accuracy of semantic understanding. A query intent recognition network is deployed to determine in real time whether a user wants to search for specific "entities" or vague "semantics." For precise entity queries, the graph subgraph diffusion path is used; for vague semantic queries, the quantitative similarity retrieval path is used, avoiding the limitations of a single retrieval mode and greatly improving retrieval efficiency. Structured data (graph) and unstructured data (vector index) are unified under the same framework, and a multi-feature fusion re-ranking mechanism is used to achieve high-quality retrieval across data types.
[0032] In one alternative implementation, multi-source heterogeneous data from enterprises is collected, their entities and attributes are extracted, an enterprise knowledge graph is constructed, and a graph neural network is used to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph, including: S2011: Collect multi-source heterogeneous data from enterprises, and perform data cleaning and multi-source fusion on the multi-source heterogeneous data to obtain a data block sequence. The multi-source heterogeneous data from enterprises includes structured data tables and unstructured data documents. In this embodiment, structured data acquisition involves configuring a database connection pool (such as Java Database Connectivity (JDBC)) to connect to the enterprise resource planning system, customer relationship management system, and human resources system, executing SQL queries to obtain business table data (such as employee table and project table), handling data type conversion, and unifying date and numerical formats. Unstructured data collection: Deploy web crawlers or file system scanners to collect PDF documents, Word contracts, Excel reports, and email data from enterprises; use Apache Tika or PDFBox parsers to extract text content and remove headers, footers, page numbers, and garbled characters; Data segmentation: For long documents, a sliding window algorithm is used for segmentation; the window size is set to 512 tokens, the step size is 128 tokens, some context overlap is preserved, semantic integrity is ensured, and a data segmentation sequence is obtained; S2012: Using an entity recognition and relation extraction model built on a natural language processing algorithm, extract named entities, corresponding attributes, and entity relations from a data block sequence; In this embodiment, an entity recognition and relation extraction model is constructed based on the BERT-BiLSTM-CRF architecture, which represents a bidirectional encoder from Transformer. The Bidirectional Encoder Representations from Transformers (BERT) layer: transforms data blocks into word vectors and extracts contextual semantic features; Bidirectional Long Short-Term Memory (BiLSTM) layer: captures long-range dependencies in sequences; Conditional Random Field (CRF) layer: learns the transition constraints between labels (e.g., the label "I-PER" must be preceded by "B-PER"), and outputs the optimal sequence of entity labels; Extraction and execution: For each data block, identify the set of named entities and their types (e.g., person names, organization names, place names). Simultaneously, based on dependency parsing or template matching, extract relational triples between entities. S2013: Construct an enterprise knowledge graph based on several named entities, their corresponding attributes, and entity relationships; In this embodiment, the storage is integrated: the extracted entities are used as nodes, relationships as edges, and attributes as node attributes, and imported into a graph database (such as Neo4j); the same entity in different data sources is aligned (for example, "ID:001" in the ERP table and "Zhang San" in the email are merged into the same node through an entity alignment algorithm). Constructing a graph structure: forming a directed attribute graph G = ( V , E ),in, V For entity node set, E For relational edge sets; S2014: Using a graph neural network built on a deep learning algorithm, neighborhood information is aggregated on the enterprise knowledge graph to obtain graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. In this embodiment, the graph neural network (GNN) feature extraction employs the graph sampling and aggregation (GraphSAGE) algorithm; for graph... G Each node in v Randomly sample its 1-hop and 2-hop neighbors; Neighborhood aggregation: Aggregate the feature vectors of neighboring nodes using Mean Pooling or Long Short-Term Memory (LSTM) networks to update the representation of the current node; iterate through 2-3 layers to ensure that the vector representation of each node is updated. , It not only includes its own attributes, but also incorporates the structural features of local subgraphs (such as the vector of "Project A" containing information about the "person in charge" and "department" connected to it).
[0033] In one alternative implementation, a semantic encoding model based on a dual-tower structure is constructed by introducing a graph-guided contrastive learning loss function, using the graph node embedding vectors as regularization constraints for training, and including: S2021: Construct an initial semantic coding model based on a dual-tower structure. The semantic coding model includes a query encoder, a document encoder, and a base model, wherein the base model is a masked language model. In this embodiment, an initial semantic coding model based on a dual-tower structure is constructed. This model contains two encoders with shared parameters or independent parameters: a query encoder and a document encoder. Both are based on a base model. In this embodiment, a powerful masked language model, such as the robustly optimized BERT approach (RoBERTa) or BERT, is selected as the backbone network for feature extraction. S2022: Merge the data block sequences to build a large-scale corpus, and use the large-scale corpus to pre-train the initial base model in the initial semantic coding model to obtain the pre-trained base model. In this embodiment, to accommodate the specialized terminology specific to enterprises, all data blocks are merged to construct a large-scale enterprise corpus. This corpus is then used to pre-train the base model. The pre-training task employs Masked Language Modeling (MLM), which randomly masks parts of a sentence, allowing the model to predict these words based on the context. This helps the model learn specialized terminology and language habits within the enterprise domain, resulting in the pre-trained base model. S2023: Using the embedding vectors of the graph nodes as regularization constraints, design a contrastive learning loss function for the initial semantic coding model based on the dual-tower structure; S2024: Construct a graph-guided contrastive learning sample set, and based on the contrastive learning loss function, use the contrastive learning sample set to train the initial semantic coding model to obtain the final semantic coding model.
[0034] In one optional implementation, the contrastive learning loss function includes the MLM loss function, the Information Noise Contrastive Estimation (InfoNCE) loss function, and the topology regularization loss function. The MLM loss function preserves the model's text generation and understanding capabilities, while the InfoNCE loss function is a standard contrastive learning loss. For positive samples (semantically similar documents) and negative samples (irrelevant documents) of the same query, it brings positive samples closer and pushes negative samples further apart. The topological regularization loss function calculates the distance between nodes (such as Euclidean distance) and minimizes this distance, thereby achieving "graph-guided" semantic alignment. The formula for the contrastive learning loss function is:
[0035] In the formula, To compare the learning loss values; The MLM loss value for the masked language model; To compare the InfoNCE loss values of the learning sample set; This represents the topological regularization loss value. To compare the learning loss weights;
[0036] In the formula, This refers to a pair of nodes in an enterprise knowledge graph that have a direct connection relationship. u , v Named entity nodes in the enterprise knowledge graph; This is the subset of edges used for training in the enterprise knowledge graph; For semantic coding models of nodes u , v The semantic vector generated after encoding the context text; For nodes v Graph node embedding vectors; semantic vector semantic similarity; The weights are the topological regularization loss weights.
[0037] In one alternative implementation, a graph-guided contrastive learning sample set is constructed, and based on the contrastive learning loss function, the initial semantic encoding model is trained using the contrastive learning sample set to obtain the final semantic encoding model, including: S20241: For several data blocks in the data block sequence, generate variant data blocks using data augmentation techniques, and use several variant data blocks as positive samples; In this embodiment, data augmentation (such as synonym replacement and random deletion) is performed on the same document blocks to construct positive samples; S20242: Block other data that is not related to the query as negative samples; S20243: Traverse the knowledge graph, select entity pairs with direct connections, and search for data blocks that mainly contain the entity pairs in the data block sequence as topological constraint samples. S20244: Summarize all positive samples, negative samples, and topologically constrained samples to construct a graph-guided contrastive learning sample set; S20245: Use the contrastive learning sample set to train the initial semantic coding model, and use the contrastive learning loss function to calculate the contrastive learning loss value for each training session; S20246: If the contrastive learning loss value meets the requirements, the final semantic encoding model is obtained; otherwise, the next round of training is carried out.
[0038] In one optional implementation, a query intent recognition network is deployed to analyze the user's natural language query in real time and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query, including: S2031: Based on a lightweight pre-trained model, a binary classifier is constructed as the initial query intent recognition network, and two types of labels are defined for the query intent recognition network. The labels include a first category label corresponding to precise entity queries and a second category label corresponding to fuzzy semantic queries. In this embodiment, a lightweight model (such as DistilBERT or TinyBERT) is selected to reduce online latency; S2032: Use regular expressions and keyword rules to automatically label historical search logs and perform manual calibration to obtain a classification sample set; In this embodiment, rule labeling: If a natural language query contains high-confidence entities (such as dates or specific names) and is less than 15 characters long, it is marked as an "Exact Entity Query" (Label 0). If a natural language query contains interrogative words (how, how), abstract nouns (strategy, solution), or is long, it is marked as a "fuzzy semantic query" (Label 1). Manual calibration: manually review boundary samples (such as natural language queries that contain both entities and interrogative words), correct noisy labels, and ensure the quality of the training set; S2033: Using the classification sample set, train the initial query intent recognition network to obtain the final query intent recognition network; S2034: Collect users' natural language queries in real time, and input the natural language queries into the final query intent recognition network to perform query intent recognition and obtain the probability distribution of the first category label and the second category label.
[0039] In one optional implementation, the retrieval path corresponding to the precise entity query or the fuzzy semantic query is dynamically selected based on the probability distribution. Then, based on the retrieval path, a semantic encoding model is used to retrieve data from the enterprise knowledge graph or vector index to obtain the retrieval results, including: S2041: If the probability value of the first category label in the probability distribution is greater than the preset probability threshold, then the natural language query belongs to the precise entity query; otherwise, the natural language query belongs to the fuzzy semantic query. S2042: If the natural language query is a precise entity query, then select the precise retrieval path corresponding to the precise entity query, perform precise data retrieval in the enterprise knowledge graph, and obtain the retrieval results; S2043: If the natural language query belongs to the fuzzy semantic query, then select the fuzzy retrieval path corresponding to the fuzzy semantic query, use the semantic coding model, perform fuzzy data retrieval in the vector index library, and obtain the retrieval results.
[0040] In one optional implementation, if the natural language query is a precise entity query, then the precise retrieval path corresponding to the precise entity query is selected, and precise data retrieval is performed in the enterprise knowledge graph to obtain the retrieval results, including: S20421: If the natural language query is an exact entity query, then select the exact retrieval path corresponding to the exact entity query; S20422: Execute a precise retrieval path, using entity recognition and relation extraction models to obtain several key entities from the natural language query, such as extracting "R&D Department" and "Zhang San" from "Zhang San's employee ID in R&D Department"; S20423: Based on several key entities, perform subgraph diffusion in the enterprise knowledge graph to obtain several project nodes and corresponding attribute nodes. Starting from the extracted key entities, perform queries in the knowledge graph. Utilize the graph's relationships to diffuse the knowledge graph (e.g., find the "Zhang San" node, start the "belongs to" edge pointing to the "R&D Department" node, and then search for the "Employee ID" attribute node). S20424: Based on the attribute values or the degree of matching with the query intent, score and sort the project nodes and attribute nodes to obtain data result cards, and output the data result cards as the search results.
[0041] In one optional implementation, if the natural language query belongs to the fuzzy semantic query, then the fuzzy retrieval path corresponding to the fuzzy semantic query is selected, and a semantic coding model is used to perform fuzzy data retrieval in the vector index library to obtain the retrieval results, including: S20431: If the natural language query belongs to the fuzzy semantic query, then select the fuzzy search path corresponding to the fuzzy semantic query; S20432: Execute the fuzzy search path, use the document encoder of the semantic coding model to convert all data blocks of the unstructured data document into document vectors, and use the Faiss library to build an index of each document vector (HierarchicalNavigable Small World, HNSW) to obtain the vector index library. S20433: A query encoder that uses a semantic coding model to transform natural language queries into query vectors; S20434: Based on the query vector, perform a K-Nearest Neighbors (KNN) search on the HNSW index in the vector index library to retrieve the Top-K candidate documents, and output the Top-K candidate documents as the search results; In this embodiment, the similarity (such as cosine similarity) between the query vector and each vector in the vector index is calculated, and the HNSW algorithm is used to quickly retrieve the Top-K candidate document fragments with the highest similarity as the retrieval results.
[0042] In one optional implementation, the search results are refined by multi-feature fusion to obtain a relevance score. Based on the relevance score, the search results are reordered, and the resulting reordered final search list is output, including: S2051: Calculate the final relevance score for the search results (whether map cards or document fragments). The scoring features include, but are not limited to: Semantic similarity feature: the vector similarity score calculated in step S2043; Text matching features: The traditional BM25 score measures the literal match of the query terms in the document; Graph structure features: For graph retrieval, consider the PageRank value (importance) of a node in the graph or its distance from the query entity in the graph; Timeliness characteristics: The document's creation time or last update time, with newer data being returned first; The final relevance score is obtained by training a ranking model (such as LambdaMART) or a weighted summation formula and fusing the above features. S2052: Reorder the search results based on relevance scores and output the final reordered search list; In this embodiment, the candidate search results are sorted in descending order (reordered) based on the calculated final relevance score; the sorted list is then used to generate a visual final search list and output to the user's front end, which simultaneously displays structured answers (if applicable) and summaries of relevant document fragments.
[0043] This invention also provides an artificial intelligence-based data retrieval device 300, see reference. Figure 3 The device may include the following units: The data acquisition and graph construction unit 301 is used to collect multi-source heterogeneous data of enterprises, extract their entities and attributes, construct enterprise knowledge graphs, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of enterprise knowledge graphs. The semantic coding model building unit 302 is used to train the graph node embedding vector as a regularization constraint term by introducing a graph-guided contrastive learning loss function, thereby constructing a semantic coding model based on a dual-tower structure. The query intent recognition unit 303 is used to deploy a query intent recognition network, analyze the user's natural language query in real time, and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query. The data retrieval unit 304 is used to dynamically select the retrieval path corresponding to the precise entity query or the fuzzy semantic query according to the probability distribution, and to use the semantic coding model to retrieve data in the enterprise knowledge graph or vector index library according to the retrieval path to obtain the retrieval results. The result re-ranking unit 305 is used to perform multi-feature fusion for refined scoring of the search results, obtain a relevance score, re-rank the search results based on the relevance score, and output the final search list after re-ranking.
[0044] Based on the same inventive concept, another embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus. Memory, used to store computer programs; When a processor executes a program stored in memory, it implements the artificial intelligence-based data retrieval method of the present invention.
[0045] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EI) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used in the diagram, but this does not indicate that there is only one bus or one type of bus. The communication interface is used for communication between the aforementioned terminal and other devices. The memory can include Random Access Memory (RAM), or non-volatile memory, such as at least one disk storage device. Optionally, the memory can also be at least one storage device located remotely from the aforementioned processor.
[0046] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0047] Furthermore, to achieve the above objectives, embodiments of the present invention also propose a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the artificial intelligence-based data retrieval method of the present invention.
[0048] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable hardware devices (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0049] The embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (apparatus), and computer program products according to embodiments of the invention. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0050] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0051] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0052] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. "And / or" indicates that either one or both can be chosen. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes the element.
[0053] The above are merely specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A data retrieval method based on artificial intelligence, characterized in that, The method includes: Collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct enterprise knowledge graphs, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. By introducing a graph-guided contrastive learning loss function, the graph node embedding vectors are used as regularization constraints for training, thereby constructing a semantic coding model based on a dual-tower structure. Deploy a query intent recognition network to analyze users' natural language queries in real time and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query; Based on the probability distribution, the retrieval path corresponding to the precise entity query or the fuzzy semantic query is dynamically selected. Based on the retrieval path, the semantic coding model is used to retrieve data in the enterprise knowledge graph or vector index library to obtain the retrieval results. The search results are refined by multi-feature fusion to obtain a relevance score. Based on the relevance score, the search results are reordered and the final reordered search list is output.
2. The data retrieval method based on artificial intelligence according to claim 1, characterized in that, Collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct an enterprise knowledge graph, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph, including: Collect multi-source heterogeneous data from enterprises, and perform data cleaning and multi-source fusion on the multi-source heterogeneous data to obtain a data block sequence. The multi-source heterogeneous data from enterprises includes structured data tables and unstructured data documents. An entity recognition and relation extraction model based on natural language processing algorithms is used to extract named entities, corresponding attributes, and entity relations from data block sequences. Construct an enterprise knowledge graph based on several named entities, their corresponding attributes, and entity relationships; A graph neural network based on deep learning algorithms is used to aggregate neighborhood information of the enterprise knowledge graph, resulting in graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph.
3. The data retrieval method based on artificial intelligence according to claim 2, characterized in that, By introducing a graph-guided contrastive learning loss function, the graph node embedding vectors are used as regularization constraints for training, constructing a semantic encoding model based on a dual-tower structure, including: An initial semantic coding model based on a dual-tower structure is constructed. The semantic coding model includes a query encoder, a document encoder, and a base model, wherein the base model is a masked language model. The data blocks are merged to build a large-scale corpus, and the initial base model in the initial semantic coding model is pre-trained using the large-scale corpus to obtain the pre-trained base model. Using the graph node embedding vector as a regularization constraint, a contrastive learning loss function for the initial semantic coding model based on a dual-tower structure is designed. A graph-guided contrastive learning sample set is constructed, and the initial semantic encoding model is trained using the contrastive learning loss function to obtain the final semantic encoding model.
4. The data retrieval method based on artificial intelligence according to claim 3, characterized in that, The contrastive learning loss function includes the MLM loss function, the InfoNCE loss function, and the topology regularization loss function; The formula for the contrastive learning loss function is: In the formula, To compare the learning loss values; The MLM loss value for the masked language model; To compare the InfoNCE loss values of the learning sample set; This represents the topological regularization loss value. To compare the learning loss weights; In the formula, This refers to a pair of nodes in an enterprise knowledge graph that have a direct connection relationship. u , v Named entity nodes in the enterprise knowledge graph; This is the subset of edges used for training in the enterprise knowledge graph; For semantic coding models of nodes u , v The semantic vector generated after encoding the context text; For nodes v Graph node embedding vectors; semantic vector semantic similarity; The weights are the topological regularization loss weights.
5. The data retrieval method based on artificial intelligence according to claim 4, characterized in that, A graph-guided contrastive learning sample set is constructed, and based on the contrastive learning loss function, the initial semantic encoding model is trained using the contrastive learning sample set to obtain the final semantic encoding model, including: For several data blocks in the data block sequence, variant data blocks are generated using data augmentation techniques, and these variant data blocks are used as positive samples. Other data blocks unrelated to the query are used as negative samples; Traverse the knowledge graph, select entity pairs with direct connections, and search for data blocks that mainly contain the entity pairs in the data block sequence as topological constraint samples. Summarize all positive samples, negative samples, and topologically constrained samples to construct a graph-guided contrastive learning sample set; The initial semantic coding model is trained using a contrastive learning sample set, and the contrastive learning loss value is calculated for each training iteration using the contrastive learning loss function. If the contrastive learning loss value meets the requirements, the final semantic encoding model is obtained; otherwise, the next round of training is performed.
6. The data retrieval method based on artificial intelligence according to claim 5, characterized in that, Deploy a query intent recognition network to analyze users' natural language queries in real time, and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query, including: Based on a lightweight pre-trained model, a binary classifier is constructed as the initial query intent recognition network, and two types of labels are defined for the query intent recognition network. The labels include a first category label corresponding to precise entity queries and a second category label corresponding to fuzzy semantic queries. Historical search logs are automatically tagged using regular expressions and keyword rules, and then manually calibrated to obtain a categorized sample set; The initial query intent recognition network is trained using a classification sample set to obtain the final query intent recognition network. The system collects users' natural language queries in real time and inputs them into the final query intent recognition network to identify the query intent and obtain the probability distribution of the first category label and the second category label.
7. The data retrieval method based on artificial intelligence according to claim 6, characterized in that, Based on the probability distribution, the retrieval path corresponding to either precise entity query or fuzzy semantic query is dynamically selected. Then, based on the retrieval path, a semantic encoding model is used to retrieve data from the enterprise knowledge graph or vector index, yielding retrieval results, including: If the probability value of the first category label in the probability distribution is greater than the preset probability threshold, then the natural language query belongs to the precise entity query; otherwise, the natural language query belongs to the fuzzy semantic query. If the natural language query is a precise entity query, then select the precise retrieval path corresponding to the precise entity query, perform precise data retrieval in the enterprise knowledge graph, and obtain the retrieval results; If the natural language query is a fuzzy semantic query, then the fuzzy retrieval path corresponding to the fuzzy semantic query is selected, and the semantic coding model is used to perform fuzzy data retrieval in the vector index library to obtain the retrieval results.
8. The data retrieval method based on artificial intelligence according to claim 7, characterized in that, If the natural language query is a precise entity query, then the precise retrieval path corresponding to the precise entity query is selected, and precise data retrieval is performed in the enterprise knowledge graph to obtain the retrieval results, including: If the natural language query is an exact entity query, then select the exact retrieval path corresponding to the exact entity query; The precise retrieval path is executed, and entity recognition and relation extraction models are used to obtain several key entities for the natural language query. Based on several key entities, subgraph diffusion is carried out in the enterprise knowledge graph to obtain several project nodes and corresponding attribute nodes. Based on the attribute values or the degree of matching with the query intent, the project nodes and attribute nodes are scored and sorted to obtain data result cards, which are then output as search results.
9. The data retrieval method based on artificial intelligence according to claim 8, characterized in that, If the natural language query belongs to the fuzzy semantic query, then the fuzzy retrieval path corresponding to the fuzzy semantic query is selected, and the semantic coding model is used to perform fuzzy data retrieval in the vector index library to obtain the retrieval results, including: If the natural language query belongs to the fuzzy semantic query, then select the fuzzy search path corresponding to the fuzzy semantic query; The fuzzy search path is executed. The document encoder of the semantic coding model is used to divide all data blocks of the unstructured data document into document vectors. The Faiss library is used to build the HNSW index of each document vector to obtain the vector index library. A query encoder using a semantic coding model transforms natural language queries into query vectors; Based on the query vector, a KNN search is performed on the HNSW index in the vector index library to retrieve the Top-K candidate documents, and the Top-K candidate documents are output as the search results.
10. A data retrieval device based on artificial intelligence, used to implement the data retrieval method as described in any one of claims 1-9, characterized in that, The device includes: The data acquisition and graph construction unit is used to collect multi-source heterogeneous data from enterprises, extract their entities and attributes, construct enterprise knowledge graphs, and use graph neural networks to aggregate neighborhood information to generate graph node embedding vectors that integrate the topological structure of the enterprise knowledge graph. The semantic coding model building unit is used to train the graph node embedding vector as a regularization constraint by introducing a graph-guided contrastive learning loss function, thereby constructing a semantic coding model based on a dual-tower structure. The query intent recognition unit is used to deploy a query intent recognition network, analyze the user's natural language query in real time, and output the probability distribution of whether the natural language query belongs to an exact entity query or a fuzzy semantic query. The data retrieval unit is used to dynamically select the retrieval path corresponding to the precise entity query or the fuzzy semantic query according to the probability distribution, and to retrieve data in the enterprise knowledge graph or vector index library according to the retrieval path using the semantic coding model to obtain the retrieval results. The result re-ranking unit is used to perform multi-feature fusion for refined scoring of the search results, obtain a relevance score, re-rank the search results based on the relevance score, and output the final re-ranked search list.