A biomedical literature long query content retrieval method, device and computer equipment

CN116414946BActive Publication Date: 2026-06-26CHONGQING INST OF GREEN & INTELLIGENT TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING INST OF GREEN & INTELLIGENT TECH CHINESE ACAD OF SCI
Filing Date
2023-04-14
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing long-search methods for biomedical literature struggle to maintain the logical relationships of the original long search content, and have low recall and precision rates. Especially in complex medical search scenarios, traditional keyword retrieval methods often return a large amount of irrelevant content or very little content.

Method used

We construct a hierarchical tree for biomedical literature, use the Hidden Dirichlet Distribution method for topic reasoning, calculate the topic relevance between the query content and the node layer by layer, locate the leaf node from top to bottom, and return the nearest document through a scoring mechanism to maintain the logical relationship of long queries and improve retrieval accuracy.

Benefits of technology

It significantly improves recall and precision while maintaining the logical relationship of long query content, with the accuracy of search results exceeding 70%.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116414946B_ABST
    Figure CN116414946B_ABST
Patent Text Reader

Abstract

The present application relates to biomedical literature content retrieval technology, in particular to a biomedical literature long query content retrieval method, device and computer equipment, the method comprises the following steps: constructing a biomedical literature hierarchical tree, and selecting a storage document segment based on the relevance of a corresponding subject word of a child node to a document segment of a retrieval library document; preprocessing a long query text input by a user to obtain to-be-queried content; performing subject reasoning on the to-be-queried content based on a latent Dirichlet allocation method on the biomedical literature hierarchical tree from top to bottom and layer by layer, that is, calculating the subject relevance of the to-be-queried content to other nodes except the root node from top to bottom, if the relevance is greater than a set threshold, continue to find the next layer of nodes of the child node; if the queried child node is a leaf node, end the subject reasoning, and find N documents closest to the to-be-queried content according to the document subject relevance distribution of the document segment on the leaf node; the present application improves the retrieval accuracy of biomedical literature.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical fields of biomedical literature content retrieval and text classification, and in particular to a method, apparatus and computer equipment for long-search content retrieval of biomedical literature. Background Technology

[0002] Text content retrieval technologies mainly include directory-based retrieval methods, document-based retrieval methods, and intelligent semantic retrieval methods. Directory-based retrieval methods extract metadata (including some text) from documents semi-automatically or manually, store it in a database, and retrieve documents by querying the database. Document-based retrieval methods are currently the most commonly used information retrieval technology. Represented by search engines provided by software service providers such as Google, Baidu, and Microsoft Bing, these methods preprocess documents (mainly web page content), build indexes, and use techniques such as link analysis and ranking optimization to provide users with comprehensive query services such as keyword searches. Intelligent semantic retrieval methods utilize advanced natural language processing technologies such as knowledge bases and knowledge graphs to understand the user's true intent and return content as relevant as possible to the information needs, rather than just text content containing keywords.

[0003] Currently, text content retrieval methods are relatively limited, with most search engines only offering category browsing and keyword search. When the content a user wants to query is difficult to describe with just a few keywords, long text queries (i.e., one or more complete sentences or even an entire article with certain semantic meaning) are needed. However, existing keyword searches will return a large amount of irrelevant content or very little content, resulting in a relatively low precision rate in such cases.

[0004] Biomedical literature constitutes the most important textual information resource in the biomedical field, and its volume continues to increase with the rapid development of scientific research in this area. Fully leveraging this massive amount of textual information to discover new medical knowledge is of paramount importance for life science research. Due to the complexity of medical concepts and information needs, searches in the biomedical field are often cumbersome. Attending physicians or resident physicians typically need to retrieve the latest relevant articles from a large body of literature based on the patient's symptoms and diagnostic indicators. Current methods for handling long queries primarily involve effectively reducing the query content to include only a portion of the keywords, and then using keyword search methods. How to preserve the original long query content and the logical relationships between words while balancing recall and precision is a pressing issue that needs to be addressed. Summary of the Invention

[0005] To maintain the content of the original long query and the logical relationships between words while balancing recall and precision, this invention proposes a method, apparatus, and computer device for retrieving long queries in biomedical literature. The retrieval method specifically includes the following steps:

[0006] Construct a hierarchical tree of biomedical literature, and select and store document fragments based on the relevance of their corresponding keywords to document fragments in the search database;

[0007] When a user enters a long query text, the long query text is cleaned, terms are restored, stop words are removed, and the stem is restored to obtain the query content.

[0008] In the biomedical literature hierarchy tree, topic reasoning is performed on the query content layer by layer from top to bottom based on the Hidden Dirichlet Distribution method. That is, the topic relevance between the query content and other nodes except the root node is calculated from top to bottom. If the relevance is greater than a set threshold, the next level node of the child node is searched.

[0009] If the child node being queried is a leaf node, the topic reasoning ends, and the N nearest neighbors to the content to be queried are found according to the document topic relevance distribution of the document fragments on the leaf node.

[0010] Calculate the ratings of the N nearest neighbor documents and push them to the user in descending order of rating.

[0011] Furthermore, the process of constructing a hierarchical tree of biomedical literature includes:

[0012] After data cleaning, term restoration, stop word removal, and stemming of the full text of biomedical literature in the search database, the document set to be processed is obtained.

[0013] Using the document set to be processed as the root node, the Hidden Dirichlet Distribution method is used to model the topics of all documents, generating two topics.

[0014] Calculate the relevance between the documents included in the upper-level node and the two topics obtained. If the relevance is greater than a set threshold, the document is classified under the corresponding topic.

[0015] Documents whose relevance to both topics is no greater than a set threshold are grouped into one node, and that node is not further subdivided.

[0016] If the number of documents under a topic exceeds a set threshold, the Hidden Dirichlet Distribution method is used to model the topic for all documents, generating two topics and dividing the documents downwards until the number of documents in a node does not exceed the set threshold.

[0017] Furthermore, if the relevance is greater than 0.4, the search continues to find the next level node of that child node. Furthermore, the calculation of the nearest neighbor document score for the query content includes:

[0018]

[0019] in, represents the score of the nearest document d of the long query content q; l represents the level of the nearest document d in a leaf node relative to the root node; L is the total height of the biomedical literature hierarchy tree; x is the topic relevance of the document fragment at the leaf node; and x' is the topic proportion value obtained by topic inference of the long query content q at the leaf node.

[0020] Furthermore, the topics generated using the Hidden Dirichlet Distribution method consist of 20 keywords.

[0021] This invention employs a two-stage search approach during the retrieval process. First, it performs layer-by-layer topic reasoning on new document content, quickly locating specific topic leaf nodes. Then, it searches for nearest-neighbor document fragments based on the topic distribution of document fragments at the leaf nodes. The nearest-neighbor document fragments are scored, and the search results are returned in descending order of score. Second, it utilizes topic modeling to obtain the document's subject terms and the co-occurrence relationships between them, effectively reflecting the central idea of ​​long texts. Therefore, after comparing content similarity, the retrieval results achieve a high precision rate, with most exceeding 70%. Attached Figure Description

[0022] Figure 1 This is a schematic diagram of a preferred embodiment of the biomedical literature hierarchical topic tree in this invention;

[0023] Figure 2 This is a preferred generation process for a hierarchical topic tree of biomedical literature in an embodiment of the present invention;

[0024] Figure 3 This is a flowchart illustrating a long-search method for biomedical literature according to the present invention.

[0025] Figure 4 This is a schematic diagram illustrating the rapid location of a leaf node in one embodiment of the present invention;

[0026] Figure 5 This is a schematic diagram illustrating the method for constructing a hierarchical topic tree for biomedical literature according to the present invention, where the nearest neighbor document content is found at the leaf node. Detailed Implementation

[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0028] This invention proposes a method, apparatus, and computer device for retrieving long-form search results in biomedical literature. The retrieval method specifically includes the following steps:

[0029] Construct a hierarchical tree of biomedical literature, and select and store document fragments based on the relevance of their corresponding keywords to document fragments in the search database;

[0030] When a user enters a long query text, the long query text is cleaned, terms are restored, stop words are removed, and the stem is restored to obtain the query content.

[0031] In the biomedical literature hierarchy tree, topic reasoning is performed on the query content layer by layer from top to bottom based on the Hidden Dirichlet Distribution method. That is, the topic relevance between the query content and other nodes except the root node is calculated from top to bottom. If the relevance is greater than a set threshold, the next level node of the child node is searched.

[0032] If the child node being queried is a leaf node, the topic reasoning ends, and the N nearest neighbors to the content to be queried are found according to the document topic relevance distribution of the document fragments on the leaf node.

[0033] Calculate the ratings of the N nearest neighbor documents and push them to the user in descending order of rating.

[0034] In this embodiment, the topics generated using the Hidden Dirichlet Distribution method consist of 20 keywords. This embodiment provides a specific implementation scheme for constructing a hierarchical tree of biomedical literature, such as... Figure 2 Specifically, it includes the following steps:

[0035] After data cleaning, term restoration, stop word removal, and stemming of the full text of biomedical literature in the search database, the document set to be processed is obtained.

[0036] Using the document set to be processed as the root node, the Hidden Dirichlet Distribution method is used to model the topics of all documents, generating two topics.

[0037] Calculate the relevance between the documents included in the upper-level node and the two topics obtained. If the relevance is greater than a set threshold, the document is classified under the corresponding topic.

[0038] Documents whose relevance to both topics is no greater than a set threshold are grouped into one node, and that node is not further subdivided.

[0039] If the number of documents under a topic exceeds a set threshold, the Hidden Dirichlet Distribution method is used to model the topic for all documents, generating two topics and dividing the documents downwards until the number of documents in a node does not exceed the set threshold.

[0040] In this embodiment, during the construction of the biomedical literature hierarchy tree, the relevance of a fragment to a topic can be calculated, that is, the topic distribution of the text fragment can be generated, the relevance of each document to a certain topic can be calculated, and then the documents can be classified into a topic according to the relevance value. Generally, those skilled in the art can achieve this by setting a fixed threshold. Preferably, the threshold can be set to 0.4, that is, if the relevance of a document to a topic is greater than 0.4, then the document is considered to be classified into the corresponding topic.

[0041] like Figure 1 As shown, Figure 1 The `Top` node serves as the root node. A topic inference method based on Hidden Dirichlet (HDD) is used to infer the topics of all documents, resulting in two topics: topic R and topic L. Documents in `Top` are then assigned to these two topics based on their relevance to them. If a document does not meet the relevance requirement, it is assigned to a node `M`. Regardless of the number of documents in node `M`, no further subdivision is performed. If the number of documents in topic L exceeds a set threshold, the documents under that node are further subdivided. This process involves using HDD to infer the topics of document L, resulting in two topics: topic LR and topic LL. Documents under topic L are then assigned to these two topics based on their relevance to them. If a document does not meet the relevance requirement, it is assigned to a node `LM`. This process continues until no further subdivision is possible, forming leaf nodes.

[0042] The retrieval process of this invention is as follows: Figure 3 As shown, the Hidden Dirichlet Distribution (HDD) method is used to perform topic inference layer by layer from top to bottom on the content to be queried. Starting from the top-level node, if the topic relevance between the content to be queried and the child node L of that node is greater than a set threshold (usually set to 0.4), then the topic inference continues at child node L; otherwise, if the topic relevance between the content to be queried and the child node R of that node is greater than a set threshold (usually set to 0.4), then the topic inference continues at child node R; otherwise, the document search is performed directly at child node M. If child nodes L and R are leaf nodes, then topic inference stops, and the document search is performed directly at that leaf node.

[0043] After layer-by-layer topic reasoning, specific leaf nodes can be quickly located. The nearest neighbor documents are then searched according to the document topic relevance distribution at each leaf node. The found document list is scored according to the following scoring rules, and the search results are returned in descending order of score. The calculation of the nearest neighbor document score for the query content includes:

[0044]

[0045] in, represents the score of the nearest document d of the long query content q; l represents the level of the nearest document d in a leaf node relative to the root node; L is the total height of the biomedical literature hierarchy tree; x is the topic relevance of the document fragment at the leaf node; and x' is the topic proportion value obtained by topic inference of the long query content q at the leaf node.

[0046] by Figure 4 For example, the found child node is the node corresponding to the topic RLL. In the diagram, the distance between the topic RLL and the root node is 3, which means that the nearest document d in the leaf node RLL is 3 levels away from the root node Top. Figure 3 The total height of the hierarchical tree of Chinese biomedical literature is 4.

[0047] Figure 5 This diagram illustrates the method for constructing a hierarchical topic tree for biomedical literature, based on the present invention, in finding the nearest neighbor document content at a leaf node. The horizontal axis represents the relevance to two topics; the closer a document is to 0, the more relevant it is to topic 1, and the closer it is to 1, the more relevant it is to topic 2. The vertical axis represents the number of documents. In this embodiment, documents distributed in the range [0, 0.4] are generally considered to be related to topic 1, documents distributed in the range (0.6, 1) are considered to be related to topic 2, and other documents are considered to belong to neither topic 1 nor topic 2.

[0048] This invention also provides a long query content retrieval device for biomedical literature, used to implement a long query content retrieval method for biomedical literature. The device includes a server and a user terminal. The user terminal transmits the long query text to be retrieved to the server. The server retrieves the received long query text and pushes the retrieval results to the user terminal. The server includes a data preprocessing module, a biomedical literature hierarchical tree, a retrieval database, and a retrieval module, wherein:

[0049] The preprocessing module is used to perform data cleaning, term restoration, stop word removal, and stemming on long query documents received by the server or documents in the process of constructing a biomedical literature hierarchy tree;

[0050] The retrieval module is used to perform a top-down, layer-by-layer search within the constructed biomedical literature hierarchy tree to obtain the subject terms closest to the query content and their corresponding documents, and to retrieve the retrieved documents from the retrieval database and recommend them to the user.

[0051] like Figures 3-5 The execution efficiency of hLDA, HLTA, and the biomedical literature hierarchical topic tree construction method of this invention was analyzed using Medline summary datasets of different sizes. The comparative experiment was carried out on a fat node server with a 96-core Intel E7 Xeon processor (3.0 GHz) and 6TB DDR3 high-speed memory.

[0052] In comparative experiments, the LDA estimation in the method of this invention is executed in two different ways: serial and parallel. In serial execution, each LDA estimation step is executed individually, and only one LDA is executed at a time. In parallel execution, LDA estimation can be executed concurrently on a fat-node server at maximum concurrency. As the corpus size increases, the execution time of all methods increases; however, the execution time of hLDA and HLTA is one or two orders of magnitude longer than that of the method of this invention. Figures 3 to 5 As shown, HLTA has a shorter execution time than hLDA, with hLDA taking the most time. Furthermore, if the LDA estimation in the method of this invention is executed in parallel at all levels, the execution time of the method can be reduced by 1 / 3 to 1 / 2. Therefore, it is evident that the execution efficiency of the method of this invention is far superior to hLDA and HLTA, and thus, the method of this invention is capable of meeting the needs of hierarchical topic modeling for large-scale datasets.

[0053] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for retrieving long-form content from biomedical literature, characterized in that, Specifically, the following steps are included: Construct a hierarchical tree of biomedical literature, and select and store document fragments based on the topic relevance of the corresponding topic to the document fragments in the search database for the child nodes; When a user enters a long query text, the long query text is cleaned, terms are restored, stop words are removed, and the stem is restored to obtain the query content. In the biomedical literature hierarchy tree, topic reasoning is performed on the query content layer by layer from top to bottom based on the Hidden Dirichlet Distribution method. That is, the topic relevance between the query content and the other child nodes except the root node is calculated from top to bottom. If the relevance is greater than a set threshold, the next level of child nodes of that child node is searched. The calculation of the nearest neighbor document score for the query content includes: in, represents the score of the nearest document d of the query content q; l represents the level of the nearest document d in a leaf node relative to the root node; L is the total height of the biomedical literature hierarchy tree; x is the topic relevance of the document fragment in the leaf node; x' is the topic proportion value of the query content q obtained by topic inference in the leaf node; If the child node being queried is a leaf node, the topic reasoning ends, and the N nearest neighbors of the content to be queried are found according to the topic relevance distribution of the document fragments on the leaf node. Calculate the ratings of the N nearest neighbor documents and push them to the user in descending order of rating.

2. The method for retrieving long-form content from biomedical literature according to claim 1, characterized in that, When constructing a hierarchical tree for biomedical literature to divide documents, the division is based on the topic relevance between each document and the topic. When the topic relevance of a document to a topic is greater than a set threshold, the document is assigned to that topic. Documents that do not belong to any topic are uniformly assigned to a single node.

3. The method for retrieving long-form content from biomedical literature according to claim 1, characterized in that, If the relevance is greater than 0.4, continue searching for the next level of child nodes of that child node.

4. A long-search content retrieval device for biomedical literature, characterized in that, To implement the long query content retrieval method for biomedical literature as described in claim 1, the device includes a server and a user terminal. The user terminal transmits the long query text to be retrieved to the server. The server performs a retrieval on the received long query text and pushes the retrieval results to the user terminal. The server includes a data preprocessing module, a biomedical literature hierarchical tree, a retrieval database, and a retrieval module, wherein: The data preprocessing module is used to perform data cleaning, term restoration, stop word removal, and stemming on long query texts received by the server or documents used in the process of constructing a biomedical literature hierarchy tree. The retrieval module is used to perform a top-down, layer-by-layer retrieval in the constructed biomedical literature hierarchy tree, obtain the topic closest to the query content and its corresponding documents, and retrieve the retrieved documents from the retrieval database in descending order of their ratings and push them to the user terminal.

5. A computer device for retrieving long-form content from biomedical literature, characterized in that, The device includes a memory and a processor, wherein when the processor runs a computer program stored in the memory, it implements the long-search content retrieval method for biomedical literature as described in any one of claims 1 to 3.