Chinese medical question classification system for deep encyclopedia learning

A technology of deep learning and classification system, which is applied in text database clustering/classification, unstructured text data retrieval, and other database retrieval, etc. It can solve the impact of accuracy, word correlation accuracy, low efficiency of medical question classification, Unfavorable answer hit rate and other issues to achieve the effect of improving accuracy and improving the efficiency of consultation and classification

Pending Publication Date: 2021-09-17
李蕊男
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are some defects in the current search engines. There are three main problems: one is that the search engines return too much retrieval information, some of which contain some noise data, and users cannot effectively locate the required information; the other is that the search engines do not understand users. The real search intent; the third is that the search engine simply considers the matching of keywords, and does not consider the grammatical and semantic relationship of the search terms, so it is difficult to improve the accuracy of query retrieval
However, the classification of Chinese questions in the prior art cannot effectively improve the hit rate and speed of retrieval, and cannot narrow the scope of retrieval of questions and reduce the retrieval time; the classification of questions in the prior art cannot optimize the retrieval items, and cannot recommend similarity to users. Question entry, the recall rate of the question answering system is low; the question classification affects the accuracy of the question answer, and the quality of the question classification algorithm determines the accuracy of the answer. The prior art question classification single algorithm is monotonous and inefficient. It is not conducive to improving the hit rate of answers. The existing Chinese question classification system cannot meet the requirements of intelligent question and answer for online consultation, and cannot be applied to the rigorous intelligent medical field;
[0008] Second, there is still a big gap between the classification of Chinese questions and the classification of English questions, especially in the field of medical question classification. The main reason is that Chinese questions have their own characteristics. Compared with English questions, Chinese The grammatical structure of questions is complex and the semantic information is diverse; the second is the lack of corresponding corpus and knowledge base; the third is that the research and application of Chinese question classification is relatively late, and most of the existing Chinese question classification adopts rule-based classification method, achieved some results on some standard data sets, by improving the Bayesian model to classify Chinese questions, extracting the main body of questions and combining word segmentation and part-of-speech feature values ​​to classify questions, but its accuracy is affected by the analysis of syntactic structure Accuracy impact
Affected by the calculation method of semantic correlation, in general, the problems encountered in the classification of Chinese questions include: the questions themselves are short and contain a small number of words, which makes the problem of dimensionality disaster and data sparse in the training of question classification, The efficiency and accuracy of Chinese question classification cannot meet the requirements of medical online consultation;
[0009] Third, as a key technology in online medical consultation, intelligent question answering directly affects the quality and user experience of this emerging medical service. One of the core problems of intelligent question answering is to efficiently classify questions, but the characteristic of medical questions Sentence keywords are less, composed of diseases or symptoms + interrogative words + verbs, the efficiency of the existing method of constructing the feature vector of medical inquiry is low, and the error of the full-text index method is relatively large. In the Chinese environment, the classification of medical questions is more difficult. It is obvious that the construction of network question feature vectors is slow, and it is easy to cause problems such as excessive dimensionality and sparse data when building question feature vectors. The efficiency of question classification is very low, which will cause synonyms to generate different distributed vectors. And limited by the corpus, it cannot identify new words on the Internet very well, and the accuracy of word association and the classification efficiency of medical questions are low;
[0010] Fourth, the semantic correlation algorithm has obvious shortcomings. It does not consider the difference in semantics. Some words have polysemy. The semantic correlation algorithm is just a simple concept mapping, which is easy to introduce noise data. In addition, semantic correlation The degree algorithm needs to consider all the data of the search engine encyclopedia page, and the preprocessing stage consumes more time and resources, indicating that the text vector includes all search engine encyclopedia concepts, and the dimension of the vector reaches 900,000 dimensions, and the amount of calculation is too large;
[0011] Fifth, Chinese questions contain rich semantic information. Its structure is complex, the forms of questions are diverse, and there are polysemy and synonymous dependencies between words. Most of the Chinese questions are relatively short and contain only few keywords. There are many problems in question classification
The existing text representation method is the vector space model. This representation method results in sparse vectors and too large dimensions, and cannot describe the semantic relationship between words well, resulting in large errors in the calculation of similarity and affecting the accuracy of the test. However, due to the lack of training corpus, the similarity is inaccurate, and the words in some dictionaries are not rich enough to eliminate the error of synonyms and solve the problem of unregistered words. The problem of word vector construction, without considering the frequency of words, grammar, semantics and context, the obtained feature word vectors cannot meet the requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese medical question classification system for deep encyclopedia learning
  • Chinese medical question classification system for deep encyclopedia learning
  • Chinese medical question classification system for deep encyclopedia learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0161] Example: "What are the treatment methods for cerebral hemorrhage?" The corpus is trained through Word2vec, words with a frequency of less than 5 are removed, other words are established in a dictionary, and finally each word in the dictionary generates a word vector (here, the vector dimension is 50) , but there is no vector of "brain hemorrhage" in Vec.txt in the trained dictionary. Therefore, a 50-dimensional vector is constructed through the search engine encyclopedia, and the feature words are expanded by the efficient convergence method of semantic relevance, and the TF-IDF value of the expanded word is obtained. The number of characteristic words in the question sentence is set to n (the value of n in the present invention is 4), and those with more than n characteristic words in the question sentence must be deleted. In terms of words, the second deletion order is "which" and "do" interrogative words, and finally the verbs or nouns are deleted. The feature vector...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

According to the Chinese medical question classification system based on deep encyclopedia learning, by using a semantic structure of Chinese search encyclopedia in combination with a deep learning method, a method for constructing a feature vector more efficiently and accurately is provided, which comprises: using a semantic association degree efficient convergence method based on the semantic structure of the Chinese search encyclopedia for constructing a network inquiry question feature vector; based on the features of the medical questions, improving a semantic association degree algorithm, solving the defect that the speed is low when feature vectors are constructed, and expanding feature words by extracting Chinese search encyclopedia word links; on the basis of a distributed Chinese word vector space of a CB-CBS language model, achieving efficient dimensionality reduction of network inquiry question feature vectors, avoiding the problem of data sparseness, greatly improving the inquiry classification efficiency; and using the CB-CBS model in combination with Chinese search encyclopedia and deep learning to construct distributed medical question word vectors, constructing a professional medical question corpus, and improving the accuracy of the word association degree and the medical question classification efficiency remarkably.

Description

technical field [0001] The invention relates to a Chinese medical question classification system, in particular to a Chinese medical question classification system based on deep encyclopedia learning, and belongs to the technical field of Chinese question classification. Background technique [0002] In the era of information big data, search engines have become an indispensable tool for the vast number of netizens. Through search engines, the required materials can be obtained from massive information. The user only needs to enter the keyword in the search engine, and the web page information related to the keyword can be obtained immediately. However, the current search engines have some defects, mainly three problems: First, the search engine returns too much retrieval information, many of which contain some noise data, and the user cannot effectively locate the required information; Second, the search engine does not understand the user The real search intent; third, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/332G06F16/35G06F40/35G06F16/951
CPCG06F16/3329G06F16/35G06F40/35G06F16/951
Inventor 李蕊男王军
Owner 李蕊男
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products