However, there are some defects in the current search engines. There are three main problems: one is that the search engines return too much retrieval information, some of which contain some noise data, and users cannot effectively locate the required information; the other is that the search engines do not understand users. The real search intent; the third is that the search engine simply considers the matching of keywords, and does not consider the grammatical and semantic relationship of the search terms, so it is difficult to improve the accuracy of query retrieval
However, the classification of Chinese questions in the prior art cannot effectively improve the hit rate and speed of retrieval, and cannot narrow the scope of retrieval of questions and reduce the retrieval time; the classification of questions in the prior art cannot optimize the retrieval items, and cannot recommend similarity to users. Question entry, the recall rate of the question answering system is low; the question classification affects the accuracy of the question answer, and the quality of the question classification algorithm determines the accuracy of the answer. The prior art question classification single algorithm is monotonous and inefficient. It is not conducive to improving the hit rate of answers. The existing Chinese question classification system cannot meet the requirements of intelligent question and answer for online consultation, and cannot be applied to the rigorous intelligent medical field;
[0008] Second, there is still a big gap between the classification of Chinese questions and the classification of English questions, especially in the field of medical question classification. The main reason is that Chinese questions have their own characteristics. Compared with English questions, Chinese The grammatical structure of questions is complex and the semantic information is diverse; the second is the lack of corresponding corpus and knowledge base; the third is that the research and application of Chinese question classification is relatively late, and most of the existing Chinese question classification adopts rule-based classification method, achieved some results on some standard data sets, by improving the Bayesian model to classify Chinese questions, extracting the main body of questions and combining word segmentation and part-of-speech feature values to classify questions, but its accuracy is affected by the analysis of syntactic structure Accuracy impact
Affected by the calculation method of semantic correlation, in general, the problems encountered in the classification of Chinese questions include: the questions themselves are short and contain a small number of words, which makes the problem of dimensionality disaster and data sparse in the training of question classification, The efficiency and accuracy of Chinese question classification cannot meet the requirements of medical online consultation;
[0009] Third, as a key technology in online medical consultation, intelligent question answering directly affects the quality and user experience of this emerging medical service. One of the core problems of intelligent question answering is to efficiently classify questions, but the characteristic of medical questions Sentence keywords are less, composed of diseases or symptoms + interrogative words + verbs, the efficiency of the existing method of constructing the feature vector of medical inquiry is low, and the error of the full-text index method is relatively large. In the Chinese environment, the classification of medical questions is more difficult. It is obvious that the construction of network question feature vectors is slow, and it is easy to cause problems such as excessive dimensionality and sparse data when building question feature vectors. The efficiency of question classification is very low, which will cause synonyms to generate different distributed vectors. And limited by the corpus, it cannot identify new words on the Internet very well, and the accuracy of word association and the classification efficiency of medical questions are low;
[0010] Fourth, the semantic correlation algorithm has obvious shortcomings. It does not consider the difference in semantics. Some words have polysemy. The semantic correlation algorithm is just a simple concept mapping, which is easy to introduce noise data. In addition, semantic correlation The degree algorithm needs to consider all the data of the search engine encyclopedia page, and the preprocessing stage consumes more time and resources, indicating that the text vector includes all search engine encyclopedia concepts, and the dimension of the vector reaches 900,000 dimensions, and the amount of calculation is too large;
[0011] Fifth, Chinese questions contain rich semantic information. Its structure is complex, the forms of questions are diverse, and there are polysemy and synonymous dependencies between words. Most of the Chinese questions are relatively short and contain only few keywords. There are many problems in question classification
The existing text representation method is the vector space model. This representation method results in sparse vectors and too large dimensions, and cannot describe the semantic relationship between words well, resulting in large errors in the calculation of similarity and affecting the accuracy of the test. However, due to the lack of training corpus, the similarity is inaccurate, and the words in some dictionaries are not rich enough to eliminate the error of synonyms and solve the problem of unregistered words. The problem of word vector construction, without considering the frequency of words, grammar, semantics and context, the obtained feature word vectors cannot meet the requirements