Search engine selection abstract generation method and device

A search engine and abstract technology, applied in the field of search engines, can solve problems such as inability to meet user needs, improve user experience and search quality, and take into account query costs and answer accuracy.

Pending Publication Date: 2021-02-02
BEIJING QIHOO TECH CO LTD
0 Cites 2 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, this method has gradually...
View more

Method used

Because PV can reflect the heat of a query word, for the query word that heat is lower, even if select summary is provided, the feedback of effect is also slower, not obvious enough, therefore under the situation of taking into account efficiency, can first heat Higher query terms are extracted. In addition, because the query words are not standardized enough, removing punctuation and/or spaces through normalization can facilitate subsequent processing.
Because there may be a certain deviation between the accuracy of this answer and the actual needs of the user, for example, it is found in the verification process that the cold start effect of the model is not good, and the accuracy of the answer generated is not enough to go online directly, then the implementation of the present invention For example, the selected abstract generation unit 240 is adapted to obtain marked answers corresponding to the marked candidate answers based on active learning, and use the marked answers as the selected abstracts of the corresponding question-and-answer query words in the search engine. The traditional labeling method is very slow to improve the effect of the model, and at the same time, the labeling of selected abstracts has high requirements on the text theory and knowledge reserve of the labelers, which makes labeling more difficult; therefore, the embodiments of the present invention also use active learning ) to further improve the accuracy.
Because there may be a certain deviation between the accuracy of this answer and the actual needs of the user, for example, it is found in the verification process that the cold start effect of the model is not good, and the accuracy of the answer generated is not enough to go online directly, then the implementation of the present invention For example, through step 140, the marked answers corresponding to the marked candidate answers are obtained based on active learning, and the marked answers are used as the selected abstracts of the corresponding question-and-answer query words in the search engine. The traditional labeling method is very slow to improve the effect of the model, and at the same time, the labeling of selected abstracts has high requirements on the text theory and knowledge reserve of the labelers, which makes labeling more difficult; therefore, the embodiments of the present invention also use active learning ) to further improve the accuracy.
In one embodiment of the present invention, in said method, the step of making the training data of topic classification model also comprises: according to page views, divide low-frequency query word from the query word of sample data; Determine according to syntax dependency tree Topic type of infrequent query terms. Syntactic dependency (Dependency Parsing, DP) reveals its syntactic structure by analyzing the dependency relationship between components within a language unit. Intuitively speaking, dependency syntactic analysis identifies grammatical components such as "subject-predicate-object" and "fixed complement" in a sentence, and analyzes the relationship between each component. Usually, a tree structure is formed, and it has been verified that this method has a higher classification accuracy for low-frequency words.
In one embodiment of the present inventio...
View more

Abstract

The invention discloses a method and a device for generating a search engine selection abstract. The method comprises the following steps: identifying question and answer query words from search log data; obtaining a search result corresponding to the question and answer query word; if the search result does not contain the document of the specified type, taking the question and answer query wordsand the corresponding search result as input data, and outputting annotated candidate answers corresponding to the question and answer query words according to a machine reading understanding model;obtaining annotation answers corresponding to the annotation candidate answers based on active learning, and taking the annotation answers as selected abstracts of the corresponding question and answer query words in a search engine. According to the technical scheme, for the recognized question and answer query words, through the mode of combining machine reading understanding and active learning, a carefully-selected abstract that a user can directly view needed content in a search result page is provided for a search engine, the query cost and answer accuracy are both considered, and the use experience and search quality of the user are improved.

Application Domain

Digital data information retrievalSpecial data processing applications

Technology Topic

Search engine results pageAnnotation +7

Image

  • Search engine selection abstract generation method and device
  • Search engine selection abstract generation method and device
  • Search engine selection abstract generation method and device

Examples

  • Experimental program(1)

Example Embodiment

[0024]Hereinafter, exemplary embodiments of the present invention will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present invention and to fully convey the scope of the present invention to those skilled in the art.
[0025]The design idea of ​​the present invention is to provide a "selected summary" for question and answer query words, which is different from traditional search results, so that users can obtain higher accuracy without clicking on the webpage to jump on the search result page. High answer. For example, search engines can extract definite answers to user questions and place them at the top of the returned results in a more conspicuous manner, which can help users save a lot of time spent in obtaining answers, thereby improving user experience and search quality. This process is that search engines help users select answers, so the present invention calls them "selected abstracts".
[0026]It should be noted that although the technical solution of the present invention is to provide selected abstracts for search engines, the process of generating selected abstracts can be independent of the existing processes of general search engines, and improve the performance of various search engines through loose coupling. Adaptability and easy integration.
[0027]figure 1 It shows a schematic flowchart of a method for generating a selected abstract of a search engine according to an embodiment of the present invention. Such asfigure 1 As shown, the method includes:
[0028]Step 110: Identify question-and-answer query words from the search log data.
[0029]Each user of the search engine can generate a large amount of search log data during daily search, for example, which query terms the user searches for, which website in the search result clicks after obtaining the search result, and so on. Through statistics and research, it is found that for question-and-answer query words, selected abstracts are most helpful to the improvement of user search experience. Therefore, the technical solution of the present invention mainly generates corresponding selected abstracts for question-and-answer query words.
[0030]Step 120: Obtain search results corresponding to the question and answer query words. This step can be achieved by calling the existing search engine interface, etc.
[0031]Step 130: If the search result does not contain the specified type of document, the question and answer query term and the corresponding search result are used as input data, and the labeled candidate answer corresponding to the question and answer query term is output according to the machine reading comprehension model.
[0032]For a specified type of document, because the type is known, content can be extracted through a template, etc., while for other web pages, the difficulty lies in how to obtain information related to the answers required by the question-and-answer query. In this regard, the embodiment of the present invention adopts the MRC (Machine Reading Comprehension, machine reading comprehension) method, and uses question-and-answer query words and corresponding search results as input data, and outputs labels corresponding to the question-and-answer query words according to the machine reading comprehension model. Candidate answer.
[0033]Since the accuracy of this answer may deviate from the actual needs of the user, for example, it is found in the verification process that the cold start effect of the model is not good, and the accuracy of the generated answer is not sufficient to go online directly, so the embodiment of the present invention passes steps 140. Obtain annotated answers corresponding to the annotated candidate answers based on active learning, and use the annotated answers as a selected summary of the corresponding question-and-answer query in the search engine. The traditional labeling method is very slow to improve the effect of the model. At the same time, the labeling of the selected abstracts requires high textual theory and knowledge reserves of the labeling personnel, which makes labeling more difficult; therefore, the embodiment of the present invention also adopts active learning (active learning). ) The method further improves the accuracy.
[0034]visible,figure 1 The method shown, for the identified question-and-answer query words, through a combination of machine reading comprehension and active learning, provides search engines with a selected summary in which users can directly view the desired content on the search result page, taking into account The query cost and answer accuracy are improved, and the user experience and search quality are improved.
[0035]Figure 5a Shows a schematic diagram of a search result page obtained according to question and answer query words in the prior art;Figure 5b Shows a schematic diagram of a search result page obtained according to a question-and-answer query word according to an embodiment of the present invention.Figure 5b The first search result in is a selected summary. Users can get most of the answers without clicking to enter the webpage, or they can click to enter the webpage and get more information from the original page.
[0036]In an embodiment of the present invention, the above method includes: performing preset types of processing on the search log data to extract query words that meet the requirements; using the query words as input data, and outputting question-and-answer queries according to the query word classification model word.
[0037]Since the query terms searched by users are massive, the query terms may be of different forms and heats. Therefore, in order to take into account efficiency, you can selectively filter out query terms that meet your needs and use machine learning to obtain query terms The classification model classifies and obtains question-and-answer query words. For example, "Why is the sea blue" is a question-and-answer query word, and "1933" is not a question-and-answer query word, but a year query word.
[0038]In an embodiment of the present invention, the above method includes: sorting the query words in the search log data according to the page views PV, and extracting several query words from high to low according to the sort; and/or, The words are normalized to remove punctuation and/or spaces in the query words.
[0039]Because PV can reflect the popularity of a query term, for query terms that are less popular, even if a selected summary is provided, the effect feedback is slower and not obvious enough, so in the case of taking into account efficiency, you can first The query words are extracted. In addition, because the query term is not standardized enough, the normalization method is used to remove punctuation and/or spaces to facilitate subsequent processing.
[0040]In an embodiment of the present invention, the above method further includes the following steps of making training data for the query word classification model: acquiring search log data generated within a preset period of time as sample data; counting the data of each query word in the sample data The total number of clicks and the number of clicks on Q&A sites, calculate the percentage of Q&A clicks for each query word in the sample data within a preset time period; take the query words whose Q&A click percentage is greater than the first threshold as a positive example, and click the Q&A click The query words whose proportion is lower than the second threshold are regarded as negative examples to obtain training data.
[0041]Here, whether a query word is a question-and-answer query word is identified according to the user's click situation, and this is achieved through a query word classification model. The embodiment of the present invention provides an example of generating training data. In the process of making training data, preset two types of thresholds, and count the total number of clicks on each query word in the sample data and the number of clicks on Q&A sites, and calculate the question and answer of each query word in the sample data within the preset time period. Percentage of clicks. It is easy to understand. If a word is a question-and-answer query, the user will most likely click on a question-and-answer site after searching; and if it is not a question-and-answer query, the user will often not click on a question-and-answer site. According to this Generate positive and negative examples. In a specific scenario, the first threshold may be 0.8, and the second threshold may be 0.1, and the verified effect is better.
[0042]In an embodiment of the present invention, the above method further includes the following steps of training to obtain a query word classification model: the training data is divided into a training set, a verification set and a test set according to a preset ratio, and training is performed based on the textCNN model to obtain Query word classification model.
[0043]For example, the training data is divided into training set, validation set and test set according to the ratio of 7:2:1, and then input into the textCNN model for training. The textCNN model is a deep learning model that uses CNN to solve text classification problems, and has a better effect on text classification.
[0044]In an embodiment of the present invention, the above method includes: taking the query words containing the specified words in the sample data as positive examples. For example, if a query contains words such as "how" and "how", it can basically be determined to be a question-and-answer query. Therefore, such a query containing specified words can be directly used as a positive example, or you can click In the negative example of the proportion statistics, the query words containing the specified words are eliminated.
[0045]In an embodiment of the present invention, in the above method, the question-and-answer site is determined according to the URL pattern. For example, for the question-and-answer query term "How to make tomato scrambled eggs", the corresponding answers can be found on different question-and-answer sites. Take Zhihu, 360 Q&A and Baidu Zhi as examples. The URL of each answer is as follows:
[0046]Know the URL: https://www.zhihu.com/question/19576438
[0047]Baidu knows the URL: https://zhidao.baidu.com/question/572659771.html
[0048]360 Q&A URL: https://wenda.so.com/q/1532241908216013
[0049]In fact, through statistics, it is found that the Q&A URLs in various Q&A sites follow a certain URL pattern. For example, the URL pattern of Zhihu is https://www.zhihu.com/question/……, and the URL pattern known to Baidu is https ://zhidao.baidu.com/question/…….html, the URL pattern of 360 Q&A is https://wenda.so.com/q/……. Therefore, Q&A sites can be determined according to the URL pattern.
[0050]In an embodiment of the present invention, the above method includes: preprocessing the identified question-and-answer query words, removing repeated question-and-answer query words, and removing the question-and-answer query words that have been marked answers. Performing this type of deduplication can reduce the repetitive process of generating selected abstracts and improve efficiency.
[0051]In an embodiment of the present invention, the above method includes: classifying the identified question and answer type query words, and filtering out the question and answer type query words other than the preset type. This can further filter out some question-and-answer query words that are not suitable for generating selected abstracts.
[0052]In an embodiment of the present invention, the above method includes: classifying the identified question-and-answer query words according to at least one of a topic classification model, a query word classification model, and an answer classification model. As the name implies, the topic classification model can be classified according to the subject of question and answer query words, the query word classification model can be classified according to the attributes of the query words themselves, and the answer classification model can be classified according to the attributes of answers to the question and answer query words.
[0053]In an embodiment of the present invention, in the above method, the topic type includes at least one of the following: mobile phone digital, life, games, education and science, leisure and hobbies, culture and art, financial management, social and people's livelihood, sports, and region; query The word type includes: fact type and/or opinion type; answer type includes: description type and/or entity type.
[0054]The subject types shown above are subject types frequently searched by users, and the effect of generating selected abstracts for such question-and-answer query words is more obvious. From the analysis of the question and answer query terms themselves and the attributes of the answers, focusing on the four types of user queries, it is believed that the selected abstracts from these queries are the most helpful to the improvement of user search experience. These four types of queries can be divided from two perspectives: according to the nature of the query, they are divided into facts and opinions, and according to the length characteristics of the answers, they are divided into entities and descriptions. E.g,Figure 6a Shows the question-and-answer query words of the fact-description type;Figure 6b Shows the question-and-answer query words of the fact-entity category;Figure 6c Shows the question-and-answer query words of the opinion-description type;Figure 6d The question-and-answer query words of the opinion-entity query are shown.
[0055]In an embodiment of the present invention, the above method includes: if the answer type of the question-and-answer query is an entity type, calling a machine reading comprehension model that does not include a ranking algorithm to output multiple labeled candidate answers; otherwise, calling The complete machine reading comprehension model outputs a labeled candidate answer.
[0056]Since the answer length of the question and answer query words of the entity type is relatively short, the labeling cost is low, and multiple label candidate answers can be output for labeling; while other question and answer query words can call a ranking algorithm, that is, a complete machine reading comprehension model. Output the labeled candidate answer with the highest score, which can reduce the labeling cost.
[0057]Before calling the machine reading comprehension model to generate annotated candidate answers, you can adjust the data format: this step is mainly based on the data obtained in the previous step to generate the format required for calling the machine reading comprehension model. A specific example is as follows:
[0058]Model input: The input of the model is a query and several documents, as follows:
[0059]{"Query":"Why is the sky blue",
[0060]"Document":["Sunlight is composed of seven kinds of light: red, orange, yellow, green, cyan, blue, and purple. Among the seven kinds of light, cyan, blue, and purple have shorter wavelengths and are easily scattered by air molecules and dust The ability of atmospheric molecules and dust in the atmosphere to scatter blue light with a shorter wavelength is much higher than that of other photons with longer wavelengths.",
[0061]"Short-wavelength light, such as purple and blue, is more likely to be absorbed by air molecules than long-wavelength light (that is, the red, orange, and yellow bands in the spectrum). Then the air molecules radiate violet and blue light in different directions. The sky is saturated."
[0062]...
]
[0063]}
[0064]Model output: the interval of the answer in the document, placed in the answer_spans field.
[0065]{"Query":"Why is the sky blue",
[0066]"Document":["Sunlight is composed of seven kinds of light: red, orange, yellow, green, cyan, blue, and purple. Among the seven kinds of light, cyan, blue, and purple have shorter wavelengths and are easily scattered by air molecules and dust The ability of atmospheric molecules and dust in the atmosphere to scatter blue light with a shorter wavelength is much higher than that of other photons with longer wavelengths.",
[0067]"Short-wavelength light, such as purple and blue, is more likely to be absorbed by air molecules than long-wavelength light (that is, the red, orange, and yellow bands in the spectrum). Then the air molecules radiate violet and blue light in different directions. The sky is saturated."
[0068]...
[0069]],
[0070]"Answer_spans": [44,77]
[0071]}
[0072]In an embodiment of the present invention, in the above method, the topic classification model is a text multi-classification model, and the method further includes the following step of making training data for the topic classification model: obtaining search log data generated within a preset time period As sample data; count the click ratios of query terms in the sample data on sites of different topic types; count the page views of each query term in the sample data, and divide high-frequency query terms from the query terms in the sample data according to the page views; The topic type of the site with the highest click rate of the high-frequency query term is taken as the topic type of the corresponding high-frequency query term.
[0073]For example, when a user searches for "How is the iPhone XS configured", they often click on mobile phone digital sites in the search results, such as Zhongguancun Online, etc. According to the percentage of clicks, the subject type of the query term can be determined. , That is, the effect of query words with high page views PV is obvious, while the effects of medium frequency words and low frequency words are slightly worse. Therefore, the present invention also provides examples of making training data for medium frequency words and low frequency words.
[0074]In an embodiment of the present invention, in the above method, the step of preparing the training data of the topic classification model further includes: dividing medium-frequency query words from the query words in the sample data according to the page views; determining the high frequency of the topic type The query term is used as a training set to train a support vector machine SVM model, and the subject type of the intermediate frequency query term is determined according to the trained SVM model. SVM (Support Vector Machine) is a generalized linear classifier that performs binary classification on data according to a supervised learning method, and its decision boundary is the maximum margin hyperplane for solving the learning sample. It has been verified that this method has higher classification accuracy for intermediate frequency words.
[0075]In an embodiment of the present invention, in the above method, the step of making training data for the topic classification model further includes: dividing low-frequency query words from query words in the sample data according to page views; and determining low-frequency query words according to the syntactic dependency tree The topic type. Dependency Parsing (DP) reveals its syntactic structure by analyzing the dependencies between the components in a language unit. Intuitively speaking, dependent syntax analysis identifies the grammatical components of "subject-predicate-object" and "fixed adverbial complement" in sentences, and analyzes the relationship between the components. Usually, a tree structure is formed. After verification, this method has higher classification accuracy for low-frequency words.
[0076]In a specific embodiment, a single-day average PV is greater than 50 as high-frequency words, between 5-50 are medium-frequency words, and those with a daily average PV less than 5 are used as low-frequency words.
[0077]In an embodiment of the present invention, in the above method, the query word classification model and the answer classification model are both text multi-classification models; the query word classification model is trained based on the characteristics of the query word; the answer classification model is based on the answer Length feature training. According to the above example, it can be seen that fact-type query words have a relatively uniform answer, while query words of opinion type have multiple answers; query words of entity type have shorter answers, and query words of description type have longer answers.
[0078]In an embodiment of the present invention, in the above method, identifying the question and answer query words from the search log data further includes: filtering the identified question and answer query words according to a number of specified dimensions to obtain the filtered question and answer query word. Specifically, in an embodiment of the present invention, in the above method, filtering the identified question-and-answer query words according to a number of specified dimensions includes: filtering each specified dimension using a corresponding text classification model; text classification The models are trained based on SVM and/or fastText respectively. Different text classification models can be selected according to the type of dimension. The specified dimension can include non-compliance with relevant laws and regulations, sensitive query words, etc., and can also filter question-and-answer query words involving people and uncomfortable question-and-answer query words.
[0079]In an embodiment of the present invention, in the above method, obtaining search results corresponding to question-and-answer query words includes: calling a search engine interface to obtain the first number of natural search results corresponding to the question-and-answer query words according to the search result sequence; The natural search results are adjusted according to the preset algorithm, and the second number of natural search results are selected as the search results corresponding to the corresponding question and answer query words.
[0080]The technical solution of the present invention can realize the acquisition of search results by calling the interface of the existing search engine, and the sorting order of the search results by the search engine is not necessarily realized according to factors such as the most relevant semantics, so the ranking can be filtered out first. Multiple natural search results with higher ranks are adjusted using preset algorithms.
[0081]In addition, many search engines will rewrite the query words, such as rewriting Pinyin into Chinese characters, then the final selected abstracts can also correspond to the rewritten words.
[0082]In an embodiment of the present invention, in the above method, adjusting the natural search results according to a preset algorithm includes at least one of the following: filtering out the document-type sites that cannot obtain web content from the natural search results; increasing the trust level The order of the stations with the first preset value is promoted.
[0083]For document-type sites where web content cannot be obtained, and the content cannot be captured through the subsequent steps, it is directly filtered. In addition, for authoritative sites, the order can be increased.
[0084]In an embodiment of the present invention, in the above method, obtaining the search results corresponding to the question-and-answer query words further includes: filtering the question-and-answer query words according to the natural search results. Specifically, in an embodiment of the present invention, in the above method, filtering the question-and-answer query words according to the natural search results includes at least one of the following: filtering the question-and-answer query words that contain the application box in the natural search results; Filter out question-and-answer query words that contain illegal words in the title of natural search results; filter out question-and-answer query words that lack high-quality natural search results based on semantic matching.
[0085]Application box (onebox) is a relatively mature way of displaying search results. It can give you more ideal search results for query terms related to stocks, film and television dramas, weather, etc., and you do not need to generate selected summaries for these query terms. . Since the subsequent processing needs to be processed by the machine reading comprehension model, if there is a lack of high-quality natural search results based on semantic matching, the desired answer will not be obtained even if the machine reading comprehension is performed, and such question-and-answer query words are also filtered. The illegal words can be question-and-answer query words that do not comply with relevant laws and regulations.
[0086]In an embodiment of the present invention, in the above method, the method further includes: if the search result contains a specified type of document, directly extracting annotated candidate answers from the specified type of document. Specifically, in an embodiment of the present invention, in the above method, the document of the specified type is an html document containing several pieces of step description information, and directly extracting the annotation candidate answer from the document of the specified type includes: parsing the html document , According to the field matching, extract several pieces of step description information, and get annotated candidate answers through splicing.
[0087]Figure 7 A schematic diagram of a web page of a document of a specified type is shown. Such asFigure 7 As shown, it contains several steps, and then extracting these steps can obtain the labeled candidate answers. One way is to use the Beautiful Soup library (Beautiful Soup is a Python package whose functions include parsing html and xml documents, repairing documents containing unclosed tags and other errors-such documents are often called tag soup) to parse html into objects Process, and then match the relevant fields to find out the content after the "step" in the webpage, and then splice them into the labeled candidate answers of the question-and-answer query.
[0088]In an embodiment of the present invention, in the above method, using question-and-answer query words and corresponding search results as input data includes: calculating the semantic relevance scores between the question-and-answer query words and the webpage title of the search result according to the semantic matching model, and The semantic relevance score ranks the search results.
[0089]In order to further improve the efficiency and accuracy of answer generation, before the actual answer generation, a semantic matching model can be used to semantically score the question and answer query words and the web page titles of the candidate documents, and give the scores to re-order them to reduce the relevance The higher candidate documents are elevated in the right position and placed in the relatively front position, so that the candidate documents contain possible correct answers as much as possible.
[0090]In an embodiment of the present invention, in the above method, the method further includes: if the calculated semantic relevance scores are all lower than the second preset value, then output corresponding to the question-and-answer query word according to the machine reading comprehension model is not performed Annotate candidate answers and follow-up steps to directly grab selected abstracts of Q&A query words from Q&A sites.
[0091]If it is found that all candidate documents of the question-and-answer query have a low matching score with the question-and-answer query, it means that there is no high-quality document for the question and answer query. If the answer is directly generated based on this, the answer is likely to be of poor quality. Therefore, the subsequent answer generation step is not performed, and the process of entering the "knowledge site crawling" process is performed.
[0092]In an embodiment of the present invention, in the above method, the method further includes the following step of training a semantic matching model: obtaining a pair of question-and-answer query words and web page titles that contain positive examples and negative examples, and construct them as Training data: Based on the pre-training model and training data of BERT, perform fine-tune adjustment training to obtain a semantic matching model.
[0093]1) Obtain positive and negative examples separately through another labeling process, as follows:
[0094]Positive example: Why is the sky blue and why is the sky blue?
[0095]Negative example: why the sky is blue and why the sea is salty
[0096]2) Construct the processor dictionary, construct the data processing flow, and form the data format required by the model, as follows:
[0097]Positive example: 1\t Why is the sky blue\t Why is the sky blue
[0098]Negative example: 0\t why the sky is blue\t why the sea is salty
[0099]3) Perform fine-tune: Run run_classsifier.py to train the model.
[0100]In an embodiment of the present invention, in the above method, the method further includes: for question-and-answer query words whose PV is lower than the third preset value, directly grabbing selected abstracts from question-and-answer sites.
[0101]Q&A query terms with low PV have fewer searchers, so grabbing selected abstracts from Q&A sites can save resource consumption.
[0102]In an embodiment of the present invention, in the above method, the method further includes: generalizing question and answer query words to obtain query words with similar semantics; and using labeled answers as the essence of query words with similar semantics in search engines Select the abstract.
[0103]On the one hand, semantically similar query terms can have the same answer, and there is no need to regenerate the answer. On the other hand, we need to generalize the query for the recall rate of selected abstracts on search engines. So there is "expansion coverage", this operation is based on the accumulation of a certain amount of high-quality query-answer data (query-answer database, Q-A database).
[0104]In an embodiment of the present invention, in the above method, generalizing question and answer query words to obtain query words with similar semantics includes: digging out queries related to question and answer queries based on the display of the query words, user click behavior and co-click behavior The candidate query term corresponding to the word; the semantic relevance score of the question-and-answer query term and each candidate query term is calculated according to the semantic matching model, and the candidate query term with the semantic relevance score higher than the fourth preset value is used as the semantically similar query term.
[0105]For example, from a large amount of search logs, using information such as query presentation, user click behavior, and co-click behavior, a batch of new queries with similar semantics to the query in the Q-A library are mined as a candidate set for extended coverage. After that, the semantic matching model based on Bert can be called (the training process can refer to the foregoing embodiment) to score the query-query pair. The higher the score, the more similar the semantics of the two. For example, a synonymous query with a score higher than 0.9 is selected, and the coverage is successfully expanded.
[0106]In an embodiment of the present invention, in the above method, generalizing question and answer query words to obtain query words with similar semantics includes: expressing the query words in vectors, and calculating the cosine similarity of each vector to determine and The candidate query term corresponding to the query term; the semantic relevance score of the question-and-answer query term and each candidate query term is calculated according to the semantic matching model. If the highest semantic relevance score is greater than the fifth preset value, the highest semantic relevance The candidate query terms corresponding to the scores are regarded as query terms with similar semantics.
[0107]This method can be understood as a completely semantic-based method, called retrieval matching. When judging whether a new query can be expanded and triggered, it is necessary to find the query with the closest semantics to it, which is divided into two stages: recall and matching. In the recall phase, the main purpose is to initially screen out the candidate set. For example, the sentence2vector (sen2vec) model is used to map the query into a vector representation, and then the cosine similarity between the vectors is calculated to obtain the 200 queries with the highest semantic similarity from the library. . In the matching stage, the 200 queries and the new query are formed into query-query pairs, and the semantic matching model is used to score, so that the query pair with the highest score and higher than a certain threshold is selected as the final synonym pair.
[0108]A specific example is as follows:
[0109]recall:
[0110]1) Vector representation of query: call the sen2vec model, and express the input query as a vector;
[0111]2) Calculate the cosine similarity: use the cosine of the angle between two vectors in the vector space as a measure of the difference between two individuals.
[0112]3) Sort: sort according to cosine similarity, select query pairs with the top 200 similarities as the expansion trigger candidate set;
[0113]match:
[0114]1) Use the semantic matching model for 200 query pairs in the candidate set to filter out the query pairs with the highest semantic score;
[0115]2) Determine whether the score of the query pair is higher than a certain threshold, and if it is higher, it will be the final expansion trigger result.
[0116]In an embodiment of the present invention, in the above method, obtaining annotated answers corresponding to annotated candidate answers based on active learning includes: providing a fine-labeled interface and a rough-labeled interface; displaying the unique annotated candidate answer through the rough-labeled interface, and receiving the returned The correctness evaluation information is used to determine the labeled answers based on the correctness evaluation information; and, through the refined labeling interface, multiple labeled candidate answers are displayed, and the returned labeled answers are received.
[0117]Through verification, it is found that whether the answer is generated by extracting steps from the specified document or generated by the machine reading comprehension model, their accuracy rates are far below 95%, which cannot meet the requirements for online search engines. Therefore, active learning is used to label the generated answers to improve accuracy. In order to maximize the benefits of labeling, reduce labeling costs, and accelerate the improvement of model effects, the solution uses the idea of ​​active learning, and sets two labeling tasks at the output: coarse labeling and fine labeling.
[0118]Firstly, judge whether the question and answer pair is correct or not through the rough mark. It is necessary to judge whether the answer can answer the question well. It is classified into the following four categories: correct answer, wrong answer, wrong query and unjudgeable. If the result of the annotation is that the answer is correct, the data can be directly online; if the answer is wrong, it means that this is a mistake made by the machine reading comprehension model, and the model input data of this query is sent to the fine label for annotation, that is The annotator needs to circle the best answer from a query and several documents. If the appropriate answer cannot be selected from these documents, the annotator can also directly fill in the best answer. On the one hand, the marked data can be directly online, and on the other hand, it can also be used as a training sample for the machine reading comprehension model to correct the errors of the model and improve the effect of the model at the maximum speed.
[0119]The above method of combining fine and coarse labels not only takes into account the cost of labeling, but also improves the online output rate, and at the same time helps the model to quickly optimize iteratively. The following are the labeling pages of the labeling platform for coarse and fine labels.
[0120]Figure 8a Shows a schematic diagram of a bold interface,Figure 8bShows a schematic diagram of a precision standard interface.
[0121]In an embodiment of the present invention, in the above method, using the marked answer as the selected summary of the corresponding question and answer query in the search engine includes: saving the selected summary in an xml format. The data in xml format can be uploaded to a stable interface and finally configured in the database of the search engine.
[0122]figure 2 It shows a schematic structural diagram of an apparatus for generating selected abstracts of search engines according to an embodiment of the present invention. Such asfigure 2 As shown, the device 200 for generating a selected abstract of a search engine includes:
[0123]The identification unit 210 is adapted to identify question-and-answer query words from the search log data.
[0124]Each user of the search engine can generate a large amount of search log data during daily search, for example, which query terms the user searches for, which website in the search result clicks after obtaining the search result, and so on. Through statistics and research, it is found that for question-and-answer query words, selected abstracts are most helpful to the improvement of user search experience. Therefore, the technical solution of the present invention mainly generates corresponding selected abstracts for question-and-answer query words.
[0125]The search unit 220 is adapted to obtain search results corresponding to question and answer query words. Specifically, it can be realized by calling the existing search engine interface.
[0126]The candidate unit 230 is adapted to use question-and-answer query words and corresponding search results as input data if the search result does not contain documents of the specified type, and output annotated candidate answers corresponding to the question-and-answer query words according to the machine reading comprehension model.
[0127]For a specified type of document, because the type is known, content can be extracted through a template, etc., while for other web pages, the difficulty lies in how to obtain information related to the answers required by the question-and-answer query. In this regard, the embodiment of the present invention adopts the MRC (Machine Reading Comprehension, machine reading comprehension) method, and uses question-and-answer query words and corresponding search results as input data, and outputs labels corresponding to the question-and-answer query words according to the machine reading comprehension model. Candidate answer.
[0128]Since the accuracy of this answer may deviate from the actual needs of the user, for example, it is found in the verification process that the cold start effect of the model is not good, and the accuracy of the generated answer is not enough to go online directly, so the embodiment of the present invention adopts precision The selected abstract generation unit 240 is adapted to obtain annotated answers corresponding to the annotated candidate answers based on active learning, and use the annotated answers as a selected summary of the corresponding question-and-answer query in the search engine. The traditional labeling method is very slow to improve the effect of the model. At the same time, the labeling of the selected abstracts requires high textual theory and knowledge reserves of the labeling personnel, which makes labeling more difficult; therefore, the embodiment of the present invention also adopts active learning (active learning). ) The method further improves the accuracy.
[0129]visible,figure 2 The device shown, for the identified question-and-answer query words, through a combination of machine reading comprehension and active learning, provides search engines with a selected summary in which users can directly view the desired content in the search result page, taking into account The query cost and answer accuracy are improved, and the user experience and search quality are improved.
[0130]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to perform preset types of processing on search log data to extract query words that meet the requirements; the query words are used as input data, and the query word classification model is used Output Q&A query words.
[0131]In an embodiment of the present invention, in the above-mentioned device, the identification unit 210 is adapted to sort the query words in the search log data according to the page views PV, and extract several query words from high to low according to the sort; and/ Or, normalize the query words to remove punctuation and/or spaces in the query words.
[0132]In an embodiment of the present invention, the above-mentioned device further includes: a training unit, adapted to obtain sample data from search log data generated within a preset time period; and count the total number of clicks on each query word in the sample data and question-and-answer sites Calculate the percentage of Q&A clicks for each query word in the sample data within the preset time period; take the query words whose Q&A clicks percentage is greater than the first threshold as a positive example, and set the Q&A click percentage below the second threshold The query term of is used as a negative example to obtain training data.
[0133]In an embodiment of the present invention, in the above-mentioned device, the training unit is adapted to divide the training data into a training set, a verification set and a test set according to a preset ratio, and train based on the textCNN model to obtain a query word classification model.
[0134]In an embodiment of the present invention, in the above-mentioned device, the training unit is adapted to use query words containing specified words in the sample data as positive examples.
[0135]In an embodiment of the present invention, in the above-mentioned device, the question-and-answer site is determined according to the URL pattern.
[0136]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to preprocess the recognized question-and-answer query words, remove duplicate question-and-answer query words, and remove question-and-answer query words that have been marked answers.
[0137]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to classify the recognized question-and-answer query words, and filter the question-and-answer query words other than the preset type.
[0138]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to classify the identified question and answer query words according to at least one of a topic classification model, a query word classification model, and an answer classification model.
[0139]In an embodiment of the present invention, in the above-mentioned device, the subject type includes at least one of the following: mobile phone digital, life, games, education and science, leisure and hobbies, culture and art, financial management, social and people's livelihood, sports, region; query The word type includes: fact type and/or opinion type; answer type includes: description type and/or entity type.
[0140]In an embodiment of the present invention, in the above device, the candidate unit 230 is adapted to call a machine reading comprehension model that does not include a ranking algorithm if the answer type of the question-and-answer query is an entity type to output multiple labeled candidate answers ; Otherwise, call the complete machine reading comprehension model and output a labeled candidate answer.
[0141]In an embodiment of the present invention, in the above-mentioned device, the topic classification model is a text multi-classification model, and the device further includes: a training unit adapted to obtain the search log data generated within a preset time period as sample data; and statistical samples The percentage of clicks for each query term in the data on sites of different topic types; count the page views of each query term in the sample data, and divide the high-frequency query terms from the query terms in the sample data according to the page views; divide the high-frequency query terms The topic type of the site with the highest click rate is used as the topic type of the corresponding high-frequency query term.
[0142]In an embodiment of the present invention, in the above-mentioned device, the training unit is adapted to classify medium-frequency query words from the query words in the sample data according to the page views; use the high-frequency query words with the subject type as the training set, and train A support vector machine SVM model, according to the trained SVM model to determine the subject type of the intermediate frequency query.
[0143]In an embodiment of the present invention, in the above-mentioned device, the training unit is adapted to classify low-frequency query words from query words in sample data according to page views; and determine the topic type of low-frequency query words according to the syntactic dependency tree.
[0144]In an embodiment of the present invention, in the above device, the query word classification model and the answer classification model are both text multi-classification models; the query word classification model is trained based on the nature of the query word; the answer classification model is based on the answer Length feature training.
[0145]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to filter the recognized question and answer query words according to a number of specified dimensions to obtain the filtered question and answer query words.
[0146]In an embodiment of the present invention, in the above-mentioned device, the recognition unit 210 is adapted to filter each specified dimension by using a corresponding text classification model; the text classification model is obtained based on SVM and/or fastText training.
[0147]In an embodiment of the present invention, in the above-mentioned device, the search unit 220 is adapted to call the search engine interface to obtain the first number of natural search results corresponding to the question and answer query words according to the search result sequence; Suppose the algorithm is adjusted, and the second number of natural search results are selected as the search results corresponding to the corresponding question and answer query words.
[0148]In an embodiment of the present invention, in the above-mentioned device, the search unit 220 is adapted to filter out document-type sites that cannot obtain web content from the natural search results; and perform the sequence of sites with a trust level higher than the first preset value Promote.
[0149]In an embodiment of the present invention, in the above-mentioned device, the search unit 220 is adapted to filter question-and-answer query words based on natural search results.
[0150]In an embodiment of the present invention, in the above-mentioned device, the search unit 220 is adapted to filter out question-and-answer query words containing application boxes in the natural search results; and/or, filter out the words that contain illegal words in the title of the natural search results Q&A query words; and/or, based on semantic matching, filter out Q&A query words that lack high-quality natural search results.
[0151]In an embodiment of the present invention, in the above-mentioned device, the candidate unit 230 is further adapted to directly extract annotated candidate answers from the documents of the specified type if the search result contains documents of the specified type.
[0152]In an embodiment of the present invention, in the above device, the document of the specified type is an html document containing several pieces of step description information, and the candidate unit 230 is adapted to parse the html document and extract several pieces of step description information according to field matching , And get annotated candidate answers through splicing.
[0153]In an embodiment of the present invention, in the above-mentioned device, the candidate unit 230 is adapted to calculate the semantic relevance score of the question-and-answer query term and the web page title of the search result according to the semantic matching model, and sort the search results according to the semantic relevance score .
[0154]In an embodiment of the present invention, in the above-mentioned device, the candidate unit 230 is adapted to not perform output according to the machine reading comprehension model and question-and-answer query if the calculated semantic relevance scores are all lower than the second preset value. Annotated candidate answers corresponding to words and subsequent steps; selected summary unit, suitable for directly grabbing selected abstracts of question and answer query words from question and answer sites.
[0155]In an embodiment of the present invention, the above-mentioned device further includes: a training unit, adapted to obtain a pair of question-and-answer query words and web page titles labeled with positive and negative examples, and construct them as training data through a processor dictionary; BERT-based Pre-training model and training data, and fine-tune adjustment training to obtain semantic matching model.
[0156]In an embodiment of the present invention, in the above device, the selected summary unit is adapted to directly grab the selected summary from the question and answer site for question-and-answer query words whose PV is lower than the third preset value.
[0157]In an embodiment of the present invention, in the above device, the selected summary unit is also suitable for generalizing question-and-answer query words to obtain query words with similar semantics; the labeled answers are also used as query words with similar semantics in search engines Featured snippets in.
[0158]In an embodiment of the present invention, in the above-mentioned device, the selected summary unit is adapted to dig out candidate query words corresponding to question-and-answer query words based on the display of query words, user click behavior and co-click behavior; according to semantic matching The model calculates the semantic relevance scores of the question-and-answer query words and each candidate query word, and takes the candidate query words with the semantic relevance score higher than the fourth preset value as the semantically similar query words.
[0159]In an embodiment of the present invention, in the above device, the selected summary unit is adapted to represent the query words in vectors, and determine the candidate query words corresponding to the question and answer query words by calculating the cosine similarity of each vector; The matching model calculates the semantic relevance scores of the question and answer query words and each candidate query word. If the highest semantic relevance score is greater than the fifth preset value, the candidate query word corresponding to the highest semantic relevance score is regarded as the semantically similar Query term.
[0160]In an embodiment of the present invention, in the above-mentioned device, the selected summary unit is adapted to provide a fine-labeled interface and a rough-labeled interface; the only labeled candidate answer is displayed through the rough-labeled interface, and the returned correctness evaluation information is received. The sexual evaluation information determines the labeled answers; and, through the refined labeling interface, displays multiple labeled candidate answers, and receives the returned labeled answers.
[0161]In an embodiment of the present invention, in the above device, the selected summary unit is adapted to save the selected summary in an xml format.
[0162]It should be noted that the specific implementation manners of the foregoing device embodiments can be performed with reference to the specific implementation manners of the foregoing corresponding method embodiments, and details are not described herein again.
[0163]To sum up, the present invention identifies question-and-answer query words from search log data; obtains search results corresponding to the question-and-answer query words; if the search results do not contain documents of a specified type, the question and answer Class query words and corresponding search results are used as input data, and annotated candidate answers corresponding to the question-and-answer query words are output according to the machine reading comprehension model; based on active learning, an annotated answer corresponding to the annotated candidate answer is obtained, and the annotated The answer is a technical solution for the selected abstract of the corresponding question-and-answer query in the search engine. For the identified question-and-answer query, through the combination of machine reading comprehension and active learning, it provides search engines with a way that users can directly View selected abstracts of the required content on the search results page, taking into account query cost and answer accuracy, and improving user experience and search quality.
[0164]It should be noted:
[0165]The algorithms and displays provided here are not inherently related to any particular computer, virtual device or other equipment. Various general-purpose devices can also be used with the teaching based on this. From the above description, the structure required to construct this type of device is obvious. In addition, the present invention is not directed to any specific programming language. It should be understood that various programming languages ​​can be used to implement the content of the present invention described herein, and the above description of a specific language is to disclose the best embodiment of the present invention.
[0166]In the instructions provided here, a lot of specific details are explained. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures and technologies are not shown in detail, so as not to obscure the understanding of this specification.
[0167]Similarly, it should be understood that in order to simplify the present invention and help understand one or more of the various inventive aspects, in the above description of the exemplary embodiments of the present invention, the various features of the present invention are sometimes grouped together into a single embodiment, Figure, or its description. However, the disclosed method should not be interpreted as reflecting the intention that the claimed invention requires more features than those explicitly stated in each claim. More precisely, as reflected in the following claims, the inventive aspect lies in less than all the features of a single embodiment disclosed previously. Therefore, the claims following the specific embodiment are thus explicitly incorporated into the specific embodiment, wherein each claim itself serves as a separate embodiment of the present invention.
[0168]Those skilled in the art can understand that it is possible to adaptively change the modules in the device in the embodiment and set them in one or more devices different from the embodiment. The modules or units or components in the embodiments can be combined into one module or unit or component, and in addition, they can be divided into multiple sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or units are mutually exclusive, any combination can be used to compare all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method or methods disclosed in this manner or All the processes or units of the equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.
[0169]In addition, those skilled in the art can understand that although some embodiments described herein include certain features included in other embodiments but not other features, the combination of features of different embodiments means that they are within the scope of the present invention. Within and form different embodiments. For example, in the following claims, any one of the claimed embodiments can be used in any combination.
[0170]The various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by their combination. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of some or all of the components in the device for generating selected abstracts of search engines according to embodiments of the present invention. Features. The present invention can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals. Such signals can be downloaded from Internet websites, or provided on carrier signals, or provided in any other form.
[0171]E.g,image 3 It shows a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 300 includes a processor 310 and a memory 320 arranged to store computer-executable instructions (computer-readable program code). The memory 320 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. The memory 320 has a storage space 330 for storing computer-readable program codes 331 for executing any method steps in the above methods. For example, the storage space 330 for storing computer-readable program codes may include various computer-readable program codes 331 respectively for implementing various steps in the above method. The computer-readable program code 331 may be read from or written into one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are usually for exampleFigure 4 The computer-readable storage medium.Figure 4 A schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention is shown. The computer-readable storage medium 400 stores the computer-readable program code 331 for executing the method steps according to the present invention, which can be read by the processor 310 of the electronic device 300, when the computer-readable program code 331 is run by the electronic device 300 , Causing the electronic device 300 to execute each step in the method described above. Specifically, the computer readable program code 331 stored in the computer readable storage medium can execute the method shown in any of the above embodiments. The computer-readable program code 331 may be compressed in an appropriate form.
[0172]It should be noted that the above-mentioned embodiments illustrate rather than limit the present invention, and those skilled in the art can design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be constructed as a limitation to the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of multiple such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
[0173]The embodiment of the present invention discloses A1, a method for generating a selected abstract of a search engine, including:
[0174]Identify question-and-answer query words from the search log data;
[0175]Obtaining search results corresponding to the question and answer query words;
[0176]If the search result does not contain the specified type of document, the question and answer query term and the corresponding search result are used as input data, and the labeled candidate answer corresponding to the question and answer query term is output according to the machine reading comprehension model;
[0177]Based on active learning, an annotated answer corresponding to the annotated candidate answer is acquired, and the annotated answer is used as a selected summary of the corresponding question-and-answer query in a search engine.
[0178]A2. The method according to A1, wherein the identifying the question-and-answer query word from the search log data includes:
[0179]Perform preset types of processing on search log data, and extract query words that meet your needs;
[0180]The query words are used as input data, and question and answer query words are output according to the query word classification model.
[0181]A3. The method as described in A2, wherein said performing preset types of processing on search log data and extracting query words that meet requirements includes:
[0182]Sort the query words in the search log data according to the page views PV, and extract several query words from high to low.
[0183]and / or,
[0184]Normalize the query term and remove the punctuation and/or spaces in the query term.
[0185]A4. The method according to A2, wherein the method further includes the following steps of making training data of the query word classification model:
[0186]Obtain search log data generated within a preset time period as sample data;
[0187]Calculate the total number of clicks for each query word in the sample data and the number of clicks on Q&A sites, and calculate the percentage of Q&A clicks for each query word in the sample data within a preset time period;
[0188]Take the query words whose question and answer click ratio is greater than the first threshold as a positive example, and use the query words whose question and answer click ratio is lower than the second threshold as negative examples to obtain training data.
[0189]A5. The method according to A4, wherein the method further includes the following steps of training to obtain the query word classification model:
[0190]The training data is divided into a training set, a verification set and a test set according to a preset ratio, and training is performed based on the textCNN model to obtain the query word classification model.
[0191]A6. The method according to A4, wherein the step of preparing training data of the query word classification model further includes:
[0192]Take the query words containing the specified words in the sample data as positive examples.
[0193]A7. The method according to A4, wherein the question and answer site is determined according to a web site pattern.
[0194]A8. The method according to A1, wherein the identifying question-and-answer query words from the search log data further includes:
[0195]Perform preprocessing on the identified question and answer query words, remove duplicate question and answer query words and remove the question and answer query words that have been marked answers.
[0196]A9. The method according to A1, wherein the identifying the question-and-answer query word from the search log data further includes:
[0197]Classify the identified Q&A query words, and filter out the Q&A query words outside the preset type.
[0198]A10. The method according to A9, wherein the classifying the identified question-and-answer query words includes:
[0199]According to at least one of the topic classification model, the query word classification model, and the answer classification model, the identified question and answer query words are classified.
[0200]A11. The method according to A10, wherein:
[0201]The topic types include at least one of the following: mobile phone digital, life, games, education and science, leisure and hobbies, culture and art, financial management, social and people's livelihood, sports, and region;
[0202]Types of query terms include: facts and/or opinions;
[0203]Answer types include: description class and/or entity class.
[0204]A12. The method according to A11, wherein the using the question and answer query words and the corresponding search results as input data, and outputting the labeled candidate answers corresponding to the question and answer query words according to the machine reading comprehension model includes:
[0205]If the answer type of the question and answer query word is an entity type, a machine reading comprehension model that does not include a ranking algorithm is called to output multiple labeled candidate answers; otherwise, a complete machine reading comprehension model is called to output a labeled candidate answer.
[0206]A13. The method according to A10, wherein the topic classification model is a text multi-classification model, and the method further includes the following steps of making training data for the topic classification model:
[0207]Obtain search log data generated within a preset time period as sample data;
[0208]Calculate the click ratio of each query term in the sample data on sites of different topic types;
[0209]Counting the page views of each query word in the sample data, and classifying high-frequency query words from the query words of the sample data according to the page views;
[0210]The topic type of the site with the highest click ratio of the high-frequency query term is used as the topic type of the corresponding high-frequency query term.
[0211]A14. The method according to A13, wherein the step of making training data of the topic classification model further comprises:
[0212]Dividing intermediate frequency query words from the query words of the sample data according to the page views;
[0213]The high-frequency query terms with the subject type determined are used as a training set, a support vector machine SVM model is trained, and the subject type of the intermediate-frequency query term is determined according to the trained SVM model.
[0214]A15. The method according to A13, wherein the step of preparing training data of the topic classification model further includes:
[0215]Dividing low-frequency query words from query words of the sample data according to the page views;
[0216]The topic type of the low-frequency query term is determined according to the syntax dependency tree.
[0217]A16. The method according to A10, wherein the query word classification model and the answer classification model are both text multi-classification models;
[0218]The query word classification model is trained based on the nature and characteristics of the query word;
[0219]The answer classification model is trained based on the length feature of the answer.
[0220]A17. The method according to A1, wherein the recognizing question-and-answer query words from the search log data further includes:
[0221]The identified question and answer query words are filtered according to a number of specified dimensions, and the filtered question and answer query words are obtained.
[0222]A18. The method according to A17, wherein the filtering the identified question and answer query words according to a number of specified dimensions includes:
[0223]For each specified dimension, a corresponding text classification model is used for filtering; the text classification model is obtained by training based on SVM and/or fastText respectively.
[0224]A19. The method according to A1, wherein the obtaining the search result corresponding to the question and answer query term includes:
[0225]Invoke the search engine interface to obtain the first number of natural search results corresponding to the question and answer query words according to the search result sequence;
[0226]The natural search results are adjusted according to a preset algorithm, and the second number of natural search results are selected from them as the search results corresponding to the corresponding question and answer query words.
[0227]A20. The method according to A19, wherein the adjusting the natural search result according to a preset algorithm includes at least one of the following:
[0228]Filtering out document-type sites that cannot obtain web content from the natural search results;
[0229]The order of the sites whose trust level is higher than the first preset value is increased.
[0230]A21. The method according to A19, wherein said obtaining search results corresponding to said question-and-answer query term further comprises:
[0231]Filter the question and answer query words according to the natural search results.
[0232]A22. The method according to A21, wherein the filtering of the question and answer query words according to the natural search results includes at least one of the following:
[0233]Filter out Q&A query words that contain application boxes in the natural search results;
[0234]Filter out question-and-answer query words that contain illegal words in the title of natural search results;
[0235]According to semantic matching, filter out Q&A query words that lack high-quality natural search results.
[0236]A23. The method according to A1, wherein the method further includes:
[0237]If the search result contains a specified type of document, then the labeled candidate answer is directly extracted from the specified type of document.
[0238]A24. The method according to A23, wherein the document of the specified type is an html document containing several pieces of step description information, and the direct extraction of the annotation candidate answer from the document of the specified type includes:
[0239]The html document is parsed, the several pieces of step description information are extracted according to field matching, and the labeled candidate answer is obtained through splicing.
[0240]A25. The method according to A1, wherein the using the question-and-answer query term and the corresponding search result as input data includes:
[0241]Calculate the semantic relevance score of the question and answer query word and the web page title of the search result according to the semantic matching model, and sort the search results according to the semantic relevance score.
[0242]A26. The method according to A25, wherein the method further comprises:
[0243]If the calculated semantic relevance scores are all lower than the second preset value, the output of the labeled candidate answers corresponding to the question-and-answer query words according to the machine reading comprehension model and the subsequent steps are not executed, and the questions are directly captured from the question-and-answer site Selected abstracts of the question and answer query words.
[0244]A27. The method according to A25, wherein the method further includes the following step of training the semantic matching model:
[0245]Obtain pairs of question-and-answer query words and web page titles that contain positive and negative examples, and construct them as training data through the processor dictionary;
[0246]Based on the pre-training model of BERT and the training data, fine-tune adjustment training is performed to obtain the semantic matching model.
[0247]A28. The method according to A1, wherein the method further includes:
[0248]For question-and-answer query words whose PV is lower than the third preset value, select abstracts are directly grabbed from question-and-answer sites.
[0249]A29. The method according to A1, wherein the method further includes:
[0250]Generalize the question and answer query words to obtain query words with similar semantics;
[0251]The labeled answers are also used as selected abstracts in search engines of query terms with similar semantics.
[0252]A30. The method according to A29, wherein the generalizing the question-and-answer query words to obtain query words with similar semantics includes:
[0253]Based on the display of query words, user click behavior and co-click behavior, mining candidate query words corresponding to the question and answer query words;
[0254]Calculate the semantic relevance scores of the question and answer query words and each of the candidate query words according to the semantic matching model, and use candidate query words with semantic relevance scores higher than a fourth preset value as query words with similar semantics.
[0255]A31. The method according to A29, wherein the generalizing the question and answer query words to obtain query words with similar semantics includes:
[0256]The query words are represented by vectors, and the candidate query words corresponding to the question and answer query words are determined by calculating the cosine similarity of each vector;
[0257]Calculate the semantic relevance score of the question-and-answer query word and each candidate query word according to the semantic matching model, and if the highest semantic relevance score is greater than the fifth preset value, then the candidate corresponding to the highest semantic relevance score The query term is a query term with similar semantics.
[0258]A32. The method according to A1, wherein the obtaining the labeled answer corresponding to the labeled candidate answer based on active learning includes:
[0259]Provide fine standard interface and coarse standard interface;
[0260]Display the unique marked candidate answer through the rough mark interface, receive the returned correctness evaluation information, and determine the marked answer according to the correctness evaluation information;
[0261]as well as,
[0262]A plurality of labeled candidate answers are displayed through the refined labeling interface, and the returned labeled answers are received.
[0263]A33. The method according to any one of A1-A32, wherein the selected abstract in a search engine using the labeled answer as a corresponding question-and-answer query term in a search engine includes:
[0264]Save the selected abstract in xml format.
[0265]The embodiment of the present invention also discloses B34, a device for generating selected abstracts of search engines, including:
[0266]Recognition unit, suitable for recognizing question-and-answer query words from search log data;
[0267]The search unit is adapted to obtain search results corresponding to the question and answer query words;
[0268]The candidate unit is adapted to, if the search result does not contain a specified type of document, use the question-and-answer query term and the corresponding search result as input data, and output a label corresponding to the question-and-answer query term according to the machine reading comprehension model Candidate answer
[0269]The selected abstract generation unit is adapted to obtain the labeled answer corresponding to the labeled candidate answer based on active learning, and use the labeled answer as the selected abstract of the corresponding question-and-answer query in the search engine.
[0270]B35. The device according to B34, wherein:
[0271]The recognition unit is adapted to perform preset types of processing on search log data to extract query words that meet requirements; use the query words as input data, and output question-and-answer query words according to the query word classification model.
[0272]B36. The device according to B35, wherein:
[0273]The recognition unit is adapted to sort the query words in the search log data according to the page views PV, and extract several query words from high to low; and/or normalize the query words to remove Punctuation and/or spaces in query terms.
[0274]B37. The device according to B35, wherein the device further comprises:
[0275]The training unit is adapted to obtain the search log data generated within a preset time period as sample data; count the total number of clicks on each query word in the sample data and the number of clicks on Q&A sites, and calculate each of the sample data The percentage of Q&A clicks for the query word in the preset time period; the query words with the percentage of Q&A clicks greater than the first threshold as positive examples, and the query words with the percentage of Q&A clicks lower than the second threshold as negative examples to obtain training data .
[0276]B38. The device according to B37, wherein:
[0277]The training unit is adapted to divide the training data into a training set, a verification set and a test set according to a preset ratio, and perform training based on the textCNN model to obtain the query word classification model.
[0278]B39. The device according to B37, wherein:
[0279]The training unit is adapted to use query words containing designated words in the sample data as positive examples.
[0280]B40. The device according to B37, wherein the question-and-answer site is determined according to a website pattern.
[0281]B41. The device according to B34, wherein:
[0282]The recognition unit is adapted to preprocess the identified question-and-answer query words, remove duplicate question-and-answer query words, and remove question-and-answer query words that have been marked answers.
[0283]B42. The device according to B34, wherein:
[0284]The recognition unit is adapted to classify the recognized question and answer query words, and filter out the question and answer query words other than the preset type.
[0285]B43. The device according to B42, wherein:
[0286]The recognition unit is adapted to classify the identified question-and-answer query words according to at least one of a topic classification model, a query word classification model, and an answer classification model.
[0287]B44. The device according to B43, wherein the subject type includes at least one of the following: mobile phone digital, life, games, education and science, leisure and hobbies, culture and art, financial management, social and people's livelihood, sports, region; query word type Including: fact type and/or opinion type; answer type including: description type and/or entity type.
[0288]B45. The device according to B44, wherein:
[0289]The candidate unit is adapted to call a machine reading comprehension model that does not include a ranking algorithm to output multiple labeled candidate answers if the answer type of the question and answer query word is an entity type; otherwise, call a complete machine reading comprehension model , Output a labeled candidate answer.
[0290]B46. The device according to B43, wherein the topic classification model is a text multi-classification model, and the device further includes: a training unit adapted to acquire search log data generated within a preset time period as sample data; The percentage of clicks for each query word in the sample data on sites of different topic types; count the page views of each query word in the sample data, and divide high-frequency queries from the query words in the sample data according to the page views Words; the topic type of the site with the highest click ratio of the high-frequency query term is used as the topic type of the corresponding high-frequency query term.
[0291]B47. The device according to B46, wherein:
[0292]The training unit is adapted to classify intermediate-frequency query words from the query words of the sample data according to the page views; use the high-frequency query words with the subject type as a training set to train a support vector machine SVM model, The subject type of the intermediate frequency query term is determined according to the trained SVM model.
[0293]B48. The device according to B46, wherein:
[0294]The training unit is adapted to classify low-frequency query words from query words in the sample data according to the page views; and determine the topic type of the low-frequency query words according to a syntactic dependency tree.
[0295]B49. The device according to B43, wherein the query word classification model and the answer classification model are both text multi-classification models; the query word classification model is trained based on the characteristics of the query word; the answer The classification model is trained based on the length feature of the answer.
[0296]B50. The device according to B34, wherein:
[0297]The recognition unit is adapted to filter the identified question-and-answer query words according to a number of specified dimensions to obtain filtered question-and-answer query words.
[0298]B51. The device according to B50, wherein:
[0299]The recognition unit is adapted to filter each specified dimension by using a corresponding text classification model; the text classification model is obtained by training based on SVM and/or fastText.
[0300]B52. The device according to B34, wherein:
[0301]The search unit is adapted to call a search engine interface to obtain a first number of natural search results corresponding to the question and answer query words according to the search result sequence; adjust the natural search results according to a preset algorithm, and select them The second number of natural search results is used as the search result corresponding to the corresponding question-and-answer query.
[0302]B53. The device according to B52, wherein:
[0303]The search unit is adapted to filter out document-type sites for which web content cannot be obtained from the natural search results; and increase the order of sites with a trust level higher than a first preset value.
[0304]B54. The device according to B52, wherein:
[0305]The search unit is adapted to filter the question and answer query words according to the natural search results.
[0306]B55. The device according to B54, wherein:
[0307]The search unit is adapted to filter out question-and-answer query words containing application boxes in natural search results; and/or filter out question-and-answer query words containing illegal words in the title of natural search results; and/or, according to semantic matching , To filter out Q&A query words that lack high-quality natural search results.
[0308]B56. The device according to B34, wherein:
[0309]The candidate unit is further adapted to directly extract annotated candidate answers from the documents of the specified type if the search result contains documents of the specified type.
[0310]B57. The device according to B56, wherein the document of the specified type is an html document containing several pieces of step description information, and the candidate unit is adapted to parse the html document, and extract the Several pieces of step description information are combined to obtain the labeled candidate answer.
[0311]B58. The device according to B34, wherein:
[0312]The candidate unit is adapted to calculate the semantic relevance score of the question and answer query term and the web page title of the search result according to a semantic matching model, and to sort the search results according to the semantic relevance score.
[0313]B59. The device according to B58, wherein:
[0314]The candidate unit is adapted to not perform the output of the labeled candidate answer corresponding to the question-and-answer query according to the machine reading comprehension model and subsequent steps if the calculated semantic relevance scores are all lower than the second preset value;
[0315]The selected summary unit is adapted to directly grab the selected summary of the question and answer query words from a question and answer site.
[0316]B60. The device according to B58, wherein the device further comprises:
[0317]The training unit is suitable for obtaining a pair of question-and-answer query words and webpage titles labeled with positive and negative examples, and constructing them as training data through the processor dictionary; fine-tune adjustments based on the pre-training model of BERT and the training data The semantic matching model is obtained through training.
[0318]B61. The device according to B34, wherein:
[0319]The selected summary unit is adapted to directly grab a selected summary from a question and answer site for question and answer query words whose PV is lower than the third preset value.
[0320]B62. The device according to B34, wherein:
[0321]The selected summary unit is further adapted to generalize the question and answer query words to obtain query words with similar semantics; and use the labeled answers as selected abstracts of query words with similar semantics in search engines.
[0322]B63. The device according to B62, wherein:
[0323]The selected summary unit is adapted to mine out candidate query words corresponding to the question-and-answer query words based on the display of query words, user click behaviors, and co-clicking behaviors; calculate the question and answer query words and For the semantic relevance score of each candidate query term, candidate query terms with a semantic relevance score higher than a fourth preset value are used as query terms with similar semantics.
[0324]B64. The device according to B62, wherein:
[0325]The selected summarization unit is suitable for vector representation of query words, by calculating the cosine similarity of each vector, the candidate query words corresponding to the question-and-answer query words are determined; For the semantic relevance score of the candidate query term, if the highest semantic relevance score is greater than the fifth preset value, then the candidate query term corresponding to the highest semantic relevance score is used as a semantically similar query term.
[0326]B65. The device according to B34, wherein:
[0327]The selected summary unit is adapted to provide a fine-labeled interface and a coarse-labeled interface; display a unique labeled candidate answer through the coarse-labeled interface, receive the returned correctness evaluation information, and determine the labeled answer according to the correctness evaluation information; And, multiple labeled candidate answers are displayed through the refined labeling interface, and the returned labeled answers are received.
[0328]B66. The device according to any one of B34-B65, wherein:
[0329]The selected abstract unit is adapted to save the selected abstract in an xml format.
[0330]The embodiment of the present invention also discloses C67, an electronic device, wherein the electronic device includes: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processing The device executes the method described in any one of A1-A33.
[0331]The embodiment of the present invention also discloses D68, a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs, and when the one or more programs are executed by a processor, the The method of any one of A1-A33.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products