Third, it can improve the user experience. The quality of the search engine’s search results directly determines the quality of the user experience. If the search engine only matches the webpage based on the keywords given by the user, it does not consider the user’s search intent and understanding of the user’s keywords. The corresponding theme cannot meet the needs of users
[0004] Chinese webpage classification is based on text classification. Although the technology of text classification has been maturely applied to various fields of life, webpage classification is much more complicated than text classification due to the irregularity of webpage data structure.
For example, the data set of text classification comes from text inventions or data items in the database. It has a very standardized data structure and it is very easy to obtain the feature items of the data set. However, most web pages are HTML files, and HTML is a semi-structured The theme information of web pages exists in HTML tags, and noise data and junk information can also exist anywhere in HTML tags. This kind of irregular and irregular web pages leads to It is becoming more and more difficult to extract web page topic information, which ushers in great difficulties for web page classification
[0005] First, the extracted webpage theme content is not accurate enough, and the webpage has no fixed modules and structure, so how to extract the webpage theme content is more difficult. In addition, the webpage not only contains the webpage theme content information, but also contains Various advertisements, navigation bars, useless links and other irrelevant information, because of the unstructured webpage, these spam and noise data can be filled in any position of the webpage, which seriously affects the accuracy of webpage classification
[0006] Second, the amount of webpage data is too large to meet the real-time requirements of the webpage classification system. The network data information is updated all the time, and the amount of data is increasing all the time. The real-time requirements of the webpage classification system are already very severe. Only continuous improvement Only by improving the calculation speed of the classification method, or proposing a new classification method, can the accuracy and precision of the web classification system be improved, and an efficient user experience can be achieved to meet the growing needs of users
[0011] First, because the content on the Internet is constantly updating and changing, the web page structure can also be set at will, resulting in a variety of web page presentation methods. They do not have a fixed structural template, and the web page content and layout styles are inconsistent. The method is webpage classification, which is very inefficient and cannot meet the needs of the growing mobile Internet users. Although text classification technology has been maturely applied, webpage classification is much more complicated than text classification due to the irregularity of webpage data structure. Web pages are HTML files in most cases, and the subject information of web pages exists in HTML tags. Noise data and garbage information can also exist anywhere in HTML tags. This kind of irregular web page leads to It is becoming more and more difficult to extract webpage topic information from webpages, which ushers in great difficulties for webpage classification;
[0012] Second, the subject content of the webpage extracted by the existing technology is not accurate enough, and the webpage has no fixed modules and structure, so it is difficult to extract the subject content of the webpage. In addition, the webpage not only contains the subject content information of the webpage, but also contains There are various advertisements, navigation bars, useless links and other irrelevant information. Because of the unstructured webpage, these spam and noise data can be filled anywhere on the webpage, which seriously affects the accuracy of webpage classification; in addition, the amount of webpage data is too large. Huge, the existing technology cannot meet the real-time requirements of the webpage classification system, and the network data information is updated all the time. The real-time requirements of the webpage classification system are already very severe. Only by improving the accuracy and precision of the web classification system can an efficient user experience be achieved;
[0013] Third, most of the webpage classification technologies in the prior art use existing corpora as data sets, and the webpages extracted from these corpora are basically outdated and cannot reflect current hot issues, and the existing corpus contains noise data. The data seriously affects the accuracy of the classification model
In addition, the feature extraction method of the existing technology does not consider the semantic correlation between the feature items, which has a certain negative impact on the performance of the classification model. The existing technology cannot effectively remove the noise data in the data set, and the quality of the data set is poor. The accuracy and precision of the model is poor;
[0014] Fourth, the existing webpage classification algorithm based on the vector space model mainly calculates the similarity of webpage document feature vectors to judge the webpage document category. When the number of webpage documents reaches the order of trillions, the time complexity of calculating the similarity between documents Too high. In addition, classification results or clustering results are based on keyword information matching, without considering semantic information, and cannot solve the situation of polysemy and polysemy, resulting in low user experience; the existing technology is based on linear algebra The webpage topic classification algorithm of the website uses SVD matrix decomposition. The matrix decomposition solution process is complex, and the result of SVD decomposition is not positive in many dimensions of the feature vector, which leads to the unsatisfactory semantic concept space of LSI. In addition, LSI makes certain categories stronger. The feature items are deleted after being mapped to the concept space, which greatly affects the classification accuracy of web pages; the prior art web page topic classification algorithm based on the probabilistic feature topic model has the problem of overfitting when the amount of web page data increases greatly , and the parameters will increase with the increase of the amount of web page data, resulting in a significant increase in computational complexity;
[0015] Fifth, most of the data sets used in the existing Chinese web page classification methods come from the Sogou corpus. Although the Sogou corpus extracts the subject information of the web pages and classifies the web page categories, the web pages extracted by the Sogou corpus are updated slowly. It cannot reflect the current social hotspots, nor can it deal with new words and unregistered words on the Internet, so it cannot use the data of Sogou corpus to deal with current hotspots; the webpage classification method of the prior art depends on the quality of training data. Topics and news hotspots are updated every day. If the data is not representative or becomes unrepresentative after a period of time with the generation of new data, it will seriously affect the accuracy of the classification model. A large number of new words and hot words are generated. If the previous classification model is used to classify web pages containing a large number of new words, because the training model is not sensitive to new words, the classification effect is very poor