CBL feature extraction and denoising webpage accurate classification method

A technology of feature extraction and classification methods, which is applied in network data retrieval, website content management, special data processing applications, etc., and can solve problems such as slow webpage updates, poor accuracy and precision of webpage classification models, and unsatisfactory LSI semantic concept space. , to achieve the effect of improving quality, improving accuracy and precision

Pending Publication Date: 2021-10-19
荆门汇易佳信息科技有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Third, it can improve the user experience. The quality of the search engine’s search results directly determines the quality of the user experience. If the search engine only matches the webpage based on the keywords given by the user, it does not consider the user’s search intent and understanding of the user’s keywords. The corresponding theme cannot meet the needs of users
[0004] Chinese webpage classification is based on text classification. Although the technology of text classification has been maturely applied to various fields of life, webpage classification is much more complicated than text classification due to the irregularity of webpage data structure.
For example, the data set of text classification comes from text inventions or data items in the database. It has a very standardized data structure and it is very easy to obtain the feature items of the data set. However, most web pages are HTML files, and HTML is a semi-structured The theme information of web pages exists in HTML tags, and noise data and junk information can also exist anywhere in HTML tags. This kind of irregular and irregular web pages leads to It is becoming more and more difficult to extract web page topic information, which ushers in great difficulties for web page classification
[0005] First, the extracted webpage theme content is not accurate enough, and the webpage has no fixed modules and structure, so how to extract the webpage theme content is more difficult. In addition, the webpage not only contains the webpage theme content information, but also contains Various advertisements, navigation bars, useless links and other irrelevant information, because of the unstructured webpage, these spam and noise data can be filled in any position of the webpage, which seriously affects the accuracy of webpage classification
[0006] Second, the amount of webpage data is too large to meet the real-time requirements of the webpage classification system. The network data information is updated all the time, and the amount of data is increasing all the time. The real-time requirements of the webpage classification system are already very severe. Only continuous improvement Only by improving the calculation speed of the classification method, or proposing a new classification method, can the accuracy and precision of the web classification system be improved, and an efficient user experience can be achieved to meet the growing needs of users
[0011] First, because the content on the Internet is constantly updating and changing, the web page structure can also be set at will, resulting in a variety of web page presentation methods. They do not have a fixed structural template, and the web page content and layout styles are inconsistent. The method is webpage classification, which is very inefficient and cannot meet the needs of the growing mobile Internet users. Although text classification technology has been maturely applied, webpage classification is much more complicated than text classification due to the irregularity of webpage data structure. Web pages are HTML files in most cases, and the subject information of web pages exists in HTML tags. Noise data and garbage information can also exist anywhere in HTML tags. This kind of irregular web page leads to It is becoming more and more difficult to extract webpage topic information from webpages, which ushers in great difficulties for webpage classification;
[0012] Second, the subject content of the webpage extracted by the existing technology is not accurate enough, and the webpage has no fixed modules and structure, so it is difficult to extract the subject content of the webpage. In addition, the webpage not only contains the subject content information of the webpage, but also contains There are various advertisements, navigation bars, useless links and other irrelevant information. Because of the unstructured webpage, these spam and noise data can be filled anywhere on the webpage, which seriously affects the accuracy of webpage classification; in addition, the amount of webpage data is too large. Huge, the existing technology cannot meet the real-time requirements of the webpage classification system, and the network data information is updated all the time. The real-time requirements of the webpage classification system are already very severe. Only by improving the accuracy and precision of the web classification system can an efficient user experience be achieved;
[0013] Third, most of the webpage classification technologies in the prior art use existing corpora as data sets, and the webpages extracted from these corpora are basically outdated and cannot reflect current hot issues, and the existing corpus contains noise data. The data seriously affects the accuracy of the classification model
In addition, the feature extraction method of the existing technology does not consider the semantic correlation between the feature items, which has a certain negative impact on the performance of the classification model. The existing technology cannot effectively remove the noise data in the data set, and the quality of the data set is poor. The accuracy and precision of the model is poor;
[0014] Fourth, the existing webpage classification algorithm based on the vector space model mainly calculates the similarity of webpage document feature vectors to judge the webpage document category. When the number of webpage documents reaches the order of trillions, the time complexity of calculating the similarity between documents Too high. In addition, classification results or clustering results are based on keyword information matching, without considering semantic information, and cannot solve the situation of polysemy and polysemy, resulting in low user experience; the existing technology is based on linear algebra The webpage topic classification algorithm of the website uses SVD matrix decomposition. The matrix decomposition solution process is complex, and the result of SVD decomposition is not positive in many dimensions of the feature vector, which leads to the unsatisfactory semantic concept space of LSI. In addition, LSI makes certain categories stronger. The feature items are deleted after being mapped to the concept space, which greatly affects the classification accuracy of web pages; the prior art web page topic classification algorithm based on the probabilistic feature topic model has the problem of overfitting when the amount of web page data increases greatly , and the parameters will increase with the increase of the amount of web page data, resulting in a significant increase in computational complexity;
[0015] Fifth, most of the data sets used in the existing Chinese web page classification methods come from the Sogou corpus. Although the Sogou corpus extracts the subject information of the web pages and classifies the web page categories, the web pages extracted by the Sogou corpus are updated slowly. It cannot reflect the current social hotspots, nor can it deal with new words and unregistered words on the Internet, so it cannot use the data of Sogou corpus to deal with current hotspots; the webpage classification method of the prior art depends on the quality of training data. Topics and news hotspots are updated every day. If the data is not representative or becomes unrepresentative after a period of time with the generation of new data, it will seriously affect the accuracy of the classification model. A large number of new words and hot words are generated. If the previous classification model is used to classify web pages containing a large number of new words, because the training model is not sensitive to new words, the classification effect is very poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • CBL feature extraction and denoising webpage accurate classification method
  • CBL feature extraction and denoising webpage accurate classification method
  • CBL feature extraction and denoising webpage accurate classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0090] The technical solution of the CBL feature extraction and denoising web page precise classification method provided by the present invention will be further described below in conjunction with the accompanying drawings, so that those skilled in the art can better understand the present invention and implement it.

[0091] In the era of information explosion, network information resources continue to grow at an exponential rate. In order to be able to analyze these massive webpage information in real time and efficiently, webpage classification technology has become a hot spot in natural language processing. Most of the webpage classification technologies in the prior art use existing corpora as data sets, and the webpages extracted from these corpora are basically outdated and cannot reflect current hot issues, and the existing corpus contains noise data, which seriously affects The accuracy of the classification model. In addition, the feature extraction methods in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

According to the CBL feature extraction and denoising precise webpage classification method, feature extraction is performed on a data set, and noise data in the data set is removed. Firstly, feature extraction is performed based on a feature extraction method of a CBL model, original high-dimensional spatial features are mapped or converted into new low-dimensional spatial features, useless noise data is mapped to a weak dimension, feature items in an original space are greatly reduced, representative feature items are selected according to relevance of the feature items, and the purpose of dimension reduction is achieved. Secondly, noise data is removed based on a noise processing method of a CBL model, a data set is divided into a plurality of subsets according to categories to which the data set belongs, a probability feature topic model corresponding to each subset is constructed, information entropy values of webpages in the data set and the probability feature topic models of the subsets are calculated, and if the information entropy values of the webpages are larger than a given critical value, the webpage belongs to noise data, the junk information is cleared, and the accuracy and precision of webpage classification are greatly improved.

Description

technical field [0001] The invention relates to a method for precise classification of webpages, in particular to a method for precise classification of webpages using CBL feature extraction and denoising, and belongs to the technical field of webpage classification. Background technique [0002] As a scalable and distributed platform, the Internet is developing rapidly. The information resources on the Internet are growing at an exponential rate. The webpage is the most important carrier for Internet information dissemination and development. Massive network resources are spread all over our production and life. In various fields, due to the constant updating and changing of the content on the Internet, the web page structure can also be set arbitrarily, resulting in a variety of web page presentation methods. They do not have a fixed structural template, and the web page content and layout styles are inconsistent. If relying purely on manual methods to classify webpages, i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F16/958G06F16/957G06F16/907G06F16/9035
CPCG06F16/958G06F16/9035G06F16/907G06F16/9577G06F18/2411G06F18/214
Inventor 刘秀萍陈军
Owner 荆门汇易佳信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products