Unlock instant, AI-driven research and patent intelligence for your innovation.

Web page denoising method and system based on cooperative work of template and classifier

A collaborative work and classifier technology, applied in the fields of instrumentation, computing, electrical digital data processing, etc., can solve the problems of inability to use webpage denoiser, influence of denoising effect, low efficiency, etc., and achieve wide adaptability and good denoising effect. , the effect of fast processing

Active Publication Date: 2022-03-22
SICHUAN UNIV
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The rule-based method is to formulate some heuristic rules in advance to filter out those text contents that meet the rules. This method is only applicable to some simple web pages, and complex heuristic rules are required for web pages with complex structures, which has its limitations.
The template-based method has a fast denoising speed, but it often needs to manually construct a template suitable for a specific website page, which cannot be used as a general webpage denoiser. In 2010, Li Liwen et al. In "Research", webpage similarity calculation is used to classify different webpages, and a corresponding template is constructed for each class. The template uses the location information of the main content. ) nodes, select the nearest parent node containing these main body content as the template, the proposed main body information may contain a lot of noise, which has a great impact on the denoising effect
The denoising method based on visual content first divides the webpage into different blocks, uses artificial annotation and predicts the importance of the webpage block through neural network and support vector machine, and finally selects the most important webpage block, but the method calculates Large volume, low efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page denoising method and system based on cooperative work of template and classifier
  • Web page denoising method and system based on cooperative work of template and classifier

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0030] Such as figure 1 As shown, the denoising method of the present invention comprises the following steps:

[0031] 1. Obtain the original HTML document through web crawler technology, including web page download and web page discovery. Among them, the webpage download is responsible for downloading the target webpage and storing it in the database according to the domain name address of the target webpage; the webpage discovery is responsible for finding the new webpage address that meets the requirements and adding it to the list to be crawled.

[0032] Second, process the original HTML document, including preprocessing and correction. Among them, preprocessing is responsible for deleting tags that do not contain text content, such as comments, scripts, styles, etc.; correction is to correct correctable errors in the DOM tree,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention discloses a web page denoising method and system based on templates and classifiers working together, wherein the denoising method includes: parsing the acquired original HTML document, deleting irrelevant label nodes, and generating a simplified DOM that meets requirements tree; calculate the characteristics of each block-level node in the DOM tree of the target web page, and obtain the original node set; add the original node set to the cache node set of the corresponding website, and trigger the template when the number of elements in the cache node set reaches the preset threshold Generate an algorithm to update the template node set of the corresponding website; use the template node set of the website to which the target web page belongs to filter the original node set of the target web page to obtain the filtered target web page node set; use the trained classifier to classify the filtered target web page nodes The collection is classified, and the node whose classification result is the main content is reserved, from which the main content text is extracted. The invention has less manual intervention and high efficiency, and is suitable for denoising various theme webpages.

Description

technical field [0001] The invention relates to the technical field of webpage denoising, in particular to a webpage denoising method and system based on templates and classifiers working together. Background technique [0002] With the continuous development of Internet technology, the amount of information on the Internet is getting bigger and bigger, showing explosive growth. Massive web page information is the main embodiment of Internet information, and it is a natural data mine for many other research fields, including: search engines, public opinion analysis, natural language processing, etc. However, in addition to the main content of web pages, there are also some commercial advertisements, navigation bars, copyright information, announcements and other information that are not related to the main content. These information can be called web page noise. How to remove noise content in web pages and extract web pages The main content is for the analysis and use of th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535
CPCG06F16/9535
Inventor 王运锋严金承
Owner SICHUAN UNIV