Equipment, system and method for cleaning internet web page

An Internet and webpage technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of low cleaning accuracy and limited webpages, and achieve the effect of high cleaning accuracy

Active Publication Date: 2008-08-27
SHENZHEN TENCENT COMP SYST CO LTD
View PDF0 Cites 82 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the embodiments of the present invention is to provide a method for cleaning Internet webpages, which aim

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Equipment, system and method for cleaning internet web page
  • Equipment, system and method for cleaning internet web page
  • Equipment, system and method for cleaning internet web page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0024] Based on the webpage classification strategy, the embodiment of the present invention divides the webpage into semantic blocks of suitable granularity with semantic cohesion, analyzes and identifies each semantic block, effectively extracts important blocks and their information, and can realize the cleaning of any webpage , including text extraction of content-type webpages, content extraction of multi-block text-type webpages, automatic extraction of important blocks of index-type webpages, and content extraction of BBS / Blog-type webpages, etc., with high cleaning accuracy.

[0025] figur...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention is suitable for the internet information processing field and provides an internet web page cleaning method, an internet web page cleaning system and an internet web page cleaning device. The method comprises the following steps that: an inputted web page is analyzed; the label content of the web page is automatically corrected; a document object model tree is established; block element nodes of HTML provided with representation content in the document object model tree is maintained, and a structural block tree corresponding to the document object model tree is generated; the inputted web page is classified according to the defined web page type on the basis of the structural block tree; semantic block analysis of the web page is performed according to the web type to which the web page belongs, and important blocks and text information of the important blocks are extracted and outputted. The internet web page cleaning method can realize cleaning of any web page, has high cleaning accuracy and can be applied in aspects like browsing of mobile terminals, a search engine, subject-oriented information acquisition, automatic information extraction, vertical search and so on.

Description

technical field [0001] The invention belongs to the field of Internet information processing, and in particular relates to an Internet webpage cleaning method, system and equipment. Background technique [0002] With the rapid development of the Internet, the Web has become the basic platform for information release and information sharing, and the Web page in HTML format is the main information carrier. At present, web pages have developed from static web pages edited by hand to dynamic web pages generated by databases and templates, and the content contained in web pages is becoming more and more complex. Noise information such as copyright information. [0003] Web page cleaning is similar to data cleaning in data mining. Web page data is cleaned and purified through Web mining and machine learning technologies, useful information is extracted, and noise information is removed. Web page cleaning can provide the basis for applications such as search engines, mobile phone...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 方高林郑全战
Owner SHENZHEN TENCENT COMP SYST CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products