Method for extracting regular noise from single record web pages

A web page and regular technology, applied in the field of network information retrieval, can solve the problems of small noise leakage extraction, low frequency of noise branches, low efficiency, etc.

Active Publication Date: 2013-04-24
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF6 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] (1) With the development of web technology, even for the same website, the differences in DOM tree structure of different sections are getting bigger and bigger. The SST tree will cause too many branch nodes in the tree building process, and the number of DOM trees with different structures is not distributed. In the case of uniformity, the occurrence frequency of a certain noise branch in the SST tree is too small, resulting in the phenomenon that only part of the noise can be extracted from web pages with this type of DOM structure;
[0013] (2) Even if the web pages are classified according to the DOM structure, and the algorithm is applied to a DOM tree with a similar structure, if only one node is different in a certain layer of nodes (for example, there are 10 nodes), the SST method will be different for different branches. Establishing different child nodes will cause a lot of waste of space, and the efficiency of tree building will also be greatly reduced;
[0014] (3) When the SST method forms sub-nodes of different styles for different branches, it is easy to cause the branch granularity to be too large, resulting in some small noises and missing extractions;
[0015] (4) Especially for single-record pages, the SST method cannot locate the relative position of the noise and the subject part
[0017] To sum up, for the regular noise extraction in a single record page, the existing noise extraction methods have the problems of missing extraction, waste of space and low efficiency, and cannot locate the relative position of the regular noise and the topic part

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting regular noise from single record web pages
  • Method for extracting regular noise from single record web pages
  • Method for extracting regular noise from single record web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0083] The present invention will be described below with reference to the accompanying drawings and specific embodiments.

[0084] According to an embodiment of the present invention, a method for extracting regular noise from a single-record web page is provided. Based on the web page's DOM tree structure information, visual information and text information of the web page, the multi-template model is used to extract the noise before the text, in the text and after the text of the single-record web page respectively. In the extraction process, firstly, n (n>=2) web pages are automatically classified according to the DOM tree structure of the web page, and then m web pages (m>=2) of the same category (similar web page structure) are matched and merged Form the site section style tree SBSTree, on this basis, use some visual and text rules to find the approximate position of the text title and text body in the site section style tree (the merged DOM tree), and then judge wh...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.

Description

technical field [0001] The present invention relates to the field of network information retrieval, and more particularly, relates to a method for extracting the front and rear of the text from a single-record web page (that is, a web page of a single style is recorded with a piece of data, and the data record refers to the area of ​​the main part of the web page). Methods for regular noise in and after the text. Background technique [0002] In the information age, there are more and more ways to obtain information. As a carrier of information, the Internet has an irreplaceable position in terms of dissemination efficiency and information capacity. At present, the Internet has become an important source for people to obtain various knowledge and information. However, with the rapid development of Web technology, the massive data information on the Internet is increasing in stages every day, and the content of the information is all-encompassing and in various forms. Web ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 程学旗李海燕郭岩万圣贤郭少华刘悦余智华
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products