Webpage core block determining method based on DOM (Document Object Model) node text density

A determination method and DOM tree technology, applied in the field of webpage core block determination algorithm, can solve problems such as easy loss of density, inability to completely extract the content of core blocks, and inability to completely eliminate the impact, etc., to achieve the effect of good discrimination

Inactive Publication Date: 2012-11-28
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this effect cannot be completely eliminated
[0011] 2. Only the text information of the core block of the webpage can be extracted, and the original structural information of the webpage cannot be retained
This makes it difficult to integrate with other applications, such as structured information extraction
[0012] 3. The characteristics of noise data in web pages are not fully utilized, and the distinguishing effect is not very obvious
[0013] 4. The content of the core block cannot be completely extracted, and the rows with low density are easily lost

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage core block determining method based on DOM (Document Object Model) node text density
  • Webpage core block determining method based on DOM (Document Object Model) node text density
  • Webpage core block determining method based on DOM (Document Object Model) node text density

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] The preferred embodiments of the present invention will be described in detail below with reference to the drawings.

[0055] This embodiment uses an actual page of New York Times as an example. In the page, there are many pictures, text and links. The specific article contained in the page is the core content of the page.

[0056] First parse it into a DOM tree. Select a piece of code as an example, as follows:

[0057]

[0058]

[0059] The ellipsis in the code represents some other node information. For simplicity, it is replaced by an ellipsis. Parse it into a DOM tree such as figure 1 Shown.

[0060] Then calculate the DOM tree of the entire page to get the text density value of each node and the density sum of its child nodes. The results are as follows:

[0061] : Chars=6094, Tags=541, LinkChars=3243, LinkTags=445, Density=4.18771, densitySum=4.18549

[0062] : Chars=6094, Tags=533, LinkChars=3243, LinkTags=444, Density=4.18549, densitySum=4.41271

[0063] : Chars=44, T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a webpage core block determining method based on a DOM (Document Object Model) node text density, which comprises the following steps of: 1, analyzing an HTML (Hyperlink Text Markup Language) webpage, and generating a DOM tree to ensure that each HTML label corresponds to one node in the DOM tree, wherein character contents in the webpage are leaf nodes of the DOM tree; 2, adding statistical information including number of all text characters included by the nodes, number of all labels included by the nodes, number of all hyperlink text characters included by the nodes and number of all hyperlinks included by the nodes, and defining text density of the nodes according to the statistical information; and 3, determining a webpage core block according to the text density of the nodes in the DOM tree. According to the invention, on the premise of being free of being influenced by webpage coding styles and remaining the original webpage DOM structure, the core content block in the webpage is completely extracted.

Description

Technical field [0001] The invention relates to an algorithm for determining the core block of a web page based on the text density of a DOM node, and belongs to the technical field of computer applications. Background technique [0002] With the rapid development of the Internet, WWW has become the largest database in the world. Therefore, data mining in the web to obtain useful information or knowledge has gradually become an emerging hot research direction. [0003] These studies need to quickly and efficiently collect, process and store the core content of the web. However, these core contents in web pages are often surrounded by a large amount of irrelevant information. For example, navigation menus, sidebar advertisements, copyright information, etc. Although this information can make the content of the web page rich and beautiful, and at the same time convenient for users to browse, it is not related to the subject of the web page and also makes it difficult for these web...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 孙飞宋丹丹廖乐健王晓华
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products