Webpage core block determining method based on DOM (Document Object Model) node text density

A determination method and DOM tree technology, applied in the field of web page core block determination algorithm, can solve the problems of easy loss of density, insufficient use of noise data, difficult application integration, etc., and achieve the effect of good discrimination

Inactive Publication Date: 2011-09-14
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF2 Cites 52 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this effect cannot be completely eliminated
[0011] 2. Only the text information of the core block of the webpage can be extracted, and the original structural information of the webpage cannot be retained
This makes it difficult to integrate with other applications, such as structured information extraction
[0012] 3. The characteristics of noise data in web pages are not fully utilized, and the distinguishing effect is not very obvious
[0013] 4. The content of the core block cannot be completely extracted, and the rows with low density are easily lost

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage core block determining method based on DOM (Document Object Model) node text density
  • Webpage core block determining method based on DOM (Document Object Model) node text density
  • Webpage core block determining method based on DOM (Document Object Model) node text density

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] The preferred embodiments of the present invention will be specifically described below in conjunction with the accompanying drawings.

[0055] This embodiment uses an actual page of New York Times as an example. On the page, it contains many pictures, text and links. The specific articles included in the page are the core content of the web page.

[0056] First parse it into a DOM tree. Select one of the codes as an example, as follows:

[0057]

[0058]

[0059] The ellipsis in the code indicates some other node information, which is replaced by ellipsis for simplified representation. Parsing it into a DOM tree such as figure 1 shown.

[0060] Then calculate the DOM tree of the entire page to obtain the text density value of each node and the density sum of its child nodes. The results are as follows:

[0061] : Chars=6094, Tags=541, LinkChars=3243, LinkTags=445, Density=4.18771, densitySum=4.18549

[0062] : Chars=6094, Tags=533, LinkChars=3243, LinkTags...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a webpage core block determining method based on a DOM (Document Object Model) node text density, which comprises the following steps of: 1, analyzing an HTML (Hyperlink Text Markup Language) webpage, and generating a DOM tree to ensure that each HTML label corresponds to one node in the DOM tree, wherein character contents in the webpage are leaf nodes of the DOM tree; 2, adding statistical information including number of all text characters included by the nodes, number of all labels included by the nodes, number of all hyperlink text characters included by the nodes and number of all hyperlinks included by the nodes, and defining text density of the nodes according to the statistical information; and 3, determining a webpage core block according to the text density of the nodes in the DOM tree. According to the invention, on the premise of being free of being influenced by webpage coding styles and remaining the original webpage DOM structure, the core content block in the webpage is completely extracted.

Description

technical field [0001] The invention relates to a web page core block determination algorithm based on DOM node text density, and belongs to the technical field of computer applications. Background technique [0002] With the rapid development of the Internet, WWW has become the largest database in the world. Therefore, data mining in the web to obtain useful information or knowledge has gradually become a new hot research direction. [0003] These studies need to collect, process and store core content in the web quickly and efficiently. However, these core contents in web pages are often surrounded by a large amount of irrelevant information. For example, navigation menus, sidebar ads, copyright information, etc. Although this information can make the content of the webpage rich and beautiful, and at the same time facilitate users to browse, it is not related to the theme of the webpage, which also makes it difficult for these webpages to be parsed by computer programs....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 孙飞宋丹丹廖乐健王晓华
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products