Unlock instant, AI-driven research and patent intelligence for your innovation.

DOM tree-based page partitioning method, apparatus and device, and storage medium

A DOM tree and page segmentation technology, which is applied to other database clustering/classification, special data processing applications, network data browsing optimization, etc. Time savings, speed and efficiency, results in improved accuracy

Active Publication Date: 2019-10-29
SOUTH CENTRAL UNIVERSITY FOR NATIONALITIES
View PDF8 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main purpose of the present invention is to provide a DOM tree-based page segmentation method, device, equipment and storage medium, aiming to solve the problem of low extraction accuracy, poor versatility, and information extraction in the prior art. costly technical issues

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • DOM tree-based page partitioning method, apparatus and device, and storage medium
  • DOM tree-based page partitioning method, apparatus and device, and storage medium
  • DOM tree-based page partitioning method, apparatus and device, and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0047]The solution of the embodiment of the present invention is mainly as follows: the present invention generates a DOM tree according to the denoised webpage by performing denoising processing on the webpage to be divided; obtains the node path of each node on the DOM tree, and calculates the similarity of each node path degree, each node is clustered according to the similarity, and a clustering result is generated; the webpage to be divided is divided into blocks according to the clustering result, which can reduce the influence of noise content on webpage information extraction, and improve the The accuracy of page information extraction, and can adapt to web pages with different structures, has strong versatility and adaptability, saves the time of information extraction, speeds up the speed and efficiency of in...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a page partitioning method, device and equipment based on a DOM tree and a storage medium, and the method comprises the steps: carrying out the denoising of a to-be-partitionedwebpage, and generating the DOM tree according to the denoised webpage; obtaining a node path of each node on the DOM tree, calculating the similarity of each node path, clustering each node according to the similarity, and generating a clustering result; and partitioning the webpage to be partitioned according to the clustering result. The influence of noise content on webpage information extraction can be reduced because the webpage to be partitioned is partitioned by a clustering result. The accuracy of webpage information extraction is improved. The invention adapts to webpages of different structures. Universality and adaptability are high, information extraction time is saved, information extraction speed and efficiency are improved, and user experience is improved.

Description

technical field [0001] The invention relates to the field of web page information processing, in particular to a DOM tree-based page block method, device, equipment and storage medium. Background technique [0002] With the explosive growth and popularization of computers around the world, a large amount of data information has been generated on the Internet, but due to the heterogeneity of network information sources, it has become very difficult to browse and search these huge data sets; for example Existing search engines have the following problems: the results retrieved using keywords are complex, and the links to webpages containing keywords need to be browsed separately to determine whether they meet the needs; when searching, as long as they contain keywords, they will all be retrieved, reducing the This affects the efficiency of information and affects the user's acquisition; while searching for keywords, the webpage is accompanied by a large amount of useless infor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/901G06F16/906G06F16/957G06F16/958
CPCG06F16/9027G06F16/906G06F16/9577G06F16/958
Inventor 李子茂江如茜莫海芳刘晶帖军吴经龙余慧
Owner SOUTH CENTRAL UNIVERSITY FOR NATIONALITIES