Unlock instant, AI-driven research and patent intelligence for your innovation.

A Method of Blocking Web Pages Based on Semantic Structure of Web Pages

A web page segmentation and semantic structure technology, applied in the field of web page editing, can solve the problems of high algorithm time consumption, complex visual features, and low web page performance, and achieve the effect of improving accuracy and accurate recognition

Active Publication Date: 2019-12-17
CHINASO INFORMATION TECH
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] However, due to the complexity of visual features in the above-mentioned prior art, how to ensure visual feature information is a major difficulty; secondly, the VIPS algorithm needs to calculate and save the visual information of all nodes in the DOM tree, which causes the algorithm to be time- and memory-intensive. The consumption is relatively large, so that the performance is not high when processing web pages with a large number of nodes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Blocking Web Pages Based on Semantic Structure of Web Pages
  • A Method of Blocking Web Pages Based on Semantic Structure of Web Pages
  • A Method of Blocking Web Pages Based on Semantic Structure of Web Pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0038] This embodiment provides a webpage segmentation method based on the semantic structure of the webpage, the specific process is as follows figure 1 Shown:

[0039] 1) Preprocessing, preprocessing the obtained webpage html source code, the preprocessing here includes blank character compression, uniform conversion of webpage labels to lowercase, conversion of non-label < symbols into entities, and processing of label content that needs to be filtered And web page character set recognition and conversion, establish DOM syntax tree;

[0040] 2) To identify the physical block type of the webpage, first calculate the number of atomic tags of each node in the DOM syntax tree, and then identify the physical block type;

[0041] 3) Physical block fusion, which integrates low-quality blocks among the identified physical blocks, including block text content, empty html tags, href addresses in a tags linking to other websites and advertising links, etc., all of which are of low qu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage blocking method based on a webpage semantic structure, and relates to the field of webpage editing. The webpage blocking method comprises the following steps that S1,an obtained webpage html source code is preprocessed, and a DOM syntax tree is established; S2, performing physical block identification and integration on the DOM tree; s3, performing webpage identification and monitoring on the basis of the type of the physical block; and S4, outputting the partitioned webpage. According to the webpage partitioning method provided by the invention, the webpage type and the importance degree of the webpage blocks can be more accurately identified, and some advertisement blocks and blocks with lower weights can be conveniently filtered; the original webpage can be typeset again conveniently, and structured data can be output; the webpage blocks are segmented according to different types of webpages, so that the content extraction accuracy is improved.

Description

technical field [0001] The invention relates to the field of webpage editing, in particular to a method for dividing webpages based on the semantic structure of webpages. Background technique [0002] In order to meet the needs of mobile phone users to browse Internet webpages and convert www webpage content into pages that can be easily browsed by mobile terminals, we propose a webpage segmentation method based on the semantic structure of webpages. First, the webpage is divided into multiple blocks, and then divided into blocks The optimal block is presented to mobile end users. Currently, the main solution in this application field is vision-based Web page segmentation (Vision-based Page Segmentation, VIPS). [0003] VIPS utilizes layout features such as fonts, colors, and sizes. It represents the entire web page as an HTML DOM tree according to certain semantic association rules, and then separates the blocks corresponding to the nodes in the web page through horizontal...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/957G06F16/951
Inventor 肖碧松赵芳芳
Owner CHINASO INFORMATION TECH