A Method of Blocking Web Pages Based on Semantic Structure of Web Pages
A web page segmentation and semantic structure technology, applied in the field of web page editing, can solve the problems of high algorithm time consumption, complex visual features, and low web page performance, and achieve the effect of improving accuracy and accurate recognition
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment
[0038] This embodiment provides a webpage segmentation method based on the semantic structure of the webpage, the specific process is as follows figure 1 Shown:
[0039] 1) Preprocessing, preprocessing the obtained webpage html source code, the preprocessing here includes blank character compression, uniform conversion of webpage labels to lowercase, conversion of non-label < symbols into entities, and processing of label content that needs to be filtered And web page character set recognition and conversion, establish DOM syntax tree;
[0040] 2) To identify the physical block type of the webpage, first calculate the number of atomic tags of each node in the DOM syntax tree, and then identify the physical block type;
[0041] 3) Physical block fusion, which integrates low-quality blocks among the identified physical blocks, including block text content, empty html tags, href addresses in a tags linking to other websites and advertising links, etc., all of which are of low qu...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


