System and a method for detecting the key content of a web page based on visual characteristics

A key content and visual feature technology, applied in the Internet field, can solve problems such as poor identification of key content, lack of self-learning algorithms, complex process implementation, etc.

Inactive Publication Date: 2019-02-15
中共中央办公厅电子科技学院 +1
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the shortcomings of this method are that the visual related features are not fully considered in the process of dom feature extraction, the process implementation is too complicated, and the algorithm lacks the ability of self-learning
However, the disadvantage of this method is that the content extraction needs to customize the web page template in advance, once the structure of the web page changes, the success rate of content extraction will be reduced
Most of the proposed solutions are based on the basic features of tags and the semantic features of content to detect key content. However, with the emergence of technologies such as web page overlay and dynamic rendering, it is not easy to identify them only by relying on the basic features of tags and semantic features. key content
Secondly, on the algorithm of machine learning, generally only one algorithm will be selected for testing. Whether it is machine learning or deep learning, it is impossible to improve the accuracy of the overall learning.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and a method for detecting the key content of a web page based on visual characteristics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The solution of the present invention is realized through the following schemes: first, the DOM component that can extract features is obtained through the sample processing module; then the characteristics of all experimental samples are obtained through the feature extraction module; The feature table detected by the learning module; the accuracy rate is detected in the machine learning module, the model with high accuracy is selected, and the parameters are tuned to obtain the best model; finally, the web crawler submits the DOM component information, and the key content detection module returns the detection result .

[0031] 1. the realization process of the present invention is:

[0032] (1) Collect the webpage sample library, use chrome-headless to dynamically render HTML files, and manually mark the key content in the webpage to form the initial sample set of the scheme.

[0033] (2) Traverse the DOM components in the current page according to the dynamic rende...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a system and a method for detecting the key content of a web page based on visual characteristics, which collects a web page sample library and utilizes chrome Head less software, dynamic rendering HTML code, and analysis of the visual characteristics of the internal DOM control and the properties of sub-components to form a multi-dimensional feature vector, using decisiontrees, random forests, Bayesian analysis, logistic regression, support vector machines, K-Nearest neighbor algorithm to detect, output the probability of the component as the key content, to achieveautomatic extraction of the key content of the web page. The invention can extract automated key contents for unknown web pages, and is used for search engine to extract abstracts for large-scale webpages.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to an automatic detection system and method for key content of webpages based on visual features. Background technique [0002] With the widespread application of the Internet, web pages have become an important carrier for users to obtain information. When search engines use web crawler software to crawl web pages, they need to analyze the key content, remove non-key content such as advertisements, navigation bars, and user comments in the web page, and provide users with a summary of the target web page. On the other hand, with the complexity and diversification of web design and the further popularization of web page dynamic rendering technology, a lot of key content is often added dynamically through JavaScript code, while the traditional tag analysis based on static HTML code is used to analyze key content The detection method has been unable to adapt to the increasingly ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/46G06K9/62
CPCG06V30/40G06V10/40G06F18/24
Inventor 王志强马平川王兵张健毅张翼池亚平张南峰余泽峰纪曦王希文
Owner 中共中央办公厅电子科技学院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products