System and a method for detecting the key content of a web page based on visual characteristics

A key content and visual feature technology, applied in the Internet field, can solve problems such as poor identification of key content, lack of self-learning algorithms, complex process implementation, etc.
CN109344733AInactive Publication Date: 2019-02-15中共中央办公厅电子科技学院 +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
中共中央办公厅电子科技学院
Publication Date
2019-02-15
Estimated Expiration
Not applicable · inactive patent

Smart Images

  • Figure 1
    Figure 1
Patent Text Reader

Abstract

The invention relates to a system and a method for detecting the key content of a web page based on visual characteristics, which collects a web page sample library and utilizes chrome Head less software, dynamic rendering HTML code, and analysis of the visual characteristics of the internal DOM control and the properties of sub-components to form a multi-dimensional feature vector, using decisiontrees, random forests, Bayesian analysis, logistic regression, support vector machines, K-Nearest neighbor algorithm to detect, output the probability of the component as the key content, to achieveautomatic extraction of the key content of the web page. The invention can extract automated key contents for unknown web pages, and is used for search engine to extract abstracts for large-scale webpages.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of the Internet, in particular to an automatic detection system and method for key content of webpages based on visual features. Background technique

[0002] With the widespread application of the Internet, web pages have become an important carrier for users to obtain information. When search engines use web crawler software to crawl web pages, they need to analyze the key content, remove non-key content such as advertisements, navigation bars, and user comments in the web page, and provide users with a summary of the target web page. On the other hand, with the complexity and diversification of web design and the further popularization of web page dynamic rendering technology, a lot of key content is often added dynamically through JavaScript code, while the traditional tag analysis based on static HTML code is used to analyze key content The detection method has been unable to adapt to the increasingly ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More