Method and device for analyzing internet web page contents

A technology of webpage content and analysis method, which is applied in the field of Internet webpage content analysis, can solve problems such as poor adaptability, lost information, spam information, etc., to improve the effect, reduce the interference of spam information, and improve accuracy and precision Effect

Active Publication Date: 2010-12-15
新岸线(北京)科技集团有限公司
View PDF4 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, web pages are arranged in various ways, and it is impossible to exhaustively
These methods have the problem of poor adaptability in actual operation, and some may be applicable to some web pages, but not to other web pages, so that the final result of web page parsing may contain spam information, or lose really useful information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for analyzing internet web page contents
  • Method and device for analyzing internet web page contents
  • Method and device for analyzing internet web page contents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] Aiming at the defects of the prior art, the present invention provides an Internet webpage content analysis method, which can analyze and process webpages in a targeted way for each website or even different channel paging of each website, and can automatically analyze whether the webpage is Generated by a template, and can automatically generate a template corresponding to the webpage, so as to use the most suitable template to parse the webpage. The invention overcomes the disadvantages of the current method, and can only analyze the real content part of the webpage, thereby reducing the interference of garbage information, improving the accuracy and precision of webpage analysis, and greatly improving the effect of webpage analysis.

[0029] refer to figure 1 , a kind of Internet web page content parsing method that the embodiment of the present invention provides, comprises the following steps:

[0030] S11, judging whether the webpage to be parsed is generated by ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for analyzing internet web page contents, comprising the following steps: judging whether a web page to be analyzed is generated by a template; if the web page is generated by the template and the template which is matched with the web page to be analyzed exists in a web page template library, utilizing the template corresponding to the web page to be analyzed to analyze the contents in the web page; and otherwise, generating a web page template corresponding to the web page to be analyzed, adding the generated web page template to the web page template library, and utilizing the template to analyze the web page. The invention also provides a corresponding device. The invention can perform pagination on each website and even each different channel, analyze and process web pages in a targeted mode, automatically analyze whether the web page is generated by a template, and automatically generate a template corresponding to the web page so as to utilize the most adaptive template to analyze the web page. The invention only analyzes the real part of the contents in the web page, thus reducing interference of junk information, improving accuracy and precision of the web page analysis and obviously enhancing the web page analysis effect.

Description

technical field [0001] The invention relates to the technical fields of communication and Internet, in particular to a method and device for analyzing Internet web page content. Background technique [0002] In recent years, with the popularity of the Internet, the improvement of bandwidth, and the maturity of service models, search engines have gradually become the mainstream application of the Internet. Technically, an Internet search engine generally consists of two parts, namely an offline processing part and an online processing part. The offline processing part mainly includes main functional modules such as webpage crawling, webpage parsing, and indexing, while the online processing module process includes: according to the query words submitted by the user, query the corresponding documents in the index and data generated by the offline processing module (that is, the webpage ), sort the queried documents according to some index, and finally return the sorted result...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 郑清芳章动鲍东山
Owner 新岸线(北京)科技集团有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products