Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and device for capturing effective webpage content

A technology of web content and crawling device, which is applied in website content management, network data retrieval, digital data information retrieval and other directions, can solve the problems of long connection time, inability to extract text information, slow speed, etc. Effect

Inactive Publication Date: 2011-12-07
北京迅捷英翔网络科技有限公司
View PDF4 Cites 60 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in terms of information records, HTML webpages contain a large number of tags used to structure information, and webpages may contain a lot of useless information
Moreover, with the vigorous development of various mobile terminals, mobile terminals have higher and higher requirements for accessing the Internet. If you directly access HTML pages through mobile terminals, due to the performance limitations of the mobile terminal equipment itself, every time you visit HTML, the connection It takes a long time and the speed is slow, and because of the existence of a large amount of useless information, the data transmission flow will be large, so that the time and cost for users to obtain the webpage will be high, so how to accurately and quickly extract useful information from the HTML format webpage become very important for mobile terminal equipment
[0003] The current text information extraction technology can only obtain the content in a specific HTML tag through HTML tag information. For the target processing web page, it is necessary to examine the HTML tag structure of the web page in advance and customize the extraction template in advance.
For web pages whose HTML structure cannot be known in advance, text information extraction will not be possible.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for capturing effective webpage content
  • A method and device for capturing effective webpage content
  • A method and device for capturing effective webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The specific embodiments of the present invention will be described in detail below. It should be noted that the embodiments described here are for illustration only, and are not intended to limit the present invention.

[0041] The present invention starts with the overall structure of the effective content webpage to be extracted and examines the position information of various text entities in the webpage, the unique result information and the label information, It can realize the automatic extraction function of web page text entities. Because the web page file conforms to the tree structure of HTML DOM (Document ObjectModel). For a web page with effective content, such as a news web page, there are many types of tags in the web page, which are generally divided into logically, page function tags, advertisement tags, and news content tags. Webpage information extraction is the need to extract effective content from webpages, such as Xinwang content tags. The function...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method for obtaining the effective contents of a web page comprises steps of: loading an HTML web page: converting the HTML web page into a corresponding DOM tree; finding a title label of effective contents according to the DOM tree, determining the text contents in the found title label as the title of the effective contents; searching sequentially for text labels in a <body> label of the DOM tree in accordance with label distances from short to long between the text labels and the title label, determining a text label having a text length larger than a predetermined length and some specific symbols related to the main text as a main text label, and then taking the text contents in the main text label as the main text of the effective contents. An apparatus corresponding to the method comprises corresponding modules.

Description

technical field [0001] The invention relates to the field of Internet information processing, in particular to a method and device for capturing effective webpage content. Background technique [0002] At present, there is the largest information database known to mankind on the Internet, and most of the information exists in HTML (Hyper Text Mark-up Lanugage, hypertext link markup language) format web pages. HTML is used to structure information—such as headings, paragraphs, and lists—and to richly represent text, images, and other multimedia information. Combined with the HTML reading tool "browser", people can easily view the information in the HTML structure. However, in terms of information records, HTML web pages contain a large number of tags used to structure information, and at the same time, web pages may contain a lot of useless information. Moreover, with the vigorous development of various mobile terminals, mobile terminals have higher and higher requirements ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F17/30896G06F16/986
Inventor 贾海禄
Owner 北京迅捷英翔网络科技有限公司