A method and device for capturing effective webpage content

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of web content and crawling device, which is applied in website content management, network data retrieval, digital data information retrieval and other directions, can solve the problems of long connection time, inability to extract text information, slow speed, etc. Effect

Inactive Publication Date: 2011-12-07

北京迅捷英翔网络科技有限公司

View PDF4 Cites 60 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, in terms of information records, HTML webpages contain a large number of tags used to structure information, and webpages may contain a lot of useless information

Moreover, with the vigorous development of various mobile terminals, mobile terminals have higher and higher requirements for accessing the Internet. If you directly access HTML pages through mobile terminals, due to the performance limitations of the mobile terminal equipment itself, every time you visit HTML, the connection It takes a long time and the speed is slow, and because of the existence of a large amount of useless information, the data transmission flow will be large, so that the time and cost for users to obtain the webpage will be high, so how to accurately and quickly extract useful information from the HTML format webpage become very important for mobile terminal equipment

[0003] The current text information extraction technology can only obtain the content in a specific HTML tag through HTML tag information. For the target processing web page, it is necessary to examine the HTML tag structure of the web page in advance and customize the extraction template in advance.

For web pages whose HTML structure cannot be known in advance, text information extraction will not be possible.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] The specific embodiments of the present invention will be described in detail below. It should be noted that the embodiments described here are for illustration only, and are not intended to limit the present invention.

[0041] The present invention starts with the overall structure of the effective content webpage to be extracted and examines the position information of various text entities in the webpage, the unique result information and the label information, It can realize the automatic extraction function of web page text entities. Because the web page file conforms to the tree structure of HTML DOM (Document ObjectModel). For a web page with effective content, such as a news web page, there are many types of tags in the web page, which are generally divided into logically, page function tags, advertisement tags, and news content tags. Webpage information extraction is the need to extract effective content from webpages, such as Xinwang content tags. The function...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method for obtaining the effective contents of a web page comprises steps of: loading an HTML web page: converting the HTML web page into a corresponding DOM tree; finding a title label of effective contents according to the DOM tree, determining the text contents in the found title label as the title of the effective contents; searching sequentially for text labels in a <body> label of the DOM tree in accordance with label distances from short to long between the text labels and the title label, determining a text label having a text length larger than a predetermined length and some specific symbols related to the main text as a main text label, and then taking the text contents in the main text label as the main text of the effective contents. An apparatus corresponding to the method comprises corresponding modules.

Description

technical field [0001] The invention relates to the field of Internet information processing, in particular to a method and device for capturing effective webpage content. Background technique [0002] At present, there is the largest information database known to mankind on the Internet, and most of the information exists in HTML (Hyper Text Mark-up Lanugage, hypertext link markup language) format web pages. HTML is used to structure information—such as headings, paragraphs, and lists—and to richly represent text, images, and other multimedia information. Combined with the HTML reading tool "browser", people can easily view the information in the HTML structure. However, in terms of information records, HTML web pages contain a large number of tags used to structure information, and at the same time, web pages may contain a lot of useless information. Moreover, with the vigorous development of various mobile terminals, mobile terminals have higher and higher requirements ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F17/30896G06F16/986

Inventor 贾海禄

Owner 北京迅捷英翔网络科技有限公司

A method and device for capturing effective webpage content

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology