Method and device for extracting web page text

A web page and text technology, applied in the network field, can solve the problems such as the inability to timely and accurately extract the web page text, and the inability of users to maintain the template rules in a timely and accurate manner. Effect

Active Publication Date: 2009-04-15
NEW FOUNDER HLDG DEV LLC +1
View PDF0 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The canonical extraction method needs to spend a lot of energy to maintain the template extraction rules of major website pages. Because there are too many website page templates on the Internet, and the page templates are updated frequently, users cannot maintain these template rules in a timely and accurate manner, and cannot timely maintain these template rules. , accurately extract the text of these web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting web page text
  • Method and device for extracting web page text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The invention provides a method for extracting the text of a web page: divide the web page into several page segments, calculate the weight of each page segment according to the contents of the page segments such as non-link characters, link characters, pictures, attachments, advertisements, etc., the content of the page segment The higher the popularity, the greater the weight; then, the page segment with the largest weight is extracted as the webpage text; thus, after reprinting the content of the webpage, the user's click rate is higher, which is conducive to improving the popularity of this website. By adopting the method for extracting the text of the webpage provided by the invention, even if the template of the webpage changes, it can quickly and accurately extract the text of the popular webpage and maintain the diversity of the content of the webpage.

[0022] The technical solution in the present invention will be clearly and completely described below in conju...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for extracting a text from a web page and a device thereof, and relates to the technical field of networks. The method and the device rapidly and accurately extract the text from the web page. The method comprises the following steps: obtaining a start tag and an end tag of a page segment; determining a start position and an end position of the page segment according to the start tag and the end tag; computing a weight value of the page segment; and extracting the page segment with maximum weight value from the web page to be taken as the text of the web page. The device comprises an acquisition module used for acquiring the start tag and the end tag of the page segment; a segmentation module used for determining the start position and the end position of the page segment according to the start tag and the end tag acquired by the acquisition module; a computation module used for computing the weight value of the page segment determined by the segmentation module; and an extraction module used for extracting the page segment with the maximum weight value computed by the computation module from the web page to be taken as the text of the web page. The technical proposal provided by the invention can be widely applied to network systems and devices for reproducing contents.

Description

technical field [0001] The invention relates to the field of network technology, in particular to a method and device for extracting webpage text. Background technique [0002] With the increasingly fierce competition in the market, if the content of the web pages of major websites only depends on the content of the web pages of this site, the content will appear monotonous, which will not increase the click-through rate, nor can it increase the popularity of the site. In order to increase the click-through rate, it is necessary to diversify the content of the webpage, add more hot topics, etc., and the content of the webpage is reproduced from this. Manual reprinting has slow update speed, low efficiency, and consumes a lot of manpower and financial resources. Therefore, crawler software has become the leading software for reprinting webpage content to quickly and accurately extract the text of webpages. [0003] At present, the methods for crawling software to extract web...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张海涛
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products