Methods and equipment for generating and maintaining web content extraction template

A webpage content and webpage generation technology, which is applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of extraction template failure accuracy, decline, etc., and achieve high accuracy and improve efficiency

Inactive Publication Date: 2011-05-25
FUJITSU LTD
View PDF4 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In addition, in the actual process of web content extraction using web content extraction templates, the problem of "template maintenance" is often encountered, that is, the extraction template fails or the accuracy decreases due to changes in web pages.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Methods and equipment for generating and maintaining web content extraction template
  • Methods and equipment for generating and maintaining web content extraction template
  • Methods and equipment for generating and maintaining web content extraction template

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] Embodiments of the present invention will be described below with reference to the drawings. It should be noted that representation and description of components and processes that are not related to the present invention and known to those of ordinary skill in the art are omitted from the drawings and descriptions for the purpose of clarity.

[0029] figure 1 The block diagram of is showing an exemplary structure of the device 100 for generating a webpage content extraction template according to the first embodiment of the present invention. The following combination figure 1 A device 100 for generating a webpage content extraction template according to a first embodiment of the present invention will be described.

[0030] like figure 1 As shown, the device 100 includes an input unit 101 , a weight calculation unit 102 , a maximum alignment relationship calculation unit 103 , a combination unit 104 , a determination unit 105 and a selection unit 106 .

[0031] The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides methods and equipment for generating and maintaining a web content extraction template. The equipment for generating the web content extraction template comprises an input unit, a weight calculation unit, a maximum alignment relationship calculation unit, a combination unit, a determination unit and a selection unit, wherein the weight calculation unit is configured to calculate weights of nodes of each type in each input tree. The equipment for maintaining the web content extraction template comprises a similarity calculation unit, a statistic calculation unit, a statistic judgment unit and a recalculation unit, wherein the similarity calculation unit calculates a similarity sequence; the statistic calculation unit traverses the similarity sequence by utilizing a window with a predetermined size and calculates statistic in the window; and the statistic judgment unit judges whether the web content extraction template is adapted to the input of a web or not according to the calculated statistic. In the methods and the equipment, the web content extraction template can be automatically generated with high efficiency, and when the web changes to cause the invalidation of the extraction template or reduction in accuracy, the web content extraction template can be automatically rapidly regenerated.

Description

technical field [0001] The invention belongs to the field of Internet information processing, and in particular relates to a method and equipment for generating a template for extracting webpage content and a method and equipment for maintaining the template for extracting webpage content. Background technique [0002] With the rapid development of the Internet, the amount of information on the Internet is increasing at an alarming rate every day. Web pages with a markup language format, such as Hypertext Markup Language HTML format, are the main information carriers. Most of the current web pages are dynamic web pages generated by databases and templates. Usually, in addition to the main text content, a web page also contains information irrelevant to the text, such as advertisements, navigation information, and copyright information. [0003] In applications such as information search, information filtering, text classification, text clustering, and summarization, it is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 夏迎炬吴科张姝于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products