Method and device for extracting webpage content

A technology of webpage content and webpage, applied in the field of devices for extracting webpage content

Inactive Publication Date: 2009-08-26
RICOH KK
View PDF3 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But both the DDA method and the DIR method have their own limitations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numerals refer to like elements throughout.

[0036] figure 1 is a block diagram showing an exemplary structure of the webpage content extracting apparatus 100 according to the embodiment of the present invention. According to an exemplary embodiment of the present invention, the webpage content extraction device 100 includes an input unit 110, a DDA webpage content extraction unit 120, a webpage to image conversion unit 130, a DIR webpage content extraction unit 140, and a DDA and DIR extraction result fusion unit 150. The input unit 110 is used to input web pages. In an exemplary embodiment of the present invention, the input web page may be, for example, a web page file in Hypertext Markup Language (HTML) format. The DDA webpage content extraction unit 120 performs webpage content extraction processing based ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for extracting webpage content. The method comprises the following steps: based on a digital document analyzing (DDA) method, extracting the webpage content of an input webpage to generate a DDA extraction result; based on a document image recognition (DIR) method, extracting the webpage content of the input webpage to generate a DIR extraction result; and merging the DDA extraction result and the DIR extraction result to generate a merging result. The method and the device can acquire better webpage extraction result compared with the prior art.

Description

technical field [0001] The present invention relates to webpage processing, and more specifically, the present invention relates to a device and method for extracting webpage content. Background technique [0002] Nowadays, the Internet has become the largest source of information, and people's daily life is increasingly dependent on the Internet. With the popularity of the Internet, the application of webpage content extraction (also known as webpage segmentation) is becoming more and more extensive. [0003] For example, webpage content extraction can make webpage search faster and the results more accurate. Compared with traditional text documents, the content of web pages is more diverse, and different areas of the same web page can contain different themes. Moreover, due to the needs of browsing and publishing, web pages often contain a lot of content that has nothing to do with the subject, such as advertisements, navigation bars, decorations, copyright information, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杜成
Owner RICOH KK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products