Method and device for extracting webpage content

A webpage content and extraction method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inclusions, missing content and noise in the extracted content, reduce complexity, overcome uncertainty, and improve accuracy Effect

Inactive Publication Date: 2011-01-05
FUJITSU LTD
View PDF0 Cites 53 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In the traditional method of extracting the text of a web page, although the layout information of the web page is considered, the method used is a pseudo-layout derived from the DOM tree. There is a difference between the order of the nodes in the DOM tree and the displayed layout. There is a big difference, so the layout obtained by the traditional method is only a rough division of the web page, which often causes the problem of missing part of the extracted content and noise inclusion

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings. For the sake of clarity and conciseness, not all features of the actual implementation are described in the specification. However, it should be understood that many implementation-specific decisions must be made during the development of any such actual implementation in order to achieve the developer’s specific goals, for example, compliance with system and business-related constraints, and these The restriction conditions may vary depending on the implementation. In addition, it should also be understood that although the development work may be very complicated and time-consuming, for those skilled in the art who benefit from the present disclosure, such development work is only a routine task.

[0024] Here, it should be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure and / or proc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting webpage content. The webpage content extracting method comprises the following steps of: carrying out visual layout analysis on a webpage picture so as to divide the webpage picture into at least one layout block; carrying out optical character recognition on each layout block to generate a recognition text of the layout blocks; analyzing the webpage to establish a document object model tree of the webpage; mapping all text nodes in the document object model tree into one of the layout blocks by utilizing a corresponding relation between a real text of the text nodes in the document object model tree and the recognition text of the layout blocks; and extracting text content of the webpage by at least utilizing the position information of the layout blocks in the webpage. The invention creatively fuses relevant technologies of image layout analysis and natural language processing and forms a fully automatic, high-efficiency and accurate webpage content extracting scheme.

Description

Technical field [0001] The present invention relates to the field of Internet information processing and image processing, and in particular to a method and device for extracting web content based on visual layout analysis. It uses image processing document layout analysis technology and natural language processing related statistical technology to complete web page content extraction. Automatic extraction of body content. Background technique [0002] Currently, the Internet has become one of the main sources of information for existing information systems. Due to the openness of the Internet and the arbitrariness of users, while there are a lot of valuable information in Web pages, there are also a lot of noise information that has nothing to do with the text, such as navigation, copyright, and advertising. The existence of noise information makes the data quality of subsequent information services impossible to guarantee. Extracting valuable content from web pages to ensure ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 付雷孟遥孙俊于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products