Web page information block extracting method and apparatus

An extraction method and information block technology, applied in the field of extracting cohesive regions in web pages, can solve problems such as lack of versatility

Inactive Publication Date: 2006-04-26
FUJITSU LTD +1
View PDF0 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

(Shian-Hua Lin 2002) method for detecting information content blocks in...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page information block extracting method and apparatus
  • Web page information block extracting method and apparatus
  • Web page information block extracting method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] figure 1

[0028] figure 2

[0029] .

[0030] In the repeated pattern discovery unit 203, a suffix tree of the HTML tag token stream is constructed, and all repeated patterns and corresponding occurrences are retrieved from the suffix tree.

[0031] Figure 4 An example suffix tree with an input token stream and six token-suffixes is demonstrated in . The suffix tree used for token flow is defined as (∑, C, E, N, S, φ, ):

[0032] ∑ is the input token letter.

[0033] C is the input token sequence. Each token c∈C, c∈∑.

[0034] E is the set of arcs in the suffix tree. Each arc e∈E in the suffix tree represents a token in ∑.

[0035] N is the set of internal nodes within the suffix tree.

[0036] S is the set of leaf nodes.

[0037] φ represents the dummy suffix tree root.

[0038] is the partial order of NUS, which is defined as: if and only if n 2 is node N 1 As a node in a subsuffix tree of the root,

[0039] If two n...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an information extraction device and method, which is characterized by the following: generating structure information block tree; classifying and combining the net page structure information block; making the result, wherein the automatic repeating mode finding and meaning classification and combination is the base of the method and device, which expands the information block and simplifies the net page operation.

Description

technical field [0001] The present invention relates to methods and devices for extracting cohesive regions in web pages. The method and device can divide the webpage into information blocks according to the content and function, and expand the processing granularity of the webpage from the whole page to the information blocks in the page, thereby making the webpage easier to be processed by machines. Background technique [0002] Recently, for business purposes, the content and structure of web pages have become more and more complex to be accessible and user-friendly. A web page is usually a collection of different themes and functions that are loosely combined to form a whole. Humans can easily identify areas of information in web pages that have different meanings and functions, but this is very difficult for automated processing systems because HTML is designed for display, not content description. Now, most of the existing network IR (Information Retrieval), IE (Info...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 王俊王继成武港山津田宏
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products