Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and system for extracting object attribute value information from web pages

A technology of object attributes and attribute values, which is applied in the direction of instruments, calculations, and electrical digital data processing, etc., can solve the problems of expensive, hierarchical structures that cannot extract attributes, difficult costs, etc., and achieve the effect of minimization of use

Active Publication Date: 2015-08-05
RICOH KK
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But such domain knowledge requires human involvement, which is often difficult and expensive
Second, existing methods cannot extract attribute hierarchies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for extracting object attribute value information from web pages
  • A method and system for extracting object attribute value information from web pages
  • A method and system for extracting object attribute value information from web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The present invention will be fully described below with reference to the accompanying drawings showing embodiments of the invention. However, this invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, components are exaggerated for clarity.

[0038] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in common dictionaries should be interpreted as having meanings consistent with their meanings in the context of the relevant technology, and should not be interpreted in idealized or extremely formalized...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for extracting object attribution value information from a webpage, comprising the following steps of: a) for a given webpage, obtaining a document target model (DOM) tree corresponding to the given webpage, and calculating relevant information of each DOM node in the DOM tree; b) according to the DOM tree and the relevant information of each DOM node, structuring a label type node diagram, and calculating a fraction of each label type node; c) selecting a label type node tree from the obtained label type node diagram based on the fractions of the label type nodes; and d) structuring an attribution value tree based on the selected label type node tree. With the adoption of the method provided by the invention, relevant information in a domain is used in a minimized manner. The domain knowledge needs participation of people, so that the object attribution value information is extracted difficultly and expensively. The method provided by the invention has the other advantage that not only can attribution value pairs be extracted, but also the attribution value tree can be extracted. An attribution normally has an internal level structure. High-level attributions provide contextual information for low-level attribution values, so that the contextual information contributes to information integration and machine perception.

Description

technical field [0001] The invention relates to the fields of information processing and information extraction, in particular to a system and method capable of extracting object attribute value information from web pages. Background technique [0002] In the prior art, there are some related technologies as follows: [0003] 1. US7720830 (B2) Hierarchical conditional random fields for web extraction [0004] The method proposed in this prior art tags an information page with an object information tag. After dividing the web page into chunks, hierarchical CRFs are used to label the object elements. [0005] The differences between the above-mentioned prior art and the present application are as follows: firstly, the above-mentioned prior art assumes that the set of attribute names of the object class is known, while the method of the present application extracts the attribute names and attribute values ​​at the same time. Secondly, the above prior art uses a supervised me...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 孙军谢宣松姜珊珊赵利军郑继川
Owner RICOH KK