Unlock instant, AI-driven research and patent intelligence for your innovation.

System and method for extracting object identifiers from web pages

A technology for object identifiers and identifiers, applied in the field of identifying and extracting object identifiers, can solve the problems of inability to meet the accuracy of identifiers and insufficient information related to object identifiers, and achieve the effect of improving accuracy

Active Publication Date: 2015-12-02
RICOH KK
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] However, the prior art mentioned above mainly has the following disadvantages: firstly, in the methods for extracting web page titles or product names disclosed in the above documents, only the features of the DOM tree and visual information may not be able to satisfy the requirements for extracting object identifier related information. the accuracy of
Moreover, the object identifier-related information provided by a single webpage is not comprehensive enough, and it is necessary to integrate the object identifier-related information from multiple webpages to obtain the object identifier

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for extracting object identifiers from web pages
  • System and method for extracting object identifiers from web pages
  • System and method for extracting object identifiers from web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] Specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0055] First, the principles of the system for extracting object identifiers from webpages according to the embodiment of the present invention will be described.

[0056] As mentioned in the background technology, when building an object database, it is usually necessary to extract object identifiers from web pages, wherein the object identifiers can be extracted from web pages describing single-product objects, or from web pages describing multi-product objects Extract the individual object identifiers of multiple objects from . The object mentioned here usually refers to a product in the real world, such as a digital camera. figure 1 is a diagram showing an example of an input destination web page that is a target of an embodiment of the present invention. Such as figure 1 As shown, the objects of the embodiment of the present invention ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This invention discloses a system and a method for extracting object identifiers from webpages. The system includes: an identifier identification module for identifying identifier blocks in a webpage, an identifier fragment extraction module connected to the identifier identification module and an identifier unit labeling module connected to the identifier fragment extraction module. The said webpage contains object identifier related information showing the information of the object identifiers, and an identifier block is a text containing the object identifier related information. The identifier fragment extraction module is used to remove useless information from the identifier block to obtain identifier fragments according to at least one of position information and content information of each wordin the identifier block recognized by the identifier identification module. The identifier unit labeling module is used for labeling the identifier fragments extracted by the identifier fragment extraction module as object identifiers suitable for constructing an object database.

Description

technical field [0001] The present invention relates generally to information processing and information extraction techniques, and more particularly, to systems and methods for identifying and extracting object identifiers from web pages. Background technique [0002] In the current field of information processing technology, it is often necessary to build an object database, which involves providing object identifiers with a hierarchical structure for object generation and object mapping, representing objects and establishing indexes. [0003] Here, the objects to be processed generally involve web pages on the Internet. Objects in the real world have their unique object identifiers (that is, names). Of course, other aliases or conventional abbreviations can also be used to represent object identifiers. For example, it is common for the same object to have different names in different web pages. . In the same web page, the representation of the same object is usually con...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 姜珊珊谢宣松孙军郑继川赵立军
Owner RICOH KK