Structured document retrieval device and program

A structured document and tree structure technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as structural condition retrieval that cannot perform structural conditions and annotations

Inactive Publication Date: 2013-12-04
HITACHI LTD
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] As a result, in the existing search function prepared for XML documents or the existing search function prepared according to UIMA, for XML documents with annotations, it is not possible to perform searches that take into account both the structural conditions of XML and the structural conditions of annotations.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Structured document retrieval device and program
  • Structured document retrieval device and program
  • Structured document retrieval device and program

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example )

[0044] (summary)

[0045] In this embodiment, a structured document retrieval device is described, which performs preprocessing on a collection of XML documents and a collection of annotation data to generate retrieval data in advance, and compares the retrieval data with the retrieval query to find the documents that match the retrieval query. Elements are output as search results. In this embodiment, a text-shared DOM tree in which structural information of XML tags and comment tags is integrated is used as data for retrieval.

[0046] (device structure)

[0047] Picture 1-1 A configuration example of the structured document retrieval device 400 is shown. The structured document retrieval device 400 is configured as a computer including a CPU (Central Processing Unit) 401 , a main storage device (memory) 402 , an auxiliary storage device 403A, and a user interface unit 406 . The structured document retrieval device 400 is connected to an external network device via a net...

no. 2 example )

[0129] In this embodiment, an inclusion relationship between different types of elements is defined between XML elements and annotation elements, or between annotation elements belonging to different annotation groups. Therefore, in this embodiment, a DOM DAG (Directed Acyclic Graph: acyclic directional flag) extended from the text-shared DOM tree structure of the first embodiment is used. In addition, the basic structure of the structured document retrieval device 400 of this embodiment is the same as that of the first embodiment. That is, with Picture 1-1 and Figure 1-2 The structure shown is the basic structure. However, in this embodiment, the DOM DAG construction unit 422 is used instead of the text sharing DOM tree construction unit 415 .

[0130] (Summary of preprocessing)

[0131] As described above, in this embodiment, a structure search using a DOM DAG considered as a parent-child relationship will be described regarding the inclusion relationship of tex...

no. 3 example )

[0151] As described above, if DOM DAG is used, it is possible to perform a search using the structural relationship between tags of different types. However, when searching for a location route, it is not efficient to trace all the constructed DOM DAGs from the root element.

[0152] Therefore, in this embodiment, a path DAG, which is a data structure that aggregates a structure of a plurality of DOM DAGs, is defined. Furthermore, elements in the route DAG can be used as entries, and searches can be performed based on a transposed index with elements in the DOM DAG as values, thereby enabling efficient search using a location route as a search query.

[0153] In the case of the present embodiment, the basic structure of the structured document retrieval apparatus 400 is the same as that of the first embodiment. That is, with Picture 1-1 and Figure 1-2 The structure shown serves as the basic structure. However, in the case of this embodiment, the functions of the D...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a structured document retrieval device and program, capable of performing structure retrieval combining both of the structure information based on an XML label and the structure information based on a comment label. The device comprises: a processor, which executes the program; a first storage region, which stores the program; a second storage region, which stores a structured document satisfying a tree structure condition and comment data added onto the document; a document structure list building part, which aims at a root element generalized structure of a DOM tree individually obtained based on including relations of the labels of the structured document and the comment data, distributes a text of the structured document, and generates a text common DOM tree; and a retrieval process part, which indexes elements according with the retrieval from the text common DOM tree.

Description

technical field [0001] The present invention relates to a structured document for retrieving a document described in a structured language (hereinafter referred to as a "structured document") and a structured document to which annotation data is attached in an arbitrary format based on the structure of tags and / or character string data A search device and a program that implements its functions with a computer. Background technique [0002] XML (Extensible Markup Language: Extensible Markup Language) is a data format that can record structural information in text, and can record structural information in text by using character strings surrounded by "<" and ">" called tags . XML can express a hierarchical tree structure by describing tags in a nested form, and can change the hierarchical tree structure by adding / deleting tags. Therefore, XML is widely used as a format for recording financial information, recording patent specifications, data exchange in electronic co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 小岛要
Owner HITACHI LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products