Text excavating method of semi-structural document set

A structured document and text mining technology, which is applied in the fields of instruments, computing, and electrical digital data processing, etc., can solve the problems of not fully utilizing the text mining effect and not forming a mathematical model, so as to achieve the effect of improving the effect and widely applying it

Inactive Publication Date: 2003-02-26
PEKING UNIV +1
View PDF0 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In these technologies, only part of the information of the semi-structured document is used, and the information in the semi-structured document is not fully utilized in order to obtain a good text mining effect, and a unified mathematical model is not formed.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text excavating method of semi-structural document set
  • Text excavating method of semi-structural document set
  • Text excavating method of semi-structural document set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] The present invention will be further described below in conjunction with the accompanying drawings. We selected some term entry documents in the terminology database of China Encyclopedia as the example data, and each term entry document is a semi-structured XML document.

[0016] First, if figure 1 As shown, it is first necessary to read in the document and perform structural analysis on the document, such as figure 2 shown. Determine whether each node of the document already exists in the structure tree. If the node information does not exist in the structure tree, you need to add the node information to the structure tree and give the node a unique identification number, such as image 3 shown.

[0017] Second, if the currently analyzed node contains child nodes, continue to analyze its first child node until the data node does not contain child nodes; if the current node is a data node, perform word segmentation on the text field of the data node, and according...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention belongs to intelligent information processing technology and especially relates to text excavating method of semi-structural document set. One excavating method to structural link vector model of semi-structural document is proposed. The said method can utilize comprehensively the word information, structure information and link information and express them in united mathematical model. The text excavating of semi-structural document set with the model can utilize the structure information and link information in semi-structural document, and thus has greatly improved excavating effect. The said method can be used widely in intelligent information processing.

Description

technical field [0001] The invention belongs to the intelligent information processing technology, and in particular relates to a text mining method for a semi-structured document set. Background technique [0002] With the rapid development of the Internet, a large number of semi-structured documents such as HTML and XML appear. Semi-structured documents are different from unstructured plain text documents and data in relational databases with regular structures. How to quickly and effectively obtain the documents that people need from such a large number of documents and how to discover the hidden laws in these documents are the problems that people face. Analytical mining of semi-structured document sets is the method used to solve these technical problems. [0003] At present, there are mainly two types of mining methods for semi-structured documents: one is to treat semi-structured documents as unstructured plain text documents, and use traditional text mining methods ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21G06N7/00
Inventor 杨建武陈晓鸥吴於茜万小军王选陈堃銶
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products