Method for classifying documents in mass document library

A document classification and document library technology, applied in the computer field, can solve problems such as time-consuming and complex document classification, and achieve the effects of improving efficiency, reducing the number of matching times, and simplifying the matching process

Active Publication Date: 2013-04-17
IOL WUHAN INFORMATION TECH CO LTD
View PDF2 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The present invention aims to provide a method for classifying documents in a massive document library to solve the problem of complicated and time-consuming classification of documents in a reference library by means of term matching

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for classifying documents in mass document library
  • Method for classifying documents in mass document library
  • Method for classifying documents in mass document library

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and the embodiments. See figure 1 , The steps of the embodiment include:

[0018] S11: Determine each keyword of all documents in the document library and the correspondence between each keyword and each document to which it belongs;

[0019] S12: Match the keywords one by one in the term database, and use the industry category attribute of the term matched by each keyword as the industry category attribute to which the keyword belongs in each document corresponding to it;

[0020] S13: Determine the same maximum industry category attributes contained in each document according to the corresponding relationship;

[0021] S14: The most industry category attribute is used as the classification of each document.

[0022] The present invention adopts a reverse matching idea to perform term search on documents in a reference library, that is, use all words in the reference libra...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for classifying documents in a mass document library. The method includes: determining keywords of each document in the document library and correspondence between each keyword and the document that the keyword belongs to; matching the keywords one by one in a term base, using industry category attribute of a term matching with each keyword as the industry category attribute of the keyword belonging to the corresponding document; determining same maximum industry category attributes in each document according to the correspondence; and using the industry category attribute with the maximum attribution as the category of the corresponding document. Documents in a reference library are subjected to term retrieval according to the idea of backward matching. The term base is a set with a character sequence index structure, string matching by dichotomy in the term base needs 1+log2n times of matching calculation at most, and accordingly matching times are decreased greatly, the matching process is simplified and efficiency in document classification is improved.

Description

Technical field [0001] The present invention relates to the field of computers, and in particular, to a method for classifying documents in a massive document library. Background technique [0002] The translation reference library (hereinafter referred to as the reference library) is a document library with a large number of auxiliary translation resources. The general similarity retrieval method is used to classify it according to certain industries, disciplines, and fields, and it needs to be very large The time and space consumed for text similarity matching calculation are unbearable for the system. [0003] Through the large term corpus to calculate the number of terms in the documents in the reference library, the documents can be divided into industries, disciplines, fields and other attributes, and the cost of string pattern matching calculations is much less than the calculations for text similarity matching calculations the amount. [0004] A large term corpus is a large...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 江潮
Owner IOL WUHAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products