Method and system for searching non-structural electronic document with obvious category classification

A category division and unstructured technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as loss of proper functions, large search of enterprise-level electronic documents, and poor differentiation of IDF, etc., to achieve easy The effect of implementation

Active Publication Date: 2013-04-03
FUJIAN YIRONG INFORMATION TECH +1
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0031] 2) The problem of weakening the discrimination of IDF value of similar keywords
[0036] The calculation results are as above, the difference between the two is only 1.3 times, the discrimination of IDF is poor, and it loses its due effect
Moreover, the more obvious the gap between categories, the more prominent this problem
Similar to the reason for the above problem 1), this problem has relatively little impact on Internet search engines, but it is still relatively large for the search of enterprise-level electronic documents
[0037] To sum up, there are three ways to search for electronic documents. Relatively speaking, although there are some deviations in the use of "weighted full-text search" in the environment of large-scale enterprise electronic document search, the overall search results The quality is still the best

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for searching non-structural electronic document with obvious category classification
  • Method and system for searching non-structural electronic document with obvious category classification
  • Method and system for searching non-structural electronic document with obvious category classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] As mentioned in the background technology, because the TF-IDF algorithm does not consider the type of the electronic document and the relationship between the search term and the type, two problems arise. In severe cases, the IDF algorithm in the TF-IDF algorithm will Some of them are almost completely ineffective, and the correlation between electronic documents and keywords can only be determined by the frequency of keywords appearing in the document (TF algorithm).

[0073] Therefore, the present invention considers from the type correlation, and improves the TF-IDF algorithm, such as figure 1 As shown, the system of the present invention is made up of following several modules:

[0074] Document classification module: classify the documents of a specific collection according to the relationship between the contents of each document;

[0075] Type keyword identification module: identify all types of keywords;

[0076] Real-time search module: According to the searc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a system for searching a non-structural electronic document with obvious category classification. The method comprises a document classification and type keyword recognizing stage and a real-time searching stage; in the document classification and type keyword recognizing stage, document classification is used for classifying documents in specific collection according to a relation existing among content of the documents, and type keyword recognizing is used for recognizing keywords in all types; and in the real-time searching stage, the documents which are in accordance with a searching result are searched according to searching words input by a user and returning the searching result from high to low according to a document correlation, wherein a real-time searching correlation algorithm formula is provided by the real-time searching stage, the relation between searching words and document types is introduced in the real-time searching correlation algorithm formula, a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is optimized, two problems when the TF-IDF algorithm is used for searching electronic documents of a large-scale enterprise are solved to a large extent, and thus the method and the system are suitable for the full-text search of the electronic documents of the large-scale enterprise.

Description

【Technical field】 [0001] The invention relates to a method and system for retrieving unstructured electronic documents with obvious classification. 【Background technique】 [0002] Digital assets are among the most valuable intangible assets in a business. Digital assets can usually be divided into structured data and unstructured data. The so-called structured data refers to data that has a well-defined structure, can be easily parsed, and can be stored in a relational database; unstructured data is relative to structured data. Data types that are inconvenient to be represented by a two-dimensional table structure for standardized data. In the various business application systems of large enterprises, unstructured data documents with diverse formats, content, and related processes cover all aspects of the company's operation and management, but they have the characteristics of obvious classification. [0003] Usually, unstructured data is usually formed by encapsulating a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 倪时龙宋立华余深田郑映洪顺淋
Owner FUJIAN YIRONG INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products