Literature classification method and system based on trie and LCS algorithm

A document classification and document technology, which is applied in the field of document classification methods and systems based on trie and LCS algorithms, can solve problems such as failures, and achieve the effects of reducing dependence, reducing labor intensity, and reducing interference

Active Publication Date: 2019-03-29
CHINA PETROLEUM & CHEM EXPLORATION & PRODION RES INST +1
View PDF7 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In short, dictionary matching is only applicable when the meaning of words is independent and unique. In the case of dictionaries with complex semantics, the rule-based method based on dictionary matchi

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0062] Taking the title of the literature to be classified as an example, the classification is as follows according to the above method:

[0063] Suppose the title of the literature to be classified is "On the Importance of Education to the Economic Development of a Country";

[0064] Assume that categories and partial strings are preset in the initial classification dictionary, as follows:

[0065] "Economy", which contains the string: "Social Economic Development";

[0066] "Education", which contains the string: "Higher Education123";

[0067] "Political category", which contains a string of characters: "Arrangement of conference affairs";

[0068] Assume that some strings are preset in the initial exclusion dictionary, such as: , "of", "a", "how";

[0069]Extend each character string in the initial classification dictionary to obtain an extended character string, in "social economic development" in "economic class":

[0070] The extended string of "社" is: "社", "社会", "...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a document classification method based on trie and LCS algorithm, comprising the following steps: step 1, pre-compiling an initial classification dictionary and an initial exclusion dictionary; 2, extending each character string in the initial classification dictionary to obtain an extended character string, filtering the obtained extended character string according to theinitial exclusion dictionary, and constructing a dictionary tree; 3, calling that dictionary tree to look up all the strings appear in each sentence in the literature to be classified, taking the longest character string in the initial classification dictionary as the longest common subsequence, and taking the longest common subsequence and its corresponding class as the final character string andfinal class of the sentence, and taking the final class which appears most frequently in a document as the class to which it belongs. The invention also discloses a document classification system based on trie and LCS algorithm. The invention omits the word segmentation process, takes the stable character string as the characteristic, has high accuracy, and reduces the dependence on the context.

Description

technical field [0001] The invention relates to the technical field of document classification, in particular to a method and system for document classification based on trie and LCS algorithms. Background technique [0002] There is no literature record using LCS for classification in the prior art. The latest literature in 2018 "Xue Weiming, Hou Xia, Li Ning, a text classification method based on word2vec, [J] Journal of Beijing Information Science and Technology University, p71- 75, Vol.33No.1, Feb.2018" as a reference, the article uses Chinese news text classification corpus, contains a total of 2615 texts, divided into 9 categories, and the highest F value is 89.48%. The effect of the reference method given in this paper is that the F value of the improved KNN is 84.15%, and the F value of the traditional KNN classification is 74.39%, but this classification method has no verification results of millions of corpus. The application number is 201510685864.6, and the pate...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/335G06F16/35G06F16/9032
Inventor 唐先明王晓丽陈新荣邓达康韩宝东史晓凌郭攀红张德浩谭培波张学龙
Owner CHINA PETROLEUM & CHEM EXPLORATION & PRODION RES INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products