PageRank and information entropy-based text word segmentation method for judgment document

A text word segmentation and information entropy technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as difficult recognition

Inactive Publication Date: 2018-11-09
NANJING UNIV
View PDF11 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are a large number of place names, person names, organization names and other special names in legal documents, so how to identify these special words is also a difficult problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PageRank and information entropy-based text word segmentation method for judgment document
  • PageRank and information entropy-based text word segmentation method for judgment document
  • PageRank and information entropy-based text word segmentation method for judgment document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] The present invention mainly uses the improved PageRank algorithm to establish a graphical model of the inclusion relationship between latent words, and calculates the Rank value of all latent words and combines information entropy and mutual information to carry out word segmentation. The present invention adds a keyword dictionary to improve Adapt well to terminology in different domains. The overall process of the word segmentation method is as follows figure 1 shown. Its specific implementation steps are as follows:

[0059] 1. The main flow of the method is as follows: Figure 10 shown in the upper part.

[0060] Step (1), read the input text, segment it with punctuation marks, numbers and English letters as separators to get all the Chinese characters in the text, and then filter and remove the words with a word length of only 1 to get a string list S;

[0061] Step (2), for each string S in S i A substring S whose length does not exceed k (k=6) sub (potenti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a PageRank and information entropy-based text word segmentation method for a judgment document, and belongs to a Chinese word segmentation technology in the field of natural language processing. An improved PageRank algorithm, an information entropy, mutual information and a keyword dictionary are mainly adopted to carry out word segmentation on a Chinese text. For the judgment document in the legal field, the word segmentation method is established on the basis of the PageRank algorithm; candidate words are segmented according to Rank vectors; the candidate words are corrected through the information entropy; terms are combined according to a keyword dictionary of the judgment document; and finally a word segmentation result is output. The method can carry out wordsegmentation on the judgment document more accurately. Compared with an existing method, the method has the remarkable advantages that statistics or training does not need to be carried out through alarge number of text corpora to establish a large-scale dictionary, and only the input text is subjected to statistics; the input text is used as an existing corpus to carry out statistical mining; and finally the word segmentation can be completed in combination with a keyword term dictionary of the judgment document.

Description

technical field [0001] The invention belongs to the Chinese word segmentation technology in the technical field of natural language processing, and is a technology for performing Chinese word segmentation for legal documents. Background technique [0002] Word segmentation refers to dividing the existing text into separate, independent and meaningful units. Chinese word segmentation refers to dividing a continuous sequence of Chinese characters into separate words to make them into semantically readable word sequences. Compared with English, Chinese does not have a clear separator between words as the basis for segmentation. Therefore, in terms of word segmentation, Chinese word segmentation is more difficult than other languages. The word segmentation results generated by the word segmentation algorithm will also directly affect the application effect of the upper layer, such as part-of-speech tagging and keyword extraction. Therefore, how to make the computer understand ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 葛季栋李传艺李振昊雷妙妙姚林霞周筱羽骆斌
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products