Historical classics word segmentation method based on word alignment

A word segmentation method and word alignment technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as ineffectiveness, and achieve the effect of improving accuracy.

Active Publication Date: 2017-10-03
大连痛点科技有限公司
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If the word segmentation method of modern Chinese is directly applied to ancient Chinese under the current situation of lack of word segmentation dictionary and large-scale word segmentation training corpus for ancient Chinese, satisfactory results will not be obtained.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Historical classics word segmentation method based on word alignment
  • Historical classics word segmentation method based on word alignment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] In this embodiment, Eclipse is used as the development platform, and Java is used as the development language. It is carried out on corpus of 4145 sentence pairs of ancient and vernacular Chinese in "The Benji of Qin Shihuang", "The Benji of Qin", "The Benji of Xiang Yu", "The Benji of Gaozu" and "The Benji of Lu Hou". The following is the specific process:

[0026] Step 1: Segment the modern Chinese in the parallel corpus, and split the ancient Chinese word by word. Align ancient Chinese and modern Chinese using the IBM Model 3 model.

[0027] Step 2: Preprocess the alignment results obtained in Step 1 to eliminate the interference of punctuation marks and adverbs:

[0028] (1) Check the alignment results obtained in step 1 one by one, and delete the alignment results whose alignment probability is less than or equal to zero, single ancient Chinese characters, or non-Chinese characters corresponding to modern Chinese;

[0029] (2) Check the part of speech of two wor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of natural language processing, and specifically relates to a historical classics word segmentation method based on word alignment. The historical classics word segmentation method comprises following steps of firstly, carrying out word segmentation on the modern Chinese language in parallel corpora, splitting ancient Chinese prose word for word, and carrying out word alignment on the ancient Chinese prose and the modern Chinese language by means of an IBM Model 3 model; secondly, processing the alignment result obtained in the last step, and eliminating interference of punctuation marks and adverbs; thirdly, merging ancient words in dependence on the processed alignment result obtained in the last step; and finally, verifying words formed by three or more characters in the word segmentation result. According to the invention, on the premise that ancient Chinese tagged corpora are lacked, word segmentation of historical classics is effectively achieved; compared with a word segmentation method trained by modern Chinese tagged corpora, the historical classics word segmentation method based on word alignment is advantaged in that the word segmentation accuracy is greatly improved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method for word segmentation of historical classics based on word alignment. Background technique [0002] Chinese word segmentation refers to the process of resynthesizing a continuous sequence of Chinese characters into a sequence of words according to a certain standard. Word segmentation is an important part of natural language processing from word to word, and it is the guarantee for text classification, information retrieval and other processing of words. The existing main word segmentation methods include rule-based word segmentation methods and statistical word segmentation methods. Many word segmentation methods have achieved relatively ideal word segmentation results in modern Chinese, and most of the algorithms and their commercial implementations have reached a very high level. Compared with modern Chinese, ancient Chinese is more concise and c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 车超吴晓婷
Owner 大连痛点科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products