Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Word Segmentation Method for Historical Classics Based on Word Alignment

A word segmentation method and word alignment technology, which can be used in instrumentation, computing, electrical and digital data processing, etc., can solve problems such as ineffectiveness, and achieve the effect of improving accuracy.

Active Publication Date: 2020-06-30
大连痛点科技有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If the word segmentation method of modern Chinese is directly applied to ancient Chinese under the current situation of lack of word segmentation dictionary and large-scale word segmentation training corpus for ancient Chinese, satisfactory results will not be obtained.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Word Segmentation Method for Historical Classics Based on Word Alignment
  • A Word Segmentation Method for Historical Classics Based on Word Alignment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] This embodiment uses Eclipse as the development platform and Java as the development language. It is carried out on the 4145 sentence pairs of ancient prose and vernacular Chinese in "Historical Records", "Benji of Qin Shihuang", "Benji of Qin", "Benji of Xiang Yu", "Benji of Gaozu" and "Benji of Lühou". The following is the specific process:

[0026] Step 1: Perform word segmentation on modern Chinese in parallel corpus, and perform word-by-word segmentation on ancient texts. Use IBM Model 3 to align ancient Chinese and modern Chinese.

[0027] Step 2: Preprocess the alignment result obtained in step 1 to eliminate the interference of punctuation and adverbs:

[0028] (1) Check the alignment results obtained in step 1 one by one, and delete the alignment results whose alignment probability is less than or equal to zero, ancient Chinese words or non-Chinese characters corresponding to modern Chinese;

[0029] (2) Check the part of speech of the two words or characters in each...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of natural language processing, and specifically relates to a historical classics word segmentation method based on word alignment. The historical classics word segmentation method comprises following steps of firstly, carrying out word segmentation on the modern Chinese language in parallel corpora, splitting ancient Chinese prose word for word, and carrying out word alignment on the ancient Chinese prose and the modern Chinese language by means of an IBM Model 3 model; secondly, processing the alignment result obtained in the last step, and eliminating interference of punctuation marks and adverbs; thirdly, merging ancient words in dependence on the processed alignment result obtained in the last step; and finally, verifying words formed by three or more characters in the word segmentation result. According to the invention, on the premise that ancient Chinese tagged corpora are lacked, word segmentation of historical classics is effectively achieved; compared with a word segmentation method trained by modern Chinese tagged corpora, the historical classics word segmentation method based on word alignment is advantaged in that the word segmentation accuracy is greatly improved.

Description

Technical field [0001] The invention relates to the technical field of natural language processing, in particular to a word segmentation method for historical classics based on word alignment. Background technique [0002] Chinese word segmentation refers to the process of recombining consecutive Chinese character sequences into word sequences according to certain specifications. Word segmentation is an important part of natural language processing from word to word, and it is the guarantee for text classification and information retrieval of words. The existing main word segmentation methods include rule-based word segmentation and statistics-based word segmentation methods. Many word segmentation methods have achieved ideal word segmentation effects in modern Chinese, and most of the algorithms and their commercial implementation have reached a very high level. Ancient Chinese is more concise and compact than modern Chinese. In addition to historical classics and names, words...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289
CPCG06F40/289
Inventor 车超吴晓婷
Owner 大连痛点科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products