Word segmentation method and device for ancient traditional Chinese medicine documents

A technology of ancient books and documents of traditional Chinese medicine, applied in the direction of text database query, unstructured text data retrieval, special data processing applications, etc., can solve the problem of no word segmentation device in the field of traditional Chinese medicine

Active Publication Date: 2019-08-16
UNIV OF SCI & TECH BEIJING
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Reasonable word segmentation of ancient Chinese medicine documents is the basis for structuring Chinese medicine knowledge, but currently there is no word segmentation device...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method and device for ancient traditional Chinese medicine documents
  • Word segmentation method and device for ancient traditional Chinese medicine documents
  • Word segmentation method and device for ancient traditional Chinese medicine documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] Embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0070] like figure 1 As shown, it is a kind of word segmentation method for ancient Chinese medicine literature according to the present invention, including:

[0071] Step 101, preprocessing the ancient documents in the field of traditional Chinese medicine to generate corpus for training language models; wherein, the step of preprocessing the ancient documents includes: obtaining the original text of the ancient documents, from the original text Delete the catalog of the ancient books and documents, and delete the sentences containing characters that cannot be represented by utf-8, and generate the cleaned text; add a space after each word in the cleaned text as the corpus for training the language model .

[0072] Step 102, training the corpus to generate a language model;

[0073] Step 103, using the language model to perform unsupervised word s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a word segmentation method and device for ancient traditional Chinese medicine documents, and the method comprises the steps: carrying out the preprocessing of the ancient traditional Chinese medicine documents in the field of traditional Chinese medicine, and generating corpora for training a language model; training the corpus to generate a language model; carrying out unsupervised word segmentation on the ancient book literature by using the language model to generate a preliminary word segmentation result; summarizing the preliminary word segmentation result according to a word relation, fixed matching of sentence patterns and linguistic knowledge, and sorting out segmentation rules to form a rule file; according to rules in the rule file, correcting the preliminary word segmentation result for the first time, and generating a first correction result.

Description

technical field [0001] The invention relates to a word segmentation method for medical documents in the field of natural language processing, in particular to a word segmentation method and device for ancient Chinese medicine documents. Background technique [0002] Chinese word segmentation is a basic step in Chinese text processing. Different from English and other texts, Chinese sentences do not use spaces to divide words between words, so when performing Chinese information processing tasks such as text classification, information retrieval, information filtering, automatic document indexing, and automatic abstract generation, Chinese Word segmentation is of key significance as a basic step. The correctness of Chinese word segmentation results will directly affect the correctness of subsequent tasks. [0003] In the field of traditional Chinese medicine, traditional Chinese medicine, which was born from the primitive society and has been constantly developing and chang...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/33G06F17/27
CPCG06F16/3344G06F40/216G06F40/289
Inventor 谢永红周越张德政阿孜古丽栗辉贾麒
Owner UNIV OF SCI & TECH BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products