integrated automatic lexical analysis method and system for ancient Chinese texts

An analysis method and automatic word technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of low accuracy, slow training speed, and difficulty in lexical analysis in ancient Chinese, and achieve simple operation, The effect of high efficiency and shortening of training time

Active Publication Date: 2019-05-31
NANJING NORMAL UNIVERSITY
View PDF4 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides an integrated automatic lexical analysis method for ancient Chinese texts, which can solve the p

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • integrated automatic lexical analysis method and system for ancient Chinese texts
  • integrated automatic lexical analysis method and system for ancient Chinese texts
  • integrated automatic lexical analysis method and system for ancient Chinese texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] Such as figure 1 As shown, ancient Chinese documents are written in traditional characters, and most of the ancient texts do not have sentence segmentation information. This brings great inconvenience to the reading and research of ancient Chinese.

[0038] Table 1 uses OCR (optical character recognition) technology to figure 1 The scanned text is as follows:

[0039]

[0040] To conduct an integrated lexical analysis of this electronic document, the specific tasks are as follows:

[0041] (1) Automatically segment the text;

[0042] (2) Automatically segment the text;

[0043] (3) Determine the part of speech of words, such as nouns, verbs, etc.;

[0044] (4) Identify named entities such as person names and place names in ancient Chinese.

[0045] The present invention uses an integrated analysis method to synchronize the above tasks, and automatically annotates the results figure 2 Shown. Each word is separated by " / ", followed by the part-of-speech tag of the word, and each s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an integrated automatic lexical analysis method for ancient Chinese texts. The method includes the following steps: pre-training the word vector of the ancient Chinese with semantic features by using the Word2Vec model; adding the information data appearing in the historical documents to the ancient name database to form a number of proper noun entries; adjusting Bi-LSTM- Each parameter of the CRF neural network model preprocesses the final training corpus into a model readable form, loads into the neural network model, continuously iteratively learns, and automaticallyevaluates the labeling result of the test corpus. According to the method, a sentence segmentation, word segmentation and part-of-speech tagging integrated tagging method is adopted, the repeated tagging process of lexical analysis of multiple sub-tasks is omitted, and multi-stage diffusion of repeated tagging errors is also avoided; According to the method, a deep learning model is adopted, richlanguage features can be learned automatically, and the work of manually customizing a feature template in traditional machine learning is omitted; The labeling model is accelerated by adopting GPU hardware, the model training time can be greatly shortened, and the efficiency is much higher than that of a traditional machine learning model.

Description

Technical field [0001] The invention relates to the technical field of text lexical analysis, in particular to an integrated automatic lexical analysis method and system of ancient Chinese texts. Background technique [0002] There are many ancient book resources. How to further extract and dig out more meaningful language knowledge from the digitized ancient book text is an important task of ancient Chinese information processing research. The basic task of ancient Chinese information processing is lexical analysis, including automatic sentence segmentation, automatic word segmentation, automatic part-of-speech tagging and automatic named entity recognition. The quality of lexical analysis will directly affect the effect of the upper task. Different from modern Chinese, the information processing of ancient Chinese is still in the exploratory stage, and the automatic processing and analysis of ancient Chinese sentences and vocabulary are less. [0003] The research results of Ch...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
Inventor 李斌程宁葛四嘉李成名郝星月冯敏萱许超
Owner NANJING NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products