Old-Chinese bilingual corpus construction method and device with Thai language as pivot

A bilingual corpus and construction method technology, applied in the field of natural language processing, can solve problems such as difficulty in obtaining parallel resources of old-Chinese bilinguals, resource scarcity, etc.

Active Publication Date: 2020-01-21
KUNMING UNIV OF SCI & TECH
View PDF9 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Corpus construction is the premise of natural language processing research. Lao-Chinese bilingual corpus is an important data resource for Chinese-Lao machine translation and cross-language retrieval. Lao language is a language with scarce resources among Southeast Asian languages. Lao-Chinese bilingual parallel Resources are relatively scarce, and it is difficult to obtain bilingual parallel resources of Old-Chinese directly from the Internet

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Old-Chinese bilingual corpus construction method and device with Thai language as pivot
  • Old-Chinese bilingual corpus construction method and device with Thai language as pivot
  • Old-Chinese bilingual corpus construction method and device with Thai language as pivot

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0060] Embodiment 1: as Figure 1-6 As shown, an old-Chinese bilingual corpus construction method with Thai as the pivot, including the following steps:

[0061] Step1. Extract Thai sentences from the existing Chinese-Thai parallel corpus data and perform Thai word segmentation processing;

[0062] As a preferred solution of the present invention, the specific steps of the step Step1:

[0063] Step1.1. Select Thai sentences with 20-50 characters from the existing Chinese-Thai bilingual parallel corpus;

[0064] Step1.2. For the selected Thai sentences, you can use the language information processing platform for small Southeast Asian languages ​​developed by Kunming University of Science and Technology. The website is http: / / 222.197.219.24:8099 / for word segmentation processing.

[0065] The present invention considers that the Thai language adopts the form of consecutive scripts, and there is no word segmentation, so word-based translation and use in the model cannot be per...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an old-Chinese bilingual corpus construction method and device taking Thai language as a pivot, and belongs to the field of natural language processing. The method comprises the steps of firstly performing Thai word segmentation processing on Chinese-Thai parallel corpus data; constructing a Lao-Thai bilingual dictionary, and translating Thai sentences into Lao sentence sequences word by word by using the Lao-Thai bilingual dictionary to obtain candidate Lao-Thai parallel sentence pairs; constructing a two-way LSTM-based Lao language-Thai language parallel sentence pair classification model, and classifying the candidate Lao language-Thai language parallel sentence pairs to obtain Lao language-Thai language bilingual parallel sentence pairs; using the Thai languageas a pivot language to match the Lao language and the Chinese language, and a Lao language-Chinese bilingual parallel corpus is constructed. According to the old-Chinese bilingual parallel corpus construction device taking Thai language as pivot language, the problem of scarcity of old language-Chinese corpus is solved, and the old-Chinese bilingual parallel corpus construction device has certaintheoretical significance and practical application value for construction of the old-Chinese bilingual corpus.

Description

technical field [0001] The invention relates to a method and device for constructing an old-Chinese bilingual corpus with Thai as a pivot, and belongs to the technical field of natural language processing. Background technique [0002] Corpus construction is the premise of natural language processing research. Lao-Chinese bilingual corpus is an important data resource for Chinese-Lao machine translation and cross-language retrieval. Lao language is a language with scarce resources among Southeast Asian languages. Lao-Chinese bilingual parallel Resources are relatively scarce, and it is difficult to directly obtain parallel bilingual resources of Old-Chinese from the Internet. [0003] Both Laotian and Thai belong to the Zhuang-Dai branch of the Zhuang-Dong language family of the Sino-Tibetan language family. The basic vocabulary is almost the same or similar, and there is also a great similarity in the syntactic structure. The Chinese-Thai parallel corpus is relatively easy ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/58G06F40/289G06F40/30G06F16/33G06F16/35
CPCG06F16/3344G06F16/35
Inventor 毛存礼高旭余正涛高盛祥王振晗聂男
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products