WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method

A WeChat public platform and construction method technology, applied in the field of corpus resource construction, can solve the problems of large investment and high cost of oral speech, and achieve the effects of simple interaction, mitigation of adverse effects, and good user experience

Active Publication Date: 2019-11-19
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to overcome the above-mentioned existing Chinese-Mongolian corpus crowdsourcing construction method that has the technical defects of high cost and large investment in the collection of spoken language in real scenes, and provides a Chinese-Mongolian corpus crowdsourcing construction method based on the WeChat public platform

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method
  • WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method
  • WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] The method in the embodiment of the present invention will be described in detail and completely below in conjunction with the accompanying drawings.

[0033] The Chinese-Mongolian corpus crowdsourcing construction method based on the WeChat public platform used in this embodiment, such as figure 1 shown. The specific steps are:

[0034] The corpus in the embodiment of the present invention includes text corpus and speech corpus, including Chinese-Mongolian bilingual alignment corpus and monolingual text alignment corpus in the fields of machine translation and natural language processing.

[0035] Step A. According to definition 1 and definition 2, the original text is preprocessed, wherein, the specific process of preprocessing the original corpus varies with the translation direction, and the purpose is to standardize the corpus, and this embodiment does not Limit the source of the corpus, segment it, and delete meaningless data. Table 1 is an example of the origi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method, and belongs to the field of corpus resource construction. The method comprises the following specific operation steps: 1) obtaining a multi-body cut open domain original corpus; 2) screening and filtering the users participating in the translation task through a Mongolian level test questionnaire; 3) sending a crowdsourcing translation task to a user following the WeChat official account in a subscription account pushing mode; 4) enabling each WeChat client to translate one or more source sentences into Mongolian and feed back the Mongolian to the background in a voice form; 5) evaluating the corpus quality in a manner of combining background administrator auditing and crowdsourcing quality evaluation to realize corpus quality control. The WeChat public platform-based Chinese-Mongolian corpus crowdsourcing construction method completes corpus collection online, is simple in interaction, good in user experience and high in user participation degree, effectively solves the problem of collecting open domain natural spoken language corpora in a real Mongolian language environment, and shows an extremely high practical prospect under an Internet mobile platform.

Description

technical field [0001] The invention relates to a crowdsourcing construction method of a Chinese-Mongolian corpus based on a WeChat public platform, and belongs to the technical field of corpus resource construction. Background technique [0002] Due to the single type and small size of the Mongolian corpus, the exploration of the construction of the Han-Mongolian spoken language corpus has gradually become an important research content in the field of natural language research, especially the research on resource construction methods will affect the related research on large-scale corpus resources. On the other hand, due to the problem of non-uniform text encoding caused by the complexity of the Mongolian language itself, using the phonetic corpus as an entry point can be a feasible way to build corpus resources. However, the current Chinese-Mongolian spoken language corpus is constructed using expert annotation methods that require a lot of manpower and material resources,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06Q10/10G06Q10/06
CPCG06Q10/101G06Q10/06395G06Q10/06398
Inventor 史树敏苏日海廖乐健黄河燕
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products