Multilingual word segmentation method based on dictionaries and grammar analysis

A technology of grammar analysis and word segmentation method, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problems of reducing storage space, garbled characters, and little representative meaning

Inactive Publication Date: 2017-03-22
BEIJING SCISTOR TECH +1
View PDF3 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a multilingual word segmentation method and system based on dictionaries and grammatical analysis, which overcomes the limitation that only a single language or individual languages ​​can be segmented, and uses word segmentation technology based on dictionary matching and grammatical analysis to realize different languages. The purpose of word segmentation is to ensure that the text can be efficiently decomposed into representative words. For some users, there is a need to accurately decompose the text content, that is, some ambiguous words can be disambiguated. Therefore, the present invention adopts the method of grammatical analysis to disambiguate and analyze the ambiguous words matched by the dictionary. In addition, some garbled characters or some stop words with little representative meaning may appear in the text to be segmented. The present invention will filter it to ensure the readability and efficient searchability of the text, and at the same time reduce the storage space required for the text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multilingual word segmentation method based on dictionaries and grammar analysis
  • Multilingual word segmentation method based on dictionaries and grammar analysis
  • Multilingual word segmentation method based on dictionaries and grammar analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

[0033] like figure 1 As shown, according to the first aspect of the present invention, a new word segmentation framework system is adopted. The new word segmentation system proposed by the present invention can realize accurate word segmentation for text judgment of each type of language by embedding Chinese, Japanese, Korean and Cantonese sub-word breakers, Chinese quantum word breakers and Western language word breakers; through the built-in language segment code recognition mechanism field to segment the text fragments to be segmented, and each segmented text segment corresponds to a language family, and the corresponding sub-tokenizer is used for word segmentation; it contains an extended dictionary configuration management u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multilingual word segmentation method based on dictionaries and grammar analysis. Efficient and accurate word segmentation of mixed texts of Chinese, Japanese, Korean, Cantonese and the like can be realized, flexible lexicon expansion of words for different time periods and different professionals can be realized, lexicon information is updated effectively, and efficient and accurate multilingual language text word segmentation is realized; a word segmentation sub-device of Chinese, Japanese, Korean, Cantonese and other language families, a Chinese quantum word segmentation device and a western language word segmentation device are embedded to realize the accurate word segmentation of each language text; a text segment to be performed with word segmentation is segmented by a built-in language segment coded identification mechanism, each segmented text segment corresponds to a language family, and the word segmentation is carried out by using a corresponding word segmentation sub-device; the word segmentation of western inflectional languages and the smart mode word segmentation of the Chinese, Japanese, Korean, Cantonese can be realized by grammar analysis, and texts containing Arabic numeral information can be processed; and meanwhile, the word segmentation of texts with a plurality of mixed languages can also be realized by the multilingual word segmentation method provided by the invention, thereby getting rid of the limitation that a word segmentation tool can only realize the word segmentation of single language and some individual languages and ensuring the security, accuracy, efficiency, flexibility and universality of word segmentation of texts. The multilingual word segmentation method provided by the invention has a wide application prospect in the text word segmentation fields such as enhancement of mass data text classification, text information extraction, autoabstract, etc.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a multilingual word segmentation method based on dictionary and grammar analysis which is discriminated by unicode (unicode, universal code, single code) encoding. Background technique [0002] With the advent of the information age, more and more information can be viewed and retrieved by people, and the search market value continues to increase, more and more enterprises are looking for a more powerful natural language processing tool, such as automatic summarization, automatic Text retrieval, automatic text classification and other language processing tools, and automatic word segmentation technology is one of the core technologies of these tools. Word segmentation, as the name implies, is to automatically segment the text with the help of a computer, so that it can correctly express the meaning to be expressed without losing information. As long as it i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/205G06F40/279
Inventor 王宇徐晓燕周渊刘庆良郑彩娟黄成王海平周游陈婷婷
Owner BEIJING SCISTOR TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products