Multilingual word segmentation method based on dictionaries and grammar analysis
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of grammar analysis and word segmentation method, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problems of reducing storage space, garbled characters, and little representative meaning
Inactive Publication Date: 2017-03-22
BEIJING SCISTOR TECH +1
View PDF3 Cites 15 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
[0004] The present invention provides a multilingual word segmentation method and system based on dictionaries and grammatical analysis, which overcomes the limitation that only a single language or individual languages can be segmented, and uses word segmentation technology based on dictionary matching and grammatical analysis to realize different languages. The purpose of word segmentation is to ensure that the text can be efficiently decomposed into representative words. For some users, there is a need to accurately decompose the text content, that is, some ambiguous words can be disambiguated. Therefore, the present invention adopts the method of grammatical analysis to disambiguate and analyze the ambiguous words matched by the dictionary. In addition, some garbled characters or some stop words with little representative meaning may appear in the text to be segmented. The present invention will filter it to ensure the readability and efficient searchability of the text, and at the same time reduce the storage space required for the text
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
Embodiment Construction
[0032] In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.
[0033] like figure 1 As shown, according to the first aspect of the present invention, a new word segmentation framework system is adopted. The new word segmentation system proposed by the present invention can realize accurate word segmentation for text judgment of each type of language by embedding Chinese, Japanese, Korean and Cantonese sub-word breakers, Chinese quantum word breakers and Western language word breakers; through the built-in language segment code recognition mechanism field to segment the text fragments to be segmented, and each segmented text segment corresponds to a language family, and the corresponding sub-tokenizer is used for word segmentation; it contains an extended dictionary configuration management u...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
PUM
Login to view more
Abstract
The invention discloses a multilingual word segmentation method based on dictionaries and grammar analysis. Efficient and accurate word segmentation of mixed texts of Chinese, Japanese, Korean, Cantonese and the like can be realized, flexible lexicon expansion of words for different time periods and different professionals can be realized, lexicon information is updated effectively, and efficient and accurate multilingual language text word segmentation is realized; a word segmentation sub-device of Chinese, Japanese, Korean, Cantonese and other language families, a Chinese quantum word segmentation device and a western language word segmentation device are embedded to realize the accurate word segmentation of each language text; a text segment to be performed with word segmentation is segmented by a built-in language segment coded identification mechanism, each segmented text segment corresponds to a language family, and the word segmentation is carried out by using a corresponding word segmentation sub-device; the word segmentation of western inflectional languages and the smart mode word segmentation of the Chinese, Japanese, Korean, Cantonese can be realized by grammar analysis, and texts containing Arabic numeral information can be processed; and meanwhile, the word segmentation of texts with a plurality of mixed languages can also be realized by the multilingual word segmentation method provided by the invention, thereby getting rid of the limitation that a word segmentation tool can only realize the word segmentation of single language and some individual languages and ensuring the security, accuracy, efficiency, flexibility and universality of word segmentation of texts. The multilingual word segmentation method provided by the invention has a wide application prospect in the text word segmentation fields such as enhancement of mass data text classification, text information extraction, autoabstract, etc.
Description
technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a multilingual word segmentation method based on dictionary and grammar analysis which is discriminated by unicode (unicode, universal code, single code) encoding. Background technique [0002] With the advent of the information age, more and more information can be viewed and retrieved by people, and the search market value continues to increase, more and more enterprises are looking for a more powerful natural language processing tool, such as automatic summarization, automatic Text retrieval, automatic text classification and other language processing tools, and automatic word segmentation technology is one of the core technologies of these tools. Word segmentation, as the name implies, is to automatically segment the text with the help of a computer, so that it can correctly express the meaning to be expressed without losing information. As long as it i...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.