Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Processing method of unregistered words in Chinese dependency tree bank

A technology of unregistered words and processing methods, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as coarse information granularity and sparse tree bank data

Inactive Publication Date: 2016-04-06
BEIJING INFORMATION SCI & TECH UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem of treebank data sparseness in dependency syntactic analysis and the coarse information granularity caused by smoothing part-of-speech information, the present invention provides a processing method for unregistered words in Chinese dependency treebank, which realizes the relationship between unregistered words in treebank and The mapping conversion of known words allows the unit pair to be returned to or to refine the information granularity and alleviate the problem of data sparseness without expanding the data scale. Improved performance of dependency parsing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Processing method of unregistered words in Chinese dependency tree bank
  • Processing method of unregistered words in Chinese dependency tree bank
  • Processing method of unregistered words in Chinese dependency tree bank

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The specific embodiments of the present invention will be described in further detail below in conjunction with the drawings and embodiments. The following examples are used to illustrate the present invention, but not to limit the scope of the present invention.

[0026] S10. Use the synonym word forest to find all synonyms of unregistered words.

[0027] Search for unregistered words in the dependency tree. According to the 5-layer encoding method of the "Synonyms Cilin" expansion board, obtain all words with the same 5-layer encoding as unregistered words and the eighth tag bit as "=". Synonym for login term.

[0028] S20. Calculate the font similarity between the unregistered word and the synonyms according to the features of the Chinese characters.

[0029] Use (sw 1 ,sw 2 ,...,Sw n ) Means that each word can be represented by a Chinese character vector consisting of 0 or the frequency of the word contained. Use uw for unregistered words in the tree library 1 ,uw 2 ,...,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the field of processing for natural languages of computational linguistics, and discloses a method for processing unknown words in Chinese-language dependency tree banks. The method includes steps of A, searching all synonyms of the unknown words by the aid of synonym forests; B, computing character pattern similarity degrees among the unknown words and all the synonyms of the unknown words according to character pattern features of Chinese characters; C, extracting mapped words and information quantities of word classes of the mapped words when the character pattern similarity degrees among the unknown words and the multiple synonyms are high, and improving character pattern similarity degree computation models; D, extracting the words with the maximum character pattern similarity degrees as the optimal mapped words of the unknown words and using the extracted words as explanation for the unknown words in the tree banks. The method has the advantages that unit pairs (word classes, word classes) in dependency syntactic analysis can be recovered to unit pairs (word classes, words) or unit pairs (words, word classes) on the premise that the scales of the tree banks are no longer expanded, accordingly, the information granularity can be refined, the problem of data sparseness can be solved, and the dependency syntactic analysis performance can be improved.

Description

Technical field [0001] The invention relates to a method for processing unregistered words in a Chinese dependent grammar tree bank, which realizes the understanding of unregistered words through known words in the tree bank, and belongs to the field of natural language processing in computational linguistics. Background technique [0002] Syntactic analysis is one of the core issues of natural language processing, and its performance directly affects the correctness and effectiveness of natural language sentence automatic understanding. Dependent syntax analysis is easier to deal with than structured syntax analysis, and has received widespread attention in recent years. At present, many countries are building and developing their own language treebanks. As the powerful disambiguation ability of vocabulary itself is gradually being excavated, more and more dependent syntax analysis statistical models tend to be lexicalized. [0003] Vocabulary is the most distinguishing informati...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 吕学强郑略省王玥关晓炟
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products