Method for processing unknown words in Chinese-language dependency tree banks

A technology of unregistered words and processing methods, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve the problems of sparse tree bank data and coarse information granularity

Inactive Publication Date: 2014-03-26
BEIJING INFORMATION SCI & TECH UNIV
View PDF5 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem of treebank data sparseness in dependency syntactic analysis and the coarse information granularity caused by smoothing part-of-speech information, the present invention provides a processing method for unregistered words in Chinese dependency treebank, which realizes the relationship between unregistered words in treebank and The mapping conversion of known words allows the unit pair to be returned to or to refine the information granularity and alleviate the problem of data sparseness without expanding the data scale. Improved performance of dependency parsing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for processing unknown words in Chinese-language dependency tree banks
  • Method for processing unknown words in Chinese-language dependency tree banks
  • Method for processing unknown words in Chinese-language dependency tree banks

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

[0026] S10. Use the synonym word forest to find all synonyms of unregistered words.

[0027] Find the unregistered words in the dependency tree bank, and according to the 5-layer encoding method of the "Synonyms Cilin" expansion board, obtain all the words that have the same 5-layer encoding as the unregistered words and whose 8th mark is "=", as unregistered words Synonyms for login words.

[0028] S20. Calculate the font similarity between the unregistered word and the synonym according to the font feature of the Chinese character.

[0029] All Chinese character vectors are used (sw 1 , sw 2 ,..., sw n ) means that each word can be represented by a Chinese character...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of processing for natural languages of computational linguistics, and discloses a method for processing unknown words in Chinese-language dependency tree banks. The method includes steps of A, searching all synonyms of the unknown words by the aid of synonym forests; B, computing character pattern similarity degrees among the unknown words and all the synonyms of the unknown words according to character pattern features of Chinese characters; C, extracting mapped words and information quantities of word classes of the mapped words when the character pattern similarity degrees among the unknown words and the multiple synonyms are high, and improving character pattern similarity degree computation models; D, extracting the words with the maximum character pattern similarity degrees as the optimal mapped words of the unknown words and using the extracted words as explanation for the unknown words in the tree banks. The method has the advantages that unit pairs (word classes, word classes) in dependency syntactic analysis can be recovered to unit pairs (word classes, words) or unit pairs (words, word classes) on the premise that the scales of the tree banks are no longer expanded, accordingly, the information granularity can be refined, the problem of data sparseness can be solved, and the dependency syntactic analysis performance can be improved.

Description

technical field [0001] The invention relates to a method for processing unregistered words in a Chinese dependent grammar tree bank, which realizes understanding of unregistered words through known words in the tree bank, and belongs to the field of natural language processing in computational linguistics. Background technique [0002] Syntactic analysis is one of the core issues of natural language processing, and its performance directly affects the correctness and effectiveness of automatic understanding of natural language sentences. Dependency parsing is easier to handle than structured parsing and has received a lot of attention in recent years. At present, many countries are establishing and developing their own language treebanks. As the powerful disambiguation ability of vocabulary itself is gradually discovered, more and more statistical models of dependent syntactic analysis tend to be lexicalized. [0003] Vocabulary is the most discriminative information, and l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 吕学强郑略省王玥关晓炟
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products