Chinese syntax parsing method with merged semantic information

A technology of syntactic analysis and semantic information, which is applied in the fields of instruments, computing, and electronic digital data processing, etc., can solve problems such as whether the description of language phenomena is accurate, data is sparse, etc., and achieve the effects of performance improvement, efficiency and accuracy improvement, and performance improvement

Inactive Publication Date: 2009-09-02
PEKING UNIV
0 Cites 12 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, these two methods also have their own shortcomings: the introduction of lexical information in the lexicalization method has brought about certain data sp...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a Chinese syntax parsing method with merged semantic information, belonging to the technical field of natural language processing. The method comprises the following steps: step 1), extracting different hierarchical semantic classes of words according to the hyponymy of the knowledge network to obtain indexes from the words to the semantic classes; step 2), using a word in a syntactic tree as a key assignment and query the knowledge network to obtain a semantic class of the word and add the semantic class to a certain layer of the syntactic tree; step 3), using the syntactic tree after being processed in the step 2) as training data to train grammar so as to obtain a grammar model; step 4), utilizing the grammar model after being trained in the step 3) to decode a sentence to be analyzed. Compared with the prior art, the invention adopts the semantic information to disambiguate parsing so that the parsing effect is remarkably improved.

Application Domain

Technology Topic

Syntactic treeNatural language +2

Image

  • Chinese syntax parsing method with merged semantic information
  • Chinese syntax parsing method with merged semantic information
  • Chinese syntax parsing method with merged semantic information

Examples

  • Experimental program(1)

Example Embodiment

[0031] Describe the specific embodiment of the present invention in detail below in conjunction with accompanying drawing, the method flowchart of the present invention is as follows image 3 shown.
[0032] 1. Build a word-semantic index
[0033] According to the hyponym relationship between sememes defined in HowNet, the semantic classes of different layers from coarse to fine are extracted, and correspond to each word, so as to construct the index from word to semantic class. The words here are accompanied by part-of-speech information.
[0034] 2. Add semantic class information to the original tree bank
[0035] For the original tree bank, the semantic class information is obtained by using word and part of speech as the key value, and then the semantic class information is attached to the part of speech (pre-terminal) level to realize the refinement of the part of speech layer tag. Such parts of speech contain semantic information.
[0036] Some words may have multiple different semantic classes, and two strategies are used for this situation: select the first of multiple semantics, or use manual labeling to select according to the context.
[0037] 3. Training grammar model
[0038] The tree bank with added semantic class information is used as training data. The non-lexical syntactic analysis model introduced above is used for grammar training. During the training process, non-terminal symbols are refined by automatic splitting and merging. On the other hand, in order to investigate whether it is necessary to perform this refinement process on the pre-terminals with added semantic information, we conducted experiments to verify that the effect of automatic subdivision while adding coarse-grained semantic information is better than that without Segmentation, and the effect of this approach is also better than directly adding more differentiated fine-grained semantics without automatic refinement. The following effect analysis section will introduce it in detail.
[0039] 4. Perform syntactic analysis on the sentence to be analyzed
[0040] With the grammatical model trained above, for a sentence to be analyzed (which has been processed by word segmentation), the non-lexical parser introduced above can be used to decode according to the grammatical model, and the result of syntactic analysis is obtained, and the sentence is also included semantic annotation results.
[0041] Effectiveness analysis:
[0042] In order to verify the effectiveness of the present invention, we designed a series of experiments, some of which are described below.
[0043] Experimental corpus:
[0044] The training and testing corpus uses UPenn Chinese Tree Bank 2.0, which contains a total of 325 news corpora, which are divided in a standard way: 1-25 articles are used as the development set, with a total of 350 sentences; 26-270 articles are used as training A total of 3172 sentences; 271-300 as a test set, a total of 348 sentences.
[0045] The semantic dictionary uses HowNet.
[0046] Baseline system:
[0047] The baseline system adopts the non-lexical syntactic analysis model introduced above, and adopts an unsupervised method to automatically split and refine non-terminal tokens. Each iteration splits the original token into two, and determines the parameters corresponding to the new token through the EM algorithm, and then The split tokens are merged according to the likelihood contribution.
[0048] Evaluation procedure:
[0049] The evaluation program adopts EVALB, a currently widely used syntax analysis evaluation tool. The tool uses bracket tag matching as the evaluation criterion, focusing on precision, recall and F-value.
[0050] Experimental results and analysis:
[0051] The results of the baseline system tested on the CTB standard dataset are shown in Table 1:
[0052] Table 1: Baseline System Performance
[0053]
[0054] Among them, S&M represents the number of split-merge process loops. For example, S&M-1 represents one split-iteration; S&M-2 represents two split-iterations, that is, a split-iteration is performed on the basis of the grammar obtained by one split-iteration . Len indicates the length of the sentence, that is, the number of words contained in the sentence. Len
[0055] In order to alleviate the problem of data sparsity to a certain extent, we select the topmost semantic class in HowNet and automatically refine all tags. The experimental results using the same data set are shown in Table 2.
[0056] Table 2 Adding coarse-grained semantic class labeling analysis performance
[0057]
[0058] From the table above, it can be found that starting from the fourth iteration of split-merging, the parsing performance by adding semantic information classes exceeds the baseline system. In the sixth iteration, over-training occurred due to too fine splitting, and the F value decreased to a certain extent, which showed the same trend in the baseline system and the improved system. But the results of adding semantic classes are still better than the baseline system. Comparing with the results of the fifth iteration, the F value has increased from 80.26% to 81.63%, an absolute increase of 1.37 points, which is quite significant in the study of syntactic analysis.
[0059] In addition, the newly released 5.0 version of the Penn Chinese tree bank (comprising 18782 sentences in total) is used for training, and the syntactic analysis performance of the present invention can reach an F value of 86.39%. The comparison trend before and after adding semantic information is similar to the results obtained on the Penn Chinese Treebank 2.0 listed above, so I won’t go into details here.
[0060] The present invention is based on a non-lexical syntactic analyzer, integrates semantic information into it, uses semantic information to help syntactic analysis to disambiguate, significantly improves the performance of the syntactic analyzer, and can obtain Semantic information of some words.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Detection of nuclear materials

InactiveUS20060284094A1Improve performanceLower costMaterial analysis by optical meansMachines/enginesSpecial nuclear materialRadioactive source
Owner:INBAR DAN

Vertical adaptive antenna array for a discrete multitone spread spectrum communications system

InactiveUS20050002440A1Improve performanceImprove signal-to-interference levelPolarisation/directional diversityMultiplex code allocationSpatially adaptiveSelf adaptive
Owner:AT&T MOBILITY II LLC

Classification and recommendation of technical efficacy words

  • Improve performance
  • Improve efficiency and accuracy

People also interested in

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products