Chinese multi-class word identification method based on conditional random field

A technology of conditional random field and identification method, which is applied in the fields of instrumentation, calculation, electrical and digital data processing, etc., can solve the problems of uneven probability distribution of different number of branches, dependence on observation value, etc.

Inactive Publication Date: 2015-07-01
EAST CHINA NORMAL UNIV
View PDF2 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of the hidden Markov model is that under the condition of a given observation sequence, the observation value only depends on the state, which makes each observation element exist independently, and in the real context, words are often not only related to the preceding and following words Relevant is the feature information that has a certain relationship with the farther word, so it only achieves a local optimum
Although the maximum entropy Markov model takes into account the associated feature information between words that are farther away from the current word, when the state transitions, due to the unbalanced probability distribution of the number of branches, it leads to resident In a certain state, the label bias problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese multi-class word identification method based on conditional random field
  • Chinese multi-class word identification method based on conditional random field
  • Chinese multi-class word identification method based on conditional random field

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0043] The present invention as figure 1 Specifically, the following steps are shown:

[0044] Step 1: Search for a Chinese concurrent word in the field of e-commerce, obtain an entry related to the concurrent word, and obtain a corpus with characteristics of the electric business field from the entry;

[0045] Step 2: Segment the corpus to generate language chunks, and simultaneously generate the language chunk features of each text in the language chunks;

[0046] Step 3: Carry out part-of-speech tagging to described text, obtain the part-of-speech feature of described...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese multi-class word identification method based on a conditional random field. The method includes the steps that entries related to multi-class words are acquired, and linguistic data are obtained from the entries; the linguistic data are segmented to generate chunks, and meanwhile the chunk characteristics of characters are generated in the chunks; part-of-speech tagging is performed on the characters to obtain the part-of-speech characteristics of the characters, and the characters are tagged through the chunk characteristics and the part-of-speech characteristics; part of the linguistic data are randomly selected to be trained, the rest of the linguistic data are tested, and then a first test result is obtained; a characteristic template is modified according to the characteristics of the linguistic data, the linguistic data continue being trained and tested after modification, and then a second test result is obtained; metric performance comparison is performed on the first test result and the second test result to improve identification of the multi-class words. According to the method, the Chinese multi-class words of the E-commerce field are identified through the conditional random field, and after the characteristics of the original characteristic template of the conditional random field are modified, the accuracy rate, recall rate and f value of multi-class word identification are increased.

Description

technical field [0001] The invention belongs to the field of text recognition of e-commerce products, and in particular relates to a method for recognizing Chinese concurrent words based on a conditional random field in the field of e-commerce. Background technique [0002] With the development of the times and the improvement of technology, ambiguous words (ambiguous words mean that the same word or words have two or more meanings, the reasons for ambiguity: unclear meaning, unfixed syntax, unclear levels, referring to Unidentified, etc.) has led to confusion in many contexts where the same word or phrases are interpreted differently by machines or humans. Therefore, whether the performance of ambiguous word recognition is accurate or not, whether it is efficient or not affects the result of processing text information. The ambiguous words are roughly divided into polysyllabic words, homonyms, polysemous words, concurrent words and anti-precepts. Previous recognition rese...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 费凡徐文超杨雁峰刘云鹏汤俊杨艳琴
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products