Chinese and English-named entity identification method and system based on conditional random field (CRF)

A technology for named entity recognition and conditional random field, which is applied in special data processing applications, instruments, electrical digital data processing, etc. It can solve the problems of difficult recognition, irregular grammar, randomness, etc., and achieve wide application value and solve entity recognition. effect of the problem

Inactive Publication Date: 2013-09-18
INST OF ACOUSTICS CHINESE ACAD OF SCI +1
View PDF2 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Entity recognition in pure English does not require word segmentation because there are intervals between English words, and the recognition is easier; entity recognition in pure Chinese is more difficult than entity recognition in pure English, but for spoken Chinese and English mixed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese and English-named entity identification method and system based on conditional random field (CRF)
  • Chinese and English-named entity identification method and system based on conditional random field (CRF)
  • Chinese and English-named entity identification method and system based on conditional random field (CRF)

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0073] 1. Perform Chinese and English word segmentation on the text after speech recognition. This part is divided into two steps: the first step is to separate Chinese and English characters, and the second step is to use finite state machine to recognize English word strings, that is, merge adjacent English letters, spaces and symbols in English, the third step, English word segmentation, that is, the English word string is segmented with spaces.

[0074] 2. Construct the CRF training data, and the data should cover as much as possible the common sayings in various spoken languages ​​in the field.

[0075] 3. Mark the training data, that is, mark the category of the entity noun in each query sentence.

[0076] 4. Feature extraction 1: In order to better extract various entity nouns (including names and other nouns) in the field, according to the characteristics of word formation of Chinese names, we have established common characters for the surnames and names of Chinese names Dic...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Chinese and English-named entity identification method and a system based on a conditional random field (CRF). The method comprises the following steps: (101) converting inquiry voice of a user into a text; (102) separating text information into Chinese characters and English letters on the basis of a finite state machine; (103) extracting the characteristics of a text of separated vocabularies; (104) performing entity identification on the text by adopting a training CRF model according to a characteristic extraction result, and marking an entity type, wherein the CRF model is a conditional random field model of a linear chain structure. The step (102) further comprises the following steps: (102-1) performing character separation on Chinese and English; (102-2) identifying English word strings by using the finite state machine, namely, combining adjacent English letters, blank spaces and symbols in English; (102-3) performing word segmentation on the English word strings.

Description

Technical field [0001] The invention relates to a sequence labeling model of a finite state machine and a conditional random field. It mainly aims at the phenomenon that the user query sentence has Chinese and English mixed in the process of human-computer interaction, and proposes a method and system for recognizing a sentence with a mixed Chinese and English named entity. Background technique [0002] The human-computer interaction system is where users put forward query requests through spoken language, and the system provides information services. A typical human-computer interaction system includes four components: automatic speech recognition, oral comprehension, dialogue management and speech synthesis. The part of oral comprehension is to transform the query sentence after speech recognition into corresponding semantic representation. However, with the great integration of internationalized information, multilingualism can be seen everywhere, which brings difficulties to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张艳李艳玲徐为群颜永红
Owner INST OF ACOUSTICS CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products