Chinese domain term recognition method based on mutual information and conditional random field model

A conditional random field and recognition method technology, applied in the information field, can solve the problems of low degree of automatic recognition, low recognition accuracy, accurate word segmentation of corpus in difficult professional fields, etc.

Inactive Publication Date: 2013-04-17
SHANGHAI UNIV
View PDF3 Cites 52 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the gap between professional field terminology and common vocabulary, it is difficult to achieve accurate word segmentation of professional field corpus with general word segmentation tool

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese domain term recognition method based on mutual information and conditional random field model
  • Chinese domain term recognition method based on mutual information and conditional random field model
  • Chinese domain term recognition method based on mutual information and conditional random field model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0054] In this embodiment, the field term recognition of plant-bamboo is taken as an example to illustrate the present invention, but it is not used to limit the scope of the present invention.

[0055] refer to figure 1 , the Chinese field term recognition method based on mutual information and conditional random field model of the present invention, comprises the following steps:

[0056] (1) Collect domain text corpus, and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus.

[0057] For example, this example selects the electronic manuscript of the ninth volume of Bamboo subfamily of "Flora of China" as the domain text corpus.

[0058] First, the corpus is randomly divided into two parts according to the ratio of 4:1: training corpus and test corpus;

[00...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese domain term recognition method based on mutual information and a conditional random field model. The Chinese domain term recognition method includes the following steps: (1) gathering domain text corpus and marking all the punctuations, spaces, numbers, ASSCII (American Standard Code for Information Interchange) characters and characters except Chinese characters in the corpus; (2) setting character strings and computing the mutual information values of the character strings, (3) computing the left comentropy and the right comentropy of every character string, (4) defining character string evaluation function, setting evaluation function threshold, computing the evaluation function values of every character string, determining that every character string is a word, comparing in sequence the evaluation function value of the former character with the evaluation function value of the latter character in the character string and segmenting character meaning character strings one by one, (5) utilizing conditional random fields to train a conditional random field model and recognizing domain terms with the conditional random field model. When the Chinese domain term recognition method is used to recognize terms, the data sparsity of legitimate terms is overcome, the amount of calculation of conditional random fields is reduced, and the accuracy of the Chinese domain term recognition is improved.

Description

technical field [0001] The invention relates to a method for recognizing Chinese domain terms based on mutual information and a conditional random field model, which belongs to the field of information technology. Background technique [0002] The definition of the national standard GB / T15237.1-2000 "Terminology Working Vocabulary" is that a term refers to a term referring to a general concept in a specific professional field, and is a word or phrase used in a subject area to express a concept or relationship in the subject area . Terminology can be divided into general terms used in daily life and domain terms used in specific fields. General terms are mostly formed according to people's living and working habits, and they are not required to be strictly accurate in the expression of concepts, and their meanings are often vague; field terms are systematic and general descriptions of a professional concept, and are not allowed Ambiguity, the concept expressed by each techn...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 彭琳刘宗田杨林楠张立敏
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products