Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for identifying entities named by Cambodian on basis of cross-language resource

A named entity recognition, cross-language technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problem of low correct rate of Cambodian named entity recognition, and achieve the effect of effective recognition

Active Publication Date: 2018-03-30
KUNMING UNIV OF SCI & TECH
View PDF1 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The invention provides a method for identifying Cambodian named entities based on cross-language resources, which is used to solve the problem of low recognition accuracy of Cambodian named entities

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying entities named by Cambodian on basis of cross-language resource
  • Method for identifying entities named by Cambodian on basis of cross-language resource
  • Method for identifying entities named by Cambodian on basis of cross-language resource

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Embodiment 1: as figure 1 As shown, a method of Cambodian named entity recognition based on cross-language resources, the specific steps of the method are as follows:

[0045] Step1. Obtain English-Cambodian bilingual parallel text corpus and Cambodian monolingual text corpus;

[0046] Step2, use the Word2vec tool to process the obtained Cambodian monolingual text corpus to obtain the text

[0047]The word vector text corresponding to each Cambodian word in ;

[0048] Step3. Calculating the similarity between Cambodian monolingual words is achieved by using the cosine similarity method for word vectors; let the vectors of any two words in the Cambodian document be expressed as w i and w j , where w i =(w i1 ,w i2 ...w in ), w j =(w j1 ,w j2 ...w jn ), then the similarity between the two words is expressed as:

[0049]

[0050] Step4. Realize the word alignment between Cambodian words and English words: use the standard word alignment technology IBM model ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for identifying entities named by Cambodian on the basis of a cross-language resource, and belongs to the technical field of natural language processing. The method includes the steps of firstly, obtaining an English-Cambodian bilingual parallel text corpus and a Cambodian monolingual text corpus; secondly, using a tool Word2vec to process Cambodian monolingual text to obtain a vector representation of Cambodian words; thirdly, adopting a cosine method to calculate the similarity values among the Cambodian words, and meanwhile using an IBM model to achieve word alignment between the Cambodian words and English words; utilizing a label propagation algorithm in a bilingual image model to process the Cambodian-English bilingual corpus to obtain a corresponding class of the Cambodian words in the text, and then adopting the corresponding class as a cross-language characteristic and applying the cross-language characteristic, word property characteristics,mark characteristics and word characteristics for marking the names of people and the names of locations to a machine learning model; finally, achieving identification of the named entities for the obtained corpora.

Description

technical field [0001] The invention relates to a method for recognizing a Cambodian named entity based on cross-language resources, and belongs to the field of natural language processing technology. Background technique [0002] The main task of named entity recognition is to identify proper names such as person names, place names, and organization names in the text. Named entity recognition technology is an essential part of many natural language processing technologies such as information extraction, information retrieval, machine translation, and question answering systems. From the perspective of the whole process of language analysis, named entity recognition belongs to the category of unregistered word recognition in lexical analysis. The structural characteristics of named entities in Cambodian are similar to those in Chinese. Except for a few abbreviations, the appearance of other named entities is the same as that of other words, but there are still some clues th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCG06F40/295
Inventor 严馨谢俊郭剑毅余正涛线岩团
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products