Multilingual text data sorting treatment method

A technology of data processing and classification methods, applied in the field of data processing, can solve problems such as information loss, great differences in emotional expression, and unsatisfactory performance of multilingual sentiment analysis, etc., to achieve small resource dependence, reduce information loss, and avoid mistakes.

Inactive Publication Date: 2014-01-01
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF2 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Multilingual sentiment analysis would be difficult to do without machine translation systems or compiled bilingual dictionaries
[0017] (2) The performance of multilingual sentiment analysis is not satisfactory
[0018] First, machine translation-based methods lose accuracy in cross-lingual sentiment analysis
Secondly, mos...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multilingual text data sorting treatment method
  • Multilingual text data sorting treatment method
  • Multilingual text data sorting treatment method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] In order to achieve the above object, the present invention proposes a self-learning classification method involving multilingual data processing, including:

[0053] See figure 1 Is the flow chart of sentiment classification algorithm.

[0054] Step 1, extract candidate emotional words by "very" and then perform stop word filtering. The stop word list is automatically obtained from the target language;

[0055] Step 2. Simultaneous clustering of sentiment words and sentiment texts by "good" and "bad" (for or against);

[0056] Step 3, build an emotion classifier through semi-supervised learning, first select confident samples from the clustering results in step 2 to train the initial classifier, and then combine the sentiment score of the text and the posterior probability of the classifier to select new samples to add to the training set .

[0057] Said step 1 includes:

[0058] In addition to extracting English emotional words through "very (very)", it also inclu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a self-learning sorting method relating to multilingual data treatment, comprising the steps of extracting candidate emotion words by a first seed word Chinese or foreign language 'very', filtering stop words, and automatically obtaining a stop word list from a language database; simultaneously carrying out support or opposing clustering on the emotion words and emotion texts by a second seed word 'good' and a third seed word 'bad' or foreign languages 'good' and 'bad'; building an emotion classifier by semi-supervised learning, training the initial classifier by selecting convinced samples from a clustering result, and selecting new samples to be added into a training set by fusing emotion scores of the texts and the posterior probability of the classifier. According to the sorting method, the method facing multilingual opinion analysis is irrelevant with languages, a machine translation system and a large-scale bilingual dictionary are not needed, the emotion classifier is directly learned on a target language, the resource dependence is the smallest, and for each target language, only three seed words are needed and other priori knowledge is not needed.

Description

technical field [0001] The invention relates to the field of data processing, and relates to the tendency analysis of massive text data, in particular to an automatic emotion classification method for multilingual (unfamiliar with languages). Background technique [0002] With the rapid development of the Internet and the acceleration of the globalization process, the information resources provided by the Internet are multilingual. According to the survey data of Nielsen Net Ratings, a global standard Internet user survey and analysis authority, in the nine years from 2000 to 2008, the growth rate of Internet usage in various languages ​​around the world reached 305.5%. The multilingual nature of Internet resources and the difference in familiarity between users' native language and non-native language inevitably bring about language barriers for users to use network information. [0003] The Internet is quietly affecting people's living habits. With the continuous emergenc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
Inventor 程学旗林政张瑾谭松波徐学可
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products