Unsupervised automatic extraction method of microblog new words based on repeated word strings

A technology for automatic extraction and repetition of words, which is applied in the fields of electrical digital data processing, natural language data processing, and special data processing applications. Guaranteed extraction speed and high accuracy

Inactive Publication Date: 2014-03-26
HEFEI UNIV OF TECH
View PDF4 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to provide a method for unsupervised automatic extraction of microblog new words based on repeated word strings, which solves the problems of low accuracy of existing new word extraction and high dependence on the completeness of the rule base

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised automatic extraction method of microblog new words based on repeated word strings
  • Unsupervised automatic extraction method of microblog new words based on repeated word strings
  • Unsupervised automatic extraction method of microblog new words based on repeated word strings

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0077] Original material: Facing a wave of "higher than one wave" chant "Voice, Li Yong only in your own blog injustice.

[0078] After text segmentation: face / one / wave / high / over / one / wave / of / " / inverted / chong / " / sound / , Lee / yong / only / can / in / own / of / blog / guest / 中 / Call for injustice / .

[0079] Participle fragments: "one / wave / high / over / one / wave / of / " / inverted / chong / ", "sound", "Li / yong / only / can / zai", "de / blog / ke / 中 "

[0080] New words to be identified: "one", "wave", "high", "over", "de", "yibo", "wave high"... "down", "yong", "downward yong", "sheng", "li", "yong" "only "" "Can" "Zai" "Li Yong" "Yongzai" "Zaizhi"...

[0081] After statistical word selection model function Thres value Calculation: Satisfy Thres value >=0 new words to be recognized are Thres value (one wave)=0.13325; Thres value (Inverted chant)=0.21123; Thres value (in only)=0.01134; Thres value (Li Yong)=0.10224; Thres value (blog)=0.43562

[0082] Candidate new words: "Yibo", "Daoyong", "Zaizhi", "L...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unsuspervised automatic extraction method of microblog new words based on repeated word strings. The method includes the steps that firstly, text segmentation is conducted on microblog documents to be processed, texts are segmented through a dynamic programming word segmentation method, the word strings to be recognized are segmented, and word segmentation fragments in the word strings to be recognized are combined into the new words to be recognized; candidate new words are extracted from the word strings to be recognized according to a statistic word selection model, and then the candidate words are filtered through a rule filtering model, and eventually the final new words are acquired. The method has the advantages that the high accuracy rate is effectively guaranteed, the method does not depend on a rule word stock too much, and the extraction speed of the new words is guaranteed.

Description

technical field [0001] The invention belongs to the technical field of new word retrieval methods, and relates to a non-supervised automatic extraction method of microblog new words based on repeated word strings. Background technique [0002] New word recognition is one of the main problems plaguing the field of Chinese word segmentation, and with the development of Weibo, the speed of the emergence of new words has been accelerated. Unsupervised automatic recognition of new words is crucial for other natural language processing tasks. Automatic segmentation of Chinese text is an important basic work in the field of natural language processing. The identification and processing of new words is one of the difficulties that restrict the further improvement of the accuracy of the Chinese word segmentation system. At present, the research on new word extraction mainly focuses on the extraction of entity nouns, especially the extraction of names of people, places, and institut...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/313G06F40/284
Inventor 孙晓李承程叶嘉麒唐陈意任福继
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products