Chinese word segmentation based text similarity identifying method and device

A technology of text similarity and Chinese word segmentation, applied in the field of text recognition, can solve the problems of large amount of calculation, high complexity of time complexity space, high storage space, reducing text similarity recognition efficiency, etc., and achieve high recognition accuracy. , improve the recognition efficiency, reduce the effect of complexity

Active Publication Date: 2013-08-21
SHENZHEN TENCENT COMP SYST CO LTD
View PDF9 Cites 52 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Although the above two algorithms can guarantee a high recognition accuracy and reduce the manslaughter caused by the keyword extraction method, the amount of calculation required by these two algorithms is quite large, and their time complexity (extracting feature information The time required fo

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation based text similarity identifying method and device
  • Chinese word segmentation based text similarity identifying method and device
  • Chinese word segmentation based text similarity identifying method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples.

[0058] The present invention is mainly based on Chinese word segmentation. During word segmentation, text segmentation and word frequency sorting can be performed, and further relying on text semantic analysis and part-of-speech tuning, etc., to find word segmentation that can accurately reflect the meaning of the text, and then calculate the features of the found word segmentation. value, so that any text can be converted into a unique feature value to represent it, and the similarity comparison between any texts can be simplified to the comparison of unique feature values ​​between texts, thus greatly reducing the similarity between texts. The complexity of gender recognition improves the recognition efficiency. At the same time, this method obtains th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention discloses a Chinese word segmentation based text similarity identifying method. The method is characterized by including: filtering unidentified and meaningless characters in texts in given coded format and obtaining preprocessed texts; segmenting words of the preprocessed texts according to a preset word segmenting mode; selecting characteristic words in words obtained from word segmentation according to preset policy; sequencing the selected characteristics words to obtain a special character string, and calculating characteristic values of the texts according to the special character string; and determining similarity of the texts by comparing the characteristic values of the texts. The embodiment of the invention further discloses a Chinese word segmentation based text similarity identifying device. By the Chinese word segmentation based text similarity identifying method and device, identifying complexity can be reduced, identifying efficiency can be improved, and higher identifying correct rate can be achieved.

Description

technical field [0001] The invention relates to the technical field of text recognition, in particular to a text similarity recognition method and device based on Chinese word segmentation. Background technique [0002] With the rapid development of Web 2.0 technology and SNS, although information dissemination is more convenient, there are more and more repeated information. According to public statistics, the average repetition rate of web pages is about 4. That is to say, when you see an article information through a link, there are an average of 3 other different links that also give the same or basically similar content. [0003] For the same or similar information, some existing websites, communities, etc. will identify the text content in web pages or articles, extract some feature information in the text, and judge the similarity between texts based on these feature information, and pass Clustering, filtering and other methods process similar texts, reduce excessive...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 万波曹训志谢志远陈盛荣刘锐强
Owner SHENZHEN TENCENT COMP SYST CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products