Similarity calculation method for blog articles

A technology of similarity calculation and original text, applied in the field of similarity calculation for blog posts, it can solve the problem of not fully applicable to Weibo search engine applications, etc., to achieve timeliness and efficiency, reduce the number of matches, and achieve significant effects.

Inactive Publication Date: 2014-03-19
北京中搜云商网络技术有限公司
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The keyword-based algorithm requires the title of the webpage, and there is no title for the blog post, so it is not completely suitable for the application of Weibo search engines

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity calculation method for blog articles
  • Similarity calculation method for blog articles
  • Similarity calculation method for blog articles

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0038] A blog post contains both the original text and the forwarded text, so the similarity of the blog post should be determined by the original text and the forwarded text, but they all have some spam information that is weakly related to the blog post, such as links, emoticons, and personal names. Therefore, first preprocess the original text and forwarded text to remove spam information such as links, emoticons, and names, and then calculate the weight of each word by word segmentation, and then extract (n+m) words with relatively high weights as the content of this Weibo. Key words.

[0039] The keywords are further divided into two parts. The top n words with the highest weight are used as the core words of the blog post, and the remaining m words of the keywords are used as the second-level matching words of the blog post. The judgment basis is that ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a similarity calculation method for blog articles. The method includes the steps of preprocessing the blog articles, calculating the weights of all terms in the processed original article and the weights of all terms in the processed forwarded article, combining the weights of the terms in the original article and the forwarded article, sorting obtained results in a descending order, and generating fingerprints. The similarity calculation method for the microblog content is simple in algorithm and obvious in effect, the fingerprints are directly found in corresponding fingerprint sets according to the length, namely, the number of the terms, of the blog articles in the fingerprint generating stage, the number of matching times is lowered, the calculation time is saved, the efficiency and the accuracy of similarity calculation are improved, and the requirements of a microblog search engine for timeliness and high efficiency are met.

Description

technical field [0001] The invention belongs to the technical field of electronic identification, and in particular relates to a similarity calculation method for blog posts. Background technique [0002] A large number of similar blog posts (microblog content, hereinafter referred to as blog posts) exist in Weibo, which brings many problems to the microblog search application. Consumption of storage space, and reduce the performance and user experience of the Weibo search application. Therefore, if an effective calculation algorithm for similar blog posts can remove a large number of similar blog posts, the burden on the microblog search application can be greatly reduced and the performance and user experience of the microblog search application can be improved. [0003] At present, there is no better similarity calculation method for blog posts at home and abroad. Here are several popular web page similarity calculation methods, including algorithms based on web page fea...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/958
Inventor 王欢龙
Owner 北京中搜云商网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products