Document similarity calculating method and similar document whole-network retrieval tracking method

A technology of document similarity and calculation method, applied in unstructured text data retrieval, calculation, text database indexing and other directions, can solve problems such as no good, and achieve the effect of improving retrieval efficiency

Inactive Publication Date: 2016-11-09
HANGZHOU FANEWS TECH
View PDF10 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Today, with the rapid development of self-media, the copyright protection for self-media individuals is even more important. Since the self-media is weak, there is no good way to protect the copyright of its own documents.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity calculating method and similar document whole-network retrieval tracking method
  • Document similarity calculating method and similar document whole-network retrieval tracking method
  • Document similarity calculating method and similar document whole-network retrieval tracking method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] figure 1 It is a system architecture diagram of the document similarity calculation method in this embodiment. In this embodiment, the document similarity calculation method includes:

[0034] (1) Data preparation-ETL

[0035] Collect the media data of the whole network in real time, and clean the interference information through the "ETL data cleaning system". While the data is purified, the news manuscripts are structured, decomposed into the structure of the smallest unit, and a set of word segmentation is obtained, which is called the data atomization process. .

[0036] (2) Infrastructure construction - ElasticSearch full-text index + Chinese word segmentation

[0037] The ElasticSearch search engine is used as the basic component of the whole system, and the later algorithms are based on ES. ElasticSearch is a distributed multi-user full-text search engine based on Lucene. The scalability of distributed storage can effectively solve the storage problem of mass...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a document similarity calculating method and a similar document whole-network retrieval tracking method. The technical scheme is characterized in that the document similarity calculating method includes: S01, performing word segmentation on an original document and a target document to acquire respective word segmentation sets; S02, performing preprocessing and feature weighting: utilizing TF-IDF technology to calculate weight of each word segmentation, extracting core key words, utilizing Word2vec to dig correlation degree among different word segmentation in the documents, and performing semantic analysis on each document; S03, adopting a vector space model and a cosine similarity algorithm: utilizing a cosine value of an included angle of two vectors in vector space to evaluate similarity of the documents, wherein the cosine value is between 0 and 1, and the greater the cosine value is, the higher the similarity of the documents is. The document similarity calculating method and the similar document whole-network retrieval tracking method are suitable for news information redistribution tracking and transmissibility statistics.

Description

technical field [0001] The invention relates to a method for calculating document similarity and a method for retrieving and tracking similar documents across the network. Applicable to news information reprint tracking and dissemination power statistics. Background technique [0002] As the main producer of news information, traditional media contributed more than 80% of the original news, but limited by the limitations of their dissemination platforms, original documents were reprinted by a large number of portals and some new media. In the process of reprinting these documents, new media realized The multiplication effect of traffic and influence has also achieved good economic benefits, but as the author of the original document, he has not benefited from it. However, in the process of resolving copyright issues through legal means, finding reprinted documents is equivalent to finding a needle in a haystack, which requires a lot of manpower and is difficult to obtain ev...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22G06F17/27G06F17/30
CPCG06F16/31G06F16/3331G06F40/194G06F40/30
Inventor 姚洲鹏
Owner HANGZHOU FANEWS TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products