Unlock instant, AI-driven research and patent intelligence for your innovation.

Media text similarity detection method

A text similarity and detection method technology, applied in the field of natural language processing, can solve problems such as low retrieval efficiency and weak semantic features of text fingerprints, and achieve the effect of improving retrieval efficiency, high accuracy and precision

Pending Publication Date: 2021-07-13
SOUTHEAST UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention is aimed at the problems existing in the prior art, and provides a method for detecting similarity of media texts. The technical solution overcomes the semantic features of text fingerprints generated by the traditional Simhash algorithm by integrating the semantic features of the text and optimizing the weight of feature words. The problem of weak and low retrieval efficiency can be applied to the rapid detection of similar media texts in massive media texts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Media text similarity detection method
  • Media text similarity detection method
  • Media text similarity detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0047] Embodiment 1: as figure 1 Shown, the present invention is a kind of media text similarity detection method, and concrete implementation steps are as follows:

[0048] Step 1, media text collection. This example crawls the source code of web pages containing self-media manuscripts from mainstream self-media platforms in the Internet, and ensures that the number of each type of self-media manuscripts is even, and then stores the source code of the webpage in the database.

[0049] Step 2, media manuscript preprocessing. Since the source code of the web page containing media text is obtained by using the crawler tool, it is necessary to extract the text content of the web page source code.

[0050] Sub-step 2-1, manuscript web page preprocessing. Because the source code of the webpage containing the self-media text contains multiple tags, the tags corresponding to the manuscript text are inconsistent on different self-media platforms, so it is necessary to analyze diffe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a media text similarity detection method, which can detect media texts similar to a target text in massive media texts, and comprises the following steps of: firstly, obtaining media texts of a mainstream self-media platform by using a crawler tool, and constructing a media text set; then preprocessing the media text to extract a normal field, and performing word segmentation and stop word filtering; thirdly, training a Skip-gram model based on a massive corpus to obtain word vector representation of the feature items; thirdly, performing feature fusion by using TF-IDF features, part-of-speech features and position features of the words, and constructing weights of feature items; and finally, based on the word vector representation and the weight of the feature item, using a Simhash algorithm to obtain a text fingerprint, and constructing a text fingerprint index according to the inverted index. According to the method, the semantic features of the text fingerprints can be effectively enhanced, and the similar text retrieval efficiency and accuracy of the Simhash algorithm are improved.

Description

technical field [0001] The invention relates to a detection method, in particular to a media text similarity detection method, and belongs to the technical field of natural language processing. Background technique [0002] With the continuous upgrading of Internet technology and the vigorous promotion of intelligent terminal systems, the vigorous development of the media industry has been greatly promoted. The new media model, which entered the public eye in 2003 and is still in the ascendant, not only has a wider social coverage, but also based on the self-media model, people can participate in practical discussions in society, which greatly improves the timeliness of news. However, the self-media model has gradually exposed its own disadvantages in its development. Due to the lack of relevant legal protection for the copyright of self-media content, the lack of effective supervision on the manuscript publishing platform, and the low media quality of the self-media creator...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/279G06F40/216G06F16/951
CPCG06F40/279G06F40/216G06F16/951
Inventor 杨鹏田杨静匡晨于晓潭
Owner SOUTHEAST UNIV