A method and system for quantifying text similarity

A text similarity, similarity technology, applied in the field of parameter collection, can solve problems such as high accuracy, low false positive rate, and overestimation of similarity

Active Publication Date: 2022-05-17
福建天晴在线互动科技有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] 1. For the specific purpose of analyzing and screening malicious accounts from account collections, there is currently no public, highly accurate, and low false positive similarity measurement method and specially optimized method parameters
[0010] 2. The edit distance can only be used as an auxiliary factor, because it can only express the difference between two texts, and it needs to be combined with other information such as text length to be meaningful. For example, the edit distance of a and b is 1, and the edit distance of aaaaaa and aaaaab is also 1, but the similarity of the two sets of data is clearly different
[0011] 3. The Jaro-Winkler similarity algorithm needs to set a reasonable parameter p. If the parameter p is set too large or too small, the misjudgment rate will increase sharply. For example, when the parameter p=0.25, only the first 4 characters need to be identical to determine The similarity between the two texts is 100%, that is, it will be determined that the similarity between the characters aaaa11111 and aaaabbbbb is 100%; in addition, after testing, Jaro-Winkeler (180721a12, 15da36xiao2) = 0.58698, but obviously the similarity between the two character strings is determined by Overestimated; the above scenario is clearly unreasonable, but it could happen
[0012] 4. When illegal users register accounts in batches, they are often generated according to specific rules. When comparing them, it will be found that their accounts are more similar in format than ordinary users’ accounts. Therefore, some calculation weights should be given to the format similarity. , the format similarity needs to be used as a factor of similarity measurement, but the Jaro-Winkler algorithm does not have this calculation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for quantifying text similarity
  • A method and system for quantifying text similarity
  • A method and system for quantifying text similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] The present invention will be further described below in conjunction with the accompanying drawings.

[0077] see figure 1 As shown, a method for quantifying text similarity of the present invention, the method is suitable for identifying illegal accounts, and the method includes the following steps:

[0078] Step S1, receiving the request for the similarity of two texts to be judged, and receiving the set one-factor weight table,

[0079] Step S2, read the strings StrA and StrB corresponding to the two texts, obtain the string lengths of the strings StrA and StrB, obtain Len_A and Len_B, and split the skeletons of StrA and StrB respectively to obtain the skeleton structures Skeleton_A and Skeleton_B; Get the part length collection PartSizeList_A, PartSizeList_B; get the part quantity PartAmount_A, PartAmount_B; get the content collection PartContentList_A, PartContentList_B of each part;

[0080] Step S3, generating part length sets PartSizeStr_A, PartSizeStr_B in ch...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a method for quantifying text similarity. The method is as follows: step S1, receiving two requests for text similarity to be judged, and receiving a set one-factor weight table, step S2, reading The strings StrA and StrB corresponding to the two texts, and obtain the string lengths of the strings StrA and StrB to obtain Len_A and Len_B, and split the skeletons of StrA and StrB respectively; step S3, the part length set PartSizeList_A split according to the skeleton , PartSizeList_B generate the parts length collection PartSizeStr_A, PartSizeStr_B of the storage character data format; Step S4, calculate the similarity factor according to each calculation parameter obtained by skeleton splitting, step S5, in conjunction with the factor weight table, to each similarity factor weighted And, get the overall similarity, so as to judge whether the two texts are similar; thus judge the illegal account, and then monitor and block the illegal account.

Description

technical field [0001] The invention relates to the field of computer system communication technology and the field of illegal product detection, and provides a method for quantifying text similarity suitable for identifying illegal accounts and a set of parameter sets after a large amount of data testing and tuning. This method is especially suitable as a similarity comparison and similarity measurement method in business scenarios where account sets are grouped and illegal accounts are screened out. Users can use this method as a basis to count the number of similar accounts in each group of account sets. , so as to filter out illegal accounts. Background technique [0002] The term “illegal account” refers to accounts used for illegal purposes such as game studio gold-making accounts and Internet forum troll accounts. In order to facilitate account management, illegal production teams often register illegal account collections with consecutive numbers in batches, such as ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/33
CPCG06F16/35G06F16/334
Inventor 刘德建任佳伟陈宏展
Owner 福建天晴在线互动科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products