Unlock instant, AI-driven research and patent intelligence for your innovation.

SimBlock algorithm for realizing high-quality text similarity calculation and implementation method

A text similarity and high-quality technology, applied in computing, unstructured text data retrieval, text database clustering/classification, etc., can solve problems such as inability to mark similar substrings, and improve overall stability and scheduling performance effect

Pending Publication Date: 2022-04-29
东方财富信息股份有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0019] According to recent literature reports, Google's BERT model can effectively contain location information, design a multi-layer Multi-Head Attention deep learning network, and rely on dedicated hardware GPU / TPU for a large number of pre-training calculations, different network layers can represent different semantic levels , the semantic vectorization of different depths is its biggest advantage, but it is also impossible to mark similar substrings and give a one-to-one correspondence

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • SimBlock algorithm for realizing high-quality text similarity calculation and implementation method
  • SimBlock algorithm for realizing high-quality text similarity calculation and implementation method
  • SimBlock algorithm for realizing high-quality text similarity calculation and implementation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] Below in conjunction with specific embodiment, further illustrate the present invention. It should be understood that these examples are only used to illustrate the present invention and are not intended to limit the scope of the present invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

[0056] This embodiment discloses a SimBlock algorithm (similar block matrix algorithm) that can realize high-quality text similarity calculations. On the basis of string vectorization and cosine calculation similarity, the local ordered information of strings is supplemented, Specifically include the following steps:

[0057] Convert the strings to be compared 1 and strings 2 to be compared into an ordered stack of each charac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The method aims at big data text duplication removal scenes, such as search engine, information thematic module aggregation, identification of content originality and content plagiarism, content governance related to repeated irrigation posts and comments and the like. According to the technical scheme, the SimBlock algorithm (similar block matrix algorithm) capable of achieving similarity calculation is provided, the defect that ordered information is lost due to character intersection and character string vectorized cosine in a traditional similarity algorithm is overcome, local ordered information is supplemented, and the similarity calculation efficiency is improved. The given similarity score is not sensitive to the length of the two character strings, and the logic inclusion relation, the positions of the similar sub-character strings and the one-to-one correspondence relation can be judged. Another technical scheme of the invention is to provide a distributed computing technology architecture applicable to the algorithm, high-concurrency computing pressure is decomposed into a high-parallelism algorithm micro-service cluster, and high-concurrency read-write pressure is also decomposed into a cache cluster, so that a multi-process Source / Trans / Sink monomer is kept lightweight.

Description

technical field [0001] The present invention relates to a kind of SimBlock algorithm (similar block matrix algorithm) that realizes high-quality text similarity calculation, can give the similarity score between two character strings, logical containment relation, the position of similar substring and one by one Correspondence. The present invention also relates to a technical solution, applying the algorithm in the scene of deduplication of large data texts, and designing a distributed computing technical framework. Background technique [0002] Similarity algorithm is a basic service of search engine. The similarity algorithm aggregates search results with the same or similar text content (shallow character level or deeper semantic level) into a group and displays them as one search result, which not only saves content layout, improves search performance, but also saves users time and improve user experience. The similarity algorithm is widely used in scenarios such as ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/194G06F16/35G06F16/903
CPCG06F40/194G06F16/35G06F16/90344
Inventor 罗伟杰
Owner 东方财富信息股份有限公司