Check patentability & draft patents in minutes with Patsnap Eureka AI!

Distributed similarity join method for data streams based on emd distance

A technology of EMD distance and connection method, applied in the direction of data exchange network, digital transmission system, electrical components, etc., can solve the problem that the load balancing strategy is not applicable to the dynamically changing data flow environment, high computational complexity, and unsuitable distributed similarity Connection and other issues to achieve the effect of improving overall execution efficiency, reducing computing load, and enhancing filtering performance

Active Publication Date: 2021-02-26
GUANGXI UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to the high computational complexity of EMD distance, the number of EMD distance calculations that actually occur when nodes perform similarity connections will be the focus of the connection cost model. Therefore, the above research work is not suitable for solving distributed similarity based on EMD distance on data streams. sexual connection problems
Although the EMD-MPJ algorithm and the Top-k DLPJ algorithm design the data similarity connection scheme in the distributed parallel computing environment based on the characteristics of the EMD distance, they all deal with static data sets, and their load balancing strategies are not suitable for dynamic changes. data flow environment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed similarity join method for data streams based on emd distance
  • Distributed similarity join method for data streams based on emd distance
  • Distributed similarity join method for data streams based on emd distance

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings, but this does not constitute a limitation to the protection scope of the claims of the present invention.

[0046] 1. Relevant definitions and technologies

[0047] 1.1. Definition of EMD distance:

[0048] It is known that each histogram tuple group containing n data buckets P={p 1 ,...,p n} and Q={q 1 ,...,q n}, and the ground distance matrix C=[c ij ], then the EMD distance between P and Q, denoted as EMD(P, Q), is the minimum transfer cost for transforming the transfer of histogram P into histogram Q, which is equal to the optimal solution of the following linear programming problem.

[0049]

[0050] In formula (1), the known ground distance c ij ∈C represents the transport cost from the ith bucket of histogram P to the jth bucket of histogram Q. It can be seen that the linear programming problem has a total of n 2...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Distributed parallel computing is an effective means to improve the efficiency of data flow similarity connection processing. EMD distance can more accurately quantify the similarity between histogram tuples, but it has a computational complexity of up to cubic, which hinders its similarity in data flow. For this reason, the present invention discloses a data flow distributed similarity connection method based on EMD distance, based on the open source data flow distributed parallel computing framework Apache Storm, and adopts data locality-based data in data distribution The flow partition strategy maintains the data locality on the connected computing nodes, and based on the data locality, the connection algorithm’s filtering performance for the EMD calculation between dissimilar histogram tuple pairs is enhanced, and the execution efficiency of each connected computing node is improved; at the same time Based on the cost model of connecting computing nodes, the feedback-based load balancing strategy is adopted to effectively improve the overall processing performance and system scalability of the system for data streams, which improves the processing throughput by up to 50% compared with the existing technology.

Description

technical field [0001] The invention belongs to the technical field of data flow similarity connection, in particular to a data flow distributed similarity connection method based on EMD distance. Background technique [0002] With the continuous improvement of data acquisition equipment and the rapid development of data acquisition technology, how to conduct high-quality data analysis on the continuous and rapidly generated data stream has become a common concern in the industry and academia. Data stream similarity join returns similar tuple pairs on two data streams, which is an important operation for analyzing and mining data streams, and is widely used in practical applications such as event detection and data deduplication. [0003] The similarity connection between data streams composed of data tuples represented by the histogram data model (referred to as histogram tuples for short), each histogram tuple can be formally expressed as P={p 1 ,...,p n}, where p i ∈[0...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): H04L12/801H04L12/803H04L12/807H04L12/815H04L12/851H04L12/911H04L47/22H04L47/27
Inventor 许嘉吕品
Owner GUANGXI UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More