A method, system and electronic device for searching similar data based on orientation encoding
By constructing a nearest neighbor graph and performing orientation encoding based on a similar data search method, the problem of low search efficiency in high-dimensional space is solved, achieving efficient and accurate determination of similar data and reducing computational costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD
- Filing Date
- 2023-05-19
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies face the "curse of dimensionality" problem caused by nearest neighbor search in high-dimensional space when processing unstructured big data. This results in high computational costs, low search efficiency, and difficulty in efficiently and accurately identifying similar data.
A similar data search method based on orientation encoding is adopted. Multidimensional feature vectors are generated by feature vector extraction, a nearest neighbor graph is constructed and orientation encoding is performed, and the nearest neighbor data points are determined iteratively based on the vector orientation encoding.
It enables efficient and accurate identification of similar data, reduces computational costs, improves search efficiency, and solves the problem of low search efficiency in high-dimensional spaces.
Smart Images

Figure CN116628280B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of big data processing technology, specifically to a method, system, and electronic device for searching similar data based on location coding. Background Technology
[0002] In today's information society, big data technology is being used more and more widely. Its main component is semi-structured and unstructured data (such as text, images, audio, and video data), accounting for over 85% of the total collected data, and containing immense value. However, unstructured big data is characterized by its massive volume, heterogeneity, and complexity, posing unprecedented challenges to information storage, computing, and data processing technologies for various applications. To fully explore and utilize the value of unstructured big data, it is necessary to be able to efficiently process and analyze massive amounts of unstructured data. Similarity search, as a key fundamental problem in this process, has significant research importance.
[0003] Similarity search refers to the process of searching for data most similar to specified query data from a given dataset. Due to the structural complexity of unstructured data, it is difficult to directly calculate similarity. Some related techniques extract feature vectors to convert the data into data points in a vector space for nearest neighbor search. However, nearest neighbor search in high-dimensional space often suffers from the "curse of dimensionality," resulting in less than ideal performance. Other related techniques also often suffer from high computational costs and low search efficiency. Summary of the Invention
[0004] In view of this, embodiments of this specification provide a method, system, and electronic device for searching similar data based on orientation coding, which can efficiently and accurately identify similar data, reduce computational costs, and improve search efficiency.
[0005] In a first aspect, embodiments of this specification provide a method for searching similar data based on orientation coding, including:
[0006] Obtain specified query data and a given dataset, wherein the given dataset includes multiple basic data items, and both the specified query data and the multiple basic data items are unstructured data;
[0007] Feature vectors are extracted from the specified query data and multiple basic data to generate corresponding multidimensional feature vectors. Based on the multidimensional vectors, query points corresponding to the specified query data and multiple data points corresponding to the multiple basic data are determined in the multidimensional vector space.
[0008] A nearest neighbor graph is constructed for multiple data points. In the nearest neighbor graph, the azimuth codes of multiple neighbor points for each data point are performed to determine the corresponding vector azimuth codes of the multiple neighbor points.
[0009] Based on the nearest neighbor graph and the vector orientation encoding, a nearest neighbor search is performed on multiple data points to determine the nearest neighbor data point of the query point;
[0010] The basic data corresponding to the nearest neighbor data point is determined to be similar data corresponding to the given query data.
[0011] This specification also provides an embodiment of a similar data search system based on orientation coding, including:
[0012] The data acquisition module is used to acquire specified query data and a given dataset, wherein the given dataset includes multiple basic data, and both the specified query data and the multiple basic data are unstructured data.
[0013] The feature vector extraction module is used to extract feature vectors from the specified query data and multiple basic data, generate corresponding multidimensional feature vectors, and determine the query point corresponding to the specified query data and multiple data points corresponding to the multiple basic data in the multidimensional vector space based on the multidimensional vectors.
[0014] The orientation coding module is used to construct a nearest neighbor graph for multiple data points, perform orientation coding on multiple neighbor points of each data point in the nearest neighbor graph, and determine the corresponding vector orientation codes of the multiple neighbor points.
[0015] The nearest neighbor search module is used to perform a nearest neighbor search among multiple data points based on the nearest neighbor graph and the vector orientation encoding, to determine the nearest neighbor data point of the query point; and
[0016] The similar data determination module is used to determine that the basic data corresponding to the nearest neighbor data point is similar data corresponding to the given query data.
[0017] This specification also provides an electronic device for searching similar data based on orientation encoding, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the similar data search method based on orientation encoding as described in the first aspect.
[0018] As can be seen from the above, the similar data search method, system, and electronic device based on orientation coding provided in the embodiments of this specification have the following beneficial technical effects:
[0019] The described similar data search method based on azimuth encoding determines the multidimensional feature vectors of multiple basic data points in a given dataset and the specified query data. It constructs a nearest neighbor graph for multiple data points corresponding to these basic data points, and performs azimuth encoding based on the azimuth relationships between data points and their corresponding neighbors in the multidimensional vector space to determine the vector azimuth codes corresponding to the multiple neighbor points. Then, it performs a nearest neighbor search based on these vector azimuth codes. This method enables faster searching and determination of the nearest neighbor data points, thereby identifying similar data corresponding to the specified query data. This approach can efficiently and accurately determine similar data, reducing computational costs and improving search efficiency. Attached Figure Description
[0020] The features and advantages of the invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the drawings:
[0021] Figure 1 This diagram illustrates a similar data search method based on orientation coding provided by one or more optional embodiments of this specification;
[0022] Figure 2 This illustration shows a method for constructing a nearest neighbor graph in a similar data search method based on orientation coding provided in one or more optional embodiments of this specification;
[0023] Figure 3 This diagram illustrates a method for azimuth encoding in a similar data search method based on azimuth encoding, provided by one or more optional embodiments of this specification.
[0024] Figure 4 This diagram illustrates a method for nearest neighbor search in a similar data search method based on orientation encoding, provided by one or more optional embodiments of this specification.
[0025] Figure 5 This diagram illustrates the structure of a similar data search system based on orientation coding, provided by one or more optional embodiments of this specification.
[0026] Figure 6 This diagram illustrates the structure of an electronic device for searching similar data based on orientation coding, provided by one or more alternative embodiments of this specification. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0028] In today's information society, big data technology is being used more and more widely. Its main component is semi-structured and unstructured data (such as text, images, audio, and video data), accounting for over 85% of the total collected data, and containing immense value. However, unstructured big data is characterized by its massive volume, heterogeneity, and complexity, posing unprecedented challenges to information storage, computing, and data processing technologies for various applications. To fully explore and utilize the value of unstructured big data, it is necessary to be able to efficiently process and analyze massive amounts of unstructured data. Similarity search, as a key fundamental problem in this process, has significant research importance.
[0029] Similarity search refers to the process of searching for the most similar data to a specified query from a given dataset. Due to the structural complexity of unstructured data, it is difficult to directly calculate similarity. Some related techniques extract feature vectors to convert the data into data points in a vector space for nearest neighbor search. However, nearest neighbor search in high-dimensional space often suffers from the "curse of dimensionality," resulting in less than ideal performance. Other related techniques for nearest neighbor search in high-dimensional space also often suffer from excessive computational load, high computational cost, and low search efficiency.
[0030] To address the aforementioned issues, the purpose of this specification is to propose a method, system, and electronic device for searching similar data based on orientation encoding. This method performs orientation encoding on specified query data and data in a given dataset before conducting nearest neighbor retrieval. During the nearest neighbor search, Hamming distance is calculated based on the orientation encoding for iterative searching, effectively reducing computational complexity and improving retrieval efficiency. This allows for faster and more accurate retrieval of similar data from a given dataset.
[0031] To achieve the above objectives, this specification provides a method for searching similar data based on orientation coding.
[0032] like Figure 1 As shown, one or more optional embodiments of this specification provide a similar data retrieval method, including:
[0033] S1: Obtain the specified query data and the given dataset, wherein the given dataset includes multiple basic data, and both the specified query data and the multiple basic data are unstructured data.
[0034] Similarity search aims to find the data items in a given dataset that are most similar to specified query data. The specified query data and the given dataset can be obtained through a data input interface, or the given dataset can be directly obtained from a database. The specified query data and several of the basic data types are unstructured data, such as image data, video data, and text data.
[0035] S2: Extract feature vectors from the specified query data and multiple basic data to generate corresponding multidimensional feature vectors, and determine the query point corresponding to the specified query data and multiple data points corresponding to the multiple basic data in the multidimensional vector space based on the multidimensional vectors.
[0036] Feature vectors are extracted from the specified query data and multiple basic data, transforming the originally unstructured specified query data and basic data into multi-dimensional feature vectors for representation and description.
[0037] For different types of unstructured data, corresponding feature vector extraction methods can be employed. For example, for image and video data, there are SIFT and HOG (Histogram of Oriented Gradients) feature extraction methods; for audio data, there are MFCC (Mellow Frequency Cepstral Coefficient) and PLP (Perceptual Linear Prediction Coefficients); for text data, there are TF-IDF and word2vec algorithms for feature extraction. Furthermore, deep learning models can be built for various types of unstructured data, and the corresponding high-dimensional vectors extracted by these models can be used as the high-dimensional feature vectors corresponding to the unstructured data.
[0038] By extracting feature vectors from unstructured data to generate corresponding multidimensional feature vectors, the specified query data and multiple basic data can be represented by a point in a multidimensional vector space. In the multidimensional vector space, the distance between points can be used to measure the similarity between corresponding unstructured data.
[0039] S3: Construct a nearest neighbor graph for multiple data points, and perform orientation coding on multiple neighbor points of each data point in the nearest neighbor graph to determine the corresponding vector orientation codes of the multiple neighbor points.
[0040] By constructing a nearest neighbor graph for multiple data points, multiple neighbor points corresponding to each data point can be determined. This allows for the subsequent use of a nearest neighbor search method to iteratively find data points closer to the query point from the neighbors of the reference point, thereby continuously moving closer to the nearest neighbor point and ultimately determining the nearest neighbor point.
[0041] In the nearest neighbor graph, the orientation relationship between the data point and its neighbors in the multidimensional vector space is determined based on the multidimensional vector features of the data point and its neighbors. The orientation relationship is then used to encode the orientation of the neighbors. The determined vector orientation code can characterize the data features of the unstructured data corresponding to the multidimensional feature vector to a certain extent and can be used to measure the similarity between data.
[0042] S4: Based on the nearest neighbor graph and the vector orientation encoding, perform nearest neighbor search on multiple data points to determine the nearest neighbor data point of the query point.
[0043] The nearest neighbor search method iteratively searches for data points closer to the query point from among the neighbors of the reference point, continuously moving closer to the nearest neighbor until the nearest neighbor is finally determined. During the iterative process of the nearest neighbor search, the Hamming distance between points can be determined based on the query point and the vector orientation codes corresponding to multiple data points, thus significantly improving computational efficiency and reducing computational costs.
[0044] S5: Determine that the basic data corresponding to the nearest neighbor data point is similar data corresponding to the given query data.
[0045] The described similar data search method based on azimuth encoding determines the multidimensional feature vectors of multiple basic data points in a given dataset and the specified query data. It constructs a nearest neighbor graph for multiple data points corresponding to these basic data points, and performs azimuth encoding based on the azimuth relationships between data points and their corresponding neighbors in the multidimensional vector space to determine the vector azimuth codes corresponding to the multiple neighbor points. Then, it performs a nearest neighbor search based on these vector azimuth codes. This method enables faster searching and determination of the nearest neighbor data points, thereby identifying similar data corresponding to the specified query data. This approach can efficiently and accurately determine similar data, reducing computational costs and improving search efficiency.
[0046] like Figure 2 As shown, in one or more optional embodiments of this specification, a similar data search method based on orientation encoding is provided, constructing a nearest neighbor graph for multiple data points includes:
[0047] S201: Select multiple data points as vertices in sequence.
[0048] S202: Determine the distance between the vertex and the other multiple data points, and select the multiple data points with the smallest distance to the vertex as the vertex's neighbor points.
[0049] The distances between the vertex and multiple other data points can be sorted from smallest to largest, and the data points corresponding to the highest sorted distances can be selected as the vertex's neighbor points. The number of selected neighbor points can be flexibly set according to actual circumstances.
[0050] S203: Connect the vertex to the corresponding plurality of neighboring points using directed edges, wherein the directed edges point from the vertex to the neighboring points.
[0051] A nearest neighbor graph can be constructed for multiple data points corresponding to the given dataset using a linear scanning method. The nearest neighbor graph can more intuitively represent the distance relationship between multiple points. Using the nearest neighbor graph to assist the nearest neighbor search can greatly improve the search efficiency.
[0052] like Figure 3 As shown, in one or more optional embodiments of this specification, a similar data search method based on azimuth encoding is provided, in which azimuth encoding is performed on multiple neighboring points of each data point in the nearest neighbor graph to determine the corresponding vector azimuth encoding of the multiple data points, including:
[0053] S301: Perform principal component analysis on the multidimensional feature vectors corresponding to the multiple data points in the given dataset to determine the principal component projection matrix.
[0054] Principal component analysis can be used to calculate the first g′ principal components of the multidimensional eigenvectors corresponding to multiple data points to form the principal component projection matrix.
[0055] The principal component:
[0056]
[0057] in Let g' represent the i-th comprehensive index component in the principal component. The principal component is composed of g' pairs of mutually orthogonal vectors, where g' = log2g.
[0058] Principal component analysis, also known as principal component analysis, aims to use the idea of dimensionality reduction to transform multi-dimensional indicators into a few comprehensive indicators (i.e., principal components). Each principal component can reflect most of the information of the original variables, and the information contained in each component does not overlap.
[0059] S302: In the nearest neighbor graph, the orientation code string of the neighbor point corresponding to the data point relative to the data point is calculated and determined according to the principal component projection matrix as the vector orientation code of the neighbor point.
[0060] For each data point in each given dataset make express In a neighboring point of the nearest neighbor graph, using the principal component projection matrix, calculate... Compared to The orientation encoding string:
[0061] C y =c1c2…c i …c g′
[0062]
[0063] Among them, c i This represents the i-th bit of the orientation code string. Multidimensional feature vectors used to characterize the neighboring points With the multidimensional feature vector of the data points The positional relationship between them, and their corresponding comprehensive index components. Multiplication yields a result that represents the directional relationship between the two components, denoted by 0 and 1. Multiple comprehensive index components. The corresponding directional relationship codes are concatenated to form a directional code string, which is the vector directional code corresponding to the neighbor point.
[0064] like Figure 4 As shown, in one or more optional embodiments of this specification, a similar data search method based on orientation encoding is provided, in which nearest neighbor search is performed on multiple data points based on the nearest neighbor graph and the vector orientation encoding to determine the nearest neighbor data points of the query point, including:
[0065] S401: Randomly select one of the multiple data points as the nearest neighbor candidate point o.
[0066] S402: Determine multiple neighbor points of the nearest neighbor candidate point based on the nearest neighbor graph.
[0067] S403: Determine the vector orientation code of the query point relative to the nearest neighbor candidate point, and combine it with the corresponding vector orientation codes of the multiple neighbor points of the nearest neighbor candidate point to determine the Hamming distance between the query point and the multiple neighbor points.
[0068] S404: Select multiple neighboring points with the smallest Hamming distance to the query point, and calculate the original distance between these multiple neighboring points and the query point.
[0069] When selecting multiple neighbor points with the smallest Hamming distance to the query point q, the number of selected neighbor points is determined based on the number of neighbor points maintained for each data point in the nearest neighbor graph.
[0070] The number of neighbor points maintained for each data point in the nearest neighbor graph is denoted as g. The number of neighbor points selected when choosing multiple neighbor points with the smallest Hamming distance to the query point is τ.
[0071] τ = log₂g.
[0072] After selecting the τ neighboring points with the smallest Hamming distance to the query point q, the original distances between the query point and the τ neighboring points are further calculated. These original distances refer to the Euclidean distances between the neighboring points and the query point. The Euclidean distances are determined based on the multidimensional feature vectors corresponding to the neighboring points and the query point q.
[0073] The Euclidean distance between the neighbor point and the query point can be denoted as ‖q,x‖.
[0074] S405: Select the point with the smallest original distance to the query point as the undetermined point, and compare the original distance between the undetermined point and the query point with the original distance between the nearest neighbor candidate point and the query point.
[0075] The original distance between the point to be determined and the query point is denoted as ||q,x||. * ||, x * The point to be determined is denoted as q,o. The original distance between the nearest neighbor candidate point and the query point is denoted as ‖q,o‖, where o represents the nearest neighbor candidate point.
[0076] S406: In response to the fact that the original distance corresponding to the undetermined point is less than the original distance corresponding to the nearest neighbor candidate point, the undetermined point is selected as a new nearest neighbor candidate point to continue the search.
[0077] If the original distance corresponding to the undetermined point is less than the original distance corresponding to the nearest neighbor candidate point, i.e., ||q,x|| * If ||<||q,o||, it means that the undetermined point is closer to the query point. The undetermined point can be used as a new nearest neighbor candidate point. Multiple neighbor points of the new nearest neighbor candidate point are determined, and the new undetermined point is re-searched and selected in the above manner for iterative search.
[0078] S407: In response to the fact that the original distance corresponding to the undetermined point is not less than the original distance corresponding to the nearest neighbor candidate point, determine the nearest neighbor candidate point as the nearest neighbor data point corresponding to the query point.
[0079] If the original distance ratio corresponding to the undetermined point is less than the original distance corresponding to the nearest neighbor candidate point, i.e., ||q,x|| * If the condition "<" is not met, it means that the current nearest neighbor candidate point is the point that is closest to the query point among the multiple data points, and it can be determined that the current nearest neighbor candidate point is the nearest neighbor data point.
[0080] It should be noted that the methods of one or more embodiments of this specification can be executed by a single device, such as a computer or server. The methods of this embodiment can also be applied in a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the methods of one or more embodiments of this specification, and the multiple devices will interact with each other to complete the method described.
[0081] It should be noted that the above description describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims may be performed in a different order than that shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0082] Based on the same concept, corresponding to any of the above embodiments, this specification also provides a similar data search system based on orientation coding.
[0083] refer to Figure 5 The location-based similar data search system includes:
[0084] The data acquisition module is used to acquire specified query data and a given dataset, wherein the given dataset includes multiple basic data, and both the specified query data and the multiple basic data are unstructured data.
[0085] The feature vector extraction module is used to extract feature vectors from the specified query data and multiple basic data, generate corresponding multidimensional feature vectors, and determine the query point corresponding to the specified query data and multiple data points corresponding to the multiple basic data in the multidimensional vector space based on the multidimensional vectors.
[0086] The orientation coding module is used to construct a nearest neighbor graph for multiple data points, perform orientation coding on multiple neighbor points of each data point in the nearest neighbor graph, and determine the corresponding vector orientation codes of the multiple neighbor points.
[0087] The nearest neighbor search module is used to perform a nearest neighbor search among multiple data points based on the nearest neighbor graph and the vector orientation encoding, to determine the nearest neighbor data point of the query point; and
[0088] The similar data determination module is used to determine that the basic data corresponding to the nearest neighbor data point is similar data corresponding to the given query data.
[0089] In a similar data search system based on orientation encoding provided in one or more optional embodiments of this specification, the orientation encoding module is further configured to sequentially select a plurality of data points as vertices; determine the distance between the vertex and the other plurality of data points, select the plurality of data points with the smallest distance to the vertex as the neighbor points of the vertex; and connect the vertex to the corresponding plurality of neighbor points using directed edges, wherein the directed edges point from the vertex to the neighbor points.
[0090] In a similar data search system based on azimuth coding provided in one or more optional embodiments of this specification, the azimuth coding module is further configured to perform principal component analysis on the multidimensional feature vectors corresponding to multiple data points in the given dataset to determine the principal component projection matrix; in the nearest neighbor graph, the azimuth coding string of the neighbor point corresponding to the data point relative to the data point is calculated and determined according to the principal component projection matrix as the vector azimuth coding of the neighbor point.
[0091] In one or more optional embodiments of this specification, a similar data search system based on orientation coding is provided, wherein the orientation coding module determines the principal component projection matrix using the following method:
[0092] Principal component analysis is used to calculate the first g′ principal components of the multidimensional eigenvectors corresponding to the multiple data points to form the principal component projection matrix.
[0093] The principal component:
[0094]
[0095] in Let g' represent the i-th comprehensive index component in the principal component. The principal component is composed of g' pairs of mutually orthogonal vectors, where g' = log2g.
[0096] In a similar data search system based on azimuth coding provided in one or more optional embodiments of this specification, the azimuth coding module determines the azimuth coding string using the following method:
[0097] For each data point in each given dataset make express In a neighboring point of the nearest neighbor graph, using the principal component projection matrix, calculate... Compared to The orientation encoding string:
[0098] C y =c1c2…c i …c g′
[0099]
[0100] Among them, c i This represents the i-th bit of the orientation code string.
[0101] In a similar data search system based on orientation coding provided in one or more optional embodiments of this specification, the nearest neighbor search module is further configured to: randomly select one of the multiple data points as a candidate nearest neighbor point; determine multiple neighbor points of the candidate nearest neighbor point based on the nearest neighbor graph; determine the vector orientation coding of the query point relative to the candidate nearest neighbor point, and combine the vector orientation codings of the multiple neighbor points of the candidate nearest neighbor point to determine the Hamming distance between the query point and the multiple neighbor points; select the multiple neighbor points with the smallest Hamming distance to the query point, and calculate these multiple neighbor points. The original distance between the home point and the query point is calculated; the point with the smallest original distance to the query point is selected as the undetermined point, and the original distance between the undetermined point and the query point is compared with the original distance between the nearest neighbor candidate point and the query point; when the original distance corresponding to the undetermined point is less than the original distance corresponding to the nearest neighbor candidate point, the undetermined point is selected as a new nearest neighbor candidate point to continue the search; when the original distance corresponding to the undetermined point is not less than the original distance corresponding to the nearest neighbor candidate point, the nearest neighbor candidate point is determined as the nearest neighbor data point corresponding to the query point.
[0102] In a similar data search system based on orientation encoding provided in one or more optional embodiments of this specification, the original distance refers to the Euclidean distance between the neighbor point and the query point; the Euclidean distance is calculated and determined based on the multidimensional feature vectors corresponding to the neighbor point and the query point.
[0103] In a similar data search system based on orientation coding provided in one or more optional embodiments of this specification, when the nearest neighbor search module selects multiple neighbor points with the smallest Hamming distance to the query point, the number of neighbor points selected is determined according to the number of neighbor points maintained for each data point in the nearest neighbor graph.
[0104] The number of neighbor points maintained for each data point in the nearest neighbor graph is denoted as g. The number of neighbor points selected when choosing multiple neighbor points with the smallest Hamming distance to the query point is τ.
[0105] τ = log₂g.
[0106] For ease of description, the above system is described by dividing it into various modules based on their functions. Of course, when implementing one or more embodiments of this specification, the functions of each module can be implemented in one or more software and / or hardware.
[0107] The system described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0108] Figure 6 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.
[0109] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.
[0110] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.
[0111] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.
[0112] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0113] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.
[0114] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.
[0115] The electronic devices described above are used to implement the corresponding methods in the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0116] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the similar data search method based on orientation encoding as described in any of the above embodiments.
[0117] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.
[0118] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the similar data search method based on orientation encoding as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0119] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk drive (HDD), or solid-state drive (SSD), etc.; the storage medium can also include combinations of the above types of memory.
[0120] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0121] For ease of description, the above devices are described separately by function as various units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.
[0122] Those skilled in the art will understand that embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0123] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0124] This application can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0125] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.
[0126] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of one or more embodiments of this specification as described above, which are not provided in detail for the sake of brevity.
[0127] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.
[0128] One or more embodiments of this specification are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the scope of protection of this disclosure.
Claims
1. A similar data search method based on orientation encoding, characterized in that, The method includes: Obtain specified query data and a given dataset, wherein the given dataset includes multiple basic data items, and both the specified query data and the multiple basic data items are unstructured data; Feature vectors are extracted from the specified query data and multiple basic data to generate corresponding multidimensional feature vectors. Based on the multidimensional vectors, query points corresponding to the specified query data and multiple data points corresponding to the multiple basic data are determined in the multidimensional vector space. A nearest neighbor graph is constructed for multiple data points. In the nearest neighbor graph, the azimuth codes of multiple neighbor points for each data point are performed to determine the corresponding vector azimuth codes of the multiple neighbor points. The step of performing azimuth encoding on multiple neighboring points of each data point in the nearest neighbor graph to determine the corresponding vector azimuth encoding of the multiple neighboring points includes: performing principal component analysis on the multidimensional feature vectors corresponding to the multiple data points of the given dataset to determine the principal component projection matrix; and calculating and determining the azimuth encoding string of the neighboring points relative to the data point based on the principal component projection matrix in the nearest neighbor graph as the vector azimuth encoding of the neighboring points. Principal component analysis is performed on the multidimensional feature vectors corresponding to multiple data points in the given dataset to determine the principal component projection matrix, including: calculating the first digits of the multidimensional feature vectors corresponding to the multiple data points using the principal component analysis method. One principal component is used to form the principal component projection matrix; The principal component: in Indicates the principal component of the first... A comprehensive index component, wherein the principal component is composed of... It consists of pairwise mutually orthogonal vectors. ; Calculating and determining the orientation code string of the neighboring points relative to the data point based on the principal component projection matrix includes: for each data point in each given dataset ,make express In a neighboring point of the nearest neighbor graph, using the principal component projection matrix, calculate... Compared to The orientation encoding string: in, This indicates the first position in the orientation encoded string. Bit encoding; Based on the nearest neighbor graph and the vector orientation encoding, a nearest neighbor search is performed on multiple data points to determine the nearest neighbor data point of the query point; The step of performing a nearest neighbor search on multiple data points based on the nearest neighbor graph and the vector orientation code to determine the nearest neighbor data point of the query point includes: randomly selecting one of the multiple data points as a candidate nearest neighbor point; determining multiple neighbor points of the candidate nearest neighbor point based on the nearest neighbor graph; determining the vector orientation code of the query point relative to the candidate nearest neighbor point, and combining the corresponding vector orientation codes of the multiple neighbor points of the candidate nearest neighbor point to determine the Hamming distance between the query point and the multiple neighbor points; selecting the multiple neighbor points with the smallest Hamming distance to the query point, and calculating these multiple neighbor points. The original distance between neighboring points and the query point; the point with the smallest original distance to the query point is selected as a pending point, and the original distance between the pending point and the query point is compared with the original distance between the nearest neighbor candidate point and the query point; in response to the original distance corresponding to the pending point being less than the original distance corresponding to the nearest neighbor candidate point, the pending point is selected as a new nearest neighbor candidate point to continue the search; in response to the original distance corresponding to the pending point being not less than the original distance corresponding to the nearest neighbor candidate point, the nearest neighbor candidate point is determined as the nearest neighbor data point corresponding to the query point; The basic data corresponding to the nearest neighbor data point is determined to be similar data to the given query data.
2. The method according to claim 1, characterized in that, Constructing a nearest neighbor graph for multiple data points includes: Multiple data points are selected sequentially as vertices; Determine the distance between the vertex and the other multiple data points, and select the multiple data points with the smallest distance to the vertex as the vertex's neighbor points; The vertex is connected to a plurality of its neighboring points using directed edges, with the directed edges pointing from the vertex to the neighboring points.
3. The method according to claim 1, characterized in that, The original distance refers to the Euclidean distance between the neighbor point and the query point; The Euclidean distance is calculated based on the multidimensional feature vectors corresponding to the neighbor points and the query point.
4. The method according to claim 1, characterized in that, When selecting multiple neighbor points with the smallest Hamming distance to the query point, the number of selected neighbor points is determined according to the number of neighbor points maintained for each data point in the nearest neighbor graph. The number of neighbor points maintained for each data point in the nearest neighbor graph is denoted as . The number of neighboring points selected when choosing multiple neighboring points with the smallest Hamming distance to the query point. :
5. A similar data search system based on orientation coding, characterized in that, The system includes: The data acquisition module is used to acquire specified query data and a given dataset, wherein the given dataset includes multiple basic data, and both the specified query data and the multiple basic data are unstructured data. The feature vector extraction module is used to extract feature vectors from the specified query data and multiple basic data, generate corresponding multidimensional feature vectors, and determine the query point corresponding to the specified query data and multiple data points corresponding to the multiple basic data in the multidimensional vector space based on the multidimensional vectors. The orientation coding module is used to construct a nearest neighbor graph for multiple data points, perform orientation coding on multiple neighbor points of each data point in the nearest neighbor graph, and determine the corresponding vector orientation codes of the multiple neighbor points. The orientation encoding module performs orientation encoding on multiple neighboring points of each data point in the nearest neighbor graph to determine the corresponding vector orientation encoding of the multiple neighboring points, including: performing principal component analysis on the multidimensional feature vectors corresponding to the multiple data points of the given dataset to determine the principal component projection matrix; and calculating and determining the orientation encoding string of the neighboring points relative to the data point based on the principal component projection matrix in the nearest neighbor graph as the vector orientation encoding of the neighboring points. The orientation encoding module performs principal component analysis on the multidimensional feature vectors corresponding to multiple data points in the given dataset to determine the principal component projection matrix, including: calculating the first digits of the multidimensional feature vectors corresponding to the multiple data points using the principal component analysis method. One principal component is used to form the principal component projection matrix; The principal component: in Indicates the principal component of the first... A comprehensive index component, wherein the main component is composed of... It consists of pairwise mutually orthogonal vectors. ; The orientation encoding module calculates and determines the orientation encoding string of the neighboring points relative to the data point based on the principal component projection matrix, including: for each data point in each given dataset ,make express In a neighboring point of the nearest neighbor graph, using the principal component projection matrix, calculate... Compared to The orientation encoding string: in, This indicates the first position in the orientation encoded string. Bit encoding; The nearest neighbor search module is used to perform a nearest neighbor search among multiple data points based on the nearest neighbor graph and the vector orientation encoding, to determine the nearest neighbor data point of the query point; and The nearest neighbor search module performs a nearest neighbor search on multiple data points based on the nearest neighbor graph and the vector orientation code to determine the nearest neighbor data point of the query point. This includes: randomly selecting one of the multiple data points as a candidate nearest neighbor; determining multiple neighbor points of the candidate nearest neighbor based on the nearest neighbor graph; determining the vector orientation code of the query point relative to the candidate nearest neighbor, and combining this with the corresponding vector orientation codes of the candidate nearest neighbor's multiple neighbor points to determine the Hamming distance between the query point and the multiple neighbor points; selecting the multiple neighbor points with the smallest Hamming distance to the query point, and calculating... The original distances between these multiple neighboring points and the query point are analyzed. The point with the smallest original distance to the query point is selected as a candidate point. The original distance between the candidate point and the query point is compared with the original distance between the nearest neighbor candidate point and the query point. If the original distance of the candidate point is less than the original distance of the nearest neighbor candidate point, the candidate point is selected as a new nearest neighbor candidate point to continue the search. If the original distance of the candidate point is not less than the original distance of the nearest neighbor candidate point, the nearest neighbor candidate point is determined as the nearest neighbor data point corresponding to the query point. The similar data determination module is used to determine that the basic data corresponding to the nearest neighbor data point is similar data to the given query data.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 4.