[0036] Embodiment one,
[0037] In the embodiment of the present invention, a fixed number of vocabulary is no longer selected at each layer, but the concept of path confidence is introduced, and the path confidence is used to determine whether to select the vocabulary corresponding to the path and enter the next layer of nodes, specifically as figure 2 As shown, in the process of querying the visual vocabulary tree for each local feature of the picture, starting from the first layer of the visual vocabulary tree, the following steps are performed:
[0038] Step 201: Determine the vocabulary to be selected from the first level of the visual vocabulary tree.
[0039] In the embodiment of the present invention, all the vocabulary in the first layer can be used as the vocabulary to be selected. In addition to this method, other selection methods can also be used to use the vocabulary of one or several nodes as the vocabulary to be selected.
[0040]For the convenience of understanding, a brief introduction to the visual vocabulary tree is given first. The visual vocabulary tree is pre-established based on large-scale images, and the visual vocabulary is extracted from the local features of the large-scale images, and the extracted large-scale visual vocabulary is clustered based on the hierarchy to form a visual vocabulary tree. Wherein, each child node is obtained by further clustering the vocabulary of the parent node, and each node is a collection of more than one word, which is usually called a vocabulary in this field.
[0041] Step 202: Using the distance between the local feature and each word to be selected at the current level and the confidence level of the path where the parent node of each word to be selected at the current level is, respectively calculate the confidence level of the path where each word to be selected at the current level is located.
[0042] When calculating the confidence of the path of each word to be selected in the current vocabulary, the following formula can be used:
[0043] γ i = γ c × Dist min Dist i - - - ( 1 )
[0044] Among them, γ c is the confidence degree of the path where the parent node of the i-th word to be selected is located, wherein the confidence degree of the path where the parent node of each word to be selected in the first layer is located can adopt a preset initial value, for example, 1. Dist i is the distance between the local feature and the i-th vocabulary to be selected, Dist min is the minimum value of the distance between the local feature and each vocabulary to be selected at the current level.
[0045] Since the local feature is an n-dimensional vector, the vocabulary of each node in the visual vocabulary tree is also an n-dimensional vector. For example, if the scale-invariant feature transformation (SIFT, Scale-invariant featuretransform) feature is used, N is 128, so the local The distance between the feature and the vocabulary to be selected can be calculated in a specific way between vectors, including but not limited to: Euclidean distance, cosine distance, etc.
[0046] In fact, the confidence degree of the path where each vocabulary is located is a cumulative value, which is extended from the path where its parent node is located, and its confidence value is accumulated from the path where its parent node is located, and reflects the deviation from The degree of the minimum distance between the vocabulary to be selected and the local features in the current level.
[0047] Step 203: Select the vocabulary to be selected whose confidence degree of the path in the current hierarchy is greater than or equal to a preset confidence threshold.
[0048] After the confidence degree of the path where the vocabulary to be selected at the current level is calculated according to step 202, the visual vocabulary is continued to be selected according to the confidence degree, and a preset reliability threshold is used as the selection basis, and the confidence threshold is usually an empirical value. For example 0.97.
[0049] The path is selected by judging the confidence of the path where the vocabulary is located. The vocabulary selected at each layer is not a fixed value, but depends on the degree of local characteristics it reflects.
[0050] Step 204: Judging whether the current layer is the last layer, if yes, execute step 205; otherwise, execute step 206.
[0051] Step 205: Determine the vocabulary selected in the current level as the visual vocabulary of the local feature, and end the visual vocabulary quantification for the graph feature.
[0052] Step 206: The vocabulary selected from the current level enters the next level, and the child nodes of the selected vocabulary are determined as the vocabulary to be selected in the next level, and the next level is taken as the current level to execute step 202 .
[0053] By analogy, local features are mapped to the last layer of vocabulary on all paths whose cumulative confidence is greater than or equal to the confidence threshold. It should be noted that the last layer here may be the last layer of the visual vocabulary tree, that is, the leaf node of the visual vocabulary tree. If optimal efficiency is not required, you can also set a certain layer of the visual vocabulary tree as the last layer of quantified visual vocabulary, for example, set the penultimate layer of the visual vocabulary tree as the last layer of quantified visual vocabulary, so that local features The next-to-last level of vocabulary that is mapped to a path with a confidence greater than or equal to the confidence threshold.
[0054] Give a simple example to figure 2 The flow of the method shown is described in order to image 3 Take the visual feature tree shown as an example, assuming that the visual feature tree has three layers, starting from the first layer for a certain local feature, first determine that all the words in the first layer are words to be selected, that is, word 1 and word 2, respectively according to Formula (1) calculates the confidence of the path where vocabulary 1 is located γ 1 And the confidence of the path where vocabulary 2 is located γ 2 , where when calculating γ c Using the preset initial value 1, assuming the calculated γ 1 greater than the preset confidence threshold θ, and γ 2 If it is less than θ, select vocabulary 1 to enter the second layer, and determine the child nodes vocabulary 3 and vocabulary 4 of vocabulary 1 as the vocabulary to be selected.
[0055] Calculate the confidence of the path where vocabulary 3 is located according to formula (1) γ 3 and the confidence of vocabulary 4 in the path γ 4 , where when calculating γ c Take the confidence of the path where vocabulary 1 is located, that is, γ 1. Suppose the computed gamma 3 and gamma 4 are greater than the preset confidence threshold θ, select vocabulary 3 and vocabulary 4 to enter the third layer respectively, and determine the child nodes vocabulary 7 and vocabulary 8 of vocabulary 3 and the child nodes vocabulary 9 and vocabulary 10 of vocabulary 4 as the vocabulary to be selected.
[0056] Calculate the confidence of the path where vocabulary 7 is located according to formula (1) γ 7 , the confidence of the path where vocabulary 8 is located γ 8 , the confidence of the path where vocabulary 9 is located γ 9 And the confidence of the path where vocabulary 10 is located γ 10 , where the γ of vocabulary 7 and vocabulary 8 c Take the confidence of the path where vocabulary 3 is located γ 3 , γ for vocabulary 9 and vocabulary 10 c Take the confidence of the path where vocabulary 4 is located γ 4 , assuming that the confidence of the path where vocabulary 7 and vocabulary 9 are located is greater than θ, since the third layer is the last layer, vocabulary 7 and vocabulary 9 are selected as the visual vocabulary of this local feature.
[0057] The example given here is just a simple example. In the actual query process, the level of the visual vocabulary tree and the number of child nodes corresponding to each parent node are often large, and the number of visual vocabulary represented by the visual vocabulary tree is also very large. Therefore, the effect of saving computational overhead when performing calculations using the method provided by the embodiment of the present invention is still obvious.
[0058] In addition, since pictures with more local features are affected by quantization errors, more local features can still be quantized to the same visual vocabulary for effective retrieval. Therefore, in order to further reduce unnecessary computing overhead (this The computational overhead at this point refers to the computational overhead of searching the inverted index during the image retrieval process after the inverted index is established using the visual vocabulary), and the upper limit of the visual vocabulary of the image can be limited. After all local features of the picture have been quantified for visual vocabulary, the visual vocabulary of all local features is sorted according to the confidence of the path, and the top N visual vocabulary is selected as the visual vocabulary of the picture. N is the upper limit of the set visual vocabulary limit, which can be a fixed positive integer, or can be set according to the number of local features.
[0059] When setting according to the number of local features, if the picture has more local features, the picture will have a stronger distinguishing power during retrieval, and a relatively small number of visual words is required to achieve higher retrieval accuracy. On the contrary, if the local features of the picture are few, the picture will have weak discriminative power during retrieval, and relatively more visual words are needed to achieve high retrieval accuracy. Therefore, the larger the number of local features of the picture, the smaller the value of N can be set, and the smaller the number of local features of the picture, the larger the value of N can be set.