Vision model construction method, image positioning method, device and electronic equipment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing image and text feature point information into the visual model, the accuracy and success rate of image localization are improved, solving the problems of low localization accuracy and low success rate in existing technologies.

CN117237963BActive Publication Date: 2026-06-12MIGU COMIC CO LTD +2

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: MIGU COMIC CO LTD
Filing Date: 2023-10-18
Publication Date: 2026-06-12

Application Information

Patent Timeline

18 Oct 2023

Application

12 Jun 2026

Publication

CN117237963B

IPC: G06V30/18; G06V30/19

CPC: Y02D10/00

AI Tagging

Application Domain

Energy efficient computing Instruments

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing image localization methods, scene modeling leads to problems such as low localization accuracy and low localization success rate.

⚗Method used

In the process of constructing the visual model, image feature point information and text feature point information are extracted and utilized. Feature matching and three-dimensional coordinate calculation are used to improve the positioning accuracy and success rate.

🎯Benefits of technology

By introducing text feature point information, the accuracy and success rate of image localization are improved, solving the problems of low localization accuracy and low success rate in existing technologies.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117237963B_ABST

Patent Text Reader

Abstract

The application provides a visual model construction method, an image positioning method, a device and an electronic device. The visual model construction method comprises: acquiring a plurality of sample images of a target scene; extracting image feature point information and text feature point information for the sample images; and constructing a visual model according to the image feature point information and the text feature point information, wherein the visual model is used for image positioning. The application can solve the problem of low positioning accuracy and low positioning success rate when a scene model in the prior art is applied to image positioning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of communication technology, and in particular to a method for constructing a visual model, an image localization method, a device, and an electronic device. Background Technology

[0002] Current image localization methods, such as those using Augmented Reality (AR) technology, primarily rely on scene modeling of the target scene. The image localization process specifically involves: when a query image is input, a 3D model of the target scene is used to find the keyframe most similar to the query image, and feature point matching is performed between the query image and the keyframe. Furthermore, the optimal camera pose can be estimated using the Perspective-n-Point (PNP) algorithm, which can further enable AR navigation and other applications.

[0003] However, current scene modeling methods are based on keyframes determined in sample images, the location information of feature points in sample images, and viewpoint information. Scene modeling based on this method suffers from low accuracy and low success rate in image localization. Summary of the Invention

[0004] This application provides a method for constructing a visual model, an image localization method, a device, and an electronic device, which solves the problems of low localization accuracy and low localization success rate when scene models in the prior art are applied to image localization.

[0005] Embodiments of this application provide a method for constructing a visual model, including:

[0006] Acquire multiple sample images of the target scene;

[0007] For the sample image, extract image feature point information and text feature point information;

[0008] A visual model is constructed based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization.

[0009] Optionally, for the sample image, text feature point information is extracted, including:

[0010] Detect the text region in the sample image;

[0011] For the text region, the feature point location information corresponding to the text content is extracted;

[0012] For the text region, the string corresponding to the text content is identified;

[0013] The text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content.

[0014] Optionally, the text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content, including:

[0015] For each text content contained in the plurality of sample images, perform the following operations respectively:

[0016] Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content;

[0017] Determine the target string corresponding to the first text content from the plurality of strings;

[0018] For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

[0019] Optionally, the target string is determined from a plurality of strings corresponding to the first text content, including:

[0020] Match any two strings from the plurality of strings to obtain the similarity value between the two strings;

[0021] The strings are grouped according to the similarity value between any two strings; wherein two strings with a similarity value greater than or equal to a first preset threshold belong to the same group, and two strings with a similarity value less than the first preset threshold belong to different groups.

[0022] Determine the target string corresponding to the first text content from the target group; wherein, the target group is a group whose number of strings is greater than a second threshold.

[0023] Optionally, matching any two strings from the plurality of strings to obtain a similarity value between the two strings includes:

[0024] Based on the dynamic programming algorithm, each character of the first string is matched with each character of the second string, and the distance between the first string and the second string is calculated; wherein, the first string and the second string are any two strings;

[0025] The similarity value between the first string and the second string is determined based on the distance between the first string and the second string, the length of the first string, and the length of the second string.

[0026] This application provides an image localization method, including:

[0027] Obtain the query image of the target scene;

[0028] Extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information;

[0029] Based on the pre-constructed visual model, determine the three-dimensional feature point information corresponding to the two-dimensional feature point information;

[0030] The target pose of the query image is determined by performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information.

[0031] Optionally, the step of performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image includes:

[0032] Based on the image two-dimensional feature point information in the two-dimensional feature point information and the image three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and image feature weight values of the query image, a first relation is constructed;

[0033] Based on the text two-dimensional feature point information in the two-dimensional feature point information and the text three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and text feature weight values of the query image, a second relation is constructed.

[0034] A norm function is constructed based on the first and second relations, and the minimum pose value obtained by solving the norm function is determined as the target pose.

[0035] Optionally, the image localization method further includes:

[0036] Receive first information input by the user; wherein the first information includes voice information and / or text information;

[0037] Based on the text feature point information in the visual model, determine the destination information that matches the first information;

[0038] A navigation path is generated based on the target pose and the destination information.

[0039] This application provides a visual model construction apparatus, comprising:

[0040] The acquisition module is used to acquire multiple sample images of the target scene;

[0041] The extraction module is used to extract image feature point information and text feature point information from the sample image.

[0042] A construction module is used to construct the visual model based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization.

[0043] This application provides an image positioning device, including:

[0044] The acquisition module is used to acquire query images of the target scene;

[0045] An extraction module is used to extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information;

[0046] The first processing module is used to determine the three-dimensional feature point information corresponding to the multiple two-dimensional feature point information based on the pre-constructed visual model.

[0047] The second processing module is used to perform feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image.

[0048] This application provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the visual model construction method described above, or the steps of the image localization method described above.

[0049] This application provides a readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the steps of the visual model construction method described above, or when the processor executes the computer program, it implements the steps of the image localization method described above.

[0050] In embodiments of this application, image feature point information and text feature point information are extracted from multiple sample images of a target scene. A visual model is then constructed based on the image feature point information and text feature point information. That is, text feature point information is introduced when constructing the visual model. This improves the accuracy and success rate of image localization based on the visual model, thereby solving the problems of low accuracy and low success rate of current scene models when applied to image localization. Attached Figure Description

[0051] Figure 1 A flowchart illustrating a method for constructing a visual model according to an embodiment of this application;

[0052] Figure 2 A flowchart illustrating the image localization method according to an embodiment of this application;

[0053] Figure 3 A block diagram illustrating a visual model construction apparatus according to an embodiment of this application;

[0054] Figure 4 A block diagram illustrating an image positioning device according to an embodiment of this application;

[0055] Figure 5 A block diagram illustrating an electronic device according to an embodiment of this application. Detailed Implementation

[0056] To make the technical problems, technical solutions, and advantages of this application clearer, a detailed description will be provided below in conjunction with the accompanying drawings and specific embodiments. In the following description, specific details such as particular configurations and components are provided merely to aid in a comprehensive understanding of the embodiments of this application. Therefore, those skilled in the art should understand that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. Furthermore, for clarity and brevity, descriptions of known functions and structures have been omitted.

[0057] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.

[0058] In the various embodiments of this application, it should be understood that the sequence numbers of the following processes do not imply a specific order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. Furthermore, the terms "system" and "network" are often used interchangeably herein.

[0059] In the embodiments provided in this application, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information.

[0060] The visual modeling process involved in the embodiments of this application includes steps such as feature extraction, feature matching, triangulation, and bundle adjustment, such as visual modeling methods based on Simultaneous Localization and Mapping (SLAM) technology, and the embodiments of this application are not limited thereto.

[0061] like Figure 1 As shown in the figure, this application provides a method for constructing a visual model, including the following steps:

[0062] Step 11: Obtain multiple sample images of the target scene.

[0063] Optionally, the multiple sample images can be acquired in real time in the target scene (i.e., the scene to be modeled), or they can be acquired in advance, etc., and this application embodiment is not limited thereto. For example, the target scene (i.e., the scene to be modeled) can be a shopping mall, parking lot, industrial park, etc., and this application embodiment is not limited thereto.

[0064] Step 12: Extract image feature point information and text feature point information from the sample image.

[0065] Optionally, the image feature point information includes, but is not limited to, at least one of the following: two-dimensional position information of the image feature points (e.g., two-dimensional coordinate points), three-dimensional position information of the image feature points (e.g., three-dimensional coordinate points), and related information for calculating the descriptor of the image feature points; the text feature point information includes, but is not limited to, at least one of the following: two-dimensional position information of the text feature points (e.g., two-dimensional coordinate points), three-dimensional position information of the text feature points (e.g., three-dimensional coordinate points), and text content information (e.g., the string corresponding to the text content).

[0066] Optionally, the image feature points and text feature points in this application embodiment are all feature points extracted (or detected) from the sample image; or it can be understood that when extracting (or detecting) feature points from the sample image in this application embodiment, the text content in the sample image is specially marked and its text content information is recorded for calculating the descriptors corresponding to its feature points, etc. This application embodiment is not limited thereto.

[0067] Step 13: Construct a visual model based on the image feature point information and text feature point information; wherein, the visual model is used for image localization.

[0068] Specifically, in the visual modeling process of this application, additional labels for text feature points are added. Taking the visual modeling method based on SLAM technology as an example, when performing real-time mapping using Visual-Inertial Odometry SLAM (VIO SLAM), both the mapping thread and the text detection and recognition thread are started simultaneously (i.e., step 12 above). When text content is detected in the current frame (i.e., the current sample image), the camera pose of the current frame and the coordinates of the feature points on the text content are recorded (at this time, the two-dimensional coordinates are recorded). Based on these coordinates, and according to the camera parameters and camera pose of the sample image where the text content is located, its three-dimensional coordinates are calculated. The three-dimensional coordinates are used as the world coordinates T(x). w ,y w ,z w Taking as an example, its three-dimensional coordinates can be calculated using the following formula:

[0069]

[0070] Among them, T wc It is the camera pose corresponding to the sample image where the feature points on the text content are located, which is recorded during the SLAM modeling process; F c Camera parameters can generally be obtained in advance; I c This is the transformation matrix from image coordinates to pixel coordinates, which can be calculated after the image's length and width are determined.

[0071] Correspondingly, the three-dimensional coordinates of other image feature points in the sample image besides text feature points can also be calculated in the same way. To avoid repetition, this will not be elaborated here.

[0072] In the above scheme, image feature point information and text feature point information are extracted from multiple sample images of the target scene. A visual model is then constructed based on the image feature point information and text feature point information. That is, text feature point information is introduced when constructing the visual model. This improves the accuracy and success rate of image localization when based on the visual model, thereby solving the problem of low localization accuracy and low success rate of current scene models when applied to image localization.

[0073] Optionally, for the sample image, text feature point information is extracted, including:

[0074] Detect the text region in the sample image;

[0075] For the text region, the feature point location information corresponding to the text content is extracted;

[0076] For the text region, the string corresponding to the text content is identified;

[0077] The text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content.

[0078] For example, when acquiring multiple sample images, text detection can be performed on each sample image (it should be noted that performing text detection on each sample image does not mean that each sample image contains text content). When text content is detected in a sample image, the text region in the sample image is determined. For example, for the location containing text content, the text box can be corrected into a rectangle using a four-point perspective transformation, and this rectangle is the text region.

[0079] Specifically, one implementation method is to use a segmentation network model to perform semantic segmentation on sample images. A threshold can be set to convert the probability map generated by the semantic segmentation network into a binary image, and then a pixel clustering method can be used to group the pixels into text instances, thus obtaining the text regions.

[0080] After determining the text region containing the text content in the sample image, multiple position coordinates are extracted from the text content within that region, i.e., the feature point position information corresponding to the text content is extracted. Furthermore, for each text region, the text content within that region is identified (e.g., represented as a string, i.e., the string corresponding to the text content is identified).

[0081] Specifically, as one implementation method, a network model can be built that includes convolutional layers for extracting feature points from sample images, a bidirectional recurrent neural network (RNN) for predicting feature sequences and outputting predicted label distributions, and a network model for converting the label distributions into a final label series loss layer. Based on this network model, the text content in the corresponding text region is recognized to obtain the string corresponding to the text content.

[0082] Optionally, the text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content, including:

[0083] For each text content contained in the plurality of sample images, perform the following operations respectively:

[0084] Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content;

[0085] Determine the target string corresponding to the first text content from the plurality of strings;

[0086] For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

[0087] For example, among the multiple sample images, there may be images of the same target object taken from different viewpoints (or camera poses), so these multiple sample images may contain the same text content. At the same time, due to factors such as shooting light and angle, when performing text recognition on multiple sample images containing the same text content, the recognition results (i.e., the strings of text content) may differ. Therefore, for each piece of text content, it is necessary to determine its corresponding accurate string.

[0088] For example, multiple sample images are 1 to 100. Sample images 1, 2, 5, 7, 8, 13, and 14 contain the same text content 1, and sample images 2, 4, 7, 10, and 35 contain the same text content 2, and so on. For text content 1, a corresponding string can be identified from sample images 1, 2, 5, 7, 8, 13, and 14 respectively. Similarly, for text content 2, a corresponding string can be identified from sample images 2, 4, 7, 10, and 35 respectively. Thus, 7 strings are obtained for text content 1, and 5 strings are obtained for text content 2. (It should be noted that the number of sample images used here is only to illustrate the relationship between sample images, text content, and strings in this application; the actual number of sample images used in building the model is not limited to this.)

[0089] Optionally, the target string is determined from a plurality of strings corresponding to the first text content, including:

[0090] Match any two strings from the plurality of strings to obtain the similarity value between the two strings;

[0091] The strings are grouped according to the similarity value between any two strings; wherein two strings with a similarity value greater than or equal to a first preset threshold belong to the same group, and two strings with a similarity value less than the first preset threshold belong to different groups.

[0092] Determine the target string corresponding to the first text content from the target group; wherein, the target group is a group whose number of strings is greater than a second threshold.

[0093] For example, the higher the similarity score between two strings, the greater the degree of similarity between them. When grouping multiple strings corresponding to each piece of text, they can be categorized according to high-frequency accurate data, medium-frequency slightly erroneous data, and low-frequency irrelevant data.

[0094] Specifically, after calculating the similarity value between each pair of strings, they can be grouped according to the similarity value, and the data with high similarity can be clustered into a category. For example, two strings with a similarity value greater than or equal to the first preset threshold can be grouped into the same group, and two strings with a similarity value less than the first preset threshold can be grouped into different groups.

[0095] After grouping all the strings corresponding to the first text content, the groups can be further filtered. For example, groups with fewer than a second threshold number of strings can be removed (this group usually corresponds to "low-frequency irrelevant data"). For instance, the remaining groups after removing "low-frequency irrelevant data" may contain high-frequency accurate data and mid-frequency minor errors (i.e., the target group).

[0096] Furthermore, for the strings in the target group, an additionally trained strongly supervised neural network can be used to filter out the target string corresponding to the first text content (i.e., the accurate string corresponding to the first text content). Alternatively, the Chat GPT tool can be called to create a list of strings of the same category and ask the user to select the most correct string (i.e., the target string). Or, considering the text content in the target scenario (such as store names in a shopping mall scenario, garage classification numbers in a garage scenario, road signs, etc., which have universal applicability), semantic database matching and other methods can also be used to filter out the target string corresponding to the first text content from the target group. The embodiments of this application are not limited to these methods.

[0097] Optionally, matching any two strings from the plurality of strings to obtain a similarity value between the two strings includes:

[0098] Based on the dynamic programming algorithm, each character of the first string is matched with each character of the second string, and the distance between the first string and the second string is calculated; wherein, the first string and the second string are any two strings;

[0099] The similarity value between the first string and the second string is determined based on the distance between the first string and the second string, the length of the first string, and the length of the second string.

[0100] For example, to calculate the distance between two strings based on the dynamic programming algorithm, we can maintain a two-dimensional array D and first set D[0,:],[:,0] to 0 (considering the positions without 0, this serves as a safety boundary). For two strings Sn and Sm, D[i,j] represents the first i strings of Sn and the first j strings of Sm (i.e., the degree of matching between the first i strings of Sn and the first j strings of Sm). During matching, i is matched from 1 to n, and j is matched from 1 to m. If Si equals Sj, the match is successful, and the distance is calculated by subtracting 1 from the previous matching state D[i-1,j-1] (here, it is set that the distance decreases when the match is successful, that is, the smaller the distance value, the higher the similarity). If Si does not equal Sj, the match fails, and the minimum value is taken from the three previous states: D[i-1,j], D[i,j-1], [i-1,j-1], and so on, until D[n+1,m+1], which is the distance between Sn and Sm.

[0101] For example: Given two strings to be compared, with lengths n and m respectively, denoted as S. n ,S m Create a two-dimensional array D to store intermediate calculation results. The calculation method for D is as follows:

[0102]

[0103] The formula for calculating minV is:

[0104] minV=1+min<D[i-1,j],D[i,j-1],D[i-1,j-1]-k>

[0105] When Si = Sj, k = 1; otherwise, k = 0.

[0106] Based on the above formula, the distance value D[n+1,m+1] between Sn and Sm can be calculated.

[0107] Optionally, determining the similarity value between the first string and the second string based on the distance between the first string and the second string, the length of the first string, and the length of the second string may include:

[0108] The similarity value between the first string and the second string is obtained by normalizing the distance between the first string and the second string, the length of the first string and the length of the second string.

[0109] For example, the following formula can be used to achieve normalization:

[0110]

[0111] Among them, Sim(S n ,S m That is, two strings S n ,S m The similarity value between them.

[0112] like Figure 2 As shown in the embodiments of this application, an image localization method is also provided. Optionally, the image localization method uses a visual model constructed by the above-described visual model construction method to achieve image localization. The method includes the following steps:

[0113] Step 21: Obtain the query image of the target scene.

[0114] Optionally, the query image may be an image taken for positioning of the target scene, or the query image may be an image of the target scene that has been collected in advance (for example, the query image may be selected from the target scene images collected by the AR device). This application embodiment is not limited thereto.

[0115] Step 22: Extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information.

[0116] Optionally, feature points in the image plane are selected based on the query image; these are two-dimensional feature points. The information for these two-dimensional feature points includes: the coordinates of the two-dimensional feature point and descriptors describing the relevant information of the two-dimensional feature point in the query image. For example, for text feature points, the descriptor may include a string containing the text content.

[0117] Step 23: Based on the pre-constructed visual model, determine the three-dimensional feature point information corresponding to the two-dimensional feature point information.

[0118] Optionally, the topology of the visual model includes, but is not limited to:

[0119] The three-dimensional and two-dimensional positional information of image feature points;

[0120] The three-dimensional and two-dimensional positional information of text feature points;

[0121] The pose and camera parameters of the keyframes;

[0122] The pose and camera parameters of sample images containing text content.

[0123] Thus, based on a pre-constructed visual model (such as a visual model constructed using the aforementioned visual model construction method), the three-dimensional feature point information corresponding to the two-dimensional feature point information can be determined. This three-dimensional feature point information includes: three-dimensional feature point coordinates and descriptors describing the relevant information of the three-dimensional feature point in the query image.

[0124] Step 24: Perform feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image.

[0125] Optionally, feature matching is performed based on the two-dimensional and three-dimensional feature point information, i.e., candidate keyframe screening. This involves matching the query image with the two-dimensional feature points of the keyframes and matching the query image with the feature points of the visual model (e.g., based on descriptors), and obtaining the target pose based on PNP estimation. For example, the image localization method in this embodiment can employ AR localization technology. The key is that the visual model in this application contains text feature point information to improve localization accuracy and success rate. Specific AR localization methods will not be elaborated here.

[0126] In this embodiment of the application, when performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information, and determining the target pose of the query image based on the PNP algorithm, the target pose can be calculated based on the constructed norm function.

[0127] Optionally, the process of constructing the norm function, that is, determining the target pose of the query image by performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information, includes:

[0128] Based on the image two-dimensional feature point information in the two-dimensional feature point information and the image three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and image feature weight values of the query image, a first relation is constructed;

[0129] Based on the text two-dimensional feature point information in the two-dimensional feature point information and the text three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and text feature weight values of the query image, a second relation is constructed.

[0130] A norm function is constructed based on the first and second relations, and the minimum pose value obtained by solving the norm function is determined as the target pose.

[0131] For example, based on the above method for constructing norm functions, the norm function can be expressed as:

[0132]

[0133] Where w(p) and w(t) are weight parameters, which can be adjusted according to the actual situation; P p For the three-dimensional feature points corresponding to visual features (that is, image feature points other than text features), P t p represents the three-dimensional feature points corresponding to the text features. p Two-dimensional feature points of ordinary visual features, p t T represents the two-dimensional feature points of the text features. cw Let T be the camera pose to be estimated. The camera pose T is estimated by minimizing the reprojection error of the 3D feature points using a weighted method. cw Complete the location request.

[0134] Optionally, the image localization method further includes:

[0135] Receive first information input by the user; wherein the first information includes voice information and / or text information;

[0136] Based on the text feature point information in the visual model, determine the destination information that matches the first information;

[0137] A navigation path is generated based on the target pose and the destination information.

[0138] For example, the first piece of information could be user-inputted voice and / or text information. This first piece of information could directly indicate the destination, or the user could indirectly indicate the destination by asking a question. For instance, during SLAM modeling, the text information in the scene and its corresponding coordinates have been registered and provided to a pre-trained question-answering model (such as the Chat GPT model). The user can ask a question like "Where are some good Western restaurants?", and the destination information is determined based on the result of the question-answering model.

[0139] In this embodiment, users can directly or indirectly indicate destination information by inputting voice and / or text information, thereby locating the target location for AR navigation. This avoids the tedious operation of manually browsing the list of available destinations and clicking to select a destination or manually marking the destination based on the scene map when the user performs navigation and selects a destination. This saves navigation time and helps improve the user experience.

[0140] like Figure 3 As shown in the illustration, this application embodiment also provides a visual model construction apparatus 300, comprising:

[0141] The acquisition module 310 is used to acquire multiple sample images of the target scene;

[0142] Extraction module 320 is used to extract image feature point information and text feature point information from the sample image;

[0143] The construction module 330 is used to construct a visual model based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization.

[0144] Optionally, the extraction module 320 includes:

[0145] A detection unit is used to detect text regions in the sample image;

[0146] The extraction unit is used to extract the feature point location information corresponding to the text content for the text region;

[0147] The recognition unit is used to identify the string corresponding to the text content for the text region;

[0148] The determining unit is used to determine the text feature point information based on the feature point location information corresponding to the text content and the string corresponding to the text content.

[0149] Optionally, the determining unit is further configured to:

[0150] For each text content contained in the plurality of sample images, perform the following operations respectively:

[0151] Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content;

[0152] Determine the target string corresponding to the first text content from the plurality of strings;

[0153] For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

[0154] Optionally, the determining unit is further configured to:

[0155] Match any two strings from the plurality of strings to obtain the similarity value between the two strings;

[0156] The strings are grouped according to the similarity value between any two strings; wherein two strings with a similarity value greater than or equal to a first preset threshold belong to the same group, and two strings with a similarity value less than the first preset threshold belong to different groups.

[0157] Determine the target string corresponding to the first text content from the target group; wherein, the target group is a group whose number of strings is greater than a second threshold.

[0158] Optionally, the determining unit is further configured to:

[0159] Based on the dynamic programming algorithm, each character of the first string is matched with each character of the second string, and the distance between the first string and the second string is calculated; wherein, the first string and the second string are any two strings;

[0160] The similarity value between the first string and the second string is determined based on the distance between the first string and the second string, the length of the first string, and the length of the second string.

[0161] It should be noted that the visual model construction apparatus described in the embodiments of this application can implement various embodiments of the visual model construction method and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0162] like Figure 4 As shown, this application embodiment provides an image positioning device 400, including:

[0163] The acquisition module 410 is used to acquire the query image of the target scene;

[0164] The extraction module 420 is used to extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information;

[0165] The first processing module 430 is used to determine the three-dimensional feature point information corresponding to the multiple two-dimensional feature point information based on the pre-constructed visual model.

[0166] The second processing module 440 is used to perform feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image.

[0167] Optionally, the second processing module 440 includes:

[0168] The first construction unit is used to construct a first relation based on the image two-dimensional feature point information in the two-dimensional feature point information and the image three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and image feature weight values of the query image;

[0169] The second construction unit is used to construct a second relation based on the text two-dimensional feature point information in the two-dimensional feature point information and the text three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and text feature weight values of the query image;

[0170] The determining unit is used to construct a norm function based on the first relation and the second relation, and determine the minimum pose value obtained by solving the norm function as the target pose.

[0171] Optionally, the image positioning device 400 further includes:

[0172] A receiving module is configured to receive first information input by a user; wherein the first information includes voice information and / or text information;

[0173] The determination module is used to determine destination information that matches the first information based on the text feature point information in the visual model;

[0174] The generation module is used to generate a navigation path based on the target pose and the destination information.

[0175] It should be noted that the image positioning device described in the embodiments of this application can implement various embodiments of the image positioning method described above and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0176] like Figure 5 As shown, this application embodiment also provides an electronic device, including a transceiver 53, a processor 51, a memory 52, and a computer program stored in the memory 52 and executable on the processor 51. When the processor 51 executes the computer program, it implements the steps of the above-described visual model construction method.

[0177] Specifically, the processor 51 is used for:

[0178] Acquire multiple sample images of the target scene;

[0179] For the sample image, extract image feature point information and text feature point information;

[0180] A visual model is constructed based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization.

[0181] Optionally, the processor 51 is further configured to:

[0182] Detect the text region in the sample image;

[0183] For the text region, the feature point location information corresponding to the text content is extracted;

[0184] For the text region, the string corresponding to the text content is identified;

[0185] The text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content.

[0186] Optionally, the processor 51 is further configured to:

[0187] For each text content contained in the plurality of sample images, perform the following operations respectively:

[0188] Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content;

[0189] Determine the target string corresponding to the first text content from the plurality of strings;

[0190] For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

[0191] Optionally, the processor 51 is further configured to:

[0192] Match any two strings from the plurality of strings to obtain the similarity value between the two strings;

[0193] The strings are grouped according to the similarity value between any two strings; wherein two strings with a similarity value greater than or equal to a first preset threshold belong to the same group, and two strings with a similarity value less than the first preset threshold belong to different groups.

[0194] Determine the target string corresponding to the first text content from the target group; wherein, the target group is a group whose number of strings is greater than a second threshold.

[0195] Optionally, the processor 51 is further configured to:

[0196] Based on the dynamic programming algorithm, each character of the first string is matched with each character of the second string, and the distance between the first string and the second string is calculated; wherein, the first string and the second string are any two strings;

[0197] The similarity value between the first string and the second string is determined based on the distance between the first string and the second string, the length of the first string, and the length of the second string.

[0198] Optionally, when the processor 51 executes the computer program, it can also implement the steps of the image localization method described above. It should be noted that the processor that can implement the steps of the image localization method when executing the computer program and the processor that can implement the steps of the visual model construction method when executing the computer program can be the same processor or different processors. Alternatively, the processor that can implement the steps of the image localization method when executing the computer program and the processor that can implement the steps of the visual model construction method when executing the computer program can be located in the same electronic device or in different electronic devices; this embodiment is not limited to this.

[0199] Specifically, the processor 51 is used for:

[0200] Obtain the query image of the target scene;

[0201] Extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information;

[0202] Based on the pre-constructed visual model, determine the three-dimensional feature point information corresponding to the two-dimensional feature point information;

[0203] The target pose of the query image is determined by performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information.

[0204] Optionally, the processor 51 is further configured to:

[0205] Based on the image two-dimensional feature point information in the two-dimensional feature point information and the image three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and image feature weight values of the query image, a first relation is constructed;

[0206] Based on the text two-dimensional feature point information in the two-dimensional feature point information and the text three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and text feature weight values of the query image, a second relation is constructed.

[0207] A norm function is constructed based on the first and second relations, and the minimum pose value obtained by solving the norm function is determined as the target pose.

[0208] Optionally, the processor 51 is further configured to:

[0209] Receive first information input by the user; wherein the first information includes voice information and / or text information;

[0210] Based on the text feature point information in the visual model, determine the destination information that matches the first information;

[0211] A navigation path is generated based on the target pose and the destination information.

[0212] The bus architecture may include any number of interconnected buses and bridges, specifically linking various circuits of one or more processors 51 (represented by processor 51) and memory 52 (represented by memory 52). The bus architecture may also link various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. The bus interface provides an interface. The transceiver 53 may be multiple elements, including transmitters and transceivers, providing a unit for communicating with various other devices over a transmission medium. Processor 51 is responsible for managing the bus architecture and general processing, and memory 52 may store data used by the processor during operation.

[0213] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a computer program instructing the relevant hardware to implement them. The computer program includes instructions to perform some or all of the steps of the above methods; and the computer program can be stored in a readable storage medium, which can be any form of storage medium.

[0214] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a computer program instructing the relevant hardware to implement them. The computer program includes instructions to perform some or all of the steps of the above methods; and the computer program can be stored in a readable storage medium, which can be any form of storage medium.

[0215] In addition, specific embodiments of this application also provide a computer-readable storage medium storing a computer program thereon. When the program is executed by a processor, it implements the steps in the above-described method for constructing a visual model, and / or, when the program is executed by a processor, it implements the steps in the above-described method for image localization, and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0216] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0217] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can be physically comprised separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0218] The integrated units implemented as software functional units described above can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions that cause a computer device (which may be a personal computer, server, or network device, etc.) to execute some steps of the transmission and reception methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0219] The above describes the preferred embodiments of this application. It should be noted that those skilled in the art can make several improvements and modifications without departing from the principles described in this application, and these improvements and modifications are also within the protection scope of this application.

Claims

1. A method for constructing a visual model, characterized in that, include: Acquire multiple sample images of the target scene; For the sample image, extract image feature point information and text feature point information; A visual model is constructed based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization; Specifically, for the sample image, text feature point information is extracted, including: Detect the text region in the sample image; For the text region, the feature point location information corresponding to the text content is extracted; For the text region, the string corresponding to the text content is identified; The text feature point information is determined based on the feature point location information corresponding to the text content and the string corresponding to the text content; Specifically, determining the text feature point information based on the feature point location information corresponding to the text content and the string corresponding to the text content includes: For each text content contained in the plurality of sample images, perform the following operations respectively: Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein, the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content; Determine the target string corresponding to the first text content from the plurality of strings; For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

2. The method for constructing a visual model according to claim 1, characterized in that, The target string is determined from multiple strings corresponding to the first text content, including: Match any two strings from the plurality of strings to obtain the similarity value between the two strings; The strings are grouped according to the similarity value between any two strings; wherein two strings with a similarity value greater than or equal to a first preset threshold belong to the same group, and two strings with a similarity value less than the first preset threshold belong to different groups. Determine the target string corresponding to the first text content from the target group; wherein, the target group is a group whose number of strings is greater than a second threshold.

3. The method for constructing a visual model according to claim 2, characterized in that, Match any two strings from the plurality of strings to obtain a similarity value between the two strings, including: Based on the dynamic programming algorithm, each character of the first string is matched with each character of the second string, and the distance between the first string and the second string is calculated; wherein, the first string and the second string are any two strings; The similarity value between the first string and the second string is determined based on the distance between the first string and the second string, the length of the first string, and the length of the second string.

4. An image localization method, characterized in that, A visual model pre-constructed by a method for constructing a visual model as described in any one of claims 1 to 3, the method comprising: Obtain the query image of the target scene; Extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information; Based on the pre-constructed visual model, determine the three-dimensional feature point information corresponding to the two-dimensional feature point information; The target pose of the query image is determined by performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information.

5. The image localization method according to claim 4, characterized in that, The step of performing feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image includes: Based on the image two-dimensional feature point information in the two-dimensional feature point information and the image three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and image feature weight values of the query image, a first relation is constructed; Based on the text two-dimensional feature point information in the two-dimensional feature point information and the text three-dimensional feature point information in the three-dimensional feature point information, combined with the camera parameters and text feature weight values of the query image, a second relation is constructed; A norm function is constructed based on the first and second relations, and the minimum pose value obtained by solving the norm function is determined as the target pose.

6. The image localization method according to claim 4, characterized in that, Also includes: Receive first information input by the user; wherein the first information includes voice information and / or text information; Based on the text feature point information in the visual model, determine the destination information that matches the first information; A navigation path is generated based on the target pose and the destination information.

7. A device for constructing a visual model, characterized in that, include: The acquisition module is used to acquire multiple sample images of the target scene; The extraction module is used to extract image feature point information and text feature point information from the sample image. A construction module is used to construct the visual model based on the image feature point information and the text feature point information; wherein, the visual model is used for image localization; The extraction module includes: A detection unit is used to detect text regions in the sample image; The extraction unit is used to extract the feature point location information corresponding to the text content for the text region; The recognition unit is used to recognize the string corresponding to the text content for the text region; The determining unit is used to determine the text feature point information based on the feature point location information corresponding to the text content and the string corresponding to the text content; The determining unit is further configured to: For each text content contained in the plurality of sample images, perform the following operations respectively: Determine multiple strings corresponding to the first text content identified from the multiple sample images; wherein, the first text content is any similar text content in the multiple sample images, and one sample image corresponds to one string of the first text content; Determine the target string corresponding to the first text content from the plurality of strings; For a sample image containing the first text content, the target string and the feature point location information corresponding to the first text content extracted from the sample image are determined as the text feature point information of the first text content.

8. An image positioning device, characterized in that, A visual model pre-constructed by the method for constructing a visual model as described in any one of claims 1 to 3 includes: The acquisition module is used to acquire query images of the target scene; An extraction module is used to extract multiple two-dimensional feature point information from the query image; wherein, the two-dimensional feature point information includes image feature point information and text feature point information; The first processing module is used to determine the three-dimensional feature point information corresponding to the multiple two-dimensional feature point information based on the pre-constructed visual model. The second processing module is used to perform feature matching based on the two-dimensional feature point information and the three-dimensional feature point information to determine the target pose of the query image.

9. An electronic device, characterized in that, include: A processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for constructing a visual model as described in any one of claims 1 to 3, or the processor executes the computer program to implement the steps of the image localization method as described in any one of claims 4 to 6.

10. A readable storage medium, characterized in that, The readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method for constructing a visual model as described in any one of claims 1 to 3, or, when executed by the processor, implements the steps of the image localization method as described in any one of claims 4 to 6.