Remote sensing image interpretation method, device and equipment and storage medium

By using speech signal parsing and algorithm pool matching techniques, the problem of insufficient intelligence in remote sensing image interpretation has been solved, and efficient image interpretation without manual operation has been achieved.

CN122245311APending Publication Date: 2026-06-19BEIJING AEROSPACE TITAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING AEROSPACE TITAN TECH CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The level of intelligence in remote sensing image interpretation is low, requiring users to manually click and interact.

Method used

Image information is obtained by parsing speech signals, and the similarity between speech signals and algorithms is calculated using a pre-built algorithm pool. The matching interpretation algorithm is then selected for image interpretation.

Benefits of technology

It enables intelligent interpretation of remote sensing images, reduces manual operation by users, and improves interpretation efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245311A_ABST
    Figure CN122245311A_ABST
Patent Text Reader

Abstract

This application provides a remote sensing image interpretation method, apparatus, and storage medium. The method is used for image interpretation via voice, including: acquiring a voice signal; parsing the voice signal; extracting the image to be interpreted from the voice signal; selecting a matching interpretation algorithm from a pre-built algorithm pool based on the voice signal; and interpreting the image based on the determined interpretation algorithm to obtain the interpretation result. This application can perform image interpretation by directly acquiring the voice signal; that is, the purpose of image interpretation can be achieved through voice control, eliminating the need for multiple manual clicks by the user and effectively improving the intelligence level of remote sensing image interpretation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of remote sensing intelligent application technology, and in particular to a remote sensing image interpretation method, apparatus, device and storage medium. Background Technology

[0002] In the field of remote sensing intelligent applications, image interpretation often requires manual user interaction, which results in a low level of intelligence. Therefore, improving the intelligence level of remote sensing image interpretation has become an urgent problem for those skilled in the art. Summary of the Invention

[0003] In view of this, this application proposes a remote sensing image interpretation method, apparatus, device, and storage medium, which can effectively improve the intelligence level of remote sensing image interpretation.

[0004] According to a first aspect of this application, a method for remote sensing image interpretation is provided, for image interpretation via speech, comprising: Acquire a speech signal, parse the speech signal, and extract the image to be interpreted from the speech signal; A matching interpretation algorithm is selected from a pre-built algorithm pool based on the speech signal; The image is interpreted based on the determined interpretation algorithm to obtain the interpretation result.

[0005] In one possible implementation, when selecting a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal, the selection is made by calculating the similarity between the speech signal and the algorithm in the algorithm pool.

[0006] In one possible implementation, selecting a matching interpretation algorithm by calculating the similarity between the speech signal and algorithms in the algorithm pool includes: The speech recognition vector is obtained based on the speech signal, and the set of algorithm vectors corresponding to each algorithm in the algorithm pool is obtained. Calculate the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set; Based on the calculated similarity scores, the algorithm with the highest similarity score is selected as the interpretation algorithm that matches the current speech signal.

[0007] In one possible implementation, the speech recognition vector includes: Convert the speech signal into text information; The text information is vectorized to obtain the speech recognition vector.

[0008] In one possible implementation, the set of algorithm vectors includes: Obtain the task text corresponding to each algorithm; The task text is vectorized to obtain the vectors of each algorithm; The algorithm vectors are combined to obtain the set of interpretation algorithm vectors.

[0009] In one possible implementation, when pre-building the algorithm pool, the following is included: Obtain the interpretation algorithms and their corresponding task texts; Construct a mapping relationship between the task text and each interpretation algorithm; The interpretation algorithm and task text are stored according to the mapping relationship to obtain the algorithm pool.

[0010] According to a second aspect of this application, an apparatus for remote sensing image interpretation is provided, for interpreting images via speech, comprising: The image extraction module is used to acquire the speech signal, parse the speech signal, and extract the image to be interpreted from the speech signal. The algorithm matching module is used to select a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal; The image interpretation module is used to interpret the image based on the determined interpretation algorithm to obtain the interpretation result.

[0011] In one possible implementation, when selecting a matching interpretation algorithm by calculating the similarity between the speech signal and algorithms in the algorithm pool, the algorithm matching module includes: The vector acquisition module is used to obtain a speech recognition vector based on the speech signal and to obtain the set of algorithm vectors corresponding to each algorithm in the algorithm pool. A similarity calculation module is used to calculate the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set; The algorithm selection module is used to select the algorithm with the highest similarity based on the calculated similarity scores as the interpretation algorithm that matches the current speech signal.

[0012] According to a third aspect of this application, a remote sensing image interpretation apparatus is provided, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the method described in the first aspect of this application.

[0013] According to a fourth aspect of this application, a non-volatile computer-readable storage medium is provided, on which computer program instructions are stored, wherein the computer program instructions, when executed by a processor, implement the method described in the first aspect of this application.

[0014] This application discloses a remote sensing image interpretation method, characterized by its use of voice for image interpretation, comprising: acquiring a voice signal; parsing the voice signal; extracting the image to be interpreted from the voice signal; selecting a matching interpretation algorithm from a pre-built algorithm pool based on the voice signal; and interpreting the image based on the determined interpretation algorithm to obtain the interpretation result. The above method acquires a voice signal, determines the image to be interpreted and the matching interpretation algorithm based on the acquired voice signal, thereby performing image interpretation. This application enables direct remote sensing image interpretation via voice, that is, image interpretation can be achieved through voice control, effectively improving the intelligence level of remote sensing image interpretation.

[0015] Other features and aspects of this application will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0016] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this application together with the specification and serve to explain the principles of this application.

[0017] Figure 1 A flowchart illustrating a remote sensing image method according to an embodiment of this application is shown; Figure 2 This diagram illustrates a natural language model architecture according to an embodiment of the present application. Figure 3 This illustrates a similarity matrix diagram of bge_base_zh according to an embodiment of this application; Figure 4 This illustrates a similarity matrix diagram of bge_large_zh according to an embodiment of this application; Figure 5 A schematic block diagram of a remote sensing image interpretation apparatus according to an embodiment of this application is shown; Figure 6 A schematic block diagram of a remote sensing image interpretation apparatus according to an embodiment of this application is shown. Detailed Implementation

[0018] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0019] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0020] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented without certain specific details. In some instances, methods, means, components, and circuits well-known to those skilled in the art have not been described in detail in order to highlight the main points of this application.

[0021] <Method Implementation> Figure 1 A flowchart illustrating a remote sensing image interpretation method according to an embodiment of this application is shown. Figure 1 As shown, this method is used for image interpretation via voice, including steps S1100-S1300: S1100, acquiring a voice signal, parsing the voice signal, and extracting the image to be interpreted from the voice signal; S1200, selecting a matching interpretation algorithm from a pre-built algorithm pool based on the voice signal; S1300, interpreting the image based on the determined interpretation algorithm to obtain the interpretation result. The above method can perform image interpretation by directly acquiring the voice signal, without requiring multiple manual clicks and interactions from the user. It uses voice control to achieve image interpretation, effectively improving the intelligence level of remote sensing image interpretation.

[0022] In the image interpretation process, step S1100 is first executed to acquire a speech signal, which is then parsed to extract the image to be interpreted. Step S1100 begins with acquiring the speech signal. This acquisition is done by having the user input a speech signal, which is then collected by a terminal device with voice input functionality.

[0023] It should be noted that before the user inputs a voice signal, the user first clicks a button on the display interface to issue a trigger command. Upon receiving the trigger command, the system calls a pre-built image database (which stores images to be interpreted) and displays the data from the image database on the front-end display interface. At this point, the user inputs a voice signal according to the voice signal rules displayed on the front-end interface. It should be pointed out that the voice signal rules can be flexibly set according to different situations. In one implementation, the voice signal rule is that the input voice should contain image information of the image to be interpreted, and the image information includes at least one of an image index and an image type.

[0024] It should also be noted that the content displayed on the front-end interface is image data from a pre-stored image database. When the image interpretation command is triggered, the image data in the database is directly retrieved according to the command and displayed on the interface so that the user can confirm the image to be interpreted.

[0025] When displaying image data by calling an image database, the following two methods can be used.

[0026] One approach involves the front-end interface directly displaying a single image along with its image information (i.e., image index and / or image type). The user then confirms the displayed image, and once confirmed as the one to be interpreted, inputs a voice signal according to the voice signal input rules. For example, based on these rules, the input voice signal could be "segment and extract the current image" or "segment and extract this image."

[0027] Another approach involves the front-end interface directly displaying multiple images along with their image information (i.e., image index and / or image type). The user then selects the image to be interpreted from the displayed images and inputs a voice signal according to the input rules. For example, the front-end interface might display: 1-Image A, 2-Image B, 3-Image C. The voice signal rule in this case is that the voice signal must contain the index corresponding to the currently interpreted image. The input voice signal could be "Segment and extract the image with index 3".

[0028] Furthermore, in this application, the image database contains each data item and its corresponding index information. That is, each image data item is acquired, and each acquired image data item is labeled to obtain its corresponding index value. Then, the association between each image data item and its corresponding index value is established, and each data item, its corresponding index value, and the association between each image data item and its corresponding index value are stored in the image database to obtain the pre-constructed image database.

[0029] Based on a pre-built database, when extracting image information from the speech signal to determine the image to be interpreted, the image being interpreted can be directly identified from the image index contained in the extracted image information.

[0030] Furthermore, to further differentiate the various image data stored in the database, after creating the index, a database table can be constructed, making retrieval based on the database table even more convenient. See Table 1 for details: Table 1 In constructing the image database tables, each data point is configured with four keywords in addition to its index: imgPath, t1Path, t2Path, and vidoPath. The configuration of these keywords for each data point is achieved by categorizing each data point and then configuring the keywords accordingly.

[0031] It should be noted that, because remote sensing images have a coordinate system, after acquiring the imaging time and imaging area for each image on the current interface, the data can be classified according to the imaging time and imaging area, whether it is video data, and whether it appears in pairs. Remote sensing image data can be divided into three categories: independent images, image pairs, and videos.

[0032] Specifically: For non-video data that does not appear in pairs, the data is independent image data, which corresponds to data that can only be segmented and extracted and object detected. Therefore, the value of the imgPath keyword is configured to the index value corresponding to the data, while the values ​​of the t1Path, t2Path and vidoPath keywords are all configured to be empty (i.e., None).

[0033] For non-video data that appears in pairs, the data is an image pair, which corresponds to image data that can be used for target tracking and change detection.

[0034] For video data, which corresponds to data that can be segmented and extracted, detected, tracked, and changed, the keywords for this data are configured as follows: the values ​​of the keywords imgPath, t1Path, and t2Path are configured to None, and the value of the keyword vidoPath is configured to the corresponding index value.

[0035] Therefore, the table constructed above is stored in the database, and each image data is distinguished based on this table. When searching for image data, the search is still based on the index column in the table.

[0036] Based on this, after constructing the image database in the above manner, when extracting image information for determining the image to be interpreted from the speech signal, the image to be interpreted can be directly identified based on the image index contained in the extracted image information.

[0037] For example, when parsing a speech signal and extracting image information for determining the interpreted image, the image information is directly extracted from the input speech signal. For instance, if the input speech signal is "segment and extract the image with index 3," the extracted image information is "the image with index 3." Therefore, based on the extracted image information "index 3," and according to the index configuration table shown in Table 1, the segmentation extraction corresponds to the data with index 3 (i.e., the C-changing image t).

[0038] Furthermore, it should be noted that when configuring keywords according to the above keyword configuration rules, when classifying data, it is necessary to determine whether remote sensing data appear in pairs (i.e., whether they can be used for change detection). This determination can be made based on imaging time and imaging area.

[0039] First, iterate through the acquired remote sensing data, and simultaneously obtain the imaging area and imaging time of the current remote sensing data.

[0040] Then, based on the acquired imaging area, the intersection area of ​​the current remote sensing data's imaging area with the imaging areas of other remote sensing data is calculated, thereby obtaining the overlap degree between the current remote sensing data and the imaging areas of other data. Simultaneously, the time difference between the current remote sensing data's imaging time and the imaging times of other remote sensing data is calculated.

[0041] Finally, a set of data with an overlap greater than a preset overlap threshold and an imaging time difference greater than the change detection time interval is selected as the image pair for change detection. The preset overlap threshold can be determined based on actual conditions, but is preferably set to 80%. It should be noted that the remote sensing data meeting the above requirements needs to be sorted by time. The change detection time interval T can be set according to different application scenarios. That is, different time intervals are set for different application scenarios. Therefore, when judging image pairs, image pairs for different application scenarios can be determined separately. This allows for the simultaneous acquisition of interpretation results for multiple different application scenarios during subsequent image change detection, target tracking, and other interpretation operations via voice.

[0042] It should be noted that different application scenarios correspond to different time intervals T. Time interval T refers to the time interval for change monitoring and is custom data. For example, different application scenarios may include at least one of the following: short-term disasters (e.g., floods, forest fires), routine remote sensing monitoring of nature reserves, short-term land resource change monitoring (land use), medium-term land resource change monitoring, medium-to-long-term land resource change monitoring, and long-term land resource change monitoring.

[0043] The specific change detection time interval varies depending on the application scenario. For example, the change detection time interval can be set to 15 days for short-term disaster (such as floods and forest fires) applications; 6 months for routine remote sensing detection of nature reserves applications; 1 year for short-term land resource change detection (land use) applications; 2 years for medium-term land resource change detection applications; 5 years for medium- and long-term land resource change detection applications; and 10 years for long-term land resource change detection applications.

[0044] It should be noted here that the time interval T for change detection can be set based on the current application scenario (i.e., the database is built in real time). The time interval is set according to the current application scenario to find the corresponding image pairs. The database table is configured based on the found image pairs. Thus, when interpreting images, the image for change detection is directly found based on the database table.

[0045] Alternatively, different time intervals can be set in advance based on multiple different application scenarios, thereby establishing a corresponding database table for each application scenario (i.e., using a pre-built database approach). This means that by searching for the image pair corresponding to the current time interval based on different time intervals, multiple image pair results will be obtained. A database table is then built based on the image pairs found for each application scenario's time interval. At this point, the database contains database tables corresponding to different application scenarios (i.e., the database contains multiple database tables). It should also be noted that when using the pre-built database approach, the database first stores a main table containing only images and their corresponding indexes. This allows for direct retrieval of the corresponding image data using the indexes in this main table during segmentation extraction, object detection, and video tracking interpretation operations. For the image database tables corresponding to each application scenario in the database, only the values ​​of t1Path and t2Path differ; the values ​​of other keys are the same. Therefore, when performing change detection in image interpretation, the database table corresponding to the current application scenario is selected, and the corresponding image data (i.e., image pair) for change detection is retrieved from that table. To address the issue of needing to select the corresponding database table for the application scenario before pre-building the database, one approach is to add application scenario restrictions during voice input. That is, when inputting voice, the user can specify which application scenario the current scenario belongs to, then extract the application scenario from the voice input and look up the corresponding database table in the database to confirm the image to be interpreted. Alternatively, a pre-selection method can be used, where the current application scenario is selected in advance, and the database table corresponding to the current application scenario is called to confirm the image to be interpreted. The specific method is not limited.

[0046] More specifically, when searching for image pairs for change detection, the calculated overlap of each remote sensing data with the imaging time difference is shown in Table 2 below: Table 2 Referring to Table 2, taking the image pair for finding img1 as an example, the data pair [img1, (img4, img5)] can be obtained based on the fact that the overlap of the imaging regions is greater than the overlap threshold of 80% and the time difference between the imaging time of img1 and the image pair is greater than the time interval T. Based on this data pair, the image pairs used for change detection are [img1, img4] and [img1, img5]. When configuring the relationship table, according to Table 1, the value of t1Path for img4 is configured to be 4 and the value of t2Path is configured to be 1; the value of t1Path for img5 is configured to be 5 and the value of t2Path is configured to be 1.

[0047] After configuring the index configuration table, during change detection, the image pairs input for the change detection interpretation operation are found based on the values ​​of t1Path and t2Path configured in the table. Specifically, during the change detection interpretation operation, the required input pairs of data are found based on the image information extracted from the speech signal and the pairwise relationships configured in the index configuration table.

[0048] For example, the input voice signal is "show the changes in the image with index 4". A change detection interpretation operation is required, thus requiring an image pair. The image information extracted from the input voice signal is "the image with index 4". Based on the image information "image" and "index 4", the corresponding remote sensing data with index 4 (i.e., D) is found in Table 1. Then, according to the pairing relationship configured for D (changed image t+1.5T) in Table 1, the remote sensing data with index 3 and the current D (changed image t+1.5T) are found to be an image pair. Therefore, the remote sensing data with a value of 3 (i.e., C) is then found in the index column. The remote sensing data C and D found at this point are used as the interpretation images input for the current change detection.

[0049] Based on the above operations, the image currently being interpreted can be determined.

[0050] After obtaining the image to be interpreted, step S1200 is executed, whereby a matching interpretation algorithm is selected from a pre-built algorithm pool based on the speech signal. The algorithm pool needs to be pre-built before selecting a matching interpretation algorithm. Specifically, when building the algorithm pool, various interpretation algorithms are first acquired. It should be noted that the interpretation algorithm includes at least one of the following: segmentation extraction algorithm, object detection algorithm, video tracking algorithm, and change detection algorithm.

[0051] Each interpretation algorithm is configured with a corresponding task text key based on its function. It should be noted that the task text key refers to the functional description of the algorithm. When configuring the task text key for an algorithm, the configuration is performed according to a pre-constructed task text configuration formula. This formula can be expressed as: key = f“{speed_desc}{accuracy_desc}{modelName}{target}{task}algorithm” In the formula, speed_desc represents the interpretation speed description, accuracy_desc represents the interpretation accuracy description, modelName represents the algorithm name, target represents the object, and task represents the task description.

[0052] It should be noted that when configuring the task text corresponding to the algorithm according to the formula, the algorithm name {modelName} and task {task} fields are the algorithm name and the corresponding function description of each specific algorithm. The remaining fields can be configured according to the actual situation, and there are no specific restrictions.

[0053] For example, the task text for the YOLO target detection algorithm can be configured as "Fast and high-precision YOLO target detection algorithm", "Fast and high-precision YOLO aircraft target detection algorithm", and "Fast and high-precision YOLO ship target detection algorithm". The task text for the Unet change detection algorithm can be configured as "Fast and high-precision Unet building change detection algorithm" and "Fast and high-precision Unet water change detection algorithm".

[0054] Based on each interpretation algorithm and its corresponding task text, a mapping relationship between the task text and the interpretation algorithm is constructed. Finally, each interpretation algorithm and the task text are stored one by one according to the mapping relationship between the task text and the interpretation algorithm to obtain the algorithm pool.

[0055] After obtaining the constructed algorithm pool, an interpretation algorithm matching the current speech signal is selected based on the pool. Specifically, the selection of the interpretation algorithm is achieved by calculating the similarity between the speech signal and the algorithms in the pool. It should be noted that the speech signal, in addition to the image information used to determine the interpretation image, also includes specific interpretation operations (i.e., at least one of segmentation extraction, object detection, video tracking, and change detection). Therefore, the selection of a matching interpretation algorithm by calculating the similarity between the speech signal and the algorithms in the pool is achieved by calculating the similarity between the entire speech signal in text form (i.e., the text information obtained by converting the entire speech signal into text form) and the task text key configured for the aforementioned interpretation algorithm.

[0056] In one possible implementation, when selecting a matching interpretation algorithm by calculating the similarity between the speech signal and algorithms in the algorithm pool, a speech recognition vector is first obtained from the speech signal. Specifically, the speech signal is converted into a speech file. During this conversion, the speech signal is recorded using a device with speech recording capabilities, and a corresponding speech file is generated. It should be noted that the speech file format can include any of the following: binary files, WAV files, MP3 files, and other speech file formats; no specific limitation is imposed.

[0057] After obtaining the audio file, it is converted into corresponding text information. This conversion involves first converting the audio file into a Mel spectrogram to obtain information from each frame. Then, based on the Mel spectrogram information, an encoder converts it into a digital representation. Finally, a decoder decodes the digitally represented audio signal into its textual form, i.e., the text information. Through these operations, the audio file can be converted into its corresponding text information. It should be noted that a Mel spectrogram is a graph used to represent the acoustic characteristics of speech, such as pitch, intensity, and duration.

[0058] After obtaining the text information corresponding to the speech signal, it is vectorized to obtain the corresponding speech recognition vector, which can be represented as vec_a. When vectorizing the text information to obtain the speech recognition vector, a natural language processing model can be used, preferably at least one of the bge_base_zh model and the bge_large_zh model.

[0059] It should be noted that when calculating the speech recognition vector, the text information is encoded and embedded to generate the speech recognition vector. Specifically, refer to... Figure 2 As shown, firstly, the text information (i.e., Figure 2 (The target detection is shown) is embedded, including token embedding, segmentation embedding and position embedding.

[0060] In embedding text information, the first step is segmentation embedding, which involves dividing each sentence in the text information into individual Chinese characters to obtain a sequence of characters and symbols. Based on the obtained sequence of text and symbols, token embedding is performed, that is, each object in the sequence is converted into a token, and the symbol is marked as [seg_token], thus forming the final token embedding result. To ensure the semantic order of the text information, each token object in the token embedding result is first positionally encoded, and then position embedding (T+PE) is performed to preserve the semantic order of the sentence.

[0061] It should be noted that the position encoding uses a sine and cosine method, which can be expressed by the formula: In the formula, pos represents the position index of the token, i represents the dimension of the token, and D represents the dimension of the model.

[0062] Each token object is embedded into the token embedding result T based on its positional encoding, ultimately resulting in T+PE (i.e., the final text embedding result). It should be noted that the text information is embedded according to the embedding order described above.

[0063] Based on the above text information embedding to obtain the embedding result (i.e., the text transformation vector), it is encoded by an encoder to obtain the context information of each word in the text. The decoder then performs a weighted sum based on the obtained context information to obtain the corresponding speech recognition vector. The weighted summation of the context information is calculated using an attention mechanism, which can be expressed by the following formula: In the formula, f represents the normalization operation, which can be softmax. express Q represents the query vector, K represents the key vector, and V represents the value vector. Represents the vector dimension.

[0064] After obtaining the speech recognition vectors, it is also necessary to calculate the algorithm vector set for each algorithm in the algorithm pool, which can be represented as vec_set=[vec_1, vec_2,…, vec_n]. Specifically, when calculating the algorithm vector set, firstly, the task text corresponding to the algorithm is obtained according to the mapping relationship between the algorithm and the task text. Then, the task text is vectorized to obtain the algorithm vectors corresponding to each algorithm. Finally, the algorithm vectors are combined to obtain the algorithm vector set. The vectorization method used when vectorizing each task text is the same as the vectorization method for the speech recognition vectors described above, and will not be elaborated further here.

[0065] Based on the obtained speech recognition vector and algorithm vector set, the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set is calculated. It should be noted that the similarity calculation method used when calculating the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set can be selected according to the actual situation; the cosine similarity calculation method is preferred.

[0066] When calculating vector similarity using cosine similarity, the formula can be expressed as: vec_a = ( , ... ) vec_n = ( , ... ) In the formula, vec_a represents the speech recognition vector, and vec_n represents the algorithm vector. It is the dot product of vectors vec_a and vec_n. These are the L2 norms of vectors vec_a and vec_n, respectively.

[0067] Based on the calculated similarity scores, the algorithm with the highest similarity score is selected as the interpretation algorithm to match the current speech signal.

[0068] After obtaining the interpretation algorithm that matches the current speech signal, step S1300 is executed to interpret the image based on the determined interpretation algorithm, thereby obtaining the interpretation result. Specifically, when interpreting the image based on the determined interpretation algorithm, the determined interpretation algorithm is directly run to interpret the image that has already been determined to be interpreted, thus obtaining the interpretation result. It should be noted that the interpretation algorithm for image interpretation is common knowledge in the field and will not be elaborated upon further here.

[0069] For example, if the text in the speech recognition is "What is the difference between the image with index 1 and the image with index 2?", and the algorithm matches "change detection algorithm", then the algorithm is run to detect changes in the image with index 1 and the image with index 2, and the detection result is obtained.

[0070] The decoding results are then displayed using a display device. In one possible implementation, the decoding effect is shown in Table 3 below: Table 3 For example, when the speech recognition result is used to segment and extract the current image, the extracted interpretation image is shown in Table 3. The matching algorithm based on the speech recognition result is the segmentation and extraction algorithm. The visualization result of the image interpretation by running the interpretation algorithm is the image shown in the visualization output in Table 3.

[0071] Furthermore, after selecting the algorithm with the highest similarity, a similarity threshold can be used to determine whether the algorithm has successfully matched. Specifically, if the highest similarity is greater than the similarity threshold, the corresponding algorithm has successfully matched, and the interpretation algorithm is run to interpret the image; otherwise, the algorithm fails to match and the speech signal is reacquired. The similarity threshold is set according to the selected natural language processing model. Preferably, when using the bge_base_zh model, the threshold is set to 0.65; when using the bge_large_zh model, the threshold is set to 0.8.

[0072] In one possible implementation, the similarity matrix obtained when calculating vectors using the bge_base_zh model is as follows: Figure 3 As shown. See also Figure 3As shown, the input speech signal is the first image segmented and extracted. The similarity between the input speech signal and the high-precision Unet segmentation and extraction algorithm, the fast YOLO aircraft target detection algorithm, the video tracking algorithm, and the fast Unet change detection algorithm is calculated, specifically: 0.75, 0.66, 0.66, and 0.64. Based on the calculated similarity, the high-precision Unet segmentation and extraction algorithm with the highest similarity of 0.75 is selected as the algorithm that matches the current speech signal (i.e., the first image segmented and extracted). Furthermore, since the similarity threshold set by the bge_base_zh model is 0.65, and 0.75 is greater than 0.65, the high-precision Unet segmentation and extraction algorithm successfully matches the current speech signal.

[0073] Similarly, when using the bge_large_zh model to calculate vectors, the resulting similarity matrix is ​​as follows: Figure 4 As shown. See also Figure 4 As shown, the input speech signal is "What's in the second image?". The similarity scores between the input speech signal and the high-precision Unet segmentation extraction algorithm, the fast YOLO aircraft target detection algorithm, the video tracking algorithm, and the fast Unet change detection algorithm are calculated to be 0.74, 0.71, 0.70, and 0.73, respectively. Based on the calculated similarity scores, the high-precision Unet segmentation extraction algorithm with the highest similarity score of 0.74 is selected as the algorithm that matches the current speech signal (i.e., "What's in the second image?"). However, since the similarity threshold set by the bge_base_zh model is 0.8, and 0.74 is less than 0.8, the high-precision Unet segmentation extraction algorithm fails to match the current speech signal, and the speech signal needs to be acquired again.

[0074] Furthermore, in one possible implementation, before acquiring the voice signal, the user first clicks the voice signal acquisition button, then performs the voice signal acquisition operation, and subsequently performs voice control for image interpretation. The image interpretation process is consistent with the image interpretation process described above, and will not be elaborated upon here.

[0075] This application provides a remote sensing image interpretation method, including acquiring a speech signal, parsing the speech signal, extracting the image to be interpreted from the speech signal, selecting a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal, and interpreting the image based on the determined interpretation algorithm to obtain the interpretation result. The above method acquires a speech signal, determines the image to be interpreted and the matching interpretation algorithm based on the acquired speech signal, thereby performing image interpretation. This application allows for direct remote sensing image interpretation via voice control, effectively improving the intelligence level of remote sensing image interpretation.

[0076] <Device Embodiment> Figure 5 A schematic block diagram of a remote sensing image interpretation apparatus according to an embodiment of this application is shown. Figure 5 As shown, the device 100 is used for image interpretation via speech, and includes: an image extraction module 110, an algorithm matching module 120, and an image interpretation module 130. The image extraction module 110 is used to acquire a speech signal, parse the speech signal, and extract the image to be interpreted from the speech signal; the algorithm matching module 120 is used to select a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal; the image interpretation module 130 is used to interpret the image based on the determined interpretation algorithm to obtain the interpretation result.

[0077] <Equipment Example> Figure 6 A schematic block diagram of a remote sensing image interpretation apparatus according to an embodiment of this application is shown. Figure 6 As shown, the remote sensing image interpretation device 200 includes a processor 210 and a memory 220 for storing executable instructions of the processor 210. The processor 210 is configured to implement any of the remote sensing image interpretation methods described above when executing the executable instructions.

[0078] It should be noted here that the number of processors 210 can be one or more. Furthermore, the remote sensing image interpretation device 200 in this embodiment may also include an input device 230 and an output device 240. The processors 210, memory 220, input device 230, and output device 240 can be connected via a bus or other means, which are not specifically limited here.

[0079] The memory 220, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and various modules, such as the program or module corresponding to the remote sensing image interpretation method in this application embodiment. The processor 210 executes various functional applications and data processing of the remote sensing image interpretation device 200 by running the software program or module stored in the memory 220.

[0080] Input device 230 can be used to receive input digital numbers or signals. These signals may include key signals related to user settings and function control of the device / terminal / server. Output device 240 may include a display device such as a screen.

[0081] <Storage Medium Examples> According to a fourth aspect of this application, a non-volatile computer-readable storage medium is also provided, on which computer program instructions are stored, which, when executed by processor 210, implement any of the remote sensing image interpretation methods described above.

[0082] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technological improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for interpreting remote sensing images, characterized in that, Used for image interpretation via speech, including: Acquire a speech signal, parse the speech signal, and extract the image to be interpreted from the speech signal; A matching interpretation algorithm is selected from a pre-built algorithm pool based on the speech signal; The image is interpreted based on the determined interpretation algorithm to obtain the interpretation result.

2. The method of claim 1, wherein, When selecting a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal, the selection is made by calculating the similarity between the speech signal and the algorithm in the algorithm pool.

3. The method of claim 2, wherein, When selecting a matching interpretation algorithm by calculating the similarity between the speech signal and the algorithms in the algorithm pool, the following steps are included: The speech recognition vector is obtained based on the speech signal, and the set of algorithm vectors corresponding to each algorithm in the algorithm pool is obtained. Calculate the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set; Based on the calculated similarity scores, the algorithm with the highest similarity score is selected as the interpretation algorithm that matches the current speech signal.

4. The method of claim 3, wherein, The speech recognition vector includes: Convert the speech signal into text information; The text information is vectorized to obtain the speech recognition vector.

5. The method of claim 2, wherein, The algorithm vector set includes: Obtain the task text corresponding to each algorithm; The task text is vectorized to obtain the vectors of each algorithm; The algorithm vectors are combined to obtain the set of interpretation algorithm vectors.

6. The method according to any one of claims 1 to 5, characterized in that, When pre-building the algorithm pool, the following are included: Obtain the interpretation algorithms and their corresponding task texts; Construct a mapping relationship between the task text and each interpretation algorithm; The interpretation algorithm and task text are stored according to the mapping relationship to obtain the algorithm pool.

7. A remote sensing image interpretation apparatus characterized by comprising: Used for image interpretation via speech, including: The image extraction module is used to acquire the speech signal, parse the speech signal, and extract the image to be interpreted from the speech signal. The algorithm matching module is used to select a matching interpretation algorithm from a pre-built algorithm pool based on the speech signal; The image interpretation module is used to interpret the image based on the determined interpretation algorithm to obtain the interpretation result.

8. The apparatus of claim 7, wherein, When selecting a matching interpretation algorithm by calculating the similarity between the speech signal and the algorithms in the algorithm pool, the algorithm matching module includes: The vector acquisition module is used to obtain a speech recognition vector based on the speech signal and to obtain the set of algorithm vectors corresponding to each algorithm in the algorithm pool. A similarity calculation module is used to calculate the similarity between the speech recognition vector and each algorithm vector in the algorithm vector set; The algorithm selection module is used to select the algorithm with the highest similarity as the interpretation algorithm that matches the current speech signal based on the calculated similarity scores.

9. A remote sensing image interpretation device, characterized by, include: processor; Memory used to store processor-executable instructions; The processor is configured to implement the method of any one of claims 1 to 6 when executing the executable instructions.

10. A non-volatile computer-readable storage medium storing computer program instructions thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 6.