An adaptive image cropping and matching method, device and medium for visual pointing recognition
By adaptively generating multi-scale candidate cropping boxes and using lightweight neural network feature extraction, the problem of recognition failure caused by changes in distance and size in visual pointing recognition is solved, achieving efficient and accurate target recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING HAOWANG TECHNOLOGY CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-26
AI Technical Summary
In existing visual pointing recognition technologies, the fixed recognition window fails to match due to changes in pointing distance and target object size, resulting in a decrease in recognition accuracy and robustness.
By adaptively generating multi-scale candidate cropping boxes and utilizing the position information of light spots, a set of candidate boxes with appropriate spans is generated. Combined with a lightweight neural network, feature extraction and similarity calculation are performed to achieve accurate recognition of target objects.
It significantly improves the accuracy and robustness of recognition, reduces the amount of computation, and makes it possible to run in real time on embedded devices, adapting to diverse use cases.
Smart Images

Figure CN122289757A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of image processing and human-computer interaction technology, specifically to a preprocessing and decision-making method, device and storage medium for visual recognition of real objects pointed to by users through laser or other pointing means in scenarios such as intelligent control and augmented reality. Background Technology
[0002] In scenarios such as smart home control, industrial equipment inspection, and AR-assisted operation, visual recognition of a user's pointed object is an intuitive interaction method. A common technical approach is as follows: the user points to the target with a laser pointer or a device with a pointer light, a camera captures an image containing the pointer light spot, the system identifies the location of the light spot and analyzes its area to identify the target object.
[0003] However, existing technologies face a significant challenge: due to the variable distance between the user and the target object, and the varying physical sizes of the target objects, there is no fixed proportional relationship between the area covered by the indicator spot in the image and the actual outline of the target object. For example, when pointing at a small switch at close range, the spot may almost cover the entire switch; while when pointing at a large television set at a distance, the spot may only cover a small portion of the screen. If a fixed-size recognition window (e.g., a rectangle with a fixed pixel radius centered on the spot) is used, when the object size does not match the window, the recognition bounding box may fail to fully encompass the target features (for large objects) or include too much interfering background (for small objects), thus severely reducing the accuracy and robustness of the recognition.
[0004] Existing solutions, such as the published patent "A Target Detection Method Based on Multi-Scale Sliding Window," employ multi-scale analysis, but the generation of its sliding window is ergonomic and independent of image content, resulting in high computational overhead. Furthermore, it fails to consider the strong prior information of "attention focus" (i.e., spot position) provided by the user's pointing intent. Another patent, "A Method for Screen Interaction Using a Laser Cursor," utilizes spot position, but its core function is to map spot coordinates to screen coordinates to execute clicks, without involving the identification of physical objects covered by the spot. Therefore, there is currently a lack of an efficient and accurate image processing method that can adapt to the target scale uncertainty in the specific scenario of "pointing recognition." Summary of the Invention
[0005] (a) Technical problems to be solved The technical problem to be solved by this invention is to overcome the problem that fixed recognition window matching fails due to changes in pointing distance and target object size in existing visual pointing recognition technology, and to provide an image cropping and matching method that can adaptively generate a suitable recognition area and improve the accuracy and robustness of pointing recognition.
[0006] (II) Technical Solution To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, the present invention provides an adaptive image cropping and matching method for visual pointing recognition, which is applied to scenarios where a user points to a real-world object using a laser pointer spot.
[0007] Figure 1 This is a flowchart illustrating a method according to an embodiment of the present invention. Figure 1 As shown, the method includes the following steps: S210: Acquire a target image containing the indicator light spot. Capture the user's operation scene using an image acquisition device (such as a camera) to obtain an image containing a bright indicator light spot formed by the laser emitter that is clearly distinguishable from the background.
[0008] S220: Identify the location and extent of the light spot. Using image processing techniques (such as threshold segmentation and contour detection), accurately identify the pixel-level coordinates (usually the center point) and spatial extent (such as equivalent diameter and circumscribed rectangle) of the highlighted area in the image.
[0009] S230: Adaptively generate multi-scale candidate clipping boxes. This is the core step of the invention. Based on the spot range information obtained in step S220, the algorithm adaptively calculates and generates multiple candidate clipping boxes of different sizes. The design principle is to generate a set of candidate boxes with a sufficiently large size span to ensure that regardless of whether the spot is "too large" or "too small" relative to the actual target, there is always a candidate box that can effectively frame the main body of the target object. Specifically, the generated set of candidate boxes includes at least one box smaller than the spot range (used to handle situations where the spot coverage is too large, such as small objects nearby), and at least one box larger than the spot range (used to handle situations where the spot coverage is too small, such as large objects in the distance).
[0010] S240: Extract image sub-regions. Based on the coordinates of each candidate cropping box generated in step S230, crop the corresponding image sub-regions from the original target image.
[0011] S250: Parallel Feature Extraction and Similarity Calculation. All image sub-regions obtained in step S240 are input into a lightweight neural network feature extraction model (e.g., MobileNet, ShuffleNet), deployed on a local device. The model computes a high-dimensional feature vector for each sub-region in parallel. Subsequently, these feature vectors are compared with feature vectors in a pre-defined device feature library for similarity calculation (e.g., calculating cosine similarity).
[0012] S260: Decision Fusion and Output Recognition Results. This step intelligently decides the results of parallel matching. For each candidate object in the feature library, it finds the highest similarity score obtained in all image sub-region matching. Then, it compares this "highest score" of all candidate objects and determines the object corresponding to the global highest score as the target that the laser spot ultimately points to. In addition, a confidence threshold can be set; if the global highest score is lower than this threshold, the recognition is considered a failure.
[0013] (III) Beneficial Effects Compared with the prior art, the present invention has the following beneficial effects: Significantly improves recognition robustness: By adaptively generating a set of multi-scale candidate regions around the light spot, the system can effectively cope with the challenges brought about by changes in the size and distance of the target object, avoiding the problem of "inaccuracy" of a single fixed window, thereby greatly improving the first-shot recognition success rate in diverse real-world application scenarios.
[0014] Computationally efficient: Compared with traditional full-image multi-scale sliding window search, this invention strictly limits the search range to a local area centered on the spot and intelligently determines several key search scales based on the spot size, which greatly reduces the amount of image data and computation required to process. This allows the algorithm to run in real time on embedded devices with limited computing power (such as smart remote controls and AR glasses).
[0015] Perfectly aligned with the pointing interaction logic: This invention is not a general object detection algorithm, but a preprocessing and decision-making module specifically designed for the "pointing-after-recognition" interaction paradigm. It fully utilizes the attention prior (spot position) provided by the "user pointing," enabling the recognition process to have both a clear focus and tolerance for uncertainty, achieving a balance between accuracy and robustness. Attached Figure Description
[0016] Figure 1 This is a flowchart of an adaptive image cropping and matching method provided in an embodiment of the present invention.
[0017] Figure 2 This is a schematic diagram of "adaptive generation of multi-scale candidate clipping boxes" in one embodiment of the present invention.
[0018] Figure 3 This is a block diagram illustrating the working principle of "parallel feature extraction and decision fusion" in one embodiment of the present invention.
[0019] Figure 4 This is a hardware structure block diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0021] Example 1: Reference Figures 1-4 This embodiment describes the application of the method in a smart remote control.
[0022] The electronic device is a smart remote control, and its hardware structure is as follows: Figure 4 As shown, it includes a main control processor, memory, a camera coaxially integrated with the laser, and a Bluetooth / Wi-Fi communication module.
[0023] When a user points the remote control at a smart light fixture in the living room, the workflow is as follows: S210: The camera captures an image containing a 650nm red laser spot, which is projected onto the surface of the lamp.
[0024] S220: The image processing algorithm identifies the center coordinates of the light spot as (320, 240) and calculates its equivalent pixel diameter D = 20 pixels.
[0025] S230: Based on preset parameters α=0.7, β=1.5, γ=3.0, generate three candidate clipping boxes: Box 1 (small): Side length L1 = 0.7 * 20 = 14 pixels Box 2 (middle): Side length L2 = 1.5 * 20 = 30 pixels Box 3 (Large): Side length L3 = 3.0 * 20 = 60 pixels All three boxes are centered at (320, 240). Figure 2 As shown, the small frame may only contain a part of the lamp's texture, the medium frame may better frame the entire lampshade, and the large frame includes the lampshade and part of the ceiling background.
[0026] S240: Crop these three image sub-regions from the original image.
[0027] S250: The NPU processes three sub-regions in parallel, extracts feature vectors, and calculates similarity with locally stored features such as "ceiling light", "table lamp", and "air conditioner". Assume the matching scores are as follows: Box 1 - Ceiling light (0.65), Box 2 - Ceiling light (0.92), Box 3 - Ceiling light (0.88).
[0028] S260: The decision fusion module found that the highest score for "ceiling light" among the three boxes was 0.92 (from box 2), which is higher than the highest score of other objects and also higher than the threshold of 0.8. Therefore, the final identification result is "ceiling light". The system can then perform control over the light.
[0029] Example 2: Stability optimization.
[0030] Before step S210, a preview frame stability detection step can be added. After detecting a pointing motion, the remote control first enters a low-power preview mode to analyze the movement distance of the light spot center in five consecutive frames. Only when the movement distance is within 3 pixels is the high-resolution formal shooting triggered (S210), thereby eliminating the impact of hand shake on image quality and further improving the accuracy of subsequent steps.
[0031] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. An adaptive image cropping and matching method for visual pointing recognition, characterized in that, The method, applicable to scenarios involving the identification of real-world objects pointed to by a user via a laser pointer spot, includes: Acquire a target image captured by an image acquisition device, the target image containing a bright area formed by the laser pointer spot; Identify the pixel-level position and range of the highlighted region in the target image; Based on the pixel range of the highlighted area, multiple candidate cropping boxes of different sizes are adaptively calculated and generated, wherein at least one candidate cropping box is smaller than the range of the highlighted area and at least one candidate cropping box is larger than the range of the highlighted area, so as to jointly cover the uncertainty of the coverage of the highlighted area on the target object caused by the change of pointing distance and / or the physical size of the target object itself. Extract the image sub-regions defined by each of the candidate cropping boxes from the target image; Features of each image sub-region are extracted in parallel using a neural network model, and their similarity to preset features in the device feature library is calculated. Based on the similarity calculation results of each image sub-region, decision fusion is performed to output the final recognition result of the real-world object pointed to by the laser indicator spot.
2. The method according to claim 1, characterized in that, The phrase "adaptively calculate and generate multiple candidate cropping boxes of different sizes" specifically includes: Calculate the equivalent diameter D or the side length of the circumscribed rectangle of the highlighted area; Using the center of the highlighted area as a reference, generate a first candidate box with a side length of L1, where L1 = α * D, 0 < α < 1; Generate a second candidate box with side length L2, where L2 = β * D, β ≥ 1; Where α and β are preset coefficients.
3. The method according to claim 2, characterized in that, A third candidate box with side length L3 is also generated, where L3 = γ * D, γ > β.
4. The method according to claim 1, characterized in that, The phrase "performing decision fusion based on the similarity calculation results of each image sub-region" includes: For each candidate device in the device feature library, obtain its highest similarity score in each image sub-region matching; The highest similarity scores of each candidate device are compared, and the candidate device with the highest score is determined as the final recognition result; If the highest score is lower than a preset threshold, the recognition is deemed to have failed.
5. The method according to claim 1, characterized in that, Before acquiring the target image, the following is also included: The stability of the highlighted areas in consecutive preview frames is detected; The acquisition of the target image for cropping and matching is triggered only when the positional jitter of the highlighted area is detected to be lower than a set threshold across multiple consecutive frames.
6. An electronic device, characterized in that, include: Image sensors are used to acquire images; A laser emitter, coaxially or diaxially mounted with the image sensor, is used to emit a laser that forms an indicator spot; A processor and a memory, the memory storing a computer program, the processor being configured to execute the program to implement the method as described in any one of claims 1 to 5.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 5.