Image recognition method for a robot for home environment

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining RGB-D camera registration and preprocessing technology with an improved neural network structure, the instability and high computational load of robot recognition algorithms in home environments are solved, the detection accuracy of small targets is improved, and real-time recognition and high reliability are achieved on embedded devices.

CN121661622BActive Publication Date: 2026-06-23HOHAI UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HOHAI UNIV
Filing Date: 2025-11-25
Publication Date: 2026-06-23

Application Information

Patent Timeline

25 Nov 2025

Application

23 Jun 2026

Publication

CN121661622B

IPC: G06V20/60; G06V10/44; G06V10/764; G06V10/82; G06V10/80; G06N3/0464; G06N3/045; G06N3/048; G06T7/33; G06V10/30; G06T5/70

AI Tagging

Application Domain

Image enhancement Image analysis

Technology Topics

Pattern recognition Imaging processing

Technical Efficacy Phrases

increase contrastSuppress lighting changes

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Mahjong machine with control lighting
CN122148946AImprove visibility Improve user experience Lighting applications Indoor games Light beam Testing Methods
Projection screen light field display gain method and system for ambient light
CN120281890Bincrease brightness increase contrast Closed loop feedback Data acquisition
A high dynamic range image fusion lamination diffraction imaging method and system
CN117891085BRobust against noiseRelax high dynamic range requirementsScattering properties measurements Optical elementsMixed noiseExposure
SAR image water body submergence range change detection method and device for flood scenario
CN122265861ASuppress multiplicative speckle noiseincrease contrast Biological models Scene recognition Contrast level Heat map
Liquid crystal display device
CN116953972BImprove transmittanceincrease contrast Static indicating devices Non-linear optics Engineering Materials science

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The recognition algorithms of robots used in home environments are affected by changes in lighting, occlusion, and reflection, resulting in unstable recognition results. They also have a large computational load and low accuracy in detecting small targets, which affects their reliability in health monitoring and autonomous task execution.

Method used

By employing RGB-D camera registration, preprocessing techniques such as wavelet transform, color correction, and geometric correction, combined with an improved neural network structure including residual connections, decomposed convolution kernels, and high-resolution detection branches, and using depthwise separable convolution and channel attention modules, the image recognition process is optimized.

Benefits of technology

It improves the stability and accuracy of recognition, reduces the amount of computation, is suitable for real-time operation on embedded devices, reduces the false negative rate, and meets the practical application needs in home environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121661622B_ABST

Patent Text Reader

Abstract

The application provides a kind of image recognition method of robot for home environment, belong to image processing technical field, specifically include: S1: the RGB-D camera of robot shoots the image under home environment, will RGB-D camera registration;S2: the image of registration in S1 is preprocessed, and improved image is obtained H ( x, y );S3: the improved image H ( x, y ) format conversion, is converted into normalized image;S4: neural network improvement;S5: the normalized image is input into improved neural network, and the object is identified by improved neural network.The application can realize high-precision identification and positioning of characters, medicines, furniture and obstacles and other targets in complex home environment, improve the environmental perception and autonomous decision-making ability of robot in indoor scene.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image recognition technology, and specifically relates to an image recognition method for robots used in home environments. Background Technology

[0002] With the development of technology, artificial intelligence technology has achieved significant applications in the home environment.

[0003] The current recognition algorithms for home environment robots have the following problems:

[0004] 1. The recognition algorithm relies on high-quality image input, but images are affected by changes in lighting, occlusion, and reflection, leading to unstable recognition results;

[0005] 2. Existing convolutional neural networks have complex structures and require a large amount of computation, making them unsuitable for real-time operation on embedded devices such as Raspberry Pi.

[0006] 3. The detection accuracy for small targets (such as medicine boxes, thermometers, etc.) is low, and it is easy to miss or falsely detect.

[0007] These issues result in inaccurate visual perception, high response latency, and high power consumption in robots within the home environment, affecting their reliability in health monitoring and autonomous task execution. Summary of the Invention

[0008] This invention proposes an image recognition method for robots used in home environments to solve the above-mentioned problems.

[0009] To achieve the above objectives, the present invention proposes the following technical content:

[0010] An image recognition method for robots used in home environments includes the following steps:

[0011] S1: The robot's RGB-D camera captures images of the home environment, and the images captured by the RGB-D camera are registered;

[0012] S2: Preprocess the registered image from S1 to obtain the improved image. H ( x, y );

[0013] S3: Improved Image H ( x, y ) Format conversion, converting to a normalized image. ;

[0014] S4: Neural Network Improvement; specifically including the following steps:

[0015] S4.1: Improvement of the residual connection structure of the backbone network;

[0016] S4.2: Improvement of the decomposed convolutional kernel structure of the backbone network;

[0017] S4.3: Improvements to the detection network; specifically including the following steps:

[0018] S4.3.1: Add a high-resolution feature extraction channel to the Neck layer;

[0019] S4.3.2: Lightweighting of convolutional structures;

[0020] S4.3.3: Embed a channel attention module in front of the detection head;

[0021] S5: Normalize the image The input is fed into an improved neural network, which then identifies the object.

[0022] Furthermore, step S2 specifically includes the following steps:

[0023] S2.1: For the registered image I ( x, y Wavelet transform;

[0024] S2.2: Soft threshold processing;

[0025] S2.3: Image obtained after wavelet inverse transform reconstruction G(x, y) ;

[0026] S2.4: For the image G(x, y) Adjust the color and brightness to obtain the image. G'(x, y) ;

[0027] S2.5: Based on the white balance hypothesis theory, for images G'(x, y) Perform color correction to obtain the color-corrected image. M(x, y) ;

[0028] S2.6: For the image M(x, y) Perform geometric correction to obtain the corrected image. K ( x, y );

[0029] S2.7: For the image K ( x, y An improved image is obtained by using adaptive local histogram equalization. H ( x, y );

[0030] Furthermore, the registration formula in step S1 is:

[0031]

[0032] In the formula, P DepthRepresents a three-dimensional point in the depth sensor coordinate system; R It is a rotation matrix; t It is a translation vector; P RGB Represents a three-dimensional point in the color sensor coordinate system; K This is the intrinsic parameter matrix of the camera.

[0033] Furthermore, in S2.4, the image... G(x, y) The formula for adjusting brightness is:

[0034]

[0035] In the formula, G(x, y) This indicates an image without color and brightness adjustments. G'(x, y) This represents the image obtained after color and brightness adjustments. I min Representing an image G ( x , y The minimum pixel value in ) I max Representing an image G ( x , y The maximum pixel value in ).

[0036] Furthermore, in step S2.5, the formula for color correction is:

[0037]

[0038] In the formula, R(x, y) Representing an image G’ ( x , y The red channel value in the middle; R'(x, y) This represents the red channel value after color correction. This indicates the average brightness of the green channel; This represents the average brightness of the blue channel; This represents the average brightness of the red channel. B (x, y) Representing an image G’ ( x , y The value of the blue channel in the middle; B'(x, y) This represents the blue channel value after color correction.

[0039] Furthermore, in step S2.6, the formula for geometric correction is:

[0040]

[0041] In the formula, k 1 andk 2 represents the distortion coefficient; Indicates an ungeometrically corrected image M(x, y) The corresponding coordinates; Indicates the coordinates corresponding to the geometric correction; r Indicates an ungeometrically corrected image M(x, y) The distance from a pixel to the center of the image.

[0042] Furthermore, in step S2.7, the formula for adaptive local histogram equalization is:

[0043]

[0044] In the formula, CDF local Represents the cumulative distribution function of the local histogram; I'(x, y) Representing an image K ( x, y The grayscale value of ) I'(x, y) Indicates improved image H ( x, y The grayscale value of ) L This represents the number of gray levels.

[0045] Furthermore, step S3 includes the following steps:

[0046] S3.1: Improve the image H ( x, y Pixels containing both color and depth are fused into a multi-channel fused input image using the following formula:

[0047]

[0048] In the formula, express H ( x, y The corrected red channel image; express H ( x, y Corrected green channel image; express H ( x, y Corrected blue channel image; Indicates depth information weights; Represents normalized depth information; This represents a multi-channel fused input image;

[0049] S3.2: Normalize the multi-channel fused input image using the following formula:

[0050]

[0051] In the formula, Indicates multi-channel fused input image The average pixel value in the image; Indicates multi-channel fused input image The standard deviation of pixels in the data; This represents the normalized image;

[0052] S3.3: Normalize the image Convert it into a tensor form that can be recognized by neural networks.

[0053] Furthermore, in steps S4.1 to S4.3, the formula for improving the residual connection structure is as follows:

[0054]

[0055] In the formula, x 1 represents the input feature map of the first layer of the convolutional network; W 1 represents the weight of the first layer of the convolutional network; y 1 represents the output feature map of the first layer of the convolutional network; F Indicates the activation function;

[0056] The formula for improving the decomposition of convolution kernel structure is as follows:

[0057]

[0058] In the formula, X The input feature map represents the convolution kernel; Represents the vertical convolution kernel; Represents the horizontal convolution kernel; Y This represents the output feature map of the convolution kernel;

[0059] The improvements to the detection network include:

[0060] Add a high-resolution feature extraction channel to the Neck layer:

[0061]

[0062] In the formula, Represents a high-level feature map; Represents the feature map of the middle level; Represents low-level feature maps; Represented as fusion weight;

[0063] Lightweighting of convolutional structures:

[0064]

[0065] In the formula, Represents the channel-wise convolution kernel; This represents a 1×1 convolution kernel used for channel fusion.X 深 The input feature map represents the depthwise separable convolution; Y 深 This represents the output feature map of a depthwise separable convolution;

[0066] Embedded attention module:

[0067]

[0068] In the formula, GAP (*) indicates a global average pooling operation; For the Sigmoid function; Indicates channel weighting operation; F This represents the feature map input to the attention module; F’ This represents the output feature map after weighting by the channel attention module; ReLu (*) indicates a linear rectifier function; W 1 and W 2 represents the weight matrix of the fully connected layer; s This represents the attention weight.

[0069] The beneficial effects that can be achieved by adopting the above technologies are:

[0070] 1. By employing preprocessing techniques such as wavelet denoising, color and geometric correction, and adaptive histogram equalization, the system effectively suppresses illumination variations and noise interference, resulting in clearer and higher-contrast input images, thereby improving the stability and accuracy of recognition.

[0071] 2. An improved ResNet network is adopted, which introduces residual connections and convolution decomposition structure to reduce the amount of computation and enhance the feature transfer capability, making it suitable for real-time operation on embedded devices such as Raspberry Pi.

[0072] 3. By adding a high-resolution detection branch and using depthwise separable convolution in the YOLOv8 network, the model becomes more sensitive to the recognition of small targets, significantly reducing the false negative rate and meeting the practical application needs in home environments. Attached Figure Description

[0073] Figure 1 This is a flowchart of the method;

[0074] Figure 2 This is a comparison chart of test accuracy results under four experimental configurations: the baseline scheme, the preprocessing-only scheme, the network-only scheme, and the complete patented scheme.

[0075] Figure 3 It shows the accuracy trends during different iterations of the training process for the baseline scheme, the preprocessing-only scheme, the network-only scheme, and the complete patent scheme.

[0076] Figure 4 The image shows a home scene using a benchmark scheme and a schematic diagram of its recognition results.

[0077] Figure 5 This diagram illustrates home images using this solution and their recognition results.

[0078] Figure 6 These are unprocessed home images;

[0079] Figure 7 These are pre-processed images of home furnishings. Detailed Implementation

[0080] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0081] like Figure 1 As shown, an image recognition method for robots used in home environments includes the following steps:

[0082] S1: The robot's RGB-D camera captures images of the home environment, and the images captured by the RGB-D camera are registered;

[0083] Specifically, because an RGB-D camera consists of two sensors—color (RGB) and depth (Depth)—and due to slight misalignments in their mounting positions, the pixel coordinates of the same object do not match on the color and depth maps. To ensure that the color and depth information of each pixel corresponds, an extrinsic parameter transformation is needed to align the depth and color maps. The formula is:

[0084]

[0085] In the formula, P Depth Represents a three-dimensional point in the depth sensor coordinate system; R This is a rotation matrix used to describe the rotation relationship between the depth sensor and the color sensor; K This represents the intrinsic parameter matrix of an RGB-D camera. t This is a translation vector used to describe the translation relationship between the depth sensor and the color sensor; P RGB This represents a three-dimensional point in the color sensor coordinate system. After this transformation, the same pixel has both a color value and a corresponding depth value.

[0086] S2: Preprocess the registered image. This includes the following steps:

[0087] S2.1: Wavelet transform of the image. The formula is:

[0088]

[0089] In equation (1), I ( x, y () represents the original image, i.e., the image registered in step S1; Represents the low-frequency approximate component; Represents the wavelet coefficients after transformation; These represent the detail coefficients in the horizontal, vertical, and diagonal directions corresponding to the wavelet transform. Represents the wavelet function; m Indicates the scale of the wavelet transform; n This represents the spatial displacement of the wavelet basis function. H, V, D These represent the horizontal direction, the vertical direction, and the diagonal direction, respectively. i This indicates the direction category corresponding to the detail coefficients after wavelet transform.

[0090] S2.2: Soft thresholding. The formula is:

[0091]

[0092] In equation (2), This represents the detail coefficients after soft thresholding. These represent the detail coefficients in the horizontal, vertical, and diagonal directions corresponding to the wavelet transform. T The set threshold;

[0093] S2.3: Inverse wavelet transform reconstruction, setting the image reconstructed by the inverse wavelet transform as... G(x, y) .

[0094] S2.4: For the image G(x, y) Perform color and brightness adjustments to obtain the image with adjusted brightness. G'(x, y) The formula is:

[0095]

[0096] In equation (3), G(x, y) This represents an image that has been reconstructed using inverse wavelet transform but has not yet undergone color and brightness adjustments. G’ (x, y) This represents the image obtained after color and brightness adjustments. I min Representing an image G ( x , y The minimum pixel value in ) Imax Representing an image G ( x , y The maximum pixel value in ).

[0097] S2.5: Based on the white balance hypothesis theory, the brightness of the image is adjusted. G'(x, y) Perform color correction to obtain the color-corrected image. M(x, y) The formula is:

[0098]

[0099] In equation (4), R(x, y) Representing an image G’ ( x , y The red channel value in the middle; R'(x, y) This represents the red channel value after color correction. This indicates the average brightness of the green channel; This represents the average brightness of the blue channel; This represents the average brightness of the red channel. B(x, y) Representing an image G’ ( x , y The value of the blue channel in the middle; B'(x, y) This represents the blue channel value after color correction.

[0100] S2.6: Due to camera lens distortion and changes in viewing angle, image edges often appear curved or tilted. The distortion parameters obtained during the calibration phase are used to correct the color of the image. M(x, y) Perform geometric correction, and set the corrected image as... K ( x, y The formula is:

[0101]

[0102] In equation (5), k 1 and k 2 represents the distortion coefficient; Indicates an ungeometrically corrected image M(x, y) The corresponding coordinates; This represents the coordinates after geometric correction.

[0103] S2.7: Adaptive local histogram equalization is employed to improve image quality. K ( x, y Improved image contrast and detail in local areas H ( x, y The formula is:

[0104]

[0105] In equation (6), CDF local Represents the cumulative distribution function of the local histogram; I'(x, y) Representing an image K ( x, y The grayscale value of ) I'(x, y) Indicates improved image H ( x, y The grayscale value of ) L This represents the number of gray levels.

[0106] S3: Improved Image H ( x, y Image format conversion, transforming it into a form that neural networks can understand. This includes the following steps:

[0107] In equation (6), CDF local Represents the cumulative distribution function of the local histogram; I'(x, y) Representing an image K ( x, y The grayscale value of ) I'(x, y) Indicates improved image H ( x, y The grayscale value of ) L This represents the number of gray levels.

[0108] S3: Improved Image H ( x, y Image format conversion, transforming it into a form that neural networks can understand. This includes the following steps:

[0109] S3.1: The image is improved after processing by S1. H ( x, y The pixels in the image contain both color and depth, thus improving the image. H ( x, y Pixels containing both color and depth in the image are fused into a multi-channel fused input image. The formula is:

[0110]

[0111] In the formula, express H ( x, y The red channel image corrected using equation (4); This represents the green channel image after correction using equation (4); express H ( x, y The blue channel image corrected using equation (4); Indicates depth information weights; Represents normalized depth information; This represents a multi-channel fused input image.

[0112] S3.2: Normalize the multi-channel fused input image. The formula is:

[0113]

[0114] In the formula, Indicates multi-channel fused input image The average pixel value in the image; Indicates multi-channel fused input image The standard deviation of pixels in the data; This represents the normalized image.

[0115] S3.3: Normalize the image Convert it into a tensor form that can be recognized by neural networks.

[0116] S4: Neural Network Improvement. This includes the following steps:

[0117] S4.1: Improved residual connection structure of the backbone network (ResNet).

[0118] Specifically, the output definition of each layer in a traditional convolutional network is changed from... Modified to In the two formulas mentioned above, x 1 represents the input feature map of the first layer of the convolutional network; W 1 represents the weight of the first layer of the convolutional network; y 1 represents the output feature map of the first layer of the convolutional network; F Indicates the activation function;

[0119] S4.2: Improved decomposition of convolutional kernel structure in the backbone network (ResNet).

[0120] Specifically, traditional convolutional layers use 3×3 convolutional kernels. This type of convolution has a large number of operational parameters and high computational complexity, making it unsuitable for real-time operation on low-power devices like Raspberry Pi and for robots in home environments. Therefore, the 3×3 convolutional kernel is decomposed into a combination of consecutive 3×1 and 1×3 smaller convolutional kernels, as shown in the following formula:

[0121]

[0122] In the formula, X The input feature map represents the convolution kernel; Represents the vertical convolution kernel; Represents the horizontal convolution kernel; Y This represents the output feature map of the convolution kernel.

[0123] S4.3: Improvements to the detection network (Yolov8). Specifically, this includes the following steps:

[0124] S4.3.1: Added a high-resolution detection branch.

[0125] Specifically, in the traditional YOLOv8 network structure, the backbone is responsible for extracting multi-scale features, the neck layer fuses these features through a feature pyramid structure, and finally, the head performs target recognition and classification. This approach adds a high-resolution feature extraction channel to the neck layer to enhance the model's ability to perceive small targets. The improved feature fusion output can be expressed as:

[0126]

[0127] In the formula, Represents a high-level feature map; Represents the feature map of the middle level; This represents a low-level feature map. This is represented as the fusion weight.

[0128] S4.3.2: Lightweighting of convolutional structures.

[0129] Specifically, some standard convolutions in YOLOv8 are replaced with depthwise separable convolutions, as shown in the formula:

[0130]

[0131] In the formula, Represents the channel-wise convolution kernel; This represents a 1×1 convolution kernel used for channel fusion. X 深 The input feature map represents the depthwise separable convolution; Y 深 This represents the output feature map of a depthwise separable convolution.

[0132] S4.3.3: Embedded attention module.

[0133] Adding a channel attention module before the detection head enables the model to automatically focus on key targets (such as humans or drugs) during the detection phase, reducing false detections of the background. The formula is as follows:

[0134]

[0135] In the formula, GAP (*) indicates a global average pooling operation; For the Sigmoid function; Indicates channel weighting operation; F This represents the feature map input to the attention module;F’ This represents the output feature map after weighting by the channel attention module; ReLu (*) indicates a linear rectifier function; W 1 and W 2 represents the weight matrix of the fully connected layer; s This represents the attention weights. The improved neural network is trained and tested using a large amount of data on both the training and test sets to accurately identify objects.

[0136] S5: Normalize the image from S3 The input is fed into the improved neural networks S4 and S5, which then identify the objects.

[0137] Calculation example:

[0138] This algorithm selected seven image categories related to the home environment from the home interior dataset for experimentation. These seven image categories are: bathroom, bedroom, nursery, hallway, dining room, living room, and kitchen. Each image category was selected from the original dataset and used to train the model and verify its recognition performance.

[0139] Image preprocessing and network training were performed for different experimental schemes. The horizontal axis represents the name of each scheme, and the vertical axis represents the training accuracy. The results are shown in [Figure number missing]. Figure 2 :

[0140] (1) Baseline approach: ResNet18 is used directly for training without applying the patent preprocessing process.

[0141] (2) Preprocessing only: Based on ResNet18, apply the complete patented preprocessing method.

[0142] (3) Network-only scheme: adopts the complete patented network structure and does not perform preprocessing operations.

[0143] (4) Complete patent scheme: Simultaneously apply patent preprocessing methods and improved network structure.

[0144] from Figure 2As can be seen, the accuracy of the baseline scheme is 77.70%, indicating low overall recognition performance. After introducing an image preprocessing module (preprocessing only), the accuracy improves to 81.29%, demonstrating that appropriate preprocessing can improve the quality of the input image and enhance feature representation. When only the improved network structure is used (network only), the accuracy further improves to 84.17%, indicating that this network structure has significant advantages in feature extraction and classification performance. When image preprocessing and the improved network structure are applied simultaneously (complete patent scheme), the overall accuracy of the system is 82.73%, a significant improvement over the baseline scheme, indicating that the combination of the two has a certain synergistic effect in optimizing the model's recognition performance.

[0145] Figure 3 The figure shows the accuracy trends of four schemes at different iterations during training. As can be seen from the figure, the training accuracy of all models gradually increases with the number of iterations, indicating that the network is continuously learning and gradually converging. Among them, the baseline scheme and the preprocessing-only scheme show relatively slower increases, with the final accuracy stabilizing at around 0.7, indicating that using traditional preprocessing methods alone has limited improvement on feature extraction. In contrast, the network-only scheme and the complete patent scheme both show faster convergence trends in the early stages (approximately the first 10 iterations), reaching accuracies of approximately 0.91 and 0.90 respectively in the later stages of training. This demonstrates that the improved network structure can learn image features more effectively and improve the model's fitting ability.

[0146] The results above demonstrate that the algorithm proposed in this invention outperforms traditional benchmark schemes in both training accuracy and convergence speed. Compared to schemes that only use preprocessing or only improve the network structure, the complete scheme shows significant improvements in accuracy and stability, achieving higher performance levels with fewer training epochs. This indicates that the proposed algorithm has better overall performance in indoor scene classification tasks, effectively improving the model's recognition ability and reliability, and possesses high practical value.

[0147] Figure 4 The image shows a home scene using the baseline scheme and its recognition results, i.e., the original image is used for recognition using ResNet18; Figure 5 This demonstrates home images and their recognition results using this solution, specifically pre-processed images and recognition using an improved ResNet18 algorithm; from... Figure 4 and Figure 5 It can be seen that the images from this scheme are superior to those from the benchmark scheme in terms of both color recognition and accuracy, and the neural network can also accurately identify the results.

[0148] In another comparison image, Figure 6 It is the original image without preprocessing. Figure 7 These are the images after preprocessing using this method. Comparing the two images, we can see that...Figure 7 The image edges are sharper and the color contrast is better, therefore, Figure 7 Compared to Figure 6 It can significantly improve the recognition accuracy of neural networks.

[0149] Based on the above-described preferred embodiments of the present invention, and through the foregoing description, those skilled in the art can make various changes and modifications without departing from the inventive concept. The technical scope of this invention is not limited to the contents of the specification, but must be determined according to the scope of the claims.

Claims

1. An image recognition method for robots used in home environments, characterized in that, Includes the following steps: S1: The robot's RGB-D camera captures images of the home environment, and the images captured by the RGB-D camera are registered; S2: Preprocess the registered image from S1 to obtain the improved image. H ( x,y ); S3: Improved Image H ( x,y ) Format conversion, converting to a normalized image. ; S4: Neural Network Improvement; specifically including the following steps: S4.1: Improvement of the residual connection structure of the backbone network; The formula for improving the residual connection structure is: ； In the formula, x 1 represents the input feature map of the first layer of the convolutional network; W 1 represents the weight of the first layer of the convolutional network; y 1 represents the output feature map of the first layer of the convolutional network; F Indicates the activation function; The formula for improving the decomposition of convolution kernel structure is as follows: ； In the formula, X The input feature map represents the convolution kernel; Represents the vertical convolution kernel; Represents the horizontal convolution kernel; Y This represents the output feature map of the convolution kernel; S4.2: Improvement of the decomposed convolutional kernel structure of the backbone network; S4.3: Improvements to the detection network; the improvements to the detection network include: Add a high-resolution feature extraction channel to the Neck layer: ； In the formula, Represents a high-level feature map; Represents the feature map of the middle level; Represents low-level feature maps; Represented as fusion weight; Lightweighting of convolutional structures: ； In the formula, Represents the channel-wise convolution kernel; This represents a 1×1 convolution kernel used for channel fusion. X 深 The input feature map represents the depthwise separable convolution; Y 深 This represents the output feature map of a depthwise separable convolution; Embedded attention module: ； In the formula, GAP (*) indicates a global average pooling operation; For the Sigmoid function; Indicates channel weighting operation; F This represents the feature map input to the attention module; F’ This represents the output feature map after weighting by the channel attention module; ReLu (*) indicates a linear rectifier function; W 1 and W 2 represents the weight matrix of the fully connected layer; s This represents attention weights; specifically, it includes the following steps: S4.3.1: Add a high-resolution feature extraction channel to the Neck layer; S4.3.2: Lightweighting of convolutional structures; S4.3.3: Embed a channel attention module in front of the detection head; S5: Normalize the image The input is fed into an improved neural network, which then identifies the object.

2. The image recognition method for a robot used in a home environment according to claim 1, characterized in that, Step S2 specifically includes the following steps: S2.1: For the registered image I ( x,y Wavelet transform; S2.2: Soft threshold processing; S2.3: Image obtained after wavelet inverse transform reconstruction G(x,y) ; S2.4: For the image G(x,y) Adjust the color and brightness to obtain the image. G'(x,y) ; S2.5: Based on the white balance hypothesis theory, for images G'(x,y) Perform color correction to obtain the color-corrected image. M (x,y) ; S2.6: For the image M(x,y) Perform geometric correction to obtain the corrected image. K ( x,y ); S2.7: For the image K ( x,y An improved image is obtained by using adaptive local histogram equalization. H ( x,y ).

3. The image recognition method for a robot used in a home environment according to claim 1, characterized in that, The registration formula in step S1 is: ； In the formula, P Depth Represents a three-dimensional point in the depth sensor coordinate system; R It is a rotation matrix; t It is a translation vector; P RGB Represents a three-dimensional point in the color sensor coordinate system; K This is the intrinsic parameter matrix of the camera.

4. The image recognition method for a robot used in a home environment according to claim 2, characterized in that, In S2.4, the image G(x,y) The formula for adjusting brightness is: ； In the formula, G(x,y) This indicates an image without color and brightness adjustments. G'(x,y) This represents the image obtained after color and brightness adjustments. I min Representing an image G ( x , y The minimum pixel value in ) I max Representing an image G ( x , y The maximum pixel value in ).

5. The image recognition method for a robot used in a home environment according to claim 2, characterized in that, In step S2.5, the color correction formula is: ； In the formula, R(x,y) Representing an image G’ ( x , y The red channel value in the middle; R'(x,y) This represents the red channel value after color correction. This indicates the average brightness of the green channel; This represents the average brightness of the blue channel; This represents the average brightness of the red channel. B (x,y) Representing an image G’ ( x , y The value of the blue channel in the middle; B'(x,y) This represents the blue channel value after color correction.

6. The image recognition method for a robot used in a home environment according to claim 2, characterized in that, In step S2.6, the formula for geometric correction is: ； In the formula, k 1 and k 2 represents the distortion coefficient; Indicates an ungeometrically corrected image M(x,y) The corresponding coordinates; Indicates the coordinates corresponding to the geometric correction; r Indicates an ungeometrically corrected image M(x,y) The distance from a pixel to the center of the image.

7. The image recognition method for a robot used in a home environment according to claim 1, characterized in that, In step S2.7, the formula for adaptive local histogram equalization is: ； In the formula, CDF local Represents the cumulative distribution function of the local histogram; I'(x,y) Representing an image K ( x,y The grayscale value of ) I'(x, y) Indicates improved image H ( x,y The grayscale value of ) L This represents the number of gray levels.

8. The image recognition method for a robot used in a home environment according to claim 1, characterized in that, Step S3 includes the following steps: S3.1: Improve the image H ( x,y Pixels containing both color and depth are fused into a multi-channel fused input image using the following formula: ； In the formula, express H ( x,y The corrected red channel image; express H ( x,y Corrected green channel image; express H ( x,y Corrected blue channel image; Indicates depth information weights; Represents normalized depth information; This represents a multi-channel fused input image; S3.2: Normalize the multi-channel fused input image using the following formula: ； In the formula, Indicates multi-channel fused input image The average pixel value in the image; Indicates multi-channel fused input image The standard deviation of pixels in the data; This represents the normalized image; S3.3: Normalize the image Convert it into a tensor form that can be recognized by neural networks.