A smart assistive glasses system based on visual recognition and voice interaction
By employing techniques such as mosaic data augmentation and regional attention mechanisms, a deep neural network model was constructed, which solved the problems of detection accuracy and robustness of tactile paving obstacle detection in complex environments, and achieved efficient and low-power obstacle recognition and obstacle avoidance guidance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIAMUSI UNIVERSITY
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-30
AI Technical Summary
Existing obstacle detection solutions for tactile paving suffer from missed or false detections due to changes in lighting and target occlusion in complex outdoor environments. Lightweight models lack sufficient semantic expression capabilities for features and have imperfect multi-scale feature fusion, resulting in insufficient detection accuracy and environmental robustness.
A deep neural network model is constructed by employing a mosaic data augmentation strategy, a region attention mechanism, a residual high-efficiency layer aggregation network, and a flash attention optimization strategy, combined with a neck network and a detection head, to perform multi-scale target detection and speech feedback.
It improves the model's detection accuracy under complex lighting and occlusion conditions, reduces computational complexity and power consumption, enhances the ability to identify small targets and edge objects, and provides reliable obstacle avoidance guidance.
Smart Images

Figure CN122308604A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision, specifically relating to an intelligent assistive glasses system based on visual recognition and voice interaction. Background Technology
[0002] With the continuous improvement of the social security system and the rapid development of artificial intelligence technology, ensuring the travel safety of visually impaired people has become an important part of smart city construction and social welfare. As a core infrastructure ensuring the independent travel of visually impaired individuals, the completeness and accessibility of tactile paving directly affect the quality of life and safety of this special group. Utilizing computer vision technology to achieve intelligent perception and assisted guidance in complex road environments has become a core research topic in the fields of assisted driving and assistive devices for the blind.
[0003] Among them, deep learning-based target detection technology plays a crucial role in tactile paving environment perception. By performing end-to-end feature extraction and classification regression on real-time images, it can quickly identify tactile paving textures, facility status, and various dynamic obstacles. The YOLO series of algorithms, with their good balance between detection accuracy and inference speed, are widely used in portable visual assistive terminals, aiming to provide visually impaired users with real-time and accurate path navigation and obstacle avoidance warning information.
[0004] Existing obstacle detection solutions for tactile paving still face numerous challenges in practical applications: First, drastic changes in lighting and frequent target occlusion in complex outdoor environments easily lead to missed or false detections, resulting in weak environmental adaptability. Second, existing lightweight models often sacrifice the semantic expressive power of features while reducing computational complexity, making it difficult to accurately model tactile paving textures and obstacle edges in complex backgrounds. Third, traditional neural networks suffer from significant memory bottlenecks and computational redundancy when processing high-resolution inputs, limiting the throughput and increasing power consumption of embedded devices. Finally, the multi-scale feature fusion mechanism is imperfect, lacking efficient coupling between deep semantics and shallow spatial details, resulting in insufficient model representation of distant small targets and obstacles with blurred edges. These problems collectively restrict the robustness and real-time performance of assistive navigation devices. Therefore, developing an intelligent assistive glasses system based on visual recognition and voice interaction that can balance detection accuracy, memory efficiency, and environmental robustness is particularly important. Summary of the Invention
[0005] The purpose of this invention is to provide an intelligent assistive glasses system based on visual recognition and voice interaction, which can effectively solve the problems in the background art mentioned above.
[0006] To achieve the above objectives, the present invention provides the following technical solution: A smart assistive glasses system based on visual recognition and voice interaction includes the following specific steps: S1 acquires raw image data of the tactile paving environment: using a portable visual acquisition device to capture real-time images of the tactile paving area in the direction of travel of visually impaired people, and acquires a multi-resolution color image sequence containing tactile paving lines, texture features and potential obstacles; S2 preprocesses and enhances the original image data: It uses a mosaic data enhancement strategy to stitch, scale, and crop a predetermined number of randomly selected original images, and adaptively adjusts the brightness, contrast, and saturation of the images to improve the model's generalization ability to complex lighting environments. S3 constructs and trains a deep neural network model optimized for tactile paving and obstacle detection: Based on the pre-defined generational real-time target detection architecture, it introduces a region attention mechanism, a residual high-efficiency layer aggregation network, and a flash attention optimization strategy. Through iterative training on a labeled tactile paving obstacle dataset, a deep learning model that can accurately identify tactile paving boundaries and obstacle categories is obtained. S4 extracts features in real time and performs multi-scale target detection: The real-time image to be detected is input into the trained deep neural network model, multi-level semantic features are extracted through the backbone network, cross-scale feature fusion is performed using the neck network, and the target category probability and bounding box coordinates are output by the detection head. The S5 outputs detection results and generates obstacle avoidance assistance commands: Based on the obstacle location, type, and confidence information output by the detection head, combined with the extension trend of the tactile paving, it determines the threat level of the obstacle to visually impaired people and converts it into voice or vibration feedback signals for real-time warning.
[0007] Preferably, in S2, the mosaic data augmentation strategy specifically includes: randomly selecting a predetermined number of images in each training batch, scaling them at random ratios, with the scaling ratio set within a preset range, then stitching the scaled images together according to multiple preset quadrants, and synchronously correcting the coordinate values of the target bounding boxes according to the size of the stitched images, thereby increasing the background complexity and the randomness of the target distribution, and significantly improving the model's ability to perceive small obstacles.
[0008] Preferably, in S3, the deep neural network model consists of multiple parts, including an input layer, a backbone network, a neck network, and a detection head. The backbone network adopts a multi-segment structure, which achieves the extraction of deep semantic features by progressively decreasing the feature map resolution and increasing the number of channels.
[0009] Preferably, the backbone network incorporates a region attention mechanism. This mechanism divides the feature map into multiple non-overlapping local regions and independently calculates the association weights of the query vector, key vector, and value vector within each local region. This constrains the search space for attention calculation to key texture regions, effectively focusing on the continuity of tactile paving lines and the edge details of obstacles. This reduces computational complexity while suppressing interference from redundant background information.
[0010] Preferably, the region attention mechanism is specifically implemented through a region attention module. This module divides the input feature map into grid cells of a preset size, performs local self-attention calculation on the pixels in each grid cell, and combines a cross-region feature transfer mechanism to maintain the consistency of global semantics, ensuring that the model can perceive the overall guidance of the tactile paving while paying attention to the features of local obstacles.
[0011] Preferably, the backbone network also integrates a region attention convolution high-efficiency layer aggregation module. This module combines lightweight convolution branches with the region attention module in parallel. Through a feature fusion structure, it dynamically couples the spatial features extracted by convolution with the semantic features extracted by the attention mechanism, thereby achieving bidirectional optimization of the number of model parameters and computational load while maintaining strong semantic expressive power.
[0012] Preferably, the backbone network has a predetermined number of residual high-efficiency layer aggregation networks stacked at its end. This network, by designing multi-level residual connection paths, directly transmits shallow spatial detail information to deep aggregation nodes, effectively alleviating the gradient vanishing problem in the feature transmission process of deep neural networks and enhancing the model's representation accuracy of slender tactile paving lines and irregular obstacle boundaries in complex road surface backgrounds.
[0013] Preferably, the residual high-efficiency layer aggregation network adopts a block-level residual strategy, which establishes skip connections between multiple consecutive convolutional layers to achieve feature reuse and stable gradient flow transmission. The aggregation ratio of its internal feature channels is set to a preset ratio to ensure that deep semantic features and shallow geometric features remain in balance during the fusion process.
[0014] Preferably, S3 introduces a flash memory attention optimization strategy. This strategy optimizes the data flow logic and memory scheduling mechanism in attention calculation, decomposes large-scale attention matrix operations into multiple small-scale slice operations adapted to hardware cache, significantly reduces the memory bandwidth usage under high-resolution image input, and improves the computing throughput.
[0015] Preferably, the flash memory attention optimization strategy achieves weighted summation without storing the complete attention matrix by using online recalibration technology of normalized exponential function, reducing memory complexity from the quadratic level of the input sequence length to the linear level, and providing a hardware-friendly solution for real-time deployment on low-power embedded devices.
[0016] Preferably, the neck network in S4 adopts a combination structure of feature aggregation network and path aggregation network. Through the top-down semantic enhancement path and the bottom-up localization information transmission path, it realizes efficient interaction of multi-scale features. Furthermore, the region attention convolution efficient layer aggregation module is embedded in the fusion path to further enhance the model's robustness to the detection of obstacles of different sizes.
[0017] Preferably, the detection head in S4 introduces a dynamic multilayer perceptron scaling mechanism. This mechanism can automatically adjust the parameter weights of the classification and regression branches according to the scale distribution of the input features. Combined with channel weighting and scale balancing strategies, it achieves adaptive optimization of target category determination and bounding box regression, significantly improving the detection accuracy of the model in variable scenarios.
[0018] Preferably, S3 further includes the introduction of a spatial channel self-attention mechanism, which consists of a shared multi-semantic spatial attention module and a progressive channel self-attention module. By highlighting the salience of the target area in the spatial dimension and dynamically strengthening key semantic features in the channel dimension, adaptive enhancement of the features of blind path obstacles is achieved.
[0019] Preferably, S3 further includes a feature fusion cascade module. This module is based on a bidirectional feature pyramid structure and introduces a channel reweighting and calibration mechanism. By adaptively weighting the feature channels at different levels, it overcomes the limitation of channel weight averaging in traditional fusion methods and effectively improves the semantic representation efficiency of small target obstacles.
[0020] Preferably, the loss function of the model in S3 is composed of multiple weighted components, including classification loss, localization loss, and confidence loss. The localization loss adopts the complete intersection-union loss function, which accelerates the convergence speed of bounding box regression and improves localization accuracy by comprehensively considering the overlap area of the target boxes, the distance between the center points, and the consistency of the aspect ratio.
[0021] Preferably, the training parameters of the deep neural network model in S3 are set as follows: the initial learning rate is a first preset value, a cosine annealing decay strategy is adopted, the weight decay coefficient is a second preset value, the momentum factor is a third preset value, the number of training rounds is a preset number of iterations, and the input image size is uniformly adjusted to a preset resolution.
[0022] Preferably, the blind path obstacle detection method and system based on a preset generation real-time target detection architecture further includes an automated evaluation module. This module monitors the overall performance of the model in real time and generates optimization suggestions by calculating multiple indicators such as accuracy, recall, mean precision, number of parameters, number of floating-point operations per second, and number of frames transmitted per second.
[0023] Preferably, the number of frames transmitted per second can reach a preset frame rate threshold or higher under the embedded computing platform, and the average accuracy can reach a preset accuracy threshold or higher on a dataset containing a variety of common blind path obstacles, thus meeting the real-time and accuracy requirements of blind path assisted guidance tasks.
[0024] Compared with the prior art, the present invention has the following beneficial effects: This invention achieves refined modeling of tactile paving textures and obstacle edges by introducing a region attention mechanism into a pre-defined generation of real-time object detection architecture. Compared to traditional global attention mechanisms, this scheme adaptively focuses computational resources on key local regions, effectively suppressing background noise interference in complex outdoor environments. This invention achieves a preset detection accuracy under complex lighting and partial occlusion conditions, with a significantly improved average accuracy compared to the base model, enhancing the model's robustness in varying scenarios.
[0025] By integrating a flash memory attention optimization strategy, this invention fundamentally alleviates the memory bottleneck problem in high-resolution image processing. This strategy optimizes memory scheduling and data flow, reducing the memory complexity of attention computation to linear levels and significantly improving computational throughput. Combined with the optimization of feature propagation paths using a residual high-efficiency layer aggregation network, this invention significantly reduces the number of model parameters and computational load while maintaining high accuracy, making real-time detection possible on low-power embedded vision-assisted terminals, with the number of frames transmitted per second fully meeting the requirements of dynamic obstacle avoidance.
[0026] The neck network designed in this invention combines a feature aggregation network and a path aggregation network, and embeds a region attention convolutional high-efficiency layer aggregation module, achieving deep coupling between deep semantic information and shallow spatial details. The application of a dynamic multilayer perceptron scaling mechanism enables the model to adaptively adjust its prediction strategy according to the target size, effectively solving the problem of easily missing small obstacles at long distances. Combined with a complete intersection-union loss function, this invention demonstrates excellent accuracy in obstacle bounding box regression, providing more reliable obstacle avoidance guidance information for visually impaired individuals.
[0027] This invention constructs a fully automated framework encompassing image acquisition, preprocessing, feature extraction, and obstacle avoidance command generation. The combination of mosaic data augmentation strategies and adaptive preprocessing technology enables the system to quickly adapt to different geographical environments and lighting conditions. The built-in performance evaluation module monitors the model's operational status in real time, ensuring the system's stability and reliability in practical applications and providing solid technical support for the development of a new generation of intelligent and portable visual aids for the visually impaired. Attached Figure Description
[0028] Figure 1 This is a schematic diagram of the overall technical architecture of the intelligent assistive glasses system based on visual recognition and voice interaction proposed in this invention; Figure 2 This is a schematic diagram of the core principle framework of the deep neural network optimized for the detection of blind paths and obstacles in this invention. Detailed Implementation
[0029] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided below.
[0030] Example 1 This embodiment provides a tactile paving obstacle detection system based on an improved 12th generation real-time target detection architecture, which is deployed in a portable intelligent guide terminal worn by visually impaired individuals.
[0031] In terms of system architecture, the system consists of a visual perception hardware layer, a heterogeneous computing kernel layer, an interactive feedback execution layer, and a power management support layer. The visual perception hardware layer includes a high-resolution complementary metal-oxide-semiconductor (CMOS) image sensor located at the front end of the guide terminal. This sensor has a 120-degree ultra-wide field of view, supports progressive scan sampling at 60 frames per second, and can transmit raw image data of the tactile paving environment to the back end via a mobile industrial processor interface. The heterogeneous computing kernel layer is the core of the system, containing a high-performance application processor and a specially optimized neural processing unit. The application processor is responsible for system task scheduling, image signal preprocessing, and peripheral management; the neural processing unit is configured to run an improved 12th-generation real-time object detection architecture model, integrating a dedicated tensor operation accelerator and vector processor for performing complex matrix convolutions, region attention calculations, and feature aggregation operations. The interactive feedback execution layer includes an audio decoding chip, a bone conduction headphone interface, and a linear resonant driver, used to convert detection results into voice broadcasts or vibration signals of different frequencies. The power management support layer consists of a high-capacity lithium polymer battery, a low-dropout linear regulator, and a power monitoring module, providing stable 3.3-volt and 1.8-volt logic voltages for the entire system.
[0032] In terms of workflow derivation, the system runs a blind path obstacle detection method based on an improved 12th-generation real-time target detection architecture, as follows: Step 1: The image sensor in the visual perception hardware layer performs raw image data acquisition. After photoelectric conversion, the sensor outputs raw Bayer format data containing tactile paving lines, paving tile textures, and road obstacles. The application processor uses its built-in image signal processor to denoise, correct white balance, and convert color space, generating a standard 640-pixel multiplied 640-pixel red-green-blue image sequence.
[0033] Step 2: The preprocessing module in the heterogeneous computing kernel layer performs mosaic data augmentation and normalization. Within each inference cycle, this module randomly selects four historical or current frame images from the image cache and scales them non-uniformly according to a preset scaling ratio (0.5x to 1.5x). Subsequently, the preprocessing module stitches the four images into a unified coordinate system and performs random cropping, thereby constructing a composite image with higher background complexity. This process can significantly simulate the imaging characteristics of obstacles at different distances and under different lighting conditions, improving the system's adaptability to complex environments.
[0034] Step 3: The neural processing unit loads and executes a deep neural network model optimized for tactile paving and obstacle detection. This model construction process deeply integrates a region attention mechanism and a residual efficient layer aggregation structure. Specifically, the backbone network is divided into five feature extraction stages. In stages 3 to 5, the system is configured with a region attention convolutional efficient layer aggregation module. When performing feature extraction, this module first segments the feature map into 16x16 local grids using a region partitioning unit. Within each grid, the region attention mechanism independently calculates the association weights of the query, key, and value vectors. This localized attention calculation method constrains the search space from global pixels to local texture regions, enabling the system to more accurately capture subtle features at the edges of tactile paving, while maintaining long-distance semantic associations through cross-regional feature transfer links.
[0035] At the end of the backbone network, the neural processing unit runs residual efficient layer aggregation network logic. This network cascades the shallow geometric features of stage 2 with the deep semantic features of stage 5 through multiple skip connection paths. The aggregation ratio is set to 1:2, i.e., the ratio of the number of shallow channels to the number of deep channels, ensuring that the model can retain accurate boundary localization information while recognizing object categories. To optimize memory usage, the neural processing unit introduces a flash attention optimization strategy. This strategy uses online recalibration technology to decompose the calculation of the normalized exponential function of the attention score into multiple block operations, so that intermediate variables do not need to be stored entirely in global memory, but are accumulated directly in the on-chip cache. The calculation process follows this logic: for each input block, the system updates the local maximum and partial sum in real time, thereby reducing the memory access bandwidth requirement by about 60% without sacrificing accuracy, ensuring low-latency operation on embedded devices.
[0036] Step 4: The neck network module performs multi-scale feature fusion. The system achieves top-down semantic enhancement through a feature aggregation network, fusing deep abstract features with shallow features after upsampling; subsequently, it achieves bottom-up localization information transmission through a path aggregation network. At each fusion node, a region attention convolutional high-efficiency layer aggregation module is embedded to enhance the ability to discriminate obstacles at different scales. The detection head module then receives the fused feature map and, using a dynamic multilayer perceptron scaling mechanism, adaptively adjusts the computational weights of the classification and regression branches based on the response intensity of the feature map.
[0037] Step 5: The interactive feedback execution layer generates obstacle avoidance assistance instructions based on the results output by the detection head. When the system detects an obstacle (such as an illegally parked vehicle, a shared bicycle, or a road sign) with a confidence level higher than 0.85 on the tactile paving, the application processor calculates the relative offset between the center point of the obstacle and the center line of the tactile paving. If the obstacle is within 2 meters directly in front of the visually impaired person, the audio decoding chip drives the bone conduction headphones to emit a high-frequency warning sound, while the linear resonant driver generates vibrations at a specific frequency, guiding the visually impaired person to shift towards the side with fewer obstacles.
[0038] Example 2 Based on Embodiment 1, this embodiment further details the adaptive optimization scheme of the system in complex and ever-changing environments, focusing on the implementation details of the spatial channel self-attention mechanism and the feature fusion cascade module.
[0039] In terms of system architecture, the heterogeneous computing kernel layer in this embodiment further integrates a deep cache management unit and a dynamic frequency adjustment module. The deep cache management unit manages the temporary tensors generated by the neural network during inference, reducing fragmentation losses by pre-allocating contiguous memory space. The dynamic frequency adjustment module automatically adjusts the operating frequency of the neural processing unit according to the motion vector intensity of the current image sequence, reducing power consumption when the environment is static and increasing computing power to peak levels when moving rapidly.
[0040] In terms of methodology, this embodiment significantly expands upon the model building process in step 3: In the feature enhancement stage, the system introduces a spatial channel self-attention mechanism. This mechanism consists of a shared multi-semantic spatial attention submodule and a progressive channel self-attention submodule connected in series. When image data flows through this module, the shared multi-semantic spatial attention submodule first extracts salient features in the spatial dimension using global average pooling and global max pooling. It then generates a spatial weight map through a set of convolutional layers with shared weights and applies it to the input feature map, thereby highlighting the tactile paving area and suppressing background interference such as roadside trees and buildings. Subsequently, the progressive channel self-attention submodule dynamically enhances feature channels containing key semantic information (such as the raised texture of tactile paving and the edge contours of obstacles) and suppresses redundant noise channels by performing dimensionality reduction and dimensionality increase processing on the cross-correlation matrix between channels.
[0041] To address the challenge of small target detection, this embodiment employs a cascaded feature fusion module in step 4. This module is based on a bidirectional feature pyramid structure and introduces a channel reweighting and calibration mechanism within the cascaded path. Specifically, after concatenating high-level semantic features with low-level spatial features, the system does not directly perform convolutional reduction. Instead, it first calculates the importance score of each channel using a lightweight channel perception unit. This score, generated based on global statistical information and local contextual association, is used to weight and calibrate the concatenated feature vector. This mechanism overcomes the indiscriminate processing of channel information in traditional feature fusion, enabling the system to have higher detection sensitivity for small-sized obstacles at long distances (over 5 meters).
[0042] Furthermore, the detection head module in this embodiment employs a full intersection-union loss function for parameter optimization. During model training, this loss function not only calculates the overlap area between the predicted bounding box and the ground truth bounding box but also introduces a center point distance penalty term and an aspect ratio consistency penalty term. By comprehensively considering these three geometric elements, the system can achieve faster convergence speed and higher positioning accuracy when regressing obstacle bounding boxes, effectively reducing obstacle avoidance misjudgments caused by bounding box drift.
[0043] Example 3 This embodiment focuses on describing the automated evaluation and adaptive calibration process of the system in actual industrial deployment, aiming to ensure the system's performance consistency across different hardware platforms and geographical environments.
[0044] In terms of system architecture, this embodiment adds an automated evaluation workstation, which connects to multiple portable guide terminals via a gigabit Ethernet interface. The evaluation workstation includes a large-scale data storage array and a high-performance graphics processor cluster for running automated performance monitoring software. During operation, the guide terminals periodically upload sample frames with low detection confidence levels, along with system operating parameters (such as temperature, power consumption, and inference time), to the workstation.
[0045] In terms of methodology, this embodiment adds a closed-loop optimization step: Step 6: The automated evaluation module performs performance monitoring and model evolution. The system calculates six core evaluation metrics in real time: accuracy, recall, mean precision, number of parameters, floating-point operations per second, and frame rate. Accuracy reflects the system's reliability in identifying obstacles, while recall measures its ability to cover all real obstacles. By automatically labeling and performing reinforcement learning on uploaded challenging samples at the workstation, the system can generate fine-tuned parameter packages for specific environments (such as rain, nighttime, and bright light).
[0046] In the data flow logic, the number of floating-point operations per second is used to evaluate the computational load of the current algorithm on the target hardware. If the load exceeds 90% of the rated capacity of the neural processing unit, the system will automatically initiate lightweight switching logic, adjusting the grid size of the region attention mechanism from 16 x 16 to 8 x 8. While maintaining the core semantic recognition capability, the system reduces redundant calculations to ensure that the frame rate is maintained at more than 30 frames per second, thus guaranteeing the real-time continuity of the guide process.
[0047] In terms of environmental adaptability, the system utilizes perspective analysis technology from video surveillance images. The application processor analyzes changes in the vanishing point of the tactile paving in consecutive frames to determine the current camera's wearing angle. If an angle deviation is detected causing the center line of the tactile paving to deviate from the preset visual area, the system will automatically invoke a coordinate transformation matrix to compensate for the output coordinates of the detection head in real time. This dynamic calibration mechanism ensures that even when the walking posture of a visually impaired person fluctuates significantly, the system can still accurately calculate the relative positional relationship between obstacles and the tactile paving.
[0048] Example 4 This embodiment describes a blind path obstacle detection scheme optimized for low-power scenarios, which is particularly suitable for small wearable devices with limited battery capacity.
[0049] In terms of system architecture, this solution adopts a deep learning acceleration scheme based on application-specific integrated circuits (ASICs). The visual perception hardware layer acquires images through low-power image sensors and uses subsampling technology to reduce the amount of input data without losing key textures. The heterogeneous computing kernel layer consists of an ultra-low-power microcontroller and a custom stream processor. The stream processor uses static random access memory as a level 1 cache, and the near-memory computing architecture reduces the energy consumption of data transfer between memory and computing units.
[0050] In terms of methodology, this embodiment has optimized the attention calculation in step 3 to an extreme degree: In this embodiment, the flash attention optimization strategy is further evolved into fixed-point flash attention. The system converts floating-point operations into 16-bit or 8-bit fixed-point operations and utilizes lookup table technology to accelerate the calculation of the normalized exponential function. When performing region attention calculations, the stream processor uses a block-parallel strategy to divide the feature map into smaller blocks. The calculation result of each block directly participates in the subsequent weighted summation without needing to write back to external memory. This processing method keeps the overall power consumption of the system below 500 milliwatts.
[0051] In the feedback logic of step 5, the system introduces a threat level classification and assessment mechanism. The application processor classifies the threat level into 1 to 5 based on the obstacle's motion state (static or dynamic), distance, and the proportion of the tactile paving obstructed. Level 1 is a potential warning, where the system only issues a slight alert via bone conduction headphones; Level 5 is an emergency avoidance mechanism, where the system triggers maximum power vibration feedback and forcibly broadcasts a voice command. This mechanism of allocating feedback intensity on demand not only improves the user experience but also further optimizes the power consumption performance of the execution layer.
[0052] In addition, this embodiment also includes an ambient brightness adaptive module. When the light intensity is below 10 lux, the application processor automatically instructs the image sensor to activate long exposure mode and simultaneously calls a pre-trained low-light compensation model. This model compensates for contrast loss in low-light environments by introducing a gain adjustment factor in the feature extraction layer, ensuring that the system can still maintain a detection accuracy of over 85% in low-light scenes such as at night or in tunnels.
[0053] Example 5 This embodiment details the in-depth application of the present invention in multi-category obstacle recognition and complex road surface semantic understanding.
[0054] In terms of system architecture, this embodiment introduces a multi-sensor fusion unit, which integrates an ultrasonic ranging sensor and an inertial measurement unit in addition to a visual sensor. The ultrasonic ranging sensor is used to compensate for the shortcomings of visual detection in detecting extremely transparent objects (such as glass doors); the inertial measurement unit is used to sense the walking gait and body tilt angle of visually impaired persons, providing real-time attitude reference for the visual coordinate system.
[0055] In terms of methodology, this embodiment extends the detection head in step 4 with a multi-task approach. The detection head not only outputs the bounding boxes of obstacles but also the segmentation mask of the tactile paving. By introducing a lightweight semantic segmentation branch into the detection head, the system can achieve real-time assessment of the integrity of the tactile paving. If the tactile paving is damaged, broken, or extensively covered, the system will provide a voice prompt to the visually impaired person that the current tactile paving is unavailable and suggest alternative routes.
[0056] For multi-category obstacles, the system constructs a feature library containing 32 common road surface targets. During detection, a dynamic multilayer perceptron scaling mechanism performs depth discrimination based on the fine-grained features of the target. For example, for two rectangular targets, the system can accurately distinguish between a bench and a low fence based on their texture details. This fine-grained recognition capability benefits from the effective preservation of high-frequency detailed features by the residual efficient layer aggregation network.
[0057] In the feature fusion path, this embodiment employs an improved bidirectional feature pyramid network. This network adds additional weight learning parameters between each level, enabling it to dynamically determine the fusion weights of features at different levels based on the global statistical features of the current scene. For example, in an open square scene, the system automatically increases the weight of deep semantic features to focus on large objects; while in a narrow corridor scene, it increases the weight of shallow spatial features to accurately avoid nearby obstacles.
[0058] Finally, the system also possesses learning and evolution capabilities. Through its built-in incremental learning algorithm, the system can learn to identify new obstacle categories using a small number of samples without retraining the entire model. When a visually impaired person encounters an unfamiliar obstacle and manually marks it, the system will use idle computing power in the background to update the classification layer parameters of the detection head, thus providing personalized obstacle avoidance services.
[0059] In summary, this invention constructs a hardware-software collaborative system based on an improved 12th-generation real-time object detection architecture. It deeply integrates a region attention mechanism, a residual high-efficiency layer aggregation network, and a flash memory attention optimization strategy, achieving comprehensive optimization across multiple dimensions, including algorithm structure, computational efficiency, memory scheduling, and multi-scale fusion. While maintaining high-precision detection capabilities, the system significantly reduces hardware overhead and power consumption, effectively solving the real-time and robustness challenges of tactile paving obstacle detection in complex outdoor environments, and providing reliable technical support for the safe travel of visually impaired individuals.
[0060] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.
Claims
1. An intelligent assistant eyewear system based on visual recognition and voice interaction, characterized in that, The method comprises the following specific steps: S1, acquiring original image data of the sidewalk environment: using a portable visual acquisition device to capture real-time images of the sidewalk area in the direction of travel of the visually impaired, obtaining a multi-resolution color image sequence containing sidewalk lines, tile texture features, and potential obstacles; S2, preprocessing and enhancing the original image data: using a mosaic data enhancement strategy to scale a predetermined number of randomly selected original images, with the scaling ratio set within a predetermined range, then stitching and cropping the scaled images according to a predetermined number of quadrants, and adaptively adjusting the brightness, contrast, and saturation of the images, while simultaneously correcting the coordinate values of the target bounding box according to the size of the stitched image; S3, constructing and training a deep neural network model optimized for sidewalk and obstacle detection: based on the real-time target detection architecture of the preset generation, introducing a regional attention mechanism, a residual efficient layer aggregation network, and a flash attention optimization strategy, and by iteratively training the labeled sidewalk obstacle dataset, obtaining a deep learning model that can accurately identify sidewalk boundaries and obstacle categories; S4, real-time feature extraction and multi-scale target detection: input the real-time image to be detected into the trained deep neural network model, extract multi-level semantic features through the backbone network, perform cross-scale feature fusion using the neck network, and output the target class probability and bounding box coordinates from the detection head, wherein the detection head introduces a dynamic multi-layer perceptron scaling mechanism that automatically adjusts the parameter weights of the classification branch and the regression branch according to the scale distribution of the input features; S5, outputting the detection results and generating obstacle avoidance assistance instructions: based on the obstacle position, category, and confidence information output by the detection head, combined with the extension trend of the main line of the sidewalk, determine the threat level of the obstacle to the visually impaired, and convert it into a voice feedback signal or a vibration feedback signal for real-time warning.
2. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, In S2, the mosaic data enhancement strategy performs the following operations: in each training batch, a predetermined number of images are randomly selected, and each is scaled by a random ratio, with the scaling ratio set within a predetermined range, then the scaled images are stitched according to a predetermined number of quadrants, and the coordinate values of the target bounding box are corrected simultaneously according to the size of the stitched image, increasing the randomness of the background complexity and target distribution, and improving the model's perception ability for small-sized obstacles.
3. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, In S3, the backbone network introduces a region attention mechanism, which performs the following operations: by dividing the feature map into multiple non-overlapping local regions, the association weights of the query vector, key vector, and value vector are independently calculated within each local region, constraining the search space of attention calculation to key texture regions, focusing on the continuity of tactile paving lines and the edge details of obstacles, thereby reducing computational complexity while suppressing interference from redundant background information; the region attention mechanism is implemented through a region attention module, which divides the input feature map into grid cells of a preset size, performs local self-attention calculation on the pixels within each grid cell, and combines a cross-regional feature transfer mechanism to maintain global semantic consistency, ensuring that the model perceives the overall guidance of the tactile paving while paying attention to local obstacle features.
4. The smart assistant glasses system based on visual recognition and voice interaction according to claim 3, characterized in that, The backbone network also integrates a region attention convolution high-efficiency layer aggregation module. This module combines lightweight convolution branches with the region attention module in parallel. Through a feature fusion structure, it dynamically couples the spatial features extracted by convolution with the semantic features extracted by the attention mechanism, thereby achieving bidirectional optimization of model parameters and computational load while maintaining semantic expressive power.
5. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, At the end of the backbone network, a predetermined number of residual high-efficiency layer aggregation networks are stacked. These residual high-efficiency layer aggregation networks, through the design of multi-level residual connection paths, directly transmit shallow spatial detail information to deep aggregation nodes, alleviating the gradient vanishing problem in the feature transmission process of deep neural networks and enhancing the model's representation accuracy of slender tactile paving lines and irregular obstacle boundaries in complex road surface backgrounds. The residual high-efficiency layer aggregation network adopts a block-level residual strategy, establishing skip connections between multiple consecutive convolutional layers to achieve feature reuse and stable gradient flow transmission. The aggregation ratio of its internal feature channels is set to a preset ratio to ensure that deep semantic features and shallow geometric features remain balanced during the fusion process.
6. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, In S3, the flash memory attention optimization strategy performs the following operations: by optimizing the data flow logic and memory scheduling mechanism in attention computation, the large-scale attention matrix operation is decomposed into multiple small-scale slice operations adapted to hardware cache, reducing the memory bandwidth occupation under high-resolution image input and improving the computational throughput; the flash memory attention optimization strategy completes the weighted summation operation without storing the complete attention matrix through the online recalibration technology of normalized exponential function, reducing the memory complexity from the square level of the input sequence length to the linear level, providing hardware solution support for real-time deployment on low-power embedded devices.
7. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, The neck network in S4 adopts a combination structure of feature aggregation network and path aggregation network. It achieves efficient interaction of multi-scale features through top-down semantic enhancement path and bottom-up localization information transmission path. The region attention convolution efficient layer aggregation module is embedded in the fusion path to enhance the model's robustness to the detection of obstacles of different sizes. The dynamic multilayer perceptron scaling mechanism introduced by the detection head, combined with channel weighting strategy and scale balancing strategy, achieves adaptive optimization of target category determination and bounding box regression.
8. The smart assistant glasses system based on visual recognition and voice interaction according to claim 1, characterized in that, S3 also includes the introduction of a spatial channel self-attention mechanism and a feature fusion cascade module; The spatial channel self-attention mechanism consists of a shared multi-semantic spatial attention module and a progressive channel self-attention module. By highlighting the salience of the target area in the spatial dimension and dynamically strengthening key semantic features in the channel dimension, it achieves adaptive enhancement of the features of blind path obstacles. The feature fusion cascade module is based on a bidirectional feature pyramid structure and introduces a channel reweighting and calibration mechanism. By adaptively weighting the feature channels at different levels, it improves the semantic representation efficiency of small target obstacles. The loss function of the model in S3 is composed of multiple weighted components, including classification loss, localization loss, and confidence loss. The localization loss adopts the complete intersection-union loss function, which accelerates the convergence speed of bounding box regression and improves localization accuracy by comprehensively considering the overlap area of the target box, the distance between the center points, and the consistency of the aspect ratio. 9.A blind obstacle detection system based on a preset generation real-time target detection architecture, characterized in that, The system includes: The visual perception hardware layer is used to perform raw image data acquisition operations using a high-resolution image sensor set at the front end of the guide terminal, to acquire raw Bayer format data containing tactile paving lines, paving textures and road obstacles, and to generate a standard image sequence with a preset resolution through an image signal processor. The heterogeneous computing kernel layer includes an application processor and a neural processing unit. The application processor is responsible for system task scheduling, image signal preprocessing and peripheral management. The neural processing unit is configured to run an improved preset generation real-time object detection architecture model. It integrates a tensor operation accelerator and a vector processor to perform matrix convolution, region attention calculation and feature aggregation operations. The interactive feedback execution layer is used to generate obstacle avoidance assistance commands based on the obstacle position, type, and confidence information output by the detection head. It can drive the audio decoding chip to send out voice feedback signals or drive the linear resonant driver to generate vibration feedback signals of a specific frequency. The power management support layer, consisting of a battery, voltage regulator, and power monitoring module, is used to provide logic voltage support for the system. The heterogeneous computing kernel layer runs the intelligent assistive glasses system based on visual recognition and voice interaction as described in any one of claims 1 to 8. 10.The blind obstacle detection system based on the preset epoch real-time target detection architecture of claim 9, wherein, The system also includes an automated evaluation module and an ambient brightness adaptive module. The automated evaluation module monitors the overall performance of the model in real time and generates optimization suggestions by calculating multiple indicators such as accuracy, recall, mean precision, number of parameters, number of floating-point operations per second, and number of frames transmitted per second. When the computing load exceeds a preset threshold, the system automatically activates the lightweight switching logic to adjust the grid size of the region attention mechanism. The ambient brightness adaptive module is used to instruct the image sensor to start a long exposure mode when the light intensity is lower than a preset brightness threshold, and to call a pre-trained low-light compensation model. By introducing a gain adjustment factor in the feature extraction layer, it compensates for the contrast loss in low-light environments.