An exhibition intelligent interpretation and central control linkage system and method based on ESP-NOW protocol beacon positioning and dual-link transmission

By using beacon positioning and dual-link transmission technology based on the ESP-NOW protocol, combined with a dual-processor architecture and offline voice recognition, the problems of inaccurate guide positioning, limited device functionality, and high hardware costs in the exhibition hall intelligent explanation system have been solved. This has enabled high-precision, low-latency intelligent explanation and central control linkage, improving user experience and system stability.

CN122248037APending Publication Date: 2026-06-19CHENGDU MAODIAN DIGITAL TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU MAODIAN DIGITAL TECHNOLOGY CO LTD
Filing Date
2026-03-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing intelligent tour guide systems suffer from problems such as inaccurate tour guide positioning, limited equipment functionality, high hardware costs, poor system compatibility, separation of positioning and audio transmission, high wireless transmission latency, and security risks due to reliance on the cloud for voice control. These issues make it difficult to meet the requirements for intelligence, low latency, and security and reliability.

Method used

Employing beacon positioning and dual-link transmission technology based on the ESP-NOW protocol, a dual-processor architecture achieves a high degree of integration between audio acquisition and real-time positioning. Combined with dual-link transmission and offline voice recognition, it enables real-time positioning with sub-meter accuracy and low-latency voice control linkage, reducing hardware costs and improving user experience.

Benefits of technology

It achieves high-precision, low-latency guide positioning and voice control, has a high degree of system integration, reduces hardware costs and deployment complexity, improves user experience and system stability, and is suitable for use in confidential locations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122248037A_ABST
    Figure CN122248037A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent exhibition hall explanation and central control linkage system and method based on ESP-NOW protocol beacon positioning and dual-link transmission, belonging to the field of intelligent exhibition hall explanation and IoT central control technology. It includes an explanation microphone terminal, wireless beacons, monitoring equipment, a central control server, and an exhibit host. The terminal adopts a dual ESP32 architecture, collecting voice and positioning in real time, and transmitting audio via WiFi / 5G and dual-link parallel transmission. The exhibit host connects to the existing sound system in the exhibition hall, automatically switching playback status according to the guide's location, achieving "explaining wherever the guide goes." The central control server program uses an offline Vosk speech recognition model to intelligently parse commands and issue control based on location information. This invention completely reuses the existing sound system in the exhibition hall, eliminating the need for additional wiring and dedicated zoned sound systems, reducing deployment costs by more than 70%. Positioning accuracy reaches 0.65 meters, offline speech recognition response time is less than 0.4 seconds, and it supports multiple guides working concurrently, achieving low-cost, highly integrated, and low-latency intelligent explanation and central control linkage.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent exhibition hall explanation and IoT central control technology, specifically to an intelligent exhibition hall explanation and central control linkage system and method based on ESP-NOW protocol beacon positioning and dual-link transmission, applicable to automatic zone explanation and voice central control linkage scenarios in science and technology museums, museums, corporate exhibition halls and other places. Background Technology

[0002] With the deep integration of the digital economy and the cultural industry, venues such as corporate exhibition halls, science and technology museums, museums, and planning exhibition halls are rapidly evolving towards intelligence and interactivity. Intelligent interpretation systems and central control linkage systems, as core carriers for enhancing the visitor experience, directly impact venue operational efficiency and visitor satisfaction due to their technological maturity. However, existing technologies still have systemic shortcomings in handling aspects such as guide positioning, audio zone playback, and voice-controlled central control linkage, specifically manifested in the following areas: First, traditional tour guide equipment is limited in function and cannot meet the demands of intelligent systems. Currently, the vast majority of exhibition halls still use traditional equipment such as "ladies' megaphones," handheld microphones, or group tour guide devices. These devices only have basic sound amplification or pre-recorded playback functions, failing to provide real-time location awareness for tour guides, let alone interact with the exhibition hall's multimedia control system. During the tour, guides must manually operate the control panel or remote control to switch lights, videos, model demonstrations, etc., disrupting the tour's flow, fragmenting the visitor experience, and reducing work efficiency. Especially in scenarios where multiple people simultaneously visit multiple exhibits, traditional equipment cannot provide differentiated explanations, severely limiting the exhibition hall's capacity and level of intelligent systems.

[0003] Secondly, existing zoned audio guide systems rely on dedicated hardware, resulting in high deployment costs and poor compatibility. To achieve automatic, "walk-in, walk-out" audio guides within the exhibition hall, some high-end venues have introduced zoned audio guide systems based on infrared, RFID, or UWB positioning technologies. These systems typically require the additional deployment of dedicated audio equipment and audio matrices in each exhibition area, operating in parallel with the existing hall's audio system. This redundant construction leads to a significant increase in hardware procurement costs, complex wiring that affects the exhibition hall's aesthetics, system debugging cycles lasting several weeks, and difficulties in subsequent maintenance. More importantly, there are issues such as interface incompatibility and sound quality mismatch between dedicated audio systems and the existing equipment in the exhibition hall, making it difficult to achieve unified scheduling and maximized utilization of audio resources.

[0004] Third, the integration of voice control functionality is low, requiring additional dedicated hardware modules. Existing voice control solutions typically require guides to wear independent voice acquisition devices or deploy distributed microphone arrays within the exhibition hall, then connect to the exhibit's main unit via a central control interface. This approach suffers from high hardware costs, long development cycles, and poor system compatibility. Guides must simultaneously carry multiple devices, including microphones, location tags, and remote controls, making operation complex and inconvenient, severely impacting user experience and the system's promotional value.

[0005] Fourth, the separation of positioning and audio transmission systems leads to difficulties in data fusion and high latency. Some manufacturers have attempted to integrate positioning and audio transmission functions, but most adopt a modular approach, such as a combination of a UWB positioning module and an independent audio transmission module. This solution suffers from complex system architecture, poor stability of multi-module collaborative operation, and high data synchronization latency. Positioning data and audio data belong to different transmission channels, causing the exhibit's host to be unable to synchronously obtain accurate location information when receiving audio, making it difficult to achieve precise matching between audio playback and location changes. Furthermore, the self-monitoring (ear monitor) function for guides is often neglected, failing to achieve low-latency audio feedback and affecting the guide's speaking rhythm and confidence.

[0006] Fifth, the application of wireless transmission technology is insufficient, limiting system scalability. Existing systems mostly employ a single wireless transmission method, such as Wi-Fi 2.4G or Bluetooth. Wi-Fi 2.4G has high transmission latency (typically above 50ms), making it difficult to meet the needs of real-time monitoring; Bluetooth has limited bandwidth and small coverage area, making it unable to simultaneously transmit high-quality audio and multi-channel positioning data. A single transmission method has inherent bottlenecks in bandwidth, latency, and concurrent connection capacity, failing to simultaneously meet the multi-task requirements of high-quality audio broadcasting, real-time positioning, and voice recognition.

[0007] Sixth, speech recognition relies on cloud services, posing security risks and response delays. Some systems use cloud-based speech recognition solutions, requiring audio recordings to be uploaded to cloud servers for processing. This approach has the following problems: First, it depends on a network connection, and the system completely fails when the network is offline; second, it poses a high risk to data privacy, as the audio content may involve trade secrets or sensitive information, making it unsuitable for use in confidential settings; third, response delays are greatly affected by network fluctuations, making it difficult to achieve a real-time voice control experience.

[0008] In summary, existing technologies suffer from problems such as system complexity, high cost, fragmented experience, poor compatibility, and significant security risks in areas such as guide positioning, audio zoning playback, and voice-controlled central control linkage. There is an urgent need for a low-cost, low-latency, highly integrated, easy-to-deploy, and secure intelligent explanation and central control linkage system to meet the growing demand for smart exhibition hall construction. Summary of the Invention

[0009] The purpose of this invention is to overcome the aforementioned shortcomings of the existing technology and provide an intelligent exhibition hall explanation and central control linkage system and method based on beacon positioning and dual-link transmission. This system achieves a high degree of integration of audio acquisition and real-time positioning through a dual-processor architecture; employs dual-link transmission to balance broadcast quality and real-time monitoring; achieves sub-meter accuracy based on beacon positioning; reuses the original exhibition hall audio system to achieve "talking wherever you go"; and combines offline voice recognition and location information to achieve secure, low-latency voice central control linkage, significantly reducing deployment costs and improving user experience.

[0010] To achieve the above-mentioned objectives, the present invention provides the following technical solution: In a first aspect, the present invention provides an intelligent exhibition hall explanation and central control linkage system based on ESP-NOW protocol beacon positioning and dual-link audio transmission, comprising: The narrator microphone terminal adopts a dual-processor architecture to collect the narrator's voice, perform real-time positioning based on wireless beacons, and transmit audio data in parallel through the first and second wireless communication links. ESP-NOW protocol wireless beacons are deployed in each exhibit area to provide location identification information to the explanation microphone terminal; The monitoring device is used to receive and play the audio data via the second wireless communication link. The monitoring device adopts the ESP-NOW protocol and supports multiple forms: bone conduction headphones for guides (self-monitoring, latency 8-18ms), reuse of existing ceiling speakers in the exhibition hall (enabling "guides to speak anywhere"), and visitor receiving headphones / customized amplifiers (suitable for outdoor explanations and group tours). All devices achieve low-latency, high-concurrency, and stable transmission.

[0011] A central control server, on which a central control server program runs, the central control server program being used to receive the audio data, perform offline speech recognition, and issue control commands based on the recognition results and the location identification information; At least one exhibition host is connected to the existing audio equipment in the exhibition hall, and an exhibition host program runs on it. The exhibition host program is used to receive the audio data and control the audio equipment to play or mute based on the matching result of the location identification information and the preset identification of the device. The exhibition host program is compatible with various audio interfaces (such as 3.5mm analog audio, USB sound card, digital audio interface) and can configure parameters such as equalization and gain through software to ensure compatibility with different brands, models, and older exhibition hall audio equipment.

[0012] The narration microphone terminal achieves real-time positioning by scanning wireless beacons and transmits audio data and location information to the exhibit host, central control server and narrator listening device via dual links.

[0013] The main processor, using the ESP32-C5 chip, is responsible for audio acquisition, encoding, and broadcasting the audio stream to all exhibit hosts and the central control server via the first link, which is a WiFi / 5G network. The secondary processor, using the ESP32-S3 chip, is responsible for scanning ESP-NOW wireless beacons, performing positioning calculations based on RSSI ranging and triangulation algorithms or fingerprint matching algorithms, and pushing audio data to ESP-NOW headphones / amplifier devices via the ESP-NOW protocol. The main processor and the sub-processor synchronize audio and positioning data in real time via the SPI communication protocol.

[0014] Furthermore, the dual-link transmission includes: The first link uses WiFi 5G band UDP broadcast, with a destination address of 255.255.255.255:8087. The measured end-to-end latency is 15-40ms (P9545ms), the measured packet loss rate is <0.3%, the bandwidth usage is 768kbps, and the local area network bandwidth usage is only 0.09%, ensuring audio quality. The second link uses the ESP-NOW protocol for broadcasting, operating on channel 11 at 2.4GHz. The measured end-to-end latency is 8-18ms (P9522ms), and the measured packet loss rate is <0.5%. Theoretically, it supports 256 concurrent devices; in practice, 50 devices have been tested to receive stably. The two links support seamless switching. When the primary link experiences continuous packet loss, it automatically degrades to the backup link, ensuring uninterrupted audio playback.

[0015] Furthermore, the exhibit host has a built-in location matching module that receives audio data and location information packets broadcast by the narrator's microphone terminal, and parses out the current exhibit area identifier of the narrator: If the identified exhibition area logo matches this exhibit, the audio will be played immediately through the existing sound system in the exhibition hall; otherwise, it will remain silent. It achieves the function of automatic zone explanation "wherever you go, we will explain", and fully reuses the existing sound equipment in the exhibition hall.

[0016] Furthermore, the offline speech recognition module of the central control server adopts the Vosk offline speech model. No network connection required; speech recognition processing is completed locally. It has a built-in Vosk offline speech recognition module (loads the Chinese large model vosk-model-cn-0.22, 1.8GB), and assigns an independent recognition thread to each microphone number. The real-time recognition rate (RTF) of a single thread is <0.3, and the CPU utilization rate is <70% when multiple threads are running concurrently. It is equipped with a local JSON configuration library, which stores the mapping relationship between the exhibit IP, communication port and voice keywords according to the beacon number, and supports hot updates.

[0017] For instructions that clearly point to the current area, control instructions are immediately sent to the corresponding exhibit host; for ambiguous instructions, the movement trend is calculated based on the historical location sequence to infer the most likely target area of ​​the instruction. The voice commands include, but are not limited to, "start playing", "pause", "next track", "previous track", "repeat play", "lights on", "lights off", "switch scene" and other multimedia and environmental control commands for the exhibition hall.

[0018] Secondly, this invention provides a method for intelligent exhibition hall explanation and central control linkage based on ESP-NOW protocol beacon positioning and dual-link audio transmission, applied to the aforementioned system, including the following steps: S1: Deploy wireless beacons in each exhibition area of ​​the exhibition hall, bind a list of beacon numbers according to the exhibition area and store it in the exhibition host and central control server, and complete the initialization and connection of the explanation microphone terminal and listening device; S2: The guide enters the exhibition hall carrying a microphone terminal. The secondary processor periodically broadcasts ESP-NOW request location data packets, receives multiple beacon signals, and calculates the current precise location based on the RSSI ranging model and positioning algorithm. S3: The main processor synchronously acquires audio and encapsulates it into a 1027-byte standard data packet. It then sends the sound data, microphone number, and button status to the secondary processor via SPI to synchronously receive the positioning beacon number. The data packet is distributed in a fixed format: bytes 1-1024 are PCM audio data, byte 1025 is the microphone number, byte 1026 is the core beacon number, and byte 1027 is the button status (0 = not triggered, 1 = triggered recognition). S4: Data is transmitted in parallel via dual links: The first link broadcasts audio data and location information to all exhibit hosts via WiFi or 5G network. and central control server; The second link pushes audio data to the narrator's monitoring device via the ESP-NOW protocol. S5: The exhibit host listens to UDP port 8087, receives data packets and parses the beacon number. If it matches the local preset identifier, it plays audio through the original speakers (delay 20-40ms). If it does not match, it clears the buffer and remains silent. If there are no valid data packets for 2-3 seconds, it enters a silent waiting state. S6: The central control server distributes audio to the corresponding recognition thread according to the microphone number. When the button state is 1, offline speech recognition is started. Combined with the beacon number, the corresponding JSON configuration is loaded to complete keyword similarity matching. S7: The central control server sends control commands to the target exhibit host via UDP. After the exhibit host executes the commands, it sends back status feedback. The total latency of the entire voice control process is <350ms.

[0019] Furthermore, in step S2, the distance between the terminal and each beacon is calculated based on the RSSI ranging model. The formula for distance is: ; in: To illustrate the distance between the terminal and the beacon, the unit is meters; The reference RSSI value is located 1 meter from the beacon, in dBm, and was obtained through on-site calibration. The RSSI value received by the terminal, in dBm; The environmental degradation factor has a value range of 2 to 4 and is calibrated based on actual measurements of the exhibition hall environment.

[0020] Furthermore, the method for calculating the precise location of the terminal based on the triangulation algorithm in step S2... include: Let the number of beacons involved in the positioning be... , No. The coordinates of the beacon are The corresponding distance measurement result is Then explain the terminal location. Satisfies the overdetermined system of equations:

[0021] Linearizing the system of equations, with the first beacon as the reference, we get:

[0022] Represented in matrix form:

[0023] in It is an (m-1)×2 matrix. Given an (m-1) dimensional column vector, the least squares method is used to solve for it: .

[0024] Furthermore, the method for calculating the current position in step S2 also includes a location fingerprint-based matching and positioning method: Beacon signal strength vectors were collected in advance at various reference points in the exhibition hall to construct a location fingerprint database. ,in For the first Received by each reference point The signal strength vector of each beacon For the corresponding coordinates; Real-time scanning obtains signal strength vector Calculate its Euclidean distance with each vector in the fingerprint database:

[0025] Select the one with the smallest distance The current position is determined by taking the weighted average of the coordinates of several reference points.

[0026] in It is a very small positive number, used to prevent the denominator from being zero.

[0027] Furthermore, the method for parsing voice commands in step S6, which combines the guide's current location information, includes: Let the current coordinates of the tour guide be... The exhibition area is grouped as follows Each region is defined as a polygon region, and its vertex coordinate set is: ; Use the ray casting method to determine whether the current point is within a certain exhibit area: From Draw a horizontal ray to the right and calculate the number of intersections with each side of the polygon in the region. If the number is odd, the ray is inside the region; if the number is even, the ray is outside the region. The text commands obtained from speech recognition are matched with a pre-set command mapping table to determine the command type and target items. If the instruction explicitly specifies the exhibit name, the specified exhibit will be prioritized as the target; if the instruction does not explicitly specify the exhibit name, the instruction will be issued based on the current area determined by the ray casting method.

[0028] Compared with the prior art, the present invention has the following advantages and beneficial effects: First, the system is highly integrated, significantly reducing costs. This invention employs a dual ESP32 processor architecture, integrating audio acquisition, real-time positioning, and dual-link transmission into a single terminal. Tour guides only need to wear one device to complete all functions, eliminating the need for additional positioning tags, remote controls, or monitoring modules. The system directly reuses the exhibition hall's existing audio equipment, eliminating the need for additional dedicated zone speakers and audio matrices. Hardware costs are reduced by over 70%, wiring costs by 90%, and the debugging cycle is shortened from several weeks to 3-5 days.

[0029] Secondly, the positioning is precise and reliable, enabling seamless following. This invention integrates RSSI trilateration and fingerprint matching algorithms, achieving a static positioning accuracy of 0.65 meters and a dynamic walking accuracy of 0.92 meters, with a positioning refresh rate of 10Hz. Kalman filtering smooths RSSI fluctuations, reducing the ranging error from ±1.2 meters to ±0.5 meters. The guide's explanations follow the exhibits, with audio fading in and out during transitions between exhibition areas, ensuring a seamless experience for the audience.

[0030] Third, the dual-link transmission balances quality and real-time performance. The first link uses WiFi / 5G and UPD broadcast audio pass-through (93kbps) with a latency of 15-40ms to ensure broadcast audio quality; the second link uses ESP-NOW broadcast audio pass-through (90kbps) with a latency of only 12ms, enabling seamless monitoring. The two links operate in parallel and independently without interference. In the event of a failure in the first link, the system automatically downgrades to the second link to ensure uninterrupted presentation.

[0031] Fourth, offline voice control ensures security and intelligence. Utilizing a large-scale offline voice recognition model from Vosk, it requires no network connection, avoiding the risk of data privacy leaks and making it suitable for confidential locations. The recognition accuracy reaches 96.8%, and combined with real-time location and historical trajectory data, the accuracy of fuzzy command inference reaches 93.2%. Guides can control lighting, video, models, and other equipment via voice, with a response time of less than 0.4 seconds, completely freeing up their hands.

[0032] Fifth, it offers flexible deployment and strong scalability. The beacon is battery-powered and fixed with 3M adhesive, requiring no wiring; the exhibit's main unit is plug-and-play and automatically discovers the network; the system supports multiple guides working concurrently and can be upgraded remotely via OTA. The deployment cost per exhibition hall is only 30,000-50,000 yuan, approximately 20%-30% of traditional systems, making it particularly suitable for newly built exhibition halls and intelligent renovations of existing venues. Attached Figure Description

[0033] Figure 1 This is a schematic diagram of the overall system architecture in an embodiment of the present invention; Figure 2 This is a schematic diagram of the dual-link transmission principle in an embodiment of the present invention; Figure 3 This is a flowchart illustrating the logic of exhibit host location matching and audio playback in an embodiment of the present invention; Figure 4 This is a flowchart of the voice recognition and command issuance process of the control server in an embodiment of the present invention. Detailed Implementation

[0034] The present invention will be further described in detail below with reference to experimental examples and specific embodiments. However, this should not be construed as limiting the scope of the above-mentioned subject matter of the present invention to the following embodiments; all technologies implemented based on the content of the present invention fall within the scope of the present invention.

[0035] Example 1 like Figure 1 As shown in the figure, this embodiment provides an intelligent exhibition hall explanation and central control linkage system based on ESP-NOW protocol beacon positioning and dual-link audio transmission. The system was actually deployed in the "Future Technology Exhibition Hall" (building area of ​​1200 square meters, floor height of 8 meters) of a municipal science and technology museum. After 3 months of trial operation, the system is stable and reliable.

[0036] System components: Explanation Microphone Terminal: This embodiment uses a Wifi 5G / ESP-NOW dual-link explanation microphone terminal (model: HlcatMicMaster6.1), which adopts a dual ESP32-C5 processor architecture, measures 85mm×45mm×18mm, weighs 48 grams, and has an IP54 protection rating. The terminal is equipped with a high-sensitivity microphone (65dB signal-to-noise ratio), an OLED display (0.96 inches, 128×64 pixels) to display battery level, signal strength, current exhibition area, and three-color status indicator lights. The terminal is worn on the guide's collar via a back clip or magnetic lanyard, and has a continuous working time of ≥10 hours (1200mAh battery).

[0037] ESP-NOW Protocol Wireless Beacon: This embodiment uses the ESP-NOW wireless beacon (model: HlcatBeacon6.1), which employs the ESP32S3WROOM-1U chip and supports dual-mode operation of Bluetooth 5.2 and ESP-NOW. 32 beacons are deployed throughout the hall, with a density of one beacon per 30 square meters and a spacing of 5-8 meters. Beacon parameters: Transmit power adjustable from -4dBm to +4dBm (set to 0dBm in this embodiment), broadcast interval 100ms (adjustable range 20ms-1000ms), battery life 2 years (using ER14505 lithium-ion battery, 2700mAh). Each beacon data packet contains: Major (1 byte, exhibition area code), Minor (2 bytes, beacon sequence number), Mic (1 byte, microphone identifier), and temperature (1 byte).

[0038] In this embodiment, the monitoring device uses ESP-NOW headphones / amplifiers, which employ a customized bone conduction design (model: HlcatEarbuds6.1) to receive audio via the ESP-NOW protocol. The bone conduction design ensures that the presenter can hear ambient sounds while monitoring their own voice, enhancing safety. The headphones include a built-in ESP32S3WROOM-1U module and a PCM5102A audio DAC bone conduction speaker. The charging case provides three full charges and an 8-hour battery life on a single charge.

[0039] The central control server program (name: HlcatNetService4.0) is configured with an Intel i5 processor (6 cores, 12 threads, 3.4GHz), 16GB DDR4 memory, a 512GB solid-state drive, and a Windows 11 Ultimate operating system. The C#.NET Framework program integrates the Vosk offline speech recognition module (loading the large Chinese model vosk-model-cn-0.22, 1.8GB), assigning an independent recognition thread to each microphone number. The single-threaded real-time recognition rate (RTF) is <0.3, and the CPU utilization is <70% during multi-threaded concurrency. It is equipped with a local JSON configuration library, storing the mapping relationship between exhibit IPs, communication ports, and voice keywords according to beacon labels. It also controls lighting, mechanical curtains, sliding screens, and other equipment via an RS485 relay module (8 channels, 5V control).

[0040] The exhibit's host program (name: HlcatNetService4.0) consists of 12 units (1 to n), all using industrial-grade embedded motherboards (model: RK3588Edge), equipped with a Rockchip RK3588 processor (8 cores, 2.4GHz), 8GB LPDDR4X memory, and 64GB eMMC storage. Each host is connected to the existing BOSE DS40F ceiling speakers (1 to n) in the exhibition hall via a 3.5mm audio cable.

[0041] System workflow: Guides wear terminals as they enter the exhibition hall; the terminals automatically scan beacons for real-time location tracking. The terminal transmits audio and location data via dual links: WiFi broadcasts to all exhibit hosts and the central control server, while ESP-NOW pushes the data to the monitoring devices; Each exhibit's main unit determines whether to play audio based on its location information, enabling it to "talk wherever you go"; The central control server performs offline voice recognition on the audio, recognizes control commands, and sends commands to the corresponding exhibit host in combination with location information. The main unit of the exhibit executes commands to control audio, lighting, multimedia and other equipment.

[0042] Deployment results: After the system went online, tour guides no longer needed to carry remote controls or operate the central control panel, allowing them to focus on the content they were explaining; the audience experience rating improved from 4.0 to 4.8 out of 5; and the response time of the central control operation was reduced from 5-10 seconds for manual operation to 0.8-1.5 seconds for voice control.

[0043] Example 2 This embodiment details the dual-processor hardware architecture and hardware-software collaborative working mechanism of the microphone terminal, which corely implements PCM audio raw stream pass-through, SPI synchronous communication and dual-link parallel broadcasting functions.

[0044] The main processor (ESP32-C5-QFN48 in this embodiment) has the following core parameters: RISC-V single core, 240MHz main frequency, 4MB ROM, supports WiFi 6 (802.11ax), 802.11a / b / g / n / ac (2.4 / 5GHz dual-band), and focuses on audio acquisition, data encapsulation and WiFi 5G link transmission.

[0045] Peripheral connection Microphone: ICS-43434 microphone, connected via I²S interface, 48kHz sampling rate, 16-bit mono, captures raw PCM audio data without any encoding or compression. Storage: Connect to TF card slot (supports up to 32GB) for local recording backup (optional function); Display: Connects to a 0.96-inch OLED screen and displays device status (battery level, signal strength, current beacon signal, etc.) via the I²C interface (address 0x3C); Buttons: The matrix keypad (3×2) is connected to GPIO to realize power on / off, pairing, mode switching and volume adjustment functions.

[0046] The secondary processor (ESP32-S3-WROOM-1U in this embodiment) has the following core parameters: Xtensa® dual-core 32-bit LX7, 240MHz clock speed, supports 2.4GHz Wi-Fi (802.11b / g / n), and focuses on beacon scanning, label updating and ESP-NOW link transmission.

[0047] Core Functions ESP-NOW beacon scanning: Scans beacon data packets by periodically sending ESP-NOW broadcast data (scanning interval 200ms). The data packets contain a microphone identifier (1 byte) and receive feedback signals from surrounding wireless beacons. Beacon label update: Based on the RSSI filtering algorithm, valid beacons (threshold ≥ -80dBm) are selected, and the core beacon labels are sorted by signal strength to complete the real-time positioning information update; ESP-NOW transmission: One-to-many transmission is achieved through WiFi MAC layer broadcasting, operating on 2.4GHz channel 11 (2462MHz), avoiding conflict with the exhibition hall WiFi 5G frequency band (channel 11 and above), and ensuring transmission stability.

[0048] Master-slave processor communication mechanism (SPI synchronous core) physical interface It adopts SPI full-duplex synchronous communication, with the master processor as the master and the slave processor as the slave, and the communication clock is 40MHz to ensure high-speed real-time data transmission.

[0049] Data protocol (1027-byte standard data packet) Master → Slave Transmission: After the master processor collects 1024 bytes of raw PCM audio data, it encapsulates it into a 1027-byte data packet with the following format definition: [Audio Data (1024 bytes)] + [Beacon Identifier (1 byte)] + [Microphone Identifier (1 byte)] + [Button Status (1 byte)] where: Button status field 0 = not pressed, 1 = pressed, no duplicate fields, data packet structure is concise and without redundancy; Secondary → Primary Feedback: After receiving the data packet, the secondary processor only updates the 1025-byte beacon identifier field, keeping the remaining fields (1024 bytes of audio, microphone number, and button status) unchanged, and synchronously returns the complete 1027-byte data packet; Verification mechanism: After receiving the feedback data packet, the main processor verifies the 1024 bytes of audio data, the 1026 bytes of microphone number, and the 1027 bytes of button status byte by byte, skipping only the 1025 bytes of beacon number (allowing for updates to differ). Once the verification is successful, a communication is completed, ensuring data consistency.

[0050] Synchronization mechanism The main processor carries a microsecond-level timer stamp (based on the ESP32's 64-bit timer) in the data packet. The secondary processor receives the audio data and simultaneously updates the beacon signal, achieving a precise association between audio acquisition and positioning information without additional synchronization delay.

[0051] Actual test data SPI transmission rate: average 4.2MB / s, meeting the requirements for fast transmission of 1027-byte data packets; CPU utilization: 12% for the main processor and 8% for the secondary processor. Low computing power consumption and no performance bottleneck. The total latency of audio acquisition to dual-link broadcast is 18ms at most and 12ms on average, which meets the requirements for real-time explanation. Overall system power consumption: 150mA (3.3V) on average when the main processor is working, 80mA on average when the secondary processor is working, and <5mA in standby mode, resulting in long battery life.

[0052] Dual-link broadcast implementation Main processor: After successful verification, it sends a 1027-byte data packet via WiFi 5G band using UDP broadcast, covering all exhibit hosts and the central control server; Secondary processor: synchronously broadcasts the same 1027-byte data packet via the ESP-NOW protocol for the guide's listening device to receive; The two links work independently and in parallel, with completely consistent data, achieving redundant backup of audio pass-through.

[0053] Exception handling SPI communication timeout: Automatically retry 3 times after communication timeout. If the retry fails, restart the SPI bus to restore communication. Secondary processor deadlock: The primary processor resets the secondary processor by pulling the EN pin low via GPIO for 500ms, with a recovery time of <1.5 seconds; Location failure: If there is no beacon signal update for 3 consecutive seconds, the terminal will automatically enter "blind talk mode". The main and secondary processors will still maintain dual-link audio pass-through. The exhibit host will play based on the last valid beacon signal or switch to the default area audio.

[0054] Example 3 like Figure 2 As shown, this embodiment details the engineering implementation details, performance optimization, and redundancy mechanisms of dual-link transmission.

[0055] First Link L1 (WiFi 5G Broadcast) Network architecture: UDP LAN broadcast is used, with the target IP address 255.255.255.255 and port number 5004. All exhibit hosts and the central control server listen on this port to receive audio data streams.

[0056] Audio format: 16bit / 48kHz / mono PCM raw data pass-through Sampling rate: 48kHz Bit depth: 16 bits Audio channel: Mono Length of data collected in a single session: 1024 bytes Custom package: 1027 bytes, of which: Bytes 1-1024: 16-bit / 48kHz mono PCM audio data Byte 1025: Microphone ID Byte 1026: Beacon Identifier (used for exhibition area matching) Byte 1027: Button State Total length of a single packet: 1027 bytes Packet transmission interval: approximately 10.67 ms, packet transmission rate: approximately 93.75 packets / second Transmission protocol: Pure UDP broadcast, no encoding, no fragmentation, no RTP. Structure of each UDP packet: 3-byte custom header + 1024-byte PCM audio data It achieves low-latency transparent transmission with minimal overhead by eliminating the need for sequence numbers, timestamps, retransmissions, and complex verification.

[0057] Jitter buffer: The receiver adopts a minimal buffering mechanism without setting a depth adaptive buffer. The audio data is written directly to the playback interface after it arrives, ensuring the lowest possible transmission latency.

[0058] Packet loss handling: Relying on the clean and low-interference characteristics of the WiFi 5G band, packet loss manifests as a brief silence; the packet loss rate is extremely low in the stable environment of the exhibition hall, and does not affect the voice listening experience.

[0059] Second Link L2 (ESP-NOW Push) Transmission mechanism: Based on direct communication at the ESP-NOW MAC layer, no WiFi connection or pairing is required; 2.4G broadcast push is used. The latest ESP-NOW library supports a maximum single packet length of 1047 bytes, and this solution sends one packet every 10.67ms.

[0060] Data packet structure: Packet header: 3 bytes (device number, beacon number, button status) Audio data: 1024 bytes 16bit / 48kHz mono PCM raw data Total length of a single packet: 1027 bytes No CVSD, no OPUS encoding, no additional verification or retransmission, and the original audio is passed through throughout.

[0061] Anti-interference design: Channel selection: Fixed use of 2.4GHz channel 11 to avoid mainstream channels used in the exhibition hall and reduce co-channel interference; Antenna solution: Use an I-PEX external antenna to improve receiving sensitivity and coverage stability; Transmission strategy: pure broadcast, no retransmission, no request-response mechanism, to ensure the lowest possible link latency.

[0062] Dual-link coordination and redundancy Main and backup work in parallel: In normal scenarios, L1 (WiFi 5G UDP broadcast) serves as the primary link, while L2 (ESP-NOW) serves as a low-latency backup link, with both links simultaneously transmitting the same audio stream.

[0063] Primary / standby switchover logic: If the main link L1 does not receive valid data packets for a continuous period of time, the receiving end automatically switches to the ESP-NOW link to achieve seamless service degradation and ensure uninterrupted audio.

[0064] Link monitoring: Terminals periodically report link quality information such as RSSI and packet loss, while the central control unit monitors the link status in real time to ensure stable system operation.

[0065] Actual performance indicators (showroom environment, multiple devices online simultaneously) Metrics L1 (WiFi 5G UDP Broadcast) and L2 (ESP-NOW) End-to-end latency: 15-40ms to 8-18ms Packet loss rate < 0.3% < 0.5% Single packet size 1027 bytes 1027 bytes Audio format 16bit / 48kHz mono PCM passthrough Encoding method: No encoding (pass-through) Concurrent devices support stable reception from multiple devices. Actual performance indicators (showroom environment, 50 people online simultaneously):

[0066] Example 4 Figure 3 As shown, this embodiment details the specific workflow, algorithm logic, and exception handling of the exhibit host to achieve automatic zoning explanation "wherever it goes, it explains".

[0067] Hardware configuration (taking the host of the "AI History Area" exhibit as an example) Motherboard: RK3588Edge, running Armbian 22.04 Audio output: Connect to the PCM5102ADAC via the I2S interface, directly connect to the amplifier and speakers to achieve PCM raw stream playback. Control output: Connects to 8 relays via USB to RS485 module to control lighting (2 channels), projection screen (1 channel), electric model (3 channels), and backup (2 channels). ESP-NOW receiver module: Optional ESP32-S3 / C5 module, which communicates with the motherboard via serial port or network for secondary link degradation reception. Software Architecture Receive module: C# UDP broadcast receive thread, bound to port 5004 Parsing module: Binary protocol parsing, each packet is 1027 bytes, of which: Bytes 1-1024: 16-bit / 48kHz mono PCM audio data Byte 1025: Microphone ID Byte 1026: Beacon Identifier (used for exhibition area matching) Byte 1027: Button State Circular buffer: Minimalist buffering (only 2-3 packets deep), ensuring real-time playback. Matching and decision module: The core logic is beacon sign matching. Playback module: Based on ALSA, it directly outputs PCM data without decoding or resampling. Status reporting module: Reports running status every 5 seconds via MQTT. Location matching algorithm Input: The beacon identifier obtained from parsing byte 1026 of the data packet is the local beacon. Output: Play / Mute Decision 1. If beacon_id==local_beacon: Play audio instantly, without delay or waiting for multiple confirmations. 2. If beacon_id!=local_beacon: Stop playback immediately, clear the buffer, and avoid cross-region audio interference. 3. Boundary handling: If no valid data packets are received for 2-3 consecutive seconds, the system will enter a "silent wait" state. If you quickly pass through a non-target exhibition area, playback will not be triggered. Audio playback control Playback method: Direct PCM stream output, no encoding, no format conversion, lowest latency. Volume adjustment: Fixed or fine-tuned according to the background noise of the exhibition hall. No fade-in / fade-out processing ensures real-time "walk-in-and-react" functionality. Multiple exhibition areas overlap: Only the host of the exhibit with a completely matching beacon signal plays, avoiding simultaneous sound output. Exception handling Audio buffer overflow: Proactively discard old data packets to prioritize real-time performance. Playback device failure: Automatically switch to backup audio output and send an alarm via MQTT. Relay fault: Control forward read status, retry 3 times if abnormal, report to maintenance if still unsuccessful. measured data Location matching to playback start delay: average 20-40ms (including network transmission, parsing, and playback start). Smoothness of switching between exhibition areas: No noise or interruption during switching, and rapid response. Audience subjective evaluation: The audio and the narrator's position are synchronized naturally, with no perceptible delay.

[0068] Example 5 Figure 4 As shown, this embodiment details the implementation of the audio data distribution module, offline speech recognition module, command matching engine, and UDP directional control module of the central control server, and the core completes the entire process of audio splitting, beacon matching and loading configuration, speech recognition, and item directional control.

[0069] Hardware configuration Server: Dell PowerEdge T340, configured with Intel Xeon E-2236, 32GB ECC RAM, and 512GB NVMe SSD RAID1, capable of handling multi-channel audio parallel processing and high-speed data read / write. Network module: Gigabit Ethernet network card, low-latency processing of UDP audio data packets and control command transmission, adapted to the exhibition hall LAN environment. Expansion Interfaces: Supports multiple serial / network port expansions, connecting to the exhibition hall's central control network and exhibit terminals. Software stack Operating system: Ubuntu 20.04LTS, kernel 5.4, with real-time patch (PREEMPT_RT) to optimize real-time audio data processing and low-latency command delivery. Speech recognition: VoskAPI, loads Chinese models, supports multi-threaded independent recognition, and adapts to microphone number-based processing requirements. Data storage: Local JSON configuration file for quick loading of voice control commands and network parameters corresponding to beacon signals. Network communication: Native UDP protocol enables audio data reception and targeted transmission of control commands, with low overhead and high real-time performance. Process Management: A multi-threaded scheduling framework allocates an independent recognition thread to each microphone number, ensuring efficient concurrent processing. Core data parsing module The central control server receives a 1027-byte UDP audio data packet, performs rapid binary data parsing, and defines the data packet structure and fields as follows: 1. Bytes 1-1024: 16-bit / 48kHz mono PCM raw audio data, unencoded and uncompressed, directly used for speech recognition. 2. Byte 1025: Microphone number, used as an audio branch identifier to distribute audio data to the corresponding Vosk recognition thread. 3. Byte 1026: Beacon identifier, used as a configuration loading identifier to match and load the JSON control configuration data corresponding to the local beacon identifier. 4. Byte 1027: Button state, serving as a voice recognition trigger switch to determine whether to initiate the offline voice recognition process for the current audio stream. Parsing rules: Direct parsing using raw binary data, without additional protocol encapsulation, parsing time <1ms, ensuring real-time performance.

[0070] Audio data distribution and speech recognition module Audio splitting processing 1. Based on the microphone ID in byte 1025 of the data packet, establish a one-to-one correspondence between **microphone ID and Vosk recognition thread** to achieve targeted audio data distribution: 2. During server initialization, an independent Vosk offline recognition thread is created for each preset microphone number, and thread resources are allocated independently. 3. Upon receiving the audio data packet, based on the parsed microphone number, the PCM audio data bytes 1-1024 are directly pushed to the corresponding thread. 4. When multiple microphones are used concurrently, each thread processes audio data independently without resource contention, supporting concurrent recognition requirements for multiple narrators speaking at the same time. Speech Recognition Triggering and Processing Flow The speech recognition process is uniquely controlled by the button state in byte 1027 of the data packet. Recognition does not start if the button is not triggered, saving server CPU resources. The processing pipeline is as follows: Trigger detection: Analyze the button state. If it is in a triggered state, start the corresponding microphone thread for speech recognition; if it is not in a triggered state, discard the audio data. Audio preprocessing: Raw PCM data is directly fed into the Vosk engine without the need for decoding / resampling (the engine is compatible with 16bit / 48kHz format). Voice Activity Detection (VAD): Based on energy and zero-crossing rate, it filters silence segments, sending only valid speech segments to the recognition core, thus reducing processing overhead. Offline recognition: Loads a local Chinese model, requires no network connection, completes the speech-to-text conversion, and outputs the recognized text result. Core performance: Single-threaded real-time recognition rate (RTF) < 0.3, server CPU utilization < 70% during multi-threaded concurrent recognition, and recognition latency < 300ms (from the end of speech to the output text). Beacon matching and JSON configuration loading module Based on the beacon signal in the 1026th byte of the data packet, precise matching of the beacon signal and local JSON configuration is achieved, providing a mapping relationship between the network parameters and commands for subsequent command-oriented control. 1. Local JSON configuration files are stored categorized by beacon number. Each beacon number corresponds to a unique JSON configuration file, containing core information such as voice keywords, display IP address, communication port, and control command data. 2. Upon receiving a data packet, parse the beacon signal, quickly load the corresponding JSON configuration into memory, and cache the configuration after loading to avoid repeated file reads and writes. 3. Configuration Update: Supports hot updates of local JSON configurations without requiring a server restart, adapting to the needs of exhibition hall exhibit adjustments and command updates. Core field mapping (JSON configuration example): {"control":"ASR","IP":"Exhibition Item IP","port":"Exhibition Item Communication Port","data":[{"key":"Voice Keyword","IP":"Exhibition Item IP","port":"Exhibition Item Communication Port","data":{"control":"Exhibition Item Control Command","Parameter":"Command Additional Information"}}]} Intelligent instruction matching and similarity algorithm module Using the text output by Vosk as input, and combining it with predefined voice keywords (Key) in the loaded JSON configuration, a distance algorithm is used to compare the similarity between the text and the keywords, achieving accurate command matching. 1. Input matching: Recognize text and predefined speech keyword sets in JSON configuration. 2. Similarity Calculation: A distance algorithm is used to compare the similarity between the identified text and each keyword, and the similarity value is output. 3. Matching Decision Rules: A similarity score of ≥70% is considered a successful match, and the corresponding keyword's display control parameters (IP, port, control command data) are extracted upon successful matching; a similarity score <70% is considered a failed match, and no control commands are triggered. 4. Multi-keyword matching: If the recognized text has a similarity of ≥70% with multiple keywords, the keyword with the highest similarity is taken as the final matching result to ensure the uniqueness of the instruction. UDP Directed Control Command Issuance Module After completing the command matching, based on the exhibit IP and communication port extracted from the JSON configuration, the control command data is sent to the target exhibit via UDP protocol, realizing precise control of the exhibit by voice commands. The core process is as follows: 1. Data Extraction: After a successful match, extract the target item's IP address, UDP communication port, and control command data from the JSON configuration. 2. Command Encapsulation: Control command data is encapsulated according to the UDP protocol, without an additional protocol header, ensuring low latency in command delivery. 3. Targeted Sending: Based on the extracted exhibit IP and port, the encapsulated control commands are sent to the target exhibit via UDP protocol without broadcasting to other exhibits, thus achieving targeted voice control. 4. Confirmation of transmission: After transmission is completed, a command log is recorded (including beacon number, microphone number, target IP, control command, and transmission time) to facilitate troubleshooting and log traceability. 5. Transmission performance: UDP command delivery delay < 5ms, command delivery rate 100% in the exhibition hall LAN environment, no packet loss, no out-of-order delivery.

[0071] Practical application cases Case 1: Directional playback of exhibit videos using a single microphone and a single beacon The guide, using microphone number 01, positioned in exhibition area number 05, pressed the voice recognition trigger button (button in trigger state) and spoke the voice message **"Play the Flower-Faced Cat promotional video"**: 1. The central control server receives a 1027-byte data packet, parses out microphone number 01, beacon number 05, button status as triggered, and raw PCM audio data. 2. Distribute the audio data to the Vosk recognition thread corresponding to microphone 01, start speech recognition, and output the recognized text **"Play the Flower-Faced Cat promotional video"**. 3. Load the corresponding local JSON configuration based on beacon number 05. The configuration contains the keyword **"Play the Flower-Faced Cat promotional video"** and the corresponding display parameters: IP=192.168.0.108, port=8190, data={"control":"play","value":"5.Fiberhome Group promotional video.mp4"} 4. The text is compared with the configured keywords using a distance algorithm. If the similarity is ≥70% (95% or higher), a successful match is determined. 5. Extract the exhibit's IP address and port, and send the control command (data) via UDP to 192.168.0.108:8190. The target exhibit will automatically play the corresponding video after receiving the command.

[0072] Case 2: Audio processing when the button is not triggered The guide, using microphone number 02, was positioned in exhibition area number 03 (beacon number 03) and did not press the recognition trigger button (button status: not triggered), and proceeded with the explanation normally. 1. After receiving the data packet, the central control server parses it and finds that the button status is not triggered, so it directly discards the audio data and does not start the voice recognition process. 2. There is no instruction matching or issuance operation; the server only performs data parsing without any additional resource overhead.

[0073] Case 3: Handling Instructions with Insufficient Similarity The guide, using microphone number 01, positioned in exhibition area number 05 (beacon number 05), triggered recognition and spoke the audio **"Promotional Video"**: The speech recognition output text **"Play the Flower-Faced Cat promotional video"** matches the JSON keyword "Play the Flower-Faced Cat promotional video" corresponding to beacon number 05. If the similarity calculated by the distance algorithm is less than 70%, the match is considered a failure. No exhibit parameters are extracted, no control commands are sent, and the exhibits do not perform any actions. Core performance indicators Packet parsing latency: <1ms (1027 bytes of binary data) Audio splitting concurrency capability: Supports multiple microphones numbered and running concurrently, with each thread processing independently without interference. Speech recognition latency: <300ms from the end of speech to the output text; offline recognition with no network dependency. Command matching time: similarity algorithm comparison < 5ms, matching accuracy ≥ 98% (similarity ≥ 70% is considered). Control command issuance delay: UDP directed transmission <5ms, 100% command delivery rate on the exhibition hall LAN. Server resource usage: During multi-threaded concurrent processing, CPU utilization is <70% and memory usage is <4GB, ensuring stable operation 24 / 7.

[0074] Example 6 This embodiment takes the "Artificial Intelligence Science Popularization Exhibition Hall" of a municipal science and technology museum as an example, and describes in detail the whole process parameter configuration, data transmission logic and actual performance data based on PCM bare stream pass-through, pure UDP broadcast and beacon signal directional matching scheme.

[0075] Exhibition Hall Overview Area: 1200 square meters Number of exhibition areas: 8 (AI History, Machine Learning, Computer Vision, Natural Language Processing, Robotics, Autonomous Driving, Brain-Computer Interface, Future Outlook) Number of beacons: 32 (each beacon has a unique identifier for use in matching exhibition areas) Exhibition host: 12 units (multiple hosts may be configured in some exhibition areas, each bound to a unique beacon number) Number of guides: 4 (each can give a presentation simultaneously, and each person is equipped with an independent microphone and a numbered presentation terminal). Step S1: Beacon Deployment Deployment method: Grid layout, spacing 5-8 meters, height 1.5 meters above the ground (matching the height of the guide's terminal). Beacon parameter configuration: Broadcast interval: 100ms (balancing real-time performance and power consumption) Transmit power: 0dBm (coverage radius 10-15 meters) Broadcast channels: Polling 37 / 38 / 39 (Bluetooth BLE standard) Core identifiers: Beacon numbers (1-32) (Each beacon is unique and maps one-to-one with an exhibition area, such as beacons 1-8 corresponding to the AI ​​history area, and beacons 9-16 corresponding to the machine learning area) No redundant UUID / Major / Minor configurations are provided; only the core fields of the beacon identifier are retained. Step S2: Real-time positioning Guide Teacher Zhang (microphone number: MIC_001) wearing a terminal enters the AI ​​history area: 1. Explain that the terminal scans the beacon signal in the AI ​​history area and extracts the beacon number: 05 (corresponding to the AI ​​history area). 2. No complex coordinate calculations / confidence analysis are required; exhibition areas are directly matched using beacon labels: Beacon label 05 belongs to the AI ​​history area. 3. The terminal records the current matching result: microphone number MIC_001 and beacon number 05, providing core fields for subsequent audio encapsulation. Step S3: Audio Acquisition and Encapsulation The main processor captured the voice message "Next, let's look at the early development of artificial intelligence," which lasted 4.2 seconds and was entirely in 16-bit / 48kHz mono PCM raw audio format, without any encoding or compression. Audio segmentation: Segmented into 1024-byte packets, the 4.2-second audio file generates approximately 406 packets (1024 bytes / packet × 406 packets ≈ 4.2 seconds × 48kHz × 2 bytes / sample point). Data packet encapsulation: Fixed 1027-byte binary format, no redundant fields, structure as follows:

[0076] Encapsulation rules: Each package contains only core business fields, without redundant data such as magic / seq_num / timestamp / coordinates, and the encapsulation time is <1ms. Step S4: Dual-link transmission (pure UDP / ESP-NOW pass-through, no encoding) First link (in this embodiment, WiFi 5G UDP broadcast) Broadcast address: 255.255.255.255:5004 Transmission interval: approximately 10.67ms / packet (matching 48kHz PCM sampling rate, 1024 bytes / packet) Actual transmission rate: 768kbps (16bit / 48kHz mono PCM native rate) + approximately 5% protocol overhead ≈ 806kbps Transmission characteristics: Pure UDP broadcast, no RTP / multicast, data packets are sent directly, latency <15ms Second link (ESP-NOW push) Broadcast address: FF:FF:FF:FF:FF:FF Transmission interval: approximately 10.67 ms / packet (aligned with UDP link frames) Data packet format: Same as 1027-byte binary structure, no CVSD encoding, directly transmits PCM data. Transmission characteristics: Direct transmission at the MAC layer, no WiFi connection, single packet transmission latency <8ms Step S5: Display Host Detection and Playback 1. The AI ​​History Zone exhibit host listens on UDP port 5004, receives 1027 bytes of data packets, and parses the 1026th byte as the beacon identifier = 05. 2. Matching Detection: Beacon number 05 matches the AI ​​history beacon number bound to the device; playback will start immediately. 3. Playback process: Extract bytes 1-1024 of PCM audio data and output them directly to the BOSE speaker via ALSA, without decoding / resampling. 4. Playback latency: From the arrival of the data packet at the host to the sound output, the average latency is 12ms. 5. After parsing the beacon signal, other exhibition area hosts determine that it is incompatible with their own, remain silent, and clear the buffer. Step S6: Speech Recognition and Command Parsing The central control server receives UDP audio data packets and completes channel identification and command matching: 1. Data parsing: Extract the microphone number (byte 1025) as 0x01 and the button status (byte 1027) as 1 (triggered recognition). 2. Audio splitting: Push 1024 bytes of PCM data to the Vosk offline recognition thread corresponding to microphone number 01. 1. Speech Recognition: Outputs the text "Lights on", with a recognition latency of <300ms. 2. Beacon Matching: Parse the 1026th byte, beacon identifier = 05, and load the corresponding JSON configuration (AI history area exhibit IP / port / command mapping). 3. Similarity comparison: The identified text "lights on" is compared with a predefined key in the JSON. If the similarity is 98% or higher (≥70%), a successful match is determined. 4. Command Extraction: Extract the control command from the JSON configuration: {"cmd":"light_on"}, target display IP=192.168.0.101, port=8190 Step S7: Issuance and Execution of Instructions 1. Command Encapsulation: Control command data is encapsulated according to the UDP protocol, eliminating the need for MQTT relay and reducing latency. 2. Targeted Sending: The central control server sends instruction data via UDP to the AI ​​history exhibit host IP=192.168.0.101:8190 3. Command Execution: After receiving the UDP command, the exhibit host closes the lighting circuit via a relay. 4. Execution Feedback: The exhibit host sends the execution status back to the central control server via UDP: {"status":"on","beacon_id":05} Full-process timing test (PCM pass-through + UDP directed control version)

[0077] User perception: The light responds in about 0.33 seconds after speaking, with no obvious delay, which meets the requirements of real-time control experience in the exhibition hall.

[0078] Summarize 1. The core of this embodiment is the transmission of 1027-byte PCM raw stream data packets. It achieves accurate matching of the spread area through beacon signals, without complex encoding / coordinate calculations, ensuring low latency; 2. Both links adopt transparent transmission mode (UDP broadcast + ESP-NOW), with end-to-end audio playback latency <40ms and total command control latency <350ms; 3. Voice recognition is triggered by button status, and command matching adopts a 70% similarity judgment rule. It is sent via UDP to achieve precise control of the displayed items.

[0079] Example 7 This embodiment optimizes the ranging model, retaining only the core RSSI effective coverage judgment, eliminating the need for complex coordinate calculations, and directly providing a basis for the validity of beacon sign matching: Signal validity determination formula Only the core attenuation pattern is retained; precise ranging is not required. The only requirement is to determine whether the RSSI is within the effective coverage area. Pr≥Pthreshold in: Pr: Received signal strength (dBm) Pthreshold: Effective coverage threshold (field-calibrated to -80dBm; below this value, the beacon signal is considered invalid). On-site calibration process 1. Reference point acquisition: Within the coverage area of ​​each beacon (10-15 meters), collect 1000 RSSI samples to determine the effective threshold. Mean μ = −45.3 dBm (at 1 meter) The effective threshold Pthreshold = −80 dBm (beacons exceeding this value are considered invalid). 2. No need for multi-point fitting of path loss index, only verification: RSSI ≥ -80dBm within the beacon coverage radius, < -80dBm outside the coverage radius, with a measured effectiveness rate of 99%.

[0080] Real-time filtering optimization Instead of Kalman filtering, a sliding window mean filter is used, balancing performance and stability. 1. Filtering rule: Take the mean of the 5 most recent RSSI samples, with a window size of 5 (sampling rate 10Hz, filtering delay 0.5 seconds). 2. Filtering effect: Before filtering: RSSI fluctuation ±5dB, effective beacon false positive rate 8%. After filtering: RSSI fluctuation ±2dB, effective beacon false positive rate <1%. 3. Outlier handling: If RSSI < -90dBm, it is directly determined as an invalid beacon and will not participate in sign matching. If five consecutive RSSI values ​​change by more than 20 dB / second, it is considered a rapid movement, and only the beacon signal with the strongest signal is retained. Core application scenarios This model is only used to determine whether the beacon scanned by the explanation terminal is within the effective coverage area. If it is effective, the beacon number is extracted; if it is invalid, it is discarded. There is no need to calculate the specific distance / coordinates. This provides a basis for subsequent "beacon number-exhibition area" matching.

[0081] Example 8 Given multiple beacon designations and their corresponding RSSI intensities, without calculating terminal coordinates, the core beacon designation is determined solely by RSSI intensity priority, allowing direct matching of the exhibition area: Mathematical Derivation 1. For all valid beacons detected (RSSI ≥ -80 dBm), sort them by RSSI intensity: Beacon_sorted=sort(beacon1,beacon2,...,beaconn)byRSSIdesc 2. Take the first beacon number after sorting as the core matching number: Beacon_core=beacon_sorted[0 3. No matrix operations / least squares solutions are required; the exhibition area is directly matched with the core beacon sign.

[0082] Numerical implementation (taking 4 beacons as an example)

[0083] The final core label is 05, which directly matches the AI ​​historical area without any coordinate calculation.

[0084] Engineering optimization 1. No need for weighted least squares / constrained optimization, only retain: (1) Valid beacon screening (RSSI≥-80dBm) (2) RSSI intensity ranking, take the strongest beacon number. 2. Real-time performance optimization: (3) Pure integer arithmetic, no floating-point calculations, processing time <1ms (4) No need to pre-calculate the covariance matrix, directly sort and take the maximum value. Actual test results Beacon signal matching accuracy: 99.5% (within effective coverage area) Processing latency: <1ms (far lower than terminal audio encapsulation / transmission latency) It eliminates coordinate calculation errors, focuses solely on the validity of beacon signals, and adapts to the core needs of the exhibition area.

[0085] Example 9 In this embodiment, the fingerprint database is positioned as a beacon sign fingerprint database, which only stores the "beacon sign-show area" mapping relationship. Coordinate / fingerprint matching is not required. The core function is to quickly match the show area corresponding to the beacon. Fingerprint database construction Reference point deployment The exhibition area is divided into reference zones, and each zone is associated with a corresponding list of beacon numbers. Reference area: 8 exhibition areas, each area is bound to 4 beacon numbers (e.g., the AI ​​history area is bound to beacons 01-04). Number of reference points: 8 (1 core reference point for each exhibition area) No laser coordinate calibration is required; only the mapping relationship between "beacon sign and exhibition area" is recorded.

[0086] Data collection Only valid beacon signals within each exhibition area are collected; RSSI mean / standard deviation is not required. The data collection method is as follows:

[0087] Fingerprint database JSON format

[0088] Real-time matching algorithm 1. Feature Extraction: Extract only the list of valid beacon symbols detected during scanning (RSSI ≥ -80dBm) 2. Matching rules: Traverse the fingerprint database; if the list of valid beacon numbers overlaps with the number bound to a certain exhibition area, directly match that area. If multiple exhibition areas match, the exhibition area with the most beacon symbols will be selected. 3. Confidence assessment: Only the number of matched beacons is considered.

[0089] Confidence level ≥ 0.7: Match successful. Confidence level < 0.7: Matching fails; output "No matching area". Performance optimization 1. Fingerprint database compression: Stores only label mappings, with no redundant data, and occupies less than 1KB of memory. 2. Fast Search: Directly traverses the list of labels for the 8 exhibition areas, with a matching time of <0.5ms. 3. No adaptive updates required: The beacon identifier-area mapping is fixed, and the fingerprint database is only manually updated when the area is adjusted. Actual performance

[0090] Example 10 This embodiment completely abandons coordinate / polygon judgment and only infers the command area based on beacon signals. The core logic is "beacon signal → display area → command orientation": Core Definition No need for polygon / ray method judgment, define directly: Each beacon number uniquely corresponds to one exhibition area (e.g., Beacon No. 05 → AI History Area). The instruction mapping only associates "beacon sign - exhibition area" and does not involve complex area judgment. Region determination algorithm Function simplification (without coordinates / polygons)

[0091] Complete algorithm for voice command parsing Input: text (recognition text), beacon_id (core beacon identifier) Output: target_zone (target zone), cmd (command), param (parameters) 1. Region Entity Recognition: If the text contains the name of an exhibition area (such as "AI History Area"), then that exhibition area will be directly used as the target_zone. Otherwise, the target_zone is obtained through beacon_id mapping. 2. Command recognition: Based on keyword matching, only core commands (turn on / turn off / play next segment) are retained. The instruction mapping table is simplified to:

[0092] 3. Fuzzy command processing: When there is no regional entity, the exhibition area corresponding to the beacon_id is directly retrieved, without the need for trajectory prediction / historical location. If there is no beacon_id, it is judged as "UNKNOWN" and no instruction is triggered. Real-world case study (adapted to beacon sign scenarios) Case 1: Boundary Blurring Instruction Beacon number: 05 (AI history area) Voiceover: "Play the next segment" Identification: Entities without a region are mapped to the AI ​​history area via beacon_id=05. Execution: Issue the play_next command to the AI ​​history area. Case 2: Cross-regional instructions Beacon number: 05 (AI history area) Voice prompt: "Turn on the lights in the robot area." Recognition: If the text contains "robot zone", directly take that zone as the target_zone. Execution: Issues the light_on command to the robot area, regardless of the current beacon_id. Case 3: Global Commands Voice prompt: "Turn off all lights" Recognition: Command scope=all Execution: Broadcast the light_off command to all exhibition areas. Performance indicators Region identification accuracy: 99.9% (relying solely on beacon sign mapping) Fuzzy command inference accuracy: 98.5% (matching only core keywords) Average processing latency: 5ms (no complex calculations, only table lookup matching) The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An exhibition intelligent interpretation and central control linkage system based on ESP-NOW protocol beacon positioning and dual-link audio transmission, characterized in that, include: The narration microphone terminal adopts a dual-processor architecture, which is used to collect the narrator's voice, perform real-time positioning based on the ESP-NOW protocol wireless beacon, and transmit audio data in parallel through the WiFi 5G broadcast link as the first link and the ESP-NOW protocol broadcast link as the second link. ESP-NOW protocol wireless beacons are deployed in each exhibit area to provide location identification information to the explanation microphone terminal; The monitoring device is used to receive and play the audio data through the ESP-NOW protocol broadcast link, which serves as a second link, to enable the tour guide to monitor himself or conduct group tours. A central control server, on which a central control server program runs, the central control server program being used to receive the audio data, perform offline speech recognition, and issue control commands based on the recognition results and the location identification information; At least one exhibit host is connected to the existing audio equipment in the exhibition hall, and an exhibit host program runs on it. The exhibit host program is used to receive the audio data and control the audio equipment to play or mute according to the matching result of the location identification information and the local preset identification.

2. The system according to claim 1, characterized in that, The dual-processor architecture of the microphone terminal includes: The main processor is responsible for audio acquisition and transmission of audio data via the WiFi 5G broadcast link, which serves as the first link. The secondary processor is responsible for scanning ESP-NOW protocol wireless beacons, determining location identifiers, and pushing audio data to the monitoring device through the ESP-NOW protocol broadcast link, which serves as a second link. The main processor and the sub-processor synchronize audio and location data in real time via the SPI communication protocol.

3. The system according to claim 1, characterized in that, The WiFi 5G broadcast link, which serves as the first link, uses the UDP protocol to broadcast in the 5GHz band to transmit audio data to all exhibit hosts and the central control server. The ESP-NOW protocol broadcast link, which serves as the second link, operates in the 2.4GHz frequency band and is used to transmit audio data to the monitoring device, supporting concurrent reception by multiple devices. The two links operate in parallel and independently, serving as redundant backups for each other. When the quality of either link degrades, the receiver automatically switches to the other link to receive audio data, ensuring uninterrupted audio transmission.

4. The system according to claim 1, characterized in that, The exhibit host program running on the exhibit host receives audio data and location identification information broadcast by the explanation microphone terminal. If the parsed location identification matches the local preset identification, the audio is played through the connected audio device; otherwise, it remains silent.

5. The system according to claim 1, characterized in that, The central control server program running on the central control server integrates the Vosk offline speech recognition model, allocates an independent recognition thread to each narration microphone terminal, and is equipped with a local JSON configuration library. It stores the mapping relationship between the network parameters of the exhibits and the voice commands according to the location identifier of the wireless beacon in the ESP-NOW protocol. For commands that clearly point to the current area, control is immediately issued. For fuzzy commands, the target area of ​​the command is inferred by combining historical location information; The central control server supports multiple narrator microphone terminals working simultaneously, enabling concurrent voice recognition and command issuance by multiple narrators.

6. A method for intelligent explanation and central control linkage in exhibition halls based on ESP-NOW protocol beacon positioning and dual-link audio transmission, applied to the system described in any one of claims 1-5, characterized in that, Includes the following steps: S1: Deploy ESP-NOW protocol wireless beacons in each exhibit area to establish the association between the host of each exhibit and the location identifier; S2: Explain how the microphone terminal scans the ESP-NOW protocol wireless beacon, receives the beacon signal, and determines the core location marker corresponding to the current location based on the signal strength; S3: The microphone terminal synchronously collects audio data and encapsulates the audio data, terminal identifier, and core location identifier into a data packet; S4: Broadcast the data packet to all exhibit hosts and the central control server via the WiFi 5G broadcast link as the first link, and push the audio data to the monitoring device via the ESP-NOW protocol broadcast link as the second link; S5: Each exhibit host receives the data packet, and the exhibit host program running on it parses the core location identifier in it. If it matches the local preset identifier, it plays audio through the connected audio device; otherwise, it remains silent. S6: The central control server receives the data packet, and the central control server program running on it performs offline speech recognition on the audio data, and combines the core location identifier to parse the voice command to determine the target display item and control action; S7: The central control server sends a control command to the target exhibit host, and the exhibit host executes the command and then reports the execution status.

7. The method according to claim 6, characterized in that, The method for determining the location marker based on signal strength in step S2 includes: calculating the distance between the terminal and each beacon according to the relationship between the received signal strength and the reference signal strength, and selecting the beacon marker with the closest distance or the strongest signal as the core location marker; wherein the distance calculation formula is: ; in, For distance, For the signal strength at the reference distance, To receive signal strength, It is an environmental degradation factor.

8. The method according to claim 7, characterized in that, The method for determining the core location marker in step S2 further includes: calculating the location coordinates of the explanation terminal using a triangulation algorithm based on the ranging results of multiple beacons, and then matching the corresponding exhibition area according to the coordinates; the triangulation algorithm obtains the location coordinates by solving the least squares solution of the overdetermined system of equations. ; Where A is a coefficient matrix composed of beacon coordinate differences. It is a constant vector composed of beacon coordinates and ranging results.

9. The method according to claim 6, characterized in that, The method for determining the core location identifier in step S2 also includes a matching and positioning method based on location fingerprints: a fingerprint database is constructed by pre-collecting beacon signal strength vectors at each reference point, and after scanning the signal strength vectors in real time, the Euclidean distance with each vector in the fingerprint database is calculated, and the area where the reference point with the smallest distance is located is selected as the current location.

10. The method according to claim 6, characterized in that, The method for parsing voice commands in step S6 includes: matching the text obtained from speech recognition with a preset command mapping table to determine the command type; if the text explicitly specifies the exhibit name, then the control command is issued with the specified exhibit as the target; if the text does not explicitly specify the exhibit name, then the control command is issued with the exhibit corresponding to the current core position identifier as the target; if the current core position identifier is invalid, then the control command is not triggered.