A camera robot system and method based on multi-modal interaction and state coordination
Through a multimodal interaction and state collaboration-based photography robot system, efficient, natural, and emotional interaction of robots in complex environments has been achieved, solving the problems of insufficient interaction fluency, intelligence, and emotionality in existing technologies and improving the user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG TONGHUASHUN INTELLIGENT TECH CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing photography robot systems are inadequate in terms of interactive fluency, intelligence, emotional level, and task management, making it difficult to achieve natural, coherent, and efficient user interaction.
The photography robot system, which adopts multimodal interaction and state collaboration, performs multimodal perception and command understanding through the interaction and decision-making central layer, prioritizes tasks through the control and scheduling layer, and dynamically switches behaviors through the execution layer. It also enhances human-like performance through the emotional expression layer, thereby realizing the intelligent and emotional interaction of the system.
It improves the naturalness and overall coherence of the interaction process, ensures the reliability and smoothness of the system under complex working conditions, enhances the robot's sense of life and friendliness, and improves the user's emotional resonance and immersive experience.
Smart Images

Figure CN122231925A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of artificial intelligence and robotics, and in particular to a photographic robot system and method based on multimodal interaction and state collaboration. Background Technology
[0002] With the development of artificial intelligence and robotics, intelligent robots designed to provide autonomous photography services have entered the application exploration stage. However, these technologies still face a series of key technological bottlenecks in achieving truly intelligent and humanized services.
[0003] In terms of interaction modes, current service robots mostly use simple state machines or scripts based on preset time / location for behavior switching, such as stopping at a fixed point and playing voice. Although some devices have follow functions, users usually need to manually turn them on or off, resulting in abrupt interaction and a disjointed experience. Robotic arms and other mechanisms only move when performing functions, remaining stationary when moving or in standby, appearing rigid and leading to a cold, instrumental interaction process that fails to establish an emotional connection with the user.
[0004] When multiple potential tasks or instructions exist simultaneously (such as being asked to take a photo while searching for an item), the lack of a clear priority decision-making and task state management mechanism can easily lead to erratic robot behavior, task failure, or user confusion, resulting in poor system behavior certainty. The robot's limbs only move when performing specific functions (such as raising its head to take a photo), remaining stationary while the robot is moving or waiting, making it appear rigid, lifeless, and lacking in approachability. The robot's limb movements are underutilized and lack expressiveness. Keyword-based interaction cannot handle complex interactions and requests that are conversational, contextual, or multimodal, limiting the robot's application scenarios and practicality, and its intent understanding ability is limited.
[0005] In summary, how to effectively address the significant shortcomings of related technical photography robot systems in terms of interactive initiative, service completeness, system determinism, emotional expression, and depth of understanding is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0006] The purpose of this application is to provide a photography robot system based on multimodal interaction and state collaboration. This system significantly improves the naturalness and overall coherence of the interaction process, ensures the reliability and smoothness of the system under complex working conditions, and enhances the intelligence, smoothness and emotionality of the overall interaction. Another purpose of this application is to provide a photography robot control method and a computer-readable storage medium based on multimodal interaction and state collaboration.
[0007] To solve the above-mentioned technical problems, this application provides the following technical solution: A photography robot system based on multimodal interaction and state cooperation includes: The interaction and decision-making central layer receives user commands and collects image information of the scene where the photography robot is located. By fusing speech recognition, visual perception, and multimodal large model reasoning, it understands the user commands and the image information to obtain command understanding results and environment perception results. Using a retrieval enhancement generation algorithm, it matches the standard commands corresponding to each user command from the structured command library based on the command understanding results. It then sends each standard command to the control and scheduling layer and sends the environment perception results to the execution layer. The control and scheduling layer is used to prioritize each standard instruction to obtain the instruction priority corresponding to each standard instruction; and to send each instruction priority to the execution layer. The execution layer is used to execute each standard instruction according to the instruction priority, and switch the global behavior mode of the photography robot according to the environmental perception results during the instruction execution process; The emotion expression layer is used to obtain the system state and the interaction context of the interaction and decision-making center layer, the control and scheduling layer, and the execution layer; based on the system state and the interaction context, it dynamically drives the physical movements of the robotic arm and the digital human on the screen to output facial expressions.
[0008] In one specific embodiment of this application, the control and scheduling layer is specifically used to determine the photo-taking instruction as the highest priority when each standard instruction includes a photo-taking instruction, and to classify the other standard instructions besides the photo-taking instruction into priorities.
[0009] In one specific embodiment of this application, the execution layer is specifically used to execute the photo-taking command in collaboration with hardware and software through status flag management, and after the photo-taking command is executed, to execute other standard commands other than the photo-taking command according to the command priority.
[0010] In one specific embodiment of this application, the execution layer is specifically used to parse the photo-taking command to obtain the shooting type; set a global flag to lock the photo-taking task and adjust the chassis to the photo-taking state; continuously take photos a preset number of times according to the shooting type, and upload the captured images to the cloud for optimal image selection.
[0011] In one specific embodiment of this application, the execution layer is specifically used to execute each standard instruction according to the instruction priority by generating a comprehensive decision output that includes natural language responses and digital human actions.
[0012] In one specific embodiment of this application, the execution layer is further configured to determine the task type corresponding to each standard instruction; when it is determined that there is a long-cycle task based on each task type, the execution layer enters a task locking state when executing the standard instruction corresponding to the long-cycle task.
[0013] In one specific embodiment of this application, the interaction and decision-making central layer is specifically used to collect user instructions in voice form through a voice acquisition device and convert each user instruction from voice form to text form; and to collect image information of the scene where the photography robot is located through an image acquisition device.
[0014] In one specific embodiment of this application, the interaction and decision-making central layer is specifically used to understand the user instructions and image information by fusing speech recognition, visual perception and multimodal large model reasoning, and obtain the instruction understanding result, the environment perception result, the person's identity, behavior and action and emotional state; and send the person's identity, the behavior and action and the emotional state to the emotion expression layer; The emotional expression layer is also used to dynamically drive the physical movements of the robotic arm and the digital human on the screen to output facial expressions based on the system state, the interaction context, the character's identity, the behavior, and the emotional state.
[0015] A control method for a photography robot based on multimodal interaction and state cooperation, comprising: It receives instructions from users and collects image information of the scene where the camera robot is located; By integrating speech recognition, visual perception, and multimodal large model reasoning, the user's instructions and image information are understood to obtain the instruction understanding results and environment perception results. The retrieval enhancement generation algorithm is used to match the standard instructions corresponding to each user instruction from the structured instruction library based on the instruction understanding results; The priority of each standard instruction is divided into priorities to obtain the instruction priorities corresponding to each standard instruction. Execute each standard instruction according to the instruction priority, and switch the global behavior mode of the photography robot according to the environmental perception results during the execution of the instructions; Obtain the system status, and obtain the interaction context of the interaction and decision-making central layer, the control and scheduling layer, and the execution layer; Based on the system state and the interaction context, the robotic arm's physical movements and the digital human on the screen are dynamically driven to output facial expressions.
[0016] A computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the aforementioned control method for a photography robot based on multimodal interaction and state collaboration.
[0017] The photography robot system based on multimodal interaction and state collaboration provided in this application includes: an interaction and decision-making central layer, used to receive user commands and collect image information of the scene where the photography robot is located; by fusing speech recognition, visual perception, and multimodal large model reasoning, it understands user commands and image information to obtain command understanding results and environmental perception results; using a retrieval-enhanced generation algorithm, it matches the standard commands corresponding to each user command from a structured command library based on the command understanding results; it sends each standard command to the control and scheduling layer and sends the environmental perception results to the execution layer; the control and scheduling layer is used to prioritize each standard command to obtain the command priority corresponding to each standard command; the execution layer is used to execute each standard command according to the command priority and switch the global behavior mode of the photography robot according to the environmental perception results during the command execution process; and the emotion expression layer is used to obtain the system state and the interaction context of the interaction and decision-making central layer, control and scheduling layer, and execution layer; and dynamically drive the physical movements of the robotic arm and the digital human on the screen to output expressions according to the system state and interaction context.
[0018] As can be seen from the above technical solutions, by constructing an interaction and decision-making central layer, real-time multimodal perception is achieved. This integrates multi-dimensional information such as environmental vision, user behavior, dialogue status, and task progress, enabling dynamic and smooth switching of the robot's working modes. This allows the robot to proactively intervene or withdraw at appropriate times, significantly improving the naturalness and overall coherence of the interaction process. The control and scheduling layer can classify tasks in real-time and dynamically prioritize them, ensuring timely responses to high-priority commands and guaranteeing the system's reliability and smoothness under complex conditions. The emotional expression layer introduces emotional actions and facial expressions, allowing the photography robot to generate rich, anthropomorphic behaviors based on the current environment, user emotions, and historical interactions during breaks in core functions. These behaviors are triggered collaboratively by multimodal states and seamlessly integrated with the main task, significantly enhancing the robot's sense of life and approachability, and improving user emotional resonance and immersive experience. This improves the overall intelligence, smoothness, and emotional level of the interaction.
[0019] Accordingly, this application also provides a camera robot control method and a computer-readable storage medium based on multimodal interaction and state collaboration, which are corresponding to the above-mentioned camera robot system based on multimodal interaction and state collaboration, and have the above-mentioned technical effects, which will not be repeated here. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a structural block diagram of a photography robot system based on multimodal interaction and state collaboration in an embodiment of this application; Figure 2 This is an architectural diagram of a photography robot system according to an embodiment of this application; Figure 3 This is a flowchart illustrating the control and scheduling layer of a photography robot system according to an embodiment of this application. Figure 4 This is a flowchart illustrating the execution layer of a photography robot system in an embodiment of this application, showing how it executes photographing commands. Figure 5 This is a flowchart illustrating the workflow of the execution layer of a photography robot system in this application for generating special effects from photos. Figure 6 This is a flowchart of the interaction and decision-making central layer of a photography robot system according to an embodiment of this application; Figure 7 This is a flowchart illustrating the emotional expression layer of a photography robot system according to an embodiment of this application. Figure 8 This is a flowchart illustrating the implementation of a photography robot control method based on multimodal interaction and state collaboration in this application.
[0022] The following labels are shown in the attached diagram: 1-Interaction and decision-making center layer, 2-Control and scheduling layer, 3-Execution layer, 4-Emotional expression layer. Detailed Implementation
[0023] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0024] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0025] See Figure 1 , Figure 1 This is a structural block diagram of a photography robot system based on multimodal interaction and state cooperation, as described in an embodiment of this application. The system may include: The interaction and decision-making central layer 1 is used to receive user commands and collect image information of the scene where the photography robot is located; by fusing speech recognition, visual perception and multimodal large model reasoning, it understands user commands and image information to obtain command understanding results and environmental perception results; using a retrieval enhancement generation algorithm, it matches the standard commands corresponding to each user command from the structured command library based on the command understanding results; and sends each standard command to the control and scheduling layer 2, and sends the environmental perception results to the execution layer 3. Control and scheduling layer 2 is used to prioritize each standard instruction and obtain the instruction priority corresponding to each standard instruction; and to send each instruction priority to the execution layer 3. Execution layer 3 is used to execute standard instructions according to instruction priority, and switch the global behavior mode of the camera robot based on the environmental perception results during instruction execution; The emotional expression layer 4 is used to obtain the system state and the interaction context of the interaction and decision-making center layer 1, the control and scheduling layer 2, and the execution layer 3; based on the system state and the interaction context, it dynamically drives the physical movements of the robotic arm and the digital human on the screen to output facial expressions.
[0026] The photography robot system based on multimodal interaction and state collaboration provided in this application embodiment may include an interaction and decision-making central layer 1, a control and scheduling layer 2, an execution layer 3, and an emotion expression layer 4. When a user needs to interact with the photography robot, they send user commands to the robot, such as voice commands, which can be multiple. The interaction and decision-making central layer 1 receives each user command and collects image information of the scene in which the photography robot is located, such as image information of the surrounding environment collected through a camera. After receiving the user commands and image information, the interaction and decision-making central layer 1 understands the user commands and image information by fusing speech recognition, visual perception, and multimodal large model reasoning, and obtains command understanding results and environmental perception results.
[0027] A structured instruction library storing standard instructions is pre-built. After obtaining the instruction understanding results and environmental perception results, the interaction and decision-making central layer 1 uses the retrieval-enhanced generation (RAG) algorithm to match the standard instructions corresponding to each user instruction from the structured instruction library based on the instruction understanding results. The standard instructions are then sent to the control and scheduling layer 2, and the environmental perception results are sent to the execution layer 3.
[0028] Control and scheduling layer 2 receives each standard instruction and prioritizes them to obtain the instruction priority corresponding to each standard instruction. For example, the photo-taking instruction can be preset as the highest priority. Control and scheduling layer 2 then sends the instruction priorities to execution layer 3.
[0029] After receiving the priorities of each instruction, Execution Layer 3 executes the standard instructions according to their priorities. During instruction execution, it switches the global behavior mode of the photography robot based on environmental perception results. For example, when an obstacle is detected, it first avoids the obstacle and then continues to execute the task corresponding to the instruction. By integrating multi-dimensional information such as environmental vision, user behavior, dialogue status, and task progress in real time, the robot's working modes (such as standby, following, interaction, shooting, and special effects processing) are dynamically and smoothly switched. This allows the photography robot to proactively intervene or withdraw at appropriate times, just like a human photographer, significantly improving the naturalness and overall coherence of the interaction process.
[0030] The emotional expression layer 4 can acquire the system status in real time and obtain the interaction context of the interaction and decision-making central layer 1, the control and scheduling layer 2, and the execution layer 3. Based on the system status and interaction context, it dynamically drives the physical movements of the robotic arm and the facial expressions of the digital human on the screen. This significantly enhances the robot's sense of life and approachability, and improves the user's emotional resonance and immersive experience. Through this design paradigm of central decision-making, layered execution, and emotional integration, each layer is constructed into a deeply coupled, real-time, bidirectional data flow organic whole. This architecture produces a significant synergistic effect.
[0031] See Figure 2 , Figure 2 This is an architectural diagram of a photography robot system according to an embodiment of this application. The photography robot system of this embodiment adopts a collaborative design driven by core modules and executed in layers to realize an intelligent closed loop from intent understanding to action execution, and incorporates anthropomorphic emotional expression.
[0032] In the interaction and decision-making central layer 1, the multimodal dialogue, instruction understanding, and decision-making control module acts as the system's brain. By integrating speech recognition, visual perception, and multimodal large-scale model reasoning, it deeply understands user instructions and scene context. Combined with retrieval-enhanced generation technology, it matches accurate instructions from a structured knowledge base.
[0033] In control and scheduling layer 2, the camera robot's state machine master control and navigation module serves as the system's nerve center, employing a dynamic scheduling model of "dual states (cruising / standing) + multi-priority commands." This module autonomously switches the robot's global behavior mode based on environmental perception results and performs hierarchical management and interrupt response for commands such as taking pictures, finding objects, and interacting. This ensures that high-priority tasks (such as taking pictures) can immediately seize resources, while a task locking mechanism guarantees focused execution of long-term tasks, achieving a balance between flexibility and stability in system behavior.
[0034] In execution layer 3, the intelligent photo-taking module serves as a functional module of the system, simulating the complete workflow of a photographer. From receiving instructions, collaborative preparation (chassis height adjustment, voice guidance), intelligent image acquisition, to data uploading and cloud linkage, this module transforms complex photography tasks into a reliable and automated "one-click image creation" service through precise status flag management and hardware-software collaboration.
[0035] In the emotional expression layer 4, the robotic arm and screen expression behavior coordination module serves as the system's emotional interaction system. Based on the system state and interaction context, it dynamically drives the synchronized output of the robotic arm's physical movements and the digital human's facial expressions on the screen. From random small movements during standby to various contextualized emotional responses during interaction, and then to the dedicated ritualistic process when taking photos, this module infuses the robot with a coherent sense of life and emotional affinity without interfering with the core functions.
[0036] Each layer collaborates efficiently and decoupledly through a unified data interface and messaging mechanism, forming an organic whole that systematically solves the core problems of module fragmentation and inconsistent user experience in related technologies. Through deep collaboration and data closure among four core modules, the entire system achieves end-to-end intelligence, from environmental perception, intent understanding, intelligent decision-making, and precise execution to emotional interaction. This significantly improves the photography robot's contextual understanding capabilities, multi-task scheduling reliability, professional image quality, and the naturalness and emotional warmth of human-computer interaction.
[0037] As can be seen from the above technical solutions, by constructing an interaction and decision-making central layer, real-time multimodal perception is achieved. This integrates multi-dimensional information such as environmental vision, user behavior, dialogue status, and task progress, enabling dynamic and smooth switching of the robot's working modes. This allows the robot to proactively intervene or withdraw at appropriate times, significantly improving the naturalness and overall coherence of the interaction process. The control and scheduling layer can classify tasks in real-time and dynamically prioritize them, ensuring timely responses to high-priority commands and guaranteeing the system's reliability and smoothness under complex conditions. The emotional expression layer introduces emotional actions and facial expressions, allowing the photography robot to generate rich, anthropomorphic behaviors based on the current environment, user emotions, and historical interactions during breaks in core functions. These behaviors are triggered collaboratively by multimodal states and seamlessly integrated with the main task, significantly enhancing the robot's sense of life and approachability, and improving user emotional resonance and immersive experience. This improves the overall intelligence, smoothness, and emotional level of the interaction.
[0038] It should be noted that, based on the above embodiments, this application also provides corresponding improvement solutions. In subsequent embodiments, steps that are the same as or corresponding to those in the above embodiments can be referred to each other, and the corresponding beneficial effects can also be referred to each other. These improvements will not be elaborated upon in the following improved embodiments.
[0039] In one specific embodiment of this application, the control and scheduling layer 2 is specifically used to determine the photo-taking instruction as the highest priority when each standard instruction includes a photo-taking instruction, and to classify the other standard instructions besides the photo-taking instruction into priorities.
[0040] The control and scheduling layer 2 is specifically used to determine the photo-taking command as the highest priority when all standard commands include it, and to prioritize other standard commands besides the photo-taking command. By setting the photo-taking command as the highest priority, timely response to the photo-taking command is guaranteed.
[0041] See Figure 3 , Figure 3This is a flowchart of the control and scheduling layer 2 of a photography robot system according to an embodiment of this application. Control and scheduling layer 2 is the core control unit for overall robot behavior scheduling, responsible for managing the robot's global operating mode and task execution priority. It adopts a two-layer architecture of "state mode + command mode": First, based on the contextual understanding provided by the interaction and decision-making center layer 1 (such as "recognizing a specific wake word and approach posture"), it autonomously and smoothly transitions between the "cruising" and "standing" dual states. In any state, it processes input commands according to an innovative set of dynamic rules bound to the state, implementing a preemptive response with the highest global priority for photo-taking commands. For object-finding commands, it enters a task-locking sub-state, intelligently filtering new interactive requests and retaining only critical system feedback channels. This strategy, which deeply integrates macro-level behavior mode management with micro-level task resource scheduling, is a creative advancement over traditional state machine or priority queue methods, ensuring high responsiveness and behavioral determinism of the system in complex dynamic environments.
[0042] The system has two main states with clearly defined functional divisions. In cruise mode, the robot moves autonomously, continuously sensing its surroundings and searching for potential interaction opportunities. The standing state is a service mode entered after detecting a user's intention to interact, supporting functions such as following, conversation, and taking photos. State switching is entirely driven by the multimodal perception system and the retrieval enhancement and command distribution module. When it detects someone proactively greeting it (e.g., a specific voice or gesture) or an "interesting target person" approaching, it automatically switches from cruise mode to the standing chat state. Conversely, if it receives a "stop following" command or fails to detect a valid interaction object within a set time, it automatically returns to cruise mode. This dynamic switching mechanism based on perception and action enables the robot to interact with users like a human photographer, significantly improving the naturalness and fluency of the photo-taking robot's interaction.
[0043] In any state, the system can receive user commands, but different commands have different priorities and execution strategies. The photo-taking command is defined as a highest-priority interruptible task. Regardless of the current state or operation, once triggered, it immediately pauses the current low-priority behavior, invokes a dedicated intelligent photo-taking process, and seamlessly returns to the original state after completion. In addition, in the standing state, the robot can respond to two types of auxiliary commands: first, chassis movement commands (such as basic robot movement commands like forward, backward, and turning), used to fine-tune its position for shooting or interaction, automatically returning to the standing state after execution; second, emotional action commands (triggered based on the dialogue between the user and the robot), used to enhance human-like performance.
[0044] For complex tasks requiring extended execution, such as locating objects or people (e.g., "Find my cup"), the system enters a task lock state. In this state, new user interaction requests (e.g., follow instructions, move instructions, etc.) are temporarily blocked, and only voice feedback on task progress (e.g., "Searching...") is allowed to ensure task focus and completion rate. The system exits the lock state and returns to its pre-trigger state (cruising or standing) only after the task is successfully completed or the user actively cancels it. This mechanism effectively resolves resource conflicts during multi-tasking concurrency and improves system reliability in real-world environments.
[0045] The control and scheduling layer 2 constructs a control framework that combines flexibility and determinism through three major mechanisms: state-aware driven switching, hierarchical dynamic scheduling of instructions, and long-term task isolation protection. It not only realizes a paradigm shift from passive response to proactive service, but also provides a unified and reliable collaborative foundation for subsystems such as photography, navigation, and anthropomorphic behavior, serving as a key technological pillar supporting the intelligent and human-centered photography experience of this invention.
[0046] In one specific embodiment of this application, the execution layer 3 is specifically used to execute the photo-taking command in coordination with hardware and software through status flag management, and to execute other standard commands other than the photo-taking command according to the command priority after the photo-taking command is executed.
[0047] Execution layer 3 is specifically used to execute photo-taking commands through status flag management and hardware / software collaboration. For example, by setting a photo-taking task lock, it prevents other tasks from interrupting the photo-taking task. After the photo-taking command is executed, other standard commands besides the photo-taking command are executed according to their priority. Through status flag management and hardware / software collaboration, high-quality and smooth execution of photo-taking tasks is ensured.
[0048] In one specific embodiment of this application, execution layer 3 is specifically used to parse the photo-taking command to obtain the shooting type; set a global flag to lock the photo-taking task and adjust the chassis to the photo-taking state; continuously take photos a preset number of times according to the shooting type, and upload the captured images to the cloud for optimal image selection.
[0049] Execution layer 3 is specifically responsible for parsing the photo-taking command, determining the shooting type, setting a global flag to lock the photo-taking task, adjusting the chassis to photo-taking mode, continuously taking a preset number of shots according to the shooting type, and uploading the captured images to the cloud for optimal image selection. Locking the photo-taking task by setting a global flag prevents interruptions from other tasks. Adjusting the chassis to photo-taking mode provides favorable physical conditions for obtaining high-quality images. Continuously taking a preset number of shots according to the shooting type and uploading the captured images to the cloud for optimal image selection ensures the quality of the selected photos.
[0050] By parsing the user's multimodal input (voice, gestures, scene references) and transforming it into structured shooting semantics, the collaborative control system is driven to automatically complete scene analysis, subject tracking, composition optimization, special effects processing, and even automatic shooting and preliminary post-processing, ultimately generating high-quality photos that meet the user's expectations, significantly reducing the technical requirements for professional photography.
[0051] See Figure 4 , Figure 4 This is a flowchart illustrating the workflow of execution layer 3 of a photography robot system in this embodiment, where the system executes photographing commands. Execution layer 3 is the core component of the entire system, realizing a closed loop from user intent to final image. Its goal is to simulate the workflow of a professional photographer, automatically producing high-quality photos that meet user expectations through multimodal collaboration and intelligent processing. This module not only completes image acquisition but also integrates capabilities such as environmental user intent perception, pose adjustment, voice guidance, original image selection, and cloud-based intelligent processing of photo effects, making it suitable for various complex scenarios, including single-person or group photos.
[0052] Upon receiving a photo-taking command, the main control state machine immediately enters the photo-taking process and sets a global flag (such as is_in_capture_process=True) to lock the current task and prevent interference from low-priority operations. Simultaneously, the system dynamically configures subsequent behaviors based on the shooting type in the command (such as "single photo" or "group photo") and clears the history cache directory to ensure that each photo capture is a clean start. This initialization process lays the foundation for subsequent highly reliable execution.
[0053] Before the actual shooting, the system performs a series of coordinated preparatory actions. First, the chassis automatically lowers by 5 centimeters. Simultaneously, the robotic arm transitions from an emotional pose to a shooting pose to achieve a more suitable framing height, demonstrating proactive adaptation to photographic aesthetics and height. Second, the system simultaneously plays a "Ready to shoot" voice prompt and precisely controls the pace through a dual timeout detection mechanism (waiting for the voice to start + waiting for the voice to end), ensuring the user's attention is focused and their posture is ready. Chassis adjustment and voice announcements are executed in parallel through independent threads, balancing efficiency and user experience.
[0054] The image acquisition strategy adopts a multi-image selection strategy: after receiving the photo-taking command, the system starts a high-speed continuous shooting process, taking 5 photos in a very short time (about 10 millisecond intervals) and uploading them to the cloud for selection of the best original image.
[0055] After the photo is taken, the robot's chassis returns to its original height, and the robotic arm exits the posing mode and transitions to an emotional / gestural state. Subsequently, all images are sent to the cloud via a unified upload service. This service automatically selects the upload interface based on the shooting type—regular photos use `type="person"`, while group photos use the dedicated `type="CESPHOTO"` identifier, and it calls the multi-image upload application programming interface (API). The upload process incorporates a robust fault tolerance mechanism, including file verification, size limits, timeout control, and compatibility with various server response formats, ensuring reliable data delivery even under abnormal conditions such as network fluctuations. In cloud processing, the best photo is first selected from the five uploaded original images using image processing algorithms. This multi-frame acquisition and selection mechanism significantly improves the success rate of capturing the "best user expression or image quality" in real-time dynamic shooting scenarios, solving the problem of missing the optimal moment with a traditional single shutter release.
[0056] See Figure 5 , Figure 5 This is a flowchart illustrating the workflow of the execution layer 3 of a photography robot system in this application, which generates special effects from photos. The system performs cloud-based special effects processing based on single-person photos or group photos.
[0057] Finally, after a successful shooting task, the system sends a {"status": "stop"} message to the global task channel as a unified signal that the shooting task is complete. This signal triggers a robot status update, plays a success sound effect, or guides the next interaction. All temporary status flags are cleared, resources are released, and the photography robot re-enters the previous level of the shooting sub-task, ready to respond to the next instruction. In summary, Execution Layer 3, through deep hardware and software collaboration, refined state management, and cloud-based intelligent linkage, transforms the complex task of "professional photography" into a user-friendly and smooth operation.
[0058] In one specific embodiment of this application, the execution layer 3 is specifically used to execute each standard instruction according to the instruction priority by generating a comprehensive decision output that includes natural language responses and digital human actions.
[0059] Execution layer 3 is specifically used to execute standard instructions according to their priority by generating a comprehensive decision output that includes natural language responses and digital human actions. By outputting natural language responses and digital human actions, the robot's sense of life and approachability is significantly enhanced, improving the user's emotional resonance and immersive experience.
[0060] In one specific embodiment of this application, the execution layer 3 is further used to determine the task type corresponding to each standard instruction; when it is determined that there is a long-cycle task based on each task type, the execution layer 3 enters a task locking state when executing the standard instruction corresponding to the long-cycle task.
[0061] Execution layer 3 is also used to determine the task type corresponding to each standard instruction. When a long-cycle task is determined based on the task type, such as a locator task, the system enters a task-locking state when executing the standard instruction corresponding to the long-cycle task. By entering a task-locking state when executing the standard instruction corresponding to the long-cycle task, the focused execution of the long-cycle task is ensured, achieving a balance between flexibility and stability in system behavior.
[0062] The photography robot system based on multimodal interaction and state collaboration provided in this application can classify tasks in real time and evaluate their dynamic priorities, ensuring that high-priority instructions (such as emergency stop and safe obstacle avoidance) are responded to in a timely manner. At the same time, it maintains the state and isolates resources for long-cycle tasks, so that they can be executed stably in the background without being interrupted by accidents, thus ensuring the reliability and smoothness of the system under complex working conditions.
[0063] In one specific embodiment of this application, the interaction and decision-making central layer 1 is specifically used to collect user instructions in voice form through a voice acquisition device and convert each user instruction from voice form to text form; and to collect image information of the scene where the photography robot is located through an image acquisition device.
[0064] The interaction and decision-making central layer 1 is specifically used to collect user commands in voice form through a voice acquisition device, convert the user commands from voice to text form, and collect image information of the scene where the photography robot is located through an image acquisition device. By converting user commands from voice to text form, it facilitates command understanding.
[0065] By constructing a cross-modal alignment and inference network, the semantic information of voice commands can be deeply integrated with the entity, spatial, and aesthetic features of real-time visual scenes. Through collaborative encoding and inference of cross-modal information, the system can accurately parse implicit preferences, styles, and spatial relationships, generating not only accurate but also highly personalized robot responses and execution commands, achieving truly intelligent collaborative shooting.
[0066] See Figure 6 , Figure 6This is a flowchart illustrating the workflow of the interaction and decision-making central layer 1 of a photography robot system according to an embodiment of this application. The interaction and decision-making central layer 1 is the interaction hub of the entire photography robot system, responsible for receiving user voice and visual input and fusing multimodal information for deep semantic understanding. This module adopts a layered processing architecture, integrating key technologies such as speech recognition, multimodal large-model reasoning, and retrieval-enhanced generation to achieve a complete closed loop from perception to decision-making to feedback.
[0067] The system first captures the user's voice through a microphone, while simultaneously using a camera to acquire visual images of the current environment. The voice signal is then processed by Voice Activity Detection (VAD) to determine valid speech segments before being sent to an automatic speech recognition module to be converted into text. Visual data is used to identify the person's identity, actions, emotional state, and key objects and spatial relationships in the surrounding environment. These two sets of information together form the basis for subsequent multimodal fusion analysis.
[0068] To further improve the accuracy and structure of command parsing, the system introduces a retrieval enhancement generation mechanism. This mechanism uses automatically recognized speech text as the query condition and retrieves the most relevant structured command templates from a pre-built knowledge base. Examples include: {command: "Take a photo", parameter: {mode: "single"}}, {command: "Take a photo", parameter: {song title: "group photo"}}, or {command: "Play a song", parameter: {song title: "nocturne"}}. This mechanism not only supports standard task commands but also recommends personalized parameters based on multimodal context, significantly enhancing the system's generalization ability and practicality.
[0069] Based on the multimodal large model and retrieval enhancement, the system generates inference outputs of retrieval results. Then, it enters the response generation phase and selects two execution paths based on the instruction type: Path 1 (General Command Response): Applicable to most interactive scenarios (such as taking photos, finding objects, and answering questions). The system generates speech through the Omni multimodal large model and simultaneously calls the retrieval enhancement generation module to generate matching facial expressions and robot emotional actions (such as nodding, being happy, being excited, etc.), achieving a human-like feedback that is consistent with both sight and sound.
[0070] Path 2 (Voice Priority Response): For specific voice-related function requirements (such as singing, song cloning, robot prompts, etc.), the system directly calls preset audio segments or cloud application programming interfaces through "audio selection", prioritizes them through the overall voice interaction control, and then plays the output of the large model and related voice modules in an orderly manner, improving the robustness and smoothness of the response.
[0071] Ultimately, this module produces three types of output: Executable commands (such as "take a picture" or "walk forward") are sent to the main control state machine module as the basis for global task scheduling, driving the robot body to perform actions. Among them, "taking a picture" is defined as the highest priority interrupt task, while other related action commands are synchronously sent to the robotic arm or chassis actuator to achieve cross-module collaboration.
[0072] Voice feedback is played through the speaker, supporting multiple tone options to enhance the personalized experience.
[0073] The robot's emotional actions and expressions are generated by a retrieval enhancement module, which triggers the generation of matching facial expressions and related commands for the robot's emotional actions (such as nodding, happiness, excitement, etc.), further strengthening the emotional connection and interactive immersion. In addition, related action commands can be simultaneously sent to the robotic arm or chassis actuators, enabling cross-module collaboration.
[0074] This architecture can not only parse complex, conversational instructions that depend on context, such as "shoot like before," but also ensure that the understanding results can accurately and structurally drive the downstream execution system, solving the problems of poor generalization ability and low accuracy of traditional keyword matching or single-modal understanding.
[0075] In summary, the interaction and decision-making central layer 1, by deeply integrating voice and visual information and combining structured knowledge retrieval with generative artificial intelligence (AI), has built an intelligent interaction engine with high semantic understanding, strong context adaptability, and rich expressiveness, laying the core foundation for the whole device to achieve an intelligent photography experience of "reading between the lines, providing proactive services, and executing precisely".
[0076] In one specific embodiment of this application, the interaction and decision-making central layer 1 is specifically used to understand user instructions and image information by fusing speech recognition, visual perception and multimodal large model reasoning, and obtain instruction understanding results, environmental perception results, character identity, behavior and emotional state; and send character identity, behavior and emotional state to the emotion expression layer 4. The emotional expression layer 4 is also used to dynamically drive the physical movements of the robotic arm and the digital human on the screen to output facial expressions based on the system status, interaction context, character identity, behavior and emotional state.
[0077] The Interaction and Decision-Making Central Layer 1 is specifically used to understand user commands and image information by integrating speech recognition, visual perception, and multimodal large-scale model reasoning, obtaining command understanding results, environmental perception results, character identity, behavioral actions, and emotional states. This character identity, behavioral actions, and emotional states are then sent to the Emotional Expression Layer 4. The Emotional Expression Layer 4 is also used to dynamically drive the physical movements of the robotic arm and the facial expressions of the digital human on the screen based on the system state, interaction context, character identity, behavioral actions, and emotional states. By understanding the character identity, behavioral actions, and emotional states through the Interaction and Decision-Making Central Layer 1, and then dynamically driving the physical movements of the robotic arm and the facial expressions of the digital human on the screen based on these factors, the Emotional Expression Layer 4 significantly enhances the robot's sense of life and approachability, improves the user's emotional resonance and immersive experience, and enhances the overall intelligence, fluency, and emotional level of the interaction.
[0078] By introducing emotional actions and facial expressions, the robot can generate a variety of human-like behaviors during the intervals between core functions, based on the current environment, user emotions, and historical interactions. These behaviors include curiously looking around, nodding in agreement, and resting or waiting postures. These behaviors are triggered collaboratively by multimodal states and seamlessly integrated with the main task, thereby significantly enhancing the robot's sense of life and approachability, and improving the user's emotional resonance and immersive experience.
[0079] See Figure 7 , Figure 7 This is a flowchart illustrating the workflow of the emotional expression layer 4 of a photography robot system according to an embodiment of this application. The emotional expression layer 4 aims to enhance the robot's emotional affinity and interactive immersion through anthropomorphic emotional expression. The emotional expression layer 4 adopts a multimodal emotion-driven architecture, deeply coordinating the physical movements of the robotic arm with the facial expressions of the digital human on the screen to form a unified emotional output unit. All behaviors are dynamically triggered by system states (such as idle, walking, and taking photos) and the context of interaction with the user (such as whether someone is conversing, semantic emotions), ensuring that the robot achieves a better anthropomorphic effect without interfering with navigation, perception, and core tasks.
[0080] In unmanned interaction scenarios, the system enters an autonomous standby mode, where the robotic arm and screen randomly cycle through low-power emotion combinations, such as curiosity, effectively avoiding the dullness caused by stationary devices and maintaining the robot's activity. In scenarios with human interaction, the system accurately matches over a dozen emotional actions and facial expressions based on multimodal intent understanding results. These include attentional postures during wake-up, curiosity when asking questions, shy avoidance when receiving praise, head-shaking feedback when negating, and dynamic expressions of positive emotions such as excitement and happiness. Each emotion corresponds to a preset robotic arm movement trajectory and screen facial animation, achieving a consistent audiovisual emotional delivery.
[0081] Upon receiving a high-priority photo-taking command, the system immediately interrupts its regular emotional actions and switches to a dedicated photo-taking action: the robotic arm smoothly raises to the desired composition height, the gimbal fine-tunes the focus, and a focused expression appears on the screen, clearly conveying the shooting intention. This action combines functionality and a sense of ritual, significantly improving user cooperation and interaction certainty. After the photo is taken, the system automatically returns to its original interactive state or idle mode.
[0082] Overall, the emotional expression layer 4 constructs a resource-efficient, responsive, and expressive anthropomorphic behavioral system through mechanisms such as state perception-driven mechanisms, emotion classification modeling, and dual-channel synchronous output. Its design not only strengthens the robot's role as an "intelligent photography partner," but also achieves a crucial leap from a tool-like device to an emotionally interactive entity without increasing the system's burden.
[0083] The photography robot system based on multimodal interaction and state collaboration provided in this application provides an adaptive switching mechanism for the global state of the photography robot. This mechanism defines two core main states: "cruising exploration" and "standing service," and dynamically drives seamless switching between states by real-time fusion of visual, auditory, and interaction context information. For example, if the system actively detects an "interaction initiation signal" in the cruising state, it automatically switches to the standing service state. In the standing service state, if no effective interaction is detected or an end command is received, cruising automatically resumes. This mechanism overcomes the drawbacks of traditional robots that rely on preset or manual switching, achieving natural and proactive interaction initiation and termination, and improving the smooth human-computer interaction experience.
[0084] Corresponding to the above system embodiments, this application also provides a camera robot control method based on multimodal interaction and state collaboration. The camera robot control method based on multimodal interaction and state collaboration described below can be referred to in correspondence with the camera robot system based on multimodal interaction and state collaboration described above.
[0085] See Figure 8 , Figure 8 This is a flowchart illustrating an implementation of a camera robot control method based on multimodal interaction and state collaboration in this application. The method may include the following steps: S801: Receives instructions from users and collects image information of the scene where the photography robot is located.
[0086] S802: By integrating speech recognition, visual perception, and multimodal large model reasoning, it understands user commands and image information to obtain command understanding results and environmental perception results.
[0087] S803: Using a retrieval-enhanced generation algorithm, standard instructions corresponding to each user instruction are matched from the structured instruction library based on the instruction understanding results.
[0088] S804: Priority is assigned to each standard instruction to obtain the instruction priority corresponding to each standard instruction.
[0089] S805: Executes standard instructions according to instruction priority, and switches the global behavior mode of the photography robot based on the environmental perception results during instruction execution.
[0090] S806: Obtain the system status and the interaction context of the interaction and decision-making central layer, the control and scheduling layer, and the execution layer.
[0091] S807: Dynamically drives the physical movements of the robotic arm and the facial expressions of the digital human on the screen based on the system status and interaction context.
[0092] As can be seen from the above technical solutions, by constructing an interaction and decision-making central layer, real-time multimodal perception is achieved. This integrates multi-dimensional information such as environmental vision, user behavior, dialogue status, and task progress, enabling dynamic and smooth switching of the robot's working modes. This allows the robot to proactively intervene or withdraw at appropriate times, significantly improving the naturalness and overall coherence of the interaction process. The control and scheduling layer can classify tasks in real-time and dynamically prioritize them, ensuring timely responses to high-priority commands and guaranteeing the system's reliability and smoothness under complex conditions. The emotional expression layer introduces emotional actions and facial expressions, allowing the photography robot to generate rich, anthropomorphic behaviors based on the current environment, user emotions, and historical interactions during breaks in core functions. These behaviors are triggered collaboratively by multimodal states and seamlessly integrated with the main task, significantly enhancing the robot's sense of life and approachability, and improving user emotional resonance and immersive experience. This improves the overall intelligence, smoothness, and emotional level of the interaction.
[0093] In one specific embodiment of this application, prioritizing standard instructions may include the following steps: When a standard command includes a photo-taking command, the photo-taking command is assigned the highest priority, and other standard commands are assigned a priority level.
[0094] In one specific embodiment of this application, executing standard instructions according to instruction priority may include the following steps: Step 1: Execute the photo-taking command through status flag management and hardware / software collaboration; Step 2: After the photo-taking command is executed, execute other standard commands other than the photo-taking command according to the command priority.
[0095] In one specific embodiment of this application, executing a photographing command may include the following steps: Step 1: Parse the photo-taking command to obtain the shooting type; Step 2: Set a global flag to lock the photo-taking task and adjust the chassis to photo-taking mode; Step 3: Take a preset number of consecutive shots according to the shooting type, and upload the captured images to the cloud for optimal image selection.
[0096] In one specific embodiment of this application, executing standard instructions according to instruction priority may include the following steps: According to the instruction priority, each standard instruction is executed by generating a comprehensive decision output that includes natural language responses and digital human actions.
[0097] In one specific embodiment of this application, the method may further include the following steps: Determine the task type corresponding to each standard instruction; When it is determined that there are long-cycle tasks based on the task types, the task is locked when executing the standard instructions corresponding to the long-cycle task.
[0098] In one specific embodiment of this application, step S801 may include the following steps: Step 1: Collect user commands in voice form using a voice acquisition device, and convert each user command from voice form to text form; Step 2: Acquire image information of the scene where the photography robot is located using image acquisition equipment.
[0099] In one specific embodiment of this application, step S801 may include the following steps: By integrating speech recognition, visual perception, and multimodal large-scale model reasoning, it understands user commands and image information, and obtains command understanding results, environmental perception results, person identity, behavior and emotional state; The method may also include the following steps: Based on system status, interaction context, character identity, behavior, and emotional state, the robot arm's physical movements and the digital human on the screen are dynamically driven to output facial expressions.
[0100] Corresponding to the above method embodiments, this application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can perform the following steps: The system receives user commands and collects image information of the scene in which the camera robot is located. By fusing speech recognition, visual perception, and multimodal large model reasoning, it understands user commands and image information, obtaining command understanding results and environmental perception results. Using a retrieval-enhanced generation algorithm, it matches the standard commands corresponding to each user command from a structured command library based on the command understanding results. It prioritizes each standard command, obtaining the command priority corresponding to each standard command. It executes each standard command according to the command priority, and switches the global behavior mode of the camera robot based on the environmental perception results during the execution of the commands. It obtains the system state and the interaction context of the interaction and decision-making center layer, control and scheduling layer, and execution layer. Based on the system state and interaction context, it dynamically drives the physical movements of the robotic arm and the digital human on the screen to output facial expressions.
[0101] The computer-readable storage medium may include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0102] For a description of the computer-readable storage medium provided in this application, please refer to the above method embodiments; further details will not be repeated here.
[0103] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatuses, devices, and computer-readable storage media disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple; relevant parts can be referred to the method section.
[0104] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the technical solutions and core ideas of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made to this application without departing from the principles of this application, and these improvements and modifications also fall within the protection scope of this application.
Claims
1. A camera robot system based on multi-modal interaction and state coordination, characterized in that, include: The interaction and decision-making central layer (1) is used to receive instructions from each user and collect image information of the scene where the photography robot is located; By integrating speech recognition, visual perception and multimodal large model reasoning, the user instructions and the image information are understood to obtain the instruction understanding result and the environment perception result; the retrieval enhancement generation algorithm is used to match the standard instructions corresponding to each user instruction from the structured instruction library according to the instruction understanding result; each standard instruction is sent to the control and scheduling layer (2), and the environment perception result is sent to the execution layer (3). The control and scheduling layer (2) is used to prioritize each standard instruction and obtain the instruction priority corresponding to each standard instruction; and to send each instruction priority to the execution layer (3). The execution layer (3) is used to execute each standard instruction according to the instruction priority, and switch the global behavior mode of the photography robot according to the environmental perception result during the instruction execution process; The emotional expression layer (4) is used to obtain the system state and the interaction context of the interaction and decision-making center layer (1), the control and scheduling layer (2) and the execution layer (3); according to the system state and the interaction context, it dynamically drives the physical action of the robotic arm and the digital human on the screen to output expressions.
2. The multi-modal interaction and state coordination based photo taking robot system according to claim 1, characterized in that, The control and scheduling layer (2) is specifically used to determine the photo-taking instruction as the highest priority when each standard instruction contains a photo-taking instruction, and to classify the other standard instructions in addition to the photo-taking instruction into priority categories.
3. The multi-modal interaction and state coordination based photo taking robot system according to claim 2, characterized in that, The execution layer (3) is specifically used to execute the photo-taking command through status flag management and hardware / software collaboration, and after the photo-taking command is executed, to execute other standard commands other than the photo-taking command according to the command priority.
4. The photography robot system based on multimodal interaction and state collaboration according to claim 3, characterized in that, The execution layer (3) is specifically used to parse the shooting command to obtain the shooting type; set a global flag to lock the shooting task and adjust the chassis to the shooting state; continuously shoot a preset number of times according to the shooting type, and upload the captured images to the cloud for optimal image selection.
5. The photography robot system based on multimodal interaction and state collaboration according to claim 1, characterized in that, The execution layer (3) is specifically used to execute each standard instruction according to the instruction priority by generating a comprehensive decision output that includes natural language responses and digital human actions.
6. The photography robot system based on multimodal interaction and state collaboration according to claim 1, characterized in that, The execution layer (3) is also used to determine the task type corresponding to each standard instruction; when it is determined that there is a long-cycle task according to each task type, the task lock state is entered when executing the standard instruction corresponding to the long-cycle task.
7. The photography robot system based on multimodal interaction and state collaboration according to claim 1, characterized in that, The interaction and decision-making central layer (1) is specifically used to collect user instructions in voice form through a voice acquisition device and convert each user instruction from voice form to text form; and to collect image information of the scene where the photography robot is located through an image acquisition device.
8. The photographic robot system based on multimodal interaction and state collaboration according to any one of claims 1 to 7, characterized in that, The interaction and decision-making central layer (1) is specifically used to understand the user's instructions and image information by integrating speech recognition, visual perception and multimodal large model reasoning, and to obtain the instruction understanding result, the environment perception result, the person's identity, behavior and emotional state; and to send the person's identity, behavior and emotional state to the emotion expression layer (4). The emotional expression layer (4) is also used to dynamically drive the physical movements of the robotic arm and the digital human on the screen to output facial expressions based on the system state, the interaction context, the character's identity, the behavior and the emotional state.
9. A control method for a photography robot based on multimodal interaction and state cooperation, characterized in that, include: It receives instructions from users and collects image information of the scene where the camera robot is located; By integrating speech recognition, visual perception, and multimodal large model reasoning, the user's instructions and image information are understood to obtain the instruction understanding results and environment perception results. The retrieval enhancement generation algorithm is used to match the standard instructions corresponding to each user instruction from the structured instruction library based on the instruction understanding results; The priority of each standard instruction is divided into priorities to obtain the instruction priorities corresponding to each standard instruction. Execute each standard instruction according to the instruction priority, and switch the global behavior mode of the photography robot according to the environmental perception results during the execution of the instructions; Obtain the system status and the interaction context of the interaction and decision-making central layer (1), the control and scheduling layer (2), and the execution layer (3); Based on the system state and the interaction context, the robotic arm's physical movements and the digital human on the screen are dynamically driven to output facial expressions.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the camera robot control method based on multimodal interaction and state collaboration as described in claim 9.