A method and system for multi-user collaborative interaction of a BFF machine

By acquiring users' physical state and behavioral data, the system dynamically constructs operable areas and combines behavioral data to identify the user to whom the touch operation belongs. This solves the problem of inaccurate user intent judgment in multi-user collaborative interaction of smart display devices, and achieves more efficient user identification and collaborative interaction.

CN122308699APending Publication Date: 2026-06-30FOSHAN CHENGYI TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FOSHAN CHENGYI TECHNOLOGY CO LTD
Filing Date
2026-02-12
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing smart display devices struggle to accurately determine each user's operational intent in multi-user collaborative interaction scenarios, especially when users switch interactive roles or use different input methods in combination, resulting in low user recognition accuracy and low collaborative efficiency.

Method used

By acquiring each user's physical status information and behavioral data, an operable area is dynamically constructed, and the user to whom the touch operation belongs is identified by combining behavioral data, including the body's center position, limb extension range, head orientation, and face orientation. The body status information is acquired using depth sensors and cameras, and behavioral data is acquired using microphone arrays and voice recognition modules. The user affiliation is determined by combining geometric algorithms and machine learning models.

Benefits of technology

It improves user identification accuracy and collaboration efficiency, enabling accurate identification of the user to whom the touch operation belongs in complex multi-user scenarios, thereby enhancing the intelligence of human-computer interaction and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308699A_ABST
    Figure CN122308699A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-user collaborative interaction method and system for a smart terminal control device. The method includes: acquiring the body state information and behavioral data of each initial user; constructing an operable area corresponding to each initial user based on their body state information; determining the region affiliation of touch points during touch operations; if the region affiliation determination result indicates the touch point is within a single-user operation area, then identifying the operable area where the touch point is located as the target operation area, and designating the initial user corresponding to the target operation area as the user to whom the touch operation belongs; if the region affiliation determination result indicates the touch operation is within an overlapping area of ​​multiple user operation areas, then determining the user to whom the touch operation belongs based on behavioral data. This invention can combine body state information and behavioral data to identify the user to whom a touch operation belongs, thereby achieving multi-user collaborative interaction and improving user identification accuracy and collaborative efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent terminal control technology, and in particular to a multi-user collaborative interaction method and system for a "bestie phone". Background Technology

[0002] In smart home environments, smart display devices (such as the "best friend machine") have become the center of home entertainment, designed to support multi-user collaborative interaction. These devices are typically equipped with sensing modules such as high-resolution cameras, depth sensors, and microphone arrays to identify users, track their location, and understand touch, voice, or gesture commands. However, in complex scenarios involving multi-user collaboration and natural user movement, existing systems, relying on a single input method attribution logic, struggle to accurately determine each user's operational intent. Specifically, when multiple users engage in complex collaborative tasks involving multiple input methods such as touch and voice in front of the best friend machine, various situations may arise, including users switching interaction roles, mixing different input methods, voice interference from non-participants, and subtle changes in user relative position or body posture. Existing systems struggle to accurately identify and assign the diverse input commands to each user. For example, in a collaborative task, User A primarily uses touch, while User B primarily uses voice commands. However, User A may also issue voice commands, and User B may also perform touch operations. In this case, the system's original single input method attribution logic fails, making it difficult to accurately determine the user attributing the current input, resulting in low user recognition accuracy and low collaborative efficiency.

[0003] In summary, the technical problems existing in the relevant technologies need to be improved. Summary of the Invention

[0004] The main objective of this invention is to propose a multi-user collaborative interaction method and system for a "bestie phone" that can identify the user to whom the touch operation belongs by combining body state information and behavioral data, thereby achieving multi-user collaborative interaction and improving user identification accuracy and collaborative efficiency.

[0005] On one hand, embodiments of the present invention provide a multi-user collaborative interaction method for a "bestie machine," comprising the following steps: Acquire the body status information and behavioral data of each initial user, including the body center position, limb extension range, head orientation, and face orientation; Based on the physical state information of each initial user, an operable area is constructed for each initial user. The operable area is used to represent the display range that the initial user can reach in the current physical state. In response to a touch operation, the touch point of the touch operation is determined to be of a specific region, and the region determination result is obtained. If the area affiliation determination result is that it is in a single user operation area, then the operable area where the touch point is located is identified as the target operation area, and the initial user corresponding to the target operation area is identified as the user to whom the touch operation belongs. If the area attribution determination result is that it is in the overlapping part of the multi-user operation area, then the user to whom the touch operation belongs is determined based on the behavior data.

[0006] On the other hand, embodiments of the present invention provide a multi-user collaborative interaction system for best friends, including: The information acquisition module is used to acquire the body status information and behavioral data of each initial user. The body status information includes the body center position, limb extension range, head orientation, and face orientation. The operable area construction module is used to construct an operable area corresponding to each initial user based on the physical state information of each initial user. The operable area is used to represent the display range that the initial user can reach in the current physical state. The region attribution determination module is used to respond to touch operations, determine the region attribution of the touch points, and obtain the region attribution determination result. The single-user area analysis module is used to identify the operable area where the touch point is located as the target operation area if the area affiliation judgment result is that it is in a single user operation area, and to identify the initial user corresponding to the target operation area as the user to which the touch operation belongs. The multi-user area analysis module is used to determine the user to which the touch operation belongs based on the behavior data if the area attribution judgment result is in the overlapping part of the multi-user operation area.

[0007] The embodiments of this application include at least the following beneficial effects: First, the embodiments of this application obtain the physical state information and behavioral data of each initial user. Then, based on the physical state information of each initial user, an operable area corresponding to each initial user is constructed. Next, the touch point of the touch operation is judged to determine the area affiliation. If the area affiliation judgment result is that it is in a single user operation area, the operable area where the touch point is located is identified as the target operation area, and the initial user corresponding to the target operation area is identified as the user to whom the touch operation belongs. If the area affiliation judgment result is that it is in the overlapping part of multiple user operation areas, the user to whom the touch operation belongs is determined based on the behavioral data. Thus, the user to whom the touch operation belongs can be identified by combining physical state information and behavioral data, so as to realize multi-user collaborative interaction and improve the accuracy of user identification and collaborative efficiency.

[0008] Other features and advantages of the invention will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures particularly pointed out in the description and the drawings. Attached Figure Description

[0009] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below.

[0010] Figure 1 This is a flowchart illustrating a multi-user collaborative interaction method for a "bestie machine" according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the structure of a multi-user collaborative interaction system for best friends according to an embodiment of the present invention. Detailed Implementation

[0011] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments.

[0012] In related technologies, smart display devices (such as the "bestie phone") have become the home entertainment center in smart home environments, designed to support multi-user collaborative interaction. These devices are typically equipped with sensing modules such as high-resolution cameras, depth sensors, and microphone arrays to identify users, track their location, and understand touch, voice, or gesture commands. However, in complex scenarios involving multi-user collaboration and natural user movement, existing systems, which rely on a single input method attribution logic, struggle to accurately determine each user's operational intent. Specifically, when multiple users perform complex collaborative tasks involving multiple input methods such as touch and voice in front of the bestie phone, various situations may arise, including users switching interaction roles, mixing different input methods, voice interference from non-participants in the current session, and subtle changes in users' relative positions or body postures. Existing systems struggle to accurately identify and assign the diverse input commands to each user.

[0013] For example, in a typical family scenario, User A and User B are using a mobile phone together. Upon startup, the device accurately identifies them using facial and voice recognition technology and continuously tracks their positions in space. Initially, User A watches a movie on the left side of the screen, while User B reads news on the right, each operating independently without disturbing the other. The device assigns actions to the corresponding application window based on the location of touches, allowing each user to use the phone independently. However, the situation becomes more complex when the two users decide to play a strategy-based game, such as tower defense. The mobile phone switches from "each playing their own game" to "collaborative" mode, with the entire screen displaying the same game screen. The game might require User A to primarily build and upgrade defense towers using touch, while User B primarily uses voice commands to unleash special skills, such as "summon reinforcements" or "slow down time." To differentiate these inputs, the device initially determines the ownership of touch operations based on the users' approximate positions in front of the screen (e.g., touches on the left belong to User A, and touches on the right belong to User B), and uses voice recognition technology to determine who issued the voice commands.

[0014] With more diverse interaction methods between User A and User B, User A, while placing a defensive tower on the touchscreen, might suddenly need to unleash a skill to deal with an emergency, so they might blurt out "Cast Freeze." Simultaneously, User B, after issuing a voice command, quickly reaches out and touches an area on the screen to more accurately select a target. At this point, the device's original implicit division of labor—"User A primarily touches, User B primarily speaks"—and the logic for attributing a single input method becomes inaccurate. The device might assign the command to User A because its voice matches the "Freeze" command, but because User A is currently touching, the device will encounter difficulties in handling priorities and conflicts between different input methods, and might even incorrectly assign the command to User B. Conversely, User B's touch operation might be ignored by the device because they primarily handle voice input, or incorrectly attributed to User A. This flexible switching and mixed use of roles and input methods renders the system's original logic for attributing a single input method ineffective, making it difficult to accurately determine the user attributing the current input, resulting in low user recognition accuracy and low collaborative efficiency.

[0015] In view of this, this application acquires the body state information and behavioral data of each initial user. The body state information reflects the user's spatial position and posture in real time. The behavioral data includes various input information from the user's interaction with the device. Then, based on the body state information of each initial user, an operable area corresponding to each initial user is dynamically constructed. This operable area represents the display range that the user can reach in their current body state. By updating the operable area in real time, the system can adapt to changes in the user's position or body posture in front of the device. When the user performs a touch operation, the system determines the area to which the touch point belongs. The determination result is divided into two cases: within a single user's operable area or within the overlapping area of ​​multiple user operable areas.

[0016] If the determination result indicates that the touch point is within a single user's operating area, the operable area where the touch point is located is identified as the target operating area, and the initial user corresponding to this target operating area is designated as the user to whom the touch operation belongs. In this case, the ownership of the touch point is clear, and the system can directly assign the operation to the corresponding user. If the determination result indicates that the touch operation is within an overlapping area of ​​multiple user operating areas, the user to whom the touch operation belongs is determined based on behavioral data. In this ambiguous scenario, the system no longer relies solely on spatial location information but further combines user behavioral data for a comprehensive judgment, thereby more accurately determining the user to whom the operation belongs.

[0017] The embodiments of this application will be explained in detail below with reference to the accompanying drawings: Figure 1 This is an optional flowchart of a multi-user collaborative interaction method for best friend computers provided in an embodiment of this application. Figure 1 The method may include, but is not limited to, steps S101 to S105.

[0018] Step S101: Obtain the body status information and behavior data of each initial user. The body status information includes the body center position, limb extension range, head orientation, and face orientation. Step S102: Based on the physical state information of each initial user, construct the operable area corresponding to each initial user. The operable area is used to represent the display range that the initial user can reach in the current physical state. Step S103: In response to a touch operation, determine the region affiliation of the touch point and obtain the region affiliation determination result. Step S104: If the area affiliation determination result is that it is in a single user operation area, then identify the operable area where the touch point is located as the target operation area, and take the initial user corresponding to the target operation area as the user to which the touch operation belongs. Step S105: If the area affiliation determination result is that it is in the overlapping part of the multi-user operation area, then determine the user to whom the touch operation belongs based on the behavior data.

[0019] Steps S101 to S105 as shown in the embodiments of this application can combine body state information and behavioral data to identify the user to whom the touch operation belongs, so as to realize multi-user collaborative interaction and improve the accuracy of user identification and collaborative efficiency.

[0020] In some embodiments, steps S101-S105 may first acquire the body state information and behavioral data of each initial user. The body state information includes the body's center position, limb extension range, head orientation, and facial orientation. This information can be acquired in various ways. For example, a depth sensor (such as a ToF camera or structured light sensor) can scan the user's body in real time to generate three-dimensional point cloud data, thereby calculating the user's body center position and limb extension range. Head orientation and facial orientation can be acquired using a high-resolution camera combined with facial recognition and pose estimation algorithms. Behavioral data can be collected through built-in sensors and software modules. For example, touch operation frequency can be recorded by a touchscreen controller, voice command frequency and user voice information can be acquired through a microphone array and voice recognition module, and hand skeleton data can be acquired through a camera combined with a gesture recognition algorithm. It is understood that the "Bestie Machine" is a smart display device with multi-user interaction capabilities, typically integrating high-resolution cameras, depth sensors, microphone arrays, and other sensing hardware to acquire the user's body state information and behavioral data in real time. Behavioral data refers to various input information during the user's interaction with the Bestie Machine, which may include touch operation frequency, voice command frequency, user voice information, and hand skeleton data.

[0021] Then, based on the body state information of each initial user, an operable area is constructed for each initial user. This operable area represents the display range that the initial user can reach in their current body state. Based on the user's body center position and limb extension range, a human kinematics model can be used to calculate the maximum range that the user's arm can reach in different postures, and this range is projected onto the display screen to form a two-dimensional area. Alternatively, a standard human body model can be preset and adjusted based on the acquired user body state information to simulate the arm movement trajectory of the model, thereby determining the operable area. For example, if the user's body center position moves to the left, their operable area will also shift to the left accordingly. It can be understood that the operable area is a virtual display range dynamically constructed based on the user's body state information; it represents the screen area that the user's limbs (e.g., arms) can naturally reach in their current body posture. The construction of this area considers factors such as the user's body center position and limb extension range, ensuring the accuracy and real-time nature of the area.

[0022] In response to touch operations, the system determines the region attribution of the touch point and obtains the region attribution result. When a user touches the screen of the device, the touchscreen controller detects the location of the touch point. The system obtains the coordinates of the touch point and compares them with the pre-defined operable areas for each user. It determines whether the touch point falls within an operable area or within an overlapping area of ​​multiple operable areas. For example, a geometric algorithm can be used to determine whether the touch point is within a polygonal region.

[0023] If the area attribution determination result indicates that the touch point is within a single user's operating area, then the operable area where the touch point is located is identified as the target operating area, and the initial user corresponding to the target operating area is designated as the user to whom the touch operation is assigned. This means that the touch point clearly belongs to a specific user's operable area and does not overlap with the operable areas of other users. In this case, the system can directly assign the touch operation to that single user. For example, if user A's operable area does not overlap with user B's operable area, and the touch point falls within user A's operable area, then the touch operation is assigned to user A.

[0024] If the area attribution determination result indicates that the touch operation is located in an overlapping area of ​​multiple user operation zones, the user to whom the touch operation belongs is determined based on behavioral data. When the touch point is located in an overlapping area of ​​multiple user operation zones, body state information alone cannot accurately determine the user to whom the touch operation belongs. In this case, the system needs to further analyze the user's behavioral data to assist in the determination. For example, it can analyze which user showed higher interaction activity in the period before the touch operation occurred, or which user's hand posture was closer to the touch point.

[0025] Through the above technical solution, this embodiment first acquires the body state information and behavioral data of each initial user. This information is acquired in real-time and dynamically, capturing subtle changes in the user during interaction. Based on the user's body state information, this embodiment dynamically constructs the operable area corresponding to each initial user. This means that even if the user changes their position or body posture, their operable area will adjust accordingly, ensuring the reliability of the touch area division. For example, when user A moves to the left, their operable area will also shift to the left accordingly, thus avoiding the problem of touch operations being incorrectly assigned to other users. Furthermore, this embodiment performs area assignment judgment on the touch point when responding to a touch operation. When the touch point is clearly within a single user's operable area, the system can directly identify and assign the operation. However, when the touch point is in the overlapping area of ​​multiple user operable areas, this embodiment no longer relies solely on spatial location information but further determines the user to whom the touch operation belongs based on behavioral data. This strategy effectively solves the problem that traditional technologies struggle to determine the user in ambiguous overlapping areas. Through the above innovative mechanism, this embodiment can continuously track the user's spatial location and posture, as well as recognize sound, understand the current interaction method, and comprehensively judge the behavioral intent. This enables the system to effectively handle diverse and complex multi-user interaction scenarios in the home, solving the problem of inaccurate judgment of operational intentions in multi-user collaborative interaction by traditional technologies, and significantly improving user experience and collaboration efficiency.

[0026] In some embodiments, in step S105, determining the user to whom the touch operation belongs based on behavioral data may include, but is not limited to, the following steps: Step S201: Extract touch operation frequency and voice command frequency from behavioral data; Step S202: Calculate the interaction activity based on the frequency of touch operations and the frequency of voice commands; Step S203: Compare the interaction activity of each initial user and select the initial user with the highest interaction activity as the assigned user.

[0027] In some embodiments, touch operation frequency and voice command frequency can be extracted from behavioral data first. The number of touch operations performed and the number of voice commands issued by each initial user within a specific time window can be extracted from the behavioral data to calculate the corresponding frequencies. Touch operation frequency refers to the intensity of user interaction through physical contact with the display screen, while voice command frequency reflects the activity level of user interaction through voice. This frequency data forms the basis for quantifying user interaction activity.

[0028] Then, the interaction activity level is calculated based on the frequency of touch operations and voice commands. By comprehensively considering the user's touch and voice interaction behaviors, a quantitative indicator that reflects the user's current engagement and intent can be derived. For example, a weighted sum of touch operation frequency and voice command frequency can be used, or a machine learning model can be employed for comprehensive evaluation.

[0029] Next, the interaction activity levels of each initial user are compared, and the initial user with the highest interaction activity is designated as the assigned user. The interaction activity levels of all initial users who might have performed actions within the overlapping area can be sorted, and the initial user with the highest interaction activity is selected as the assigned user for this touch operation. The purpose is to identify the most likely active user to perform an action within the overlapping area, thereby improving the accuracy of the attribution determination.

[0030] This embodiment, by introducing the extraction and analysis of touch operation frequency and voice command frequency, can more comprehensively and objectively quantify the interactive activity of each initial user. By comprehensively considering touch operation frequency and voice command frequency, this embodiment can more accurately reflect the user's level of participation and operational intent in the current scenario. When the touch point is located in the overlapping area of ​​multi-user operation areas, the system no longer relies solely on area judgment but further combines the intensity of the user's actual interactive behavior. The initial user with the highest interactive activity is considered the user most likely to perform the touch operation, thus effectively solving the problem of accurately determining the user in the overlapping area.

[0031] To illustrate this technical solution more clearly, a specific example is used below. Assume two users, A and B, are simultaneously interacting with a device, and their interactive areas overlap. When a touch operation occurs within the overlapping area, the system extracts the touch operation frequency and voice command frequency of users A and B from the behavioral data. For example, in the last 5 seconds, user A performed 3 touch operations and 2 voice commands, while user B performed 1 touch operation and 4 voice commands. The system calculates the interaction activity level based on these frequencies. If the calculation model sets the contribution weight of touch operations to 0.6 and the contribution weight of voice commands to 0.4, then user A's interaction activity level is (3 * 0.6) + (2 * 0.4) = 1.8 + 0.8 = 2.6, and user B's interaction activity level is (1 * 0.6) + (4 * 0.4) = 0.6 + 1.6 = 2.2. By comparison, user A's interaction activity level of 2.6 is higher than user B's 2.2, therefore the touch operation will be attributed to user A. This method can effectively distinguish users who are more active and more likely to perform operations in overlapping areas, thus achieving a more accurate attribution determination.

[0032] Through the above technical solution, this embodiment can significantly improve the accuracy and robustness of user attribution determination when touch operations occur in overlapping areas of multi-user operation regions. By quantifying the frequency of user touch operations and voice commands, and calculating interaction activity, the system can more accurately identify the user's true operational intent in complex interaction scenarios, avoiding ambiguity or misjudgment caused by overlapping areas. This enables the "Girlfriend Machine" to provide a more intelligent and personalized user experience in multi-user collaborative interaction environments, effectively improving the efficiency of human-computer interaction and user satisfaction.

[0033] In some embodiments, in step S202, calculating the interaction activity based on the touch operation frequency and the voice command frequency may include, but is not limited to, the following steps: Step S301: Assess the level of cognitive load based on behavioral data; Step S302: If the cognitive load level is high, extract the user's voice information from the behavioral data and obtain the user's hand skeleton data through the visual recognition module. Step S303: Adjust the contribution weight of voice commands based on user voice information; Step S304: Adjust the contribution weight of touch operation based on the user's hand skeleton data; Step S305: Based on the adjusted voice command contribution weight and touch operation contribution weight, the touch operation frequency and voice command frequency are weighted and summed to calculate the interaction activity.

[0034] In some embodiments, user behavior data may be influenced by a variety of factors, such as the user's cognitive state, mood, or the complexity of the current task. Simply calculating interaction activity based on the original frequency of touch operations and voice commands may not accurately reflect the user's true intentions and level of engagement, especially when the user is under high cognitive load, their behavior patterns may change, leading to biases in interaction activity assessments and thus affecting the accuracy of attribution to the user.

[0035] Therefore, we can first assess the cognitive load level based on behavioral data. Cognitive load refers to the degree of mental effort or information processing burden experienced by a user when performing a task. The purpose of assessing cognitive load is to gain a deeper understanding of the user's current state, thereby allowing for more refined adjustments to subsequent interaction activity calculations. If the cognitive load level is high, it indicates that the user may be processing complex information or facing challenges, and their behavioral data may no longer be a direct reflection of their true intentions. In this case, user voice information can be extracted from the behavioral data, and user hand skeleton data can be obtained through a visual recognition module. This data can provide richer contextual information for judging the validity and intent of user behavior. User voice information refers to all voice content uttered by the user during interaction, including voice commands, dialogues, interjections, etc., and its purpose is to assist in judging the user's intent and level of participation through the analysis of voice content. User hand skeleton data refers to the posture and movement trajectory information of the user's hand in three-dimensional space obtained through visual recognition technology, and its purpose is to assist in judging the user's touch operation intent by analyzing the accuracy and directionality of hand movements.

[0036] Then, the contribution weight of voice commands is adjusted based on the user's voice information. The importance of voice commands in the interaction activity calculation can be dynamically adjusted according to the quality, content, and context of the voice information. For example, when a voice command is clear, unambiguous, and highly relevant to the current task, its contribution weight can be maintained or increased; conversely, if the voice is unclear, ambiguous, or irrelevant to the task, its contribution weight can be decreased.

[0037] Then, based on the user's hand skeleton data, the contribution weight of touch operations is adjusted. The importance of touch operations in the interaction activity calculation can be dynamically adjusted according to the accuracy, directionality, and matching degree of hand movements with the touch point. For example, when the hand movement is accurately directed at the touch point and is within the effective operation area, its contribution weight can be maintained or increased; conversely, if the hand movement is uncertain or does not match the touch point, its contribution weight can be reduced.

[0038] Finally, based on the adjusted contribution weights of voice commands and touch operations, the frequencies of touch operations and voice commands are weighted and summed to obtain a more representative and accurate interaction activity value. This weighted summation method ensures that the contribution of different types of interactive behaviors to interaction activity in different contexts is reasonably reflected.

[0039] To illustrate this technical solution more clearly, a specific example is used below. Suppose users C and D are collaborating on a task, such as editing a document together, in front of their phones. The system first continuously acquires behavioral data from both users and assesses their cognitive load levels. At a certain moment, by analyzing user C's facial expressions, head posture, and speech rate, the system determines that user C's cognitive load level is high, possibly because user C is thinking about a complex editing problem. Meanwhile, user D's cognitive load level is normal. At this point, both users perform touch operations and issue voice commands. If only the original touch operation frequency and voice command frequency are considered, user C's interaction activity might appear lower due to brief pauses or hesitations during thinking.

[0040] This embodiment employs special handling for user C's high cognitive load level. The system further extracts user C's voice information and hand skeleton data. For example, if user C's voice commands, although spoken slowly, are semantically clear and highly relevant to the editing task, and their hand skeleton data shows that their hand is precisely pointing to an editing area on the screen, then even if their touch operation frequency is slightly low, the system will maintain or even increase the contribution weight of their voice commands and touch operations. Conversely, if user D, under normal cognitive load, has a high touch operation frequency but their hand movements appear random, or their voice commands are unclear, the system may reduce their corresponding contribution weight. In this way, this embodiment can more accurately calculate user C's interaction activity level. Even if user C exhibits different behavioral patterns under high cognitive load, the system can still identify their true participation and intent. Ultimately, the system will more accurately determine the user to whom the current touch operation belongs based on the interaction activity level calculated with adjusted weights, thereby avoiding misjudgments caused by cognitive load and ensuring the smoothness and accuracy of collaborative interaction.

[0041] Through the above technical solution, this embodiment overcomes the limitations of relying solely on the original behavior frequency for interaction activity calculation, significantly improving the accuracy and robustness of user attribution judgment in multi-user collaborative interaction scenarios. Especially when users are under high cognitive load, this embodiment can intelligently adjust the contribution weights of voice commands and touch operations, avoiding misjudgments caused by behavioral pattern deviations due to changes in user cognitive state. Therefore, the system can more accurately identify the user's true intentions and level of participation, providing users with a more personalized and intelligent interactive experience, effectively improving the efficiency and user satisfaction of multi-user collaborative interaction on the "bestie phone" platform.

[0042] In some embodiments, step S301, assessing the level of cognitive load based on behavioral data, may include, but is not limited to, the following steps: Based on behavioral data, changes in key facial points are analyzed using a facial feature point recognition algorithm to calculate a facial expression analysis score. Key facial points include eyebrows, corners of the eyes, and corners of the mouth. Based on behavioral data, the user's head posture in three-dimensional space is analyzed using a human skeleton recognition algorithm, and a head posture analysis score is calculated. Based on behavioral data, the eye-tracking algorithm is used to analyze the user's gaze direction and fixation point, and a gaze direction analysis score is calculated. Based on behavioral data, the speech rate and pause frequency of users' speech are analyzed through speech recognition algorithms to calculate a speech rate analysis score. Based on behavioral data, the speech recognition algorithm is used to analyze the lexical diversity and syntactic structure of the user's speech and calculate the speech complexity analysis score. The cognitive load score was calculated by weighting the facial expression analysis score, head posture analysis score, gaze direction analysis score, speech rate analysis score, and speech complexity analysis score. The level of cognitive load is determined based on the cognitive load score.

[0043] In some embodiments, facial landmark recognition algorithms can be used to analyze changes in key facial features based on behavioral data to calculate a facial expression analysis score. These key facial features include eyebrows, corners of the eyes, and corners of the mouth. A facial landmark recognition algorithm is a computer vision technique that uses analysis of a user's facial image to identify and track changes in key facial features such as eyebrows, corners of the eyes, and corners of the mouth. Changes in these key features, such as furrowed eyebrows, tense corners of the eyes, or drooping corners of the mouth, are typically associated with the user's emotional state and cognitive load level. By quantifying these changes, a facial expression analysis score can be calculated, reflecting the user's cognitive engagement or stress level during the current interaction.

[0044] Then, based on the behavioral data, the user's head posture in three-dimensional space is analyzed using a human skeleton recognition algorithm to calculate a head posture analysis score. A human skeleton recognition algorithm is a technology that can detect and track key points of the human skeleton from video or depth images. By analyzing the user's head posture in three-dimensional space, such as head tilting, turning, or nodding, the user's level of attention, confusion, or fatigue can be inferred. Therefore, a head posture analysis score can be calculated to assess the user's cognitive load during task execution.

[0045] Based on behavioral data, eye-tracking algorithms are used to analyze the user's gaze direction and fixation point, calculating a gaze direction analysis score. Eye-tracking algorithms are techniques that infer a user's attention distribution and cognitive processing by capturing and analyzing eye movement data, such as gaze direction, fixation point, and saccade frequency. For example, prolonged gazing may indicate that the user is deeply thinking or encountering difficulties, while frequent saccades may indicate that the user is searching for information or is distracted. By analyzing these eye movement patterns, a gaze direction analysis score can be calculated to reflect the user's cognitive load level.

[0046] Based on behavioral data, speech rate and pause frequency are analyzed using speech recognition algorithms to calculate a speech rate analysis score. Speech recognition algorithms are technologies that convert human speech into text or analyze speech features. By analyzing a user's speech rate and pause frequency—for example, slower speech, more pauses, or the appearance of hesitant words—it's possible to reflect the user's cognitive effort in language organization and expression. Therefore, a speech rate analysis score can be calculated as an indicator of cognitive load.

[0047] Based on behavioral data, speech recognition algorithms are used to analyze the lexical diversity and syntactic structure of user speech, calculating a speech complexity analysis score. Speech recognition algorithms can also be used to analyze the lexical diversity and syntactic structure of user speech. For example, under high cognitive load, users may tend to use simpler, more repetitive words and employ simpler syntactic structures. By quantifying these linguistic features, a speech complexity analysis score can be calculated to more comprehensively assess the user's cognitive load level.

[0048] Finally, a weighted average was calculated for the facial expression analysis score, head posture analysis score, gaze direction analysis score, speech rate analysis score, and speech complexity analysis score to obtain a comprehensive cognitive load score. The purpose of weighted averaging is to assign different weights to different scores based on their contribution to cognitive load, thereby obtaining a more accurate overall assessment. For example, in some scenarios, facial expressions may reflect cognitive load more accurately than speech rate; in such cases, a higher weight can be assigned to the facial expression analysis score. The cognitive load level is then determined based on the cognitive load score. This cognitive load score can be compared with a preset scoring threshold to determine the current cognitive load level, for example, classifying it into low, medium, and high levels. The preset scoring threshold can be set using expert experience.

[0049] To illustrate this technical solution more clearly, a specific example is used below. Suppose users E and F are interacting collaboratively in front of a smart device, and their operable areas overlap. When a touch operation occurs in the overlapping area, the system needs to determine the user responsible based on behavioral data. At this point, the system will assess the cognitive load levels of both users using the methods described above. Specifically, for user E, the system analyzes user E's facial expressions using a facial feature point recognition algorithm. For example, detecting slightly furrowed eyebrows and a slight drooping of the corners of the mouth may indicate that they are thinking or experiencing slight stress, thus calculating a facial expression analysis score. Simultaneously, it analyzes user E's head posture using a human skeleton recognition algorithm. For example, finding that their head is slightly tilted forward and relatively stable may indicate high concentration, thus calculating a head posture analysis score. Next, it analyzes user E's gaze direction and fixation point using an eye-tracking algorithm. For example, finding that their gaze lingers near the touch point for an extended period and that their saccade frequency is low, may indicate that they are processing relevant information deeply, thus calculating a gaze direction analysis score. Furthermore, the speech behavior of user E is analyzed using a speech recognition algorithm. For example, a slight slowdown in speech rate with occasional pauses is detected, which may reflect that user E is organizing language or thinking about instructions, thus calculating a speech rate analysis score. Further analysis of user E's lexical diversity and syntactic structure reveals relatively simple vocabulary and straightforward syntax, suggesting a tendency to simplify expression under high cognitive load, thus calculating a speech complexity analysis score. A weighted average of these five scores for user E is used to obtain their cognitive load score, and user E's cognitive load level is determined to be high. Similarly, the system performs the same multimodal behavioral data analysis on user F, calculating their various scores and cognitive load score. Assuming user F's cognitive load score is low, their cognitive load level is determined to be low.

[0050] After determining that user E's cognitive load level is high, the system extracts user E's voice information and hand skeletal data from behavioral data, and adjusts the contribution weights of voice commands and touch operations based on this information. For example, because user E is in a state of high cognitive load, the system may be more inclined to assume that their touch operations and voice commands are well-considered, even if they show some hesitation, and may assign them higher weights. Finally, based on the adjusted voice command contribution weights and touch operation contribution weights, the system calculates the weighted sum of the touch operation frequencies and voice command frequencies of users E and F, calculating their respective interaction activity levels. By comparing the interaction activity levels of the two users, the user with the highest interaction activity level is identified as the user to whom the touch operation belongs. In this way, even in complex situations where multiple users' operation areas overlap and users have different cognitive loads, the system can more accurately determine the user's true intention, thereby achieving more intelligent and humanized collaborative interaction.

[0051] Through the above technical solution, this embodiment achieves a more accurate and comprehensive assessment of the user's cognitive load level. By integrating multimodal information such as facial expressions, head posture, gaze direction, and voice features, this embodiment significantly improves the accuracy and robustness of cognitive load assessment. This high-precision cognitive load assessment enables the system to more accurately understand the user's behavior under different cognitive states in multi-user collaborative interaction scenarios, such as distinguishing between hesitation caused by high cognitive load and lack of operational intent. Therefore, when determining the user attributing a touch operation, the system can more reasonably adjust the contribution weights of touch operation frequency and voice command frequency based on a more reliable cognitive load level, effectively avoiding misjudgment of the attributing user due to inaccurate cognitive load assessment, and improving the smoothness and user experience of multi-user collaborative interaction.

[0052] In some embodiments, step S303, adjusting the contribution weight of voice commands based on user voice information, may include, but is not limited to, the following steps: Semantic content analysis is performed on user voice information to identify whether there are instructional words related to the current collaborative task in the user voice information, and the semantic content analysis results are obtained. Analyze user voice information by intonation and speed to identify the intensity of voice expression; Calculate the confidence level of the voice command based on the semantic content analysis results and the intensity of the voice expression; If the confidence level of the voice command is greater than the preset confidence threshold, the contribution weight of the voice command is maintained; otherwise, the contribution weight of the voice command is reduced.

[0053] In some embodiments, semantic content analysis can be performed on the user's voice information to identify whether there are instructional words related to the current collaborative task, thus obtaining the semantic content analysis results. Natural Language Processing (NLP) technology can be used to perform deep analysis of the text content of the user's voice commands. The purpose is to accurately identify whether the voice information contains instructional words or phrases directly related to the current multi-user collaborative task. For example, in a collaborative drawing task, instructional words might include "zoom in," "zoom out," "move to the left," "select color," etc. Through this analysis, semantic content analysis results can be obtained, which can be used to determine the validity and relevance of the voice commands.

[0054] Then, the user's voice information is analyzed for intonation and speed to identify the intensity of the speech expression. Speech signal processing technology can be used to extract acoustic features of the speech, such as fundamental frequency, speech rate, and volume, to identify the intensity of the speech expression. The intensity of speech expression can reflect the firmness of the user's intention or emotional state. For example, speech that is fast-paced, high-pitched, and loud may be identified as having high intensity, indicating that the user's intention regarding the instruction is more explicit and urgent.

[0055] Then, based on the semantic content analysis results and the intensity of speech expression, the confidence score of the voice command is calculated. This confidence score is a quantitative indicator used to evaluate the reliability and effectiveness of the user's voice commands. For example, if the semantic content analysis results indicate that the voice information contains explicit instructional vocabulary, and the intonation and rate analysis shows a high intensity of expression, then the confidence score of the voice command will be calculated as a high value.

[0056] If the confidence level of a voice command is greater than a preset confidence threshold, its contribution weight is maintained. Otherwise, its contribution weight is reduced. The preset confidence threshold is a configurable parameter used to define the validity threshold of a voice command, and can be set through expert experience. When the confidence level of a voice command reaches or exceeds the preset confidence threshold, it indicates that the voice command is considered reliable and effective, and therefore its corresponding contribution weight should be maintained to ensure its influence in the interaction activity calculation. Otherwise, if the confidence level of a voice command does not reach the preset confidence threshold, its contribution weight is reduced. This aims to reduce the interference of ambiguous, irrelevant, or weakly expressive voice commands on the interaction activity calculation, thereby improving the accuracy of user attribution judgment. Reducing the contribution weight can be achieved by multiplying by a decay coefficient or subtracting a fixed value.

[0057] Through the above technical solution, this embodiment can dynamically adjust the contribution weight of voice commands based on the semantic content and expressive intensity of user voice information. This enables the system to more intelligently identify and utilize high-quality voice commands when evaluating user interaction activity, while effectively suppressing interference from low-quality or irrelevant voice. Consequently, it improves the accuracy and reliability of voice commands as user intent indicators in multi-user collaborative interaction scenarios, thereby optimizing the accuracy of determining the user to whom touch operations belong in overlapping areas of multi-user operation regions, and enhancing the overall user experience of multi-user collaborative interaction on the "bestie phone" (a mobile phone platform).

[0058] In some embodiments, in step S304, adjusting the touch operation contribution weight based on the user's hand skeleton data may include, but is not limited to, the following steps: Step S401: Obtain the operation extension area; Step S402: Based on the hand skeleton data, determine whether the user's hand is pointing at the display screen; Step S403: If the user's hand is pointing at the display screen, calculate the direction vector from the wrist to the fingertip based on the hand skeleton data; Step S404: Calculate the intersection point of the direction vector and the display screen; Step S405: Calculate the distance between the intersection point and the touch point; Step S406: If the distance between the intersection point and the touch point is less than a preset distance threshold, and the touch point is located within the operation extension area, then maintain the touch operation contribution weight; otherwise, reduce the touch operation contribution weight.

[0059] In some embodiments, relying solely on hand skeleton data may not accurately determine the true intent of a user's touch operation, especially in multi-user collaborative interaction scenarios where the user's hand may move around near the screen, but not all actions represent a clear touch operation intent. This may affect the accuracy of adjusting the touch operation contribution weight.

[0060] To achieve this, the operational expansion area can be obtained first. Based on the user's currently operable area, the area can be dynamically expanded according to the user's body movement trends to predict the potential area where the user might perform an action. This helps to identify the user's intention to perform an action in advance, even before the user has fully touched the screen.

[0061] Then, based on the hand skeleton data, it is determined whether the user's hand is pointing towards the display screen. This can be achieved by analyzing information such as the posture and direction of the palm and fingers in the hand skeleton data. For example, when the user's palm is facing the display screen and the fingers are extended in a direction roughly pointing towards the screen, it can be determined that the hand is pointing towards the display screen. If the user's hand is not pointing towards the display screen, then a direction vector from the wrist to the fingertips is calculated based on the hand skeleton data. This vector represents the direction of the user's hand in three-dimensional space. For example, the wrist joint can be selected as the starting point and the fingertips (such as the fingertip of the index finger) as the ending point to construct a vector from the wrist to the fingertips.

[0062] Next, calculate the intersection of the direction vector and the display screen. This direction vector can be extended to the point where it intersects the display screen plane. This intersection point represents the potential pointing position of the user's hand on the screen. The distance between this intersection point and the touch point is then calculated to quantify the degree of matching between the user's hand pointing and the actual touch point. The smaller the distance, the better the match between the user's hand pointing and the touch point. A preset distance threshold is a configurable parameter used to define the acceptable distance range between the hand pointing and the touch point, and can be set through expert experience.

[0063] If the distance between the intersection point and the touch point is less than a preset distance threshold, and the touch point is located within the extended operation area, then the hand pointing is considered relevant to the touch point, and the touch operation contribution weight can be maintained. Otherwise, the touch operation contribution weight is reduced. The extended operation area refers to the area dynamically predicted and expanded based on the operable area according to the user's body movement trend. Its purpose is to more accurately capture the user's potential operation intentions on the screen. Even if the touch point slightly exceeds the currently strictly defined operable area, but is still within the extended area indicated by the user's body movement trend, it should still be considered a valid operation.

[0064] To illustrate this technical solution more clearly, a specific example is used below. Assume a multi-user collaborative interaction scenario using a shared mobile phone, where user G and user H are using the same phone. The system acquires user G's hand skeleton data and, by analyzing their hand posture, determines that user G's hand is pointing towards a certain area of ​​the display screen. Simultaneously, based on user G's body movement speed and direction, the system identifies that their body is moving towards the right side of the screen and expands user G's operable area to the right, obtaining an expanded operable area. At this point, a touch operation occurs on the screen, with the touch point located within user G's expanded operable area. The system further calculates the intersection of the direction vector from user G's wrist to their fingertips with the display screen and finds that the distance between this intersection and the actual touch point is less than a preset distance threshold. Based on these judgments, the system determines that the touch operation is an intentional act by user G and therefore maintains user G's touch operation contribution weight. Conversely, if user H's hand skeleton data is acquired, and although their hand is close to the screen, it is not explicitly pointing towards a touch point, or the touch point is not within their expanded operable area, the system will reduce user H's touch operation contribution weight to avoid misjudgment. In this way, the system can effectively distinguish the operational intentions of different users, ensuring the accuracy of interaction activity calculation.

[0065] Through the above technical solution, this embodiment can more accurately determine the true intention of a user's touch operation, avoiding the problem of inaccurate calculation of touch operation contribution weight caused by the user's hand being near the screen but without actual operation intention or by misoperation. This makes the evaluation of interaction activity more objective and reliable, thereby improving the accuracy of user affiliation determination in multi-user collaborative interaction systems and optimizing user experience and system response accuracy.

[0066] In some embodiments, obtaining the operation extension area in step S401 may include, but is not limited to, the following steps: Acquire 3D point cloud data of the body; Calculate the body's movement speed and direction based on the body's three-dimensional point cloud data; Identify body movement trends based on body movement speed and direction; Based on the body's movement trend, the operable area is expanded to obtain the expanded operable area.

[0067] In some embodiments, three-dimensional point cloud data of the body can be acquired first. Point cloud information of the user's body in three-dimensional space can be collected in real time using devices such as depth sensors, stereo cameras, or LiDAR. This point cloud data can accurately reflect the user's body posture, position, and the three-dimensional structure of their surrounding environment.

[0068] Then, based on the body's 3D point cloud data, the body's movement speed and direction are calculated. By comparing the body's 3D point cloud data across consecutive time frames, changes in the user's body's center position can be tracked, thereby calculating the user's instantaneous movement speed and direction in 3D space. For example, algorithms such as Kalman filtering or particle filtering can be used to process the body point cloud data to smoothly estimate the body's motion parameters.

[0069] Then, based on the body's movement speed and direction, movement trends can be identified. For example, if a user's body is moving towards a certain area of ​​the display screen at a certain speed and direction, it can be predicted that the user may soon interact with that area. Identifying body movement trends can employ predictive models, such as learning from historical motion data, or using simple linear extrapolation to predict the user's likely location in a short period of time.

[0070] Finally, based on the body's movement trend, the operable area is expanded to obtain the expanded operable area. If a tendency for the user's body to move in a certain direction is detected, the current operable area can be appropriately expanded along that direction. This expansion can be dynamic, and its range and shape can be adjusted according to the confidence level of the body's movement speed and direction. For example, the faster the movement speed and the farther the predicted movement distance, the larger the expanded area may be.

[0071] This embodiment acquires 3D point cloud data of the user's body to obtain precise position and posture information in 3D space. Based on this high-precision 3D data, the user's movement speed and direction can be accurately calculated, thereby identifying the user's movement trend. By anticipating the user's movement trend, the system can predict the display area the user may touch and proactively expand the currently operable area accordingly. This dynamic expansion mechanism makes the user's interactive experience smoother and more natural, avoiding situations where the user's hand movements exceed the static operable area due to body movement.

[0072] Through the above technical solution, this embodiment can more intelligently adapt to the user's body movement and dynamically adjust the range of the operable area. This significantly improves the smoothness of multi-user collaborative interaction and user experience, and reduces operation interruptions or misjudgments caused by user body movement. Users do not need to deliberately remain still to perform continuous and effective touch operations while moving, thereby improving the naturalness and efficiency of the interaction.

[0073] In some embodiments, after adjusting the contribution weight of voice commands based on user voice information in step S303, the method may also include, but is not limited to, the following steps: Step S501: Obtain the current touch operation object and the current touch operation type; Step S502: Parse the user's voice information and extract the voice command object and voice command type; Step S503: Compare the current touch operation object and the voice command object to obtain the first comparison result; Step S504: Compare the current touch operation type and the voice command type to obtain a second comparison result; Step S505: If the first comparison result is that the objects are inconsistent and the second comparison result is that the types are inconsistent, then the contribution weight of the voice command is attenuated.

[0074] In some embodiments, when there is a conflict or inconsistency between a user's touch operation and voice command, adjusting the weight based solely on voice information may not accurately reflect the user's true intention, thereby affecting the accuracy of the interaction activity calculation and potentially leading to a bias in the judgment of the user to whom the touch operation belongs.

[0075] To do this, we can first obtain the current touch operation object and the current touch operation type. The current touch operation object refers to the interface element or functional module that the user has currently selected through touch, such as a button, a text box, an image, or an application icon. The current touch operation type refers to the specific behavior of the user's current touch operation, such as clicking, long pressing, swiping, zooming, etc.

[0076] Then, the user's voice information is parsed to extract the voice command object and voice command type. The voice command object refers to the interface element or functional module mentioned or pointed to by the user through voice commands, such as "open album" or "select picture." The voice command type refers to the specific operational intent expressed by the user through voice commands, such as "open," "select," "delete," or "move."

[0077] The current touch operation object and the voice command object are then compared to obtain the first comparison result. The current touch operation type and the voice command type are then compared to obtain the second comparison result, aiming to determine whether the user's touch operation and voice command are consistent in terms of object and type.

[0078] If the first comparison result indicates an inconsistency in objects, and the second comparison result indicates an inconsistency in types, it indicates a significant conflict or inconsistency between the user's touch operations and voice commands. In this case, to avoid negatively impacting or misleading the calculation of interaction activity, it is necessary to attenuate the contribution weight of voice commands.

[0079] To illustrate this technical solution more clearly, a specific example is used below. Suppose that on a friend's phone, user I is dragging an image file using a touch operation (the current touch operation object is "image file," and the current touch operation type is "drag"). At the same time, user I might unintentionally say "delete this file" (the user's voice information is parsed as the voice command object "file," and the voice command type "delete"). In this case, the system will obtain the current touch operation object "image file" and the current touch operation type "drag." Next, the system parses the user's voice information, extracting the voice command object "file" and the voice command type "delete." By comparison, it is found that the current touch operation object "image file" and the voice command object "file" may have some semantic relationship, but the operation types "drag" and "delete" are clearly inconsistent. If the system determines that the objects are inconsistent (for example, the touch is on a specific image, while the voice command refers to "file"), and the types are inconsistent, it will trigger a reduction in the weight of the voice command's contribution. For example, the contribution weight of voice commands might be reduced from 0.5 to 0.2 to lessen the impact of conflicting voice commands on user I's interaction activity calculation, ensuring the system prioritizes user I's touch operation intent. Another example: user J is touching and clicking a "play" button (current touch operation object is "play button," current touch operation type is "click"), but simultaneously says "pause music" (voice command object is "music," voice command type is "pause"). In this case, the current touch operation object "play button" and the voice command object "music" might be judged as inconsistent (one a concrete button, one an abstract concept), and the current touch operation type "click" is clearly inconsistent with the voice command type "pause." Based on this inconsistency, the system will reduce the contribution weight of user J's voice command, thus more accurately reflecting user J's "play" intent expressed through touch operations and avoiding interference from voice commands.

[0080] Through the above technical solution, this embodiment can effectively identify and handle conflicts between touch and voice interaction modes. This avoids deviations in interaction activity calculation caused by inconsistencies between voice commands and actual touch operations, improving the accuracy of user attribution. Especially in multi-user collaborative scenarios, when multiple users interact with touch and voice simultaneously, this embodiment can more accurately assess the actual interaction contribution of each user, thereby optimizing the multi-user collaborative interaction experience and enabling the system to respond to user operations more intelligently and accurately.

[0081] In some embodiments, the attenuation of the voice command contribution weight in step S505 may include, but is not limited to, the following steps: Calculate the distance to the object's location based on the current touch operation object and the voice command object; Calculate type similarity based on the current touch operation type and voice command type; Assess the degree of conflict bias based on object location distance and type similarity; The weight decay amount is determined based on the degree of conflict deviation; The contribution weight of voice commands is attenuated based on the weight attenuation amount.

[0082] In some embodiments, simple attenuation processing may not accurately reflect the degree of conflict between touch operations and voice commands, resulting in insufficiently refined weight adjustment and affecting the system's accurate judgment of the user's true intent. If this problem is not addressed, inappropriate weight attenuation may reduce the smoothness and accuracy of interaction in complex multi-user collaborative scenarios.

[0083] To achieve this, the distance between the touch object and the voice command object can be calculated first. The object distance refers to the spatial distance between the current touch object's position on the screen and the position of the object pointed to or mentioned by the voice command on the screen. This distance can be obtained by calculating the Euclidean distance between the center points of the two objects in a two-dimensional or three-dimensional coordinate system. For example, if the touch operation points to a button on the screen, while the voice command mentions another text box, the distance between these two elements on the screen is calculated. The purpose is to quantify the degree of spatial inconsistency between the user's physical operation and the voice command.

[0084] Then, type similarity is calculated based on the current touch operation type and the voice command type. Type similarity refers to the degree of semantic or functional matching between the current touch operation type and the voice command type. For example, if the touch operation type is "select" and the voice command type is "delete," the type similarity is low; if the touch operation type is "zoom in" and the voice command type is "zoom out," the type similarity is also low, but there may be some correlation. Type similarity can be calculated using a predefined semantic similarity matrix, an ontology-based matching algorithm, or a machine learning model. Its purpose is to assess the differences in functional intent between the two operations.

[0085] Next, the degree of conflict deviation is assessed based on object location distance and type similarity. The degree of inconsistency between touch operations and voice commands is quantified by combining object location distance and type similarity, thus obtaining the degree of conflict deviation. For example, methods such as weighted averaging, fuzzy logic reasoning, or decision trees can be used to map object location distance and type similarity to a unified conflict deviation index. The higher the index value, the greater the conflict between touch operations and voice commands. The aim is to provide a unified and quantifiable standard for conflict assessment.

[0086] Finally, the weight attenuation amount is determined based on the degree of conflict deviation. The weight attenuation amount refers to the specific numerical value attenuated from the weight of the voice command contribution, determined according to the assessed degree of conflict deviation. For example, an attenuation function can be preset, taking the degree of conflict deviation as input and outputting the corresponding attenuation ratio or value. The greater the degree of conflict deviation, the greater the weight attenuation amount, and vice versa. The purpose is to achieve dynamic and fine-grained adjustment of the voice command contribution weight, enabling it to accurately reflect the complexity of the user's intent. The weight attenuation amount is then used to attenuate the voice command contribution weight.

[0087] To illustrate this technical solution more clearly, a specific example is used below. Suppose on a friend's phone, user K is dragging an image object on the screen via touch, moving it from the left to the right side. Simultaneously, user L issues a voice command, "Delete this image," but the "image" referred to is actually a text box object located in the upper left corner of the screen. The system first identifies the current touch operation object as the image object and the touch operation type as "drag / move." Simultaneously, by parsing user L's voice information, the system extracts the voice command object as the text box object and the voice command type as "delete." Next, the system calculates the distance between the image object and the text box object based on their positions on the screen. For example, if the center coordinates of the image object are (500, 300) and the center coordinates of the text box object are (100, 100), the Euclidean distance between them is calculated. Simultaneously, the system calculates the type similarity between the touch operation type "drag / move" and the voice command type "delete." Because "drag / move" and "delete" differ significantly in semantics and function, their type similarity is low.

[0088] The system then assesses the current degree of conflict bias based on the calculated object location distance and type similarity. For example, a larger object location distance and lower type similarity will result in a higher assessed degree of conflict bias. Finally, based on the high assessed degree of conflict bias, the system determines a larger weight decay and accordingly reduces the weight of user L's voice command contribution. For example, if the original voice command contribution weight is 0.7, the decay might be 0.4 based on the high conflict bias, resulting in a final voice command contribution weight of 0.3. In this way, the system is more likely to accept user K's touch operation intent while reducing the weight of user L's voice command, thus avoiding the erroneous deletion of image objects due to misunderstandings of voice commands and ensuring the accuracy and smoothness of the interaction.

[0089] Through the above technical solution, this embodiment can more accurately identify and handle potential conflicts between touch operations and voice commands in multi-user collaborative interaction. By calculating the object location distance and type similarity, and assessing the degree of conflict deviation accordingly, the system can achieve refined and adaptive attenuation of the contribution weight of voice commands, rather than simple uniform attenuation. This refined weight adjustment enables the system to more accurately judge the user's true operational tendency when faced with complex or ambiguous user intentions, effectively avoiding misoperations or reduced interaction efficiency caused by inappropriate weight attenuation, and significantly improving the intelligence level and user satisfaction of multi-user collaborative interaction on the device.

[0090] The beneficial effects of implementing the embodiments of the present invention include: First, the embodiments of this application obtain the physical state information and behavioral data of each initial user. Then, based on the physical state information of each initial user, an operable area corresponding to each initial user is constructed. Next, the touch point of the touch operation is judged to determine the area affiliation. If the area affiliation judgment result is that it is in a single user operation area, the operable area where the touch point is located is identified as the target operation area, and the initial user corresponding to the target operation area is identified as the user to whom the touch operation belongs. If the area affiliation judgment result is that it is in the overlapping part of multiple user operation areas, the user to whom the touch operation belongs is determined based on the behavioral data. Thus, the user to whom the touch operation belongs can be identified by combining physical state information and behavioral data, so as to realize multi-user collaborative interaction and improve the accuracy of user identification and collaborative efficiency.

[0091] like Figure 2 As shown in the figure, this embodiment of the invention also provides a multi-user collaborative interaction system for best friends, including: The information acquisition module 601 is used to acquire the body status information and behavioral data of each initial user. The body status information includes the body center position, limb extension range, head orientation, and face orientation. The operable area construction module 602 is used to construct an operable area corresponding to each initial user based on the body state information of each initial user. The operable area is used to represent the display range that the initial user can reach in the current body state. The region attribution determination module 603 is used to respond to touch operations, determine the region attribution of the touch points of the touch operations, and obtain the region attribution determination result. The single-user area analysis module 604 is used to identify the operable area where the touch point is located as the target operation area if the area affiliation judgment result is that it is in a single user operation area, and to identify the initial user corresponding to the target operation area as the user to which the touch operation belongs. The multi-user area analysis module 605 is used to determine the user to whom the touch operation belongs based on the behavior data if the area attribution judgment result is in the overlapping part of the multi-user operation area.

[0092] The content of the above method embodiments is applicable to this system embodiment. The specific functions implemented in this system embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.

[0093] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

Claims

1. A multi-user collaborative interaction method for a "bestie" mobile phone, characterized in that, Includes the following steps: Acquire the body status information and behavioral data of each initial user, including the body center position, limb extension range, head orientation, and face orientation; Based on the physical state information of each initial user, an operable area is constructed for each initial user. The operable area is used to represent the display range that the initial user can reach in the current physical state. In response to a touch operation, the touch point of the touch operation is determined to be of a specific region, and the region determination result is obtained. If the area affiliation determination result is that it is in a single user operation area, then the operable area where the touch point is located is identified as the target operation area, and the initial user corresponding to the target operation area is identified as the user to whom the touch operation belongs. If the area attribution determination result is that it is in the overlapping part of the multi-user operation area, then the user to whom the touch operation belongs is determined based on the behavior data.

2. The method according to claim 1, characterized in that, The step of determining the user to whom the touch operation belongs based on the behavioral data includes: Extract touch operation frequency and voice command frequency from the behavioral data; The interaction activity level is calculated based on the touch operation frequency and the voice command frequency; The interaction activity of each initial user is compared, and the initial user with the highest interaction activity is selected as the assigned user.

3. The method according to claim 2, characterized in that, The calculation of interaction activity based on the touch operation frequency and the voice command frequency includes: Based on the behavioral data, assess the level of cognitive load; If the cognitive load level is high, then user voice information is extracted from the behavioral data and user hand skeleton data is obtained through the visual recognition module; Adjust the contribution weight of voice commands based on the user's voice information; Adjust the contribution weight of touch operation based on the user's hand skeleton data; The interaction activity is calculated by weighting and summing the touch operation frequency and the voice command frequency based on the adjusted voice command contribution weight and touch operation contribution weight.

4. The method according to claim 3, characterized in that, The assessment of cognitive load level based on the behavioral data includes: Based on the behavioral data, changes in key facial points are analyzed using a facial feature point recognition algorithm to calculate a facial expression analysis score. The key facial points include eyebrows, corners of the eyes, and corners of the mouth. Based on the behavioral data, the user's head posture in three-dimensional space is analyzed using a human skeleton recognition algorithm, and a head posture analysis score is calculated. Based on the behavioral data, the eye-tracking algorithm is used to analyze the user's gaze direction and fixation point, and a gaze direction analysis score is calculated. Based on the behavioral data, the speech rate and pause frequency of the user's speech are analyzed using a speech recognition algorithm to calculate a speech rate analysis score. Based on the behavioral data, the lexical diversity and syntactic structure of the user's speech are analyzed using a speech recognition algorithm, and a speech complexity analysis score is calculated. The cognitive load score is calculated by weighting the facial expression analysis score, head posture analysis score, gaze direction analysis score, speech rate analysis score, and speech complexity analysis score. The cognitive load level is determined based on the cognitive load score.

5. The method according to claim 3, characterized in that, The step of adjusting the contribution weight of voice commands based on the user's voice information includes: Semantic content analysis is performed on the user's voice information to identify whether there are instructional words related to the current collaborative task in the user's voice information, and the semantic content analysis results are obtained. The user's voice information is analyzed for intonation and speed to identify the intensity of the voice expression; Calculate the confidence level of the voice command based on the semantic content analysis results and the voice expression intensity; If the confidence level of the voice command is greater than a preset confidence threshold, the contribution weight of the voice command is maintained; otherwise, the contribution weight of the voice command is reduced.

6. The method according to claim 3, characterized in that, The step of adjusting the touch operation contribution weight based on the user's hand skeleton data includes: Get the operation extension area; Based on the hand skeleton data, determine whether the user's hand is pointing at the display screen; If the user's hand is pointing at the display screen, then the direction vector from the wrist to the fingertip is calculated based on the hand skeleton data; Calculate the intersection point of the direction vector and the display screen; Calculate the distance between the intersection point and the touch point; If the distance between the intersection and the touch point is less than a preset distance threshold, and the touch point is located within the operation extension area, then the touch operation contribution weight is maintained; otherwise, the touch operation contribution weight is reduced.

7. The method according to claim 6, characterized in that, The acquisition operation extension area includes: Acquire 3D point cloud data of the body; Based on the three-dimensional point cloud data of the body, calculate the body movement speed and body movement direction; Identify body movement trends based on the body movement rate and the body movement direction; Based on the body movement trend, the operable area is expanded to obtain the expanded operable area.

8. The method according to claim 3, characterized in that, After adjusting the contribution weight of voice commands based on the user's voice information, the method further includes: Get the current touch operation object and the current touch operation type; The user's voice information is parsed to extract the voice command object and voice command type; The current touch operation object and the voice command object are compared to obtain a first comparison result; The current touch operation type and the voice command type are compared to obtain a second comparison result; If the first comparison result indicates that the objects are inconsistent, and the second comparison result indicates that the types are inconsistent, then the contribution weight of the voice command is attenuated.

9. The method according to claim 8, characterized in that, The attenuation process for the contribution weight of the voice command includes: Calculate the object's position distance based on the current touch operation object and the voice command object; Calculate the type similarity based on the current touch operation type and the voice command type; The degree of conflict deviation is assessed based on the object location distance and the type similarity. The weight attenuation amount is determined based on the degree of conflict deviation. The contribution weight of the voice command is attenuated according to the weight attenuation amount.

10. A multi-user collaborative interaction system for best friends, characterized in that, include: The information acquisition module is used to acquire the body status information and behavioral data of each initial user. The body status information includes the body center position, limb extension range, head orientation, and face orientation. The operable area construction module is used to construct an operable area corresponding to each initial user based on the physical state information of each initial user. The operable area is used to represent the display range that the initial user can reach in the current physical state. The region attribution determination module is used to respond to touch operations, determine the region attribution of the touch points, and obtain the region attribution determination result. The single-user area analysis module is used to identify the operable area where the touch point is located as the target operation area if the area affiliation judgment result is that it is in a single user operation area, and to identify the initial user corresponding to the target operation area as the user to which the touch operation belongs. The multi-user area analysis module is used to determine the user to which the touch operation belongs based on the behavior data if the area attribution judgment result is in the overlapping part of the multi-user operation area.