Virtual video communication method, apparatus, device, storage medium and program product
By integrating multi-view facial information and utilizing deep learning and the solvePNP algorithm, a 3D animation model is driven in real time, solving the problems of limited location information and poor real-time performance in existing technologies. This improves stability and accuracy and expands the application scope.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2022-07-01
- Publication Date
- 2026-06-16
Smart Images

Figure CN115359159B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and video communication technology, and in particular to a virtual video communication method, apparatus, device, storage medium, and program product. Background Technology
[0002] With increasing demands for visualization, video communication, which integrates real-time voice, data, and video transmission, has become a hot topic in the communications field and is widely used in areas such as video conferencing, remote video healthcare, and remote video education. Among these applications, the technology of using real-time driven 3D animation models to replace real human facial images for virtual video communication also shows broad application prospects and market potential.
[0003] In existing technologies, 3D animation models are generally driven in the following two ways:
[0004] (1) By inputting a video of a real human face, and processing it with an algorithm, a dynamic 3D animation model video is output. Although this method has a good driving effect, it cannot meet the real-time requirements of virtual video communication. It can only obtain a driveable 3D animation model on the premise that the video is recorded in advance, and the application scenarios are greatly limited.
[0005] (2) Acquire facial information through a monocular camera to drive a 3D animation model in real time. However, the facial position information acquired by the monocular camera is limited, and it is impossible to accurately analyze and acquire information such as the rotation angle and orientation of the 3D space. Furthermore, there are unstable and inaccurate problems such as missing points and dropped frames during the driving process, which cannot meet market demand. Summary of the Invention
[0006] This invention provides a virtual video communication method, apparatus, device, storage medium, and program product to address the shortcomings of limited location information and poor real-time performance in existing technologies. By using a 3D animation model that integrates multi-view facial information for virtual video communication, the robustness and stability of the 3D animation model are improved.
[0007] This invention provides a virtual video communication method, comprising:
[0008] Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information;
[0009] Based on the world coordinates of facial key points from multiple perspectives and the facial key point information from the multiple perspectives, facial fusion information is determined. The facial fusion information is used to drive the 3D animation model in real time. The facial fusion information includes: facial key point fusion information and facial rotation angle fusion information.
[0010] The 3D animation model is streamed in real time, and the real-time video stream is used to realize virtual video communication.
[0011] According to the virtual video communication method provided by the present invention, face fusion information is determined based on the world coordinates of facial key points from multiple perspectives and the facial key point information from the multiple perspectives. The face fusion information is used to drive a 3D animation model in real time. The face fusion information includes: facial key point fusion information and face rotation angle fusion information, including:
[0012] Based on the world coordinates of the facial key points from each viewpoint, the rotation vector of the face relative to each viewpoint is determined using the solvePNP algorithm.
[0013] The rotation vectors of the face relative to each viewpoint are fused to determine the face rotation angle fusion information;
[0014] The facial key point information from each viewpoint is fused to determine the fused facial key point information.
[0015] According to the virtual video communication method provided by the present invention, face fusion information is determined based on the world coordinates of facial key points from multiple perspectives and the facial key point information from the multiple perspectives. The face fusion information is used to drive a 3D animation model in real time. The face fusion information includes: facial key point fusion information and face rotation angle fusion information, and further includes:
[0016] Determine the facial key point information of the pre-built 3D animation model;
[0017] The face fusion information is imported into the 3D animation model, and the facial key point information of the 3D animation model is matched with the face fusion information. The key point matching is used to drive the 3D animation model in real time based on the current scene image.
[0018] According to the virtual video communication method provided by the present invention, determining the corresponding multi-view facial key point information based on the acquired multi-view current scene images includes:
[0019] Based on the acquired current scene image, a deep learning algorithm is used to determine the face ROI in the current scene image;
[0020] Based on the face ROI, the facial landmark algorithm is used to determine the corresponding multi-view facial landmark information based on the acquired multi-view current scene images.
[0021] According to the virtual video communication method provided by the present invention, the facial key point information includes facial key points and image coordinates of the facial key points.
[0022] The present invention also provides a virtual video communication device, comprising:
[0023] The first determining module is used to determine the corresponding multi-view facial key point information based on the acquired multi-view current scene images;
[0024] The second determining module is used to determine face fusion information based on the world coordinates of face key points from multiple perspectives and the face key point information from the multiple perspectives. The face fusion information is used to drive the 3D animation model in real time. The face fusion information includes: face key point fusion information and face rotation angle fusion information.
[0025] A real-time video streaming module is used to stream the 3D animation model in real time, and the real-time video streaming is used to realize virtual video communication.
[0026] The virtual video communication device provided by the present invention further includes:
[0027] The key point docking module is used to determine the facial key point information of the pre-constructed 3D animation model; import the face fusion information into the 3D animation model, and dock the facial key point information of the 3D animation model with the face fusion information. The key point docking is used to drive the 3D animation model in real time based on the current scene image.
[0028] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the virtual video communication method as described above.
[0029] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the virtual video communication method as described above.
[0030] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the virtual video communication method as described above.
[0031] The virtual video communication method, apparatus, device, storage medium, and program product provided by this invention obtain face fusion information by fusing facial key point information from multiple perspectives, including facial key point fusion information and face rotation angle fusion information, thereby improving the accuracy of analyzing the three-dimensional spatial information of the face. Furthermore, by using face fusion information to drive a three-dimensional animation model in real time, the point information driving the three-dimensional animation model is significantly increased, improving the accuracy of the three-dimensional animation model and the robustness and stability of driving it. Simultaneously, virtual video communication is achieved by real-time video streaming of the three-dimensional animation model, further expanding the application scope, better meeting market demands, and enhancing the user experience. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0033] Figure 1 This is one of the flowcharts illustrating the virtual video communication method provided by the present invention;
[0034] Figure 2 This is a schematic diagram illustrating the meaning of the parameters for determining the rotation vector in the virtual video communication method provided by this invention;
[0035] Figure 3 This is the second flowchart illustrating the virtual video communication method provided by the present invention;
[0036] Figure 4 This is one of the schematic diagrams of real-time video streaming results of the virtual video communication method provided by the present invention;
[0037] Figure 5 This is the second schematic diagram of the real-time video streaming result of the virtual video communication method provided by the present invention;
[0038] Figure 6 This is a schematic diagram of the structure of the virtual video communication device provided by the present invention;
[0039] Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0040] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0041] The following is combined with Figures 1-5 The present invention describes a virtual video communication method.
[0042] Figure 1 This is one of the flowcharts illustrating the virtual video communication method provided by the present invention, such as... Figure 1 As shown, the method includes:
[0043] Step 110: Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information.
[0044] Specifically, the limited facial position information obtained through a monocular camera makes it impossible to accurately analyze and obtain information such as the rotation angle and orientation in three-dimensional space. This leads to instability and inaccuracy issues such as missing points and dropped frames during the driving process, which cannot meet market demands. On the other hand, driving the three-dimensional animation model with pre-recorded video cannot meet the real-time requirements.
[0045] To obtain rich facial landmark information, this invention acquires images of the current scene from multiple perspectives based on the current scene, and further determines the facial key point information under the corresponding perspective based on the current scene images from multiple perspectives. This can satisfy the accurate analysis of information such as rotation angle and orientation in three-dimensional space. Moreover, the rich facial landmark information can avoid missing points and dropped frames during the driving process, and make the facial key point information of the three-dimensional animation model that replaces the real face more accurate.
[0046] The aforementioned multi-view current scene image acquisition devices may include multi-view cameras, webcams, and devices equipped with cameras, such as smartphones, tablets, laptops, desktop computers, all-in-one computers, etc. The number of these acquisition devices is at least two, and adjacent acquisition devices are positioned at an angle, meaning that at least two acquisition devices acquire facial video images of people in the current scene from different perspectives to obtain rich facial position information. Furthermore, the acquisition devices may also include multi-view RGB cameras.
[0047] Optionally, methods for determining facial key point information from multiple perspectives include:
[0048] Based on the acquired current scene image, a deep learning algorithm is used to determine the face ROI in the current scene image;
[0049] Based on the face ROI, and using the face landmark algorithm, the corresponding multi-view face landmark information is determined based on the acquired multi-view current scene images.
[0050] Specifically, to enable virtual video communication using 3D animated models instead of real human face video images, OpenCV acquires face video images from the aforementioned acquisition device and selects face ROIs from the acquired multi-view current scene images (face video images), including the position and size of the face. Methods for selecting face ROIs can include using the face detector algorithm of the TensorFlow deep learning framework to improve robustness and selection efficiency. Furthermore, facial landmark information is determined from the selected face ROIs; this determination can be achieved using the facial landmark extraction algorithm in OpenCV.
[0051] The aforementioned face ROI (Region of Interest) refers to the face region of interest. When processing face video images, the face region to be processed is selected from the face video image to be processed. The shape of the face region includes, but is not limited to: rectangle, circle, ellipse, polygon, and irregular polygon.
[0052] Optionally, the facial landmark information includes facial landmarks and their image coordinates.
[0053] The aforementioned facial key points refer to the locations of key facial regions in a facial video image, including but not limited to: eyebrows, eyes, nose, mouth, and face. The image coordinates of facial key points are obtained by establishing a two-dimensional coordinate system for the current scene image from each viewpoint after acquisition. In fact, the two-dimensional coordinate system established for the current scene image from each viewpoint is the same coordinate system as the three-dimensional coordinate system established with the acquisition device as the center point.
[0054] Step 120: Based on the world coordinates of facial key points from multiple perspectives and the facial key point information from multiple perspectives, determine the face fusion information. The face fusion information is used to drive the 3D animation model in real time. The face fusion information includes: facial key point fusion information and face rotation angle fusion information.
[0055] The world coordinates of the aforementioned facial key points are: a three-dimensional coordinate system established on a real human face. Since the human face has undulations, establishing a three-dimensional coordinate system can more accurately describe the position of facial key points.
[0056] To improve the accuracy of driving 3D animation images with real face video images, this invention uses the world coordinates and information of face key points from multiple perspectives to determine the fused face key point fusion information and face rotation angle fusion information, providing more information about the 3D spatial rotation angle and orientation of the real face, thereby improving the accuracy of the analysis.
[0057] Optionally, the methods for determining face fusion information include:
[0058] Based on the world coordinates of the facial key points from each viewpoint, the rotation vector of the face relative to each viewpoint is determined using the solvePNP algorithm.
[0059] The rotation vectors of the face relative to each viewpoint are fused to determine the face rotation angle fusion information;
[0060] The facial key point information from each viewpoint is fused to determine the fused facial key point information.
[0061] For example, Figure 2 This is a schematic diagram illustrating the parameter meanings for determining the rotation vector in the virtual video communication method provided by this invention, as shown below. Figure 2 As shown, taking the current scene image (face video image) acquired by a multi-view RGB camera, and selecting the left corner of the left eye, the right corner of the right eye, and the left corner of the mouth as facial key points as an example, the method for determining the rotation vector of the face relative to each viewpoint using the solvePNP algorithm includes:
[0062] First, the proportion of the face under normal conditions is set, that is, under normal conditions, in the real face coordinate system, the world coordinates of the left eye corner are (X1,Y1,Z1), the world coordinates of the right eye corner are (X2,Y2,Z2), and the world coordinates of the left mouth corner are (X3,Y3,Z3). Based on the checkerboard algorithm, the intrinsic parameters and distortion coefficients of the RGB camera under the corresponding viewpoint are calibrated and calculated. The world coordinates of the left eye corner (X1,Y1,Z1), the world coordinates of the right eye corner (X2,Y2,Z2), the world coordinates of the left mouth corner (X3,Y3,Z3) and the intrinsic parameters and distortion coefficients of the RGB camera under the corresponding viewpoint are input into the solvePNP model. Through the projection relationship between the face coordinate system and the camera coordinate system in the solvePNP model, the camera coordinates of the above three facial key points are calculated to be mapped to the camera coordinate system. The projection relationship between the face coordinate system and the camera coordinate system is shown in Equation (1):
[0063]
[0064] in, Figure 2 This is a schematic diagram illustrating the parameter meanings for determining the rotation vector in the virtual video communication method provided by this invention, as shown below. Figure 2As shown, P represents the optical center of the RGB camera at the corresponding viewpoint. Points A, B, and C represent the world coordinates of the left corner of the left eye (X1, Y1, Z1), the right corner of the right eye (X2, Y2, Z2), and the left corner of the mouth (X3, Y3, Z3), respectively. a′, b′, and c′ represent the magnitudes of BC, AC, and AB, respectively. The magnitude of BC represents the distance from the right corner of the right eye to the left corner of the mouth, the magnitude of AC represents the distance from the left corner of the left eye to the left corner of the mouth, and the magnitude of AB represents the distance from the left corner of the left eye to the right corner of the right eye. x, y, and z represent the magnitudes of PA, PB, and PC, respectively. α, β, and γ represent the angles between PC and PB, PC and PA, and PA and PB, respectively.
[0065] Using the projection relationship between the face coordinate system and the camera coordinate system in the solvePNP model, the camera coordinates of the three facial key points mapped to the camera coordinate system are calculated as follows: the camera coordinates of the left eye's left corner in the camera coordinate system are (X′1,Y′1,Z′1), the camera coordinates of the right eye's right corner in the camera coordinate system are (X′2,Y′2,Z′2), and the camera coordinates of the left mouth corner in the camera coordinate system are (X′3,Y′3,Z′3).
[0066] Secondly, when the captured human face moves or shakes, based on the above method, the camera coordinates of the three facial key points corresponding to the position of the face after the face moves or shakes are calculated in the camera coordinate system as follows: the camera coordinates of the left eye left corner in the camera coordinate system after the shift are (X″1,Y″1,Z″1), the camera coordinates of the right eye right corner in the camera coordinate system after the shift are (X″2,Y″2,Z″2), and the camera coordinates of the left mouth corner in the camera coordinate system after the shift are (X″3,Y″3,Z″3).
[0067] Next, based on the camera coordinates in the camera coordinate system corresponding to the known facial key points, as shown in Equation (2), the rotation vector and translation vector in the corresponding camera view are calculated.
[0068] p″=R i p′+t (2),
[0069] Where p′ represents the set of camera coordinates (X′1, Y′1, Z′1) for the left eye's left corner, (X′2, Y′2, Z′2) for the right eye's right corner, and (X′3, Y′3, Z′3) for the left corner of the mouth in the initial camera coordinate system, and p″ represents the set of camera coordinates (X″1, Y″1, Z″1) for the left eye's left corner, (X″2, Y″2, Z″2) for the right eye's right corner, and (X″3, Y″3, Z″3) for the left corner of the mouth in the shifted camera coordinate system, R i Let i represent the camera's rotation vector, i represent different viewpoints, and t represent the camera's translation vector.
[0070] The aforementioned camera coordinate system and the coordinate system containing the image coordinates of the facial key points are the same coordinate system, both of which are constructed with the camera's optical center as the center point. The only difference is that the z-axis length of the coordinate system containing the image coordinates of the facial key points is the camera's focal length, that is, the z-axis coordinate in the image coordinates of the facial key points is the camera's focal length.
[0071] Optionally, the rotation vector of the face relative to each viewpoint is calculated sequentially and then weighted and fused to determine the face rotation angle fusion information. The fusion formula of the rotation vector is shown in equation (3):
[0072]
[0073] Where R represents the face rotation angle fusion information, R i w represents the rotation vector from the corresponding viewpoint. i The weights represent the corresponding viewpoints, N represents the number of acquisition viewpoints or the number of acquisition devices when one acquisition device is placed at each viewpoint, and i represents different viewpoints.
[0074] Optionally, the image coordinates of the facial key points in the acquired facial key point information are weighted and fused to determine the facial key point fusion information. The fusion formulas are shown in equations (4) to (6):
[0075]
[0076]
[0077]
[0078] Where Leye represents the fusion information of the left corner of the left eye, Leye i Reye represents the image coordinates of the left corner of the left eye from a single viewpoint, while Reye represents the fused information of the right corner of the right eye. i Lmouth represents the image coordinates of the right corner of the right eye from a single viewpoint, and Lmouth represents the fused information of the left corner of the mouth. i This represents the image coordinates of the left corner of the mouth from a single viewpoint.
[0079] Optionally, Figure 3 This is the second flowchart illustrating the virtual video communication method provided by the present invention, as shown below. Figure 3 As shown, the method also includes:
[0080] Determine the facial key point information of the pre-built 3D animation model;
[0081] The face fusion information is imported into the 3D animation model, and the facial key point information of the 3D animation model is matched with the face fusion information. The key point matching is used to drive the 3D animation model in real time based on the current scene image.
[0082] To make the 3D animation model more consistent with the facial proportions of a human face, and to drive the 3D animation model in real time using the current scene image (facial video image), a corresponding 3D animation model can be pre-designed according to one's own preferences, and the facial key point information of the 3D animation model can be determined. The determined facial fusion information is then matched with the facial key point information of the 3D animation model to drive the 3D animation model in real time, further enabling the 3D animation model to replace the real human face for subsequent operations.
[0083] The selection of the aforementioned facial landmarks is not limited to the left corner of the left eye, the right corner of the right eye, and the left corner of the mouth; they can be arbitrarily selected within the facial ROI. Based on the selected facial landmarks, further facial fusion information is determined, including but not limited to... Figure 2 The invention provides information on head angle, facial features, and facial contours. Furthermore, the invention does not limit the number of facial key points that can be selected.
[0084] Step 130: Stream the 3D animation model in real time. The real-time video stream is used to realize virtual video communication.
[0085] Optionally, real-time video streaming refers to compressing and converting a series of video images using an encoder, and then transmitting the compressed images to the user's receiver in real time via the internet. In this invention, real-time driven 3D animation models can be applied to video communication platforms, such as Tencent Meeting and WeChat video calls, using OBS video streaming software.
[0086] For example, Figure 4 This is one of the schematic diagrams of the real-time video streaming results of the virtual video communication method provided by the present invention, such as... Figure 4 As shown, the real-time driven 3D animation model is applied to the Tencent Meeting platform. Under the Tencent Meeting platform, the real-time driven 3D animation model of face video can realize normal meeting communication and has high stability and robustness. Among them, the multi-view RGB camera joint optimization of face key points effectively solves the problems of missing points and frame drops.
[0087] For example, Figure 5 This is the second schematic diagram of the real-time video streaming result of the virtual video communication method provided by the present invention, as shown below. Figure 5 As shown, a real-time driven 3D animation model is applied to WeChat video calls. The 3D animation model replaces the human face, enabling normal virtual video communication while vividly displaying facial expression details.
[0088] The virtual video communication method provided by this invention obtains face fusion information by fusing facial key point information from multiple perspectives, including facial key point fusion information and face rotation angle fusion information, thereby improving the accuracy of analyzing the three-dimensional spatial information of the face. Furthermore, by using face fusion information to drive a three-dimensional animation model in real time, the method significantly increases the point information driving the three-dimensional animation model, improving the accuracy, robustness, and stability of the three-dimensional animation model. Simultaneously, by performing real-time video streaming of the three-dimensional animation model, virtual video communication is achieved, further expanding the application scope, better meeting market demands, and enhancing the user experience.
[0089] The virtual video communication device provided by the present invention is described below. The virtual video communication device described below and the virtual video communication method described above can be referred to in correspondence.
[0090] The present invention also provides a virtual video communication device. Figure 6 This is a schematic diagram of the structure of the virtual video communication device provided by the present invention, as shown below. Figure 6 As shown, the virtual video communication device 200 includes a first determining module 201, a second determining module 202, and a video streaming module 203, wherein:
[0091] The first determining module 201 is used to determine the corresponding multi-view facial key point information based on the acquired multi-view current scene images;
[0092] The second determining module 202 is used to determine face fusion information based on multi-view face key point information. The face fusion information is used to drive the 3D animation model in real time. The face fusion information includes: face key point fusion information and face rotation angle fusion information.
[0093] The real-time video streaming module 203 is used to stream 3D animation models in real time, and the real-time video streaming is used to realize virtual video communication.
[0094] Optionally, the virtual video communication device further includes: a key point docking module, used to determine the facial key point information of a pre-built 3D animation model; importing face fusion information into the 3D animation model, and docking the facial key point information of the 3D animation model with the face fusion information at key points, wherein the key point docking is used to drive the 3D animation model in real time based on the current scene image.
[0095] The virtual video communication device provided by this invention obtains face fusion information by fusing facial key point information from multiple perspectives, including facial key point fusion information and face rotation angle fusion information, thereby improving the accuracy of analyzing the three-dimensional spatial information of the face. Furthermore, by using face fusion information to drive a three-dimensional animation model in real time, the device significantly increases the point information driving the three-dimensional animation model, improving the accuracy, robustness, and stability of the three-dimensional animation model. Simultaneously, by performing real-time video streaming of the three-dimensional animation model, virtual video communication is achieved, further expanding the application scope, better meeting market demands, and enhancing the user experience.
[0096] Optionally, the first determining module 201 is specifically used for:
[0097] Based on the acquired current scene image, a deep learning algorithm is used to determine the face ROI in the current scene image;
[0098] Based on the face ROI, and using the face landmark algorithm, the corresponding multi-view face landmark information is determined based on the acquired multi-view current scene images.
[0099] Optionally, the first determining module 201 is specifically used for:
[0100] Facial landmark information includes facial landmarks and their image coordinates.
[0101] Optionally, the second determining module 202 is specifically used for:
[0102] Based on the world coordinates of the facial key points from each viewpoint, the rotation vector of the face relative to each viewpoint is determined using the solvePNP algorithm.
[0103] The rotation vectors of the face relative to each viewpoint are fused to determine the face rotation angle fusion information;
[0104] The facial key point information from each viewpoint is fused to determine the fused facial key point information.
[0105] Optionally, the key point docking module is specifically used for:
[0106] Determine the facial key point information of the pre-built 3D animation model;
[0107] The face fusion information is imported into the 3D animation model, and the facial key point information of the 3D animation model is matched with the face fusion information. The key point matching is used to drive the 3D animation model in real time based on the current scene image.
[0108] Figure 7 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 7As shown, the electronic device 300 may include: a processor 310, a communications interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communications interface 320, and the memory 330 communicate with each other via the communication bus 340. The processor 310 can call logical instructions in the memory 330 to execute a virtual video communication method, which includes:
[0109] Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information;
[0110] Based on the world coordinates and facial key points from multiple perspectives, facial fusion information is determined. This facial fusion information is used to drive the 3D animation model in real time. The facial fusion information includes facial key point fusion information and facial rotation angle fusion information.
[0111] The 3D animation model is streamed in real time, and the real-time video stream is used to realize virtual video communication.
[0112] Furthermore, the logical instructions in the aforementioned memory 330 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0113] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute the virtual video communication method provided by the above methods, the method including:
[0114] Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information;
[0115] Based on the world coordinates and facial key points from multiple perspectives, facial fusion information is determined. This facial fusion information is used to drive the 3D animation model in real time. The facial fusion information includes facial key point fusion information and facial rotation angle fusion information.
[0116] The 3D animation model is streamed in real time, and the real-time video stream is used to realize virtual video communication.
[0117] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the virtual video communication method provided by the methods described above, the method comprising:
[0118] Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information;
[0119] Based on the world coordinates and facial key points from multiple perspectives, facial fusion information is determined. This facial fusion information is used to drive the 3D animation model in real time. The facial fusion information includes facial key point fusion information and facial rotation angle fusion information.
[0120] The 3D animation model is streamed in real time, and the real-time video stream is used to realize virtual video communication.
[0121] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0122] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods of various embodiments or some parts of embodiments.
[0123] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A virtual video communication method, characterized in that, include: Based on the acquired multi-view current scene images, determine the corresponding multi-view facial key point information; Based on the world coordinates of facial key points from multiple perspectives and the facial key point information from the multiple perspectives, facial fusion information is determined. The facial fusion information is used to drive the 3D animation model in real time. The facial fusion information includes: facial key point fusion information and facial rotation angle fusion information. The 3D animation model is streamed in real time, and the real-time video stream is used to realize virtual video communication. The world coordinates of the facial key points based on multiple perspectives and the facial key point information from the multiple perspectives are used to determine the face fusion information, including: Based on the world coordinates of the facial key points from each viewpoint, the rotation vector of the face relative to each viewpoint is determined using the solvePNP algorithm. The rotation vector of the face relative to each viewpoint is weighted and fused to determine the face rotation angle fusion information; The facial key point information from each viewpoint is weighted and fused to determine the facial key point fusion information.
2. The virtual video communication method according to claim 1, characterized in that, The world coordinates of the facial key points based on multiple perspectives and the facial key point information from the multiple perspectives are used to determine face fusion information. This face fusion information is used to drive the 3D animation model in real time. The face fusion information includes: facial key point fusion information and face rotation angle fusion information, and also includes: Determine the facial key point information of the pre-built 3D animation model; The face fusion information is imported into the 3D animation model, and the facial key point information of the 3D animation model is matched with the face fusion information. The key point matching is used to drive the 3D animation model in real time based on the current scene image.
3. The virtual video communication method according to claim 1, characterized in that, The determination of corresponding multi-view facial key point information based on the acquired multi-view current scene images includes: Based on the acquired current scene image, a deep learning algorithm is used to determine the face ROI in the current scene image; Based on the face ROI, the facial landmark algorithm is used to determine the corresponding multi-view facial landmark information based on the acquired multi-view current scene images.
4. The virtual video communication method according to any one of claims 1 to 3, characterized in that, The facial key point information includes facial key points and their image coordinates.
5. A virtual video communication device, characterized in that, include: The first determining module is used to determine the corresponding multi-view facial key point information based on the acquired multi-view current scene images; The second determining module is used to determine face fusion information based on the world coordinates of face key points from multiple perspectives and the face key point information from the multiple perspectives. The face fusion information is used to drive the 3D animation model in real time. The face fusion information includes: face key point fusion information and face rotation angle fusion information. A real-time video streaming module is used to stream the 3D animation model in real time, and the real-time video streaming is used to realize virtual video communication. The second determining module is specifically used for: determining the rotation vector of the face relative to each viewpoint based on the world coordinates of the facial key points of each viewpoint using the solvePNP algorithm; weighting and fusing the rotation vector of the face relative to each viewpoint to determine the face rotation angle fusion information; and weighting and fusing the facial key point information of each viewpoint to determine the facial key point fusion information.
6. The virtual video communication device according to claim 5, characterized in that, Also includes: The key point docking module is used to determine the facial key point information of a pre-built 3D animation model; The face fusion information is imported into the 3D animation model, and the facial key point information of the 3D animation model is matched with the face fusion information. The key point matching is used to drive the 3D animation model in real time based on the current scene image.
7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the virtual video communication method as described in any one of claims 1 to 4.
8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the virtual video communication method as described in any one of claims 1 to 4.
9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the virtual video communication method as described in any one of claims 1 to 4.