A sound and picture synchronous playing method, device and equipment
By dividing the area into fan-shaped regions based on the audience's facial orientation and generating virtual sound images, the problem of audio-visual misalignment when the screen and speakers are bound together is solved, achieving an immersive audio-visual experience where audio and visuals are in sync.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CLOUD TECH CO LTD
- Filing Date
- 2024-12-04
- Publication Date
- 2026-06-19
AI Technical Summary
When existing screens are paired with a single speaker near the screen, the sound-image offset angle is large, causing viewers to perceive a separation between sound and image, resulting in a poor experience, especially on large screens.
By detecting and dividing the audience's field of vision into fan-shaped areas that are dense in the middle and sparse at the edges based on the audience's facial orientation information, a mapping relationship between the screen and these areas is established, and virtual sound images are dynamically generated to match the position of the sound source image. The virtual sound image technology is then used to play sound at the target location.
It achieves sufficient audio-visual immersion for viewers with limited hardware and algorithm overhead, thus enhancing the immersive experience of audio-visual content.
Smart Images

Figure CN119815267B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of audio and video signal processing, and in particular to a method, apparatus and device for simultaneous audio and video playback. Background Technology
[0002] Immersive audio-visual technology is already well-established in the film and television industry. For example, in cinemas, virtual sound positioning technology combined with 3D video can give viewers the feeling of a train rushing past or a bomber flying overhead. However, in fields such as video conferencing, remote teaching, and video presentations, current products are far from providing users with a truly immersive experience.
[0003] Some clever tricks involve displaying an image of a sound source on a fixed screen, which is then paired with a single speaker near the screen. When it is confirmed that the sound source displayed on the screen is making a sound, the speaker corresponding to the screen will also make a sound, giving the user in front of the screen the feeling that the sound and image are in sync. However, this method results in a large deviation angle between the sound and the image, especially on large screens, and the audience will still perceive that the sound and image are separate. Summary of the Invention
[0004] In view of this, it is necessary to provide a method, apparatus and device for playing sound and image in sync, in order to solve the problem of existing methods that bind the screen to a single speaker near the screen, and when it is confirmed that the sound source displayed on the screen is emitting sound, the speaker corresponding to the screen emits sound. This method has a large sound and image deviation angle, which causes the audience to perceive that the sound and image are separate.
[0005] To address the above problems, this invention provides a method for simultaneous audio and video playback, comprising:
[0006] Based on the audience's facial orientation information, an angle region mapping relationship is obtained; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area;
[0007] Obtain the first position information of the first sound source image on the screen;
[0008] The screen region where the location is located is obtained based on the first location information;
[0009] Based on the angle region mapping relationship and the screen region, an angle region corresponding to the screen region is obtained, and the angle region is determined as the first audio-visual region;
[0010] The first target location is obtained based on the first acoustic image region;
[0011] A virtual sound image of the first sound source image is generated at the first target location, and sound is played through the virtual sound image.
[0012] Optionally, obtaining the angle region mapping relationship based on the audience's facial orientation information specifically includes:
[0013] Real-time detection of audience facial orientation to obtain audience facial orientation information;
[0014] The vertical distance data between the viewer and the screen is obtained based on the facial orientation information;
[0015] Based on the vertical data, the viewer's field of vision is symmetrically divided into multiple angular regions according to the angle from small to large.
[0016] Establish the angular region mapping relationship between the screen and multiple angular regions.
[0017] Optionally, the step of symmetrically dividing the viewer's field of vision into multiple angular regions according to the vertical data, from smallest to largest angle, specifically includes:
[0018] The viewer's field of vision is divided into M horizontal sector regions and N vertical sector regions in a centrally symmetrical manner; wherein M and N are both odd numbers.
[0019] The included angle of the fan-shaped area near the center is no greater than 5 degrees, and the included angle of other fan-shaped areas increases progressively from the center to the edge; the audience's field of vision area is a centrally symmetrical fan-shaped area with an included angle of 60 degrees.
[0020] Optionally, obtaining the first target location based on the first acoustic image region specifically includes:
[0021] When the first sound source image is within the viewer's field of vision, the intersection of the angle bisectors of the first mapping region on the screen is obtained, and the intersection is determined as the first target orientation.
[0022] Optionally, obtaining the first target location based on the first acoustic image region specifically includes:
[0023] When the first sound source image is outside the viewer's field of vision, the intersection of the boundary line of the outermost sector of the viewer's field of vision and the screen is obtained, and the intersection is determined as the first target location.
[0024] Optionally, obtaining the first target location based on the first acoustic image region specifically includes:
[0025] When the first sound source image is in the outermost sector of the viewer's field of vision, and the angle bisector of the first mapping region has no intersection on the screen, the midpoint of the boundary of the outermost sector on the screen will be determined as the first target orientation.
[0026] Optionally, after generating a virtual sound image of the first sound source image at the first target location and playing sound through the virtual sound image, the method further includes:
[0027] Real-time detection of the position information of the first sound source image;
[0028] When the first sound source image changes from the first location information to the second location information belonging to the first sound image area, sound is played through the virtual sound image at the first target location.
[0029] Optionally, after generating a virtual sound image of the first sound source image at the first target location and playing sound through the virtual sound image, the method further includes:
[0030] Real-time detection of the position information of the first sound source image;
[0031] When the first sound source image changes from the first location information to the third location information belonging to the second sound image region, the second target orientation is obtained according to the second sound image region;
[0032] The virtual sound image is moved from the first target location to the second target location, and sound is played through the virtual sound image at the second target location.
[0033] The present invention also provides a playback device for audio and video in sync, comprising: a mapping relationship construction module, a position information acquisition module, a mapping area acquisition module, a target position information acquisition module, and a virtual audio-visual generation module;
[0034] The mapping relationship construction module is used to obtain the angle region mapping relationship based on the audience's facial orientation information; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area; the position information acquisition module is used to acquire the first position information of the first sound source image on the screen;
[0035] The mapping area acquisition module is used to obtain the screen area where it is located based on the first location information; then, based on the angle area mapping relationship and the screen area, it obtains the angle area corresponding to the screen area and determines the angle area as the first audio-visual area.
[0036] The target location information acquisition module is used to obtain the first target orientation based on the first acoustic image region;
[0037] The virtual audio-visual generation module is used to generate a virtual audio-visual image of the first sound source image at the first target location, and to play sound through the virtual audio-visual image.
[0038] Another technical solution of the present invention to solve the above-mentioned technical problems is an electronic device, including a memory and a processor, wherein,
[0039] The memory is used to store programs;
[0040] The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in a simultaneous audio-visual playback method of any of the above schemes.
[0041] By applying the technical solution provided in the embodiments of this application, the central area of the viewer's field of vision is non-uniformly divided into several fan-shaped areas that are dense in the middle and sparse at the edges along the horizontal and vertical directions by monitoring the viewer's facial orientation and vertical distance from the screen; the size of each fan-shaped area is mapped on the screen; the location of the sound source image on the screen is dynamically determined in the intersection area of the mapping of which two fan-shaped areas; and then a virtual sound image is generated at the target location. By processing a limited number of angular areas, hardware and algorithm overhead are effectively saved. By generating a virtual sound image at the target location, the viewer can have a sufficient sense of audio-visual synchronicity, bringing an immersive audio-visual experience. Attached Figure Description
[0042] Figure 1 This is a flowchart of a method for simultaneous audio and video playback provided in an embodiment of the present invention;
[0043] Figure 2 This is the present invention. Figure 1 A flowchart illustrating an embodiment of step S103;
[0044] Figure 3 This is an example diagram of the audience's forward viewing angle division when M=5 and N=1 provided in the embodiments of the present invention;
[0045] Figure 4 This is a schematic diagram of the virtual sound image generation strategy provided in this embodiment of the invention when the sound source image is outside the mapping area or the angle bisector is outside the screen.
[0046] Figure 5 This is a flowchart of a method for simultaneous audio and video playback provided in another embodiment of the present invention;
[0047] Figure 6 This is a flowchart of a method for simultaneous audio and video playback provided in another embodiment of the present invention;
[0048] Figure 7 This is a flowchart of a method for audio-visual synchronization provided in an embodiment of the present invention;
[0049] Figure 8 This is a schematic diagram of the mapping of the included angle on the screen and the determination of the virtual sound image position provided in the embodiments of the present invention;
[0050] Figure 9 This is a structural block diagram of a sound and image simultaneous playback device provided in an embodiment of the present invention;
[0051] Figure 10This is a hardware structure block diagram of an electronic device provided in various embodiments of the present invention. Detailed Implementation
[0052] The principles and features of the present invention are described below with reference to the accompanying drawings. The embodiments described are only for explaining the present invention and are not intended to limit the scope of the present invention.
[0053] like Figure 1 As shown, a specific embodiment of the present invention discloses a method for simultaneous audio and video playback, comprising:
[0054] S101. Based on the audience's facial orientation information, obtain the angle region mapping relationship; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area; it should be noted that, in an optional embodiment, the audience's facial orientation is monitored in real time, and the central area of the audience's front field of vision is divided into angles in a non-uniform manner from the center outwards, from small to large; forming M horizontally and N vertically densely packed fan-shaped areas with sparse edges, where M and N are limited to 10≥M≥3≥N≥1, and M and N are odd numbers. The angle region mapping relationship is established by mapping each fan-shaped area to the corresponding square area on the screen. For example: the central area of the audience's front field of vision has the i-th horizontal fan-shaped area and the i-th vertical fan-shaped area. The i-th horizontal and i-th vertical fan-shaped areas are mapped onto the screen, resulting in a long horizontal or vertical i-th rectangular plane area on the screen; a rectangular plane is established by mapping each fan-shaped area of the central area of the audience's front field of vision to the screen. i≥1.
[0055] S102. Obtain the first position information of the first sound source image on the screen;
[0056] S103. Obtain the screen area based on the first location information;
[0057] S104. Based on the angle region mapping relationship and the screen region, obtain the angle region corresponding to the screen region, and determine the angle region as the first audio-visual region. It should be noted that, in an optional embodiment, for example, image A is at screen position a, where point a is located in the i-th rectangular plane region of the screen. The i-th sector region (which may include horizontal and vertical sector regions) has a mapping relationship with the i-th rectangular plane region and the viewer's central field of vision. That is, point a represents the first position information of image A, the i-th rectangular plane region is the screen region, and the i-th sector region is the first audio-visual region.
[0058] S105. Obtain the first target orientation based on the first audio-visual region; it should be noted that, in an optional embodiment, the intersection of the angle bisector of the i-th sector region on the screen is taken as the first target orientation.
[0059] S106. Generate a virtual sound image of the first sound source image at the first target location, and play the sound through the virtual sound image.
[0060] It should be noted that, in a specific embodiment, virtual sound image and virtual sound image localization can be as follows: after the sound emitted by multiple speakers propagates to both ears, the listener will perceive the existence of some sound sources that are not actually located at the speaker locations. These sound sources themselves do not exist in the listener's space; they are "imagined" by the listener's auditory system, hence the term virtual sound image or phantom sound source. By adjusting the output signals of multiple speakers using amplitude shift, time delay adjustment, or hybrid methods, the listener can perceive the target location of the virtual sound image. Corresponding technologies include VBAP (vector-based amplitude shift) and DBAP (distance-based amplitude shift), which can also be called virtual sound image generation. The method for generating virtual sound images can include: calculating the different volumes of each speaker according to the angle between each speaker, the target location of the virtual sound image, and the target listening point, converting them into gain, and then processing the audio signal to be played at the target location.
[0061] In a specific embodiment, audio-visual synchronization can be achieved by adjusting the position of the virtual audio-visual image to the image position of the sound source on the screen when playing video on the screen using virtual audio-visual positioning technology, so that the audience perceives that the sound comes from the corresponding sound source image position on the screen.
[0062] The solution of this invention is mainly applied to one-on-one video scenarios, considering the case where there is only one viewer. Then, based on the viewer's facial orientation, facial orientation information is obtained in real time and processed.
[0063] When applied to video conferencing, this embodiment monitors the facial orientation of participants in real time. The central area of the participant's field of vision is divided into non-uniform sections, expanding outwards from the center. This forms M horizontally and N vertically sparsely spaced fan-shaped regions, densely packed in the center and sparsely spaced at the edges. A mapping relationship is established between each fan-shaped region and the corresponding square area on the screen, forming the angle region mapping relationship. When the participant's facial orientation changes, the angle region mapping relationship also changes accordingly. For example, using a 60° fan-shaped region at the center of the participant's field of vision, a mapping relationship is established between each fan-shaped region and the corresponding square area on the screen. If the participant's face is to the left, a 60° fan-shaped region is created to the left front; if the participant's face is to the right, a 60° fan-shaped region is created to the right front. Based on the determined 60° fan-shaped regions, a mapping relationship is established between each fan-shaped region and the corresponding square area on the screen, thus obtaining the corresponding angle region mapping relationship.
[0064] The system dynamically identifies which two sectors the sound source image on the screen is located in at the intersection of their mappings; then, using virtual sound image localization technology, it generates a virtual sound image with the intersection of the angle bisectors of these two sectors on the screen as the target orientation.
[0065] This application embodiment is applied to the playback of speech videos. It monitors the facial orientation of the viewer in real time and divides the central area of the viewer's field of vision into non-uniform angles, from the center outwards and from small to large. This forms M horizontally and N vertically densely packed fan-shaped regions with sparse edges. A mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, which is the angle region mapping relationship. When the viewer's facial orientation changes, the angle region mapping relationship also changes accordingly. For example, taking the 60° fan-shaped region in the center of the viewer's field of vision as an example, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen. If the viewer's face is to the left, a 60° fan-shaped region is established to the left front; if the viewer's face is to the right, a 60° fan-shaped region is established to the right front. Based on the determined 60° fan-shaped regions, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, thus obtaining the corresponding angle region mapping relationship.
[0066] The speaker's image on the screen is dynamically identified as being located in the intersection of two sector mappings; then, virtual sound imagery is generated using virtual sound image positioning technology, with the intersection of the two sector angular bisectors on the screen as the target orientation.
[0067] This application embodiment is applied to live streaming. It monitors the facial orientation of the viewer in real time and divides the central area of the viewer's field of vision into non-uniformly oriented regions, from the center outwards and from smaller to larger areas. This forms M horizontally and N vertically oriented fan-shaped regions, densely packed in the center and sparsely spaced at the edges. A mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, which is the angle region mapping relationship. When the viewer's facial orientation changes, the angle region mapping relationship also changes accordingly. For example, taking the 60° fan-shaped region in the center of the viewer's field of vision as an example, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen. If the viewer's face is to the left, a 60° fan-shaped region is established to the left front; if the viewer's face is to the right, a 60° fan-shaped region is established to the right front. Based on the determined 60° fan-shaped regions, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, thus obtaining the corresponding angle region mapping relationship.
[0068] The system dynamically identifies which two sectors the live streamer's image is located in at the intersection of; then, using virtual sound and image positioning technology, it generates a virtual sound and image with the intersection of the two sector angle bisectors on the screen as the target orientation.
[0069] This application embodiment is applied to remote teaching, which monitors the student's facial orientation in real time. The central area of the student's field of vision is divided into angles in a non-uniform manner, from the center outwards and from small to large. This forms M horizontally and N vertically, fan-shaped regions that are densely packed in the middle and sparsely spaced at the edges. A mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, which is the angle region mapping relationship. When the student's facial orientation changes, the angle region mapping relationship also changes accordingly. For example, taking the 60° fan-shaped region in the center of the student's field of vision as an example, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen. If the student's face is to the left, a 60° fan-shaped region is established to the left front; if the student's face is to the right, a 60° fan-shaped region is established to the right front. Based on the determined 60° fan-shaped regions, a mapping relationship is established between each fan-shaped region and the corresponding square region on the screen, thus obtaining the corresponding angle region mapping relationship.
[0070] The system dynamically identifies the intersection of two sector mappings where the teacher's image on the screen is located. Then, using virtual sound image localization technology, it generates a virtual sound image with the intersection of the bisectors of these two sectors on the screen as the target location. By monitoring the viewer's facial orientation and vertical distance from the screen, the central area of the viewer's field of vision is non-uniformly divided into several sector regions, densely packed in the center and sparsely spaced at the edges, along both horizontal and vertical directions. The size of each sector is mapped onto the screen. The system dynamically identifies the intersection of two sector mappings where the sound source image on the screen is located. Then, using virtual sound image localization technology, it generates a virtual sound image with the intersection of the bisectors of these two sectors on the screen as the target location. This allows the viewer to experience sufficient audio-visual immersion with limited hardware and algorithm overhead. The intersection of two sector mappings represents the area where the horizontal and vertical sectors intersect on the screen. Since the mapping of a sector on the screen is a long horizontal or vertical rectangular plane, the intersection of two sector mappings is represented by a series of rectangular grids.
[0071] In an optional embodiment of the invention, such as Figure 2 As shown, obtaining the angle region mapping relationship based on the audience's facial orientation information specifically includes:
[0072] S201. Real-time detection of audience facial orientation to obtain audience facial orientation information;
[0073] S202. Obtain the vertical distance data between the viewer and the screen based on the facial orientation information;
[0074] S203. Based on the vertical data, the viewer's field of vision is symmetrically divided into multiple angular regions according to the angle from small to large.
[0075] S204. Establish the angular region mapping relationship between the screen and multiple angular regions.
[0076] By detecting the orientation of the viewer's face in real time, the mapping relationship between the viewer's field of vision and the screen area can be determined, which can maintain a good sense of audio-visual synchronization even after the viewer adjusts the viewing angle, thus improving the audio-visual experience.
[0077] In an optional embodiment of the present invention, the detection frequency is determined based on the viewer's reaction time to the image combined with the audio delay time;
[0078] The facial orientation of the audience is detected in real time according to the detection frequency to obtain the audience's facial orientation information.
[0079] It protects the detection of changes in the viewer's facial orientation at a detection frequency greater than 20 Hz. Since ordinary viewers can perceive sound changes that lag behind the picture by 80 ms, the latency of spatial audio renderers in the headphone field is often less than 60 ms. By using a detection frequency greater than 20 Hz, the viewer's viewing experience is guaranteed without perceiving the separation of sound and picture, while saving computing power.
[0080] In one embodiment, establishing the angular region mapping relationship between the screen and multiple angular regions can specifically include: establishing a mapping relationship between each sector and the corresponding square region on the screen, which constitutes the angular region mapping relationship. For example, if the central area in front of the viewer has an i-th horizontal sector and an i-th vertical sector, mapping these two sectors onto the screen results in an i-th rectangular plane region, either horizontally or vertically; and establishing a rectangular plane mapping between each sector in the central area in front of the viewer and the screen. i ≥ 1.
[0081] In an optional embodiment of the present invention, the step of symmetrically dividing the viewer's field of vision into multiple angular regions according to the vertical data in ascending order of angle specifically includes:
[0082] The viewer's field of vision is divided into M horizontal fan-shaped areas and N vertical fan-shaped areas in a centrally symmetrical manner; wherein M and N are both odd numbers, used to create differences in the position of the sound image in the middle and on both sides; in one embodiment, such as Figure 3 The example shown is when M=5 and N=1, the included angles of each sector are 25°, 15°, 5°, 15° and 25° respectively.
[0083] The included angle of the fan-shaped area near the center is no greater than 5 degrees, and the included angle of other fan-shaped areas increases progressively from the center to the edge; the audience's field of vision area is a centrally symmetrical fan-shaped area with an included angle of 60 degrees.
[0084] It should be noted that the human auditory system has the highest accuracy in locating sound images in the horizontal direction in front, and the accuracy decreases rapidly from the center to the sides. Relevant literature shows that the average accuracy of subjects' perception of virtual sound images in the horizontal direction directly in front is about 4°, and the localization becomes more blurred towards the sides. The human auditory system is not sensitive to sound image movement in terms of height; sometimes, sound image movement requires a 60° angular displacement at the subject's location to be perceived.
[0085] In one embodiment, considering both algorithm complexity and audience perception accuracy, when the sector regions are numbered from left to right as 1 to M, the included angle of the most central sector is ≤5°, that is:
[0086] When M is odd, the included angle of the (M+1) / 2th sector is ≤5°;
[0087] When M is even, the angle between the M / 2th and (M / 2+1th)th sectors is ≤5°;
[0088] The included angles of the other sectors increase progressively from the center to both sides in a stepwise manner.
[0089] Because the angle between the speakers near the screen and the audience, i.e., the listening angle, is smaller than the viewing angle, this patent only divides a fan-shaped area of about 60° in front of the audience's center of view, ignoring the edges of the viewing angle. Therefore, the angles of each fan-shaped area in the horizontal or vertical direction are limited to ≤90°.
[0090] It should be noted that the principles for dividing the viewer's field of vision can include:
[0091] Both the horizontal and vertical directions are divided into several sectors centered on the front. The sum of the angles of the M horizontal sectors is ≤90°, and the sum of the angles of the N vertical sectors is also ≤90°. The number of sectors, the angle of each sector, and even the sum of the angles can be different in the two directions.
[0092] The audience's field of vision is protected by partitioning the area into denser sections in the center and sparser sections at the edges. Since the human auditory system has the highest accuracy in the horizontal direction directly in front, with accuracy decreasing rapidly towards the sides, partitioning the area with denser sections in the center and sparser sections at the edges ensures a good audience experience while saving computational resources. By setting the angle of the central fan-shaped area to no more than 5 degrees, tests showed that the average accuracy of the virtual sound image perceived by the audience at a horizontal angle directly in front is approximately 4 degrees, becoming increasingly blurry towards the sides. This angle setting balances algorithm complexity and audience perception accuracy to achieve an optimal equilibrium.
[0093] In an optional embodiment of the present invention, obtaining the first target location based on the first acoustic image region specifically includes:
[0094] When the first sound source image is within the viewer's field of vision, the intersection of the angle bisectors of the first mapping region on the screen is obtained, and the intersection is determined as the first target orientation.
[0095] In one embodiment, the target orientation calculation method may specifically include:
[0096] The image recognition algorithm first provides an "actual" coordinate point (such as the center point of the mouth or the center point of the image, as described below). Then, the mapping of horizontal and vertical sectors onto the screen generates a series of rectangular grids. Determining which rectangular grid the "actual" coordinate point is located within identifies that grid as the target sound source region. Next, the final location of the sound source target within that rectangular grid is determined. The final step uses the angle bisectors of the two sectors to confirm the final target location of the sound source.
[0097] It should be noted that the process involves working backward from a defined rectangular grid to deduce two sectors (one horizontal and one vertical), and then returning to a point within the rectangular grid. The image recognition algorithm can be a mature, existing AI image recognition technology based on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs). This embodiment of the invention uses the OpenCV library in Python to identify and calculate the coordinates of a specific icon in the image.
[0098] It protects the intersection of the angle bisectors of the mapped area on the screen as the target orientation information, which is used to determine the orientation of the virtual sound image and adjust the sound source image's sound position in real time to achieve sound and image synchronization.
[0099] In an optional embodiment of the present invention, obtaining the first target location based on the first acoustic image region specifically includes:
[0100] When the first sound source image is outside the viewer's field of vision, the intersection of the boundary line of the outermost sector of the viewer's field of vision and the screen is obtained, and the intersection is determined as the first target location.
[0101] In one embodiment, such as Figure 4 As shown in the example with M=5 and N=1, when the audience faces the edge of the screen, causing the coordinates of the sound source image on the screen to exceed the mapping area of all the sectors it divides on the screen, the midpoint of the boundary surface of the outermost sector in that direction on the screen is set as the target orientation to generate a virtual sound image.
[0102] That is, when the first sound source image is outside the viewer's field of vision, the location of the virtual sound image 1 is the first target orientation.
[0103] In an optional embodiment of the present invention, obtaining the first target location based on the first acoustic image region specifically includes:
[0104] When the first sound source image is in the outermost sector of the viewer's field of vision, and the angle bisector of the first mapping region has no intersection on the screen, the midpoint of the boundary of the outermost sector on the screen will be determined as the first target orientation.
[0105] In one embodiment, such as Figure 4 As shown in the example with M=5 and N=1, when the coordinates of the sound source image on the screen are located in the intersection mapping area of one or two sectors, but the angle bisector of the sector does not intersect with the screen, the virtual sound image orientation corresponding to the sound source image is set to the midpoint of the outermost sector boundary on the screen.
[0106] That is, when the first sound source image is in the outermost sector of the viewer's field of vision, and the angle bisector of the first mapping area has no intersection on the screen, the location of the virtual sound image 2 is the first target orientation.
[0107] In an optional embodiment of the invention, such as Figure 5 As shown, a virtual sound image of the first sound source image is generated at the first target location. After playing sound through the virtual sound image, the method further includes:
[0108] S501. Obtain the angle region mapping relationship based on the audience's facial orientation information; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area;
[0109] S502, Obtain the first position information of the first sound source image on the screen;
[0110] S503. Obtain the screen area based on the first location information;
[0111] S504. Based on the angle region mapping relationship and the screen region, obtain the angle region corresponding to the screen region, and determine the angle region as the first audio-visual region;
[0112] S505. Obtain the first target location based on the first acoustic image region;
[0113] S506. Generate a virtual sound image of the first sound source image at the first target location, and play sound through the virtual sound image;
[0114] S507. Real-time detection of the position information of the first sound source image;
[0115] S508. When the first sound source image changes from the first location information to the second location information belonging to the first sound image area, sound is played through the virtual sound image at the first target location.
[0116] It should be noted that, in an optional embodiment, for example, image A is at screen position a, where point a is located in the i-th rectangular planar region of the screen. The i-th sector region (which may include horizontal and vertical sector regions) has a mapping relationship with the i-th rectangular planar region and the viewer's central field of vision. That is, point a represents the first position information of image A, the i-th rectangular planar region is the screen region, and the i-th sector region is the first audio-visual region.
[0117] The first sound source image changes from the first position information to the second position information belonging to the first sound image region, indicating that the sound source image moves from point a to point a1. Both point a and point a1 belong to the i-th rectangular plane region corresponding to the i-th sector region.
[0118] In an optional embodiment of the invention, such as Figure 6 As shown, after generating a virtual sound image of the first sound source image at the first target location and playing sound through the virtual sound image, the method further includes:
[0119] S601. Obtain the angle region mapping relationship based on the audience's facial orientation information; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area;
[0120] S602, Obtain the first position information of the first sound source image on the screen;
[0121] S603. Obtain the screen area based on the first location information;
[0122] S604. Based on the angle region mapping relationship and the screen region, obtain the angle region corresponding to the screen region, and determine the angle region as the first audio-visual region;
[0123] S605. Obtain the first target location based on the first acoustic image region;
[0124] S606. Generate a virtual sound image of the first sound source image at the first target location, and play sound through the virtual sound image;
[0125] S607. Real-time detection of the position information of the first sound source image;
[0126] S608. When the first sound source image changes from the first position information to the third position information belonging to the second sound image region, the second target orientation is obtained according to the second sound image region. It should be noted that, in an optional embodiment, the third position information indicates that the sound source image moves from point a to point b, and points a and b are not in the same screen area. Point b is in the (i-1)th rectangular plane area corresponding to the (i-1)th sector area. At this time, the sector area corresponding to point b has changed, so the corresponding orientation information must also be adjusted with the sector area. The change in the position of the sound source image can be a change in the position of the same object, or a change in both the object and the position. For example, if the sound image A moves from point a to point b, it can be that the sound image A becomes the sound image B, and the object changes, and the corresponding position also moves from point a to point C of image B.
[0127] S609. Move the virtual sound image from the first target location to the second target location, and play sound through the virtual sound image at the second target location.
[0128] The refresh target orientation information is protected and used as the initial position information when the sound source image position changes again. When the sound source image position changes again, it can be detected in time through real-time detection, and a virtual sound image can be generated in time at the new target orientation to ensure that the sound source image remains in the same position after the movement.
[0129] In one embodiment, such as Figure 7 As shown, a method for audio-visual synchronization includes:
[0130] Provide the audience's initial facial orientation and vertical distance from the screen;
[0131] The viewer's frontal perspective is divided into several angular regions, from the center outwards, in a non-uniform manner, gradually increasing in size. It should be noted that the human auditory system has the highest accuracy in localizing sound images in the horizontal direction directly in front, with accuracy decreasing rapidly from the center outwards. Related literature indicates that the average accuracy of a subject's perception of a virtual sound image in the horizontal direction directly in front is approximately 4°, becoming increasingly blurred towards the sides. The human auditory system is not sensitive to vertical movement of sound images; sometimes, a 60° angular displacement of the sound image at the subject's location is required for it to be perceived.
[0132] In one embodiment, the facial orientation of the audience is monitored in real time, and the central area of the audience's field of vision in front of them is divided into angles in a non-uniform manner from the center to both sides and from small to large. This forms M horizontal fan-shaped areas and N vertical fan-shaped areas that are dense in the middle and sparse at the edges. M and N are defined as 10≥M≥3≥N≥1.
[0133] An angle region mapping relationship is established on the screen. It should be noted that, in one embodiment, a coordinate system is established for the display area of the screen. Based on the vertical distance from the viewer to the screen, the above-mentioned sector angle is mapped to the coordinate system of the screen to confirm the coordinate range of the mapping area corresponding to each sector, and to confirm the coordinates of the intersection point of each pair of horizontal and vertical sector angle bisectors in the mapping area.
[0134] When a video image is displayed on the screen, the video signal processing module provides the coordinates of the sound source.
[0135] For an image of the speaker, provide the coordinates of the center point of the mouth;
[0136] For non-human sound sources, provide the coordinates of the midpoint of the horizontal dimension of their image.
[0137] The video signal processing module mainly refers to image processing software, whose function is to identify image content and provide corresponding coordinate points. For example, the currently popular artificial intelligence image recognition technology based on algorithms such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) is a mature existing technology for this patent proposal. In practice, this patent uses the OpenCV library in Python to identify and calculate the coordinate points of a specific icon in an image.
[0138] Refreshing the audience's facial expressions;
[0139] Determine if the audience's facial orientation has changed; if it has, return to confirm the area in front of the audience.
[0140] If there is no change, provide the initial coordinates of the sound source image on the screen;
[0141] Determine which angular mapping region the sound source image falls within;
[0142] Virtual sound image generation is performed using the intersection of the current angle bisector on the screen as the target orientation. The generation method may include: calculating the different volumes of each speaker based on the angles between each speaker, the virtual sound image target orientation, and the target listening point, and then converting these volumes into gain-based audio signals for playback. This is based on existing publicly available virtual sound image localization algorithms such as VBAP and DBAP.
[0143] Refresh the coordinates of the sound source image on the screen;
[0144] Determine whether the sound source image has left the previous angle mapping area. If it has not left, then generate a virtual sound image at the current target location; in one embodiment, such as Figure 8 The image shows an example of virtual audio-visual generation when M=5 and N=1.
[0145] If you leave the current area, generate a virtual sound image with the intersection of the angle bisectors on the screen as the target location, and refresh the coordinates of the sound source image on the screen.
[0146] In another embodiment of the invention, such as Figure 9 As shown, a playback device 900 with simultaneous audio and video includes: a mapping relationship construction module 901, a position information acquisition module 902, a mapping area acquisition module 903, a target position information acquisition module 904, and a virtual audio-visual generation module 905.
[0147] The mapping relationship construction module 901 is used to obtain the angle region mapping relationship based on the audience's facial orientation information; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area; the position information acquisition module 902 is used to acquire the first position information of the first sound source image on the screen;
[0148] The mapping area acquisition module 903 is used to obtain the screen area where it is located based on the first position information; then, based on the angle area mapping relationship and the screen area, it obtains the angle area corresponding to the screen area and determines the angle area as the first audio-visual area.
[0149] The target location information acquisition module 904 is used to obtain the first target orientation based on the first acoustic image region;
[0150] The virtual audio-visual generation module 905 is used to generate a virtual audio-visual image of the first sound source image at the first target location, and to play sound through the virtual audio-visual image.
[0151] The audio-visual simultaneous playback device provided in the above embodiments can realize the technical solution described in the above embodiment of the audio-visual simultaneous playback method. The specific implementation principle of each module or unit can be found in the corresponding content in the above embodiment of the audio-visual simultaneous playback method, which will not be repeated here.
[0152] like Figure 10 As shown, the present invention also provides an electronic device 1000. The electronic device 1000 includes a processor 1001, a memory 1002, and a display 1003. Figure 10 Only some components of the electronic device 1000 are shown, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
[0153] In some embodiments, memory 1002 may be an internal storage unit of electronic device 1000, such as a hard disk or memory of electronic device 1000. In other embodiments, memory 1002 may also be an external storage device of electronic device 1000, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. equipped on electronic device 1000.
[0154] Furthermore, the memory 1002 may include both internal storage units of the electronic device 1000 and external storage devices. The memory 1002 is used to store application software and various types of data installed on the electronic device 1000.
[0155] In some embodiments, processor 1001 may be a central processing unit (CPU), microprocessor, or other data processing chip, used to run program code stored in memory 1002 or process data, such as a method for simultaneous audio and video playback in this invention.
[0156] In some embodiments, display 1003 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. Display 1003 is used to display information from electronic device 1000 and to display a visual user interface. Components 1001-1003 of electronic device 1000 communicate with each other via a system bus.
[0157] In some embodiments of the present invention, when the processor 1001 executes the audio-visual simultaneous playback program in the memory 1002, the following steps can be implemented:
[0158] Based on the audience's facial orientation information, an angle region mapping relationship is obtained; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area;
[0159] Obtain the first position information of the first sound source image on the screen;
[0160] The screen region where the location is located is obtained based on the first location information;
[0161] Based on the angle region mapping relationship and the screen region, an angle region corresponding to the screen region is obtained, and the angle region is determined as the first audio-visual region;
[0162] The first target location is obtained based on the first acoustic image region;
[0163] A virtual sound image of the first sound source image is generated at the first target location, and sound is played through the virtual sound image.
[0164] It should be understood that when the processor 1001 executes the audio-visual simultaneous playback program in the memory 1002, in addition to the functions mentioned above, it can also perform other functions, as detailed in the description of the corresponding method embodiments above.
[0165] Furthermore, the embodiments of the present invention do not specifically limit the type of the electronic device 1000 mentioned. The electronic device 1000 can be a mobile phone, tablet computer, personal digital assistant (PDA), wearable device, laptop computer, or other portable electronic device. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices running iOS, Android, Microsoft, or other operating systems. The aforementioned portable electronic device can also be other portable electronic devices, such as a laptop computer with a touch-sensitive surface (e.g., a touch panel). It should also be understood that in some other embodiments of the present invention, the electronic device 1000 may not be a portable electronic device, but rather a desktop computer with a touch-sensitive surface (e.g., a touch panel).
[0166] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a method for simultaneous audio-visual playback provided by the methods described above, the method comprising:
[0167] Based on the audience's facial orientation information, an angle region mapping relationship is obtained; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area;
[0168] Obtain the first position information of the first sound source image on the screen;
[0169] The screen region where the location is located is obtained based on the first location information;
[0170] Based on the angle region mapping relationship and the screen region, an angle region corresponding to the screen region is obtained, and the angle region is determined as the first audio-visual region;
[0171] The first target location is obtained based on the first acoustic image region;
[0172] A virtual sound image of the first sound source image is generated at the first target location, and sound is played through the virtual sound image.
[0173] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.
[0174] The present invention has provided a detailed description of a method, apparatus, and device for simultaneous audio and video playback. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, those skilled in the art will know that there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for simultaneous audio and video playback, characterized in that, include: Based on the audience's facial orientation information, obtain the angular region mapping relationship; The angle region mapping relationship is the mapping relationship between the screen and the viewer's field of vision; Obtain the first position information of the first sound source image on the screen; The screen region where the location is located is obtained based on the first location information; Based on the angle region mapping relationship and the screen region, an angle region corresponding to the screen region is obtained, and the angle region is determined as the first sound image region; The first target location is obtained based on the first acoustic image region; A virtual sound image of the first sound source image is generated at the first target location, and sound is played through the virtual sound image; The step of obtaining the angle region mapping relationship based on the audience's facial orientation information specifically includes: Real-time detection of audience facial orientation to obtain audience facial orientation information; The vertical distance data between the viewer and the screen is obtained based on the facial orientation information; Based on the vertical distance data, the viewer's field of vision is symmetrically divided into multiple angular regions according to the angle from small to large. Establish the angular region mapping relationship between the screen and multiple angular regions.
2. The method of claim 1, wherein, The step of symmetrically dividing the viewer's field of vision into multiple angular regions according to the vertical distance data, from smallest to largest, specifically includes: The viewer's field of vision is divided into M horizontal sector regions and N vertical sector regions in a centrally symmetrical manner; wherein M and N are both odd numbers. The included angle of the fan-shaped area near the center is no greater than 5 degrees, and the included angle of other fan-shaped areas increases progressively from the center to the edge; the audience's field of vision area is a fan-shaped area with a centrally symmetrical included angle of 60 degrees.
3. The method for simultaneous audio and video playback according to claim 1, characterized in that, The step of obtaining the first target location based on the first acoustic image region specifically includes: When the first sound source image is within the viewer's field of vision, the intersection of the angle bisectors of the first mapping region on the screen is obtained, and the intersection is determined as the first target orientation.
4. The method of claim 1, wherein the sound picture synchronization is performed by a user. The step of obtaining the first target location based on the first acoustic image region specifically includes: When the first sound source image is outside the viewer's field of vision, the intersection of the boundary line of the outermost sector of the viewer's field of vision and the screen is obtained, and the intersection is determined as the first target location.
5. The method for simultaneous audio and video playback according to claim 1, characterized in that, The step of obtaining the first target location based on the first acoustic image region specifically includes: When the first sound source image is in the outermost sector of the viewer's field of vision, and the angle bisector of the first mapping region has no intersection on the screen, the midpoint of the boundary of the outermost sector on the screen will be determined as the first target orientation.
6. The method for simultaneous audio and video playback according to claim 1, characterized in that, After generating a virtual sound image of the first sound source image at the first target location, and playing sound through the virtual sound image, the method further includes: Real-time detection of the position information of the first sound source image; When the first sound source image changes from the first location information to the second location information belonging to the first sound image area, sound is played through the virtual sound image at the first target location.
7. The method of claim 1, wherein the sound picture synchronization is performed by a user. The process of generating a virtual sound image of the first sound source image at the first target location, and then playing sound through the virtual sound image, further includes: Real-time detection of the position information of the first sound source image; When the first sound source image changes from the first location information to the third location information belonging to the second sound image region, the second target orientation is obtained according to the second sound image region; The virtual sound image is moved from the first target location to the second target location, and sound is played through the virtual sound image at the second target location.
8. A sound picture synchronous playing device, characterized in that, include: The system includes a mapping relationship construction module, a location information acquisition module, a mapping area acquisition module, a target location information acquisition module, and a virtual audio-visual generation module. The mapping relationship construction module is used to obtain the angle region mapping relationship based on the audience's facial orientation information; the angle region mapping relationship is the mapping relationship between the screen and the audience's field of vision area; The location information acquisition module is used to acquire the first location information of the first sound source image on the screen; The mapping area acquisition module is used to obtain the screen area where the location is based on the first location information; Then, based on the angle region mapping relationship and the screen region, the angle region corresponding to the screen region is obtained, and the angle region is determined as the first audio-visual region; The target location information acquisition module is used to obtain the first target orientation based on the first acoustic image region; The virtual sound image generation module is used to generate a virtual sound image of the first sound source image at the first target location, and to play sound through the virtual sound image; The device is also used to detect the facial orientation of the audience in real time and obtain facial orientation information of the audience; The vertical distance data between the viewer and the screen is obtained based on the facial orientation information; Based on the vertical distance data, the viewer's field of vision is symmetrically divided into multiple angular regions according to the angle from small to large. Establish the angular region mapping relationship between the screen and multiple angular regions.
9. An electronic device, comprising: Including memory and processor, among which, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps of the audio-visual simultaneous playback method according to any one of claims 1 to 7.