Image processing system, image processing method, and program

JP2024168219A5Pending Publication Date: 2026-06-26CANON KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
CANON KK
Filing Date
2023-05-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

When a subject emitting sound is deleted from a video with sound, the playback of the video can result in a sense of discomfort due to the audio components corresponding to the deleted subject being played back, leading to a deterioration in video quality.

Method used

An image processing system that includes subject detection means for identifying a specific subject in video frames, audio detection means for identifying corresponding audio components, and deletion means to remove the subject and its associated audio from the video and audio data, respectively, using machine learning models to maintain video quality.

Benefits of technology

The system effectively deletes a specific subject from a video with audio while minimizing the deterioration in audio quality by synchronizing the removal of the subject and its associated audio components.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 00000000_0000_ABST
    Figure 00000000_0000_ABST
Patent Text Reader

Abstract

To provide a technique capable of removing a specific subject from a moving image with a sound, while suppressing deterioration of a quality of the moving image with the sound.SOLUTION: An image processing system comprises: subject detection means of detecting a subject of a first type in a plurality of frames of a moving image data in accordance with a sound data; subject removing means of removing the subject of the first type from one or more frame where the subject of the first type is detected, of the plurality of frames; sound detection means of detecting a sound component corresponding to the subject of the first type in the sound data; and sound removing means of removing the sound component corresponding to the subject of the first type from the sound data.SELECTED DRAWING: Figure 2
Need to check novelty before this filing date? Find Prior Art

Description

[Technical field]

[0001] The present invention relates to an image processing system, an image processing method, and a program. [Background technology]

[0002] Nowadays, digital cameras and smartphones with a function to shoot videos with audio are widely used. In videos with audio shot by a user, a subject that the user does not want to see may appear in the video. For example, when a user wants to shoot a person, a car that the user does not want to see may appear in the video.

[0003] Currently, there is known a technique for erasing unnecessary areas in an image so as to leave no trace of the area (Patent Document 1). [Prior art documents] [Patent documents]

[0004] [Patent Document 1] JP 2007-286734 A Summary of the Invention [Problem to be solved by the invention]

[0005] Consider the case where a subject making a sound is deleted from a video with audio. In this case, when the video with audio is played back, the audio that is played back contains the sound components corresponding to the subject that is not displayed because it has been deleted, which may cause the user to feel uncomfortable. In this way, when a subject making a sound is deleted from a video with audio, the quality of the video with audio is reduced.

[0006] The present invention has been made in consideration of the above-mentioned circumstances, and aims to provide a technology that makes it possible to remove a specific subject from a video with audio while suppressing degradation in the quality of the video with audio. [Means for solving the problem]

[0007] In order to solve the above problem, the present invention provides an image processing system comprising: a subject detection means for detecting a first type of subject in multiple frames of video data accompanied by audio data; a subject removal means for removing the first type of subject from one or more frames among the multiple frames in which the first type of subject is detected; an audio detection means for detecting audio components in the audio data corresponding to the first type of subject; and an audio removal means for removing the audio components corresponding to the first type of subject from the audio data. Effect of the Invention

[0008] According to the present invention, it is possible to delete a specific subject from a video with audio while suppressing deterioration in the quality of the video with audio.

[0009] Other features and advantages of the present invention will become apparent from the accompanying drawings and the following detailed description of the preferred embodiments of the present invention. [Brief description of the drawings]

[0010] [Figure 1A] FIG. 1 is a diagram showing a hardware configuration of an image processing system. [Figure 1B] FIG. 1 is a diagram showing the functional configuration of an image processing system according to a first embodiment. [Diagram 2] 4 is a flowchart of image processing executed by the image processing system according to the first embodiment. [Figure 3A] 5A to 5C are views for explaining an example of removing a subject according to the first embodiment. [Figure 3B] 5A to 5C are views for explaining an example of removing a subject according to the first embodiment. [Figure 3C] 5A to 5C are views for explaining an example of removing a subject according to the first embodiment. [Figure 4A] FIG. 4 is a diagram for explaining an example of deleting a voice component according to the first embodiment. [Figure 4B] FIG. 4 is a diagram for explaining an example of deleting a voice component according to the first embodiment. [Diagram 5] FIG. 11 is a diagram showing the functional configuration of an image processing system according to a second embodiment. [Figure 6] 10 is a flowchart of image processing executed by an image processing system according to a second embodiment. [Figure 7] FIG. 11 is a diagram for explaining an example of separation of audio components according to the second embodiment. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0011] Hereinafter, the embodiments will be described in detail with reference to the attached drawings. Note that the following embodiments do not limit the invention according to the claims. Although the embodiments describe a number of features, not all of these features are essential to the invention, and the features may be combined in any manner. Furthermore, in the attached drawings, the same reference numbers are used for the same or similar configurations, and duplicated descriptions are omitted.

[0012] [First embodiment] ●Hardware configuration of image processing system Fig. 1A is a diagram showing a hardware configuration of an image processing system. In Fig. 1A, an information processing device 100 is a device having a video editing function, such as a personal computer (PC) or a smartphone. The information processing device 100 has a CPU 101, a ROM 102, a RAM 103, a HDD 104, a GPU 105, a network communication unit 106, an operation input unit 108, a display unit 109, an audio output unit 110, and a data communication unit 111. These components of the information processing device 100 are connected to each other via a system bus 107.

[0013] The CPU 101 uses the RAM 103 as a work area and executes programs stored in the ROM 102 or the HDD 104 to comprehensively control the operation of the information processing device 100. The programs executed by the CPU 101 include a video editing application program. The ROM 102 is a read-only non-volatile storage medium, and stores programs such as firmware. The RAM 103 is a volatile storage medium capable of reading and writing information at high speed, and is used as a work area when the CPU 101 processes information. The HDD 104 is a non-volatile storage medium capable of reading and writing information, and stores an OS, various control programs, application programs, video data and audio data used in video editing, and the like.

[0014] The GPU 105 cooperates with the CPU 101 to execute processes for video editing, learning and inference using machine learning technology, and the like. In general, a GPU can perform more efficient calculations by processing more data in parallel than a CPU. Therefore, when the GPU 105 is used in addition to the CPU 101, inference regarding video and audio can be efficiently performed multiple times using a trained model in deep learning. Note that the inference process in the trained model described below may be performed by either the CPU 101 or the GPU 105.

[0015] The network communication unit 106 is an interface for connecting to the server 130 via the network 120. The operation input unit 108 accepts operations from a user via a keyboard, a mouse, a touch panel, or the like. The user can operate the video editing application by the operations. The display unit 109 is a monitor or a display, and displays a graphical user interface (GUI) of the information processing device 100. The display unit 109 also displays the GUI of the video editing application, and the user can edit the video by operating the GUI. The audio output unit 110 is an audio playback device such as a speaker. Alternatively, the audio output unit 110 may be an output terminal that can be connected to an audio playback device such as an earphone or a headphone. The user can hear the audio played back by the video editing application via the audio output unit 110.

[0016] The data communication unit 111 is an interface such as USB, SD, PCI Express, SATA, etc., and is capable of data communication with various recording media such as a USB memory, an SD card, and an SSD. A user can import video data and audio data obtained by shooting a video via the data communication unit 111 and store the data in the HDD 104, etc. Then, the user can edit the video data and audio data stored in the HDD 104, etc., using a video editing application. Alternatively, the user can import video data and audio data from devices such as a camera, a PC, a smartphone, etc. (not shown) via the network 120. There is no particular limitation on the method of importing the video data and audio data into the information processing device 100.

[0017] The server 130 is a server for sharing part of the processing of the information processing device 100, and is, for example, a device such as a personal computer (PC). In this embodiment, the processing shared by the server 130 is not particularly limited, but may be, for example, processing related to video editing and machine learning.

[0018] The server 130 includes a CPU 131, a ROM 132, a RAM 133, a HDD 134, a GPU 135, and a network communication unit 136. These components of the server 130 are connected to each other via a system bus 137. The functions of the CPU 131, the ROM 132, the RAM 133, the HDD 134, the GPU 135, and the network communication unit 136 are similar to those of the CPU 101, the ROM 102, the RAM 103, the HDD 104, the GPU 105, and the network communication unit 106 of the information processing device 100, respectively. However, in general, the server 130 often has hardware resources with higher functionality and larger capacity than the information processing device 100. Therefore, when the hardware resources of the information processing device 100 alone are insufficient, the hardware resources of the server 130 can be used to perform processing efficiently. However, all processing may be completed by the information processing device 100 alone. Therefore, although the image processing system illustrated in FIG. 1A includes the information processing device 100 and the server 130, the image processing system of this embodiment does not necessarily need to include the server 130.

[0019] ● Functional configuration of image processing system Fig. 1B is a diagram showing a functional configuration realized by the hardware of the image processing system shown in Fig. 1A working in cooperation with a program (software). In Fig. 1B, the image processing system includes an area selection unit 141, an object type determination unit 142, an object vector acquisition unit 143, an object deletion unit 144, an audio type determination unit 145, an audio vector acquisition unit 146, a type match determination unit 147, and an audio deletion unit 148. The software of this embodiment also includes a video editing application. The video editing application operates when CPU 101 executes a program stored in ROM 102 or HDD 104 using RAM 103 as a work area.

[0020] The area selection unit 141 selects an arbitrary area within the angle of view of an arbitrary frame (area selection frame) of the moving image data displayed on the display unit 109. For example, the area selection unit 141 selects an area designated by the user in accordance with an instruction from the user via the operation input unit 108. The area selection frame is, for example, a frame designated by the user in the moving image data.

[0021] The subject type determination unit 142 determines the type of subject (e.g., person, dog, car, or other) included in the area (selected area) selected by the area selection unit 141, and outputs information indicating the determined type. The determination of the type of subject is realized, for example, by using an image of the selected area as an input and performing inference using a trained model (first machine learning model) configured to identify the type of subject included in the input image.

[0022] In this embodiment, any known technology can be used for machine learning. For example, the object type determination unit 142 uses a trained model for images. In generating a trained model for images, an image to be identified is used as input data, and information on the object type of the image of the input data (e.g., person, dog, car, or other) is used as teacher data to generate a trained model for images that outputs the type of object corresponding to the image. Specific algorithms for machine learning include nearest neighbor method, naive Bayes method, decision tree, support vector machine, etc. In addition, as another algorithm, deep learning that uses a neural network to generate features and connection weighting coefficients for learning by itself can be used. Among these algorithms, any available one can be appropriately used and applied to this embodiment.

[0023] In the inference phase, the trained model for images takes an image of the selected area as input data and outputs information indicating the type of subject contained in the image (e.g., person, dog, car, or other).

[0024] In this embodiment, the hardware used for generating the trained model and for inference based on the trained model is not particularly limited, and for example, some or all of the CPU 101, the GPU 135, the CPU 131, and the GPU 135 may be used. Also, a different device (not shown) may be used.

[0025] The object vector acquisition unit 143 calculates the velocity vector of the object detected through the type determination by the object type determination unit 142. For example, the object vector acquisition unit 143 tracks the object in several frames before and after the region selection frame, and calculates the velocity vector of the object from the amount of movement of the object. The object can be tracked by detecting the object in each frame using a machine learning technique, for example, as with the object type determination unit 142. Alternatively, the object may be tracked by pattern matching of pixel values ​​between frames without using the machine learning technique. The hardware used to calculate the velocity vector of the object is not particularly limited, and for example, some or all of the CPU 101, the GPU 135, the CPU 131, and the GPU 135 may be used.

[0026] The subject removal unit 144 removes the subject detected by the subject type determination unit 142 from the area selection frame. Since simply removing the subject from the area selection frame would result in an unnatural video, the subject removal unit 144 complements the background of the area from which the subject is removed by blending the subject into the background. Furthermore, in frames other than the area selection frame, if a corresponding subject exists within the angle of view, the subject removal unit 144 similarly removes the subject and complements the background. There are no particular limitations on the hardware used to remove the subject, and for example, some or all of the CPU 101, GPU 135, CPU 131, and GPU 135 may be used.

[0027] The sound type determination unit 145 analyzes the sound data corresponding to the frame from which the object has been deleted by the object deletion unit 144, and outputs information indicating the type of sound included in the sound data (for example, person, dog, car, or other). The determination of the type of sound is realized, for example, by using the sound data as input and performing inference using a trained model (second machine learning model) configured to identify the type of each object corresponding to each sound component included in the input sound data.

[0028] In this embodiment, any known technique can be used for machine learning. For example, the voice type determination unit 145 uses a trained model for voice. In generating the trained model for voice, the voice to be identified is used as input data, and information on the type of object corresponding to the voice of the input data (e.g., person, dog, car, or other) is used as teacher data to generate a trained model for voice that outputs the type of object corresponding to the voice. As with the case of the object type determination unit 142, various algorithms can be used as specific algorithms for machine learning.

[0029] In the inference phase, the trained model for voice takes voice as input data and outputs information indicating the type of subject corresponding to each sound component contained in the voice (e.g., person, dog, car, or other).

[0030] In this embodiment, the hardware used for generating the trained model and for inference based on the trained model is not particularly limited, and for example, some or all of the CPU 101, the GPU 135, the CPU 131, and the GPU 135 may be used. Also, a different device (not shown) may be used.

[0031] The voice vector acquisition unit 146 calculates the position and velocity vector of the voice. An example of a method for calculating the position and velocity vector of the voice will be described below. For example, when the voice data is recorded using two microphones, the voice vector acquisition unit 146 specifies the position of the voice of the subject by the difference in the arrival time of the sound that reaches the two microphones. After that, the voice vector acquisition unit 146 calculates the velocity vector of the voice of the subject based on the movement of the position of the voice of the subject and the time axis of the voice data. In addition, the position and velocity vector of the voice source may be more easily calculated by recording the voice data using a microphone array using three or more microphones or a directional microphone. The hardware used to calculate the velocity vector of the voice is not particularly limited, but for example, a part or all of the CPU 101, the GPU 135, the CPU 131, and the GPU 135 may be used. In addition, a different device not shown in the figure may be used.

[0032] The type matching determination unit 147 compares the type of the object determined by the object type determination unit 142 with the type of the sound determined by the sound type determination unit 145 to determine whether there is a sound component of a type that matches the type of the object to be deleted (for example, a person, a dog, a car, etc.). The type matching determination unit 147 also compares the position and velocity vector of the object calculated by the object vector acquisition unit 143 with the position and velocity vector of the sound calculated by the sound vector acquisition unit 146 to determine whether there is a sound velocity vector that corresponds to the speed vector of the object to be deleted. If there is a corresponding sound velocity vector, the sound vector acquisition unit 146 can determine that the sound (sound component) is of the same type as the object to be deleted. This is because, when the type of the object cannot be correctly determined due to insufficient learning of the image learning model or the sound learning model described above, a different function called speed vector calculation can be used to detect the sound component corresponding to the object to be deleted. Furthermore, the type matching determination unit 147 may use only the type information obtained by the object type determination unit 142 and the sound type determination unit 145, without using the velocity vector. Alternatively, the type information obtained by the object type determination unit 142 and the sound type determination unit 145 may be used only the velocity vector. In this way, the method of identifying the sound components corresponding to the object to be deleted from the video data is not particularly limited, and various methods including the method described here may be used.

[0033] The audio deletion unit 148 separates and deletes audio components determined by the type matching determination unit 147 to correspond to the object to be deleted from other audio components. Audio components other than the audio components corresponding to the object to be deleted are not deleted. Any known technology can be used for separating and deleting audio components. To explain one example of a plurality of technologies, the audio deletion unit 148 determines the type of audio using a trained model for audio similar to that described with respect to the audio type determination unit 145, and separates the audio components for each type. At this time, the audio deletion unit 148 performs a Fourier transform on the audio data, treats the audio data as spectral information, masks the spectrum of the audio type to be deleted, and performs an inverse Fourier transform to return to the audio data, thereby generating audio data from which only a specific audio type has been deleted.

[0034] ●Image processing flow 2 is a flowchart of image processing executed by the image processing system. The target of image processing is moving image data accompanied by audio data. As described above, the audio data and moving image data are recorded, for example, in the HDD 104. When the user of the information processing device 100 selects a function for deleting a subject in the user interface of a video editing application, the processing of this flowchart starts.

[0035] The overall control of this flowchart is performed by CPU 101. Moreover, the processing of each step of this flowchart is performed by each unit shown in Fig. 1B. There are no particular limitations on the hardware that realizes the functions of each unit shown in Fig. 1B, and as long as it is technically possible, the functions may be realized by, for example, some or all of CPU 101, GPU 135, CPU 131, and GPU 135.

[0036] In S201, the area selection unit 141 selects a specific area (selection area) in a specific frame (area selection frame) among multiple frames of video data. The selection area is, for example, an area designated by the user.

[0037] In S202, the object type determination unit 142 determines the type (first type) of the object (target object) included in the selection area. This allows the target object to be detected and its type to be identified. In addition, the object vector acquisition unit 143 may calculate (acquire) a velocity vector of the target object across multiple frames.

[0038] In S203, the subject type determination unit 142 detects target subjects in other frames (frames other than the region selection frames) of the video data.

[0039] In S204, the subject removal unit 144 removes the target subject from each of one or more frames (target frames) in which the target subject was detected by the subject detection in S201 or S202. As the target subject is removed, the subject removal unit 144 complements the area of ​​the removed subject by assimilating it into the background. For example, if the target subject is captured within the angle of view of the 201st to 300th frames in video data having 500 frames, the target subject is removed and the background is complemented for the 201st to 300th frames.

[0040] In S205, the sound type determination unit 145 determines the type of sound included in the sound data (the type of each object corresponding to each sound component). In addition, the sound vector acquisition unit 146 may calculate (acquire) a velocity vector of the sound component corresponding to the target object across multiple frames.

[0041] In S206, the type matching determination unit 147 detects the sound component corresponding to the target object in the sound data and determines whether or not the sound component corresponding to the target object exists. If the sound component corresponding to the target object exists, the process proceeds to S207, and if not, the process of this flowchart ends.

[0042] The type matching determination unit 147 detects the sound (detects the sound component corresponding to the target subject) based on the type of the target subject determined in S202 and the type of sound determined in S205. For example, consider a case where the type of the target subject is "automobile" and the types of sound are "automobile" and "person". In this case, the sound data includes a sound component corresponding to the automobile, and the sound component corresponding to the automobile is detected as the sound component corresponding to the target subject. Alternatively, the type matching determination unit 147 may detect the sound component corresponding to the target subject using the velocity vector acquired in S202 and S205 instead of or in addition to the type determined in S202 and S205. When the velocity vector is used, the type matching determination unit 147 can detect the sound component corresponding to the target subject in the sound data (each frame) by comparing the velocity vector of the target subject with the velocity vector of the sound component corresponding to the target subject.

[0043] In S207, the audio deletion unit 148 separates and deletes the audio components corresponding to the target subject from the audio data. Audio components other than the audio components corresponding to the target subject are not deleted.

[0044] Even if the target subject is not included in the angle of view during video capture, if the target subject emits a sound near the microphone of the imaging device (camera), the sound component of the target subject may be recorded in the sound data. Therefore, for sound data corresponding to a frame in which the target subject was not detected in S203, the sound component corresponding to the target subject may be detected in S206. Therefore, in S207, the sound deletion unit 148 can delete the sound component corresponding to the target subject from the sound data even in a frame in which the target subject is not included, if the sound component exists. For example, consider a case in which the target subject is captured within the angle of view of the 201st to 300th frames in video data having 500 frames, and the sound component corresponding to the target subject exists in the sound data corresponding to the 101st to 400th frames. In this case, when the sound type determination unit 145 performs a process of determining the sound type for the entire sound data in S205, the sound component corresponding to the target subject is detected from the part of the sound data corresponding to the 101st to 400th frames. Therefore, the sound deletion section 148 can delete the sound components corresponding to the target object from the 101st to 400th frames in which the sound components corresponding to the target object exist.

[0045] In addition, when determining whether or not a sound component corresponding to the target subject exists for a frame in which the target subject does not exist, the period during which the target subject existed within the angle of view may be taken into consideration. For example, the sound type determination unit 145 may determine the type of sound for each predetermined period for sound data corresponding to a period before and after the period during which the target subject existed within the angle of view. The predetermined period is set, for example, as a period of a predetermined length (for example, a 10-frame period) before and after the period during which the subject existed within the angle of view. The sound type determination unit 145 may repeat setting the predetermined period in order from the period closest to the period during which the subject existed within the angle of view to the period farthest from the period during which the subject existed within the angle of view until the sound component corresponding to the target subject no longer exists. Alternatively, the sound type determination unit 145 may perform a calculation to predict a frame in which the sound component corresponding to the target subject will disappear from the velocity vector calculated by the above-mentioned object vector acquisition unit 143 or sound vector acquisition unit 146 or the transition of the volume of the sound corresponding to the object of the same type as the type of the target subject, and delete the sound component up to the predicted frame.

[0046] Example of removing the target object and its corresponding audio components An example of removing a target subject and its corresponding audio components will be described with reference to FIGS. 3A to 3C and 4A to 4B.

[0047] 3A shows an example of three consecutive frames in video data. In these three frames, a car 301 is moving from right to left. All other subjects are stationary.

[0048] 2, it is assumed that the user designates area 310 while the frame in the middle row of FIG. 3A (the nth frame) is being displayed on display unit 109. Area selection unit 141 selects area 310 in response to the user's designation of area 310. This process corresponds to S210 in FIG. 2.

[0049] 3A is rectangular, the shape of the area 310 and the method of designating the area 310 are not particularly limited. For example, the area 310 may be circular. Alternatively, a configuration may be adopted in which the user designates the area 310 by surrounding a desired area freehand.

[0050] The subject type determination unit 142 determines that the type of subject included in the area 310 is an automobile. Then, the subject type determination unit 142 detects automobiles, which are target subjects, in other frames of the video data. As a result, automobiles 301 are also detected in the upper and lower frames of FIG. 3A. This process corresponds to S202 to S203 in FIG. 2.

[0051] Next, the object removing unit 144 removes the automobile 301 from each frame in which it was detected, as shown in Fig. 3B. Then, the object removing unit 144 complements the area of ​​the automobile 301 to be removed by blending it into the background, as shown in Fig. 3C. This process corresponds to S204 in Fig. 2.

[0052] Fig. 4A is a conceptual diagram of voice data corresponding to the three frames shown in Fig. 3A. The voice type determination unit 145 determines the type of voice contained in the voice data of these three frames. Then, the type matching determination unit 147 detects a voice component (voice of the car 401) corresponding to the car 301 in Fig. 3A. This process corresponds to S205 to S206 in Fig. 2.

[0053] Next, the voice deletion unit 148 deletes the voice component corresponding to the car 301 (the voice of the car 401). As a result, as shown in Fig. 4B, the voice data corresponding to the three frames does not include the voice component corresponding to the car 301, but includes the voice components of the dog and the person. This process corresponds to S207 in Fig. 2.

[0054] If audio components corresponding to the automobile 301 are contained in frames other than the three frames shown in FIG. 4A, the audio deletion unit 148 similarly deletes the audio components corresponding to the automobile 301 from these frames as well.

[0055] Summary of the first embodiment As described above, according to the first embodiment, when a specific subject (a first type of subject) is deleted from video data accompanied by audio data, audio components corresponding to the subject to be deleted are deleted from the audio data. Therefore, when playing a video with audio, it is possible to prevent audio that contains audio components corresponding to a subject that is not displayed because it has been deleted from being played back. Therefore, according to this embodiment, it is possible to delete a specific subject from a video with audio while suppressing deterioration in the quality of the video with audio.

[0056] The specific procedure of the image processing described above with reference to FIG. 2 is merely one example of a processing procedure for preventing playback of audio that has audio components corresponding to a deleted object that is not displayed. Any configuration for deleting a specific object from video data accompanied by audio data and deleting audio components corresponding to the deleted object from the audio data is included in the scope of the technical idea of ​​this embodiment. Therefore, to generalize the first embodiment, the image processing system detects a specific object (a first type of object) in multiple frames of video data accompanied by audio data, and deletes the first type of object from one or more frames in which the first type of object is detected among the multiple frames. Also, the image processing system detects audio components corresponding to the first type of object in the audio data, and deletes the audio components corresponding to the first type of object from the audio data.

[0057] [Second embodiment] In the first embodiment, a configuration is described in which an object to be deleted from video data is first determined, and then audio components corresponding to the object are deleted from the audio data. In contrast, in the second embodiment, a configuration is described in which an audio component to be deleted from audio data is first determined, and then the object corresponding to the audio component is deleted from the video data. In the second embodiment, the basic configuration including the hardware configuration (FIG. 1A) of the image processing system is the same as in the first embodiment. Below, differences from the first embodiment will be mainly described.

[0058] ● Functional configuration of image processing system Fig. 5 is a diagram showing a functional configuration realized by cooperation of the hardware of the image processing system shown in Fig. 1A with a program (software). In Fig. 5, the image processing system includes an audio type determination unit 501, an audio selection unit 502, an audio deletion unit 503, an object type determination unit 504, a type match determination unit 505, and an object deletion unit 506.

[0059] The sound type determination unit 501 has substantially the same functions as the sound type determination unit 145. However, the sound type determination unit 501 determines the type of sound contained in the sound data for a period designated by the user out of the entire period of the sound data or for the entire period, and outputs information indicating the type (for example, person, dog, car, or other).

[0060] The function of the audio selection unit 502 will be described later with reference to Fig. 6. The function of the audio deletion unit 503 is similar to that of the audio deletion unit 148.

[0061] Subject type determination unit 504 has substantially the same functions as subject type determination unit 142. However, whereas subject type determination unit 142 determines the type of subject included in a specific area of ​​a specific frame, subject type determination unit 504 analyzes all frames in a period corresponding to the audio components deleted by audio deletion unit 503. Furthermore, since the user does not specify an area, subject type determination unit 504 targets all pixels in a frame for analysis, and outputs information including the type of each subject included in the frame (for example, person, dog, car, or other).

[0062] The function of the type matching determination unit 505 is similar to that of the type matching determination unit 147. The function of the object removal unit 506 is similar to that of the object removal unit 144.

[0063] ●Image processing flow 6 is a flowchart of image processing executed by the image processing system. The target of image processing is moving image data accompanied by audio data. As in the first embodiment, the audio data and moving image data are recorded, for example, in the HDD 104. When the user of the information processing device 100 selects a function for deleting a subject in the user interface of a video editing application, the processing of this flowchart starts.

[0064] The overall control of this flowchart is performed by the CPU 101. Moreover, the processing of each step of this flowchart is performed by each unit shown in Fig. 5. There are no particular limitations on the hardware that realizes the functions of each unit shown in Fig. 5, and as long as it is technically possible, the functions may be realized by, for example, some or all of the CPU 101, GPU 135, CPU 131, and GPU 135.

[0065] In S601, the voice type determination unit 501 determines the type of voice contained in the voice data, separates the voice components for each type, and displays the type of each voice component on the display unit 109.

[0066] An example of the processing in S601 will now be described with reference to Fig. 7. The upper part of Fig. 7 is a conceptual diagram of the audio data to be processed. "ALL" conceptually indicates audio data including all audio components, with the horizontal axis indicating time and the vertical axis indicating volume. The lower part of Fig. 7 is a conceptual diagram of each separated audio component. Audio components whose type cannot be determined are separated as "other" audio components. In the following, an example will be described in which audio data is separated into person A, person B, car A, dog A, and other audio components.

[0067] In S602, the audio selection unit 502 selects a specific audio component corresponding to a specific type from the multiple audio components separated in S601. Here, the audio selection unit 502 may select an audio component designated by the user. In the following, an example will be described in which the user designates an audio component corresponding to automobile A. Also, in video data having 500 frames, it is assumed that the audio component corresponding to automobile A is included in audio data corresponding to the 101st to 400th frames.

[0068] In S603, the voice deletion unit 503 deletes the voice component selected in S602 (target voice component) from the voice data. Note that voice components other than the selected voice component are not deleted. For example, the voice component corresponding to the car A included in the voice data corresponding to the 101st to 400th frames is deleted.

[0069] In S604, the object type determination unit 504 determines the type of object included in the video data. For example, the object type determination unit 504 determines the type of object for the 101st to 400th frames corresponding to the deleted audio components. In this embodiment, unlike the first embodiment, the area of ​​the frame is not selected by the area selection unit 141. Therefore, the object type determination unit 504 analyzes all pixels in each frame, and outputs information indicating the type of each object included in the analyzed frame (for example, person, dog, car, etc.).

[0070] Note that the target of the process for determining the type of object in S604 is not limited to the frame corresponding to the deleted audio component. For example, the object type determination unit 504 may determine the type of object for all frames of the video data.

[0071] In S605, the type matching determination unit 505 determines whether or not an object corresponding to the type of the target audio component is present in the video data based on the determination result in S604. For example, if the audio component of automobile A is the target audio component (the audio component deleted in S603), the type matching determination unit 505 determines whether or not an automobile is included in the determination result in S604. If an object corresponding to the type of the target audio component is present in the video data, the process proceeds to S606; otherwise, the process of this flowchart ends.

[0072] In S606, the object removing unit 506 removes the object corresponding to the type of the target audio component from each frame of the video data (a frame in which an object corresponding to the type of the target audio component is detected through the processes of S604 and S605). In conjunction with the removal of the object, the object removing unit 506 performs complementation by assimilating the area of ​​the removed object into the background.

[0073] Summary of the second embodiment As described above, according to the second embodiment, audio components corresponding to a specific subject (a first type of subject) are selected in the audio data, and the selected audio components are deleted from the audio data. Also, the subject corresponding to the audio components to be deleted is deleted from the video data corresponding to the audio data. Therefore, when playing a video with audio, it is possible to prevent playback of audio that contains audio components corresponding to a subject that is not displayed because it has been deleted. Therefore, according to this embodiment, it is possible to delete a specific subject from a video with audio while suppressing deterioration in the quality of the video with audio.

[0074] [Other embodiments] The present invention can also be realized by a process in which a program for implementing one or more of the functions of the above-described embodiments is supplied to a system or device via a network or a storage medium, and one or more processors in a computer of the system or device read and execute the program. The present invention can also be realized by a circuit (e.g., ASIC) that implements one or more of the functions.

[0075] [summary] The above-described embodiment discloses at least the inventions shown in the following items, but is not limited to these inventions. [Item 1] a subject detection means for detecting a first type of subject in a plurality of frames of the video data accompanied by audio data; an object removing means for removing the first type of object from one or more frames in which the first type of object is detected among the plurality of frames; a sound detection means for detecting a sound component corresponding to the first type of object in the sound data; a sound deleting means for deleting the sound components corresponding to the first type of object from the sound data; An image processing system comprising: [Item 2] the subject detection means identifies a type of subject included in a first region based on an image of the first region in a first frame of the plurality of frames; The first type is the type of the subject included in the first area. 2. The image processing system according to item 1, [Item 3] The object detection means performs inference using a first machine learning model with an image of the first area as an input, thereby identifying the type of the object included in the first area. 3. The image processing system according to item 2, [Item 4] a region selection means for selecting the first region in the first frame in accordance with an instruction by a user; 4. The image processing system according to item 2 or 3, further comprising: [Item 5] an object vector acquisition means for acquiring a velocity vector of the object included in the first region across the plurality of frames; a voice vector acquisition means for acquiring a velocity vector across the plurality of frames of a voice component corresponding to the subject included in the first region; Further comprising: The voice detection means detects the voice component corresponding to the first type of subject in the voice data by comparing the velocity vector of the subject included in the first area with the velocity vector of the voice component corresponding to the subject included in the first area. 5. The image processing system according to any one of items 2 to 4, [Item 6] The voice detection means detects the voice component corresponding to the first type of subject in the voice data by performing inference using a second machine learning model with the voice data as an input. 5. The image processing system according to any one of items 1 to 4. [Item 7] the voice detection means performs inference using a second machine learning model with the voice data as an input, thereby detecting a plurality of voice components in the voice data, each of which corresponds to a different type of subject; The image processing system further includes an audio selection unit that selects one of the plurality of audio components, The first type is a type of an object corresponding to the selected sound component from among the plurality of sound components. 2. The image processing system according to item 1, [Item 8] An image processing method executed by an image processing system, comprising: a subject detection step of detecting a first type of subject in a plurality of frames of video data accompanied by audio data; a subject removing step of removing the first type of subject from one or more frames in which the first type of subject is detected among the plurality of frames; a sound detection step of detecting a sound component corresponding to the first type of object in the sound data; a sound deleting step of deleting the sound components corresponding to the first type of object from the sound data; An image processing method comprising: [Item 9] A program for causing a computer to function as each of the means of the image processing system according to any one of items 1 to 7.

[0076] The invention is not limited to the above-described embodiments, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the following claims are appended to apprise the public of the scope of the invention. [Explanation of symbols]

[0077] 141: area selection unit, 142: object type determination unit, 143: object vector acquisition unit, 144: object deletion unit, 145: audio type determination unit, 146: audio vector acquisition unit, 147: type match determination unit, 148: audio deletion unit

Claims

1. A subject detection means for detecting a subject of a first type in multiple frames of video data accompanied by audio data, A subject deletion means for deleting a subject of the first type from one or more frames in which a subject of the first type is detected among the plurality of frames, The audio data includes an audio detection means for detecting audio components corresponding to the first type of subject, Audio deletion means for deleting the audio component corresponding to the first type of subject from the audio data, An image processing system characterized by comprising the following features.

2. The subject detection means identifies the type of subject included in the first region based on the image of the first region of the first frame among the plurality of frames. The first type is the type of the subject included in the first region. The image processing system according to feature 1.

3. The subject detection means identifies the type of subject included in the first region by performing inference using a first machine learning model with the image of the first region as input. The image processing system according to claim 2, characterized in that it is as described above.

4. Region selection means that selects the first region in the first frame according to user instructions. The image processing system according to claim 2, further comprising the following:

5. A subject vector acquisition means for acquiring the velocity vector of the subject included in the first region over the plurality of frames, Audio vector acquisition means for acquiring a velocity vector over the plurality of frames of the audio component corresponding to the subject included in the first region, Furthermore, The sound detection means detects the sound component in the sound data that corresponds to the first type of subject by comparing the velocity vector of the subject included in the first region with the velocity vector of the sound component corresponding to the subject included in the first region. The image processing system according to claim 2, characterized in that it is as described above.

6. The voice detection means detects the voice component in the voice data that corresponds to the first type of subject by performing inference using a second machine learning model with the voice data as input. The image processing system according to feature 1.

7. The voice detection means detects multiple voice components in the voice data, each corresponding to a different type of subject, by performing inference using a second machine learning model with the voice data as input. The image processing system further comprises audio selection means for selecting one of the plurality of audio components, The first type is the type of subject corresponding to the selected audio component among the plurality of audio components. The image processing system according to feature 1.

8. The image processing system according to claim 1, characterized in that the subject deletion means completes the region from which the first type of subject has been deleted based on the surrounding image of the region.

9. The image processing system according to claim 1, characterized in that the audio deletion means deletes the audio component corresponding to the first type of subject and maintains the audio components other than the audio component included in the audio data.

10. The image processing system according to claim 1, wherein the audio deletion means separates the audio component corresponding to the first type of subject from a plurality of audio components contained in the audio data, and deletes the separated audio component.

11. The image processing system according to claim 1, wherein the audio deletion means deletes the audio component corresponding to the first type of subject from the audio data during the period corresponding to the frame in which the first type of subject is not detected among the plurality of frames.

12. The image processing system according to claim 1, characterized in that the sound detection means detects the sound component corresponding to the first type of subject in the sound data corresponding to the period before and after the period in which the first type of subject was detected in the plurality of frames.

13. The image processing system according to claim 1, characterized in that the sound detection means determines a period for deleting the sound component based on the change in volume of the sound component corresponding to the first type of subject.

14. An image processing method performed by an image processing system, A subject detection step in which a subject of a first type is detected in multiple frames of video data accompanied by audio data, A subject deletion step of deleting the subject of the first type from one or more frames in which the subject of the first type is detected among the plurality of frames, The audio data includes an audio detection step for detecting an audio component corresponding to the first type of subject, A sound deletion step of deleting the sound component corresponding to the first type of subject from the sound data, An image processing method characterized by comprising:

15. A program for causing a computer to function as one of the means of the image processing system described in any one of claims 1 to 13.