Image processing device, image processing method, and computer program
The image processing device dynamically applies effects to cropped regions based on subject and music information, addressing the lack of realism in automatic shooting techniques for live events.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- CANON KK
- Filing Date
- 2024-12-03
- Publication Date
- 2026-06-15
AI Technical Summary
Existing automatic shooting techniques for live events lack the ability to generate realistic images that convey the sense of movement and presence of performers, and require manual specification of virtual camera work.
An image processing device that detects specific subjects, crops predetermined regions, applies effects based on music-related information and subject information, and adjusts these effects dynamically.
Generates realistic images during automatic shooting at music events by enhancing the sense of movement and presence of performers, without manual intervention.
Smart Images

Figure 2026096214000001_ABST
Abstract
Description
【Technical Field】 【0001】 The present invention relates to a video processing apparatus, a video processing method, a computer program, and the like. 【Background Art】 【0002】 In recent years, at the time of camera shooting and distribution during events such as music live shows and concerts, an automatic shooting technique using an IP remote camera or the like has been actively utilized for the purpose of labor saving (cost reduction). 【0003】 On the other hand, although it is possible to shoot to a certain level with conventional automatic shooting techniques, there is a problem that the expression of the sense of movement and the sense of presence of the performer, like when an actual professional cameraman is shooting, is insufficient, and there is a feeling of dissatisfaction as a video. 【0004】 For example, in the imaging device described in Patent Document 1, a method of extracting similar features from analysis of a reproduced image and reproducing a plurality of images subjected to movement deformation processing according to the rhythm, tempo, etc. included in the metadata of music data together with the music is disclosed. 【0005】 Also, in the imaging device described in Patent Document 2, a method of specifying virtual camera work, identifying a person from the recognition result and skeleton determination, and determining the trimming position and size using pre-stored composition information is disclosed. 【Prior Art Documents】 【Patent Documents】 【0006】 【Patent Document 1】 Japanese Patent Application Laid-Open No. 2010-237516 【Patent Document 2】 Japanese Patent Application Laid-Open No. 2021-119686 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0007】 However, Patent Document 1, mentioned above, is based on playback images after recording and does not target live video. While it considers the timing of image generation, it lacks flexibility in generation patterns for continuous frames like those in live video. Furthermore, Patent Document 2, mentioned above, has the problem of requiring the user to manually specify any virtual camera work each time. 【0008】 This invention has been made in view of the above circumstances, and one of its objectives is to provide an image processing device capable of generating realistic images during automatic shooting at music events and the like. [Means for solving the problem] 【0009】 To achieve the above objective, one aspect of the present invention is an image processing apparatus. A subject detection means that acquires video and detects a specific subject within the video, A cropping means for cropping at least one predetermined region including the subject, A means for acquiring music-related information, and A determination means for determining subject information related to the subject, The system includes an effect-granting means for granting a predetermined effect to the predetermined region cropped by the cropping means, The effect-granting means is characterized by determining the effect to be applied to the predetermined area according to the music-related information and the subject information. [Effects of the Invention] 【0010】 According to the present invention, it is possible to provide an image processing device that can generate realistic images during automatic shooting at music events and the like. [Brief explanation of the drawing] 【0011】 [Figure 1] This figure shows an example of the hardware configuration of the camera 10 according to Embodiment 1. [Figure 2] This is a diagram showing an example configuration of the imaging unit 11 according to Embodiment 1. [Figure 3] It is a functional block diagram showing a configuration example of an image processing unit 30 of a camera 10 according to Embodiment 1. [Figure 4] It is a flowchart showing a processing example of a camera 10 according to Embodiment 1. [Figure 5] It is a diagram showing an example of an image when a subject is photographed at a predetermined angle of view according to Embodiment 1. [Figure 6] It is a diagram showing an example of a table linking subject information detected on the image of FIG. 5. [Figure 7] (A) to (D) are diagrams showing examples of patterns for cropping a predetermined area including a person and musical instruments according to Embodiment 1. [Figure 8] (A) and (B) are diagrams showing examples of cropping positions and sizing methods according to Embodiment 1. [Figure 9] It is a diagram showing an example of music-related information according to Embodiment 1. [Figure 10] It is a diagram showing an example of data linking a part of the music-related information of FIG. 9. [Figure 11] It is a diagram showing a table of example patterns of effect application according to Embodiment 1. [Figure 12] It is a diagram showing an example of an image when a predetermined effect is applied to a cropping area according to Embodiment 1. [Figure 13] It is a diagram showing an example of a table indicating the level of effect application according to Embodiment 1. [Figure 14] (A) to (C) are diagrams showing examples of images representing changes in the level of the effect applied to the cropping area according to Embodiment 1. [Figure 15] It is a flowchart showing a processing example of a camera 10 according to Embodiment 2. [Figure 16] It is a diagram showing a configuration example of an orchestra as a shooting target according to Embodiment 2. [Figure 17] It is a diagram showing an example of picking up some subjects of an orchestra according to Embodiment 2. [Figure 18] It is a diagram showing an example of a table linking attributes and features of subjects detected on an image according to Embodiment 2. [Figure 19] It is a diagram showing an example of an image obtained by cropping a predetermined area including a person and an instrument according to Embodiment 2. [Figure 20] It is a diagram showing an example of music-related information according to Embodiment 2. [Figure 21] It is a diagram showing an example of a table indicating a pattern of effect application according to Embodiment 2. 【Mode for Carrying Out the Invention】 【0012】 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the present invention is not limited to the following embodiments. In each figure, the same members or elements are given the same reference numerals, and duplicate explanations are omitted or simplified. In the following description, a case where a camera is used as the video processing device will be described. 【0013】 <Embodiment 1> FIG. 1 is a diagram showing an example of the hardware configuration of the camera 10 according to Embodiment 1. The camera 10 includes, as a hardware configuration, an imaging unit 11, a CPU 12 as a computer, a memory 13 as a storage medium, an input unit 14, a display unit 15, and a communication unit 16. 【0014】 The imaging unit 11 captures a subject image. Details of the imaging unit 11 are shown in FIG. 2 described later. The CPU 12 controls the entire camera 10. The memory 13 stores computer programs, data sets, images captured by the imaging unit 11, set values, and the like. <0'000115> 【0015】 The input unit 14 inputs a user's selection operation, external sound, etc., and passes it to the CPU 12. The display unit 15 displays an image or the like based on the control of the CPU 12. The communication unit 16 connects the camera 10 to a network and controls communication with other devices. 【0016】 Furthermore, the communication unit 16 may connect to the internet to acquire various information. The CPU 12 executes processing based on the computer program stored in the memory 13, thereby realizing the software configuration of the camera 10 shown in Figure 3, which will be described later, and the flowchart processing shown in Figures 4 and 15, which will be described later. The camera 10 is an example of an image processing device. 【0017】 Figure 2 shows an example of the configuration of the imaging unit 11 according to Embodiment 1. The imaging unit 11 includes a lens 201 consisting of several lens groups, and an image sensor 202 such as a CCD or CMOS. The image sensor 202 converts the image of the subject formed through the lens 201, which serves as the imaging optical system, into an electrical signal. 【0018】 Furthermore, the imaging unit 11 is equipped with a CDS circuit 203 for noise reduction. CDS stands for Correlated Double Sampling. The CDS circuit 203 performs correlated double sampling on the electrical signal output from the image sensor 202. 【0019】 Furthermore, the imaging unit 11 is equipped with an AGC amplifier 204 that automatically controls the camera's gain. AGC stands for Automatic Gain Control. The AGC amplifier 204 performs amplification processing on the electrical signal output from the CDS circuit 203. The A / D converter 205 converts the analog signal amplified by the AGC amplifier 204 into a digital signal. The output of the A / D converter 205 is supplied to the image processing unit 30 shown in Figure 3. 【0020】 Figure 3 is a functional block diagram showing an example configuration of the image processing unit 30 of the camera 10 according to Embodiment 1. Note that some of the functional blocks shown in Figure 3 are realized by having the CPU, which acts as a computer within the camera, execute a computer program stored in the memory, which acts as a storage medium. 【0021】 However, some or all of these can be implemented in hardware. Hardware options include dedicated circuits (ASICs) and processors (reconfigurable processors, DSPs). 【0022】 Furthermore, the functional blocks shown in Figure 3 do not necessarily have to be housed in the same enclosure; they may be composed of separate devices connected to each other via signal paths. 【0023】 The imaging control unit 301 controls the imaging unit 11 and passes the signals obtained from the imaging unit 11 to the image generation unit 302. The image generation unit 302 generates an image based on the received signals. The detection unit 303 detects people and other specific objects in the image. 【0024】 The music information acquisition unit 304 acquires various information related to music. The determination unit 305 determines the relationships between multiple elements. The crop position setting unit 306 sets the position of the area to be cropped on the image. 【0025】 The crop size setting unit 307 sets the size of the area to be cropped on the image. The cropping unit 308 crops a predetermined area on the image. The effect application unit 309 applies various effects to a predetermined image. 【0026】 The output unit 310 outputs the entire image generated by the image generation unit 302 or the cropped image generated by the cropping unit 308. The output unit 310 may also output a continuous sequence of images in the time direction as live video and video. The output of the output unit 310 is supplied to the communication unit 16 and the memory 13. 【0027】 The imaging control unit 301 controls the imaging unit 11 and passes the luminance signal and color signal obtained from the imaging unit 11 to the image generation unit 302. The imaging control unit 301 also functions as an image acquisition means for acquiring video. The image generation unit 302 generates image signals such as RGB images and YUV images from the obtained luminance signal and color signal. 【0028】 The detection unit 303 detects people, people's attributes, people's characteristics (state), or specific objects from the image generated by the image generation unit 302. The detection unit 303 also functions as a subject detection means for detecting specific subjects within the video. 【0029】 Here, a person's attributes include at least one of the following: age, gender, hairstyle, hair color, clothing color, and accessories. Furthermore, a person's characteristics (state) include at least one of the following: movement such as sitting or walking, specific posture such as posing, mouth movements, and facial expressions including smiles. 【0030】 In addition, specific objects other than people include audio equipment such as microphones held by people, and types of musical instruments with specific shapes (hereinafter, "musical instruments" will also include audio equipment). The objects to be detected by the detection unit 303 may be registered in advance as reference data used for determination such as pattern matching and feature matching. 【0031】 Furthermore, the detection unit 303 may determine whether a specific object exists based on values such as brightness information, color information, spatial frequency, and contrast on the image. In addition, the detection unit 303 may perform object detection by performing deep learning using an existing neural network such as a CNN (Convolutional Neural Network). 【0032】 The music information acquisition unit 304 may acquire external music from an input unit 14 such as a microphone, or it may acquire music-related information that can be acquired as metadata via the input unit 14. The music information acquisition unit 304 functions as a means for acquiring music-related information that is related to music. 【0033】 Music-related information here includes at least one of the following: melody, tempo, time signature, genre, song structure, song title, artist name, lyrics, composer, musical form, key signature, performance time, instrumentation, instrument placement, vocal volume, sound volume, tone quality, and musical score. 【0034】 Furthermore, music-related information may be stored in memory 13 as a database beforehand, or various related information may be obtained from an external server or the like via the communication unit 16, based on some characteristics such as the song title or the melody of the music being played. 【0035】 The determination unit 305 determines the multiple elements detected by the detection unit 303. For example, it may determine the part and role of a person playing music based on a specific person detected by the detection unit 303 and a microphone, guitar, drum set, etc., detected in contact with or near that person. 【0036】 Alternatively, it may be determined whether the detected person is singing based on their mouth movements. Furthermore, the music-related information acquired by the music information acquisition unit 304 and the person's performance part may be compared or matched in chronological order. The determination unit 305 functions as a determination means for determining subject information related to the subject. 【0037】 The crop position setting unit 306 sets the cropping position on the image, including the specific person or object detected by the detection unit 303, according to the determination result of the determination unit 305. For example, the cropping position may be set considering the position on the image including the musical instrument being played by the specific person. 【0038】 The crop size setting unit 307 sets the size of the image to be cropped, including the specific person or object detected by the detection unit 303, according to the determination result of the determination unit 305. In this case, for example, the crop size may be set considering the size including a predetermined margin on the image that includes the musical instrument being played by the specific person. 【0039】 The cropping unit 308 crops the image generated by the image generation unit 302 using the cropping position and cropping size set by the cropping position setting unit 306 and the cropping size setting unit 307. The cropping unit 308 functions as a cropping means that crops at least one predetermined area including the subject. 【0040】 The effect application unit 309 applies at least one image processing effect to a part or all of the image cropped by the cropping unit 308, such as changing the position or size of a predetermined area, blurring, panning, tilting, zooming in, or zooming out. The effect application unit 309 functions as an effect application means that applies a predetermined effect to a predetermined area cropped by the cropping means. 【0041】 The effect application unit 309 may, for example, use a general two-dimensional filter to apply blur to each pixel of the image, replacing the value of that pixel with an average value weighted in the direction in which blur is to be applied, based on the values of the pixels adjacent to that pixel in the vertical or horizontal direction. Alternatively, it may use a general three-dimensional filter that averages the values of the pixels in the time direction. 【0042】 The output unit 310 outputs an image of a predetermined area cropped by the crop unit 308. In this case, if multiple cropping areas are set, the multiple cropping areas may be output simultaneously, or the entire image or any one of the cropping areas may be switched and output sequentially. 【0043】 Furthermore, in cases where the processing load for the effects applied by the effect application unit 309 is high, the image or video before the effect is applied may be temporarily stored in the memory 13, buffered, and then the effect is applied to the buffered image or video before output. 【0044】 In other words, the effect application unit 309, as an effect application means, may temporarily store the acquired video and then apply the effect to the temporarily stored video. Furthermore, the lyrics acquired by the music information acquisition unit 304 may be superimposed on the image or video as text and output. 【0045】 Figure 4 is a flowchart showing an example of processing by camera 10 according to Embodiment 1. Using Figure 4, we will explain an example of processing in which camera 10 applies a predetermined effect to the cropped area according to the detection results on the captured image and various acquired information when filming a band performance. 【0046】 Furthermore, the CPU and other components of the camera 10, acting as a computer, execute the computer program stored in memory, thereby sequentially performing the actions of each step in the flowchart shown in Figure 4. 【0047】 First, in step S101, subject information on the image is acquired. That is, subject information is acquired by detecting it in the detection unit 303 based on the image generated via the imaging control unit 301 and the image generation unit 302. Here, an example of subject information acquisition will be explained using Figures 5 and 6. 【0048】 Figure 5 shows an example of an image taken when a subject is photographed at a predetermined angle of view according to Embodiment 1, and Figure 6 shows an example of a table linking subject information detected on the image in Figure 5. 【0049】 Figure 50 shows an example of an image captured by camera 10 at a predetermined angle of view, showing multiple people performing live as a music band. Image 50 may also be part of a video captured in real time or a continuous video. 【0050】 In Figure 5, 51-54 represent different people, while 55-59 represent simplified representations of musical instruments or parts of musical instruments. 55 represents a guitar, 56 a microphone, 57 a bass guitar, 58 a drum set, and 59 drumsticks. 【0051】 Generally, when filming a live performance of a music band, equipment such as speakers, amplifiers, cables, lighting equipment, and the audience may be captured in the image, but these are omitted in this diagram. In step S101, the detection unit 303 detects a person or a specific object other than a person that is present in the image in Figure 5. Here, step S101 functions as a subject detection step that acquires video and detects a specific subject within the video. 【0052】 As mentioned above, specific objects include audio equipment such as microphones held in a person's hand, and types of musical instruments with specific shapes (hereinafter, "musical instruments" in this embodiment also include audio equipment). 【0053】 The table in Figure 6 shows an example of subject information, which includes person information, object information, performance parts and roles, etc. Person information includes attributes and characteristics of people 51-54 in Figure 5, such as gender, age, hairstyle, hair color, and clothing color. 【0054】 The detection unit 303 detects the attributes and characteristics of these individuals and links them to their individual IDs. The detection unit 303 also functions as a linking means that links various types of subject information together. The relationship between individuals 51-54 in Figure 5 and the individual IDs in the table in Figure 6 is as follows: individual 51 corresponds to individual ID=1, individual 52 to individual ID=2, individual 53 to individual ID=3, and individual 54 to individual ID=4. 【0055】 Furthermore, the object information in the table in Figure 6 includes the types of musical instruments and sound equipment that overlap with or are adjacent to the person area on the image associated with each person ID. These are detected by the detection unit 303 and associated with each person ID. 【0056】 Furthermore, as shown in the table in Figure 6, the performance part / role detected by the detection unit 303 is linked to the person ID according to the type of instrument. For example, if the detected instrument type is a drum set and sticks, drums (Dr.) may be linked as the performance part / role. 【0057】 On the other hand, even if a microphone is not included in the instruments, if a person is detected and associated with it, they can be linked as a vocalist (Vo.), or multiple performance parts / roles, such as guitar (Gt.) and vocals (Vo.), can be linked to a single person ID. 【0058】 Thus, in this embodiment, the subject information includes at least one of the following: a person, the person's attributes, the person's characteristics, the type of musical instrument, and the type of audio equipment. Furthermore, the detection unit 303, as a linking means, links the person with the type of performance part or role corresponding to the type of musical instrument, according to the person, the person's attributes, and the type of musical instrument detected at a position overlapping with or touching the area of the person. 【0059】 Next, in step S102, it is determined whether the subject information matches a predetermined feature pattern. That is, the determination unit 305 determines whether the subject information detected by the detection unit 303 matches a predetermined feature pattern. For example, if a person not present in the person ID in Figure 6 is detected, the result is determined as No in step S102, and in that case, the processing flow in Figure 4 is terminated. 【0060】 On the other hand, if the result in step S102 is "Yes," that is, if it is determined that the subject corresponds to a predetermined person ID based on multiple features in the subject information detected in step S101, the process proceeds to step S103. In this case, step S102 functions as a determination step to determine subject information related to the subject. 【0061】 In step S103, a predetermined area containing a specific subject is cropped. That is, if it is determined that the subject information detected in step S101 corresponds to a predetermined person ID based on multiple features, a predetermined area on the image containing the person or musical instrument associated with that person ID is cropped according to the table in Figure 6. Step S103 functions as a cropping step that crops at least one predetermined area containing a subject. 【0062】 Figures 7(A) to 7(D) show examples of patterns for cropping a predetermined area including a person and a musical instrument according to Embodiment 1. Figures 71 to 74 in Figures 7(A) to 7(D) show examples in which a predetermined area including a person and a musical instrument linked to the person ID in Figure 6 is cropped by the cropping unit 308 from the image in Figure 5. 【0063】 Figures 8(A) and 8(B) show examples of how to determine the crop position and size according to Embodiment 1. The crop position set by the crop position setting unit 306 can be represented by the x (horizontal) and y (vertical) coordinates in Figure 8(A), respectively. 【0064】 For example, 80 in Figure 8(A) represents an example of how to determine the center when cropping a predetermined area that includes person 53 and base 57 in Figure 5, which are associated with person ID=3 in the table in Figure 6. As shown in Figure 8(A), the center when cropping a predetermined area may be set to the center 83 of the line segment connecting the centroid 81 of person 53 and the centroid 82 of base 57 on the image. 【0065】 Furthermore, the size and position to be set may be a rectangular area with a predetermined margin added to the rectangular area centered on the centroid of the detected central position of the person and instrument associated with the person ID to be cropped, or it may be set to maintain an arbitrary aspect ratio or resolution. 【0066】 In this case, when converting the cropped area to an arbitrary resolution or aspect ratio, an enlargement process using a general pixel interpolation method, including bilinear interpolation or bicubic interpolation, may be performed. Alternatively, as shown in Figure 7(A), a masking process may be performed to superimpose or add a black mask area 75. 【0067】 Thus, in this embodiment, the image may be output externally after performing aspect ratio conversion, resolution conversion, scaling, or masking on a predetermined area cropped by the cropping means. 【0068】 Furthermore, Figure 8(B) shows an example where v (width) and h (height) are set as a crop size by adding predetermined margins in the horizontal and vertical directions for a predetermined number of pixels in the crop size setting unit 307. 【0069】 The center set in the crop position setting unit 306 at this time may be the center 86 of the rectangular region 85 in Figure 8, which is formed by the perpendicular intersection of tangents that are the top, bottom, left, and right edges of the common area including the person 53 and base 57 in Figure 5, which are associated with person ID=3. Furthermore, it may be possible to arbitrarily set the system so that the area including the person and instrument associated with a predetermined person ID is not intentionally cropped. 【0070】 Next, in step S104, the music information acquisition unit 304 acquires music-related information. Here, step S104 functions as a music-related information acquisition step, which acquires music-related information that is relevant to music. 【0071】 Figure 9 shows an example of music-related information according to Embodiment 1. In Figure 9, an example of various information groups related to a song is shown, where the "Artist Name" is "Hanako Suzuki" and the "Song Title" is "Ashita e" (To Tomorrow). 【0072】 In Figure 9, the "genre" is listed as "pop," but it is also possible to associate it with various other music genres such as "jazz," "rock," and "hard rock." Furthermore, "BPM (Beats Per Minute)" in Figure 9 represents the number of beats per minute in a song. 【0073】 Furthermore, the "song structure" in Figure 9 shows examples of the various parts of the song's structure, including the "intro," "verse A (the part where the vocals come in after the intro)," "verse B (the part connecting the verse A and the chorus)," and "chorus (the highlight of the song)." 【0074】 Furthermore, sections such as "interlude (a section without vocals, such as a guitar solo)," "C-melody (a section that adds variation to the song and builds up to the second half)," and "outro (the end of the song)" are also indicated, and when the song is performed, they are defined in the order shown in the table in Figure 9. 【0075】 Furthermore, these expressions are widely used and known within the Japanese music industry, and outside of Japan, they may be replaced with expressions of similar meaning in terms of song structure. 【0076】 Furthermore, if necessary, information such as the time signature and number of measures related to the song may be obtained. Additionally, when obtaining a series of music-related information, the necessary information may be directly obtained from an external source as metadata, where attributes and content are paired, as shown in the table in Figure 9. 【0077】 Alternatively, the music information obtained can be used to estimate the song via a song estimation service over the internet, and other music-related information associated with that song can be obtained, or the music-related information can be stored in the memory 13 of the camera 10 beforehand. 【0078】 Figure 10 shows an example of data linked to some of the music-related information in Figure 9. Waveform 100 in Figure 10 represents the melody of the song in the table in Figure 9 using a specific waveform format, and shows an example of information linked to the song's structure from start to finish. This kind of information can be digitized and obtained as music-related information. 【0079】 Next, in step S105, it is determined whether the music-related information corresponds to a specific pattern. That is, the determination unit 305 determines whether the subject information detected by the detection unit 303 and the music-related information acquired by the music information acquisition unit 304 correspond to a specific pattern. 【0080】 Figure 11 is a diagram showing a table of example effect application patterns according to Embodiment 1. For example, as shown in the table in Figure 11, the patterns to be applied are set and linked together according to the combination of subject information (role / instrument) and music-related information (song structure). That is, in this embodiment, music-related information, subject information, and effect patterns are pre-linked and stored in a storage means (e.g., memory 13). 【0081】 Then, in the combination of subject information (role / instrument) and music-related information (song structure) as shown in Figure 11, if there is a corresponding pattern for applying the effect, step S105 is determined to be Yes and the process proceeds to step S106. On the other hand, if step S105 is determined to be No, the process flow shown in Figure 4 is terminated. 【0082】 In step S106, the corresponding effect from the table in Figure 11 is applied to the area cropped by the crop unit 308. For example, referring to Figure 9 or Figure 10, if the current part of the song is the chorus, then, referring to the table in Figure 11, the vocalist's face is zoomed in and a blur effect is applied. 【0083】 Specifically, the area cropped to capture the entire person 52 who is the vocalist is zoomed in on the person's face, and the cropped area is corrected as shown in image 72 of Figure 7, while also adding blur. 【0084】 Furthermore, as the song progresses and moves to the outro, the cropped area that was showing a close-up of person 52's face is corrected to show a close-up of person 52's hands, as shown in Table 11, and a zoom-in effect is added. 【0085】 Here, step S106 functions as an effect application step (effect application means) that applies a predetermined effect to a predetermined area cropped by the crop step, and determines the effect to be applied to the predetermined area according to music-related information and subject information. 【0086】 In this embodiment, if the current song structure is in the chorus section, the role of Figure 11 is to zoom in and out of the cropped area (Figure 7(C)) which includes the face of person 54 corresponding to the drums and the drum set, and to add blur. 【0087】 Figure 12 shows an example of an image when a predetermined effect is applied to the cropped area according to Embodiment 1. Here, Figure 12 shows an example of image 120 and image 121, in which a zoom-in effect is applied while adding blur radially from the center of the cropped area to image 74, in which the person and drum set, whose roles in the table of Figure 6, are cropped. 【0088】 As shown in Figure 12, by repeatedly zooming in and out at a predetermined speed and for a predetermined time, it is possible to create a sense of power in the drummer's performance during the chorus. Here, the effect application pattern in Figure 11 is one example of how the song in Figure 9 could be performed by a band composed of the individuals in the roles shown in Figure 6. 【0089】 Alternatively, the effect pattern may be selected from multiple candidate patterns that take into account the artist, genre, BPM, etc., of the song in question, using training data, etc. Alternatively, the effect pattern may be arbitrarily selected from multiple effect patterns that are pre-stored in the memory 13 of the camera 10. 【0090】 Figure 13 is a diagram showing an example of a table illustrating the level of effect application according to Embodiment 1. In addition to the effect application patterns described above, the level of effect application may also be changed according to the genre of music, etc., as shown in the table in Figure 13. In other words, the level of effect applied by the effect application means may be changed according to music-related information. 【0091】 The table in Figure 13 shows the effect level variation depending on the genre and BPM of the music, with 10 levels of effect level defined, where a higher number from 1 to 10 indicates a stronger effect. 【0092】 Figures 14(A) to (C) are examples of images showing the change in the level of the effect applied to the cropped area according to Embodiment 1. 【0093】 In Figure 14(A), 72 is an image showing a close-up of the cropped area of the face of the person whose role is vocalist, and image 140 in Figure 14(B) shows an example of an image where a blur effect level of 2 is applied to a song of the genre "jazz" with a BPM of 120. 【0094】 Image 141 in Figure 14(C) shows an example of an image with a blur effect applied to a song of the "rock" genre with a BPM of 80, by applying an effect level of 6 during playback. Here, the effect level is changed only by the genre and BPM of the song. 【0095】 However, the effect level may be changed more precisely based on other information related to the song, such as changes in the volume of the vocals or changes in the volume or tone of the instruments played by each part. Furthermore, the effect level may be changed based on changes in a person's facial expressions, movements, or specific poses. 【0096】 As shown in Figures 14(A) to (C), by varying the intensity of the effects depending on the genre of music and BPM, it becomes possible to differentiate between relatively mellow songs and intense songs in terms of expression. 【0097】 <Embodiment 2> Figure 15 is a flowchart showing an example of processing by camera 10 according to Embodiment 2, illustrating an example of processing in which camera 10 applies a predetermined effect to the cropped area according to the detection results on the captured image and various acquired information when photographing an orchestra. 【0098】 Furthermore, the CPU and other components of the camera 10, acting as a computer, execute the computer program stored in memory, thereby sequentially performing the actions of each step in the flowchart shown in Figure 15. 【0099】 The configuration of camera 10 is the same as in Figures 1 to 3 described above, so its explanation is omitted. Also, regarding steps S101 to S106 in Figure 15, the process is the same as that described in Embodiment 1, so the explanation of the overlapping parts is omitted. 【0100】 Furthermore, Figure 16 is a diagram showing an example of the configuration of the orchestra to be filmed according to Embodiment 2, and it is assumed that the camera 10 is filming from a predetermined position so as to capture the entire orchestra within the field of view. 【0101】 Furthermore, the system may use multiple cameras with similar functions and performance, not just camera 10. It may also be a system where camera 10 and the other cameras film the orchestra from different positions and share subject information and music-related information via a communication means such as the communication unit 16. 【0102】 Figure 17 is a diagram showing an example of selecting some subjects from the orchestra according to Embodiment 2, in which only the "first violins" and "second violins" are selected from the orchestra arrangement in Figure 16, and the people and equipment of the other performance parts are omitted. 【0103】 In Figure 17, figures 171-174 each possess a violin 175, but in terms of their roles, 171 and 172 are first violinists, while figures 173 and 174 are second violinists. Since it is difficult to determine their roles from the instruments alone, it is also possible to determine them from the known orchestra arrangement diagram in Figure 16, as well as the relative positions of camera 10 and the direction of shooting. 【0104】 Figure 18 is a diagram showing an example of a table linking the attributes and characteristics of subjects detected on an image according to Embodiment 2. In Figure 18, the table shows the subject information, including the person information of persons 171-174 in Figure 17, object information, and the performance part / role of each person, linked to person IDs 11-14. 【0105】 In addition, the table in Figure 18 includes person information such as person ID, gender, age, hairstyle, whether or not glasses are worn, and clothing color; the object information shows an example of the type of instrument (violin); and the performance part / role shows examples of the first violin and second violin. 【0106】 Furthermore, in the table in Figure 18, Person 171 in Figure 17 corresponds to Person ID=11, Person 172 corresponds to Person ID=12, Person 173 corresponds to Person ID=13, and Person 174 corresponds to Person ID=14. 【0107】 Given the nature of orchestras, where musicians typically wear black suits and have relatively little individual variation in hairstyles, it would be acceptable to also detect accessories such as glasses worn by individuals to identify them and link them to an ID. 【0108】 Furthermore, although only some of the violinists in Figure 17 are illustrated in the table in Figure 18, the instruments and individuals included in the entire orchestra being photographed can also be similarly linked to the person ID. 【0109】 In step S201 of the flowchart in Figure 15, it is determined whether the similarity of the subjects is above a predetermined level. If it is determined to be Yes, the process proceeds to step S202; if it is determined to be No, the process proceeds to step S103, where a predetermined area containing the specific subject is cropped. After that, the process proceeds to step S104. 【0110】 In step S202, multiple subjects with high similarity are grouped together. Specifically, the determination unit 305 groups together musical instruments (for example, stringed instruments such as violins) and people related to those instruments whose similarity of features on the image is above a predetermined level. 【0111】 In steps S201 and S202, the determination unit 305, acting as a determination means, determines the similarity of multiple subjects detected by the subject detection means and groups multiple subjects with a similarity of a predetermined level or higher. 【0112】 In step S203, a predetermined area containing multiple grouped subjects is cropped. Specifically, the crop position setting unit 306 and the crop size setting unit 307 determine the position and size, respectively, so that the predetermined area contains multiple musical instruments and the people who play them, each with a similarity of a predetermined degree or higher, and then the cropping unit 308 performs the cropping. 【0113】 For example, when grouping by "violin," set the position and size of the crop area so that all the people playing the "first violin" and "second violin" fit within the crop area. 【0114】 Figure 19 shows an example of an image cropped to include a person and a musical instrument according to Embodiment 2. That is, when grouping by "violin," if the detection unit 303 detects only four violins 175 as shown in Figure 17, then a cropped area 190 including the person playing the violin is set as shown in Figure 19. 【0115】 On the other hand, when grouping by "string instruments," the position and size may be set so that the person playing the violin, cello, viola, and double bass fits within the crop area. Furthermore, these conditions are applied similarly to the other instruments and people playing them included in Figure 16, resulting in grouping and cropping area settings based on similarity. 【0116】 Next, in step S104 of Figure 15, the music information acquisition unit 304 acquires the necessary music-related information. Figure 20 is a diagram showing an example of music-related information according to Embodiment 2. The music-related information acquired in this embodiment is an example where the "title of the piece" in the table in Figure 20 is "Symphony No. 5 'Fate'" and the "composer" is "Beethoven," as described in Figure 16, when performed by an orchestra. 【0117】 In the music-related information table in Figure 20, for example, if the attribute "Genre" is "Symphony," it can be seen that the structure of the piece is somewhat predetermined, and it generally consists of 3 to 4 movements. Therefore, it is also possible to obtain the "Key," "Time Signature," and "Form" for each movement to determine and estimate the style and atmosphere of the piece. 【0118】 Figure 21 is an example of a table showing the effect application patterns according to Embodiment 2, indicating the effect patterns to be applied to each crop area of each instrument part in each movement. In Figure 21, A to D represent the type of effect, with A being no effect, B being panning or tilting, C being zoom in, and D being zoom out. 【0119】 Furthermore, the level (speed) of each effect B to D in Figure 21 can be changed. For example, B- may be made relatively slower than B, and B+ may be made relatively faster than B. 【0120】 Alternatively, as shown in the table in Figure 21, for example, in the first movement, the cropped area may be divided into a grouped "first violin" and a grouped "cello," and a predetermined effect may be applied to each cropped area for each instrument. 【0121】 On the other hand, in the third movement, a predetermined effect may be applied to the entire cropped area grouped as "string instruments." 【0122】 For example, since major keys generally tend to create brighter moods and minor keys tend to create darker moods, the piece in question, as shown in the music-related information in Figure 20, has the first movement in C minor and the fourth movement in C major. Therefore, as shown in Figure 21, you can create contrast to match the mood of the piece by making the tempo (level) when applying the effect relatively slower for the first movement and relatively faster for the fourth movement. 【0123】 Furthermore, in Figure 21, the effect patterns applied by the effect application unit 309 are set for each movement in the song's structure, but other music-related information such as digital scores may also be acquired. Additionally, more detailed effect patterns may be set, such as zooming in on the performer of a solo part. 【0124】 As described above, this embodiment makes it possible to easily generate images that capture the atmosphere of the music and are suitable for each performer, even when automatically filming an orchestra, similar to images filmed by a professional photographer. 【0125】 <Other Embodiments> The present invention provides a program that implements one or more of the functions of the above-described embodiments to a system or device via a network or storage medium. It can also be implemented by one or more processors in the computer of that system or device reading and executing the program. 【0126】 Furthermore, it can also be realized by a circuit (for example, an ASIC) that performs one or more functions. In that case, the program and the storage medium storing the program constitute the present invention. 【0127】 Although one example of an embodiment of the present invention has been described in detail above, the present invention is not limited to this particular embodiment. For example, the camera described as a video control device in the above-described embodiment can be applied to digital video cameras, IP remote cameras, and the like. 【0128】 Furthermore, it can be applied to terminal devices or client devices that acquire video from external sources via cables compliant with HDMI (registered trademark) or SDI standards. 【0129】 Furthermore, some or all of the above-described software configuration may be implemented in the device as a hardware configuration. Alternatively, a GPU (Graphics Processing Unit) may be used instead of a CPU in the hardware configuration. 【0130】 As described above, according to each of the embodiments, it is possible to easily generate realistic footage that is similar to that taken by a professional photographer during automatic shooting at music live performances, concerts, etc. 【0131】 Various modifications and combinations of the above embodiments are possible based on the spirit of the present invention, and these do not exclude them from the scope of the present invention. Furthermore, some of the above embodiments may be combined as appropriate, and the present invention includes the following combinations. 【0132】 (Configuration 1) A video processing apparatus comprising: subject detection means for acquiring video and detecting a specific subject in the video; cropping means for cropping at least one predetermined region including the subject; music-related information acquisition means for acquiring music-related information related to music; determination means for determining subject information related to the subject; and effect application means for applying a predetermined effect to the predetermined region cropped by the cropping means, wherein the effect application means determines the effect to be applied to the predetermined region according to the music-related information and the subject information. 【0133】 (Configuration 2) The video processing apparatus according to Configuration 1, characterized in that the subject information includes at least one of the following: a person, the attributes of the person, the characteristics of the person, the type of musical instrument, and the type of audio equipment. 【0134】 (Configuration 3) The video processing apparatus according to Configuration 2, comprising: a linking means for linking the person and the type of instrument detected at a position overlapping with or touching the area of the person with the performance part or role corresponding to the type of instrument. 【0135】 (Configuration 4) The video processing device according to any one of Configurations 1 to 3, characterized in that the music-related information includes at least one of melody, tempo, time signature, genre, song structure, song title, artist name, lyrics, composer, musical form, key of the song, performance time, instrument configuration, instrument arrangement, volume, sound quality, and musical score. 【0136】 (Configuration 5) The image processing apparatus according to any one of Configurations 1 to 4, characterized in that the effect provided by the effect-providing means is an effect of at least one image processing, such as changing the position or size of the predetermined region of the image cropped by the cropping means, blurring, panning, tilting, zooming in, or zooming out. 【0137】 (Configuration 6) A video processing apparatus according to any one of Configurations 1 to 5, characterized in that the level of the effect applied by the effect-applying means is changed according to the music-related information. 【0138】 (Configuration 7) An image processing apparatus according to any one of Configurations 1 to 6, characterized in that it has a storage means for storing in advance the music-related information, the subject information, and the effect patterns linked together. 【0139】 (Configuration 8) The video processing apparatus according to any one of Configurations 1 to 7, characterized in that the determination means determines the similarity of a plurality of subjects detected by the subject detection means and groups the plurality of subjects whose similarity is above a predetermined level. 【0140】 (Configuration 9) An image processing apparatus according to any one of Configurations 1 to 8, characterized in that, after performing aspect ratio conversion, resolution conversion, enlargement processing, or mask processing on the predetermined area cropped by the cropping means, the image is output to the outside. 【0141】 (Configuration 10) The video processing apparatus according to any one of Configurations 1 to 9, characterized in that the effect-applying means temporarily stores the acquired video and then applies the effect to the temporarily stored video. 【0142】 (Method) A video processing method comprising: a subject detection step of acquiring video and detecting a specific subject in the video; a cropping step of cropping at least one predetermined region including the subject; a music-related information acquisition step of acquiring music-related information related to music; a determination step of determining subject information related to the subject; and an effect application step of applying a predetermined effect to the predetermined region cropped by the cropping step, wherein the effect application step determines the effect to be applied to the predetermined region according to the music-related information and the subject information. 【0143】 A computer program for controlling each means of the video processing device described in any one of configurations 1 to 10. [Explanation of symbols] 【0144】 10: Camera 11: Imaging Department 12:CPU 13: Memory 14: Input section 15: Display section 16: Communications Department
Claims
[Claim 1] A subject detection means that acquires video and detects a specific subject within the video, A cropping means for cropping at least one predetermined region including the subject, A means for acquiring music-related information, and A determination means for determining subject information related to the subject, The system includes an effect-granting means for granting a predetermined effect to the predetermined region cropped by the cropping means, The effect application means is characterized by determining the effect to be applied to the predetermined area according to the music-related information and the subject information. [Claim 2] The image processing apparatus according to claim 1, characterized in that the subject information includes at least one of the following: a person, the attributes of the person, the characteristics of the person, the type of musical instrument, and the type of audio equipment. [Claim 3] The video processing apparatus according to claim 2, further comprising: a linking means for linking the person and the type of instrument, according to the person, the attributes of the person, and the type of instrument detected at a position overlapping with or touching the area of the person, to a performance part or role corresponding to the type of instrument. [Claim 4] The video processing apparatus according to claim 1, characterized in that the aforementioned music-related information includes at least one of melody, tempo, time signature, genre, song structure, song title, artist name, lyrics, composer, musical form, key of the song, performance time, instrument configuration, instrument placement, vocal volume, volume, sound quality, and musical score. [Claim 5] The image processing apparatus according to claim 1, characterized in that the effect provided by the effect-providing means is an effect of at least one image processing, such as changing the position or size of the predetermined region of the image cropped by the cropping means, blurring, panning, tilting, zooming in, or zooming out. [Claim 6] The video processing apparatus according to claim 1, characterized in that the level of the effect applied by the effect-applying means is changed according to the music-related information. [Claim 7] The video processing apparatus according to claim 1, characterized in that it has a storage means for pre-linking and storing the aforementioned music-related information, the aforementioned subject information, and the aforementioned effect patterns. [Claim 8] The video processing apparatus according to claim 1, characterized in that the determination means determines the similarity of a plurality of subjects detected by the subject detection means, and groups the plurality of subjects whose similarity is above a predetermined level. [Claim 9] The video processing apparatus according to claim 1, characterized in that, after performing aspect ratio conversion, resolution conversion, scaling, or masking on the predetermined area cropped by the cropping means, the video is output to the outside. [Claim 10] The effect-applying means is characterized by temporarily storing the acquired video and then applying the effect to the temporarily stored video, as described in claim 1. [Claim 11] A subject detection step involves acquiring video footage and detecting a specific subject within the video footage. A cropping step of cropping at least one predetermined region including the subject, Music-related information acquisition step to acquire music-related information, A determination step for determining subject information related to the subject, The process includes an effect application step of applying a predetermined effect to the predetermined region cropped by the crop step, The effect application step is characterized by determining the effect to be applied to the predetermined area according to the music-related information and the subject information. [Claim 12] A computer program for controlling each means of the video processing apparatus described in any one of claims 1 to 10 by computer.