Context-based speaker counter for speaker segmentation clustering system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By utilizing visual contextual information to generate unique descriptors, this method addresses the shortcomings of existing speaker segmentation and clustering systems in terms of accuracy and robustness in speaker number estimation. It achieves efficient, manual-label-free speaker number estimation that adapts to environmental changes.

CN115298704BActive Publication Date: 2026-06-16GOOGLE LLC

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: GOOGLE LLC
Filing Date: 2020-03-13
Publication Date: 2026-06-16

Application Information

Patent Timeline

13 Mar 2020

Application

16 Jun 2026

Publication

CN115298704B

IPC: G06V20/40; G06V40/10; G06V40/16; G06V10/50; G06V10/762

CPC: G06V40/161; G06V40/169; G06V40/173; G06V40/10; G06V20/46; G06V10/507; G06V10/762; G06V20/41

AI Tagging

Application Domain

Speech analysis Biometric pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing speaker segmentation and clustering systems rely on trial and error or human curation to determine the number of unique speakers, resulting in insufficient accuracy and robustness. They are prone to errors, especially in the context of environmental changes and linguistic diversity, and may also involve privacy issues.

⚗Method used

By leveraging visual contextual information, the system detects facial and human features in videos to generate pixel-activated histograms and descriptors with wider bounding boxes. Combined with active speaker detection, it automatically estimates the number of speakers, avoids biometric identifiers, and provides prior information to the speaker segmentation and clustering system.

🎯Benefits of technology

It improves the accuracy and robustness of the speaker segmentation and clustering system, reduces reliance on human intervention and biometric data, adapts to environmental changes, and achieves efficient speaker number estimation without identifying individuals.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115298704B_ABST

Patent Text Reader

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining a number of speakers in a video and corresponding audio using visual context are disclosed. In one aspect, a method includes detecting a plurality of speakers within a video; for each detected speaker, determining a bounding box that includes a detected person in an image frame and objects within a threshold distance of the detected person; determining a unique descriptor for the person based in part on image information depicting the objects within the bounding box; determining a cardinality of unique speakers in the video; and providing the cardinality of unique speakers to a speaker segmentation clustering system.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification pertains to the field of speaker segmentation and clustering. Background Technology

[0002] Speaker segmentation clustering is the process of segmenting an audio stream with multiple speakers into segments associated with each individual speaker. Segmentation clustering is useful for many applications such as audio transcription and captioning.

[0003] The performance and accuracy of current speaker segmentation and clustering systems largely depend on identifying the number of unique speakers in a video or audio file. Some systems rely on trial and error to determine the number of speakers, while others require human input. The latter approach is susceptible to scale limitations, as human curation requires reviewing the audio or video and accurately counting the speakers. The curator may be unfamiliar with the speakers or may even speak a language different from the spoken language in the audio or video. Using trial and error to determine a threshold number of speakers may be more efficient in terms of time and resources, but the diversity of contextual information (especially in video) can lead to large variations in counts, resulting in error-prone estimations. Summary of the Invention

[0004] With an increasing amount of audio data now associated with video, visual context can provide crucial prior information that can be used to generate a priori information about the number of unique speakers. Specifically, the subject of this application relates to using visual context from video to determine the cardinality of speakers as a priori for a speaker segmentation clustering system. This specification describes a novel system and method for providing a speaker segmentation clustering system with the cardinality of unique speakers present in both the video and the corresponding audio.

[0005] In general, an innovative aspect of the subject matter described in this specification can be embodied in a method comprising the following actions: obtaining a video comprising multiple image frames and corresponding audio; detecting multiple individuals depicted in the video; for each detected individual, determining a bounding box comprising the detected individual in the image frames and objects within a threshold distance of the detected individual; determining a unique descriptor for that individual from image information included within each bounding box for each detected individual, the unique descriptor being based in part on image information depicting objects within the bounding box; determining a cardinality of the unique descriptors determined for the video; and providing at least the cardinality of the unique descriptors to a speaker segmentation clustering system that determines unique speakers of the corresponding audio of the video. Other embodiments of this aspect include corresponding systems, apparatus, and computer programs configured to perform the actions of the method and encoded on a computer storage device.

[0006] Specific embodiments of the subject matter described herein can be implemented to achieve one or more of the following advantages. The system uses visual context to determine the speaker, but the visual context does not privately identify the speaker. The context includes at least the detection of additional informational facts such as the area around the face, the detection of the person (e.g., head and torso), and the detection of active speakers. The detection of these features results in a more robust process than systems that rely solely on active speaker detection without human curation and / or systems based solely on audio processing. Therefore, acceptable accuracy can be achieved without privately identifying people in the video and without human curation.

[0007] More specifically, in contrast to audio-only processing, visual cues significantly increase the robustness of the underlying technology. For example, the voice of the same person in an indoor environment will sound very different in an outdoor environment, or more generally when environments have different acoustic properties. These differences make it difficult for audio-only processing to accurately determine speaker segmentation clusters without using biometric information. However, when considering visual features such as less frequent changes in clothing, the process can utilize pixel-based descriptors that are quite robust to environmental changes and do not rely on biometric information.

[0008] Details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of this subject matter will become apparent from the description, drawings, and claims. Attached Figure Description

[0009] Figure 1A This is a block diagram of a context-based speaker counting system that determines the priors of a speaker segmentation clustering system.

[0010] Figure 1B This is a block diagram of an example image frame with a wide bounding box.

[0011] Figure 2 This is a block diagram of an example image frame that illustrates a larger bounding box.

[0012] Figure 3 This is a block diagram illustrating example image frames that use multiple video signals simultaneously.

[0013] Figure 4 This is a flowchart of an example process for determining the cardinality of the speaker. Detailed Implementation

[0014] Overview

[0015] Speaker segmentation and clustering systems consist of a three-step sequence: speech detection, segmentation into individual speaker segments, and clustering of these segments. The final step is particularly problematic, where automated systems must rely on trial and error to determine when to stop clustering, or on external input to provide the number of speakers. As mentioned above, this latter process involves human curation, requiring the processing of large amounts of data while accurately counting speakers. Other processes may involve using identifying information, such as facial recognition. However, privacy concerns may prevent the use of processes that could identify people.

[0016] The subject matter of this application overcomes the technical problem of accurately estimating the number of speakers in a video without using a private process for identifying speakers, thereby eliminating the use of biometric data that raises privacy concerns. This subject matter may utilize a combination of the following features:

[0017] (1) Face detection without any associated explicit identity: For each detected face, the system considers a wider bounding box (a multiple of the size of the face bounding box) and uses information about the distribution of objects detected within the larger bounding box as a unique descriptor. Thus, the presence of a potential speaker can be detected without using biometric data.

[0018] (2) Detecting people without any associated explicit identity: For each detected person, the system uses a pixel-activated histogram as a descriptor to encode the intuition that such a distribution is largely influenced by local factors such as clothing and lighting and will remain relatively unchanged.

[0019] (3) Detect people with a larger context bounding box: Similar to (1) above, but this time use a wider bounding box around the person.

[0020] (4) Active speaker detection: The system uses a pre-trained model to detect speaking faces and encodes the intuition of the speaker to continue speaking for a reasonable duration.

[0021] The task of a speaker segmentation clustering system is to separate conversations based on speakers when the number of speakers in an audio recording is unknown. Since an increasing amount of audio content also has corresponding videos, processing the video as described above can be used to predict the cardinality of unique speakers based on visual context. When determining the number of speakers, the speaker segmentation clustering system uses a priori count of the number of speakers.

[0022] Throughout this document, the term video will refer to both video and its corresponding audio, wherein video comprises a sequence of image frames. Furthermore, as used herein, face detection or human detection does not imply a personal identifier of the speaker; instead, these detections merely indicate the presence or absence of a human speaker. This can be accomplished without the use of biometric data.

[0023] For each video, one or more of these signals are used to detect speakers in the video, whereby each detected speaker is given a unique descriptor determined by a specific signal. When unique descriptors are determined using specific signals, the number of speakers is determined either directly or by feeding them as input to a machine learning model and outputting the predicted number of speakers in the received video. As used herein, a “unique descriptor” is a descriptor (or a set of sufficiently similar descriptors) determined from the video that indicates the same person was detected for each instance described. The descriptors are not biometric data used to privately identify people, but are derived from the visual features described above. For example, for each frame of the video, a descriptor based on regions including the head and torso of a person can be generated.

[0024] These features and additional characteristics will be described in more detail below.

[0025] Figure 1A This is a block diagram 10 of a context-based speaker counting system 30 that determines the prior of the speaker segmentation clustering system 40. System 30 receives video 20 as input. Video 20 comprises multiple image frames and corresponding audio. A person detector 32 detects multiple people depicted in the video. The person detector 32 can be a process or model trained to detect faces without any explicit identity associated with those faces. For each detected face, detector 32 can also select a bounding box wider than the face (e.g., a multiple of the face bounding box size) and use information about the distribution of objects detected within the larger bounding box. Detector 32 can also detect people without any explicit identity associated with those people. For example, for each detected person, system 30 uses a histogram of pixel activations as a descriptor and encodes the intuition that such a distribution is largely influenced by local factors such as clothing and lighting and will remain relatively invariant. In yet another example, detector 32 can detect people with a larger context bounding box. Therefore, as used in this specification, a detected person can be a detected face with additional context as determined by a bounding box, a person detection with additional context as determined by a bounding box, or a person detection with additional context as determined by a bounding box. In other words, for each detected person, the system determines a bounding box including the detected person in the image frame and objects within a threshold distance of the detected person.

[0026] Then, the cardinality estimator model 36 determines a unique descriptor for each detected person from the image information included within each bounding box. The unique descriptor is based in part on the image information depicting the object within the bounding box. For example, the cardinality estimator model 36 determines the histogram of pixel activations as the descriptor.

[0027] The active speaker detector 34 can also be used to identify active speakers. The identified active speakers can be matched to corresponding unique descriptors. For example, lip movements can be detected in a face within a portion of a frame, and that portion of the frame is within a bounding box, where a person is detected. This is used as additional information to determine that the person is an active speaker in at least some portion of the video.

[0028] Then, the cardinality estimator model 36 determines the cardinality of the unique descriptors identified for the video. Cardinality is a prediction or estimate of the number of unique speakers in the video. In some implementations, a single cardinality value is determined. In other implementations, the cardinality estimation model can provide the speaker segmentation clustering system with a distribution of possible cardinities for unique speakers. In such an implementation, multiple cardinities for unique descriptors are determined, and for each unique descriptor, a confidence value for that unique descriptor is determined. For example, the system can determine that there is an 80% probability that there are four unique descriptors (e.g., four speakers) in the video, and a 20% probability that there are three unique descriptors (e.g., three speakers) in the video. These estimates are then provided to the speaker segmentation clustering system 40, which performs speech detection, uses the estimates to determine the number of speakers, and associates portions of audio with the corresponding number of speakers.

[0029] Figure 1B This is a high-level block diagram illustrating an environment 100 of an example video 110 comprising multiple image frames 111-116. The video 110 has a corresponding audio signal, which includes the speech of a speaker present in the video 110. In some implementations, the video 110 may have a single speaker, while in other implementations, the video 110 may have multiple speakers. For example, the video 110 shows two humans 120 and 130 in image frame 116, each of whom is a potential speaker.

[0030] Not all speakers are shown in every frame of the video. For example, image frame 116 shows two speakers, 120 and 130, while image frame 115 may depict only one of speakers, 120 or 130.

[0031] In some implementations, the number of speakers in the video is determined using face detection methods based on computer vision or image processing techniques. In such implementations, a sequence of image frames is processed to identify features, and the detected faces within the video are determined from the identified features. For example, face detection techniques or person detection techniques as described above are used to detect speakers 120 and 130. For example, bounding boxes 122 and 132 represent the detected faces and additional context in frame 116 (e.g., each bounding box is larger than the corresponding detected face).

[0032] In this implementation, a unique descriptor for each speaker is determined based on objects detected within a threshold distance of the detected face. For example, face detection techniques detect faces within bounding box 122 to determine the presence of a potential speaker 120. When speaker 120 is detected, a wider bounding box 124 is used to encapsulate the specific speaker's face and other detectable objects within the wider bounding box 124. In this example, other detectable objects include portions of table 140 and window 145.

[0033] In some implementations, the threshold distance for determining the maximum separation between detected faces and detected objects is determined by the system designer and can be available to the system as user input. In other implementations, the threshold distance can be determined automatically by the system to capture speaker uniqueness by adjusting the threshold value. Such implementations can determine the threshold distance through an iterative process or a machine learning model that takes image frames as input and determines the threshold based on the attributes of the image frames. For example, the iterative process can increase or decrease the threshold value in each iteration and determine the final value of the threshold based on the objects detected within the threshold distance.

[0034] In some implementations, a unique descriptor for a particular speaker can be determined using all detectable features within a wider bounding box. In such implementations, techniques such as convolution and pooling can be used to identify features, not necessarily specific objects. Continuing the example, image frame 116 shows another potential speaker 130. Face detection techniques detect faces within bounding box 132. Then, a wider bounding box 134 is used to determine a unique descriptor for the potential speaker 130 by encapsulating the speaker 130's face and detectable objects within a bounding box 134 that includes the portion of the lamp 150.

[0035] In some implementations, techniques other than face detection can be used to detect potential speakers in a video. For example, the presence of potential speakers can be detected by detecting human shapes or human motion patterns. In such implementations, each speaker in the video (specifically, a sequence of image frames) is encapsulated by a larger bounding box that encapsulates the detected human shape (e.g., head and torso). Similar to the wider bounding box used for faces, for each potential speaker, a unique descriptor is determined using the objects detected within the larger bounding box. Figure 2 This is an example environment 200 illustrating a larger bounding box surrounding a human figure. When a potential speaker 220 is detected, a larger bounding box 260 is determined to enclose the potential speaker 220 and include other objects such as portions of table 250, window 230, and lamp 240. Alternatively, the size of the larger bounding box 260 can be varied to primarily enclose the detected human figure and omit other objects, as the detected human figure may have sufficient visual features (e.g., clothing, lighting) to distinguish it from other detected human figures that cannot be used for unique personal identification.

[0036] In some implementations, the size of the larger bounding box 260 and the threshold distance for the maximum separation between the detected potential speaker and the detected object are determined by the system designer and can be available to the system as user input. In some implementations, the threshold for determining the size of the larger bounding box can be determined automatically by the system to capture the uniqueness of the detected speaker by adjusting the value of the threshold. Such implementations can determine the threshold distance through an iterative process or a machine learning model that takes an image frame as input and determines the predicted value of the threshold based on the attributes of the image frame. Some implementations may include measures to allow maximum performance in detecting unique speakers by at least allowing the speaker's head and torso as a requirement for the larger bounding box. For example, a larger bounding box is not suitable for generating unique descriptors for a specific user if only the speaker's head is visible due to the speaker's position relative to the image frame.

[0037] In some implementations, a histogram of pixel activations is used to determine a unique descriptor for each potential speaker in the video, based on the corresponding speaker's local factors. For example, a speaker wearing a blue shirt and sitting in a brown chair under specific lighting conditions in a particular image frame will generate roughly similar histograms of pixel activations across multiple frames of the video, and will differ from the histograms of pixel activations of other speakers with different local factors. Other implementations may include information obtained from the speaker's clothing, objects within multiple bounding boxes in the image frame, and the speaker's color and depth analysis.

[0038] In some implementations, the histograms of pixel activations generated for all detectable potential speakers can be clustered to form groups of similar histograms. The intuition for this is that speakers with their respective local factors will generate similar histograms across all image frames in the video, and grouping them into similar clusters will represent unique speakers. In such implementations, the histogram of pixel activations for a speaker can be generated based on portions of image frames determined by a threshold distance from the detected speaker. In some implementations, the threshold can be determined through an iterative process or by a machine learning model that takes image frames as input and determines the predicted values of the threshold based on the attributes of the image frames.

[0039] In some implementations, a speaker's unique descriptor can be determined when using an active speaker detection system to detect active speakers. An active speaker detection system is a system that detects (multiple) active speakers in a video. For example, given a video depicting three people talking, an active speaker detection system can detect which of the three is speaking at a given time. Supporting this approach is the assumption that visual mouth movements during speaking are highly correlated with the corresponding sounds produced during speaking. Even if, for some reason, other facial features of the speaker cannot be detected, the active speaker detection system can process both the video and the corresponding audio simultaneously to track the speaker's mouth movements. For example, there may be two speakers in the audio, but the corresponding video only shows one speaker. Suppose that a speaker not visible in the video is speaking in the audio. In such a scenario, the active speaker detection system can track the mouth movements of the speaker who is not speaking but is visible in the video and determine the presence of the other speaker.

[0040] In some implementations, multiple signals can be used to determine the unique descriptor for each speaker. An example scenario for this is when the positions of multiple speakers detected within an image frame overlap. Figure 3This is an example environment 300 illustrating image frame 310. Image frame 310 shows two potential speakers 330 and 340, whose faces have been detected using any face detection technique, represented by bounding boxes 332 and 342. In this case, for each of speakers 330 and 340, the wide bounding boxes as discussed above would be such that both wide bounding boxes include the faces of both speakers 330 and 340. In such an implementation, the larger bounding box for each speaker would capture the uniqueness of the two speakers by utilizing the relative positions of the objects identified within the larger bounding box. For example, the larger bounding box 344 for speaker 340 includes the bounding box 332 of speaker 330 and other portions of speaker 330 that are different from speaker 340, the bounding box 332 including the detected face of speaker 330. In this example, information with bounding boxes 332 and 334 is used to determine the unique descriptor of speaker 330, which includes the detected face of user 330, part of speaker 340 and window 350, while the detected face of user 340, part of user 330 and part of table 360 is used to determine the unique descriptor of user 340.

[0041] In some implementations, a cardinality estimation model can be used to determine the number of unique speakers from unique descriptors. In some implementations, the cardinality estimation model can be a machine learning model trained to predict the number of unique speakers given unique descriptors. In other implementations, the cardinality estimation model can be an algorithmic process. For example, one possible implementation is a stepwise elimination process that uses a user-defined heuristic to eliminate redundant descriptors for each speaker until the cardinality of unique speakers is determined.

[0042] In some implementations, the cardinality estimation model can provide the speaker segmentation clustering system with the distribution of possible cardinality for unique speakers. In such implementations, multiple cardinality values are provided for unique descriptors, and for each unique descriptor, a confidence value for that unique descriptor is provided. All cardinality values and their confidence values can be provided to the speaker segmentation clustering system; alternatively, only the cardinality value with the highest confidence value can be provided. For example, the above features can be used to train a machine learning model to predict speaker cardinality.

[0043] Figure 4This is a flowchart of a process 400 for determining the cardinality of unique speakers. Process 400 is implemented in a computer system including one or more computers. Process 400 receives a video (410) comprising multiple image frames and corresponding audio. Process 400 detects multiple people in the video (420). For example, for each image frame in the video, potential speakers are detected as described above. Other methods for speaker detection include tracking human motion patterns and human body shapes. In some cases, active speaker detection systems (e.g., lip movements) can also be used to detect the presence of speakers in a video.

[0044] After a potential speaker is detected in an image frame, process 400 determines (430) a bounding box (BGB) to enclose the speaker and objects within a threshold distance. In some scenarios, for each detected face in the image frame, a wider BGB is determined based on a threshold distance including the speaker's face and objects within a wider BGB. For example, when speakers 120 and 130 are detected, wider BGBs 124 and 134 are determined based on the threshold distances to the detected faces of speakers 120 and 130. In other scenarios, when a speaker is identified in the image frame, a larger BGB is determined to enclose each specific speaker and objects within a threshold distance. For example, when speaker 220 is identified, a larger BGB 260 is determined to enclose the speaker and objects within a threshold distance.

[0045] When the bounding boxes of detected speakers in an image frame are determined, process 400 generates a unique descriptor (440) for each detected speaker. For example, when a wider bounding box 124, determined by a threshold distance, is determined, objects such as portions of table 140 and window 145 are detected within the wider bounding box. Then, a unique descriptor is generated for speaker 120 based on the detected face 122 and objects 140 and 145. In another example, a unique descriptor is generated based on the speaker and detected objects with a larger bounding box 260 that includes table 250, window 230, and lamp 240. When a unique descriptor is generated for each detected speaker in the video, process 400 determines the cardinality of unique speakers in the video (450). For example, the unique descriptors can be fed to a cardinality estimation model to predict the cardinality of unique speakers in the video from the unique descriptors. When the cardinality of unique speakers is determined, process 400 provides the cardinality to a speaker segmentation clustering system (460). In some implementations, multiple cardinalities can be determined for unique speakers in a video. In such implementations, all determined cardinalities and their distributions are provided to the speaker segmentation clustering system, or in some cases, the cardinality with the highest confidence is provided.

[0046] The embodiments of the subject matter and operation described herein can be implemented in digital electronic circuits, or in computer software, firmware, or hardware, including the structures disclosed herein and their equivalents, or in combinations thereof. Embodiments of the subject matter described herein can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a computer storage medium for execution by a data processing apparatus or for controlling the operation of a data processing apparatus.

[0047] Computer storage media can be a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination thereof, or be included in a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination thereof. Furthermore, although a computer storage medium is not a propagating signal, it can be a source or destination of computer program instructions encoded in an artificially generated propagating signal. Computer storage media can also be one or more separate physical components or media (e.g., multiple CDs, discs, or other storage devices) or be included in one or more separate physical components or media (e.g., multiple CDs, discs, or other storage devices).

[0048] The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.

[0049] The term "data processing apparatus" encompasses all kinds of devices, setups, and machines for processing data, including programmable processors, computers, systems-on-a-chip, or a combination thereof. Apparatus may include special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, cross-platform runtime environments, virtual machines, or combinations thereof. Apparatus and execution environments can implement a variety of different computing model infrastructures, such as web services, distributed computing, and grid computing infrastructures.

[0050] A computer program (also known as a program, software, software application, script, or code) can be written in any programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but does not necessarily, correspond to a file in a file system. A program may be stored as part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), as a single file dedicated to the program in question, or as multiple collaborating files (e.g., a file storing one or more modules, subroutines, or code sections). A computer program can be deployed to execute on a single computer, located in one place, or distributed across multiple computers interconnected by a communication network.

[0051] The processes and logic flows described in this specification can be executed by one or more programmable processors that execute one or more computer programs to perform actions by manipulating input data and generating outputs. The processes and logic flows can also be executed by special-purpose logic circuits, and the devices can be implemented as special-purpose logic circuits, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits).

[0052] For example, processors suitable for executing computer programs include general-purpose and special-purpose microprocessors, as well as any one or more processors in any kind of digital computer. Typically, a processor receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are a processor for performing actions according to instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as disks, magneto-optical disks, or optical disks, to receive data from or transfer data to, or both. However, a computer does not need to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name a few. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, such as: semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROMs and DVD-ROMs. The processor and memory can be supplemented or incorporated by dedicated logic circuitry.

[0053] To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a display device for displaying information to the user, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, and on a computer with a keyboard and pointing device (e.g., a mouse or trackball) that the user can use to provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual, auditory, or tactile feedback; and input from the user can be received in any form, including sound, speech, or tactile input. Furthermore, the computer can interact with the user by sending documents to and receiving documents from the device used by the user, for example, by sending a webpage to the web browser in response to a request received from the web browser on the user's device.

[0054] Embodiments of the subject matter described herein can be implemented in computing systems that include backend components, such as data servers, or middleware components, such as application servers, or frontend components, such as user computers with graphical user interfaces or web browsers through which users can interact with implementations of the subject matter described herein, or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”) and wide area networks (“WANs”), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., self-organizing peer-to-peer networks).

[0055] A computing system may include users and servers. Users and servers are typically geographically separated and usually interact via a communication network. The relationship between users and servers arises from computer programs running on respective computers and having a user-server relationship with each other. In some embodiments, the server transmits data (e.g., HTML pages) to the user device (e.g., for the purpose of displaying data to a user interacting with the user device and receiving user input from that user). Data generated at the user device (e.g., the result of user interaction) may be received at the server from the user device.

[0056] While this specification contains numerous details of specific implementations, these should not be construed as limiting the scope of any feature or potentially claimed content, but rather as descriptions of specific features of particular embodiments. Certain features described in this specification within the context of independent embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations, and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and the claimed combination may be for sub-combinations or variations thereof.

[0057] Similarly, although the operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring these operations to be performed in the specific order or sequence shown, or requiring all illustrated operations to be performed to obtain the desired result. In some cases, multitasking and parallel processing can be advantageous. Furthermore, the separation of the various system components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0058] Therefore, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions described in the claims can be performed in a different order and the desired result can still be obtained. Furthermore, the processes depicted in the drawings do not necessarily require the specific order or sequence shown to obtain the desired result. In some implementations, multitasking and parallel processing can be advantageous.

Claims

1. A method executed by a data processing apparatus, the method comprising: Obtain a video that includes multiple image frames and corresponding audio; Detect multiple speakers depicted in a video without using biometric information to identify the speakers; the detection of each person is based on the speaker’s face detected in the face detection bounding box. In response to the detection of multiple speaking people depicted in a video, for each detected person, a bounding box is determined that includes the detected person in the image frame and objects within a threshold distance of the detected person, the bounding box being larger than and including the face detection bounding box; For each detected person, a unique descriptor for that person is determined from the image information included within the bounding box for that person and without using biometric data identifying that person. This unique descriptor is based in part on the image information depicting the object within the bounding box. The unique descriptor for the detected person does not include biometric data identifying that person. Determining the unique descriptor for that person includes determining a histogram of pixel activation for each of a plurality of frames. The histogram of pixel activation is determined based on local factors from the image information included within the bounding box. The cardinality for determining the unique descriptor for a video includes: Generate histogram clusters, where each histogram cluster includes only histograms that are within a threshold distance of each other; and The cardinality of unique descriptors is determined as the number of histogram clusters; A speaker segmentation clustering system that identifies the unique speaker of the corresponding audio in a video provides at least the cardinality of unique descriptors.

2. The method according to claim 1, wherein: Detecting multiple speakers depicted in a video includes: detecting faces within the video; and For each detected person, determining the bounding box that includes the detected person in the image frame and objects within a threshold distance of the detected person includes: determining a bounding box that is a multiple of a face detection bounding box that includes the minimum portion of the image frame required to detect the face of the detected person.

3. The method according to claim 1, wherein: Detecting multiple speakers depicted in the video within the video includes: detecting bodies, including detecting the position of each person's head and torso; and For each detected person, determining the bounding box that includes the detected person in the image frame and objects within a threshold distance of the detected person includes: determining a bounding box that is a multiple of a body detection bounding box that includes a minimum portion of the image frame required to detect at least the head and torso of the detected person.

4. The method according to claim 1, wherein, Determining the cardinality of a unique descriptor includes: Provide unique descriptors to cardinality estimation models; and The cardinality of unique descriptors is estimated from the cardinality estimation model.

5. The method according to claim 1, wherein, The cardinality for determining unique descriptors for videos includes: Determine multiple cardinalities for unique descriptors; and For each of the multiple cardinalities, determine a confidence value that indicates the cardinality is correct.

6. The method according to claim 1, wherein, A speaker segmentation clustering system that identifies a unique speaker for a given audio segment of a video provides at least the cardinality of unique descriptors, including providing multiple cardinalities for unique descriptors and, for each unique descriptor, providing a confidence value for that unique descriptor.

7. The method according to claim 1, wherein, The speaker segmentation clustering system that determines the unique speaker of the corresponding audio of a video provides at least the cardinality of unique descriptors, including providing the cardinality of unique descriptors with the highest confidence value relative to all other unique descriptors.

8. A data processing system, comprising: Data processing device; as well as A non-transitory computer-readable medium storing instructions executable by a data processing apparatus, wherein the instructions, when executed, cause the data processing apparatus to perform operations including: Obtain a video that includes multiple image frames and corresponding audio; Detect multiple speakers depicted in a video without using biometric information to identify the speakers; the detection of each person is based on the speaker’s face detected in the face detection bounding box. In response to the detection of multiple speaking people depicted in a video, for each detected person, a bounding box is determined that includes the detected person in the image frame and objects within a threshold distance of the detected person, the bounding box being larger than and including the face detection bounding box; For each detected person, a unique descriptor for that person is determined from the image information included within the bounding box for that person and without using biometric data identifying that person. This unique descriptor is based in part on the image information depicting the object within the bounding box. The unique descriptor for the detected person does not include biometric data identifying that person. Determining the unique descriptor for that person includes determining a histogram of pixel activation for each of a plurality of frames. The histogram of pixel activation is determined based on local factors from the image information included within the bounding box. The cardinality for determining the unique descriptor for a video includes: Generate histogram clusters, where each histogram cluster includes only histograms that are within a threshold distance of each other; and The cardinality of unique descriptors is determined as the number of histogram clusters; A speaker segmentation clustering system that identifies the unique speaker of the corresponding audio in a video provides at least the cardinality of unique descriptors.

9. The system according to claim 8, wherein: Detecting multiple speakers depicted in a video includes: detecting faces within the video; and For each detected person, determining the bounding box that includes the detected person in the image frame and objects within a threshold distance of the detected person includes: determining a bounding box that is a multiple of a face detection bounding box that includes the minimum portion of the image frame required to detect the face of the detected person.

10. The system according to claim 8, wherein: Detecting multiple speakers depicted in the video within the video includes: detecting bodies, including detecting the position of each person's head and torso; and For each detected person, determining the bounding box that includes the detected person in the image frame and objects within a threshold distance of the detected person includes: determining a bounding box that is a multiple of a body detection bounding box that includes a minimum portion of the image frame required to detect at least the head and torso of the detected person.

11. The system according to claim 8, wherein, Determining the cardinality of a unique descriptor includes: Provide unique descriptors to cardinality estimation models; and The cardinality of unique descriptors is estimated from the cardinality estimation model.

12. The system according to claim 8, wherein, The cardinality for determining unique descriptors for videos includes: Determine multiple cardinalities for unique descriptors; and For each of the multiple cardinalities, determine a confidence value that indicates the cardinality is correct.

13. The system according to claim 8, wherein, A speaker segmentation clustering system that identifies a unique speaker for a given audio segment of a video provides at least the cardinality of unique descriptors, including providing multiple cardinalities for unique descriptors and, for each unique descriptor, providing a confidence value for that unique descriptor.

14. The system according to claim 8, wherein, The speaker segmentation clustering system that determines the unique speaker of the corresponding audio of a video provides at least the cardinality of unique descriptors, including providing the cardinality of unique descriptors with the highest confidence value relative to all other unique descriptors.

15. The system according to claim 8, wherein, The speaker segmentation clustering system performs speech detection, uses the cardinality of unique descriptors to determine the number of speakers, and associates multiple parts of the audio with speakers corresponding to the cardinality.

16. A non-transitory computer-readable medium storing instructions executable by a data processing apparatus, wherein the instructions, when executed, cause the data processing apparatus to perform operations including: Obtain a video that includes multiple image frames and corresponding audio; Detect multiple speakers depicted in a video without using biometric information to identify the speakers; the detection of each person is based on the speaker’s face detected in the face detection bounding box. In response to the detection of multiple speaking people depicted in a video, for each detected person, a bounding box is determined that includes the detected person in the image frame and objects within a threshold distance of the detected person, the bounding box being larger than and including the face detection bounding box; For each detected person, a unique descriptor for that person is determined from the image information included within the bounding box for that person and without using biometric data identifying that person. This unique descriptor is based in part on the image information depicting the object within the bounding box. The unique descriptor for the detected person does not include biometric data identifying that person. Determining the unique descriptor for that person includes determining a histogram of pixel activation for each of a plurality of frames. The histogram of pixel activation is determined based on local factors from the image information included within the bounding box. The cardinality for determining the unique descriptor for a video includes: Generate histogram clusters, where each histogram cluster includes only histograms that are within a threshold distance of each other; and The cardinality of unique descriptors is determined as the number of histogram clusters; A speaker segmentation clustering system that identifies the unique speaker of the corresponding audio in a video provides at least the cardinality of unique descriptors.