Speech recognition and content de-identification

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting speaker feature vectors and text content, and combining natural language processing and speech conversion technologies, privacy-sensitive information in voice data is anonymized, solving the problem of identity and content privacy protection in voice data, and achieving security for data sharing and secondary use.

CN115427958BActive Publication Date: 2026-06-26INTERNATIONAL BUSINESS MACHINE CORPORATION

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: INTERNATIONAL BUSINESS MACHINE CORPORATION
Filing Date: 2021-04-26
Publication Date: 2026-06-26

Application Information

Patent Timeline

26 Apr 2021

Application

26 Jun 2026

Publication

CN115427958B

IPC: G06F21/62; G10L21/00; H04K1/00; G10L13/033; G10L15/26; G06F40/279; G10L21/013

CPC: G10L15/26; G10L21/00; G10L2021/0135; G10L13/033; G06F21/6254; G06F40/279; H04K1/00; G06F2221/2113

AI Tagging

Technology Topics

Privacy protection Speech sound

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Blockchain-based resource processing method and device, and computer device
CN122293353APrivacy protectionResource transfer
Special bathroom diamond mesh integrated aluminum alloy casement window
CN224413448UPrivacy protection Structural engineering
Pet recognition system and method in an elevator
CN122254361AElevators Privacy protectionBioimpedance Analysis
A graph neural network forgetting system and method considering forgetting and performance preservation
CN122242641ABiological modelsRegion selectionOriginal data
A multi-modal intelligent interaction portable terminal
CN122284835AGuarantee data securityMeet diverse usage needsPrivacy protection Processing element

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively protect the speaker's identity and content privacy in voice data, especially before sharing and secondary use, as they cannot effectively anonymize sensitive information in voice data.

Method used

By extracting the speaker's feature vector and text content, natural language processing techniques are used to anonymize privacy-sensitive information in the speech data, and a synthetic speaker identity is generated to hide the real identity. Speech conversion technology is used to generate a new speech waveform that is different from the original speech.

Benefits of technology

It achieves effective anonymization of speaker identity and voice content while ensuring data privacy, ensuring that personal identity and sensitive information are not leaked when voice data is shared, and supports the secondary use of voice data.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115427958B_ABST

Patent Text Reader

Abstract

One embodiment of the invention provides a method for speaker identity and content de-identification under privacy guarantees. The method includes receiving an input indicating a privacy protection level to be implemented, extracting features from speech recorded in a voice recording, identifying and extracting textual content from the speech, parsing the textual content to identify privacy-sensitive personal information about an individual, generating de-identified textual content by anonymizing the personal information to a degree that satisfies the privacy protection level and hides personal identity, and mapping the de-identified textual content to a speaker that delivered the speech. The method also includes generating a synthetic speaker identity based on other features that are dissimilar to the features to a degree that satisfies the privacy protection level, and synthesizing a new speech waveform based on the synthetic speaker identity to deliver the de-identified textual content. The new speech waveform hides the identity of the speaker.

Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] Embodiments of the present invention generally relate to data privacy protection, and more specifically, to methods and systems for speaker identification and content de-identification with data privacy guarantees. Summary of the Invention

[0002] One embodiment of the present invention provides a method for speaker identity and content de-identification under data privacy guarantees. The method includes receiving input indicating at least one privacy protection level requiring speaker identity and content de-identification, and extracting features corresponding to the first speaker from first speech transmitted by a first speaker and recorded in a first voice recording. The method further includes recognizing and extracting text content from the first speech, parsing the text content to identify privacy-sensitive personal information corresponding to a first individual, and generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The method further includes mapping the de-identified text content to the first speaker, generating a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesizing a new speech waveform based on the synthetic speaker identity to transmit the de-identified text content. The dissimilarity between the other features corresponding to the at least one other speaker and the features corresponding to the first speaker reaches a level that satisfies the at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform hides the personal identity of the first speaker.

[0003] Another embodiment of the present invention provides a system for speaker identity and content de-identification under data privacy guarantees. The system includes at least one processor and a non-transient processor-readable storage device storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations. The operations include receiving input indicating at least one privacy protection level requiring speaker identity and content re-identification, and extracting features corresponding to the first speaker from a first speech transmitted by a first speaker and recorded in a first voice recording. The operations also include recognizing and extracting text content from the first speech, parsing the text content to identify privacy-sensitive personal information corresponding to a first individual, and generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The operations further include mapping the de-identified text content to the first speaker, generating a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesizing a new speech waveform based on the synthetic speaker identity to transmit the de-identified text content. The dissimilarity between other features corresponding to the at least one other speaker and features corresponding to the first speaker reaches a level that satisfies at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform conceals the personal identity of the first speaker.

[0004] One embodiment of the present invention provides a computer program product for speaker identity and content de-identification under data privacy protection. The computer program product includes a computer-readable storage medium having program instructions. The program instructions are executable by a processor to cause the processor to receive input indicating that at least one privacy protection level needs to be implemented for speaker identity and content de-identification, and to extract features corresponding to the first speaker from a first speech transmitted by a first speaker and recorded in a first voice recording. The program instructions further cause the processor to identify and extract text content from the first speech, parse the text content to identify privacy-sensitive personal information corresponding to the first individual, and generate de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The program instructions further cause the processor to map the de-identified text content to the first speaker, generate a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesize a new speech waveform based on the synthetic speaker identity to transmit the de-identified text content. The dissimilarity between other features corresponding to the at least one other speaker and features corresponding to the first speaker reaches a level that satisfies at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform conceals the personal identity of the first speaker.

[0005] These and other aspects, features, and advantages of embodiments of the present invention will be understood with reference to the accompanying drawings and the detailed description herein, and will be realized by means of the various elements and combinations particularly pointed out in the appended claims. It should be understood that the above general description of the drawings and the following brief description, as well as the detailed description of the embodiments, are exemplary and illustrative of preferred embodiments of the invention, and not intended to limit the claimed embodiments of the invention. Attached Figure Description

[0006] The subject matter considered to be embodiments of the invention is specifically pointed out and clearly claimed in the claims at the end of the specification. The foregoing and other objects, features, and advantages of embodiments of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings, wherein:

[0007] Figure 1 A cloud computing environment according to an embodiment of the present invention is shown;

[0008] Figure 2 An abstract model layer according to an embodiment of the present invention is shown;

[0009] Figure 3An example computational architecture for implementing speaker identity and content de-identification according to an embodiment of the present invention is shown;

[0010] Figure 4 An exemplary speaker identification and content de-identification system according to an embodiment of the present invention is shown;

[0011] Figure 5 An example annotation is shown showing the mapping between the speech waveform of the provided speech recording and the words spoken by the speaker corresponding to the speech recording, according to an embodiment of the present invention;

[0012] Figure 6 An example graphical representation of the speaker's feature vector in a two-dimensional space according to an embodiment of the present invention is shown;

[0013] Figure 7 This is a flowchart of an example process for speaker identity and content de-identification according to embodiments of the present invention; and

[0014] Figure 8 This is a high-level block diagram illustrating an information processing system for implementing embodiments of the present invention.

[0015] The preferred embodiments, advantages, and features of the invention are explained in detail with reference to the accompanying drawings. Detailed Implementation

[0016] Embodiments of the present invention generally relate to data privacy protection, and more specifically, to methods and systems for speaker identity and content de-identification under data privacy guarantees. One embodiment of the present invention provides a method for speaker identity and content de-identification under data privacy guarantees. The method includes receiving input indicating at least one privacy protection level requiring speaker identity and content de-identification, and extracting features corresponding to the first speaker from first speech transmitted by a first speaker and recorded in a first voice recording. The method further includes recognizing and extracting text content from the first speech, parsing the text content to identify privacy-sensitive personal information corresponding to a first individual, and generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The method further includes mapping the de-identified text content to the first speaker, generating a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesizing a new speech waveform based on the synthetic speaker identity to deliver the de-identified text content. The dissimilarity between other features corresponding to the at least one other speaker and features corresponding to the first speaker reaches a level that satisfies at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform conceals the personal identity of the first speaker.

[0017] Another embodiment of the present invention provides a system for speaker identity and content de-identification under data privacy guarantees. The system includes at least one processor and a non-transient processor-readable storage device storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations. The operations include receiving input indicating at least one privacy protection level requiring speaker identity and content re-identification, and extracting features corresponding to the first speaker from first speech transmitted by a first speaker and recorded in a first voice recording. The operations further include recognizing and extracting text content from the first speech, parsing the text content to identify privacy-sensitive personal information corresponding to a first individual, and generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The operations further include mapping the de-identified text content to the first speaker, generating a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesizing a new speech waveform based on the synthetic speaker identity to transmit the de-identified text content. The dissimilarity between other features corresponding to the at least one other speaker and features corresponding to the first speaker reaches a level that satisfies at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform conceals the personal identity of the first speaker.

[0018] One embodiment of the present invention provides a computer program product for speaker identity and content de-identification under data privacy guarantees. The computer program product includes a computer-readable storage medium having program instructions. The program instructions are executable by a processor to cause the processor to receive input indicating at least one level of privacy protection requiring the implementation of speaker identity and content re-identification, and to extract features corresponding to the first speaker from a first speech transmitted by a first speaker and recorded in a first voice recording. The program instructions further cause the processor to identify and extract text content from the first speech, parse the text content to identify privacy-sensitive personal information corresponding to a first individual, and generate de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied. The de-identified text content hides the personal identity of the first individual. The program instructions further cause the processor to map the de-identified text content to the first speaker, generate a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, and synthesize a new speech waveform based on the synthetic speaker identity to transmit the de-identified text content. The dissimilarity between other features corresponding to the at least one other speaker and features corresponding to the first speaker reaches a level that satisfies at least one privacy protection level. The new speech waveform differs from the speech waveform of the first speech, and the new speech waveform conceals the personal identity of the first speaker.

[0019] For illustrative purposes, the term "de-identification" generally refers to the process of preventing the disclosure of an individual's personal identity. An individual's personal identity includes information that indicates one or more private characteristics of the individual (i.e., privacy-sensitive personal information such as, but not limited to, gender, age, health, mood, education, origin, etc.), which the third party can infer the individual's identity if such privacy-sensitive personal information is disclosed to a third party.

[0020] For illustrative purposes, the term "voice data" generally refers to data comprising one or more voice recordings of one or more voices transmitted by one or more speakers.

[0021] For illustrative purposes, the terms “speaker deidentification” and “voice deidentification” generally refer to the process of applying deidentification to voice data, which includes a recording of speech transmitted by a speaker, to prevent the speaker’s personal identity and voice from being disclosed.

[0022] For illustrative purposes, the terms “text content de-identification” and “content de-identification” generally refer to the process of applying de-identification to text content that includes privacy-sensitive personal information about an individual to prevent the disclosure of an individual’s personal identity from the text content.

[0023] For illustrative purposes, the term “speaker identification and content de-identification” generally refers to the process of applying both speaker de-identification and content de-identification to speech data.

[0024] For illustrative purposes, the term "direct identifier" generally refers to a data attribute, word, tag, or value that can be used alone to identify an individual. A direct identifier can uniquely correspond to an individual, such that when present in data, it reveals the identity of the corresponding individual. Examples of direct identifiers include, but are not limited to, personal names, social security numbers, country IDs, credit card numbers, phone numbers, medical record numbers, IP addresses, account numbers, etc.

[0025] For illustrative purposes, the terms "indirect identifier" or "quasi-identifier" generally refer to data attributes, words, tags, or values that cannot be used alone to identify an individual but can be used in combination with one or more other indirect / quasi-identifiers to identify that individual. The combination of indirect / quasi-identifiers corresponding to an individual can be unique or extremely rare, such that the presence of the combination in the data reveals the identity of the corresponding individual, or the combination can be linked to the identity of the corresponding individual using records in externally available datasets containing the individual's name (e.g., voter registration lists, decadal records, U.S. Census, etc.). For example, for most of the U.S. population, the combination of date of birth, gender, and a five-digit zip code is unique.

[0026] Embodiments of the present invention provide a method and system for voice de-identification and content de-identification of voice recordings, which protect the personal identity of the speaker transmitting the voice recorded in the voice recording as well as privacy-sensitive personal information included in the text content of the voice.

[0027] The speaker delivering the speech produces a human voice that carries voice signals indicating the speaker's privacy-sensitive personal information. For example, the timbre of a speaker's voice often carries most of the speaker's personally identifiable information. Since no two individuals sound the same, an individual's voice can be used as an identifier by combining one or more physiological characteristics of the speaker's vocal tract system and / or one or more behavioral characteristics of the human voice (e.g., rhythm, intonation, vocabulary, accent, pronunciation, speaking style, etc.) into a unique biometric pattern (i.e., a signature) for that individual.

[0028] With the recent ubiquitous growth of Automated Speaker Verification (ASV) systems, effectively protecting the personal identity of speakers in voice data has become essential. Furthermore, because voice can include inherently highly sensitive content, privacy protection measures to safeguard this content are necessary to comply with existing data privacy laws. For example, voice data that includes audio clinical data (e.g., voice recordings of clinicians' speech in electronic health records (EHRs) and records of clinicians' encounters with patients) contains privacy-sensitive personal information about patients, such as Protected Health Information (PHI); such data must undergo de-identification before being shared with one or more third parties for secondary use (e.g., to support medical research).

[0029] Traditional solutions for speaker de-identification utilize speech transformation (VT), a technique that modifies the original, non-linguistic properties of spoken utterance to anonymize the speaker's speech without affecting the speech content. Specifically, VT modifies the speaker's speech through the following steps: (1) source modification involving changes to the time scale, pitch, and / or energy of the speaker's voice; (2) filter modification involving changes to the timbre (i.e., magnitude response) of the speaker's voice; or (3) a combination of source modification and filter modification.

[0030] Speech-to-speech (VT) is a special form of VT that involves mapping the characteristics of one speaker's voice (i.e., the source speaker's voice) to the characteristics of another individual's voice (i.e., the target speaker's voice). The source speaker can use VT to mimic / simulate the target speaker's voice. VT requires both the source and target speakers, using the same corpus, to produce spoken utterances of the same text for training purposes.

[0031] Embodiments of the present invention provide a method and system for speaker de-identification that utilizes prior art feature vector extraction methods for manipulating speech data for ASVs to construct or create speaker identities for different individuals while providing data privacy guarantees. In one embodiment, privacy is protected by controlling the text content of the speech data through anonymization. Embodiments of the present invention provide a novel method that combines speaker de-identification and text content de-identification to both hide the speaker's identity and anonymize the text content of the speech, while providing prior art data privacy guarantees. This method can be used in a wide range of real-world applications to effectively and provably anonymize speech data and voice recordings, and facilitates the secondary use of the resulting anonymized speech data and voice recordings.

[0032] It should be understood that although this disclosure includes a detailed description of cloud computing, the implementation of the teachings set forth herein is not limited to a cloud computing environment. Rather, embodiments of the invention can be implemented in conjunction with any other type of computing environment now known or developed hereafter.

[0033] Cloud computing is a service delivery model that enables convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services), which can be rapidly provisioned and released with minimal management effort or interaction with service providers. This cloud model may include at least five features, at least three service models, and at least four deployment models.

[0034] The features are as follows:

[0035] On-demand self-service: Cloud consumers can unilaterally and automatically provide computing power, such as server time and network storage, as needed, without requiring human interaction with the service provider.

[0036] Extensive network access: Capabilities are available through networks and accessed via standard mechanisms that facilitate the use of heterogeneous thin client or thick client platforms (e.g., mobile phones, laptops, and PDAs).

[0037] Resource pooling: A provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, where different physical and virtual resources are dynamically assigned and reassigned as needed. There is a sense of location independence because consumers typically do not have control or knowledge of the exact location of the resources provided, but may be able to specify the location at a higher level of abstraction (e.g., country, state, or data center).

[0038] Rapid flexibility: The ability to provide capacity quickly and flexibly, automatically scaling down and up rapidly in some situations to scale up rapidly. For consumers, the available supply capacity often appears unlimited and can be purchased in any quantity at any time.

[0039] Measuring services: Cloud systems automatically control and optimize resource usage by leveraging metering capabilities at a level of abstraction appropriate to the service type (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency to both service providers and consumers.

[0040] The service model is as follows:

[0041] Software as a Service (SaaS): This provides consumers with the ability to use the provider's applications running on cloud infrastructure. Applications can be accessed from different client devices via thin client interfaces such as web browsers (e.g., web-based email). Consumers do not manage or control the underlying cloud infrastructure, including the network, servers, operating system, storage, or even individual application capabilities, with possible exceptions such as limited user-specific application configuration settings.

[0042] Platform as a Service (PaaS): This provides consumers with the ability to deploy applications created or acquired by the consumer using programming languages and tools supported by the provider onto cloud infrastructure. Consumers do not manage or control the underlying cloud infrastructure, including networks, servers, operating systems, or storage, but they have control over the deployed applications and the configuration of any application hosting environment.

[0043] Infrastructure as a Service (IaaS): The capabilities offered to consumers are processing, storage, networking, and other basic computing resources that enable consumers to deploy and run arbitrary software, which may include operating systems and applications. Consumers do not manage or control the underlying cloud infrastructure, but rather have control over the operating system, storage, deployed applications, and potentially limited control over selected networking components (e.g., host firewalls).

[0044] The deployment model is as follows:

[0045] Private cloud: A cloud infrastructure that operates solely for an organization. It can be managed by the organization or a third party and can exist on-site or off-site.

[0046] Community cloud: A cloud infrastructure shared by several organizations and supporting a specific community with shared concerns (e.g., tasks, security requirements, policies, and compliance considerations). It can be managed by an organization or a third party and can exist on-site or off-site.

[0047] Public cloud: Makes cloud infrastructure available to the public or large industry groups and is owned by an organization that sells cloud services.

[0048] Hybrid cloud: A cloud infrastructure is a combination of two or more clouds (private, community, or public) that remain a single entity but are bound together by standardized or proprietary technologies that enable data and applications to be ported (e.g., cloud bursting for load balancing between clouds).

[0049] Cloud computing environments are service-oriented, focusing on statelessness, loose coupling, modularity, and semantic interoperability. At the heart of cloud computing is the infrastructure comprising a network of interconnected nodes.

[0050] Figure 1A cloud computing environment 50 according to an embodiment of the present invention is illustrated. As shown, in one embodiment, the cloud computing environment 50 includes one or more cloud computing nodes 10 communicating with local computing devices used by cloud consumers, such as personal digital assistants (PDAs) or cellular phones 54A, desktop computers 54B, laptop computers 54C, and / or automotive computer systems 54N. In one embodiment, the nodes 10 communicate with each other. In one embodiment, they are physically or virtually grouped (not shown) in one or more networks, such as private clouds, community clouds, public clouds, or hybrid clouds or combinations thereof as described above. This allows the cloud computing environment 50 to provide infrastructure, platform, and / or software as a service, without requiring cloud consumers to maintain resources on their local computing devices. It should be understood that... Figure 1 The types of computing devices 54A-N shown are for illustrative purposes only, and computing node 10 and cloud computing environment 50 can communicate with any type of computerized device via any type of network and / or network-addressable connection (e.g., using a web browser).

[0051] Figure 2 A set of functional abstraction layers provided by a cloud computing environment 50 according to an embodiment of the present invention is illustrated. It should be understood beforehand that... Figure 2 The components, layers, and functions shown are for illustrative purposes only, and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided:

[0052] The hardware and software layer 60 includes hardware and software components. Examples of hardware components include: a host 61; a server 62 based on a RISC (Reduced Instruction Set Computer) architecture; a server 63; a blade server 64; a storage device 65; and a network and network components 66. In some embodiments, software components include network application server software 67 and database software 68.

[0053] In one embodiment, virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities are provided: virtual server 71; virtual storage 72; virtual network 73, including virtual private network; virtual application and operating system 74; and virtual client 75.

[0054] In one example, management layer 80 may provide the following functionalities: Resource Provisioning 81 provides dynamic procurement of computing resources and other resources used to perform tasks within the cloud computing environment. Metering and Pricing 82 provides cost tracking as resources are utilized within the cloud computing environment and bills or invoices for the consumption of these resources. In one example, these resources may include application software licenses. Security provides authentication for cloud consumers and tasks, as well as protection for data and other resources. User Portal 83 provides access to the cloud computing environment for consumers and system administrators. Service Level Management 84 provides cloud resource allocation and management to ensure that required service levels are met. Service Level Agreement (SLA) Planning and Fulfillment 85 provides pre-scheduling and procurement of cloud resources based on anticipated future needs according to the SLA.

[0055] In one embodiment, workload layer 90 provides examples of functionalities that can leverage a cloud computing environment. Examples of workloads and functionalities that may be provided from this layer in one embodiment include: mapping and navigation 91; software development and lifecycle management 92; virtual classroom education delivery 93; data analytics and processing 94; transaction processing 95; and speaker identification and content de-identification 96 (e.g., speaker identification and content de-identification system 330, as described in detail later herein).

[0056] Figure 3 An example computing architecture 300 for implementing speaker identity and content de-identification according to an embodiment of the present invention is illustrated. In one embodiment, computing architecture 300 is a centralized computing architecture. In another embodiment, computing architecture 300 is a distributed computing architecture.

[0057] In one embodiment, computing architecture 300 includes computing resources, such as, but not limited to, one or more processor units 310 and one or more storage units 320. One or more applications may utilize the computing resources of computing architecture 300 to execute / operate on computing architecture 300. In one embodiment, applications on computing architecture 300 include, but are not limited to, a speaker identification and content de-identification system 330 configured for speaker identification and content de-identification.

[0058] As described in detail later herein, in one embodiment, system 330 is configured to receive a dataset (e.g., a collection of voice data) comprising a set of voice recordings of a set of speakers, wherein the text content of the voice recordings includes privacy-sensitive personal information about a set of individuals. System 330 is configured to apply speaker de-identification and content de-identification to at least one voice recording to conceal the identity (i.e., personal identity) of at least one speaker and anonymize the privacy-sensitive personal information about at least one individual, thereby generating at least one de-identified voice recording from which the identity of at least one speaker and the privacy-sensitive personal information about at least one individual cannot be inferred. Each generated de-identified voice recording may be shared with one or more third parties for secondary use (e.g., sharing to support medical research).

[0059] For example, in one embodiment, the dataset is audio clinical data that includes voice recordings of clinicians included in the EHR and records of clinician-patient encounters, wherein the text content of the voice recordings contains information about the patient's PHI. System 330 extracts a subset of the audio clinical data (i.e., extracts one or more voice recordings) and performs de-identification on the extracted subset (i.e., applies speaker identification and content de-identification to the extracted voice recordings) to hide the clinician's identity and anonymize the patient's PHI. Since privacy-sensitive personal information (e.g., PHI) about the patient cannot be inferred from the obtained de-identified extracted subset (i.e., the obtained de-identified voice recordings), the de-identified extracted subset can be shared with one or more third parties to support medical research.

[0060] In one embodiment, system 330 is incorporated into / integrated into a cloud computing environment (e.g., IBM). (etc.)

[0061] In one embodiment, the speaker identification and content de-identification system 330 is configured to exchange data with one or more electronic devices 350 and / or one or more remote server devices 360 via a connection (e.g., a wireless connection such as a Wi-Fi connection or a cellular data connection, a wired connection, or a combination of both).

[0062] In one embodiment, electronic device 350 includes one or more computing resources, such as, but not limited to, one or more processor units 351 and one or more storage units 352. One or more applications can utilize the one or more computing resources of electronic device 350 to execute / operate on electronic device 350, such as, but not limited to, one or more software applications 354 loaded or downloaded to electronic device 350. Examples of software applications 354 include, but are not limited to, artificial intelligence (AI) applications, big data analytics applications, etc.

[0063] Examples of electronic devices 350 include, but are not limited to, desktop computers, mobile electronic devices (e.g., tablets, smartphones, laptops, etc.), wearable devices (e.g., smartwatches, etc.), Internet of Things (IoT) devices, smart appliances such as smart TVs, etc.

[0064] In one embodiment, electronic device 350 includes one or more input / output (I / O) units 353 integrated into or coupled to electronic device 350, such as a keyboard, keypad, touch interface, display screen, etc. A user can utilize the I / O units 353 of electronic device 350 to configure one or more user preferences, configure one or more parameters (e.g., thresholds, boundaries, etc.), provide input (e.g., requests), etc.

[0065] In one embodiment, electronic device 350 and / or remote server device 360 may be a data source providing a dataset for speaker identification and content de-identification.

[0066] In one embodiment, the speaker identification and content de-identification system 330 may be accessed or utilized by one or more online services (e.g., AI services, big data analytics services, data processing services) hosted on a remote server device 360 and / or by one or more software applications 354 (e.g., AI applications, big data analytics applications, data processing applications) operating on an electronic device 350.

[0067] Figure 4 An exemplary speaker identification and content de-identification system 330 according to an embodiment of the present invention is illustrated. In one embodiment, system 330 includes an input unit 400 configured to receive a dataset comprising a set of speech data. In one embodiment, the set of speech data comprises R voice records, where R is a positive integer. The R voice records are original voice records of speech delivered by S speakers, where S is a positive integer. Specifically, for each of the S speakers, the R voice records include at least one corresponding voice record of at least one speech transmitted by the speaker. The text content of the speech delivered by the S speakers includes privacy-sensitive personal information about (i.e., relating to or involving) P individuals, such as PHI or other personally identifiable information (PII), where P is a positive integer.

[0068] In one embodiment, system 330 includes a feature extraction unit 410. For each of the S speakers, the feature extraction unit 410 is configured to generate a corresponding feature vector based on at least one voice record (out of R voice records) corresponding to the speaker. Specifically, in one embodiment, the feature extraction unit 410 is configured to: (1) extract language and speaker identity features from the voice record (i.e., extract features from the speech recorded in the voice record) for each of the R voice records, and (2) generate a corresponding feature vector for each of the S speakers based on the language and speaker identity features extracted from all voice records corresponding to the speaker. For example, if the R voice records include two voice records that both correspond to the same speaker, then the feature extraction unit 410 generates a feature vector corresponding to the speaker based on the linguistic features and speaker identity features extracted from the speech recorded in the two voice records.

[0069] Examples of linguistic and speaker identity features extracted from the corresponding speech transcripts include, but are not limited to, one or more physiological features of the speaker's vocal tract system (e.g., vocal cords, vocal tract shape, timbre, pitch, etc.) and one or more behavioral features of the speaker's voice (e.g., rhythm, intonation, vocabulary, accent, pronunciation, speaking style, etc.).

[0070] In one embodiment, the feature extraction unit 410 generates a speaker-corresponding feature vector by training a feature vector based on linguistic and speaker identity features extracted from all speech records corresponding to the speaker.

[0071] In one embodiment, the feature extraction unit 410 utilizes a feature vector extraction method, such as, but not limited to, x-vectors, i-vectors, etc. In another embodiment, the feature extraction unit 410 utilizes Fourier or Fast Fourier Transform (FFT) to decompose the unique tones of the S speakers and separate the voiceprints of the S speakers. The unique speakers can be divided into separate speech files for processing.

[0072] In one embodiment, all feature vectors generated by the feature extraction unit 410 are maintained in a database (e.g., on storage unit 320).

[0073] In one embodiment, system 330 includes a text content extraction unit 420. For each of R voice recordings, the text content extraction unit 420 is configured to identify and extract text content (i.e., transcript) from the voice recording (i.e., extract text content from the speech recorded in the voice recording), wherein the text content includes words (or terms) spoken or uttered by the speaker corresponding to the voice recording (and the speaker transmitting the speech). In one embodiment, the text content extraction unit 420 uses a speech recognition method or engine to identify and extract text content from the voice recording, said speech recognition method or engine being, for example, but not limited to, [examples of, but not limited to, the following]. Speech to Text, Temi, Transcribe, etc.

[0074] In one embodiment, for each of the R voice recordings, the text content extraction unit 420 is optionally configured to generate corresponding annotation text based on the text content identified and extracted from the voice recording, wherein the annotation text is an annotation that provides a mapping between the speech waveform of the voice recording and the words (or terms) spoken or uttered by the speaker corresponding to the voice recording.

[0075] In one embodiment, the text content extraction unit 420 is configured to link the speech waveform of a voice recording with segments (i.e., portions) of text content identified and extracted from the voice recording, wherein the segments include privacy-sensitive personal information (e.g., PHI or other PII) about one of P individuals. In one embodiment, the segments are pre-filtered to obscure the segments before speaker identification and content de-identification of the voice recording is completed. In one embodiment, the speech waveform is tagged or labeled to identify the voice recording as one of R voice recordings, which cannot be published or disclosed to third parties for secondary use without inspection (e.g., the voice recording must undergo speaker identification and content de-identification before publication).

[0076] In one embodiment, system 330 includes a masking and tagging unit 430. For each of the R voice records, the masking and tagging unit 430 is configured to generate corresponding processed text content by identifying and processing privacy-sensitive personal information (e.g., PHI or other PII) contained in text content identified and extracted from the voice records. Specifically, for each of the R voice records, the masking and tagging unit 430 is configured to: (1) receive text content identified and extracted from the voice records (e.g., from the text content extraction unit 420), (2) parse the text content using at least one natural language processing (NLP) annotator to identify (i.e., find) at least one direct identifier and / or at least one quasi-identifier in the text content, and (3) process each identifier identified in the text content (i.e., direct identifier and / or quasi-identifier) based on the type of the identifier, thereby generating processed text content corresponding to the voice record.

[0077] For illustrative purposes, the term "PII word" generally refers to a word (or term) in text content that is a direct identifier or quasi-identifier. For illustrative purposes, the term "non-PII word" generally refers to a word (or term) in text content that is neither a direct identifier nor a quasi-identifier. Non-PII words cannot be linked to an individual's personal identity.

[0078] In one embodiment, the masking and tagging unit 430 processes directly identified identifiers in the text content by masking (i.e., replacing) the directly identified identifiers in the text content with a masking value (i.e., a replacement value) based on the type of the directly identified identifier. For example, in one embodiment, if the directly identified identifier in the text content is a name, the masking and tagging unit 430 replaces the directly identified identifier in the text content with a random name (e.g., extracted from a dictionary, extracted from a publicly available dataset such as a voter registration list) or a pseudonym (e.g., “Patient1234”). Alternatively, the masking and tagging unit 430 processes directly identified identifiers in the text content by suppressing the directly identified identifiers in the text content.

[0079] In one embodiment, the masking and tagging unit 430 processes quasi-identifiers identified in text content by tagging them with one or more labels based on the type of the quasi-identifier (e.g., age, gender, date, postal code, etc.). For example, in one embodiment, if the quasi-identifier identified in the text content is age, the masking and tagging unit 430 is configured to tag the quasi-identifier with one or more labels indicating that the quasi-identifier is age.

[0080] In one embodiment, if a fragment (i.e., a portion) of the text content contains unrecognized concepts (e.g., not recognized by an NLP annotator) or incomprehensible audio, the masking and tagging unit 430 is configured to annotate or tag the fragment as “unknown”, so that the fragment is ignored for further processing by the system 330.

[0081] A speaker's choice of words (i.e., word selection) can be a characteristic of the speaker's identity and can reveal the speaker's identity. In one embodiment, system 330 includes a word replacement unit 440. For each of R voice records, word replacement unit 440 is configured to replace some words in the corresponding processed text content with similar words to protect the speaker's word selection corresponding to the voice record. Specifically, for each of the R voice records, word replacement unit 440 is configured to: (1) receive processed text content corresponding to the voice record (e.g., from the masking and tagging unit 430), (2) select at least one word in the processed text content, the word being a quasi-identifier word or a non-PII word, and (3) replace each selected word with similar (i.e., synonymous) words using a dictionary, lookup table, or vocabulary database (e.g., Wordnet). In one embodiment, if the processed text content includes medical terms (e.g., the voice data is audio clinical data), the word replacement unit 440 can replace the medical terms in the processed text content with SNOMED (a set of computer-processable medical terms organized by the system) codes or ICD-9 (International Classification of Diseases, Ninth Edition) codes to which the medical terms are mapped.

[0082] In one embodiment, for each of the S speakers, the similar word used as a replacement is the same (i.e., global) across all processed text content corresponding to the same speaker (i.e., all processed text content corresponding to all voice recordings of the same speaker), thereby protecting the speaker's identity while also preserving the utility of all processed text content. For example, in one embodiment, word replacement unit 440 can replace every occurrence of the word "found" in all processed text content with the same similar word "discovered," and can replace every occurrence of the word "elevated" in all processed text content with the same similar word "imcreased" (i.e., using the same similar word across all processed text content).

[0083] In one embodiment, system 330 includes a text document generation unit 450. For each of the P individuals, the text document generation unit 450 is configured to generate a corresponding text document by combining all processed text content corresponding to the same individual (i.e., all processed text content including privacy-sensitive personal information about the same individual). Specifically, in one embodiment, for each of the P individuals, the text document generation unit 450 is configured to: (1) receive all processed text content corresponding to the same individual (e.g., from masking and tagging unit 430 and / or word replacement unit 440), wherein all processed text content includes privacy-sensitive personal information about an individual that has been masked, replaced, concealed, and / or tagged, and (2) generate a corresponding text document by combining all processed text content into a text document.

[0084] In one embodiment, the text document generation unit 450 generates a corresponding text document for each of the P individuals, thereby producing a set of P generated text documents. In one embodiment, the set of P text documents is maintained in a database (e.g., on storage unit 320). Alternatively, in one embodiment, the text document generation unit 450 generates corresponding text documents only for multiple individuals having multiple corresponding processed text contents (i.e., only for multiple individuals being the topics of multiple voice records).

[0085] In one embodiment, if an individual is the subject of multiple voice records (i.e., all processed text content corresponding to the multiple voice records includes privacy-sensitive personal information about the individual), the text document generation unit 450 is configured to combine all processed text content corresponding to the multiple voice records into a corresponding text document based on timestamp vectors or classification similarity. For example, in one embodiment, all processed text content is arranged in chronological order in the text document based on timestamp vectors indicating the multiple voice records. As another example, in one embodiment, all processed text content is arranged in order of classification similarity in the text document.

[0086] In one embodiment, system 330 includes a content de-identification unit 460. For each of the P individuals, the content de-identification unit 460 is configured to: (1) receive a corresponding text document (e.g., from text document generation unit 450), wherein the text document includes all processed text content corresponding to the same individual (i.e., all processed text content includes privacy-sensitive personal information about the individual that has been masked, replaced, concealed, and / or tagged), and (2) generate corresponding de-identified text content by applying content de-identification to the text document. The applied content de-identification anonymizes all processed text content included in the text document to the extent that all processed text content retains its utility and does not disclose any privacy-sensitive personal information about the individual. All de-identified text content generated by the content de-identification unit 460 is suitable for distribution to third parties for secondary use.

[0087] In one embodiment, for each of the P individuals, the content de-identification unit 460 is configured to generate corresponding de-identified text content, which protects k. c Among the other individuals, k is not affected by the P individuals, where k c ≤P. These conditions provide data privacy guarantees regarding the potential (by a third party) re-identification of an individual's original identity. If the corresponding de-identified text content is published to a third party or intercepted by a third party, the probability that the third party successfully identifies the individual (i.e., infers the individual's identity) from the de-identified text content is limited to 1 / k. c Therefore, the probability of a third party (e.g., an attacker) successfully re-identifying an individual based on the de-identified text content is limited to 1 / k. c In one embodiment, k c Set by the data owner or an expert (e.g., via I / O unit 353). In one embodiment, K c The de-identification risk threshold is provided by the data owner or de-identification expert as input and is used to implement the required / necessary level of privacy protection (i.e., the likelihood of re-identification).

[0088] In one embodiment, the content de-identification applied by the content de-identification unit 460 includes the following steps: First, the content de-identification unit 460 parses each of the P text documents to generate a union of words / symbols that appear in the text documents and exclude each identified PII word (i.e., each direct identifier and / or quasi-identifier identified via the masking and tagging unit 430). The content de-identification unit 460 maintains a frequency list, which includes a corresponding frequency for each term / symbol (token) in the union, indicating the number of times that term / symbol (token) appears (i.e., occurs) in the P text documents.

[0089] Secondly, the content de-identification unit 460 selects one or more infrequent items from P text documents for deletion based on at least one blacklist / dictionary for direct identifiers (e.g., a list of names extracted from a publicly available dataset such as a list of voter registrations). In one embodiment, the content de-identification unit 460 utilizes at least one blacklist / dictionary to determine the maximum frequency F associated with the direct identifiers identified in the P text documents, where the maximum frequency F is selected as a threshold for selecting infrequent terms in the P text documents for deletion. For example, in one embodiment, the content de-identification unit 460 selects to delete all terms / symbols of unions having a corresponding frequency not exceeding the threshold F, such that the remaining terms / symbols of the unions not selected for deletion have a corresponding frequency exceeding the threshold F.

[0090] For each infrequent term / symbol selected for deletion, the content de-identification unit 460 is configured to delete (i.e. filter out) the infrequent term / symbol from P text documents.

[0091] Content de-identification unit 460 selects unique terms and low-frequency terms (collectively referred to as infrequent terms) appearing in P text documents for deletion. Content de-identification unit 460 initially assumes that each infrequent term selected for deletion is a PII word. However, the infrequent terms selected for deletion may actually be non-PII words that do not need to be deleted (i.e., filtered out) from the P text documents. Third, to account for infrequent terms that are actually non-PII words, content de-identification unit 460 is optionally configured to restore one or more infrequent terms selected for deletion to the P text documents based on at least one whitelist of harmless terms. Each infrequent term selected for deletion but included in at least one whitelist is identified by content de-identification unit 460 as a known non-PII word and restored to the P text documents. Examples of whitelists that content de-identification unit 460 may utilize include, but are not limited to, known whitelists used for content de-identification and vocabulary databases (e.g., WordNet).

[0092] Fourth, the content de-identification unit 460 is configured to extract each quasi-identifier identified in the P text documents and create a corresponding record of structured data (i.e., structured representation) based on a list of known quasi-identifiers. The list of known quasi-identifiers identifies one or more structured representations used to maintain one or more values of one or more known types of quasi-identifiers. For example, if the list of known quasi-identifiers is defined as {date of birth, gender, 5-digit postal code}, then the list identifies a first structured representation for maintaining the value of the known type date of birth, a second structured representation for maintaining the value of the known type gender, and a third structured representation for maintaining the value of the known type 5-digit postal code. In one embodiment, the list of known quasi-identifiers is derived based on a publicly available dataset in the domain where the text documents reside (i.e., associated with them). In another embodiment, the list of known quasi-identifiers is provided by the data owner or a de-identification expert (e.g., via I / O unit 353).

[0093] Specifically, for each known type included in the list of known quasi-identifiers, the content de-identification unit 460 is configured to: (1) locate all text documents among the P text documents that contain at least one quasi-identifier, the quasi-identifier being marked with one or more tags indicating the known type; and (2) for each located text document, create a corresponding record of structured data maintaining the value of the known type. For example, if the known type is date of birth and the quasi-identifier is "November 2, 1980", the created record includes the following structured data: date of birth = "11 / 2 / 1980". As another example, if the known type is gender and the quasi-identifier is "he", the created record includes the following structured data: gender = "M". As yet another example, if the known type is postal code and the quasi-identifier is "12345", the created record includes the following structured data: postal code = "12345".

[0094] In one embodiment, the content de-identification unit 460 suppresses each quasi-identifier in P text documents, each of the P text documents being marked with one or more tags indicating a quasi-identifier type not included in a known list of quasi-identifiers.

[0095] In one embodiment, the content de-identification unit 460 supports multiple anonymization algorithms. For each record of the created structured data, the content de-identification unit 460 is configured to select an appropriate syntactic anonymization method (i.e., algorithm) among the multiple anonymization algorithms to apply to the record to anonymize at least one value maintained in the record, resulting in an anonymized record of the structured data maintaining anonymized values. In one embodiment, for known types included in a list of quasi-identifiers, the content de-identification unit 460 is optionally configured to apply a micro-aggregation method to all records of the structured data that maintain values of known types, thereby producing an anonymized record of the structured data that maintains random values of known types, wherein the random values are computed on the micro-aggregation. For each of P text documents, the content de-identification unit 460 is configured to replace each quasi-identifier identified in the text document with an anonymized / random value for the quasi-identifier of the known type, wherein the anonymized / random value is obtained from an anonymized record of the structured data corresponding to the text document.

[0096] Finally, for each of the P individuals, the content de-identification unit 460 is configured to remove each tag marked by each quasi-identifier identified in the text document (e.g., remove the start and end tags) from the corresponding text document, thereby generating the corresponding de-identified text content.

[0097] In one embodiment, system 330 includes a mapping unit 470. For each of P individuals, the mapping unit 470 is configured to: (1) receive corresponding de-identified text content (e.g., from content de-identification unit 460), and (2) map one or more segments of the de-identified text content to one or more speakers of S speakers and one or more voices transmitted by the one or more speakers, based on R voice recordings, wherein the one or more voices include privacy-sensitive personal information about the individual (i.e., the one or more voices are recorded in one or more voice recordings with the individual as the subject).

[0098] In one embodiment, system 330 includes a synthesized speaker identity creation unit 480. For each of the S speakers, the synthesized speaker identity creation unit 480 is configured to apply speaker deidentification to each of the R voice recordings corresponding to that speaker. In one embodiment, for each of the S speakers, the speaker deidentification applied by the synthesized speaker identity creation unit 480 includes: (1) generating a corresponding synthesized speaker identity, and (2) for each voice recording corresponding to the speaker (of the R voice recordings), synthesizing a new speech waveform based on the synthesized speaker identity to transmit deidentified text content mapped to both the speaker and the speech transmitted by the speaker (and recorded in the voice recordings). The new speech waveform sounds very different from the speech waveform corresponding to each voice recording of the speaker. In one embodiment, if the deidentified text content includes one or more suppression values, each suppression value is expressed / generated as or converted into a beep in the new speech waveform. The beep can notify the listener of the new speech waveform that one or more words are missing.

[0099] In one embodiment, for each of the S speakers, the synthetic speaker identity creation unit 480 is configured to generate a corresponding synthetic speaker identity that satisfies the following conditions: (1) synthetic speaker identity protection k s The speakers among the other speakers are not affected by the S speakers, where k s ≤S, and (2) the synthesized speaker identity is far removed from the speaker's original speaker identity (i.e., the new speech waveform synthesized using the synthesized speaker identity sounds very different from the speech waveform corresponding to each speech record of the speaker). These conditions provide data privacy guarantees for potential re-identification of the speaker's original speaker identity (by a third party). If the de-identified speech record, including the new speech waveform synthesized using the synthesized speaker identity, is published to a third party or intercepted by a third party, the probability that the third party successfully identifies the speaker from the de-identified speech record (i.e., infers the speaker's original speaker identity) is limited to 1 / K. s Therefore, the probability of a third party (e.g., an attacker) successfully re-identifying the speaker from the de-identified voice recording is limited to 1 / K. s In one embodiment, k s Set by the data owner or an expert (e.g., via I / O unit 353). In one embodiment, K s The de-identification risk threshold is provided by the data owner or de-identification expert as input and is used to implement the required / necessary level of privacy protection (i.e., the likelihood of re-identification).

[0100] In one embodiment, the synthesized speaker identity creation unit 480 clusters the S speakers into multiple clusters (i.e., groups) by clustering each feature vector corresponding to each of the S speakers (extracted via the feature extraction unit 410) based on a vector similarity metric / measure (e.g., Euclidean distance or cosine similarity metric), wherein each resulting cluster includes at least k features corresponding to the S speakers. s The feature vectors of S similar speakers. For each of the S speakers, the synthetic speaker identity creation unit 480 is configured to generate a corresponding synthetic speaker identity by: (1) selecting a cluster (i.e., a target cluster) that is as far as possible from another cluster (i.e., the source cluster) that includes the feature vectors corresponding to that speaker (i.e., the speech waveforms of all feature vectors included in the selected cluster are very different from the speech waveforms of each recording corresponding to that speaker); (2) applying an aggregation function to the speaker identity features (i.e., speech waveforms) of all feature vectors included in the selected cluster; and (3) generating a synthetic speaker identity based on the resulting aggregated speaker identity features. The degree of dissimilarity between the feature vectors included in the selected cluster and the feature vectors corresponding to the speaker meets the required / necessary level of privacy protection.

[0101] In one embodiment, system 330 includes an output unit 490. For each of R original voice recordings, output unit 490 is configured to publish the corresponding de-identified voice recording to a third party for secondary use, wherein the de-identified voice recording includes a synthesized speech waveform that transmits de-identified text content mapped to both the speaker corresponding to the original voice recording and the speech transmitted by the speaker and recorded in the original voice recording. Output unit 490 only publishes de-identification information, i.e., the de-identified voice recording and the de-identified text content. Output unit 490 does not publish the original voice recordings or the original text content identified and extracted from the original voice recordings.

[0102] In an example application scenario, assuming R=10, S=6, and P=20, the R voice records include a total of 10 voice records, namely voice record 1, voice record 2, ... and voice record 10; the S speakers include a total of 6 speakers, namely speaker 1, speaker 2, ... and speaker 6; and the P individuals include a total of 20 individuals, namely individual 1, individual 2, ... and individual 20. In one embodiment, for each of the 6 speakers, the feature extraction unit 410 generates a corresponding feature vector based on all language and speaker identity features extracted from all voice records (of the ten voice records) corresponding to the speaker, and maintains the feature vector in a database (e.g., on storage unit 320).

[0103] Suppose that system 330 receives a request via input unit 400 to apply speaker de-identification and content de-identification to three (3) given voice records out of ten voice records. Suppose that the three given voice records correspond to two specific speakers out of six speakers, and that the text content identified and extracted from the three voice records includes privacy-sensitive personal information about three specific individuals out of twenty.

[0104] Table 1 below provides example text records identified and extracted by the text content extraction unit 420 from three given voice records.

[0105] Table 1

[0106]

[0107]

[0108] As shown in Table 1, the three given voice recordings include: (1) Voice Recording 1 corresponding to Speaker 1, who is Dr. Carmen Dudley, a clinician in Mary, USA, wherein the text record identified and extracted from Voice Recording 1 includes privacy-sensitive personal information (e.g., PHI or other PII) about the individual Jane Alan as a patient; (2) Voice Recording 2 corresponding to Speaker 1 (i.e., Dr. Carmen Dudley, a clinician in Mary, USA), wherein the text record identified and extracted from Voice Recording 2 includes privacy-sensitive personal information about the individual Mr. Ted Borret as a patient; and (3) Voice Recording 3 corresponding to Speaker 2, who is Dr. Veep Bob, a clinician in Mary, USA, wherein the text record identified and extracted from Voice Recording 3 includes privacy-sensitive personal information about the individual Cathie Trian as a patient.

[0109] Figure 5 An example annotation is shown, illustrating the mapping between the speech waveform of a voice recording and the words spoken by the speaker corresponding to the voice recording, according to an embodiment of the present invention. Specifically, the annotation provides a mapping between the speech waveform of voice recording 1 and the text record identified and extracted from voice recording 1 (see Table 1). The annotation is generated by the text content extraction unit 420.

[0110] Table 2 below provides exemplary direct identifiers and quasi-identifiers identified by the masking and marking unit 430 in the text record of Table 1. For reference, each direct identifier identified by the masking and marking unit 430 is shown in bold underlined text, and each quasi-identifier identified by the masking and marking unit 430 is shown in bold with a label indicating the type of quasi-identifier.

[0111] Table 2

[0112]

[0113] As shown in Table 2, names in the text records are identified as direct identifiers (e.g., names “Jane Alan,” “Becket,” “Ted Borret,” “Cathie Trian,” and “Boris” are shown in bold with an underline). Further as shown in Table 2, one or more words (or terms) in the text records indicating a specific age, gender, date, diagnosis, or procedure are identified as quasi-identifiers (e.g., the date “August 14, 2013” in the text record identified and extracted from voice record 3 is marked with a start tag). <date> and closing tags< / date> mark).

[0114] As further shown in Table 2, one or more NLP annotators applied to a text record cannot recognize all direct identifiers and / or all quasi-identifiers in the text. For example, direct identifiers “ID43729” and “ID53265”, as well as quasi-identifiers like “Crohn's disease”, are not recognized by the NLP annotator. System 330 is configured to hide direct identifiers and quasi-identifiers (e.g., “ID…”) in the text that are not recognized by the NLP annotator. For example, content de-identification unit 460 selects unique terms and low-frequency terms (e.g., “ID…”) that appear in the text record for deletion (see Table 4 below).

[0115] Table 3 below provides example masking values for the masking and marking unit 430 to replace direct identifiers identified in the text records of Table 2, and also provides example similar words for the word replacement unit 440 to replace some words in the text of Table 2. For reference, each masking value is indicated by underlined bold, and each similar word is indicated by bold and italics.

[0116] Table 3

[0117]

[0118] As shown in Table 3, the names “Jane Alan,” “Becket,” “Ted Borret,” “Catie Trian,” and “Boris,” which were identified as direct identifiers in the text records of Table 2, were replaced with the masking values “Mary Quinn,” “Capeman,” “Albert Somaya,” “Ted Burner,” and “Rott,” respectively. Further, as shown in Table 3, each occurrence of the words “presented,” “instructed,” “experienced,” and “elevated” in the text records of Table 2 was replaced with similar words such as “came,” “asked,” “had,” and “increased,” respectively.

[0119] Table 4 below provides an exemplary set of terms / symbols generated by the content de-identification unit 460. For reference, each uncommon term / symbol selected by the content de-identification unit 460 for deletion is shown with a strikethrough.

[0120] Table 4

[0121] Table 5 below provides exemplary uncommon terms / symbols that were selected for deletion but were subsequently identified as harmless and restored by the content de-identification unit 460. For reference, each uncommon term / symbol subsequently identified as harmless and restored by the content de-identification unit 460 is shown in bold.

[0122] Table 5

[0123] As shown in Table 5, the uncommon terms selected for deletion, such as “care,” “department,” “evaluate,” “found,” “had,” “found,” “given,” “history,” “no,” “of,” “prescription,” “came,” “went,” “were,” “who,” and “instructed” (see Table 4), were subsequently identified as harmless and reversible.

[0124] Let PLQ generally represent a list of known quasi-identifiers. In one example, suppose PLQ is represented according to the following provided list (1):

[0125] PLQ = {{age, gender}, {date}, {diagnosis}}(1),

[0126] PLQ includes the following elements: (1) a first element (“PLQ element 1”) representing the first structured representation {age, gender} used to maintain the values of the quasi-identifiers of known types, which is ...

[0127] Table 6 below provides exemplary quasi-identifiers extracted from the text records in Table 2 by the content de-identification unit. For reference, each quasi-identifier suppressed in the text records by the content de-identification unit 460 is shown with strikethrough.

[0128] Table 6

[0129]

[0130] As shown in Table 6, each extracted quasi-identifier is labeled with a tag indicating a known type included in the PLQ. Since the operation is not a known type included in the PLQ, the quasi-identifier "laparoscopic partial nephrectomy" is labeled with a tag indicating that the operation is suppressed in the text record.

[0131] Table 7 below provides example records of structured data created by the content de-identification unit 460 based on PLQ element 1.

[0132] Table 7

[0133]

[0134] As shown in Table 7, each created record has a corresponding identifier (ID) that indicates the text record from which a quasi-identifier is extracted, preserving the original value of the record. The quasi-identifier is labeled with a tag indicating a known type, such as age or gender.

[0135] Table 8 below provides an example record of the structured data created by the content de-identification unit 460 based on PLQ element 2.

[0136] Table 8

[0137]

[0138] As shown in Table 8, each created record has a corresponding ID indicating a text record from which a quasi-identifier is extracted, preserving the original value of the record. The quasi-identifier is labeled with a tag indicating a date of a known type.

[0139] Table 9 below provides an example record of the structured data created by the content de-identification unit 460 based on PLQ element 3.

[0140] Table 9

[0141]

[0142] As shown in Table 9, each created record has a corresponding ID indicating a text record from which a quasi-identifier is extracted, preserving the record's original value. The quasi-identifier is labeled with a tag indicating a known type of diagnosis.

[0143] Assume k c =2. In one embodiment, for each individual (i.e., patients) Jane Alan, Ted Borret, and Cathie Trian, the content de-identification unit 460 is configured to generate corresponding de-identified text content, such that the probability of a third party (e.g., an attacker) successfully performing de-identification on the individual based on the de-identified text content is limited to 1 / 2.

[0144] Table 10 below provides example anonymized records of structured data generated by the content de-identification unit 460 applying relation 2-anonymity to the records of Table 7 to anonymize the original values maintained in the records into generalized values. For reference, the original values suppressed by the content de-identification unit 460 are shown as asterisks (*).

[0145] Table 10

[0146]

[0147] Table 11 below provides an example of anonymized records of structured data generated by the content de-identification unit 460 applying sequence 2-anonymization (order preservation) to the records of Table 8 to anonymize the original values maintained in the records into generalized values.

[0148] Table 11

[0149]

[0150] Table 12 below provides example anonymized records of structured data generated by the content de-identification unit 460 applying set 2 anonymization to the records of Table 9 to anonymize the original values maintained in the records into generalized values. For reference, the original values suppressed by the content de-identification unit 460 are shown as asterisks (*).

[0151] Table 12

[0152]

[0153] Table 13 below provides example anonymous records of structured data generated by the content de-identification unit 460 applying a micro-aggregation method to the records in Table 10 to obtain random values computed on the micro-aggregates. For reference, each random value is shown in parentheses.

[0154] Table 13

[0155]

[0156]

[0157] Table 14 below provides example anonymous records of structured data generated by the content de-identification unit 460 applying a micro-aggregation method to the records in Table 11 to obtain random values computed on the micro-aggregates. For reference, each random value is shown in parentheses.

[0158] Table 14

[0159]

[0160] Table 15 below provides example anonymous records of structured data generated by the content de-identification unit 460 applying a micro-aggregation method to the records in Table 12 to obtain random values computed on the micro-aggregates. For reference, each random value is shown in parentheses.

[0161] Table 15

[0162]

[0163] As shown in Tables 13-15, each random value for each record is a seemingly reasonable replacement value that can be used to replace the corresponding quasi-identifier in the text record from which the quasi-identifier is extracted. Each random value is randomly selected from generalized values generated as a result of the application of anonymization algorithms. In the case of categorical values, the random value may be randomly selected from a set of original values or from a subtree rooted at a node with a generalized value (e.g., "head-related medical issues").

[0164] Table 16 below provides example de-identified text records generated by the content de-identification unit 460 replacing some quasi-identifiers identified in the text records of Table 2 with the replacement values from Tables 13-15 and removing the labels of the quasi-identifiers. For reference, the original values suppressed by the content de-identification unit 460 are shown as asterisks (*).

[0165] Table 16

[0166]

[0167]

[0168] As shown in Table 16, the de-identified text of voice record 1 and voice record 2 was mapped to speaker 1 (clinician Carmen Dudley, MD), and the de-identified text of voice record 3 was mapped to speaker 2 (clinician VeepBob, MD). As shown in Table 16, the de-identified text did not reveal any privacy-sensitive personal information about the individuals (i.e., patients) Jane Alan, Ted Borret, and Cathie Trian.

[0169] Figure 6 An example graphical representation of the feature vectors of a speaker in a two-dimensional space according to an embodiment of the present invention is shown. The synthesized speaker identity creation unit 480 obtains each feature vector corresponding to each of the six speakers from a database and clusters all the obtained feature vectors based on a vector similarity metric / measure (e.g., Euclidean distance or cosine similarity metric). Assume k s =3, the synthesized speaker identity creation unit 480 clusters all the obtained feature vectors into two separate clusters, cluster X and cluster Y, where each cluster includes feature vectors corresponding to three similar speakers. For example Figure 6 As shown, cluster X consists of feature vector S-vector-1 corresponding to speaker 1 (clinician Carmen Dudley, MD), feature vector S-vector-3 corresponding to speaker 3, and feature vector S-vector-4 corresponding to speaker 4. Cluster Y includes feature vector S-vector-2 corresponding to speaker 2 (clinician VeepBob, MD), feature vector S-vector-5 corresponding to speaker 5, and feature vector S-vector-6 corresponding to speaker 6.

[0170] For each cluster, the synthetic speaker identity creation unit 480 applies an aggregation function to the speaker identity features of all feature vectors included in the cluster, and generates a synthetic speaker identity corresponding to that cluster based on the obtained aggregated speaker identity features. In one embodiment, the synthetic speaker identity creation unit 480 constructs a synthetic vector V corresponding to cluster X by calculating the minimum, maximum, and average values of all feature vectors contained in cluster X. X Then, by calculating the minimum, maximum, and average values of all feature vectors contained in cluster Y, a composite vector V corresponding to cluster Y is constructed. Y In another implementation, the synthesized speaker identity creation unit 480 constructs a synthetic vector V corresponding to cluster X by randomly selecting the value of each speaker identity feature from all features included in cluster X. XThen, by randomly selecting the value of each speaker's identity feature from all the features contained in cluster Y, a composite vector V corresponding to cluster Y is constructed. Y The composite vector V X and V Y These represent the identities of the synthesized speakers corresponding to clusters X and Y, respectively.

[0171] Since the feature vector S-vector-1 corresponding to speaker 1 (clinician Carmen Dudley, MD) is included in cluster X, the synthesized speaker identity creation unit 480 selects cluster Y as the cluster furthest away from cluster X, and based on the synthesized vector V... Y A new speech waveform is synthesized to transmit the de-identified text transcripts of speech record 1 and speech record 2 (see Table 16 above). Since the feature vector S-vector-2 corresponding to speaker 2 (clinician VeepBob, MD) is included in cluster Y, the synthesized speaker identity creation unit 480 selects cluster X as the cluster furthest away from cluster Y, and based on the synthesized vector V... X A new speech waveform is synthesized to transmit the de-identified text record of speech record 3 (see Table 16 above). In summary, the synthetic speaker identity creation unit 480 creates two different synthetic speaker identities (i.e., synthetic vector V) for two different speakers (i.e., clinician Carmen Dudley, MD and clinician VeepBob, MD). X and V Y These are used to synthesize new speech waveforms to transmit de-identified text recordings of three different speech records (i.e., speech record 1, speech record 2, and speech record 3). In one embodiment, for each of two different speakers (i.e., clinician Carmen Dudley, MD and clinician VeepBob, MD), the synthesized speaker identity creation unit 480 is configured to create a corresponding synthesized speaker identity, such that the probability of a third party (e.g., an attacker) successfully de-identifying the speaker from the new speech waveform synthesized using the synthesized speaker identity is limited to 1 / 3.

[0172] Table 17 below provides examples of how to identify speech recordings, including those created by the synthetic speaker identity creation unit 480 using the synthesis vector V. X and V Y The synthesized new speech waveform. For reference, the de-identified text transcript transmitted by the new speech waveform is shown in quotation marks.

[0173] Table 17

[0174]

[0175] Output unit 490 publishes the deidentified voice recordings from Table 17 to third parties for secondary use. As shown in Table 17, the deidentified voice recordings do not reveal the identities of Speaker 1 and Speaker 2 (i.e., clinicians Carmen Dudley, MD and VeepBob, MD), and the deidentified text recordings do not reveal any privacy-sensitive personal information about the individuals (i.e., patients) Jane Alan, Ted Borret, and Cathe Trian.

[0176] Figure 7 This is a flowchart of an example process 700 for speaker identity and content de-identification according to an embodiment of the present invention. Process block 701 includes receiving input indicating at least one privacy protection level to be implemented via speaker identity and content de-identification. Process block 702 includes extracting features corresponding to a first speaker from a first speech transmitted by a first speaker and recorded in a first speech recording. Process block 703 includes identifying and extracting text content from the first speech. Process block 704 includes parsing the text content to identify privacy-sensitive personal information corresponding to a first individual. Process block 705 includes generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize the privacy-sensitive personal information to the extent that at least one privacy protection level is satisfied, wherein the de-identified text content hides the personal identity of the first individual. Process block 706 includes mapping the de-identified text content to a first speaker. Process block 707 includes generating a synthetic speaker identity corresponding to the first speaker based on other features corresponding to at least one other speaker, wherein the other features corresponding to at least one other speaker differ from the features corresponding to the first speaker to the extent that at least one privacy protection level is satisfied. Processing box 708 includes synthesizing a new speech waveform based on the identity of the synthesized speaker to transmit to recognize text content, wherein the new speech waveform is different from the speech waveform of the first speech, and the new speech waveform hides the personal identity of the first speaker.

[0177] In one embodiment, processing blocks 701-708 are executed by one or more components of system 330.

[0178] Figure 8 This is a high-level block diagram illustrating an information processing system 800 for implementing one embodiment of the present invention. The computer system includes one or more processors, such as processor 802. Processor 802 is connected to communication infrastructure 804 (e.g., a communication bus, switching structure, or network).

[0179] The computer system may include a display interface 806 that forwards graphics, text, and other data from a voice communication infrastructure 804 (or from a frame buffer, not shown) for display on a display unit 808. In one embodiment, the computer system also includes a main memory 810, preferably random access memory (RAM), and secondary memory 812. In one embodiment, secondary memory 812 includes, for example, a hard disk drive 814 and / or a removable storage drive 816, representing, for example, a floppy disk drive, magnetic tape drive, or optical disk drive. The removable storage drive 816 reads from and / or writes to the removable storage unit 818 in a manner well known to those skilled in the art. The removable storage unit 818 represents, for example, a floppy disk, compact disk, magnetic tape, or optical disk, which is read from and written to by the removable storage drive 816. As will be understood, the removable storage unit 818 includes a computer-readable medium in which computer software and / or data are stored.

[0180] In an alternative embodiment, auxiliary memory 812 includes other similar means for allowing computer programs or other instructions to be loaded into the computer system. Such means include, for example, removable storage unit 820 and interface 822. Examples of such means include packages and package interfaces (e.g., found in video game devices), removable memory chips (e.g., EPROM or PROM) and associated sockets, and other removable storage units 820 and interfaces 822 that allow software and data to be transferred from removable storage unit 820 to the computer system.

[0181] In one embodiment, the computer system further includes a communication interface 824. The communication interface 824 allows software and data to be transferred between the computer system and external devices. In one embodiment, examples of the communication interface 824 include a modem, a network interface (such as an Ethernet card), a communication port, or a PCMCIA slot and card. In one embodiment, the software and data transmitted via the communication interface 824 are in the form of signals, such as electronic, electromagnetic, optical, or other signals that can be received by the communication interface 824. These signals are provided to the communication interface 824 via a communication path (i.e., channel) 826. In one embodiment, the communication path 826 carries signals and is implemented using wires or cables, optical fibers, telephone lines, cellular telephone links, RF links, and / or other communication channels.

[0182] Embodiments of the present invention can be systems, methods, and / or computer program products at any possible level of technical detail integration. A computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to execute aspects of embodiments of the present invention.

[0183] Computer-readable storage media can be tangible means for retaining and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electronic storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of computer-readable storage media includes: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital universal disk (DVD), memory sticks, floppy disks, mechanical encoding devices such as punch cards or protrusions in slots having instructions recorded thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media should not be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses passing through fiber optic cables), or electrical signals transmitted through wires.

[0184] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a suitable computing / processing device via a network (e.g., the Internet, a local area network, a wide area network, and / or a wireless network), or to an external computer or external storage device. The network may include copper cables, optical fibers, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to a computer-readable storage medium within the suitable computing / processing device.

[0185] Computer-readable program instructions used to perform the operations of this invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages (such as Smalltalk, C++, etc.) and procedural programming languages (such as the "C" programming language or similar programming languages). The computer-readable program instructions may be executed entirely on a user's computer, partially on a user's computer, as a standalone software package, partially on a user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer may be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)) or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) may execute computer-readable program instructions by utilizing state information from the computer-readable program instructions to personalize the electronic circuitry in order to perform aspects of this invention.

[0186] The present invention will now be described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0187] These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / actions specified in one or more blocks of a flowchart and / or block diagram. These computer-readable program instructions may also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner, such that the computer-readable storage medium storing the instructions includes an article of manufacture containing instructions that implement aspects of the functions / actions specified in one or more blocks of a flowchart and / or block diagram.

[0188] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus, or other device to produce computer-implemented processing, such that the instructions executed on the computer, other programmable apparatus, or other device perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0189] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions comprising one or more executable instructions for implementing a specified logical function. In some alternative embodiments, the functions indicated in the blocks may occur in a different order than indicated in the figures. For example, two blocks shown consecutively may actually be implemented as a single step, executed simultaneously, substantially simultaneously, with partial or complete time overlap, or these blocks may sometimes be executed in reverse order, depending on the functions involved. It will also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented by a dedicated hardware-based system that performs the specified function or action or executes a combination of dedicated hardware and computer instructions.

[0190] As can be seen from the above description, embodiments of the present invention provide systems, computer program products, and methods for implementing embodiments of the present invention. Embodiments of the present invention also provide non-transitory computer-usable storage media for implementing embodiments of the present invention. The non-transitory computer-usable storage medium has a computer-readable program, wherein when processed on a computer, the program causes the computer to implement the steps of the embodiments of the present invention described herein. References to singular elements in the claims are not intended to mean "one and only one," unless expressly stated otherwise, but rather "one or more." All structural and functional equivalents of the elements of the above exemplary embodiments, currently known or hereafter known to those skilled in the art, are intended to be included in these claims. Elements of the claims herein should not be construed pursuant to paragraph 6 of Section 112 of 35 U.S.SC, unless the element is expressly stated using the phrases "...apparatus, for..." or "...steps, for...".

[0191] The terminology used herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the terms “comprises” and / or “comprising,” when used in this specification, specify the presence of the stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof.

[0192] All means or steps plus functional elements in the following claims are intended to include any structure, material, action, and equivalent for performing functions in combination with other claimed elements as specifically claimed.

[0193] Various embodiments of the invention have been described for illustrative purposes, but are not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein has been chosen to best explain the principles of the embodiments, their practical application, or improvements to existing technologies on the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for speaker identification and content de-identification while ensuring data privacy, comprising: Receiving instructions requires at least one privacy-preserving level of input to verify the speaker's identity and the content; Extract features corresponding to the first speaker from the first speech transmitted by the first speaker and recorded in the first speech recording; Identify and extract text content from the first speech; The text content is parsed to identify privacy-sensitive personal information corresponding to the first individual; Generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize privacy-sensitive personal information to the extent that it meets at least one privacy protection level, wherein the de-identified text content hides the personal identity of the first individual, the step comprising: The text content is processed by replacing at least one word in the text content with at least one replacement value, wherein the at least one word includes a portion of the privacy-sensitive personal information; Map the de-identified text content to the first speaker; A synthetic speaker identity corresponding to the first speaker is generated based on other features corresponding to at least one other speaker, wherein the other features corresponding to the at least one other speaker differ from the features corresponding to the first speaker to satisfy at least one privacy protection level; and A new speech waveform is synthesized based on the identity of the synthesized speaker to transmit the de-identified text content, wherein the new speech waveform is different from the speech waveform of the first speech, and the new speech waveform hides the personal identity of the first speaker.

2. The method of claim 1, wherein the other feature corresponding to the at least one other speaker is extracted from at least one other speech transmitted by the at least one other speaker and is recorded in at least one other speech record.

3. The method according to claim 2, further comprising: Receive a plurality of voice records including the first voice record and the at least one other voice record, wherein the plurality of voices including the first voice and the at least one other voice are recorded in the plurality of voice records, each of the plurality of voices is transmitted by one of a plurality of speakers, the plurality of speakers including the first speaker and the at least one other speaker, and the text content of each of the plurality of voices includes privacy-sensitive personal information corresponding to at least one of a plurality of individuals, and the plurality of individuals includes the first individual.

4. The method according to claim 3, further comprising: For each of the plurality of voice records, extract features from the speech recorded in the voice record; as well as For each of the plurality of speakers, a corresponding feature vector is generated based on features extracted from at least one of the plurality of speech transmitted by the speaker.

5. The method according to claim 4, further comprising: For each of the plurality of voice records: From the speech recorded in the speech record, the text content is recognized and extracted; as well as The text content is parsed by applying at least one natural language processing annotator to identify privacy-sensitive personal information in the text content corresponding to at least one of the plurality of individuals.

6. The method of claim 5, wherein each privacy-sensitive personal information corresponding to each individual includes at least one of the following: a direct identifier or a quasi-identifier.

7. The method according to claim 6, further comprising: For each of the plurality of voice records: The text content of speech recorded in the voice recording is processed by masking each direct identifier identified in the text content, marking each quasi-identifier identified in the text content, and replacing at least one word in the text content with at least one similar word, wherein each word replaced in the text content is one of the following: a quasi-identifier, or a word that is neither a direct identifier nor a quasi-identifier.

8. The method according to claim 7, further comprising: For each of the plurality of individuals: Combine the processed text content, including corresponding privacy-sensitive personal information, of at least one of the plurality of voices into a text document; By performing the utility-preserving content de-identification on the text document to generate corresponding de-identified text content, the corresponding privacy-sensitive personal information is anonymized to the extent that it meets at least one privacy protection level, wherein the corresponding de-identified text content does not disclose the corresponding privacy-sensitive personal information; as well as Map one or more segments of the corresponding de-identified text content to at least one of the plurality of speakers, wherein the text content of at least one of the plurality of voices transmitted by the at least one speaker includes the corresponding privacy-sensitive personal information.

9. The method according to claim 8, further comprising: For each of the plurality of speakers: A corresponding synthetic speaker identity is generated based on at least one feature vector corresponding to at least one other speaker among the plurality of speakers, wherein the at least one feature vector corresponding to the at least one other speaker differs from the feature vector corresponding to the speaker in the degree to which the at least one privacy protection level is satisfied; as well as For each of the plurality of speech voices transmitted by the speaker and recorded in one of the plurality of speech voice records, a corresponding new speech waveform is synthesized based on the corresponding synthesized speaker identity to transmit a segment of de-identified text content mapped to the speaker, wherein the corresponding new speech waveform is different from the speech waveform of the speech voice and the corresponding new speech waveform does not reveal the personal identity of the speaker.

10. The method of claim 9, further comprising: For each of the plurality of speakers: The corresponding de-identified speech record is released to a third party, wherein the corresponding de-identified speech record includes a corresponding new speech waveform synthesized based on the identity of the corresponding synthesized speaker.

11. A system for speaker identification and content de-identification with privacy protection, comprising: At least one processor; as well as A non-transitory processor-readable memory device storing instructions, which, when executed by the at least one processor, cause the at least one processor to perform an operation, the operation including: The input required to receive the instruction necessitates at least one privacy-preserving level of speaker identity and content re-identification; features corresponding to the first speaker are extracted from the first speech transmitted by the first speaker and recorded in the first speech recording; Identify and extract text content from the first speech; The text content is parsed to identify privacy-sensitive personal information corresponding to the first individual; Generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize privacy-sensitive personal information to the extent that it meets at least one privacy protection level, wherein the de-identified text content hides the personal identity of the first individual, the step comprising: The text content is processed by replacing at least one word in the text content with at least one replacement value, wherein the at least one word includes a portion of the privacy-sensitive personal information; Map the de-identified text content to the first speaker; A synthetic speaker identity corresponding to the first speaker is generated based on other features corresponding to at least one other speaker, wherein the other features corresponding to the at least one other speaker differ from the features corresponding to the first speaker to satisfy at least one privacy protection level; and A new speech waveform is synthesized based on the identity of the synthesized speaker to transmit the de-identified text content, wherein the new speech waveform is different from the speech waveform of the first speech, and the new speech waveform hides the personal identity of the first speaker.

12. The system of claim 11, wherein the operation further comprises: Receive a plurality of voice records including the first voice record and the at least one other voice record, wherein the plurality of voices including the first voice and the at least one other voice are recorded in the plurality of voice records, each of the plurality of voices is transmitted by one of a plurality of speakers, the plurality of speakers including the first speaker and the at least one other speaker, and the text content of each of the plurality of voices includes privacy-sensitive personal information corresponding to at least one of a plurality of individuals, and the plurality of individuals includes the first individual.

13. The system of claim 12, wherein the operation further comprises: For each of the plurality of voice records, extract features from the speech recorded in the voice record; as well as For each of the plurality of speakers, a corresponding feature vector is generated based on features extracted from at least one of the plurality of speech transmitted by the speaker.

14. The system of claim 13, wherein the operation further comprises: For each of the plurality of voice records: From the speech recorded in the speech record, the text content is recognized and extracted; as well as The text content is parsed by applying at least one natural language processing annotator to identify privacy-sensitive personal information in the text content corresponding to at least one of the plurality of individuals.

15. The system of claim 14, wherein each privacy-sensitive personal information corresponding to each individual includes at least one of the following: a direct identifier or a quasi-identifier.

16. The system of claim 15, wherein the operation further comprises: For each of the plurality of voice records: The text content of speech recorded in the voice recording is processed by masking each direct identifier identified in the text content, marking each quasi-identifier identified in the text content, and replacing at least one word in the text content with at least one similar word, wherein each word replaced in the text content is one of the following: a quasi-identifier, or a word that is neither a direct identifier nor a quasi-identifier.

17. The system of claim 16, wherein the operation further comprises: For each of the plurality of individuals: Combine the processed text content, including corresponding privacy-sensitive personal information, of at least one of the plurality of voices into a text document; By performing the utility-preserving content de-identification on the text document to generate corresponding de-identified text content, the corresponding privacy-sensitive personal information is anonymized to the extent that it meets at least one privacy protection level, wherein the corresponding de-identified text content does not disclose the corresponding privacy-sensitive personal information; as well as Map one or more segments of the corresponding de-identified text content to at least one of the plurality of speakers, wherein the text content of at least one of the plurality of voices transmitted by the at least one speaker includes the corresponding privacy-sensitive personal information.

18. The system of claim 17, wherein the operation further comprises: For each of the plurality of speakers: A corresponding synthetic speaker identity is generated based on at least one feature vector corresponding to at least one other speaker among the plurality of speakers, wherein the at least one feature vector corresponding to the at least one other speaker differs from the feature vector corresponding to the speaker in the degree to which the at least one privacy protection level is satisfied; as well as For each of the plurality of speech voices transmitted by the speaker and recorded in one of the plurality of speech voice records, a corresponding new speech waveform is synthesized based on the corresponding synthesized speaker identity to transmit a segment of de-identified text content mapped to the speaker, wherein the corresponding new speech waveform is different from the speech waveform of the speech voice and the corresponding new speech waveform does not reveal the personal identity of the speaker.

19. A computer program product for speaker identification and content de-identification with privacy protection, the computer program product comprising program instructions executable by a processor to cause the processor to: Receiving the instruction requires input at least one privacy-preserving level to perform the speaker identity and content re-identification; Extract features corresponding to the first speaker from the first speech transmitted by the first speaker and recorded in the first speech recording; Identify and extract text content from the first speech; The text content is parsed to identify privacy-sensitive personal information corresponding to the first individual; Generating de-identified text content by performing utility-preserving content de-identification on the text content to anonymize privacy-sensitive personal information to the extent that it meets at least one privacy protection level, wherein the de-identified text content hides the personal identity of the first individual, the step comprising: The text content is processed by replacing at least one word in the text content with at least one replacement value, wherein the at least one word includes a portion of the privacy-sensitive personal information; Map the de-identified text content to the first speaker; A synthetic speaker identity corresponding to the first speaker is generated based on other features corresponding to at least one other speaker, wherein the other features corresponding to the at least one other speaker differ from the features corresponding to the first speaker to satisfy at least one privacy protection level; and A new speech waveform is synthesized based on the identity of the synthesized speaker to transmit the de-identified text content, wherein the new speech waveform is different from the speech waveform of the first speech, and the new speech waveform hides the personal identity of the first speaker.