A system and a method for unstructured data risk assessment and risk quantification
The system addresses the challenge of interpreting unstructured data by using a composite entity detection technique to identify and quantify risks, enhancing decision-making and compliance through contextual risk assessment and regulatory alignment.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PRIVASAPIEN TECH PTE LTD
- Filing Date
- 2025-12-18
- Publication Date
- 2026-06-25
AI Technical Summary
Existing systems struggle to interpret complex contextual relationships within unstructured data, leading to incomplete detection of sensitive elements and inconsistent outputs, and lack adaptability to emerging data types and threat patterns, making it difficult for organizations to manage privacy risks and regulatory compliance.
A system and method for unstructured data risk assessment using a composite entity detection technique that includes pattern recognition, entity reference repositories, and trained language models to identify and categorize data entities, determine contextual sensitivity, calculate risk scores, and provide regulatory compliance recommendations.
Enables comprehensive risk quantification and proactive threat scenario simulation, providing structured summaries of risk severity and contextual relevance, supporting informed decision-making and compliance assurance.
Smart Images

Figure IB2025063150_25062026_PF_FP_ABST
Abstract
Description
[0001] A SYSTEM AND A METHOD FOR UNSTRUCTURED DATA RISK ASSESSMENT AND RISK QUANTIFICATION
[0002] EARLIEST PRIORITY DATE:
[0003] This Application claims priority from a Provisional patent application filed in India having Patent Application No. 202441101504, filed on December 20, 2024, and titled “A SYSTEM AND A METHOD FOR UNSTRUCTURED DATA THREAT MODELING, RISK SCORING AND GOVERNANCE”.
[0004] FIELD OF INVENTION
[0005] Embodiments of the present disclosure relate to the field of an intelligent risk analysis with automated governance for complex unstructured information ecosystem and more particularly a system and a method for unstructured data risk assessment and risk quantification.
[0006] BACKGROUND
[0007] An unstructured data refers to information that does not follow a predefined data model or standardized format, making it difficult for conventional systems to interpret or categorize it. It has grown rapidly with the widespread use of digital communication, multimedia content, and user-generated information from platforms such as emails, social media, audio recordings, videos, and free-text documents. This form of data often contains implicit, context-rich, or sensitive insights that are not directly identifiable, and its fluid, diverse, and non-linear characteristics make it particularly sensitive within organizational systems. As a result, the unstructured data frequently carries high informational value while also exposing organizations to unique privacy, security, and ethical risk.
[0008] The existing systems for supervising unstructured data primarily rely on traditional data processing mechanisms including but not limited to a keyword-based detection, metadata analysis, and basic categorization techniques. Alongside these, established practices such as threat modelling, rule-based privacy checks, statistical detection of anomalies, and predefined risk-control workflows are applied to support oversight. While these mechanisms provide some level of visibility, they were originally built for structured or semi -structured data environments and depend heavily on human-defined patterns, making it difficult to interpret deeper insights embedded in dynamic, unpredictable unstructured datasets.
[0009] Additionally, these existing approaches present several drawbacks due to inherent limitations in their design and operational scope. Firstly, many conventional systems struggle to interpret complex contextual relationships within unstructured content, leading to incomplete detection of sensitive elements. The reliance on static rules or surface-level indicators frequently results in inaccurate assessments, particularly when dealing with nuanced or domain-specific information. Additionally, many existing mechanisms lack adaptability, causing them to underperform when new data types, emerging privacy concerns, or novel threat patterns appear in unstructured datasets.
[0010] Furthermore, current tools often operate in fragmented and non-integrated ways, requiring organizations to use multiple disjointed solutions to identify potential risks, assess compliance considerations, and interpret the broader implications of unstructured data. Such fragmentation leads to inconsistent outputs, increased manual effort, and challenges in forming a holistic supervisory view. These limitations make it difficult for organizations to confidently identify privacy risks beyond basic personally identifiable information, interpret hidden sensitive attributes, and manage the rising regulatory and ethical expectations surrounding unstructured information.
[0011] Hence, there is a need for an improved a system and a method for unstructured data risk assessment and risk quantification which addresses the aforementioned issue(s). OBJECTIVES OF THE INVENTION
[0012] The primary objective of the invention is to establish a comprehensive approach for understanding and evaluating hidden risks present within unstructured data, which often contains unpredictable and sensitive information. The invention including a system, aims to interpret complex patterns that traditional structured-data techniques cannot capture. The system seeks to bring clarity and meaningful insight into diverse data formats originating from multiple source.
[0013] Another objective of the invention is to provide a systematic mechanism for simulating potential threat scenarios across unstructured data, particularly those related to privacy and responsible digital practices. The proposed system helps organizations anticipate how different risk factors may emerge and influence sensitive information. This allows proactive understanding of vulnerabilities before they escalate into real issues.
[0014] Yet another objective of the invention is to deliver a unified approach for identifying, quantifying, and synthesizing multiple risk indicators into meaningful assessments. The system aims to convert fragmented risk cues into structured summaries that reflect severity and contextual relevance. This supports informed prioritization and enhances decision-making across organizational functions.
[0015] Yet another objective of the invention is to align detected risks in unstructured data with applicable regulatory requirements and to guide organizations on suitable safeguard measures. The system seeks to bridge the gap between risk discovery and regulatory expectations by providing structured governance oversight. This strengthens compliance assurance and supports responsible handling of sensitive information. SUMMARY
[0016] In accordance with an embodiment of the present disclosure, a system for unstructured data risk assessment and risk quantification is disclosed. The system includes a processor, a memory coupled to the processor, wherein the memory includes instructions that when executed by the processor cause the processor to receive an unstructured data file comprising at least one of a document, audio, video, and textual content via an access interface to accept a data from at least one of a local and remote storage sources. Additionally, the processor is caused to uploading the unstructured data file to a data repository through a secure communication channel, thereby ensuring a data integrity and confidentiality in course of a data transfer. Furthermore, the processor is caused to detect a plurality of data entities within the uploaded unstructured data file by applying a composite entity detection technique, wherein the composite entity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data. Moreover, the processor is caused to determine a contextual sensitivity of each of the detected data entity by correlating the plurality of detected entities with a statistical and linguistic models configured to interpret a semantic context of a surrounding data and assign a unique context-specific sensitivity score. Moreover, the processor is caused to calculate an individual risk score for each of the detected data entity based on the determined contextual sensitivity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density. Moreover, the processor is caused to compute a document-level index by aggregating the individual risk scores and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document. Moreover, the processor is caused to categorize the document based on the computed document-level index and a risk distribution, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment. Moreover, the processor is caused to determine a regulatory obligation corresponding to the categorized document by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks comprising a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes. Moreover, the processor is caused to provide a safeguard recommendation comprising a regulatory measure and a technical measure based on the determination of the plurality of compliance gaps and a context-specific risk severity. Moreover, the processor is caused to display a categorized result, a computed scores, and the safeguard recommendations utilizing a governance dashboard, wherein the governance dashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making.
[0017] In accordance with an embodiment of the present disclosure, a method for unstructured data risk assessment and risk quantification is disclosed. The method includes receiving, by an access interface, an unstructured data file comprising at least one of a document, audio, video, and textual content to accept a data from at least one of a local and remote storage sources. Additionally, the method includes uploading, via a secured communication channel, the unstructured data file to a data repository, thereby ensuring a data integrity and confidentiality in course of a data transfer. Furthermore, the method includes detecting, by applying a composite entity detection technique, a plurality of data entities within the uploaded unstructured data file, wherein the composite entity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data. Moreover, the method includes determining, by correlating the plurality of detected entities with a statistical and linguistic models, a contextual sensitivity of each of the detected data entity, wherein the statistical and linguistic models are configured to interpret a semantic context of a surrounding data and assign a unique context- specific sensitivity score. Moreover, the method includes calculating, based on the determined contextual sensitivity, an individual risk score for each of the detected data entity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density. Moreover, the method includes computing, by aggregating the individual risk scores, a document-level index and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi- identifiable data contained within the document. Moreover, the method includes categorizing, based on the computed document-level index and a risk distribution, the document, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment. Moreover, the method includes determining, by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks, a regulatory obligation corresponding to the categorized document, wherein the one or more applicable regulatory frameworks comprises a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes. Moreover, the method includes providing, by determination of the plurality of compliance gaps and a context-specific risk severity, a safeguard recommendation comprising a regulatory measure and a technical measure. Moreover, the method includes displaying, utilizing a governance dashboard, a categorized result, a computed scores, and the safeguard recommendations, wherein the governance dashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making.
[0018] To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures. BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:
[0020] FIG. 1 illustrates a network environment for a system for unstructured data risk assessment and risk quantification in accordance with an embodiment of the present disclosure;
[0021] FIG. 2 illustrates a schematic diagram of a user device of FIG. 1, in accordance with an example implementation of the present subject matter;
[0022] FIG. 3 illustrates a schematic diagram of the system for unstructured data risk assessment and risk quantification of FIG. 1, in accordance with an embodiment of the present disclosure;
[0023] FIG. 4 (a) is a flow chart representing the steps involved in a method for unstructured data risk assessment and risk quantification in accordance with an embodiment of the present disclosure; and
[0024] FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure.
[0025] Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein. DETAILED DESCRIPTION
[0026] For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as would normally occur to those skilled in the art are to be construed as being within the scope of the present disclosure.
[0027] The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.
[0028] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.
[0029] In the following specification and the claims, reference will be made to a number of terms, which shall be defined to have the following meanings. The singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. A system for unstructured data risk assessment and risk quantification is disclosed. The system includes a processor, a memory coupled to the processor, wherein the memory includes instructions that when executed by the processor cause the processor to receive an unstructured data file comprising at least one of a document, audio, video, and textual content via an access interface to accept a data from at least one of a local and remote storage sources. Additionally, the processor is caused to uploading the unstructured data file to a data repository through a secure communication channel, thereby ensuring a data integrity and confidentiality in course of a data transfer. Furthermore, the processor is caused to detect a plurality of data entities within the uploaded unstructured data file by applying a composite entity detection technique, wherein the composite entity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data. Moreover, the processor is caused to determine a contextual sensitivity of each of the detected data entity by correlating the plurality of detected entities with a statistical and linguistic models configured to interpret a semantic context of a surrounding data and assign a unique context-specific sensitivity score. Moreover, the processor is caused to calculate an individual risk score for each of the detected data entity based on the determined contextual sensitivity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density. Moreover, the processor is caused to compute a document-level index by aggregating the individual risk scores and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document. Moreover, the processor is caused to categorize the document based on the computed document-level index and a risk distribution, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment. Moreover, the processor is caused to determine a regulatory obligation corresponding to the categorized document by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks comprising a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes. Moreover, the processor is caused to provide a safeguard recommendation comprising a regulatory measure and a technical measure based on the determination of the plurality of compliance gaps and a context-specific risk severity. Moreover, the processor is caused to display a categorized result, a computed scores, and the safeguard recommendations utilizing a governance dashboard, wherein the governance dashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making.
[0030] FIG. 1 illustrates a network environment for a system for unstructured data risk assessment and risk quantification in accordance with an embodiment of the present disclosure.
[0031] Referring to FIG. 1, a user (120) may be an individual or an organizational entity whose unstructured data file (340, fig 3) is subjected to risk assessment, analysis, and governance using the disclosed system (100). The user (120) interacts with the system (100) through a user device (125), which functions as an access interface (125) to upload, submit, or grant access to diverse unstructured data file (340, fig 3) including but not limited to a document, audio, video, or textual content. Through this access interface (125), the user (120) can initiate assessment processes, monitor the progress of data evaluation, and review the resulting insights. The system (100) enables the user (120) to seamlessly connect from local or remote storage sources, ensuring accessibility and flexibility in managing their data (210, fig 2). The user (120) also benefits from the categorized results, computed risk indicators, and safeguard recommendations (375, fig 3) generated by the system (100), thereby supporting informed decisions regarding data handling and compliance. Further, the user (120) may access the system (100) over a network (115). Examples of the user device (125) includes, but is not limited to, a mobile phone, desktop computer, portable digital assistant (PDA), smart phone, tablet, ultra-book, netbook, laptop, multi-processor system, microprocessor-based or programmable consumer electronic system, or any other communication device that the user (120) may use. It will be appreciated that the system (100) may be presented to the user (120) on a corresponding user device (125) including the access interface as a web application accessed through a browser, through a software application on the user device (125), or, particularly for smartphones, through a mobile application installed at the smartphone. It will be appreciated that, within the context of the disclosure herein, web application refers to a utility implemented on a networked computing system accessible by user device (125) including access interface over the Internet (e.g. through browsers) wherein the bulk of the processing takes place at the networked computing system, mobile applications refer to applications installed on smartphones that may communicate with a networked computing system, and a “software” application refers generally to applications other than web browsers installed on other types of user device (125) that may communicate with a networked computing system over the network (115).
[0032] The network (115) may be a single communication network or a combination of multiple communication networks and may use a variety of different communication protocols. The network (115) may be a wireless network, a wired network, or a combination thereof. Examples of the network (115) includes, but is not limited to, Global System for Mobile Communication (GSM) network, Universal Mobile Telecommunications System (UMTS) network, Personal Communications Service (PCS) network, Time Division Multiple Access TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NON), Public Switched Telephone Network (PSTN). Depending on the technology, the network (115) may include various network entities, such as gateways and routers, however, such details have been omitted for the sake of brevity of the present description.
[0033] The system (100) may have a homepage that is presented to the user (120), accessing a top-level web address for web applications presented to the user (120) in a browser or a welcome screen for software and mobile applications. The homepage is adapted to serve as the primary data dashboard and presents users (120) with an interactive interface for bias detection and reflective evaluation using adaptive simulated interactions. The users (120) navigate this interface to upload datasets, customize views, and extract meaningful insights efficiently. The homepage may include links to a user log-in interface or general information about the system (100). It will be appreciated that the presentation of a homepage may not be necessary, for example, the user (120) may bypass initial interaction layers and directly request to initiate the unstructured data file assessment form processing to the system (100). In such cases, the user (120) can access authorized modules and trigger system-defined evaluation processes without navigating intermediate steps. This enables streamlined execution of assigned functions while remaining within authenticated operational boundaries.
[0034] A new or unregistered user (120), can access the user log-in interface, fill out the log-in information corresponding to the user's account, and indicate that the user (120) wishes to sign in. It will be appreciated that any conventional registration and log-in techniques for web applications, software application, and mobile applications may be used, whichever is appropriate for the user (120). While registering the user (120) may be prompted to provide username and corresponding user credentials, not limited to, password, geographical location, and contact information and upon receipt of the foregoing information, a corresponding userprofile may be created and stored on a respective database (385, fig 3) of the system (100).
[0035] Additionally, the network (115) environment includes the data repository (130) adapted to serve as a centralized storage space and designed to securely hold heterogeneous unstructured data in a standardized form. The data repository (130) maintains versioned records, ensuring consistency, traceability, and controlled accessibility throughout the assessment process. Additionally, the data repository (130) acts as the foundational reference point for storing processed content and derived analytical outputs used during risk evaluation.
[0036] Furthermore, the network (115) environment includes a composite entity detection technique (135). The composite entity detection technique (135) represents an integrated analytical approach that identifies diverse data elements within unstructured inputs. It combines pattern interpretation, reference-based matching, and linguistic understanding to detect meaningful entities across varied formats. In the proposed system (100) context, the composite entity detection technique (135) operates as the core analytical layer responsible for recognizing granular information units required for subsequent risk assessment.
[0037] Moreover, the network (115) environment includes a contextual sensitivity (140) adapted to reflect the measured depth of how sensitive or impactful a detected data element becomes when interpreted within its surrounding context. The contextual sensitivity (140) considers semantic relationships, relational cues, and thematic meaning embedded in unstructured content. Additionally, the contextual sensitivity (140) facilitates nuanced understanding by differentiating between benign and sensitive elements based on their contextual influence.
[0038] Moreover, the network (115) environment includes an individual risk score (145) representing a quantified indicator assigned to each identified data entity (355, fig 3) to denote its potential privacy or exposure risk. It encapsulates aspects such as contextual relevance, occurrence characteristics, inferred importance and serves as a foundational metric that influences further aggregation and overall documentlevel evaluation.
[0039] Moreover, the network (115) environment includes a document level index (150). The document level index (150) is an aggregate metric derived from individual risk indicators to represent the overall sensitivity and exposure level of an entire unstructured data file (340, fig 3). The document level index (150) offers a consolidated view that captures both the distribution and density of critical information, thereby enabling a holistic comparison, classification, and governance of diverse documents.
[0040] Moreover, the network (115) environment includes a governance dashboard (155) adapted to act as an interactive visualization environment and designed to present consolidated analytical findings in an interpretable format. The governance dashboard (155) highlights categorized outcomes, risk indicators, and compliance- related insights derived from the evaluation process. In the proposed system (100), the governance dashboard (155) supports continuous oversight by allowing users (120) and authorized entities to monitor data sensitivity and governance status in real time.
[0041] In accordance with an embodiment of the present disclosure, a system for unstructured data risk assessment and risk quantification is disclosed. The system (100) includes a processor (105, fig 2), a memory (110, fig 2) coupled to the processor (105, fig 2), wherein the memory (110, fig 2) includes instructions that when executed by the processor (105, fig 2) cause the processor (105, fig 2) to receive an unstructured data file (340) comprising at least one of a document, audio, video, and textual content via an access interface (125) to accept a data (210) from at least one of a local and remote storage sources. Additionally, the processor (105, fig 2) is caused to uploading the unstructured data file (340) to a data repository (130) through a secure communication channel, thereby ensuring a data integrity (345) and confidentiality in course of a data transfer (350). Furthermore, the processor (105, fig 2) is caused to detect a plurality of data entities (355) within the uploaded unstructured data file (340) by applying a composite entity detection technique (135), wherein the composite entity detection technique (135) comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data. Moreover, the processor (105, fig 2) is caused to determine a contextual sensitivity data entities (355) with a statistical and linguistic models configured to interpret a semantic context of a surrounding data (360) and assign a unique context-specific sensitivity score. Moreover, the processor (105, fig 2) is caused to calculate an individual risk score (145) for each of the detected data entity (355) based on the determined contextual sensitivity (140), wherein each of the detected data entity (355) is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density. Moreover, the processor (105, fig 2) is caused to compute a document-level index (150) by aggregating the individual risk scores (145) and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document. Moreover, the processor (105, fig 2) is caused to categorize the document based on the computed document-level index (150) and a risk distribution, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment. Moreover, the processor (105, fig 2) is caused to determine a regulatory obligation corresponding to the categorized document by mapping a plurality of detected risk attributes (365) to one or more applicable regulatory frameworks comprising a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps (370) associated with the plurality of detected risk attributes (365). Moreover, the processor (105, fig 2) is caused to provide a safeguard recommendation (375) comprising a regulatory measure and a technical measure based on the determination of the plurality of compliance gaps (370) and a contextspecific risk severity. Moreover, the processor (105, fig 2) is caused to display a categorized result, a computed scores, and the safeguard recommendations (375) utilizing a governance dashboard (155), wherein the governance dashboard (155) is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making.
[0042] It may be noted that the foregoing system (100) is an exemplary system and may be implemented as computer executable instructions in any computing or processing environment, including in digital electronic circuitry or in computer hardware, firmware, device driver, or software. As such, the system (100) is not limited to any specific hardware or software configuration.
[0043] FIG.2 illustrates a schematic diagram of a user device (125) of FIG. 1, in accordance with an example implementation of the present subject matter, wherein the user device (125) as here in the proposed invention relates to the access interface (125). Referring to FIG. 2, the user device (125) may comprise a processor(s) (105), a memory(s) (110) coupled to and accessible by the processor(s) (105), and an interface (215) coupled to the memory(s) (110). The user device (125) disclosed herein may be same as the access interface (125) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s) (105)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor (105), the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor (105)" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (105). The user device (125) may further include a display (205) in addition to other components such as, but not limited to, keyboard, sensors, logic circuits etc. Further, the user device (125) may include data (210) which may include data (210) including unstructured data file (340, fig 3), data integrity (345, fig 3), data transfer (350, fig 3), data entities (355, fig 3), surrounding data (360, fig 3), detected risk attributes (365, fig 3), compliance gaps (370, fig 3), safeguard recommendations (375, fig 3), and sensitive data (380, fig 3) that may be stored in the database (385), utilized or generated during the operation of the user device (125). The memory(s) (110) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (110) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The user device (125) may further include an interface (215) that may allow the connection or coupling of the user device (125) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi), for example, for connecting to the system (100) shown in FIG. 1. The interface (215) may also enable intercommunication between different logical as well as hardware components of the user device (125) including the access interface of FIG. 1.
[0044] FIG. 3 illustrates a schematic diagram of the system for unstructured data risk assessment and risk quantification of FIG. 1, in accordance with an embodiment of the present disclosure. Referring to FIG. 3, the system (100) include a processor(s) (105), a memory(s) (110) coupled to and accessible by the processor(s) (105), database (385) and a user interface (390) coupled to the memory(s) (110).
[0045] The system (100) disclosed herein is the same as the system (100) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (105). The system (100) may further include other components such as, but not limited to, keyboard, sensors, logic circuits, input / output interfaces etc. Further, the system (100) may include data (not shown) which may include data that may be stored, utilized or generated during the operation of the computer implemented system (100).
[0046] The memory(s) (110) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (110) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The system (100) may further include the user interface (390) that may allow the connection or coupling of the system (100) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi)., for example, for connecting to the user device (125) including the one or more entity as shown in FIG. 1. The user interface (390) may also enable intercommunication between different logical as well as hardware components of the system (100).
[0047] The system (100) may be provided with a database (385) to store a plurality of data (210) including unstructured data file (340), data integrity (345), data transfer (350), data entities (355), surrounding data (360), detected risk attributes (365), compliance gaps (370), safeguard recommendations (375), and sensitive data (380). In an example implementation of the system (100) including one or more servers, the databases (385) may be local to the server or may be remote to the server. It may be noted that the data (210) in the databases (385) may be stored as a table or may be pre-stored as a mapping with the other. This application is not limited thereto.
[0048] The system (100) may include module(s). The module(s) may include an intake and preprocessing module (305), a threat modelling and detection module (310), a risk scoring and categorization module (315), a regulatory obligation module (320), and a governance dashboard module (325). In one example, the module(s) may be implemented as a combination of hardware and firmware. In an example described herein, such combinations of hardware and firmware may be implemented in several different ways. For example, the firmware for module(s) may be processor (105) executable instructions stored on a non -transitory machine-readable storage medium and the hardware for the module(s) may include a processing resource (for example, implemented as either single processor or combination of multiple processors), to execute such instructions. Further, the hardware for the module(s) may include communication apparatuses, control circuitries involving electrical and electronics components, sensors, and interface devices, which may be in communication with each other for multi-directional communication there between.
[0049] The system (100) may further include engine(s). The engine(s) include a context analysis engine (330) and a safeguard recommendation engine (335). The engine(s) may be implemented as a combination of hardware and programming, for example, programmable instructions to implement a variety of functionalities of the engine(s). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the engine(s) may be executable instructions. Such instructions may be stored on a non-transitory machine-readable storage medium which may be coupled either directly with the system or indirectly (for example, through networked means). In an example, the engine(s) may include a processing resource, for example, either a single processor (105) or a combination of multiple processors (105), to execute such instructions. In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement engine(s). In other examples, the engine(s) may be implemented as electronic circuitry.
[0050] Further, the system (100) includes data (210). The data (210) may include data (210) that is either stored or generated as a result of functions implemented by the system (100). It may be further noted that information stored and available in data (210) may be utilized for performing various functions by the system (100). In an example, data (210) may include unstructured data file (340), data integrity (345), data transfer (350), data entities (355), surrounding data (360), detected risk attributes (365), compliance gaps (370), safeguard recommendations (375), and sensitive data (380). It may be noted that such examples of the various functions are only indicative. The present approaches may be applicable to other examples without deviating from the scope of the present subject matter.
[0051] In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement the functionalities of modules(s). In such examples, the system (100) may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions. In other examples of the present subject matter, the machine-readable storage medium may be located at a different location but accessible to the system (100) and the processor(s) (105).
[0052] In operation, the intake and pre-processing module (305) is configured to receive an unstructured data file (340) comprising at least one of a document, audio, video, and textual content via an access interface (125) to accept a data (210) from at least one of a local and remote storage sources. The intake and pre-processing module (305) receives an unstructured data file (340) via the access interface (125), enabling acquisition of data (210) from local or remote storage sources. The unstructured data file (340) represents information that lacks a predefined schema and may appear as documents, audio, video, or free-form textual content, exhibiting variability in format, structure, and semantic density. Once obtained, this heterogeneous file is normalized and prepared to ensure it can be uniformly interpreted by subsequent analytical components. The intake and pre-processing module (305) thus converts an otherwise irregular and format-diverse input into a consistent data stream that the system can reliably process for further assessment and contextual analysis. Additionally, the intake and pre-processing module (305) is configured to uploading the unstructured data file (340) to a data repository (130) through a secure communication channel, thereby ensuring a data integrity (345) and confidentiality in course of a data transfer (350). The intake and pre-processing module (305) uploads the unstructured data file (340) to the data repository (130) through a secure communication channel, enabling the file to be transferred without exposure to unauthorized interception or modification. The unstructured data file (340) comprises format-diverse content lacking a fixed schema, making preservation of its authenticity during transmission essential. The secure communication channel employs at least one of a cryptographic protections such as encrypted transport tunnels or certificate-based authenticated links that safeguard the unstructured data file’s (340) confidentiality and maintain data integrity (345) throughout the data transfer (350). This controlled transmission path ensures that the repository receives an unaltered and protected version of the file, supporting reliable downstream processing.
[0053] In one embodiment, the intake and pre-processing module (305) is configured to convert the received document into a machine-readable data format by performing a pre-processing operations on the text, image, and audio content, thereby standardizing a heterogeneous unstructured input. The intake and pre-processing module (305) converts the received document into a machine-readable format by applying tailored pre-processing operations to text, image, and audio content, ensuring uniform handling of diverse inputs. During this pre-processing, elements such as extracted text, transcribed audio, or digitized images are cleaned, structured, and normalized into a consistent representation. This standardization allows the system (100) to interpret heterogeneous unstructured data without format-specific limitations; for example, a scanned contract and a recorded voicemail become comparable machine-processable inputs. The resulting unified data stream enables downstream modules to perform accurate and uninterrupted analysis. The data repository (130) serves as a structured and persistent storage environment that organizes incoming information in a manner that supports reliable retrieval and further computational use. It accommodates files of varying formats while maintaining their contextual attributes, allowing the system (100) to treat them as consistent reference points during subsequent processing. Additionally, data integrity (345) ensures that the stored content remains accurate, complete, and unaltered from its original state, providing a trusted foundation for downstream analytical operations.
[0054] Furthermore, the data transfer (350) represents the controlled movement of information from one system (100) component to another, typically executed through secure and monitored communication pathways. These transfer mechanisms are designed to safeguard the content from unauthorized access or corruption, enabling smooth propagation of information through different layers of the architecture. By maintaining protected transit and ensuring the content arrives in its intended form, the system (100) establishes a dependable flow that supports consistent and verifiable processing.
[0055] Additionally, in operation, the threat modelling and detection module (310) is configured to detect a plurality of data entities (355) within the uploaded unstructured data file (340) by applying a composite entity detection technique (135), wherein the composite entity detection technique (135) comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data. The threat modelling and detection module (310) analyses the uploaded unstructured data file (340) to identify a plurality of data entities (355) using the composite entity detection technique (135). During this operation, the threat modelling and detection module (310) evaluates the heterogeneous content and applies coordinated detection logic to recognize and classify elements embedded within the file. The composite technique operates through the combined use of pattern recognition rules, a structured entity reference repository, and a trained language model, enabling the system (100) to distinguish among personal, quasi-identifiable, anonymized, and pseudo-anonymized data. This integrated evaluation produces a structured understanding of a sensitive data (380) types present in the file, supporting risk-oriented interpretation of its content.
[0056] The composite entity detection technique (135) functions as a multi-layered analytical mechanism that correlates rule-based patterns, predefined entity libraries, and machine-learned linguistic models to isolate and categorize data within unstructured content. The pattern recognition logic filters raw text or multimedia- derived transcripts for structural cues, while the entity reference repository provides a curated set of known markers or identifiers for comparison. The trained language model interprets contextual and semantic nuances, refining the classification of detected elements. In the proposed system (100), these components operate concurrently to convert unstructured input into an entity-mapped representation, enabling precise assessment of sensitive information across diverse data formats.
[0057] In one embodiment, the threat modelling and detection module (310) is configured to utilize a hybrid detection model in the composite entity detection technique (135), wherein the hybrid detection model comprises a contextual inference for the detected entity extraction. This hybrid detection model combines rule-driven patterns with language model reasoning, allowing it to recognize data entities (355) even when they appear in ambiguous or irregular structures for example, identifying a person’s address from a phrase like “lives near the central plaza” without explicit labels. The contextual inference component evaluates surrounding text to understand how an entity is being referenced, improving accuracy across diverse formats. By merging deterministic rules with semantic interpretation, the hybrid model ensures robust and adaptable entity detection across heterogeneous unstructured data file (340). In one embodiment, the threat modelling and detection module (310) is configured to validate each of the detected data entities (355) by cross-referencing them with the data repository (130), thereby ensuring an accuracy of entity categorization and classification. The threat modelling and detection module (310) optimizes the reliability of its outputs by validating each detected data entity (355) through crossreferencing with the data repository (130), ensuring accurate categorization and classification. Cross-referencing involves comparing the identified data entity (355) against stored reference patterns, previously verified entities, or known metadata, helping confirm whether the detection is correct. For example, if an extracted term resembles an employee ID, the system (100) checks it against data repository (130) records to verify its identity and type. This verification step reduces false positives and strengthens consistency across the overall detection process.
[0058] Furthermore, in operation, the context analysis engine (330) is configured to determine a contextual sensitivity (140) of each of the detected data entity (355) by correlating the plurality of detected data entities (355) with a statistical and linguistic models configured to interpret a semantic context of a surrounding data (360) and assign a unique context-specific sensitivity score. The context analysis engine (330) evaluates each detected data entity (355) by determining its contextual sensitivity (140), correlating the identified detected data entities (355) with statistical and linguistic models tailored to interpret their surrounding semantic cues. During this phase, the context analysis engine (330) examines how each detected data entity (355) appears in relation to nearby expressions, patterns, and inferred meanings within the surrounding data (360), allowing the system (100) to assign a unique context-specific sensitivity score. This enables the context analysis engine (330) to differentiate whether an entity carries higher or lower sensitivity based on its placement, relevance, and implied significance within the unstructured data file (340).
[0059] Additionally, the surrounding data (360) refers to the textual or transcribed content that appears immediately before, after, or around a detected entity (355), providing semantic clues that shape its interpretation. For example, the term “location” may be innocuous in general text but becomes sensitive when paired with phrases like “patient residence,” showing how contextual framing changes its impact. The plurality of detected data entities (355) includes multiple identifiable elements such as names, addresses, age ranges, device identifiers, or anonymized tokens recognized earlier through entity detection and now evaluated in relation to one another for more accurate sensitivity assessment.
[0060] Furthermore, the unique context-specific sensitivity score represents a quantified assessment assigned to each entity based on its contextual importance, potential risk contribution, and semantic surroundings. Its characteristics include being dynamic, context-aware, and adaptable across different data structures, allowing the system (100) to grade sensitivity on a calibrated scale rather than using fixed labels. The unique context-specific sensitivity score is calculated by combining statistical proximity measures, linguistic similarity weights, and semantic relevance indicators, producing a refined measurement that reflects how sensitive a detected entity (355) becomes within the overall narrative of the unstructured content.
[0061] In one embodiment, the context analysis engine (330) is configured to analyse a contextual semantic of the document to determine the contextual sensitivity of each of the detected data entity (355) and adjust a base risk scores as a function of the contextual inference and a statistical occurrence of each of the detected entity (355). The context analysis engine (330) evaluates the semantic meaning of the document to determine the contextual sensitivity of each detected data entity (355), allowing the system (100) to understand how the data entity (355) is being used within its surrounding text. Contextual sensitivity refers to how the risk level of an entity changes based on its placement and meaning. For instance, the term ID number becomes more sensitive when appearing near a patient record. Using this interpretation, the context analysis engine (330) adjusts the base risk scores by considering both contextual inference and the statistical occurrence of each data entity (355). This dynamic adjustment ensures that data entities (355) are scored according to their real-world significance within the document rather than static definitions.
[0062] In one embodiment, the context analysis engine (330) is configured to detect contextual sensitivity (140) of a data entity (355) by analysing a surrounding text semantic and a relational dependency among each of the detected data entities (355). The context analysis engine (330) examines each identified data entity (355) along with the text that appears around it to understand how sensitive that entity might be. It studies the meaning of nearby words and how different detected entities (355) relate to each other to determine contextual sensitivity (140). For example, the word location may seem harmless on its own, but if it appears next to terms like patient or a medical visit, the context analysis engine (330) recognizes it as sensitive. This helps ensure that sensitivity is judged not just by the entity itself but by the context in which it appears.
[0063] Moreover, in operation, the risk scoring and categorization module (315) is configured to calculate an individual risk score (145) for each of the detected data entity (355) based on the determined contextual sensitivity (140), wherein each of the detected data entity (355) is assigned abase risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density. The risk scoring and categorization module (315) derives the individual risk score (145) for each detected data entity (355) by utilizing the previously determined contextual sensitivity (140) as a primary input. The individual risk score (145) represents a quantified measure assigned to each detected data entity (355) to indicate its potential privacy or security risk within the analysed content. It reflects how sensitive or impactful an entity becomes when considering its inherent nature, contextual placement, and frequency of appearance. This individual risk score (145) acts as a weighted indicator that helps the system (100) prioritize entities requiring higher attention or stricter safeguards. By translating qualitative sensitivity factors into a numeric value, the system (100) establishes a consistent and comparable framework for evaluating risk across diverse data elements During this evaluation, every data entity (355) is first associated with a base risk score that reflects its inherent sensitivity level, allowing the system (100) to establish an initial benchmark. The risk scoring and categorization module (315) then dynamically adjusts this score by incorporating factors such as contextual interpretation and occurrence density, enabling the risk value to shift according to how often and in what semantic conditions the entity appears. This adaptive scoring process produces a refined risk profile for each entity, supporting precise categorization aligned with the characteristics of the analysed content.
[0064] Additionally, occurrence density refers to the frequency with which a particular data entity (355) appears within the unstructured data file (340) content, offering insight into its prominence and potential impact. A higher density suggests repeated presence in meaningful sections of the data (210), which may elevate its sensitivity or risk significance. In contrast, a sparsely occurring entity may carry lower contextual influence. By quantifying these repetition patterns, the system (100) uses occurrence density as a dynamic factor to refine and adjust the overall individual risk score (145) assigned to each data entity (355).
[0065] Additionally, the risk scoring and categorization module (315) is configured to compute a document-level index (150) by aggregating the individual risk scores (145) and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document. The risk scoring and categorization module (315) generates a documentlevel index (150) by consolidating the individual risk scores (145) assigned to each detected data entity (355), allowing the system (100) to represent the overall sensitivity of the unstructured data file (340) through a single quantifiable measure. The document-level index (150) may take various forms depending on the analysis obj ective including but not limited to a cumulative risk value, a weighted sensitivity grade, and a tiered categorization level, each reflecting how concentrated or diverse the sensitive elements are within the document. For example, a file containing multiple context-sensitive identifiers like patient names and diagnostic references may yield a higher index category compared to a file containing only anonymized markers.
[0066] Further, to refine this assessment, the risk scoring and categorization module (315) derives a normalized density metric that evaluates the proportion of personally identifiable and quasi-identifiable data relative to the total content volume. This metric quantifies how densely sensitive data entities (355) are distributed across the document, ensuring that the document-level index (150) does not rely solely on raw counts but also considers contextual saturation. In the proposed system (100), this normalized density is integrated into the document-level index (150), enabling the system (100) to distinguish between documents with moderate sensitivity dispersed broadly and those with concentrated clusters of high-risk information, thereby supporting more accurate categorization and subsequent handling.
[0067] Furthermore, the risk scoring and categorization module (315) is configured to categorize the document based on the computed document-level index (150) and a risk distribution, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment. The risk scoring and categorization module (315) classifies the document by interpreting the computed document-level index (150) together with the distribution pattern of sensitive elements, enabling the system (100) to distinguish among high-risk, moderate-risk, and low-risk data segments. During this operation, the risk scoring and categorization module (315) examines how intensely sensitive entities are concentrated and how their contextual characteristics influence overall exposure. A high-risk segment may contain dense clusters of identifiable information such as full names paired with medical records, while a moderate-risk segment may include partially descriptive elements like age ranges or general locations. A low-risk segment typically consists of anonymized or non-sensitive content that poses minimal privacy implications. By evaluating these variations through structured thresholds derived from the document-level index (150) and distribution metrics, the system (100) produces a categorized output that reflects the document’s holistic risk posture. In one embodiment, the risk scoring and categorization module (315) is configured to derive the document-level index (150) by summing the each of the individual detected data entity (355) risk scores to obtain a cumulative privacy risk indicator for the document. The risk scoring and categorization module (315) calculates the document-level index (150) by summing the individual risk scores (145) of each detected data entity (355), producing a cumulative privacy risk indicator that reflects the overall sensitivity of the document. This indicator quantifies the combined exposure posed by all sensitive elements within the unstructured data file (340), providing a single metric for risk assessment. For example, a document containing multiple names, addresses, and financial identifiers would have a higher cumulative privacy risk indicator than one containing only anonymized data. This measure enables the system (100) to prioritize documents for remediation or stricter handling based on their aggregated privacy risk.
[0068] In one embodiment, the risk scoring and categorization module (315) is configured to compute the normalized density metric by dividing a cumulative privacy score by a document character length, thereby quantifying a concentration of a sensitive data (380) in the document. The risk scoring and categorization module (315) computes the normalized density metric by dividing the cumulative privacy risk indicator by the total character length of the document, thereby quantifying the concentration of sensitive data (380). For example, if a document has a cumulative privacy risk score of eighty 80 and contains 1,000 characters, the normalized density metric would be 0.08, indicating a high concentration of sensitive information. In contrast, a 5,000-character document with the same score of 80 would yield a density of 0.016, showing that sensitive data is more sparsely distributed. This calculation helps the system (100) prioritize documents with densely packed sensitive content for stricter monitoring or remediation.
[0069] Moreover, in operation, the regulatory obligation module (320) is configured to determine a regulatory obligation corresponding to the categorized document by mapping a plurality of detected risk attributes (365) to one or more applicable regulatory frameworks comprising a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps (370) associated with the plurality of detected risk attributes (365). The regulatory obligation module (320) evaluates the categorized document by mapping the plurality of detected risk attributes (365) to relevant regulatory frameworks, including privacy and data-protection standards. In this stage, the regulatory obligation module (320) interprets the document’s risk distribution and correlates each attribute with specific legal or compliance requirements that govern the handling of sensitive or identifiable information. The plurality of detected risk attributes (365) may include elements such as entity sensitivity levels, contextual exposure patterns, occurrence density indicators, or distribution of identifiable data, each reflecting a characteristic that influences regulatory applicability. These plurality of detected risk attributes (365) collectively form a structured profile of the document’s compliance posture.
[0070] Using this mapped profile, the regulatory obligation module (320) determines the set of compliance gaps (370) by comparing the document’s characteristics with obligations mandated by the applicable frameworks. Compliance gaps (370) refer to the specific areas where the document, fails to meet the requirements established by applicable regulatory or privacy frameworks. These gaps (370) indicate missing controls, inadequate protections, or deviations from mandated standards that could lead to non-compliance. Examples include handling personal data without sufficient anonymization, lacking required consent documentation, or failing to apply appropriate access restrictions. For instance, a detected attribute indicating the presence of health-related identifiers may trigger obligations under medical privacy laws, while high occurrence density of contact details may correspond to stricter data-minimization requirements. The mapping mechanism identifies where the document’s current state diverges from regulatory expectations, enabling the system (100) to highlight missing safeguards, insufficient anonymization, or unmet consent requirements. This connected evaluation ensures that the regulatory determination is both context-aware and aligned with the specific risk attributes driving the document’s sensitivity.
[0071] In one embodiment, the regulatory obligation module (320) is configured to store computed metrics, categorized data entities (355), and a compliance mapping result in a version-controlled repository, thereby enabling a traceability, auditing, and continuous compliance evaluation across a multiple document dataset. The regulatory obligation module (320) keeps all computed metrics, classified data entities (355), and the resulting compliance mappings in a version-controlled repository. This setup ensures every change is recorded, allowing full traceability and easy auditing over time. By storing each version, the system (100) can track how compliance assessments evolve. This also supports continuous evaluation across large document sets, ensuring that updates or corrections are always transparent.
[0072] Moreover, in operation, the safeguard recommendation engine (335) is configured to provide a safeguard recommendation (375) comprising a regulatory measure and a technical measure based on the determination of the plurality of compliance gaps (370) and a context-specific risk severity. The safeguard recommendation engine (335) generates the safeguard recommendation (375) by evaluating the identified compliance gaps (370) together with the context-specific risk severity, enabling the system (100) to propose corrective actions tailored to the document’s sensitivity profile. During this the safeguard recommendation engine (335) interprets which regulatory requirements remain unfulfilled and correlates them with appropriate protection strategies. The resulting recommendation may include regulatory measures such as applying stricter consent requirements or enforcing data-r etention limits and technical measures like pseudonymization, encryption, or access-control enforcement. These safeguard recommendations (375) address the specific gaps detected. For example, if contact identifiers appear without necessary protection, the safeguard recommendation engine (335) may suggest redaction or tokenization. By aligning each recommendation with the document’s risk posture, the system (100) ensures that remediation is both targeted and compliant.
[0073] Moreover, in operation, the governance dashboard module (325) is configured to display a categorized result, a computed scores, and the safeguard recommendations (375) utilizing a governance dashboard (155), wherein the governance dashboard (155) is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making. The governance dashboard module (325) presents the categorized result, the computed scores, and the generated safeguard recommendations (375) through the governance dashboard (155), enabling an integrated view of the document’s risk and compliance posture. As the module consolidates outputs from preceding analytical components, the governance dashboard (155) visualizes key indicators such as sensitivity distribution, risk levels, and required remediation steps in an organized and interpretable format. The governance dashboard (155) further establishes an interface with external compliance or data protection units, allowing these stakeholders to access real-time insights for continuous oversight. Through this connected interaction, the system (100) supports informed decisionmaking and ongoing monitoring of regulatory alignment.
[0074] In a non-limiting example, consider a scenario, the proposed system (100) for unstructured data risk assessment and risk quantification can be implemented in a large healthcare organization that handles extensive patient records in multiple formats, including scanned documents, audio consultations, video recordings, and textual notes from electronic health records. When a new batch of patient data (210) is received, the system (100) first ingests the unstructured data files (340) through an access interface (125), whether from local hospital servers or remote cloud storage, ensuring seamless collection across different sources. The system (100) then uploads the files to a secure data repository (130) using encrypted communication channels, maintaining the integrity and confidentiality of sensitive medical and personal information during transfer. Once stored, the composite entity detection technique (135) identifies a plurality of data entities (355) within the files (340), such as patient names, addresses, medical IDs, test results, or anonymized identifiers, using a combination of pattern recognition logic, a curated entity reference repository, and a trained language model. Each detected entity (355) is then evaluated, which determines a contextual sensitivity (140) linked with a base score by examining surrounding text and relational dependencies, such as linking a patient’s location to a specific treatment or medical condition. Based on this, the system (100) calculates the individual risk score (145) for each entity (355), dynamically adjusting for occurrence density and context, for instance, assigning higher risk to frequently mentioned identifiers in sensitive reports. The system (100) then aggregates these scores (145) to produce a document-level index (150) and computes a normalized density metric, reflecting the proportion and concentration of personally identifiable or quasi-identifiable data within the document. Using these metrics, the system (100) categorizes documents into high-risk, moderaterisk, and low-risk segments, for example highlighting reports with multiple patient identifiers as high-risk, while anonymized study notes are categorized as low-risk. The system (100) maps the detected risk attributes (365) to applicable privacy and data protection frameworks, identifying compliance gaps (370) such as missing consent records or insufficient anonymization in certain files. Based on these gaps (370) and the context-specific risk severity, the system (100) suggests regulatory measures like stricter access controls and technical measures such as encryption or pseudonymization. Finally, the governance dashboard (155) presents the categorized results, computed scores, and recommended safeguards in an intuitive interface, allowing hospital compliance teams or external auditors to monitor data handling continuously and make informed decisions. By implementing this system (100), the healthcare organization ensures that sensitive patient information is rigorously assessed, prioritized for protection, and maintained in compliance with regulatory requirements, reducing the likelihood of data breaches or privacy violations. FIG. 4 (a) is a flow chart representing the steps involved in a method for unstructured data risk assessment and risk quantification in accordance with an embodiment of the present disclosure. FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure. The method (400) includes receiving, by an access interface, an unstructured data file comprising at least one of a document, audio, video, and textual content to accept a data from at least one of a local and remote storage sources in the step (405).
[0075] Additionally, the method (400) includes uploading, via a secured communication channel, the unstructured data file to a data repository, thereby ensuring a data integrity and confidentiality in course of a data transfer in the step (410).
[0076] Furthermore, the method (400) includes detecting, by applying a composite entity detection technique, a plurality of data entities within the uploaded unstructured data file, wherein the composite entity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data in the step (415).
[0077] Moreover, the method (400) includes determining, by correlating the plurality of detected entities with a statistical and linguistic models, a contextual sensitivity of each of the detected data entity, wherein the statistical and linguistic models are configured to interpret a semantic context of a surrounding data and assign a unique context-specific sensitivity score in the step (420).
[0078] Moreover, the method (400) includes calculating, based on the determined contextual sensitivity, an individual risk score for each of the detected data entity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density in the step (425). Moreover, the method (400) includes computing, by aggregating the individual risk scores, a document-level index and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document in the step (430).
[0079] Moreover, the method (400) includes categorizing, based on the computed document-level index and a risk distribution, the document, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment in the step (435).
[0080] Moreover, the method (400) includes determining, by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks, a regulatory obligation corresponding to the categorized document, wherein the one or more applicable regulatory frameworks comprises a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes in the step (440).
[0081] Moreover, the method (400) includes providing, by determination of the plurality of compliance gaps and a context-specific risk severity, a safeguard recommendation comprising a regulatory measure and a technical measure in the step (445).
[0082] Moreover, the method (400) includes displaying, utilizing a governance dashboard, a categorized result, a computed scores, and the safeguard recommendations, wherein the governance dashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making in the step (450).
[0083] Various embodiments of the system and the method for unstructured data risk assessment and risk quantification described above enables various advantages. The proposed system (100) offers significant practical advantages by enabling organizations to automatically assess risk within large volumes of unstructured data without manual review. It helps reduce compliance effort by instantly detecting sensitive or identifiable information across documents, audio, and video files using trained models. The system (100) improves accuracy by considering contextual sensitivity, ensuring that data entities (355) are not misclassified based solely on keyword matches. The system (100) further enhances data protection by generating a document-level risk index (150) and identify high-risk segments that require immediate attention. Organizations benefit from transparent regulatory alignment, as the system (100) maps detected risk attributes (365) to relevant legal frameworks and highlights compliance gaps (370) before audits occur. The automated safeguard recommendations (375) reduce the likelihood of human error and guide teams toward appropriate technical and regulatory actions. The secure ingestion and data transfer (350) process ensures data confidentiality during all stages of handling. The governance dashboard (155) provides real-time visibility, enabling faster decisionmaking and continuous monitoring. Overall, the system (100) reduces operational overhead, increases regulatory readiness, and strengthens enterprise -wide data protection.
[0084] The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware, or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. The term “processor” or “processing subsystem” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit including hardware may also perform one or more of the techniques of this disclosure. Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various techniques described in this disclosure. In addition, any of the described units, modules, or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware, firmware, or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware, firmware, or software components, or integrated within common or separate hardware, firmware, or software components.
[0085] It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the disclosure and are not intended to be restrictive thereof.
[0086] While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person skilled in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.
[0087] The figures and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, the order of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts need to be necessarily performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples.
Claims
WE CLAIM:
1. A system for unstructured data risk assessment and risk quantification comprising: a processor; a memory coupled to the processor, wherein the memory comprises instructions that when executed by the processor cause the processor to: receive an unstructured data file comprising at least one of a document, audio, video, and textual content via an access interface to accept a data from at least one of a local and remote storage sources; uploading the unstructured data file to a data repository through a secure communication channel, thereby ensuring a data integrity and confidentiality in course of a data transfer; detect a plurality of data entities within the uploaded unstructured data file by applying a composite entity detection technique, wherein the composite entity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data; determine a contextual sensitivity of each of the detected data entity by correlating the plurality of detected entities with a statistical and linguistic models configured to interpret a semantic context of a surrounding data and assign a unique context-specific sensitivity score; calculate an individual risk score for each of the detected data entity based on the determined contextual sensitivity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density; compute a document-level index by aggregating the individual risk scores and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document;categorize the document based on the computed document-level index and a risk distribution, wherein the categorization distinguishes among a high-risk, moderate-risk, and a low-risk data segment; determine a regulatory obligation corresponding to the categorized document by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks comprising a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes; provide a safeguard recommendation comprising a regulatory measure and a technical measure based on the determination of the plurality of compliance gaps and a context-specific risk severity; and display a categorized result, a computed scores, and the safeguard recommendations utilizing a governance dashboard, wherein the governance dashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decisionmaking.
2. The system as claimed in claim 1, to cause the processor to convert the received document into a machine-readable data format by performing a preprocessing operations on the text, image, and audio content, thereby standardizing a heterogeneous unstructured input.
3. The system as claimed in claim 1, to cause the processor to utilize a hybrid detection model in the composite entity detection technique, wherein the hybrid detection model comprises a contextual inference for the detected entity extraction.
4. The system as claimed in claim 1, to cause the processor to analyse a contextual semantic of the document to determine the contextual sensitivity of each of the detected entity and adjust a base risk scores as a function of the contextual inference and a statistical occurrence of each of the detected entity.
5. The system as claimed in claim 1, to cause the processor to validate each of the detected data entities by cross-referencing them with the data repository, thereby ensuring an accuracy of entity categorization and classification.
6. The system as claimed in claim 1, to cause the processor to store computed metrics, categorized data entities, and a compliance mapping result in a version- controlled repository, thereby enabling a traceability, auditing, and continuous compliance evaluation across a multiple document dataset.
7. The system as claimed in claim 1, to cause the processor to derive the document-level index by summing the each of the individual detected entity risk scores to obtain a cumulative privacy risk indicator for the document.
8. The system as claimed in claim 1, to cause the processor to compute the normalized density metric by dividing a cumulative privacy score by a document character length, thereby quantifying a concentration of a sensitive data in the document.
9. The system as claimed in claim 1 , to cause the processor to detect contextual sensitivity of a data entity by analysing a surrounding text semantic and a relational dependency among each of the detected entities.
10. A method for unstructured data risk assessment and risk quantification comprising: receiving, by an access interface, an unstructured data file comprising at least one of a document, audio, video, and textual content to accept a data from at least one of a local and remote storage sources; uploading, via a secured communication channel, the unstructured data file to a data repository, thereby ensuring a data integrity and confidentiality in course of a data transfer; detecting, by applying a composite entity detection technique, a plurality of data entities within the uploaded unstructured data file, wherein the compositeentity detection technique comprises a pattern recognition logic, an entity reference repository, and a trained language model adapted to operate concurrently to identify and categorize at least one of a personal, quasi-identifiable, anonymized, and a pseudo-anonymized data; determining, by correlating the plurality of detected entities with a statistical and linguistic models, a contextual sensitivity of each of the detected data entity, wherein the statistical and linguistic models are configured to interpret a semantic context of a surrounding data and assign a unique context-specific sensitivity score; calculating, based on the determined contextual sensitivity, an individual risk score for each of the detected data entity, wherein each of the detected entity is assigned a base risk score adapted to be dynamically adjusted as a function of a contextual analysis and an occurrence density; computing, by aggregating the individual risk scores, a document-level index and calculate a normalized density metric indicative of a proportion of at least one of a personally identifiable and quasi-identifiable data contained within the document; categorizing, based on the computed document-level index and a risk distribution, the document, wherein the categorization distinguishes among a high- risk, moderate-risk, and a low-risk data segment; determining, by mapping a plurality of detected risk attributes to one or more applicable regulatory frameworks, a regulatory obligation corresponding to the categorized document, wherein the one or more applicable regulatory frameworks comprises a privacy framework and a data protection framework, wherein the mapping yields a determination of a plurality of compliance gaps associated with the plurality of detected risk attributes; providing, by determination of the plurality of compliance gaps and a context-specific risk severity, a safeguard recommendation comprising a regulatory measure and a technical measure; and displaying, utilizing a governance dashboard, a categorized result, a computed scores, and the safeguard recommendations, wherein the governancedashboard is configured to interface with at least one of an external compliance and a data protection unit for continuous monitoring and decision-making.