Method and apparatus for identifying target data in application, device, and product
By identifying the centralized management method and semantic analysis of data fields in the application code, the target data fields are filtered out, solving the problem of incomplete data identification in existing technologies and achieving more efficient data protection.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-18
Smart Images

Figure CN2024137932_18062026_PF_FP_ABST
Abstract
Description
Methods, apparatus, devices, and products for identifying target data in applications. Technical Field
[0001] This disclosure relates to the field of data security, and more specifically to methods, apparatus, devices and products for identifying target data in applications. Background Technology
[0002] In the field of data security, it is necessary to identify specific categories of data fields within application code. A data field refers to a unit in the code that stores and uses data. After identifying specific categories of data fields, the values of those fields can be processed accordingly to improve data security.
[0003] For example, after identifying a specific category of data fields, the value of that field can be encrypted, masked, or deleted. Furthermore, engineers can analyze specific categories of data fields to determine if the corresponding code contains vulnerabilities that could lead to data breaches. Summary of the Invention
[0004] In a first aspect of the embodiments of this disclosure, a method for identifying target data in an application is provided. The method includes determining a set of data fields based on the code scope in which the data fields are stored or used in the application's code, wherein multiple data fields in the set are centrally stored or used in the code. The method also includes determining that there exist data fields in the set of data fields associated with words in a predetermined target data term set. Furthermore, the method includes identifying target data fields from data fields in the set that are not associated with words in the predetermined target data term set.
[0005] In a second aspect of the embodiments of this disclosure, an apparatus for identifying target data in an application is provided. The apparatus includes a data field set partitioning module configured to determine a data field set based on the code scope in which the data fields are stored or used in the application's code, wherein multiple data fields in the data field set are centrally stored or used in the code. The apparatus also includes a data field set determination module configured to determine that there exist data fields in the data field set associated with words in a predetermined target data word set. Furthermore, the apparatus includes a target data field identification module configured to identify target data fields from data fields in the data field set that are not associated with words in the predetermined target data word set.
[0006] In a third aspect of embodiments of this disclosure, an electronic device is provided. The electronic device includes one or more processors; and a storage device for storing one or more programs, which, when executed by the one or more processors, cause the one or more processors to implement a method for identifying target data in an application. The method includes determining a set of data fields based on the scope of code in which the data fields are stored or used in the application's code, wherein multiple data fields in the set are centrally stored or used in the code. The method also includes determining that there exists a data field in the set of data fields associated with a word in a predetermined target data term set. Furthermore, the method includes identifying the target data field from data fields in the set of data fields that are not associated with a word in the predetermined target data term set.
[0007] In a fourth aspect of embodiments of this disclosure, a computer program product is provided. The computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions that, when executed, cause a machine to implement a method for identifying target data in an application. The method includes determining a set of data fields based on the scope of code in which the data fields are stored or used in the application's code, wherein multiple data fields in the set are centrally stored or used in the code. The method also includes determining that there exists a data field in the set of data fields associated with a word in a predetermined target data term set. Furthermore, the method includes identifying the target data field from data fields in the set of data fields that are not associated with a word in the predetermined target data term set.
[0008] The summary section is provided to present the chosen concepts in a simplified form, which will be further described in the detailed description below. The summary section is not intended to identify key or principal features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Attached Figure Description
[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:
[0010] Figure 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;
[0011] Figure 2 shows a flowchart of a method for identifying target data in an application according to some embodiments of the present disclosure;
[0012] Figure 3 shows a schematic diagram of the architecture of an example system for identifying target data in an application according to some embodiments of the present disclosure;
[0013] Figure 4 shows a schematic diagram of an example field clustering module for determining the target data category of a candidate data field according to some embodiments of the present disclosure;
[0014] Figure 5 illustrates a schematic diagram of an example of determining the similarity between a candidate data field and keywords in a target data category cluster, according to some embodiments of the present disclosure;
[0015] Figure 6 shows a block diagram of an apparatus for identifying target data in an application according to some embodiments of the present disclosure; and
[0016] Figure 7 shows a block diagram of a device capable of implementing several embodiments of the present disclosure. Detailed Implementation
[0017] It is understood that all user-related data involved in this technical solution should be obtained and used only after authorization from the user. This means that if it is necessary to use a user's personal information in this technical solution, the user's explicit consent and authorization are required before obtaining this data; otherwise, no related data collection and use will be carried out. It should also be understood that when implementing this technical solution, relevant laws and regulations should be strictly followed in the process of data collection, use, and storage, and necessary technical measures should be taken to protect user data security and ensure the secure use of data.
[0018] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0019] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects unless explicitly stated. Other explicit and implicit definitions may also be included below.
[0020] In modern life and work, various mobile applications carry massive amounts of data, including data requiring special protection (referred to as target data or target data fields in this article), such as user data, platform business data, and data involving proprietary technologies. To effectively protect target data, comprehensively and accurately locating and identifying this data is an indispensable prerequisite. However, due to the complex, diverse, and constantly evolving categories of this data (for example, for platform business data, content titles and content tags belong to two different categories), relevant technical solutions struggle to comprehensively identify this data, thus affecting the comprehensiveness and accuracy of subsequent data protection.
[0021] When locating and identifying target data in applications, relevant technologies primarily rely on semantic information to determine the existence of specific data categories and their specific classifications. Based on the data source, these technologies can be divided into two categories: semantic information based on static code and semantic information based on dynamic network traffic. For identification schemes based on static code semantic information, a keyword set for the target data to be identified can be constructed, and the application's local code can be scanned statically. Then, techniques such as character matching and regular expressions can be used to determine whether the code contains semantic information matching the keyword set. Once a match is found, the data can be identified as target data, and its category can be further determined. For identification schemes based on dynamic network traffic semantic information, the application under test can be dynamically run to obtain the content transmitted during network communication, and character matching or other techniques can be applied to identify it as target data.
[0022] However, these related technologies share a common limitation: they can only support data identification within a specified range. When detecting target data in a specific application, the keyword sets upon which these schemes rely may miss data fields unique to the application being tested (e.g., target data with special naming conventions). Therefore, these schemes struggle to comprehensively and accurately discover target data in an application, thereby affecting the completeness and reliability of subsequent data security analysis tasks.
[0023] Therefore, embodiments of this disclosure provide a scheme for identifying target data in applications. The embodiments of this disclosure are based on a core observation: application developers typically employ a centralized data management approach when writing application code. Since many application programming languages (e.g., Java, Kotlin, etc.) are object-oriented programming languages, target data is often managed and used in a centralized format (e.g., Java classes or key-value pairs). Compared to distributing target data throughout the application, this centralized management approach significantly simplifies the implementation of application business processes. Therefore, these centralized data carriers become the best entry point for mining target data.
[0024] In the solutions provided by the embodiments of this disclosure, a computing device can determine a set of data fields based on the code scope in which the data fields are stored or used in the application's code, wherein multiple data fields in the set are centrally stored or used in the code. The computing device can also determine that there are data fields in the set of data fields associated with words in a predetermined target data term set. Furthermore, the computing device can identify target data fields from data fields in the set that are not associated with words in the predetermined target data term set.
[0025] This approach leverages the centralized storage and usage of target data within applications to filter out a set of data fields that include the target data. Then, it further identifies potential target data fields from those not initially identified as such. This reduces the amount of target data missed in the tested application, improving the comprehensiveness and accuracy of data protection.
[0026] Figure 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. As shown in Figure 1, environment 100 includes a computing device 102, which can be any device with computing or processing capabilities. For example, computing device 102 can be a local server, cloud server, desktop computer, laptop computer, tablet computer, etc. As shown in Figure 1, in environment 100, computing device 102 can obtain application code 106 of application 104. Application code 106 can be native code of application 104 (e.g., program code written using a programming language) or data in a code format (e.g., JSON format code) sent by application 104 over a network.
[0027] In environment 100, target data is centrally stored or used in application code 106. Some parts of the target data are relatively easy to discover; for example, some data field names in application code 106 match keywords of the target data. Other parts of the target data are not easily discovered; for example, some data field names in application code 106 belong to uncommon naming conventions, or some data fields are target data with unknown categories.
[0028] In environment 100, computing device 102 can determine a set of data fields based on the code scope in which data fields are stored or used in application code 106, where multiple data fields in the set are centrally stored or used in the code. In this document, "centrally stored or used" means that multiple data fields are managed within the same code scope or data structure. For example, multiple member variables in a Java class are centrally stored or used, and computing device 102 can determine these member variables as a set of data fields. As another example, multiple keys in a key-value pair data structure (e.g., a JSON object) are centrally stored or used, and computing device 102 can determine these keys as a set of data fields. In environment 100, computing device 102 can determine data fields 108-1, 108-2, and 108-3 (collectively referred to as data fields 108) stored or used within code scope 110, which can form a set of data fields 112.
[0029] In environment 100, computing device 102 can acquire a predetermined target data term set 114, which may include keywords 116-1, 116-2, ..., 116-N (collectively referred to as keywords 116). Keywords 116 are keywords of target data that appear frequently in multiple applications. Computing device 102 can determine whether each data field 108 in data field set 112 is associated with any keyword 116 in target data term set 114. For example, computing device 102 can use techniques such as keyword matching or semantic analysis to determine whether data field 108 is associated with keyword 116.
[0030] If none of the data fields 108 in the data field set 112 are associated with any keyword 116 in the target data term set 114, then the computing device 102 can determine that there is no target data in the code range 110. If any data field 108 in the data field set 112 is associated with a keyword 116 in the target data 114, then the computing device 102 can determine that the data field 108 is target data. Furthermore, based on the core observation mentioned above (i.e., application developers typically use a centralized data management approach to write application code), the computing device 102 can also determine that in addition to the identified target data, there may be unidentified target data in the data field set 112.
[0031] In the example shown in Figure 1, computing device 102 can determine that data field 108-1 is associated with keyword 116-2, and therefore can determine that data field 108-1 is target data. Furthermore, computing device 102 can determine that data fields 108-2 and 108-3 are not associated with any of the keywords 116 in the target data term set 114, and therefore can place data fields 108-2 and 108-3 into the candidate data field set 118 to further identify whether the data fields in the candidate data field set 118 are target data fields.
[0032] In this way, the management method in which target data is typically stored and used centrally in application code 106 can be used to filter out the data field set 112 that includes target data. Then, from the data field set 112 that was not identified as target data, data fields that may be target data can be further identified. This reduces the amount of target data that is missed in application code 106 and improves the comprehensiveness and accuracy of data protection.
[0033] Figure 2 illustrates a flowchart of a method 200 for identifying target data in an application according to some embodiments of the present disclosure. Method 200 can be performed, for example, by a computing device 102 in the environment 100 shown in Figure 1. As shown in Figure 2, at block 202, the computing device can determine a set of data fields based on the code scope in which data fields are stored or used in the application's code, where multiple data fields in the set are centrally stored or used in the code. For example, in the environment 100 shown in Figure 1, the computing device 102 can determine a set of data fields 112 located within code scope 110 based on the code scope in which data fields are stored or used in application code 106, the set of data fields 112 including data fields 108-1, 108-2, and 108-3, which are centrally stored and used within code scope 110. In some embodiments, multiple data fields in the set of data fields 112 are stored or used in the same class, or the multiple data fields in the set of data fields 112 are multiple keys in a data structure containing multiple key-value pairs. For example, code range 110 could be a Java class in application code 106, and data fields 108-1, 108-2, and 108-3 could be member variables in that Java class; or code range 110 could be a key-value pair data structure (e.g., a JSON object), and data fields 108-1, 108-2, and 108-3 could be multiple keys in that data structure.
[0034] In box 204, the computing device can determine that there exists a data field in the data field set that is associated with a word in a predetermined target data term set. For example, in environment 100 as shown in FIG1, computing device 102 can acquire a predetermined target data term set 114, which may include multiple keywords 116, which are keywords of target data that appear frequently in multiple applications. Computing device 102 can determine whether each data field 108 in data field set 112 is associated with any keyword 116 in target data term set 114. For example, computing device 102 can use techniques such as keyword matching or semantic analysis to determine whether data field 108 is associated with keyword 116. In the example shown in environment 100, computing device 102 can determine that data field 108-1 is associated with keyword 116-2, and therefore determine that data field 108-1 is target data.
[0035] In box 206, the computing device can identify target data fields from data fields in the data field set that are not associated with words in a predetermined target data word set. For example, in environment 100 as shown in Figure 1, computing device 102 can determine that data fields 108-2 and 108-3 are not associated with any keywords 116 in the target data word set 114, and therefore can place data fields 108-2 and 108-3 into the candidate data field set 118. Based on the core observation mentioned above (i.e., application developers typically use a centralized data management approach to write application code), computing device 102 can determine that in addition to the identified target data, there may be unidentified target data in data field set 112. Therefore, computing device 102 can further identify whether the candidate data field set 118 includes the target data field.
[0036] This approach leverages the centralized storage and usage of target data within applications to filter out a set of data fields that include the target data. Then, it further identifies potential target data fields from those not initially identified as such. This reduces the amount of target data missed in the tested application, improving the comprehensiveness and accuracy of data protection.
[0037] In some embodiments, data fields in the data field set that are not associated with words in a predetermined target data word set include candidate data fields. When identifying a target data field from data fields in the data field set that are not associated with words in the predetermined target data word set, the computing device can acquire multiple target data category clusters, which correspond to multiple target data categories, and each of the multiple target data category clusters includes multiple target data keywords. The computing device can then determine that a candidate data field is a target data field by clustering the candidate data field into a first target data category cluster among the multiple target data category clusters. In some embodiments, when determining that a candidate data field is a target data field, the computing device can determine multiple semantic similarities between the candidate data field and multiple target data keywords in the first target data category cluster. The computing device can then cluster the candidate data field into the first target data category cluster based on the multiple semantic similarities.
[0038] Figure 3 illustrates a schematic diagram of the architecture of an example system 300 for identifying target data in an application according to some embodiments of the present disclosure. As shown in Figure 3, system 300 includes a field collection module 302, a target word set determination module 304, a candidate field determination module 306, and a field clustering module 308. In system 300, the field collection module 302 can determine a superset of data fields from application code 310 that may contain target data, that is, find and delineate all locations in application code 310 where target data may exist, and extract the relevant data fields. For example, the field collection module can directly extract data access operations of member variables of Java classes or key-value objects such as Sharedpreference, HashMap, and JsonObject using the get() and put() methods. By directly extracting all member variables of Java classes and the strings of keys used when key-value objects access data, all data fields in application code 310 can be collected. For each data field, the name of its Java class or key-value object, the data type of the field (e.g., boolean, integer, character, etc.), and the name of the field itself can be saved. Since the target data is typically managed centrally, the field collection module 302 can divide the data field superset into multiple data field sets based on code scope. For example, member variables located in the same Java class can be grouped into the same data field set. Figure 3 shows one of the multiple data field sets, data field set 312, where fields are centrally managed and used within the same data carrier.
[0039] Because the types of target data included in different applications may vary due to different business logics and are often updated with business iterations, it is difficult to achieve comprehensive identification of target data by directly relying on keyword matching. However, there are some common target data that may appear in different applications, such as usernames and birthdays. Based on this, System 300 can efficiently determine whether a data carrier is used to store target data and filter out noisy data that is irrelevant to the target data by filtering the data field set corresponding to the centralized data carrier to see if such common target data exists.
[0040] As shown in Figure 3, after determining the data field set 312, the system 300 can further utilize techniques such as keyword matching or semantic analysis to preliminarily determine whether the data field set 312 includes known common target data. If the preliminary screening determines that the data field set 312 does not include any common target data, the system 300 can identify all fields in the data field set 312 as noise fields unrelated to the target data. If the preliminary screening determines that the data field set 312 includes known target data, the system 300 can identify fields in the data field set 312 that have not been identified as target data as candidate data fields for further analysis to determine whether the candidate data fields are target data.
[0041] To determine common target data, in system 300, the target term set determination module 304 can acquire application codes 314-1, 314-2, ..., 314-N (collectively referred to as application codes 314) of multiple applications. Furthermore, the target term set determination module 304 can also acquire multiple predetermined common target data keywords. The target term set determination module 304 can extract data carriers containing common target data keywords from the multiple application codes 314 and obtain their member fields. Then, the target term set determination module 304 can preprocess these fields using methods such as stop word removal and word segmentation, and then filter out multiple fields with high frequencies through frequency analysis. Then, the target term set determination module 304 can determine the fields belonging to the target data from the filtered fields (e.g., through semantic analysis, large language models, or manual screening) to form a target data term set 316. The keywords in the target data term set 316 are the common target data keywords extracted from the multiple application codes 314.
[0042] In system 300, the candidate field determination module 306 can compare the fields in the data field set 312 with the keywords in the target data term set 316. If there is no field in the data field set 312 that matches any keyword in the target data term set 316, then all fields in the data field set 312 will be identified as noisy data. If there is a field in the data field set 312 that matches any keyword in the target data term set 316, then the data field set 312 will be retained for further identification. In this case, the fields in the data field set 312 that do not match any keywords in the target data term set 316 will be identified as candidate data fields (e.g., candidate data field 318).
[0043] In different application code, data fields may differ due to different naming conventions of developers or different specific business logics. However, their core semantics always represent a certain category of target data, so these data fields will remain semantically consistent. Based on this, system 300 can determine whether a candidate data field 318 is target data and to which category of target data it belongs based on the semantics of the candidate data field 318.
[0044] In system 300, each target data category cluster (320-1, 320-2, ..., 320-N, collectively referred to as target data category clusters) corresponds to a category of target data, and each target data category cluster includes multiple keywords related to the corresponding category. For example, in a scenario where the target data is platform business data, target data category cluster 320-1 may correspond to the category "geographical location," and target data category cluster 320-1 may include keywords such as "location," "latitude," "longitude," and "lat," which are keywords associated with "geographical location."
[0045] In system 300, the field clustering module 308 can determine the semantic similarity between candidate data field 318 and multiple keywords in multiple target data category clusters 320, and then determine which target data category cluster to cluster the candidate data field 318 into based on the determined semantic similarity. If the candidate data field 318 can be clustered into one cluster of target data category clusters 320, the field clustering module 308 can determine that the candidate data field 318 is target data, and can determine the target data category corresponding to the target data category cluster to which the candidate data field 318 is clustered as the target data category 322 of the candidate data field 318. If the candidate data field 318 cannot be clustered into any cluster of target data category clusters 320, the field clustering module 308 can determine that the candidate data field 318 is not target data, and can add the candidate data field 318 to the non-target data term set 324. In this way, when determining whether a data field is target data, it can first be compared with the keywords in the non-target data term set 324. If the data field matches any keyword in the non-target data term set 324, it can be directly determined that it is not the target data, thus saving computing resources and time.
[0046] In some embodiments, if candidate data field 318 cannot be clustered into any cluster of target data category cluster 320, system 300 can place candidate data field 318 into an unknown field pool. Fields in the unknown field pool are those not yet identified as target data but may still be target data. Then, system 300 can perform semantic clustering on the fields in the unknown field pool to cluster them into multiple unknown clusters. In some embodiments, system 300 can generate multiple word embeddings corresponding to the multiple fields in the unknown field pool, and then use a clustering algorithm such as DBSCAN to cluster these word embeddings, thereby clustering the multiple fields in the unknown field pool into multiple unknown clusters. After clustering the fields in the unknown field pool into unknown clusters, system 300 can receive multiple user inputs for multiple unknown clusters. Each user input can indicate whether a field in the corresponding unknown cluster is target data and the target data category corresponding to that unknown cluster. For unknown clusters where the user input indicates that the field does not belong to the target data, system 300 can place the fields in the unknown cluster into a non-target data word set. In this way, based on user input, unknown clusters can be transformed into new target data category clusters, or unknown clusters can be merged into existing target data category clusters. This allows for further identification of whether fields not clustered into any of the target data category clusters 320 are target data, thereby improving the comprehensiveness of target data identification. Furthermore, since the fields in the unknown field pool are clustered into multiple clusters, users only need to identify whether a field within a cluster is target data, thus improving the efficiency of user identification.
[0047] In this way, the initial screening process can quickly locate candidate data fields and filter out noisy data, thereby reducing the number of fields that need further identification and saving computational resources. Furthermore, by clustering candidate data fields into predetermined target data category clusters based on semantic similarity, the accuracy of target data identification can be improved, thus enhancing the comprehensiveness of target data identification.
[0048] In some embodiments, when clustering candidate data fields into a first target data category cluster, the computing device may calculate a first average of multiple semantic similarities as a first evaluation value, the first evaluation value indicating the probability that the candidate data field belongs to the first target data category cluster. The computing device may also calculate a second evaluation value indicating the probability that the candidate data field belongs to a second target data category cluster among the multiple target data category clusters. Then, in response to the first evaluation value being greater than the second evaluation value, the computing device may cluster the candidate data field into the first target data category cluster.
[0049] In some embodiments, in response to a first evaluation value being greater than a second evaluation value, the computing device may obtain a semantic similarity threshold for a first target data category cluster. Then, in response to the first evaluation value being greater than the second evaluation value and the first evaluation value satisfying the semantic similarity threshold, the computing device may cluster the candidate data fields into the first target data category cluster.
[0050] In some embodiments, the computing device may determine that a first evaluation value is greater than a second evaluation value and that the first evaluation value does not meet a semantic similarity threshold. The computing device may receive user input for a candidate data field, indicating that the candidate data field is not the target data field. Then, in response to receiving the user input, the computing device may add the candidate data field to the non-target data field vocabulary.
[0051] In some embodiments, the computing device may acquire a second candidate data field. In response to the second candidate data field being associated with a data field in a non-target data field vocabulary, the computing device may determine the second candidate data field as a non-target data field.
[0052] Figure 4 illustrates a schematic diagram of an example field clustering module 400 for determining target data categories of candidate data fields according to some embodiments of the present disclosure. As shown in Figure 4, the field clustering module 400 can obtain candidate data fields 402 (e.g., candidate data field 318 in Figure 3) and target data category clusters 404-1, 404-2, ..., 404-N (collectively referred to as target data category clusters 404, e.g., target data category cluster 320 in Figure 3). Each target data category cluster 404 corresponds to a target data category and includes multiple keywords associated with that target data category. For example, target data category cluster 404-1 includes keywords 406-1, 406-2, ..., 406-N (collectively referred to as keywords 406).
[0053] For each target data category cluster 404, the field clustering module 400 can calculate the semantic similarity between the candidate data field 402 and each keyword in the target data category cluster 404. For example, for the target data category cluster 404-1, the field clustering module 400 can calculate the semantic similarity 408-1 between the candidate data field 402 and the keyword 406-1, the semantic similarity 408-2 between the candidate data field 402 and the keyword 406-2, ..., and the semantic similarity 408-N between the candidate data field 402 and the keyword 406-N (collectively referred to as semantic similarity 408). Then, the field clustering module 400 can calculate an evaluation value 410-1 for the target data category cluster 404-1 based on the semantic similarities 408-1 to 408-N. The evaluation value 410-1 indicates the probability that the candidate data field 402 can be clustered into the target data category cluster 404-1. In some embodiments, the field clustering module 400 can calculate the average of semantic similarities 408-1 to 408-N as an evaluation value 410-1. Similarly, the field clustering module 400 can also calculate an evaluation value 410-2 for candidate data field 402 to be clustered into target data category cluster 404-2, and an evaluation value 410-N for candidate data field 402 to be clustered into target data category cluster 404-N.
[0054] After determining the evaluation values 410-1 to 410-N, the field clustering module 400 can determine the maximum value among these evaluation values, namely, the maximum evaluation value 412. Then, the field clustering module 400 can determine the target data category cluster 414 corresponding to the maximum evaluation value 412. For example, if the maximum evaluation value 412 is evaluation value 410-1, then the target data category cluster 414 is target data category cluster 404-1.
[0055] In some embodiments, each target data category cluster 404 has a corresponding similarity threshold. After determining the maximum evaluation value 412 and the corresponding target data category cluster 414, the field clustering module 400 can obtain the similarity threshold 416 of the target data category cluster 414 and compare it with the maximum evaluation value 412. If the maximum evaluation value 412 meets the similarity threshold 416, the field clustering module 400 can determine that the candidate data field 402 can be clustered into the target data category cluster 414, thereby determining that the candidate data field 402 is target data and its category is the category corresponding to the target data category cluster 414. If the maximum evaluation value 412 does not meet the similarity threshold 416, it means that although the candidate data field 402 is most likely to be clustered into the target data category cluster 414, the candidate data field 402 is not semantically similar to the keywords in the target data category cluster 414, and therefore the candidate data field 402 cannot be clustered into the target data category cluster 414. In this scenario, as described above, candidate data field 402 and other fields that cannot be clustered into the target data category cluster can be placed into an unknown field pool. Fields in the unknown field pool can be clustered into multiple unknown clusters based on their semantics. The computing device can then receive user input for the unknown clusters and, based on the received user input, identify fields in the unknown clusters as target data, or identify fields in the unknown clusters as non-target data and place them into a non-target data terminology.
[0056] In this way, even if the keywords in the candidate data field 402 do not match those in the target data category cluster 414 due to factors such as different naming habits of developers and the diversity of natural language expressions, the field clustering module 400 can still cluster the candidate data field 402 into the target data category cluster that is most semantically similar to it, thereby identifying the target data category to which the candidate data field 402 belongs. This improves the comprehensiveness of target data identification.
[0057] In some embodiments, multiple target data keywords in a first target data category cluster include the first target data keyword, and when determining semantic similarity, the computing device can determine a relevance score between the candidate data field and the first target data keyword. The computing device can also determine a word embedding similarity score between the candidate data field and the first target data keyword. Then, the computing device can determine a first similarity between the candidate data field and the first target data keyword based on the relevance score and the word embedding similarity score.
[0058] In some embodiments, the computing device may obtain a first weight for the relevance score and a second weight for the word embedding similarity score. The computing device may then determine a first similarity based on the relevance score, the first weight, the word embedding similarity score, and the second weight.
[0059] Figure 5 illustrates a schematic diagram of an example 500 for determining the similarity between a candidate data field and keywords in a target data category cluster according to some embodiments of the present disclosure. As shown in Figure 5, example 500 includes a target data category cluster 502 and a candidate data field 506, wherein the target data category cluster 502 includes keyword 504. For simplicity, other keywords in the target data category cluster 502 are not shown in Figure 5. To determine the semantic similarity between candidate data field 506 and keyword 504, a computing device may determine a relevance score 512 between candidate data field 506 and keyword 504. The relevance score 512 indicates the semantic relevance between candidate data field 506 and keyword 504, which considers not only explicit semantic similarity (e.g., "location" is similar to "address"), but also implicit relevance (e.g., "location" is a superordinate concept of "city," therefore they are related). In some embodiments, a natural language processing model may be used to determine the relevance score 512 between candidate data field 506 and keyword 504. For example, the Relatedness feature of the ConceptNet model can quantify the relevance between two concepts and capture direct relationships in human knowledge, such as specific semantic connections like "IsA" and "UsedFor". Therefore, computing devices can use the Relatedness feature of the ConceptNet model to determine a relevance score of 512 between candidate data field 506 and keyword 504.
[0060] Furthermore, in Example 500, the computing device can also generate a word embedding 508 for candidate data field 506 based on candidate data field 506, and a word embedding 510 for keyword 504 based on keyword 504. For example, the computing device can use a pre-trained word embedding model (e.g., BERT, Word2Vec, etc.) to generate word embeddings 508 and 510. The computing device can then calculate a word embedding similarity score 516 between word embedding 508 and word embedding 510. In some embodiments, the computing device can calculate the cosine similarity between word embedding 508 and word embedding 510 as the word embedding similarity score 516.
[0061] In Example 500, the computing device can obtain a weight 514 for the relevance score 512 and a weight 518 for the word embedding similarity score 516. Weights 514 and 518 can be used to control the attention given to the relevance score 512 and the word embedding similarity score 516 when calculating semantic similarity. Weights 514 and 518 can be adjusted based on experimental results or expert experience. The computing device can then calculate a semantic similarity 520 between candidate data field 506 and keyword 504 (e.g., by weighted summation) based on the relevance score 512, the weight 514 for the relevance score 512, the word embedding similarity score 516, and the weight 518 for the word embedding similarity score 516.
[0062] By combining the relevance between keywords and candidate data fields with the similarity between word embeddings, it is possible to simultaneously capture the association between concepts and semantic patterns based on vectorized representations, thereby improving the accuracy of the calculated semantic similarity between keywords and candidate data fields.
[0063] Figure 6 shows a block diagram of an apparatus 600 for identifying target data in an application according to some embodiments of the present disclosure. As shown in Figure 6, the apparatus 600 includes a data field set partitioning module 602, configured to determine a data field set based on the code scope in which the data fields are stored or used in the application's code, wherein multiple data fields in the data field set are centrally stored or used in the code. The apparatus 600 also includes a data field set determination module 604, configured to determine that there are data fields in the data field set associated with words in a predetermined target data word set. Furthermore, the apparatus 600 includes a target data field identification module 606, configured to identify target data fields from data fields in the data field set that are not associated with words in the predetermined target data word set.
[0064] In some embodiments, multiple data fields in a data field set are stored or used in the same class, or multiple data fields in a data field set are multiple keys in a data structure containing multiple key-value pairs.
[0065] In some embodiments, the data fields in the data field set that are not associated with words in the predetermined target data word set include candidate data fields, and the target data field identification module 606 includes: a target data category cluster acquisition module configured to acquire a plurality of target data category clusters, the plurality of target data category clusters corresponding to a plurality of target data categories, and each of the plurality of target data category clusters including a plurality of target data keywords; and a clustering module configured to determine that a candidate data field is a target data field by clustering the candidate data field into a first target data category cluster in the plurality of target data category clusters.
[0066] In some embodiments, the clustering module includes: a semantic similarity determination module configured to determine multiple semantic similarities between a candidate data field and multiple target data keywords in a first target data category cluster; and a semantic similarity usage module configured to cluster the candidate data field into the first target data category cluster based on the multiple semantic similarities.
[0067] In some embodiments, the plurality of target data keywords in the first target data category cluster include the first target data keyword, and the semantic similarity determination module includes: a relevance score determination module configured to determine the relevance score between the candidate data field and the first target data keyword; a word embedding similarity determination module configured to determine the word embedding similarity score between the candidate data field and the first target data keyword; and a relevance score usage module configured to determine the first similarity between the candidate data field and the first target data keyword based on the relevance score and the word embedding similarity score.
[0068] In some embodiments, the relevance score using module includes: a weight acquisition module configured to acquire a first weight for the relevance score and a second weight for the word embedding similarity score; and a weight usage module configured to determine a first similarity based on the relevance score, the first weight, the word embedding similarity score, and the second weight.
[0069] In some embodiments, the semantic similarity usage module includes: a first evaluation value calculation module configured to calculate a first average of a plurality of semantic similarities as a first evaluation value, the first evaluation value indicating the probability that a candidate data field belongs to a first target data category cluster; a second evaluation value calculation module configured to calculate a second evaluation value indicating the probability that a candidate data field belongs to a second target data category cluster among a plurality of target data category clusters; and an evaluation value comparison module configured to cluster the candidate data field into the first target data category cluster in response to the first evaluation value being greater than the second evaluation value.
[0070] In some embodiments, the evaluation value comparison module includes: a similarity threshold acquisition module configured to acquire a semantic similarity threshold for a first target data category cluster; and a first similarity threshold usage module configured to cluster candidate data fields into the first target data category cluster in response to a first evaluation value being greater than a second evaluation value and the first evaluation value satisfying the semantic similarity threshold.
[0071] In some embodiments, the apparatus 600 further includes: a second similarity threshold using module configured to determine that a first evaluation value is greater than a second evaluation value and the first evaluation value does not meet a semantic similarity threshold; a user input receiving module configured to receive user input for a candidate data field, the user input indicating that the candidate data field is not a target data field; and a user input using module configured to add the candidate data field to the non-target data field vocabulary in response to receiving the user input.
[0072] In some embodiments, the apparatus 600 further includes: a candidate data field acquisition module configured to acquire a second candidate data field; and a non-target data field determination module configured to determine the second candidate data field as a non-target data field in response to the association of the second candidate data field with a data field in the non-target data field vocabulary.
[0073] It is understood that by utilizing the apparatus 600 of this disclosure, at least one of the many advantages achievable by the methods or processes described above can be realized. For example, it is possible to utilize a management approach where target data is typically stored and used centrally in the application to filter out a set of data fields that include target data, and then further identify data fields that may be target data from the data fields in this set that were not identified as target data. This reduces the amount of target data missed in the application under test, improving the comprehensiveness and accuracy of data protection.
[0074] Figure 7 shows a block diagram of a device 700 capable of implementing various embodiments of the present disclosure. Device 700 may, for example, be a computing device 102 as shown in Figure 1. As shown in Figure 7, device 700 includes a central processing unit (CPU) and / or a graphics processing unit (GPU) 701, which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) 702 or loaded from storage unit 708 into random access memory (RAM) 703. Various programs and data required for the operation of device 700 may also be stored in RAM 703. The CPU / GPU 701, ROM 702, and RAM 703 are interconnected via bus 704. Input / output (I / O) interface 705 is also connected to bus 704. Although not shown in Figure 7, device 700 may also include a coprocessor.
[0075] Multiple components in device 700 are connected to I / O interface 705, including: input unit 706, such as keyboard, mouse, etc.; output unit 707, such as various types of monitors, speakers, etc.; storage unit 708, such as disk, optical disk, etc.; and communication unit 709, such as network card, modem, wireless transceiver, etc. Communication unit 709 allows device 700 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0076] The various methods or processes described above can be executed by CPU / GPU 701. For example, in some embodiments, the methods can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program can be loaded and / or installed on device 700 via ROM 702 and / or communication unit 709. When the computer program is loaded into RAM 703 and executed by CPU / GPU 701, one or more steps or actions in the methods or processes described above can be performed.
[0077] In some embodiments, the methods and processes described above can be implemented as a computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of this disclosure.
[0078] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0079] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, a local area network (LAN), a wide area network (WAN), and / or a wireless network, to an external computer or external storage device. The network may include copper cables, fiber optic cables, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0080] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages and conventional procedural programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0081] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0082] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0083] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of devices, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0084] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technical improvements to the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A method for identifying target data in an application, comprising: The data field set is determined based on the code scope in which the data fields are stored or used in the application's code, wherein multiple data fields in the data field set are centrally stored or used in the code; Determine that there exists a data field in the data field set that is associated with a word in the predetermined target data word set; as well as Identify target data fields from data fields in the data field set that are not associated with words in the predetermined target data word set.
2. The method according to claim 1, wherein the plurality of data fields in the data field set are stored or used in the same class, or the plurality of data fields in the data field set are multiple keys in a data structure containing multiple key-value pairs.
3. The method according to claim 1, wherein the data fields in the data field set that are not associated with words in the predetermined target data word set include candidate data fields, and identifying target data fields from the data fields in the data field set that are not associated with words in the predetermined target data word set includes: Multiple target data category clusters are obtained, which correspond to multiple target data categories, and each target data category cluster includes multiple target data keywords; as well as The candidate data field is determined to be a target data field by clustering the candidate data field into a first target data category cluster among the plurality of target data category clusters.
4. The method of claim 3, wherein determining that the candidate data field is a target data field by clustering the candidate data field into the first target data category cluster among the plurality of target data category clusters comprises: Determine the semantic similarity between the candidate data field and multiple target data keywords in the first target data category cluster; as well as The candidate data fields are clustered into the first target data category cluster based on the multiple semantic similarities.
5. The method according to claim 4, wherein the plurality of target data keywords in the first target data category cluster include first target data keywords, and determining the plurality of semantic similarities between the candidate data field and the plurality of target data keywords in the first target data category cluster includes: Determine the relevance score between the candidate data fields and the first target data keywords; Determine the word embedding similarity score between the candidate data field and the first target data keyword; as well as The first similarity between the candidate data field and the first target data keyword is determined based on the relevance score and the word embedding similarity score.
6. The method according to claim 5, wherein determining the first similarity between the candidate data field and the first target data keyword based on the relevance score and the word embedding similarity score comprises: Obtain a first weight for the relevance score and a second weight for the word embedding similarity score; as well as The first similarity is determined based on the relevance score, the first weight, the word embedding similarity score, and the second weight.
7. The method of claim 4, wherein clustering the candidate data fields into the first target data category cluster based on the plurality of semantic similarities comprises: The first average of the multiple semantic similarities is calculated as a first evaluation value, which indicates the probability that the candidate data field belongs to the first target data category cluster. Calculate a second evaluation value that indicates the probability that the candidate data field belongs to a second target data category cluster among the plurality of target data category clusters; as well as In response to the first evaluation value being greater than the second evaluation value, the candidate data fields are clustered into the first target data category cluster.
8. The method of claim 7, wherein clustering the candidate data fields into the first target data category cluster in response to the first evaluation value being greater than the second evaluation value comprises: Obtain the semantic similarity threshold for the first target data category cluster; as well as In response to the first evaluation value being greater than the second evaluation value and the first evaluation value satisfying the semantic similarity threshold, the candidate data fields are clustered into the first target data category cluster.
9. The method according to claim 8, further comprising: Determine that the first evaluation value is greater than the second evaluation value and that the first evaluation value does not meet the semantic similarity threshold; Receive user input for the candidate data field, the user input indicating that the candidate data field is not the target data field; and In response to receiving the user input, the candidate data field is added to the non-target data field vocabulary.
10. The method of claim 9, further comprising: Retrieve the second candidate data field; as well as In response to the association of the second candidate data field with a data field in the non-target data field vocabulary, the second candidate data field is determined as a non-target data field.
11. An apparatus for identifying target data in an application, comprising: The data field set partitioning module is configured to determine the data field set based on the code scope in which the data field is stored or used in the application's code, wherein multiple data fields in the data field set are centrally stored or used in the code; The data field set determination module is configured to determine whether there is a data field in the data field set that is associated with a word in a predetermined target data word set; as well as The target data field identification module identifies target data fields from data fields in the data field set that are not associated with words in the predetermined target data word set.
12. An electronic device, comprising: processor; as well as A memory coupled to the processor, the memory having instructions stored therein, which, when executed by the processor, cause the electronic device to perform the method according to any one of claims 1 to 10.
13. A computer program product tangibly stored on a non-transitory computer-readable medium and comprising machine-executable instructions that, when executed, cause a machine to perform the method according to any one of claims 1 to 10.