System and method for classifying software vulnerabilities
By introducing an automated software vulnerability scanning and analysis system, combined with machine learning technology, we have achieved efficient identification, classification, and remediation of software vulnerabilities. This solves the problem of low efficiency in manual analysis in existing technologies and improves the security and efficiency of software development.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ACCENTURE GLOBAL SOLUTIONS LTD
- Filing Date
- 2021-09-10
- Publication Date
- 2026-06-26
AI Technical Summary
The current process of scanning, analyzing, and remediating software vulnerabilities is slow and reliant on manual labor, resulting in wasted resources and low efficiency, and there is a shortage of cybersecurity experts.
A system and methodology are employed that utilizes components such as a scanning engine, vulnerability reporting engine, extraction engine, formatting engine, vector engine, classification engine, and output engine, combined with machine learning and automatic classification technologies, to automatically identify, classify, and remediate security vulnerabilities in software applications, providing intelligent analysis and reporting.
It improves the efficiency of software vulnerability scanning and analysis, reduces false positives and duplications, shortens analysis time, enhances security and development efficiency, and ensures the accuracy and consistency of vulnerability remediation.
Smart Images

Figure CN116209997B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates generally to the field of software security, and in particular to methods and systems for scanning and remediating security vulnerabilities in software applications during the development of software applications. Background Technology
[0002] In the software and application development process, the scanning, analysis, and remediation of security vulnerabilities are typically slow and manual. Basic techniques and tools in the field are known for scanning and identifying vulnerabilities. However, experts are needed to interpret the results, highlight the most relevant vulnerabilities, and recommend remediation. This often takes a significant amount of time, and such cybersecurity experts are in short supply. Software developers desire a faster process that can scale to meet demand while maintaining the quality of expert analysis. Intelligence is needed to scan software applications more effectively during the development phase. Attached Figure Description
[0003] The foregoing and other objects, features, and advantages of embodiments of this disclosure will become apparent from the following more detailed description of the embodiments illustrated in the accompanying drawings, wherein reference numerals refer to the same parts throughout the various views. The drawings are not necessarily drawn to scale, but rather to emphasize the principles illustrating this disclosure.
[0004] Figure 1 This is a block diagram illustrating an example of the architecture of an exemplary system according to certain embodiments of the present disclosure.
[0005] Figure 2 This illustrates certain embodiments of the present disclosure for implementation. Figure 1 A block diagram of an embodiment of the scanning engine and vulnerability reporting engine of the exemplary system shown.
[0006] Figure 3 This illustrates certain embodiments of the present disclosure for implementation. Figure 1 The flowchart illustrates an example of a method for implementing an exemplary extraction engine in the system shown.
[0007] Figure 4 This illustrates certain embodiments of the present disclosure for implementation. Figure 1 A block diagram of an embodiment of the formatting engine and vector engine of the exemplary system shown.
[0008] Figure 5 This illustrates certain embodiments of the present disclosure for implementation. Figure 1 A block diagram illustrating an embodiment of the vector engine, classification engine, and output engine components of the exemplary system shown.
[0009] Figure 6 This illustrates certain embodiments of the present disclosure for implementation. Figure 1A block diagram illustrating embodiments of the components of various engines in the exemplary system shown.
[0010] Figure 7 This is a diagram illustrating an example of an automatic classification method for implementing an exemplary system according to certain embodiments of the present disclosure.
[0011] Figures 8(a)-(b) are graphs illustrating examples of scan results implemented by an exemplary system according to certain embodiments of the present disclosure.
[0012] Figure 9 This is a block diagram illustrating an example of a method implemented by an exemplary system according to certain embodiments of the present disclosure.
[0013] Figure 10 This is a flowchart illustrating an example of a method implemented by an exemplary system according to certain embodiments of the present disclosure.
[0014] Figure 11 Example automatic classification strategy (ATP) rule base is shown, along with example steps for generating ATP and the corresponding automatic classification method (ATM) for ATP.
[0015] Figure 12 An example mapping between ATP and vulnerabilities is shown.
[0016] Figure 13 It shows Figure 11 The example process generated by the Improved Quality (IQ) Guidelines. Detailed Implementation
[0017] Reference will now be made in detail to embodiments of this disclosure, examples of which are shown in the accompanying drawings.
[0018] This disclosure can be implemented in various forms, including systems, methods, computer-readable media, or Platform as a Service (PaaS) products for scanning and remediating security vulnerabilities in software applications. In some examples, the technical advantages of this disclosure described herein may include the identification of security vulnerabilities in software applications scanned during their development phase. Another technical advantage may be the reduction of false positives and duplications in scan results. Yet another technical advantage may be the analysis of the root causes of vulnerabilities. Another technical advantage may include providing additional information to human security analysts to narrow their analysis and thus improve their efficiency. Technical advantages may include the classification of identified security vulnerabilities, and the automatic classification based on machine learning. In some examples, technical advantages may include the translation or interpretation of scan results to determine remedies for security vulnerabilities identified by the scan. In one example, technical advantages may include presenting recommendations to software developers via a user interface or scan report to enable secure development of the software application. Therefore, exemplary benefits of this disclosure may include reducing the time security analysts spend assessing vulnerabilities and increasing confidence in the security of software applications under development. While there are inefficient techniques for providing security analysts with basic scan results of detected vulnerabilities, the technical advantages of this disclosure can include the evaluation of scan results and the identification of actual vulnerabilities and false positives.
[0019] Figure 1 An embodiment of such a system 100 is illustrated, which can be implemented in many different ways using various components and modules, including any combination of the circuits described herein, such as hardware, software, middleware, application programming interfaces (APIs), and / or other components for implementing the features of the circuits. System 100 may include a scanning engine 101, a vulnerability reporting engine 102, an extraction engine 103, a formatting engine 104, a vector engine 105, a classification engine 106, an output engine 107, an auditing engine 108, and / or a reporting engine 109. In one embodiment, the steps of the disclosed method may be implemented by these engines 101-109.
[0020] In one embodiment, system 100 may include computing device 110, which may include memory 111 and processor 112. System 100 may also include a generated user interface (UI) 113 and, for example, a... Figure 2The Representational State Transition (REST) API 114 shown can be adapted to enable communication between components, modules, and databases. As described below, users can interact with system 100 via UI 113. In some embodiments, storage 111 may include components and modules of system 100, including the aforementioned engines 101-109, UI 113, and REST API 114. System 100 may also include a source code database 115, a vulnerability report database 116, a security vulnerability database 117, a Java codebase or database 118, and / or a training model database 119. Furthermore, system 100 may include a software security server 120 and a router.
[0021] According to certain embodiments of this disclosure, computing device 110, databases 115-119, software security server 120, and router can be logically and physically organized in many different ways. Databases 115-119 can be implemented using different types of data structures, such as linked lists, hash tables, or implicit storage mechanisms, and can include relational databases and / or object-relational databases. Databases 115-119 can be stored in the memory 111 of device 110 and / or software security server 120, or they can be distributed across multiple devices, servers, processing systems, or repositories. For example, vulnerability reporting database 116 can be configured to communicate with software security server 120, and vulnerability reporting engine 102 and extraction engine 103 can be configured to communicate with software security server 120. In some embodiments, computing device 110 may include communication interfaces, display circuitry, and input / output (I / O) interface circuitry that can be controlled by processor 112 for communication via... Figure 1 The components and modules shown perform the process steps discussed below. As described below, the user can interact with the system 100 via the UI 113 displayed by the display circuitry.
[0022] Figure 2An embodiment of a scanning engine 101 configured to scan source code 125 stored in a source code database 115 is shown. In one embodiment, the computing device 110 may include system circuitry capable of implementing any desired functionality of the system 100. As described below, in some embodiments, the scanning engine 101 may be configured to scan the source code 125 to look for security vulnerabilities 127. For example, the scanning engine 101 may be implemented on an application scanning client 128, which, as further discussed below, may be configured to communicate with the source code database 115, which stores the source code 125 to be scanned by the system 100. In one embodiment, the application scanning client 128 may include the computing device 110. Alternatively, the source code database 115 may be implemented on the computing device 110, which may be configured to communicate with the application scanning client 128 implemented on another device, which may be adapted to communicate with a display 129. In some embodiments, such as Figure 2 As shown, scanning engine 101 can also be configured to generate vulnerability report 130 and transmit vulnerability report 130 to vulnerability report engine 102.
[0023] In some embodiments, as an initial step of the disclosed method, scanning engine 101 may receive a scan request to scan source code 125. In some embodiments, this may be the initial stage of the process, where a client or user requests analysis of source code 125 to detect security vulnerabilities or threats 127 within or related to source code 125. In one example, this initial analysis may be performed by system 100 in conjunction with code analyzer 133. In some embodiments, code analyzer 133 in scanning engine 101 may be implemented using commercial packages or open-source solutions. For example, code analyzer 133 may include scanning tools such as Veracode, HCL App Scan, Checkmarx, and / or Fortify. Generally, code analyzer 133 attempts to protect the system from security flaws in business-critical software applications by using vulnerability reports 130. Code analyzer 133 may scan the source code 125 of a software product or application 135 and generate vulnerability reports 130. In some embodiments, vulnerability reporting engine 102 may generate vulnerability reports 130.
[0024] In some embodiments, the source code 125 of an application 135 selected, received, and / or identified by client 132 may be stored in a source code database 115. This may include source code 125 requested by client 132 for evaluation or analysis to determine whether source code 125 contains security vulnerabilities 127 that can be considered exploitable by security analysts. In one embodiment, source code 125 may be pushed or transmitted to application scanning client 128. Application scanning client 128 may include static application security testing software. In some embodiments, a user or client 132 may type, input, submit, or transmit the source code 125 of software application 135 to application scanning client 128.
[0025] Application scanning client 128 can generate vulnerability reports 130 corresponding to scans of source code 125. Typically, security analysts may spend considerable time reviewing such documents via application scanning client 128 to identify source code 125 that may contain security vulnerabilities / threats 127 and to identify false positives that can be ignored. Vulnerability reports 130 can be stored in software security server 120. Vulnerability reports 130 may include scan project code used by code analyzer 133, which may include a suite of tools used by security professionals to scan enterprise software for security issues. In some embodiments, vulnerability reports 130 may be stored in vulnerability report database 116, which may include a relational database service (RDS). Vulnerability reports 130 stored in vulnerability report database 116 can be transferred to software security server 120. In one embodiment, software security server 120 may be configured to transfer vulnerability reports 130 to extraction engine 103 via REST API 114, such as... Figure 2 The large arrow shown indicates the connection between the vulnerability reporting engine 102 and the extraction engine 103.
[0026] Figure 3 An embodiment of a feature extraction process implemented by an extraction engine 103 is illustrated, which can be configured to communicate with a software security server 120. The feature extraction process of the disclosed method may include: extracting features 138 from a vulnerability report 130 generated by a code analyzer 133 that indicate whether a portion of source code 125 may be vulnerable, based on the vulnerability report 130 generated by the code analyzer 133; and transmitting features 138 to a formatting engine 104. The process may include an initial step of receiving the vulnerability report 130 from the software security server 120 via a REST API 114 (box 301). Features 138, including different portions of a security vulnerability 127, may be retrieved (box 302). In some embodiments, such retrieved features 138 may identify a relevant threat to the security vulnerability 127 in source code 125 based on the corresponding vulnerability report 130.
[0027] The feature extraction process may also include a source code extraction step. See box 303. This step can be performed by the source code extractor 300, such as... Figure 2 As shown, it extracts the original source code 125 from the scanned and / or tested application 135. See also Figure 3 Box 303 in the image. The extracted source code 125 may include code 125 corresponding to the retrieved feature 138. Therefore, the source code extractor 300 can be configured to communicate directly or indirectly with the source code database 115, such as... Figure 2 As shown. Furthermore, this process may include pushing or transmitting the extracted source code 125 containing security vulnerability 127 (as shown). Figure 3 The step of transferring the source code 125 (box 304) to the vulnerability database 117 can be performed via the formatting engine 104. Therefore, all security vulnerabilities 127 can be detected by the code analyzer 133, and the source code 125 can be sent to and stored in the vulnerability database 117 for further processing by the system 100.
[0028] In one embodiment, formatting engine 104 can format security vulnerabilities 127 received from source code extractor 300 of extraction engine 103 into a format configured to be received by vulnerability database 117. In one example, the received security vulnerabilities 127 may be stored in a format compatible with or usable by system 100. Formatting engine 104 can store all security vulnerabilities 127 identified by code analyzer 133 and received from extraction engine 103 in a format suitable for enabling system 100 to perform transformations of security vulnerabilities 127. This format is readable by system 100. In this format, cleaned or reformatted vulnerabilities 127 can be analyzed through analysis experiments performed by system 100. Cleaned vulnerabilities 127 stored in vulnerability database 117 may be adapted for further transformation by system 100. In some embodiments, vulnerability database 117 may be adapted to transfer cleaned security vulnerabilities 127 to vector engine 105.
[0029] Figure 4An example of vector engine 105 and its interaction with components of other engines 104 and 106 are shown, as indicated by large arrows between the engines. Vector engine 105 can be configured to create feature vectors 173 for training machine learning (ML) model 141 to predict or determine whether security vulnerability 127 is actually a threat. The sanitized security vulnerability 127 can be transformed from human-readable features 138 into a format that can be processed by machine learning model 141. In some embodiments, an abstract syntax tree (AST) can be used as a method to decompose data for the sanitized security vulnerability 127 into a format that can be processed by machine learning model 141. In one embodiment, the tokenizer 155 in the vectorization process can be replaced by AST 143, as described below. Syntax tree 143 can include a tree representation of the abstract syntactic structure of source code 125 written in a programming language. Each node of tree 143 can represent a construct that appears in source code 125.
[0030] like Figure 4 As shown, the orchestrator 147 of the vector engine 105 can receive cleaned vulnerabilities 127 from the formatting engine 104. In some embodiments, the vulnerability database 117 can be configured to transmit cleaned security vulnerabilities 127 to the orchestrator 147 via a REST API 114. The vulnerability router 148 can be configured to communicate with the orchestrator 147. The vulnerability router 148 can scan the list of cleaned vulnerabilities 127 and classify each cleaned vulnerability 127 based on its corresponding security vulnerability 127 type. Based on the type of vulnerability 127 determined for the classified vulnerability 127, the classified vulnerability 127 can be routed in the system 100 based on predetermined machine learning rules or programming rules.
[0031] In some embodiments, vector engine 105 may include a grammar file 151 that defines speech-to-text words, terms, and phrases 152 that the grammar engine can recognize on user device 110. The grammar file 151 may include .py, .java, .js, .cs, and / or .xml files. In one embodiment, the terms 152 listed in the grammar file 151 may be those terms that the grammar engine searches for and compares with speech responses. When the grammar engine finds a matching term 152, it may execute an associated command or input the term 152 into a field. According to some embodiments, lexical analyzer 154 may receive grammar file 151 and vulnerability features 138, and perform tokenization via tokenizer 155 to return features 138.
[0032] Tagifier 155 can perform lexical analysis, lexical analysis, or tokenization. This can include a process of converting a character sequence 156 targeting the cleaned vulnerability 127 into a sequence of tokens 157. Tokenized vulnerability features 158 can include the vulnerability 127 stored in memory 111 in a tokenized format, which can include a sequence of such tokens 157. A repository 160 that can host the target source code 125 can be selected. In one embodiment, the repository 160 can be selected based on its size. The hosted code 125 can be transferred to tagifier 161, which can include tools for language recognition. Tagifier 161 can tokenize repository 160 and generate tokens 157.
[0033] In some embodiments, vector engine 105 may include a FastText creation model 162, which may include a library for learning word embeddings and text classification. FastText creation model 162 may receive tokens 157 and generate a trained embedding model 166. The trained embedding model 166 may include embeddings, which may include mappings of discrete categorical variables to vectors of continuous numbers. In some embodiments, each cleaned vulnerability 127 may be mapped to a vulnerability category 170 to generate a vulnerability ID 171 for each cleaned vulnerability 127 mapped to category 170. In some embodiments, vectorizer 172 may receive tokenized vulnerability features 158 as input and may output a single feature vector 173. Feature vector 173 may include all outputs collected from vectorizer 172. Furthermore, feature vectors may include links to a source code tree, where relevant source code can be obtained. These feature vectors 173 may be sent to classification engine 106.
[0034] Figure 5An embodiment of a classification engine 106 according to certain embodiments of the disclosed system 101 is illustrated, along with its interaction with components of other engines 105 and 107. Feature vector 173 can be used as input to a pre-trained ML model 141, predetermined programming rules 150, and / or a blanket rule 174 to determine whether a cleaned-up vulnerability 127 is a threat. Classification engine 106 can determine whether vulnerability 127 is a threat using at least three different methods: blanket rule 174, programming rule 150, and / or ML model 141. Blank rule 174 and programming rule 150 can be applied to an automated classification method configured to automatically classify vulnerability 127. In some embodiments, blanket rule 174 can be applied to vulnerabilities 127 routed via vulnerability router 148, and ML model 141 may not be required. Such vulnerabilities 127 can be selected based on historical data that consistently indicates vulnerability 127 is exploitable. Therefore, it may be reasonable to automatically assume that identified vulnerability 127 is likely to be exploitable again. In some embodiments, programming rule 150 can be applied to vulnerabilities 127 transmitted from vulnerability router 148. Programming rule 150 can scan vulnerabilities 127 to detect common patterns identified as threats. In one embodiment, AST 143 may have been processed by system 100 but can be removed during transformation. Classification engine 106 can also leverage machine learning. Vulnerabilities 127 can be processed by system 100 (e.g., tokenized and vectorized), and feature vector 173 can be sent or fed into a pre-trained model 141, which may have previously analyzed such feature vector 173. As more vulnerabilities 127 can be transformed into feature vectors 173, system 100 can utilize ML model 141 more frequently because pre-trained model 141 may be more likely to have determined whether a particular vulnerability 127 is exploitable. Figure 5The exemplary classification engine 106 shown can determine whether vulnerability 127 is a threat. Classification engine 106 may include a deterministic classifier 175, which implements a classification algorithm whose outcome behavior can be determined by its initial state and input. In one embodiment, the deterministic classifier 175 may not be random or speculative. Classification engine 106 may also include a probabilistic classifier 179, which may include a classifier configured to predict a probability distribution over a set of categories. In one embodiment, the probabilistic classifier 179 may be based on observations of the input, rather than simply outputting the most probable category to which the observation might belong. Furthermore, classification engine 106 may include a training classifier 184, which can be configured to be trained based on feature vector 173. In some embodiments, training classifier 184 may be configured to train deterministic classifier 175 and / or probabilistic classifier 179. In some embodiments, training classifier 184 may be configured to train a trained model 141. Therefore, training classifier 184 may be adapted to communicate with trained model 141, which may be included in output engine 107. Rules (e.g., comprehensive rule 174) can be passed as a set of rules to deterministic classifier 175. For example, comprehensive rule 174 can be implemented if source code 125 can be identified as a threat based on historical data that consistently indicates vulnerability 127 is exploitable.
[0035] like Figure 4 and Figure 5 As shown, vulnerability router 148 can route vulnerability 127 directly to rule-based deterministic classifier 175 or via vector engine 105 to ML-based probabilistic classifier 179. The set of vulnerability types can be associated with rules 150 and 174. Vulnerability router 148 can determine the vulnerability type in the input vulnerability scan. When rule 150 or 174 associated with the determined vulnerability type is identified, vulnerability router 148 can then route the input vulnerability scan to deterministic classifier 175 for processing according to the identified and pre-established rules. Otherwise, vulnerability router 148 can route the input vulnerability scan to probabilistic ML classifier 179. See below for reference. Figure 11 Further, example embodiments of classification methods for establishing various rules 150 and 174 for various types of vulnerabilities are discussed.
[0036] In some other embodiments, vulnerability 127 may be routed to both a rule-based deterministic classifier 175 and an ML-based probabilistic classifier 179, and if the determination of whether vulnerability 175 is exploitable is inconsistent between the deterministic classifier 175 and the ML-based probabilistic classifier 179, additional arbitration may be performed to determine which classifier is more trustworthy.
[0037] An embodiment of the output engine 107 is also included. Figure 5 The output from output engine 107 may include initial findings received from trained model 141 for predicting whether a labeled vulnerability 187 is a threat. Trained model 141 may be stored in training model database 119. In some embodiments, trained model 141 may be passed to probabilistic classifier 179. Classification engine 106 may generate a list of labeled vulnerabilities 187 and / or their predictions, which may be stored and later reviewed by system 100.
[0038] Figure 6 An embodiment of review engine 108 is illustrated, its interaction with components of other engines 104-107 and 109, and exemplary processes implemented by review engine 108. For example, review engine 108 may be implemented to include processes for output review (box 600) and processes for vulnerability review and model updating (box 601). Through these processes, review engine 108 can review vulnerabilities 127 identified as exploitable by system 101, and these vulnerabilities 127 can be used to retrain model 141 for future use. The review can be sent back to model 141 for further training of model 141.
[0039] The vulnerability review and model update process 601 may include steps of updating vulnerabilities (box 602), retaining the model (box 603), and updating rules (box 604). This process can be configured to update the vulnerability database 117 with vulnerabilities 127 identified as exploitable for synthesis rule 174. The updated vulnerability 127 can be sent back to the vulnerability database 117, which can store the cleaned-up vulnerability 127 in a format compatible with system 100. To retrain model 141, findings can be received from security analyst (SA) review 606, data scientist (DS) review 607, and / or quality assurance (QA) review 608, and data analysis 609 can be performed. Such findings received from data analysis 609 can be transmitted to the orchestrator 147 of vector engine 105. These findings can be used to update synthesis rule 174, model 141, and vulnerability list 127.
[0040] The updated comprehensive rule 174 may include rules updated by findings received from reviews 606-608 and data analysis 609. These reviews 606-608 may be performed by data scientists and / or security analysts. Data analysis 609 may be performed on the new data to determine the best approach for updating comprehensive rule 174 and retraining model 141. An automatic classification method instance 610 may be configured to automate the classification of vulnerability 127. The vulnerability review and model update process 601 may be based on a combination of review results 611 received from security analyst reviews 606, data scientist reviews 607, and / or quality assurance reviews 608. Review results 611 may be sent to reporting engine 109.
[0041] Reporting engine 109 can be configured to receive review results 611 from review engine 108. A complete report can be generated, which may include all vulnerabilities 127 that are actually threats, as analyzed by quality assurance review 608. Vulnerabilities 187 with quality assurance tags can be generated to include vulnerabilities 127 that have passed system 100 and been evaluated by quality assurance review 608. This review 608 can be performed by quality assurance experts. A final report 147 can be generated for client 132, and an HTML report 188 can be generated to report all findings in HTML format.
[0042] Final report 147 and HTML report 188 can be displayed via device 110. UI 113 can be displayed locally using display circuitry or used for remote visualization, such as as HTML, JavaScript, audio, and video output from a web browser that can run on a local or remote machine. UI 113 and I / O interface circuitry may include touch-sensitive displays, voice or facial recognition inputs, buttons, switches, speakers, and other user interface elements. Additional examples of I / O interface circuitry include microphones, video and still image cameras, headphone and microphone input / output jacks, Universal Serial Bus (USB) connectors, memory card slots, and other types of inputs. I / O interface circuitry may also include magnetic or optical media interfaces (e.g., CD-ROM or DVD drives), serial and parallel bus interfaces, and keyboard and mouse interfaces.
[0043] In one embodiment, the components and modules used in the exemplary system can be divided into nine parts: scanning; storing reports; extracting features; storing all vulnerabilities in a canonical format; creating feature vectors and / or abstract syntax trees; classification; initial output; reviewing vulnerabilities; and final output and report generation. This list of divided parts does not need to be arranged in chronological order.
[0044] In one embodiment, system 100 may include steps for collecting and using different scan reports. These scan reports may be collected from multiple vendors. The scan reports may include vulnerability reports 130 received from code analyzer 133, as well as reports of various types of scans from other vendors. Automatic classification may include hybrid methods. For example, system 100 may combine various feature vector combinations using rules, filters, and machine learning. Figure 7 Examples of automatic classification methods are shown. For evaluation purposes, these methods can be trained and validated on a variety of datasets. Figures 8(a)-8(b) Examples of the identified problem types, their corresponding total classification time percentages, highest remediation priority, and implemented automatic classification methods are shown.
[0045] In one embodiment, system 100 may include integration of existing toolchains with custom annotation tags / variables, enabling automated FPA files to be integrated back into the existing toolchain. For example, system 100 may integrate with extracting scan results from an application scanning tool that can be implemented in memory 111 to automatically categorize issues and push the results back to the application scanning tool. Figure 9 A system 100 according to certain embodiments is illustrated. In one embodiment, system 100 may implement a vulnerability identification prioritization and remediation (ViPR) tool in memory 111, which may include an integrated repository of data and analysis tools. System 100 may include a front-end 191 and an API 114. Front-end 191 may communicate with a user, and API 114 may communicate with a software security server 120. Furthermore, system 100 may combine and use information from both Static Application Security Testing (SAST) and Dynamic Application Security Testing (DAST) scan reports. System 100 may combine SAST and DAST classification judgments to automatically recommend remediation actions in a unified manner, for example, such that a remediation can address both SAST and DAST issues.
[0046] like Figure 5 The automatic classification rules for the deterministic classifier 175 shown in 150 and 174 can be created for each of a predetermined set of vulnerability types. An automatic classification rule base can be established for the predetermined set of vulnerability types. For example, such an automatic rule base may include an Automatic Classification Strategy (ATP) for each type of vulnerability, and thus may be referred to as an ATP rule base. Each ATP may also include one or more automated methods (ATMs) of various classification algorithmic forms, which can be... Figure 5 The deterministic classifier 175 is invoked to evaluate the input vulnerability. The evaluation output of the deterministic classifier 175 can indicate whether the input vulnerability is unexploitable, exploitable, or its exploitability is uncertain.
[0047] therefore, Figure 4 The 147 arranger can be used first. Figure 5 The vulnerable router 148 will input the vulnerability (e.g., from...) Figure 4 The data frames of the vulnerability database 117 are mapped to either the deterministic classifier 175 or the ML probabilistic classifier 179. If the input vulnerability is mapped to the ML classifier 179, a feature vector creation process is triggered, whereby a feature vector is created for the input vulnerability, and the ML model is loaded and invoked to process the feature vector to classify the input vulnerability. If the input vulnerability is mapped to the deterministic classifier 175, the classification engine 106 further maps the input vulnerability to one of a predetermined set of vulnerability types and the corresponding ATP. The ATP and ATM from the ATP rule base are invoked and passed to the deterministic classifier 175 along with the data frame of the input vulnerability for classification.
[0048] Figure 11 Figure 1102 illustrates an example ATP rule base. The ATP rule base 1102 may include multiple ATPs 1104, each ATP 1104 corresponding to a type in a predetermined set of vulnerability types. Each ATP 1104 may include a set of ATMs 1106. For example, each ATM may include one or more specific algorithms for deterministic vulnerability classification. Figure 11 As further shown in 1102, the mapping from input vulnerabilities to specific ATPs can be formed by vulnerability mappers 1108. In some implementations, vulnerability mappers 1108 may be part of an ATP rule base. Input vulnerabilities (e.g., from...) Figure 4 Vulnerability data frames from vulnerability database 117 can be passed to ATP rule base 1102. For example... Figure 11 As indicated by arrow 1110, the ATP rule base 1102 can output ATP and pass the output ATP to the deterministic classifier 175.
[0049] ATP 1104 and ATM 1106 can be created for each of the predefined vulnerability types in a variety of ways and loaded into the ATP rule base 1102. The predefined vulnerability type set can be established based on any method. For example, the predefined vulnerability type set can be based on Fortify vulnerability categories and types determined and defined through historical Fortify vulnerability scans and analysis. Each vulnerability type can be associated with a vulnerability identifier (ID). Figure 11 Example 1120 shows how to create ATP and ATM for each of the predefined set of vulnerability types.
[0050] The ATP and ATM creation process 1120 may include a manual classification strategy (MTP) generation process and an ATP / ATM generation process for each of these vulnerability types, such as Figure 11 As shown in 1122 and 1124 respectively. As shown in 1122, an MTP can be specified as a definition of a step as part of an Improved Quality (IQ) guideline that a Security Analyst (SA) must follow to classify (categorize) vulnerabilities as, for example, “not a problem,” “exploitable,” and “suspected.” For example, an MTP for a specific vulnerability type can be represented by a list of questions that the SA must examine. The list of questions can be organized as a decision tree. In other words, the order in which questions are asked is determined based on the decision tree. Specifically, the next question asked in the list depends on the answer and outputs the previous question in the list. A list of questions and a decision tree can be created for each vulnerability type. Table I below shows an example list of MTP questions for the “Resource Injection” vulnerability type (example vulnerability ID 0043).
[0051] Table I
[0052]
[0053] Table I above contains both a list of questions and information about the decision tree for that list of questions. For example, if the answer to the first question in the list might be "out of range," indicating that this particular vulnerability is not a problem, the decision tree will end and not continue. However, if the answer to that question is "no" or "uncertain," the decision tree proceeds to the next question, and question "0043-2" needs to be answered, as shown in Table I. If the answer to question "0043-2" is "not a problem," the decision tree ends again. Otherwise, the decision tree proceeds to the next question, and as specified in Table I, question "0043-3" needs to be answered next. This process continues as shown in example Table I until the decision tree ends. Therefore, Table I specifies the conditional sequence of classification steps. Each step presents a question for the SA to answer. The answer to the question determines the next step (either the end of the decision tree or the next question). Table I provides the path to the final classification decision.
[0054] Figure 13 It shows Figure 11 The example process 1122 for generating IQ guidelines that can automatically form ATP and ATM is described below. Process 1122 can be used to process data sources including context data 1302, experimental data 1304, and computational data 1306 via iterative verification (1310), enhancement (1312), encoding (1314), and aggregation (1316) processes, wherein the output is processed by reaction module 1320 to generate IQ guidelines stored in database 1330. The IQ guidelines are used for the generation of ATP and ATM.
[0055] return Figure 11 As further illustrated in 1124, once an MTP has been created for each vulnerability type, it can be further determined what can be encoded in the MTP to generate the Automatic Classification Method (ATM) for the MTP. Specifically, each issue in the MTP can correspond to a Manual Classification Method (MTM), which can be converted and encoded into an ATM containing an automatic algorithm (such as...). Figure 11 (As shown in 1126). Each ATM can be encoded in a function that can be called by the classification engine 106. The Automatic Classification Strategy (ATP) corresponding to the MTP can identify the encoded ATM. Table II below shows an example.
[0056] Table II
[0057]
[0058] In some embodiments, such as Figure 12 As shown in the vulnerability-ATP mapping, the ATP library includes multiple ATPs (1202). Each ATP can be associated with a unique identifier and represents a policy, as described above. Each vulnerability type can be associated with one ATP (e.g., ...). Figure 12 The mapping from 1204 to 1202 is shown in the image), and each ATP can be mapped to one or more vulnerability types (such as...). Figure 12 The mapping from 1202 to 1204 in the diagram indicates that multiple different vulnerability types can use the same ATP (with the same decision tree 1206). Each ATP further encapsulates the decision tree as described above and links it to one or more ATMs, such as... Figure 12 As shown in 1206. Therefore, each ATP can be implemented as an ordered container of ATM, as... Figure 11 As shown in 1128 and 1106. Each ATM corresponds to a step in the decision tree. The ATM is encoded and can include various algorithms. As a callable function, the ATM can be shared by different ATs (e.g., ...). Figure 12 The common "ATM_Third_Party" and "ATM_Is-Trust" functions among different APs in 1206 are shown. Therefore, ATMs can be collected in a unified function library or code repository. When each ATP references an ATM at a specific step in its decision tree, the ATM can be identified by its unique function identifier in the function library or code repository, such as... Figure 12 As shown in 1206, the example code for ATP invoking various ATMs using an integrated decision tree is shown below:
[0059] def check(self, df):
[0060] chain = self.strategy['chain']
[0061] if self.id == 'ATP_ML':
[0062] item = chain[0]
[0063] atm_config = item['config'] if 'config' in item else {}
[0064] atm = item['class'](**atm_config)
[0065] result_df = atm.check(df)
[0066] return result_df
[0067] else:
[0068] cols = ['vulnerabilityPrediction', 'vulnerabilityEngine', 'vulnerabilityDecisionTree']
[0069] result_df = pandas.DataFrame(columns = cols, index=df.index) # keepindex [!!!]
[0070] for i, row in df.iterrows():
[0071] prediction = Labels.NS # default
[0072] tree = []
[0073] lang = row["programmingLang"]
[0074] for item in chain:
[0075] atm_config = item['config'] if 'config' in item else {}
[0076] atm = item['class'](**atm_config) # ATM instance
[0077] # check if the item in the chain has a lang attribute
[0078] if "language" in item:
[0079] if item["language"] == lang:
[0080] flag = atm.check(row)
[0081] answer = ATP_Abstract.answer(flag)
[0082] tree.append({
[0083] 'name': atm.atm_name,
[0084] 'output': {
[0085] 'prediction': item[answer] if answer in item else Labels.NEXT,
[0086] 'explanation': atm.explanation,
[0087] 'confidence': 0.5
[0088] }
[0089] })
[0090] if answer in item:
[0091] prediction = item[answer]
[0092] break # prediction was found
[0093] else:
[0094] flag = atm.check(row)
[0095] answer = ATP_Abstract.answer(flag)
[0096] tree.append({
[0097] 'name': atm.atm_name,
[0098] 'output': {
[0099] 'prediction': item[answer] if answer in item else Labels.NEXT,
[0100] 'explanation': atm.explanation,
[0101] 'confidence': 0.5
[0102] }
[0103] })
[0104] If the answer is in item:
[0105] prediction = item[answer]
[0106] break # prediction was found
[0107] result_df.at[i, 'vulnerabilityPrediction'] = prediction
[0108] result_df.at[i, 'vulnerabilityEngine'] = self.id
[0109] result_df.at[i, 'vulnerabilityDecisionTree'] = tree
[0110] return result_df
[0111] In some embodiments, the above Figure 5 The output of classification engine 106 may include an input data frame with several additional columns. For example, one of the additional columns may include a prediction from classification engine 106. Another additional column may include an indication of the classifier used for the prediction (deterministic classifier 175 or ML probabilistic classifier 179). Yet another additional column may include information indicating the decision tree used in the deterministic classifier. The decision tree used may be identified by an ATP identifier.
[0112] A separate machine learning model can be used to automate the targeting of a predetermined set of vulnerability types. Figure 11The generation of each of the manual classification strategies (MTP) or decision trees in (1122). For example, a machine learning model can be trained to select a list of issues from an issue library in a specific order based on the accuracy of historical vulnerability predictions.
[0113] like Figure 9 As shown, the method implemented by system 100 may include the step of selecting an item via user interface 113. See box 900. Front end 191 may request an item (see box 901), and API 114 may send such an item request to software security server 120. See box 902. As a result, API 114 may receive the item. See box 903. Front end 191 may be adapted to display the received item via user interface 113. See box 904. In some embodiments, one of the displayed items may be selected via user interface 113. See box 905. In some embodiments, front end 191 may identify or determine the selected item. See box 906. API 114 may be adapted to extract features of the selected item from software security server 120. See box 907. In one embodiment, API 114 may also be adapted to: apply rules (box 908), apply filters (box 909), apply programmed filters (box 910), and / or apply machine learning models (box 911). Furthermore, according to some embodiments, API 114 may be adapted to export the results to software security server 120. See box 912.
[0114] In some embodiments, the communication interface may include a wireless transmitter and receiver (referred to herein as a “transceiver”) and any antenna used by the transceiver’s transmitting and receiving circuitry. For example, under any version of IEEE 802.11 (e.g., 802.11n or 802.11ac), or under other wireless protocols such as Bluetooth, Wi-Fi, WLAN, cellular (4G, LTE / A), the transceiver and antenna may support WiFi network communication. The communication interface may also include a serial interface, such as Universal Serial Bus (USB), Serial ATA, IEEE 1394, illumination port, I / O, etc. 2 C. SlimBus or other serial interfaces. The communication interface may also include a cable transceiver that supports wired communication protocols. The cable transceiver can provide a physical layer interface for any of a variety of communication protocols, such as any type of Ethernet, Gigabit Ethernet, optical network protocols, Cable Data Service Interface Specification (DOCSIS), Digital Subscriber Line (DSL), Synchronous Optical Network (SONET), or other protocols.
[0115] System circuitry can include any combination of hardware, software, firmware, APIs, and / or other circuitry. For example, system circuitry can be implemented using one or more system-on-a-chip (SoC), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), microprocessor, discrete analog and digital circuitry, and other circuitry. System circuitry can implement any desired function of system 100. As an example only, system circuitry can include one or more instruction processors 112 and memory 111. Memory 111 can store, for example, control instructions for executing features of system 100. In one implementation, processor 112 can execute control instructions to perform any desired function of system 100. Control parameters can provide and specify configuration and operational options for the control instructions and other functions of system 100. System 100 can also include various databases or data sources, each accessible by system 100 to obtain data considered during any one or more processes described herein.
[0116] In one embodiment, a method or system 100 for managing software may include the following steps: scanning the source code of a software product or application 135 to detect potential vulnerabilities; and generating an electronic document report listing the detected potential vulnerabilities. The method / system may further include the following steps: extracting features for each potential vulnerability from the electronic document report; receiving policy data and business rules; comparing the extracted features with the policy data and business rules; and determining a tag based on the source code of the potential vulnerability. Furthermore, the method / system may include the following steps: determining a vector based on the extracted features of the potential vulnerability and based on the tag; and selecting one of multiple vulnerability scoring methods based on the vector. In one embodiment, the vulnerability scoring method may be a machine learning modeling method 141, a comprehensive rule-based automatic classification method 174, and / or a programming rule-based automatic classification method 150. According to some embodiments, the multiple vulnerability scoring methods may include any combination of these methods. The method / system may further include the following steps: using the selected vulnerability scoring method, determining a vulnerability accuracy score based on the vector, and displaying the vulnerability accuracy score to a user. In one embodiment, the multiple machine learning models may include a random forest machine learning model.
[0117] In some embodiments, such as Figure 10As shown, a method or system 100 for managing software may include the following steps: obtaining an electronic document listing potential vulnerabilities in a software product (box 1000); extracting features from the electronic document for each potential vulnerability (box 1001); determining a vector based on the extracted features (box 1002); selecting one of a plurality of machine learning modeling methods and an automatic classification method based on the vector (box 1003); and determining a vulnerability accuracy score based on the vector using the selected method (box 1004). The method / system may also include the following steps: scanning the source code of the software product to detect potential vulnerabilities; and generating an electronic document based on the detected potential vulnerabilities. Furthermore, the method / system may include the following steps: receiving policy data or business rules; comparing the extracted features with the policy data or business rules; and determining a label corresponding to at least one of the detected potential vulnerabilities based on the scanned source code. In some embodiments, the vector may be based on the label. The method / system may also include the step of displaying the vulnerability accuracy score to a user. In one embodiment, the machine learning modeling method may include a random forest machine learning model. In some embodiments, the automatic classification method may include a comprehensive rule-based automatic classification method and / or a programming rule-based automatic classification method. In some embodiments, a method or system for accessing software vulnerabilities may include the following steps: accessing an automatic classification rule base, the automatic classification rule base including multiple predefined automatic classification strategies corresponding to multiple predetermined vulnerability types, wherein each automatic classification strategy includes a decision tree for determining whether one of the multiple predetermined vulnerability types is exploitable; accessing a machine learning model library to probabilistically determine whether one of the multiple predetermined vulnerability types is exploitable; obtaining an electronic document listing potential vulnerability issues of the software product based on the source code of the software product; determining whether a potential vulnerability issue is associated with a predetermined vulnerability type among the multiple predetermined vulnerability types; and when it is determined that a potential vulnerability issue is associated with a predetermined vulnerability type among the multiple predetermined vulnerability types, determining whether the software product is exploitable based on processing the electronic document using an automatic classification strategy retrieved from the automatic classification rule base that is associated with a predetermined vulnerability type among the multiple predetermined vulnerability types and the corresponding decision tree, otherwise probabilistically determining whether the software product is exploitable based on processing the electronic document using a machine learning model from the machine learning model library.
[0118] All discussions, regardless of the specific implementation described, are exemplary in nature and not restrictive. For example, although selected aspects, features, or components of these implementations are described as being stored in memory, all or part of one or more systems may be stored on, distributed across, or read from other computer-readable storage media, such as auxiliary storage devices like hard disks, flash drives, floppy disks, and CD-ROMs. Furthermore, the various modules and screen display functions are merely one example of such functionality, and any other configuration incorporating similar functionality is possible.
[0119] Corresponding logic, software, or instructions for implementing the processes, methods, and / or techniques described above may be provided on a computer-readable storage medium. The functions, actions, or tasks shown in the figures or described herein may be performed in response to one or more sets of logic or instructions stored in or on a computer-readable medium. The functions, actions, or tasks are independent of a particular type of instruction set, storage medium, processor, or processing strategy, and may be performed individually or in combination by software, hardware, integrated circuits, firmware, microcode, etc. Similarly, processing strategies may include multiprocessing, multitasking, parallel processing, etc. In one embodiment, instructions are stored on a removable media device for reading by a local or remote system. In other embodiments, logic or instructions are stored at a remote location for transmission over a computer network or telephone line. In other embodiments, logic or instructions are stored within a given computer, central processing unit (“CPU”), graphics processing unit (“GPU”), or system.
[0120] While this disclosure has been specifically shown and described with reference to embodiments thereof, those skilled in the art will understand that various changes in form and detail may be made therein without departing from the spirit and scope of this disclosure. Although some figures illustrate multiple operations performed in a particular order, operations independent of order may be reordered, and other operations may be combined or separated. While some reorderings or other groupings are specifically mentioned, other options will be apparent to those skilled in the art, and therefore an exhaustive list of alternatives is not provided.
Claims
1. A system for assessing software vulnerabilities, comprising: Memory, used to store executable instructions; as well as A processor adapted to access the memory, the processor further adapted to execute the executable instructions stored in the memory for: Access an automatic classification rule base, which includes multiple predefined automatic classification strategies corresponding to multiple predetermined vulnerability types, wherein each automatic classification strategy includes a decision tree for determining whether one of the multiple predetermined vulnerability types is exploitable, wherein the decision tree includes a set of progressively ordered automatic classification methods; Access a library of machine learning models for probabilistically determining whether one of the predetermined vulnerability types is exploitable from the plurality of predetermined vulnerability types; Based on the source code of the software product, obtain an electronic document listing the potential vulnerabilities of the software product. Determine whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types; as well as When a potential vulnerability is determined to be associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is determined based on processing the electronic document using an automatic classification strategy retrieved from the automatic classification rule base that associates the predetermined vulnerability type with the corresponding decision tree. Each automatic classification method in the automatic classification strategy is configured to generate a classification output when processing the electronic document. When it is not determined whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is probabilistically determined based on processing the electronic document using a machine learning model from the machine learning model library.
2. The system of claim 1, wherein the classification output from each automatic classification method includes one of the following: a classification determination indicating that the software product is unavailable, a classification determination indicating that the software product is available, or a classification determination indicating that the availability of the software product is uncertain.
3. The system of claim 2, wherein the processor is adapted to determine whether the software product is available based on the automatic classification strategy by: progressively invoking the automatic classification method of the automatic classification strategy according to the decision tree when the output of the automatic classification method indicates that the software product is available or the availability of the software product is uncertain; and terminating the decision tree when an unavailable classification output is obtained.
4. The system of claim 1, wherein each automatic classification method of the automatic classification strategy includes coded versions of one or more classification algorithms for determining a classification output as a response to a predetermined classification query within a predetermined query tree of the automatic classification strategy.
5. The system of claim 4, wherein each classification strategy and the automatic classification method is established based on a set of guidelines, which is derived based on separate contextual data, experimental data, and computational data.
6. The system of claim 5, wherein the guide set is encoded in a predefined format, and the predefined format is processed to generate an encoded version of the one or more classification algorithms.
7. The system of claim 1, wherein the processor is further adapted to: Scan the source code of the software product to detect the potential vulnerabilities; and The electronic document is generated based on the detected potential vulnerabilities.
8. The system of claim 7, wherein the processor is adapted to probabilistically determine whether the software product is available by: Extract features for each potential vulnerability from the electronic document; A vector is determined based on the extracted features; Based on the vector, one of the multiple vulnerability scoring models is selected from the machine learning model library; as well as The vulnerability accuracy score is determined based on the vector using the selected vulnerability scoring model from the plurality of vulnerability scoring models.
9. The system of claim 8, wherein the processor is further adapted to: Receive policy data or a set of business rules; The extracted features are compared with the policy data or set of business rules; and, Based on the scanned source code, a tag corresponding to at least one of the detected potential vulnerability issues is determined.
10. The system of claim 9, wherein the vector is based on the tag.
11. The system of claim 8, wherein the processor is further adapted to display the vulnerability accuracy score to a user.
12. The system according to any one of claims 1-11, wherein the machine learning model library comprises a plurality of random forest machine learning models.
13. A method for assessing software vulnerabilities, comprising the following steps: Access an automatic classification rule base, which includes multiple predefined automatic classification strategies corresponding to multiple predetermined vulnerability types, wherein each automatic classification strategy includes a decision tree for determining whether one of the multiple predetermined vulnerability types is exploitable, wherein the decision tree includes a set of progressively ordered automatic classification methods; Access a library of machine learning models for probabilistically determining whether one of the predetermined vulnerability types is exploitable from the plurality of predetermined vulnerability types; Based on the source code of the software product, obtain an electronic document listing the potential vulnerabilities of the software product. Determine whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types; as well as When a potential vulnerability is determined to be associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is determined based on processing the electronic document using an automatic classification strategy retrieved from the automatic classification rule base that associates the predetermined vulnerability type with the corresponding decision tree. Each automatic classification method in the automatic classification strategy is configured to generate a classification output when processing the electronic document. When it is not determined whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is probabilistically determined based on processing the electronic document using a machine learning model from the machine learning model library.
14. The method of claim 13, wherein the classification output from each automatic classification method includes one of the following: a classification determination indicating that the software product is unavailable, a classification determination indicating that the software product is available, or a classification determination indicating that the availability of the software product is uncertain.
15. The method of claim 14, wherein determining whether the software product is available based on the automatic classification strategy comprises: When the output of the automatic classification method indicates that the software product is available or the availability of the software product is uncertain, the automatic classification method of the automatic classification strategy is called progressively according to the decision tree. as well as The decision tree is terminated when an unusable classification output is obtained.
16. The method of claim 13, wherein each automatic classification method of the automatic classification strategy includes coded versions of one or more classification algorithms for determining a classification output as a response to a predetermined classification query within a predetermined query tree of the automatic classification strategy.
17. The method according to any one of claims 13-16, wherein probabilistically determining whether the software product is available comprises: Extract features for each potential vulnerability from the electronic document; A vector is determined based on the extracted features; Based on the vector, one of the multiple vulnerability scoring models is selected from the machine learning model library; as well as The vulnerability accuracy score is determined based on the vector using the selected vulnerability scoring model from the plurality of vulnerability scoring models.
18. A non-transient computer-readable medium comprising instructions configured to be executed by a processor, wherein the executed instructions are adapted to cause the processor to: Access an automatic classification rule base, which includes multiple predefined automatic classification strategies corresponding to multiple predetermined vulnerability types, wherein each automatic classification strategy includes a decision tree for determining whether one of the multiple predetermined vulnerability types is exploitable, wherein the decision tree includes a set of progressively ordered automatic classification methods; Access a library of machine learning models for probabilistically determining whether one of the predetermined vulnerability types is exploitable from the plurality of predetermined vulnerability types; Based on the source code of the software product, obtain an electronic document listing the potential vulnerabilities of the software product. Determine whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types; as well as When a potential vulnerability is determined to be associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is determined based on processing the electronic document using an automatic classification strategy retrieved from the automatic classification rule base that associates the predetermined vulnerability type with the corresponding decision tree. Each automatic classification method in the automatic classification strategy is configured to generate a classification output when processing the electronic document. When it is not determined whether the potential vulnerability is associated with one of the plurality of predetermined vulnerability types, the exploitability of the software product is probabilistically determined based on processing the electronic document using a machine learning model from the machine learning model library.