A multi-stage filtering source code data detection method and device

A source code, level filtering technology, applied in the field of multi-level filtering source code data detection, can solve the problems of complex code labeling words, inappropriate detection, non-standard identifier naming, etc., to enhance the ability of security management and control, speed up Information processing speed, avoiding the effect of excessive calculation

Active Publication Date: 2019-08-16
北京明朝万达科技股份有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] (1) The method used in the above patent only intercepts network data streams, and is not suitable for detecting whether local storage files contain source code data, and cannot know the distribution of local source code
[0011] (2) The basis for judging whether a character stream contains source code in the above-mentioned patent is "preset detection strings and / or syntax analysis library functions corresponding to the programming language", and does not specify the file type, lexical token attributes and Semantic content is analyzed, and the accuracy of source code data detection is limited
[0012] (3) In enterprises with a low degree of standardization in the software development process, there are many development tools, different code styles, irregular identifier naming, and complex code labeling words
To detect these source codes, only relying on basic detection technologies such as file attribute detection, keyword matching, and regular expression matching will have the problem of inaccurate positioning (it is difficult to determine the location of source code data fragments in an ordinary document); Fingerprint, statistical learning and other methods also have the defect of insufficient versatility of detection methods due to incomplete sample acquisition.
[0013] (4) When the above-mentioned advanced detection technology of text data performs semantic analysis on a long text, it will calculate the full text of the text, which has a large amount of calculation and will cause a long time to output the detection result

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A multi-stage filtering source code data detection method and device

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment 1

[0079] The developer downloads a CPP source code file from the enterprise code server to his own client computer, and the data security software on the client side scans the newly added file.

[0080] When the file type detection and filtering module of the source code data detection device finds that the file suffix is ​​".cpp", the file is determined as a source code file. Based on this determination, the data security software only allows this file to be saved to the client computer's secure area.

specific Embodiment 2

[0082] When a developer was writing a user manual, he pasted multiple source code programs into a text document and saved it in pdf format. The data security software installed by the enterprise conducts a full disk scan, and the file is input into the device for detection.

[0083] After being detected by the file type detection and filtering module of the source code data detection device, the file is not a source code file, but belongs to a text file format, so it is converted into a document format and processed subsequently. After performing lexical analysis on the converted txt file, it is found that the sum of the weighted scores of the lexical tokens contained in it exceeds the discrimination threshold, so it is determined to contain source code data. After obtaining the test result, the enterprise should limit the scope of dissemination of this file.

specific Embodiment 3

[0085] A person who is about to resign wants to take some core source codes of the enterprise out, so he modified the suffixes and keywords in these source code files in batches. When transferring these files to the USB flash drive, the data security software will Input them into the device one by one for detection.

[0086] After being detected by the file type detection and filtering module of the source code data detection device, the file is not a source code file, but belongs to a text file format, so it is converted into a document format and processed subsequently. After performing lexical analysis on the converted txt file, it is found that the sum of the weighted scores of the lexical tokens contained therein does not exceed the discrimination threshold, so it is turned to grammatical analysis. In the syntax analysis stage, it is found that the file contains multiple expressions, so it is judged that the file contains source code data. After the data security softwar...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multistage-filtering source code data detection method and device. The device comprises a file type detection filtering module, a lexical analysis filtering module, a grammar analysis filtering module, a semantic analysis filtering module, a source code data containing file protection module, and a non-source code labelling module, wherein the file type detection filtering module is used for judging whether an input file is of an appointed file type or not; the lexical analysis filtering module extracts a lexical mark in the file, determines a corresponding weight, calculates a weighing score sum, and judges whether the weighing score sum exceeds an appointed threshold value or not; the grammar analysis filtering module intercepts a text of an appointed length as a suspected text from the file, extracts a grammar phrase and an expression contained in the suspected text, and judges the importance degrees of the phrase and the expression for a source code; the semantic analysis filtering module extracts the semantic features of the text, and carries out similarity analysis on the semantic features and the semantic features of an appointed core source code; the source code data containing file protection module carries out sensitive data protection on the file which contains source program data; and the non-source code labelling module carries out no source code labelling on the file. Through the above scheme, the accuracy of source code detection is improved, and the safety protection strength of the source code is reinforced.

Description

technical field [0001] The invention relates to the technical field of source code data detection, in particular to a multi-stage filtering source code data detection method and device. Background technique [0002] As an R&D and design enterprise, data such as design documents, drawings, and source codes are the core intellectual assets of the enterprise, as well as the core competitiveness of the enterprise. Effective management and control of these core data is the top priority of enterprise information security. Because the source code data exists in the form of text files or text fragments, it is more likely to be mixed or embedded in conventional text files, which may cause loss, leakage, or uncontrolled diffusion, which may endanger enterprise information security. Most of these sources of data loss are due to unintentional operations by internal personnel, and a few intentional leaks from internal personnel and malicious attacks from outside the enterprise. The occu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F21/57
CPCG06F21/57
Inventor 邸宏宇王志海魏效征张静何晋昊喻波安鹏
Owner 北京明朝万达科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products