Code classification and clustering method based on convolution and recurrent neural network

A cyclic neural network and code classification technology, applied in the field of software engineering, can solve the problem of difficulty in extracting code features

Inactive Publication Date: 2021-02-02
NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
View PDF3 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, there are also many models designed for program AST. These models can efficiently extract the structural information of the code. However, when the size of the tree is very large, directly extracting features from the entire AST will inevitably make it difficult to extract some code features. In the final generated code vector

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Code classification and clustering method based on convolution and recurrent neural network
  • Code classification and clustering method based on convolution and recurrent neural network
  • Code classification and clustering method based on convolution and recurrent neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] In order to facilitate the understanding of those skilled in the art, the present invention will be further explained below with reference to the abstract drawings, and the content mentioned in the embodiments does not limit the present invention.

[0035] refer to figure 1 As shown, the code classification includes the following:

[0036] (1) Cutting of abstract syntax tree

[0037] The structure of the code AST is very large. In order to fully extract the characteristics of the code, the AST is cut. The following is the pseudocode of the cut AST:

[0038]

[0039]

[0040] root is the root node of AST, dfs is a recursive implementation of AST depth-first traversal, the traversal sequence of nodes will be stored in the nodes array, and all nodes in the nodes array will be traversed, if the node belongs to the {if, while, for} set Any node, the subtree with the changed node as the root node is cut out and put into the array T, and finally T except the first node...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a code classification and clustering method based on convolution and a recurrent neural network. An experimental data set comprises 104 types of codes. The specific steps of code classification are as follows: generating AST of codes by using srcml,generating a word vector of the node by using Word2vec according to the expanded sequence of the AST, cutting the AST into a subtree sequence composed of a cycle and a condition AST, wherein each sub-tree uses a tree convolutional neural network TBCNN for encoding, inputting a vector obtained by encoding into a bidirectionalLSTM, putting a code vector generated in each time step into a matrix, compressing the matrix by using maximum pooling to obtain a vector, mapping the vector by using a layer of fully connected neuralnetwork to obtain a 104-dimensional code vector, generating a marker vector by using a onehot method, wherein cross entropy is used as a loss function, carrying out dimension reduction by using a TSNE method, and clustering on the codes by using a Kmeans clustering method. The classification accuracy can reach 97.8%, and the clustering accuracy can reach 83.6%.

Description

technical field [0001] The invention belongs to the field of software engineering, mainly relates to deep learning, program static analysis, code big data and other technologies, and can mainly realize automatic classification of marked codes and automatic clustering of unmarked codes. [0002] technical background [0003] Code classification and clustering is a long-term research hotspot in the field of software engineering, and this work can promote the development of multiple sub-fields, such as program understanding, concept location, code plagiarism detection, vulnerability classification, and malware detection. To a certain extent, the label of code classification is the expression of the internal semantics of the code, which can assist the subsequent code modularization and refactoring, and reduce the cost of software maintenance. [0004] Different from natural language, code language has a strong logical structure. It is difficult to understand the specific function...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F8/41G06K9/62G06N3/04G06N3/08
CPCG06F8/44G06N3/049G06N3/084G06N3/045G06F18/213G06F18/23213G06F18/24323
Inventor 周宇史志成
Owner NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products