Code programming language classification method using characterization information of each layer of CodeBert

A technology of programming language and classification method, applied in the field of computer, can solve the problems of insufficient performance and poor classification effect, and achieve the effect of improving performance and reducing cost

Pending Publication Date: 2022-04-29
NANTONG UNIVERSITY
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Most of the existing work is based on traditional machine learning methods to establish classification models, such as random forest classifiers, naive Bayesian classifiers, etc., but the classification results based on traditional machine learning classification methods are not good
Although a few classification methods based on deep learning models (such as CNN, RNN, etc.) have improved compared with traditional machine learning methods, they still cannot achieve satisfactory performance.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Code programming language classification method using characterization information of each layer of CodeBert
  • Code programming language classification method using characterization information of each layer of CodeBert
  • Code programming language classification method using characterization information of each layer of CodeBert

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. Of course, the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0043] see figure 1 , the present embodiment provides a code programming language classification method utilizing CodeBert representation information of each layer, comprising the following steps:

[0044] 1. Processing of corpus

[0045] 1.1 process the original data set, the original data set includes 21 programming languages ​​such as Bash, C, C#, C++ and CSS, but the noise of some of the programming languages ​​is too large, so after removing these programming languages, the present embodiment The dataset contains code snippets for 19 programming languages.

[0046] 1.2 Use the BPE algorithm to segment code s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a code programming language classification method based on CodeBert, and belongs to the technical field of computer application. According to the technical scheme, the method comprises the following steps that (1) an original data set is processed, noise in the original data set is removed, and word segmentation is conducted through a BPE method; (2) extracting characterization information from each layer of the CodeBert as a characterization information sequence, and paying attention to the layer capable of providing important characterization information by using a bidirectional recurrent neural network (Bi-LSTM) and an attention mechanism; and (3) training the constructed model based on a corpus to obtain a code programming language classification model. The method has the beneficial effects that the programming language type to which the source code belongs can be quickly identified, and the cost of manually classifying the source code programming language is reduced.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a code programming language classification method utilizing CodeBert representation information of each layer. Background technique [0002] In the process of software development and maintenance, programmers often encounter various problems that are difficult to solve. It has become a mainstream solution to publish posts on programmers' question-and-answer websites to seek solutions to problems. But programmers usually use different programming languages ​​(such as Java, Python, C#, JavaScript, etc.) to complete different development tasks. In most cases, the solutions required for problems in different programming languages ​​are also different. Therefore, when a developer publishes a post on a question-and-answer website (such as Stack Overflow, etc.), marking the type of programming language used can help the questioner find the corresponding solution quickly, and the websi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F40/289G06F16/35G06F8/41G06N3/08
CPCG06N3/084G06F40/289G06F8/436G06F16/353G06F18/241
Inventor 陈翔刘珂杨光曲豫宾周彦琳夏鸿崚顾亚锋于池
Owner NANTONG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products