Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Transformer-based code programming language classification method

A programming language and classification method technology, applied in the computer field, can solve problems such as performance bottlenecks, poor results, and poor classification effects, and achieve the effect of improving accuracy, improving classification effects, and easy implementation

Pending Publication Date: 2021-07-20
NANTONG UNIVERSITY
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing works have established classification models through machine learning methods, such as naive Bayesian classifiers, or random forest classifiers, but traditional machine learning-based classification methods have certain bottlenecks in performance, and the classification effect is not good
Compared with traditional machine learning methods, the effect of a few classification methods using deep learning models (CNN, RNN) has improved but the effect is still not good

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Transformer-based code programming language classification method
  • Transformer-based code programming language classification method
  • Transformer-based code programming language classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] see figure 1 As shown, the present invention provides a Transformer-based code programming language classification method, specifically including the following:

[0042] (1) Collect the content of question and answer posts in Stack Overflow, organize the data set format into , which contains 224445 pairs of code fragments and corresponding language types;

[0043] (2) Use the BPE algorithm to segment the code fragment as text, split the words and symbols in the code fragment into character sequences, and add the suffix "" at the end to avoid more "[UNK]" symbols in the training set, The BPE algorithm can effectively solve the OOV (Out-Of-Vocabulary) problem when using the test set to test the model by segmenting code fragments;

[0044](3) We divide the data in the data set into training set and verification set according to the ratio of 4:1, the number of training set is 179556, and the number of verification set is 44889; according to the identification of language t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a Transformer-based code programming language classification method, which comprises the following steps: (1) collecting question and answer posts from Stack Overflow as a data set, and carrying out data preprocessing on data in the original data set; (2) carrying out word embedding on the data subjected to word segmentation by using the BPE to convert words into vectors; (3) on the basis of the constructed data set, performing fine tuning on the RoBERTa model, inputting a generated word vector into the RoBERTa model, performing code semantic learning through a double-layer Transform encoder, and generating a semantic representation vector Xsematic; and (4) mapping the semantic vector Xsemination to a programming language category label through a linear layer, and obtaining a corresponding programming language through a Softmax algorithm. The method has the beneficial effects that the code type can be quickly identified according to the code snippets so as to play a role in assisting developers to quickly find the solution on the question and answer website.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method for classifying code programming languages ​​based on Transformer. Background technique [0002] In the software development cycle, different development tasks usually use different programming languages ​​(such as Java, Python, C#, C language, etc.). In most cases, different programming language types of problems require different solutions. In the process of software development, programmers often encounter various problems, and posting on question-and-answer websites to seek solutions has become the mainstream solution. Therefore, when a developer asks a question on a question-and-answer website (such as Stack Overflow), the website needs to mark the type of the language to quickly find the corresponding solution. Stack Overflow relies on the language tag of the source code in the post to match users who can provide answers. . However, new users or novice develop...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/35G06F40/30G06F40/289G06N3/04G06N3/08
CPCG06F16/35G06F40/30G06F40/289G06N3/08G06N3/045Y02D10/00
Inventor 于池陈翔周彦琳杨光刘珂
Owner NANTONG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products