Biological sequence feature extraction method based on word embedding and auto-encoder fusion

An autoencoder, biological sequence technology, applied in neural learning methods, biological neural network models, instruments, etc., can solve problems such as ignoring base position dependencies

Pending Publication Date: 2021-09-14
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method based on 6mer frequency information as the input of the deep learning model only uses the composition information of biological sequences and ignores the position dependence between bases

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Biological sequence feature extraction method based on word embedding and auto-encoder fusion
  • Biological sequence feature extraction method based on word embedding and auto-encoder fusion
  • Biological sequence feature extraction method based on word embedding and auto-encoder fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangements of components and steps, numerical expressions and numerical values ​​set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

[0023] The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses.

[0024] Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the description.

[0025] In all examples shown and discussed herein, any specific values ​​should be construed as exemplary only, and not as limitations. Therefore, other instances of the exemplary embodiment may have dif...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a biological sequence feature extraction method based on word embedding and auto-encoder fusion. The method comprises the steps: constructing a representation model and a compression model, wherein the representation model comprises a word embedding network, and the compression model is an auto-encoder model and comprises an encoder and a decoder; taking minimization of a set total loss function as an optimization target, jointly training a representation model and a compression model, taking a short sequence Kmer set as input by a word embedding network, shielding part of the short sequence Kmer, carrying out context association on the Kmer in the sequence, learning an embedding vector of each Kmer in the sequence, and obtaining embedding information corresponding to the Kmer forming the sequence; enabling an encoder of the compression model to convert the embedded information into a low-dimensional feature vector, decoding Kmer embedding of a reconstruction sequence through a decoder, and outputting a reconstruction vector; and using the reconstruction vector to classify the shielded Kmer in the sequence. According to the method, efficient characterization of the biological sequence is realized, and the accuracy of subsequent classification is ensured.

Description

technical field [0001] The present invention relates to the technical field of computer applications, and more specifically, to a biological sequence feature extraction method based on word embedding and autoencoder fusion. Background technique [0002] With the development of sequencing technology, the number of biological sequences is increasing exponentially. How to better understand and recognize biological sequences has become a hot spot in the field of bioinformatics. Biological sequence is a kind of high-dimensional time series data. Extracting key features from biological sequence to represent the whole sequence is one of the most important and basic tasks in the field of bioinformatics. In the traditional biological sequence research method, the common method is sequence alignment, by comparing the query sequence with the existing sequence in the database to obtain sequence similarity and annotation information. However, this method is slow and costly, and due to t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N3/04G06N3/08
CPCG06N3/08G06N3/045G06F18/211G06F18/2415
Inventor 杨金蔡云鹏肖瑞
Owner SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products