Methods, devices, electronic equipment, and storage media for recognizing facial expression images
By combining principal component analysis, convolutional neural networks, and Poisson distribution pulse frequency encoding with an improved leak-integral discharge neuron model, a multi-layer spiking neural network was constructed. This solved the problems of high energy consumption and insufficient interpretability in facial expression recognition for customer service robots, achieving low-energy and high-efficiency emotion recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- AGRICULTURAL BANK OF CHINA
- Filing Date
- 2022-10-20
- Publication Date
- 2026-06-30
Smart Images

Figure CN115482578B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image recognition technology, and in particular to a method, apparatus, electronic device and storage medium for recognizing facial expression images. Background Technology
[0002] With the development of artificial intelligence technology, various intelligent robots are gradually permeating people's work, such as customer service robots. Although customer service robots can alleviate some of the workload of human customer service representatives, they still lag far behind human customer service in terms of customer experience because they cannot effectively perceive customer emotions and provide corresponding personalized customer guidance.
[0003] In facial expression recognition, further subdivisions can be made based on data format and expression definition type. According to data format, it can be divided into image-based facial expression recognition and (audio)-video-based facial expression recognition. According to expression definition type, it can be divided into discrete label-based facial expression recognition, continuous model-based facial expression recognition, and facial expression recognition based on facial activity unit systems. While deep learning has achieved good results in this field, significant limitations remain. On one hand, it lacks interpretability and flexibility, hindering knowledge transfer and autonomous learning. On the other hand, in practical applications, deep neural networks are complex, computationally intensive, and energy-intensive. Customer service robots, due to their operational nature, need to provide 24 / 7 service at branches, requiring sufficient battery life. Therefore, a low-energy-consumption facial expression recognition algorithm that combines interpretability and flexibility is urgently needed. Summary of the Invention
[0004] This application provides a method, apparatus, electronic device, and storage medium for recognizing facial expression images, which can combine better interpretability and flexibility, while greatly reducing network power consumption and improving computing power, making it suitable for more intelligent small robots such as customer service robots.
[0005] Firstly, this application provides a method for recognizing facial expression images, the method comprising:
[0006] The original vector of the facial expression image is determined based on the pixel values of the facial expression image, and the target vector of the facial expression image is obtained by performing principal component analysis on the original vector.
[0007] The target vector is input into a feature extraction network to obtain the feature information of the facial expression image;
[0008] The pulse sequence of the facial expression image is obtained by pulse frequency encoding of the feature information based on the Poisson distribution;
[0009] The pulse sequence is input into the expression recognition model to obtain the recognition result of the expression image. The expression recognition model is a multilayer spiking neural network. The spiking neuron model in the multilayer spiking neural network is an improvement based on the leaky integral discharge neuron (LIF) model.
[0010] Secondly, this application provides a facial expression image recognition device, the device comprising:
[0011] The target vector determination module is used to determine the original vector of the expression image based on the pixel values of the expression image, and to perform principal component analysis on the original vector to obtain the target vector of the expression image.
[0012] The feature information determination module is used to input the target vector into the feature extraction network to obtain the feature information of the expression image;
[0013] A pulse sequence determination module is used to encode the feature information based on a Poisson distribution to obtain the pulse sequence of the facial expression image.
[0014] The facial expression image recognition module is used to input the pulse sequence into the facial expression recognition model to obtain the recognition result of the facial expression image. The facial expression recognition model is a multilayer spiking neural network, in which the spiking neuron model is an improvement based on the leaky integral discharge neuron LIF model.
[0015] Thirdly, this application provides an electronic device comprising:
[0016] At least one processor; and
[0017] A memory communicatively connected to the at least one processor; wherein,
[0018] The memory stores a computer program that can be executed by the at least one processor, which enables the at least one processor to perform the facial expression image recognition method described in any embodiment of this application.
[0019] Fourthly, this application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the facial expression image recognition method described in any embodiment of this application.
[0020] This application provides a method for recognizing facial expression images, including: determining the original vector of the expression image based on the pixel values of the expression image; performing principal component analysis on the original vector to obtain the target vector of the expression image; inputting the target vector into a feature extraction network to obtain feature information of the expression image; encoding the feature information by pulse frequency based on a Poisson distribution to obtain a pulse sequence of the expression image; and inputting the pulse sequence into an expression recognition model to obtain the recognition result of the expression image. The expression recognition model is a multilayer spiking neural network, in which the spiking neuron model is an improvement based on the leaky integral discharge neuron (LIF) model. This application introduces principal component analysis and convolutional layers into the network architecture, making the extracted features more critical. In the design of the network architecture and spiking neuron model, compared with traditional artificial neural network models, the multilayer spiking neural network references the hierarchical structure of the visual cortex of the brain, resulting in better interpretability and flexibility. Furthermore, this application improves upon the LIF model to obtain a spiking neuron model, which can ensure the firing pattern of real biological neurons while minimizing the computational load of the model. It achieves a good balance between biomimicry and model complexity, resulting in high real-time performance. Due to the sparse pulse firing pattern of spiking neurons, network energy consumption is greatly reduced while improving computational power, making it suitable for more intelligent small robots such as customer service robots.
[0021] It should be noted that the aforementioned computer instructions may be stored, in whole or in part, on a computer-readable storage medium. This computer-readable storage medium may be packaged together with the processor of the facial expression image recognition device, or it may be packaged separately from the processor of the facial expression image recognition device; this application does not impose any limitations on this.
[0022] The descriptions of the second, third, and fourth aspects in this application can be referenced to the detailed description of the first aspect; and the beneficial effects described in the second, third, and fourth aspects can be referenced to the analysis of the beneficial effects of the first aspect, which will not be repeated here.
[0023] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description.
[0024] It is understood that before using the technical solutions disclosed in the various embodiments of this application, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this application in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained. Attached Figure Description
[0025] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0026] Figure 1 This is a schematic diagram of the network architecture provided in the embodiments of this application;
[0027] Figure 2 A schematic diagram of the first process of the facial expression image recognition method provided in the embodiments of this application;
[0028] Figure 3 A schematic diagram of the second process of the facial expression image recognition method provided in the embodiments of this application;
[0029] Figure 4 Equivalent circuit of the LIF model provided in the embodiments of this application;
[0030] Figure 5 A schematic diagram of the structure of the facial expression image recognition device provided in the embodiments of this application;
[0031] Figure 6 This is a block diagram of an electronic device used to implement the facial expression image recognition method of the embodiments of this application. Detailed Implementation
[0032] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this application.
[0033] It should be noted that the terms "first," "second," "target," and "original," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprising," "having," and any variations thereof are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0034] Before introducing the embodiments of this application, the network architecture of the facial expression image recognition of this application will be explained. Figure 1 This is a schematic diagram of the network architecture provided in an embodiment of this application. The network architecture includes an input layer, a representation layer, an encoding layer, and a perception layer. From the input layer to the perception layer, information is gradually extracted, abstracted, and perceived, gradually forming a stable representation to achieve facial expression recognition. The bidirectional information flow between each layer is as follows: Figure 1 As shown, the layer descends from the perception layer to the representation layer, supervising and guiding the lower layers. The input layer processes visual information (i.e., raw feature information), removing non-critical and redundant information to a certain extent; this can be seen as a preprocessing step, essentially data dimensionality reduction. The representation layer further abstracts to extract key features, obtaining feature maps that effectively simulate the human visual receptive field. The encoding layer encodes the feature maps output by the representation layer into a pulse sequence (i.e., spatiotemporal information) that the perception layer can process. The perception layer processes and integrates the pulse sequence to achieve emotion recognition. Furthermore, the perception layer can provide feedback to the representation layer through guidance information.
[0035] Figure 2 This is a schematic diagram of the first process of a facial expression image recognition method provided in this application embodiment. This embodiment is applicable to a customer service scenario where an intelligent robot recognizes a customer's facial expressions and provides better service based on the customer's emotions. The facial expression image recognition method provided in this embodiment can be executed by the facial expression image recognition device provided in this application embodiment. This device can be implemented through software and / or hardware and integrated into the electronic device executing the method. Preferably, the electronic device in this application embodiment can be an intelligent robot, such as a customer service robot.
[0036] See Figure 2 The method in this embodiment includes, but is not limited to, the following steps:
[0037] S110. Determine the original vector of the facial expression image based on the pixel values of the facial expression image, and perform principal component analysis on the original vector to obtain the target vector of the facial expression image.
[0038] In this embodiment, this step is the input layer in the network architecture. The intelligent robot acquires multiple facial expression images of the customer through an image acquisition device, i.e., a data image set, and analyzes each expression image to determine its pixel value. For example, if the pixel size is w×h, it is expanded to obtain a vector of dimensions m = w×h. The pixel values of multiple expression images are used to form the original vector X, which can be used as X... a This represents the vector corresponding to the a-th facial expression image in the original vector.
[0039] Furthermore, principal component analysis is performed on the original vector to obtain the target vector of the facial expression image, including: preprocessing each vector in the original vector X (such as mean normalization) to obtain the intermediate vector X′. a ;X′ a =X a -μ, Where μ represents the mean of the vector composed of the image set, p is the number of facial expression images in the image set, and a is the index number of the facial expression image in the image set. To approximate the original vector as closely as possible, principal component analysis treats it as a problem of finding the eigenvalues and eigenvectors of a matrix. First, the covariance matrix of the intermediate vector is determined. Then, the eigenvalues of the covariance matrix are calculated to obtain eigenvectors; a predetermined number (e) of target eigenvectors that meet the target requirements (e.g., larger eigenvalues) are selected from the eigenvectors according to the magnitude of their eigenvalues, represented as: d a a = 1, 2, ..., e. The target feature vectors are used to form a projection matrix D = [d1, d2, ..., d]. e Finally, the projection matrix and the original vector are processed to obtain the target vector X″, such as X″ = D. T X′. The target vector X″ is used as the input to the representation layer.
[0040] The principal component analysis (PCA) method in this application can fully simulate the processing of visual information by the primary visual cortex, and can eliminate unimportant and redundant information to a certain extent; it can also reduce the dimensionality of image datasets and remove correlations. PCA uses variance as a measure of information, and can achieve dimensionality reduction of input data in an unsupervised manner without guidance signals from higher layers.
[0041] Alternatively, in addition to principal component analysis for dimensionality reduction of the image dataset, t-distributed stochastic neighbor embedding (t-SNE) or an auto-encoder can also be used.
[0042] S120. Input the target vector into the feature extraction network to obtain the feature information of the facial expression image.
[0043] The feature extraction network is trained based on a convolutional neural network; the convolutional neural network includes five convolutional layers and two pooling layers; the convolutional parameters of the first two convolutional layers are different from those of the last three convolutional layers.
[0044] In this embodiment, this step is the representation layer in the network architecture. Referring to the brain's visual processing mechanism, an artificial neural network model is used to establish the representation layer in the architecture. This application uses a Convolutional Neural Network (CNN) to simulate the visual nervous system, emphasizing the overall interaction between neurons and modeling at the network structure level, which can fully simulate the human visual receptive field.
[0045] In this embodiment, the CNN includes convolutional layers and pooling layers. The convolutional layers filter the input target vector through convolutional kernel operations to extract local feature information, as shown in Equation (1):
[0046]
[0047] In the formula, * denotes the convolution operation; This represents the i-th feature value of the n-th layer of a convolutional neural network; This represents the j-th feature value of the (n-1)-th layer of a convolutional neural network; This represents the convolution kernel weights between the j-th feature in the (n-1)-th layer of the convolutional neural network and the i-th feature in the n-th layer. This represents the bias of the i-th node in the n-th layer of a convolutional neural network. This represents the receptive field corresponding to the nth layer of the convolutional neural network, that is, the region of the input image corresponding to this feature node. f(·) is a specific non-linear activation function, which can optionally be a linear rectification function (ReLU), such as f(x) = max(0,x).
[0048] Besides the convolution kernel, the convolution operation includes three key parameters: kernel size, padding size, and stride. The feature map extracted after using a two-dimensional convolution kernel not only contains the values of the feature points but also the spatial information between them. In this invention, the convolutional neural network includes five convolutional layers with two sets of convolutional parameters. The first two layers use a 5×5 kernel with a padding size of 2 and a stride of 1. The last three layers use a 3×3 kernel with a padding size of 1 and a stride of 1. After these two convolutional operations, the size of the feature map remains unchanged. By stacking these two small-scale convolutional kernels, large-scale convolutional kernels can be replaced, further reducing the number of model parameters.
[0049] In facial expression images, the values of neighboring pixels tend to be similar, and the output obtained after convolutional layers also contains a large amount of redundant information. Therefore, the convolutional neural network in this invention also includes two pooling layers, which are placed after the first two convolutional layers. The purpose of using pooling layers is to further abstract the features to remove redundant information. Optionally, max pooling can be used to obtain the largest feature point in the receptive field, which can highlight key information. The parameters of the pooling layer are: pooling size of 2 and stride of 2, that is, after each pooling layer, the length and width of the feature map are reduced to half of their original size, which can reduce feature mapping and reduce network complexity.
[0050] S130. Based on the Poisson distribution, pulse frequency encoding is performed on the feature information to obtain the pulse sequence of the facial expression image.
[0051] In this embodiment, this step is the encoding layer in the network architecture. The feature information output by the convolutional neural network of the representation layer is a series of analog quantities, while the perception layer in step S140 below uses a sparse pulse sequence for facial expression image recognition. Therefore, it is necessary to use the encoding layer to encode the feature information of the analog quantities into a pulse sequence that can be processed by the perception layer.
[0052] In traditional artificial neural networks, the signals transmitted are all analog quantities, so the information can be viewed as encoded in the form of pulse frequencies. This encoding method is the dominant method of information encoding in artificial neural networks. Existing pulse frequency-based encoding methods linearly map the intensity of the input stimulus to a certain interval as the firing frequency of neurons, thereby encoding external stimuli as pulse sequences. However, the ideal situation for this existing encoding is to obtain a pulse sequence with a constant firing period, but the firing of neurons in the cerebral cortex is irregular.
[0053] To address this problem, this invention employs pulse frequency encoding based on Poisson coding. It considers that at any given time within a certain time range, the firing of neurons is a random discharge following a Poisson distribution, and the irregularity of neuronal firing timing can be viewed as noise. Specifically, the pulse frequency encoding of feature information based on a Poisson distribution to obtain a pulse sequence of an facial expression image includes: discretizing the time information in the feature information into multiple time points according to the sampling time step; determining the neuronal firing probability at each time point based on the Poisson distribution; and performing pulse frequency encoding on the feature information based on the neuronal firing probability at each time point to obtain the pulse sequence.
[0054] S140. Input the pulse sequence into the facial expression recognition model to obtain the recognition result of the facial expression image.
[0055] In this embodiment, this step is the perception layer in the network architecture. The facial expression recognition model is a multilayer spiking neural network. The multilayer spiking neural network designed in this invention is composed of feedforward connections of spiking neurons, i.e., a multilayer feedforward spiking neural network. In the multilayer spiking neural network, synapses only exist between neurons across layers; there are no connections between neurons within the same layer. They receive input from the previous layer, and their output is then passed to the next layer. The spiking neuron model in the multilayer spiking neural network is an improvement based on the Leaky Integrate & Fire (LIF) model, which can be called the iterative adaptive Leaky Integrate & Fire neuron model. The pulse sequence of the facial expression image encoded in step S130 is input into the facial expression recognition model, and the recognition result of the facial expression image is determined based on the output result of the facial expression recognition model.
[0056] The iterative adaptive leak-integral discharge neuron model proposed in this step can ensure the firing pattern of real biological neurons while minimizing the computational load of the model. It strikes a good balance between biomimicry and model complexity, resulting in high real-time performance. Due to the sparse pulse firing pattern of spiking neurons, the network energy consumption is greatly reduced while improving computational power, making it suitable for more intelligent small robots such as customer service robots.
[0057] The technical solution provided in this embodiment determines the original vector of the facial expression image based on the pixel values, performs principal component analysis on the original vector to obtain the target vector of the facial expression image, inputs the target vector into a feature extraction network to obtain the feature information of the facial expression image, encodes the feature information using pulse frequency based on a Poisson distribution to obtain the pulse sequence of the facial expression image, and inputs the pulse sequence into an facial expression recognition model to obtain the recognition result of the facial expression image. The facial expression recognition model is a multilayer spiking neural network, in which the spiking neuron model is an improvement based on the leaky integral discharge neuron (LIF) model. This application introduces principal component analysis and convolutional layers into the network architecture, making the extracted features more critical. In the design of the network architecture and spiking neuron model, compared with traditional artificial neural network models, the multilayer spiking neural network references the hierarchical structure of the visual cortex of the brain, thus possessing better interpretability and flexibility. Furthermore, this application improves upon the LIF model to obtain a spiking neuron model, which can ensure the firing pattern of real biological neurons while minimizing the computational load of the model. It achieves a good balance between biomimicry and model complexity, resulting in high real-time performance. Due to the sparse pulse firing pattern of spiking neurons, network energy consumption is greatly reduced while improving computational power, making it suitable for more intelligent small robots such as customer service robots.
[0058] The facial expression image recognition method provided in the embodiments of this application is further described below. Figure 3This is a schematic diagram of the second process of the facial expression image recognition method provided in this application embodiment. This application embodiment is an optimization based on the above embodiment, specifically an optimization that provides a detailed explanation of the design process of the expression recognition model. The expression recognition model is a multilayer spiking neural network, where the spiking neuron model is an improvement upon the LIF model.
[0059] See Figure 3 The method in this embodiment includes, but is not limited to, the following steps:
[0060] S210. Convert the membrane potential differential equation in the LIF model into a membrane potential difference equation.
[0061] The membrane potential difference equation is used to represent the relationship between the membrane potential of the current layer neuron at the current moment and the membrane potential at the previous moment.
[0062] In this embodiment, the LIF neuron model uses a first-order linear differential equation to describe the dynamic changes in neuronal membrane potential, which can balance the relationship between the model's neural computational characteristics and model complexity. Figure 4 The diagram shows the equivalent circuit of the LIF model, in which a biological neuron is abstracted as an RC circuit. In this step, we first assume... Figure 4 resting potential u rest =0mV, and by using the difference quotient to approximate the derivative, the first-order linear differential equation (i.e. the membrane potential differential equation) in the equivalent circuit formula of the LIF model can be transformed into the membrane potential difference equation.
[0063] The membrane potential difference equation is expressed by the following formula (2):
[0064]
[0065] In the formula, This represents the membrane potential of the neurons in the current layer at the current moment; This represents the membrane potential of the neurons in the current layer at the previous moment; This represents the output current of the current layer neuron at the previous moment, i.e., the second output current in step S220; This represents the input current of the current layer neuron at the current moment, i.e., the second input current in step S230; β=1-Δt / τ is the membrane potential decay coefficient; λ=Δt / τ represents the input enhancement coefficient, Δt represents the sampling time step from u(k-1) to u(k), and the membrane potential decay time constant τ=R m C m R m For equivalent conductance, C mThe equivalent capacitance is represented by the superscript n, which is the layer index number of the neuron, where n is the current layer neuron, and the subscript i represents the sequence number of the neuron in the nth layer; k is the index number of the moment in the pulse sequence, where k is the current moment and k-1 is the previous moment.
[0066] As can be seen from formula (2), in a multilayer spiking neural network, for the current layer neuron, the membrane potential at the current moment is determined by the membrane potential at the previous moment, the output current at the previous moment (i.e., the second output current in step S220), and the input current at the current moment (i.e., the second input current in step S230).
[0067] S220. Determine the first input current of the current layer neuron at the previous moment, determine the first output current of the previous layer neuron at the current moment, and determine the second output current of the current layer neuron at the previous moment.
[0068] In this embodiment, to simplify the network model, the present invention adopts a current-based synaptic model. In a multilayer spiking neural network, for the current layer neuron, the input current at the current moment (i.e., the second input current in step S230) can be determined by the input current at the previous moment (i.e., the first input current in this step) and the output current of the previous layer neuron at the current moment (i.e., the first output current in this step). This is also an explicit iterative model. Therefore, in order to find the second input current, it is necessary to first determine the first input current and the first output current. The first input current can be directly obtained in the model.
[0069] Specifically, the process of determining the first output current includes: determining the neuron firing threshold at the current moment based on the adaptive firing threshold function; determining the neuron output function based on the neuron firing threshold at the current moment; determining the first membrane potential of the neurons in the previous layer at the current moment; and determining the first output current based on the neuron output function and the first membrane potential.
[0070] The neuron output function is represented by the following formula (3):
[0071]
[0072] In the formula, g(·) represents the neuron output function; u thresh denoted by , where u is the neuron firing threshold; and u is the membrane potential. As shown in formula (3), the membrane potential is set once it reaches the threshold, so theoretically it is greater than u. thresh The membrane potential is unattainable.
[0073] The output current is a function of the membrane potential and is expressed by the following formula (4):
[0074]
[0075] In the formula, This represents the output current of the neuron in the previous layer at the current moment, i.e., the first output current; This represents the membrane potential of the neuron in the previous layer at the current moment, i.e., the first membrane potential; the superscript n is the layer index number of the neuron, n-1 is the neuron in the previous layer; the subscript j represents the sequence number of the neuron in the (n-1)th layer; k is the index number of the moment in the pulse sequence, k is the current moment. It should be noted that the method for determining the second output current is the same as the method for determining the first output current, and will not be repeated here.
[0076] Furthermore, the adaptive firing threshold function is determined as follows: the threshold decay coefficient of the firing threshold decays with time is determined; the adaptive firing threshold function is determined based on the neuron firing threshold, the second output current, and the threshold decay coefficient at the previous moment. The neuron firing threshold increases with neuron firing to achieve regulation of the neuron's pulse firing frequency. The adaptive firing threshold function of the output current is expressed by the following formula (5):
[0077] u thresh (k)=γu thresh (k-1)+(1-γ)o(k-1) (5)
[0078] In the formula, u thresh (k) represents the neuron firing threshold at the current moment; u thresh (k-1) represents the neuron firing threshold at the previous time step; o(k-1) is the output current of the current layer neuron at the previous time step, i.e., the second output current; γ is the decay coefficient of the firing threshold over time, which can be selected to be less than 1. The neuron firing threshold increases as the neuron fires, thereby achieving the regulation of the neuron's pulse firing frequency.
[0079] S230. Determine the second input current of the current layer neuron at the current moment based on the first input current, the first output current, and the synaptic weights between neurons in each layer.
[0080] In this embodiment, the second input current is represented by the following formula (6):
[0081]
[0082] In the formula, This represents the second input current of the neuron in the current layer at the current moment. This represents the first input current of the current layer neuron at the previous moment; This represents the first output current of the neuron in the previous layer at the current moment; The weights (or synaptic weights) represent the synaptic connections between neuron j in layer n-1 of the convolutional neural network and neuron i in layer n; α is the attenuation coefficient of the input current; l(n-1) is the number of neurons in layer n-1; the subscript i represents the sequence number of the neuron in layer n, the subscript j represents the sequence number of the neuron in layer n-1, the superscript n is the layer index of the neuron, n is the current layer neuron, and n-1 is the previous layer neuron; k is the index of the moment in the pulse sequence, k is the current moment, and k-1 is the previous moment.
[0083] S240. Substitute the second input current and the second output current back into the membrane potential difference equation to obtain the spiking neuron model.
[0084] In the embodiments of this application, substituting formulas (4) and (6) back into formula (2) yields a spiking neuron model, namely an iterative adaptive leakage integral discharge neuron model.
[0085] The technical solution provided in this embodiment transforms the membrane potential differential equation in the LIF model into a membrane potential difference equation; determines the first input current of the current layer neuron at the previous time step, the first output current of the previous layer neuron at the current time step, and the second output current of the current layer neuron at the previous time step; determines the second input current of the current layer neuron at the current time step based on the first input current, the first output current, and the synaptic weights between neurons in each layer; and substitutes the second input current and the second output current back into the membrane potential difference equation to obtain the spiking neuron model. This application improves upon the LIF model to obtain the spiking neuron model, namely the iterative adaptive leak-integration firing neuron model. This model ensures the reproduction of the firing process of real biological neurons while minimizing the computational load, achieving a good balance between biomimicry and model complexity, resulting in high real-time performance. Due to the sparse pulse firing pattern of the spiking neuron, network energy consumption is significantly reduced while improving computational power, making it suitable for more intelligent small robots such as customer service robots.
[0086] Figure 5 This is a schematic diagram of the structure of a facial expression image recognition device provided in an embodiment of this application, as shown below. Figure 5 As shown, the device 300 may include:
[0087] The target vector determination module 310 is used to determine the original vector of the expression image based on the pixel values of the expression image, and to perform principal component analysis on the original vector to obtain the target vector of the expression image.
[0088] The feature information determination module 320 is used to input the target vector into the feature extraction network to obtain the feature information of the expression image;
[0089] The pulse sequence determination module 330 is used to encode the feature information based on Poisson distribution to obtain the pulse sequence of the facial expression image;
[0090] The facial expression image recognition module 340 is used to input the pulse sequence into the facial expression recognition model to obtain the recognition result of the facial expression image. The facial expression recognition model is a multilayer spiking neural network, and the spiking neuron model in the multilayer spiking neural network is an improvement based on the leaky integral discharge neuron LIF model.
[0091] Furthermore, the aforementioned facial expression image recognition device may also include: a model determination module;
[0092] The model determination module is used to convert the membrane potential differential equation in the LIF model into a membrane potential difference equation. The membrane potential difference equation represents the relationship between the membrane potential of the current layer neuron at the current time and the membrane potential at the previous time. The module determines the first input current of the current layer neuron at the previous time, the first output current of the previous layer neuron at the current time, and the second output current of the current layer neuron at the previous time. Based on the first input current, the first output current, and the synaptic weights between neurons in each layer, the module determines the second input current of the current layer neuron at the current time. The module substitutes the second input current and the second output current back into the membrane potential difference equation to obtain the spiking neuron model.
[0093] Furthermore, the aforementioned model determination module can be specifically used to: determine the neuron firing threshold at the current moment based on the adaptive firing threshold function; determine the neuron output function based on the neuron firing threshold at the current moment; determine the first membrane potential of the neurons in the previous layer at the current moment; and determine the first output current based on the neuron output function and the first membrane potential.
[0094] Furthermore, the aforementioned model determination module can also be specifically used to: determine the threshold decay coefficient of the discharge threshold decaying over time; and determine the adaptive discharge threshold function based on the neuron discharge threshold of the previous moment, the second output current, and the threshold decay coefficient.
[0095] Furthermore, the target vector determination module 310 described above can be specifically used for: preprocessing the original vector to obtain an intermediate vector, and determining the covariance matrix of the intermediate vector; solving the eigenvalues of the covariance matrix to obtain eigenvectors; selecting a preset number of target eigenvectors that meet the target requirements from the eigenvectors, and forming a projection matrix from the target eigenvectors; and performing data processing on the projection matrix and the original vector to obtain the target vector.
[0096] Furthermore, the aforementioned pulse sequence determination module 330 can be specifically used to: discretize the time information in the feature information into multiple time moments according to the sampling time step; determine the neuron firing probability at each time moment based on the Poisson distribution; and encode the feature information using pulse frequency based on the neuron firing probability at each time moment, thereby obtaining the pulse sequence.
[0097] Optionally, the feature extraction network is trained based on a convolutional neural network; the convolutional neural network includes five convolutional layers and two pooling layers; the convolutional parameters of the first two convolutional layers are different from those of the last three convolutional layers.
[0098] The facial expression image recognition device provided in this embodiment can be applied to the facial expression image recognition method provided in any of the above embodiments, and has corresponding functions and beneficial effects.
[0099] Figure 6 This is a block diagram of an electronic device used to implement a method for recognizing facial expression images according to embodiments of this application. The electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present application described and / or claimed herein.
[0100] like Figure 6 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0101] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0102] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as methods for recognizing facial expression images.
[0103] In some embodiments, the facial expression image recognition method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the facial expression image recognition method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the facial expression image recognition method by any other suitable means (e.g., by means of firmware).
[0104] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0105] Computer programs used to implement the methods of this application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0106] In the context of this application, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. Alternatively, a computer-readable storage medium can be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0107] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0108] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0109] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0110] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this application can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this application can be achieved, and this is not limited herein.
[0111] The specific embodiments described above do not constitute a limitation on the scope of protection of this application. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A method of recognizing a facial expression image, characterized by, The method includes: The original vector of the facial expression image is determined based on the pixel values of the facial expression image, and the target vector of the facial expression image is obtained by performing principal component analysis on the original vector. The target vector is input into a feature extraction network to obtain the feature information of the facial expression image; The pulse sequence of the facial expression image is obtained by pulse frequency encoding of the feature information based on the Poisson distribution; The pulse sequence is input into the facial expression recognition model to obtain the recognition result of the facial expression image. The facial expression recognition model is a multilayer spiking neural network. The spiking neuron model in the multilayer spiking neural network is an improvement based on the leaky integral discharge neuron LIF model. The spiking neuron model is characterized by the following improvement method: The membrane potential differential equation in the LIF model is converted into a membrane potential difference equation; the membrane potential difference equation represents the relationship between the membrane potential of the current layer neuron at the current moment and the membrane potential at the previous moment; the first input current of the current layer neuron at the previous moment is determined, the first output current of the previous layer neuron at the current moment is determined, and the second output current of the current layer neuron at the previous moment is determined; the second input current of the current layer neuron at the current moment is determined based on the first input current, the first output current, and the synaptic weights between neurons in each layer; the second input current and the second output current are substituted back into the membrane potential difference equation to obtain the spiking neuron model. Determining the first output current of the previous layer of neurons at the current moment includes: determining the neuron firing threshold at the current moment based on an adaptive firing threshold function; determining the neuron output function based on the neuron firing threshold at the current moment; determining the first membrane potential of the previous layer of neurons at the current moment; and determining the first output current based on the neuron output function and the first membrane potential. The step of performing principal component analysis on the original vector to obtain the target vector of the facial expression image includes: performing mean normalization on the original vector, and obtaining the intermediate vector using the following formula. , , wherein, represents the intermediate vector, μ represents the mean of the vector composed of the data image set, p is the number of expression images in the data image set, and a is the index number of the expression images in the data image set. Determine the covariance matrix of the intermediate vector; solve the eigenvalues of the covariance matrix to obtain eigenvectors; select a preset number of target eigenvectors that meet the target requirements from the eigenvectors, and form a projection matrix from the target eigenvectors; perform data processing on the projection matrix and the original vectors to obtain the target vector.
2. The method of claim 1, wherein, The adaptive discharge threshold function is determined as follows: Determine the threshold decay coefficient as the discharge threshold decays over time; The adaptive discharge threshold function is determined based on the neuron firing threshold of the previous time step, the second output current, and the threshold decay coefficient.
3. The method for recognizing facial expression images according to claim 1, characterized in that, The step of encoding the feature information based on a Poisson distribution to obtain the pulse sequence of the facial expression image includes: The time information in the feature information is discretized into multiple time points according to the sampling time step; The probability of neuron firing at each time step is determined based on the Poisson distribution. The feature information is encoded by pulse frequency based on the neuron firing probability at each time point to obtain the pulse sequence.
4. The method for recognizing facial expression images according to claim 1, characterized in that, The feature extraction network is trained based on a convolutional neural network; the convolutional neural network includes five convolutional layers and two pooling layers; the convolutional parameters of the first two convolutional layers are different from those of the last three convolutional layers.
5. A facial expression image recognition device, characterized in that, The apparatus for implementing the facial expression image recognition method of claim 1 includes: The target vector determination module is used to determine the original vector of the expression image based on the pixel values of the expression image, and to perform principal component analysis on the original vector to obtain the target vector of the expression image. The feature information determination module is used to input the target vector into the feature extraction network to obtain the feature information of the expression image; A pulse sequence determination module is used to encode the feature information based on a Poisson distribution to obtain the pulse sequence of the facial expression image. The facial expression image recognition module is used to input the pulse sequence into the facial expression recognition model to obtain the recognition result of the facial expression image. The facial expression recognition model is a multilayer spiking neural network, in which the spiking neuron model is an improvement based on the leaky integral discharge neuron LIF model.
6. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the facial expression image recognition method according to any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the facial expression image recognition method according to any one of claims 1 to 4.