Artificial intelligence acceleration hardware device and method

The AI acceleration hardware device with a U-net-based neural network structure addresses real-time noise removal challenges by converting audio data into video, processing it through neural networks, and restoring it, achieving efficient noise removal and sound source separation.

WO2026142056A1PCT designated stage Publication Date: 2026-07-02IND ACAD COOP GRP OF SEJONG UNIV

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
IND ACAD COOP GRP OF SEJONG UNIV
Filing Date
2025-12-05
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing artificial intelligence acceleration technologies struggle with real-time noise removal in audio and video data due to high computational demands and memory requirements, particularly from environmental and technical noise sources, which degrade quality and hinder speech recognition.

Method used

An AI acceleration hardware device utilizing a U-net-based neural network structure with encoder-decoder architecture, including preprocessing and postprocessing units, on-chip memory, and a controller, to convert audio input into video data, process it through neural network units, and restore it back into audio data, efficiently removing noise.

Benefits of technology

Enables real-time noise removal and target sound source separation by leveraging a U-net-based neural network structure, effectively handling various noise factors and preserving detailed image information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025020904_02072026_PF_FP_ABST
    Figure KR2025020904_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present invention relates to an artificial intelligence acceleration hardware device comprising: neural network units connected according to an encoder-decoder-based neural network structure so as to implement a neural network operation; a pre-processing unit for converting audio input data into image input data and providing the image input data to an encoder neural network unit disposed at an input end from among the neural network units; a post-processing unit for receiving image output data from a decoder neural network unit disposed at an output end from among the neural network units, and outputting the image output data as audio output data; an on-chip memory including an input buffer and an output buffer for storing the image input data and the image output data, respectively; and a controller for executing pipeline operations of the neural network units through access to the on-chip memory according to the neural network operation.
Need to check novelty before this filing date? Find Prior Art

Description

Artificial Intelligence Acceleration Hardware Device and Method

[0001] The present invention relates to artificial intelligence acceleration hardware technology utilizing an encoder-decoder-based neural network structure, and more specifically, to an artificial intelligence acceleration hardware device and method that perform the process of converting audio input data into video data, processing it through a neural network unit, and then restoring it back into audio data.

[0002]

[0003] In audio data, noise refers to unwanted signals that distort voice or acoustic data and degrade quality, and it is generated by various factors. First, environmental noise refers to external noise occurring in the recording environment, including wind noise, vehicle noise, and reverberation (echo) caused by reflections from walls. Additionally, technical factors include hum noise (60Hz) generated from circuit power supplies, noise caused by microphone quality during the recording process, and quantization noise generated during digital conversion. Noise can also occur due to various other causes, and audio sources damaged by noise can cause various problems, such as degrading the listening experience or reducing speech recognition rates.

[0004] Camera footage inherently contains noise, which can become severe due to weather conditions such as fog, sea fog, yellow dust, and dust. To address this, AI-based real-time ultra-high-speed noise removal technology is essential; however, real-time processing can be very difficult with existing acceleration technologies because it requires a massive amount of computation and memory references.

[0005] U-net is a representative Encoder-Decoder neural network model proposed for medical image segmentation. Because U-net’s multi-layer structure enables feature extraction from multiple scales, it can be considered a representative neural network that demonstrates excellent performance in noise removal. U-net can be composed of an encoding part containing CNN operations, a decoding part, and a skip-connection part that transmits the encoding results to the decoding part. This configuration enables the restoration of images damaged by noise by preserving detailed image information.

[0006]

[0007] [Prior Art Literature]

[0008] [Patent Literature]

[0009] Korean Registered Patent No. 10-2256288 (May 20, 2021)

[0010]

[0011] One embodiment of the present invention aims to provide an artificial intelligence acceleration hardware device and method that processes sound source data in real time to remove noise and separate a target sound source.

[0012] One embodiment of the present invention aims to provide an artificial intelligence acceleration hardware device and method that efficiently removes noise generated from various environmental and technical factors by utilizing a U-net-based neural network structure.

[0013]

[0014] Among the embodiments, the artificial intelligence acceleration hardware device comprises: neural network units that implement neural network operations by being connected according to an encoder-decoder-based neural network structure; a preprocessing unit that converts audio input data into video input data and provides the video input data to an encoder neural network unit positioned at the input end among the neural network units; a postprocessing unit that receives video output data from a decoder neural network unit positioned at the output end among the neural network units and outputs the video output data as audio output data; an on-chip memory including an input buffer and an output buffer that respectively store the video input data and the video output data; and a controller that executes the pipeline operation of the neural network units through access to the on-chip memory according to the neural network operation.

[0015] The above neural network units may include: encoding units that receive the image input data and process an encoding process that compresses it into encoding data; decoding units that remove noise from the image input data through a decoding process that restores the encoding data and output the image output data; and a bottleneck unit that connects the encoding units and the decoding units.

[0016] Each of the above encoding units and the above decoding units may include a WeightReg module connected to a dedicated weight buffer to receive weights; and a Convolution Unit module implementing at least one convolution layer that performs a convolution operation for the encoding process or the decoding process for input data based on the weights.

[0017] The above preprocessing unit may include a Fourier transform module that performs a Fourier transform on the audio input data; and a spectrogram generation module that generates a spectrogram of the result of the Fourier transform as the image input data.

[0018] The above Fourier transform module can perform the Fourier transform by dividing the audio input data into an overlapping window and performing a Short-Time Fourier Transform (STFT) operation.

[0019] The above Fourier transform module can set the size of the overlapping window according to the noise characteristics of the audio input data.

[0020] The spectrogram generation module can generate the spectrogram as a feature map for the image input data and provide it to the encoder neural network unit.

[0021] The above post-processing unit may include an inverse Fourier transform module that performs an inverse Fourier transform on the image output data; and an overlap module that outputs the audio output data through overlapping the result of the inverse Fourier transform.

[0022] The above inverse Fourier transform module can perform the inverse Fourier transform by performing an ISTFT (Inverse Short-Time Fourier Transform) operation on the above image output data.

[0023] The above overlap module can perform a convolution operation on the result of the inverse Fourier transform through the convolution unit module and overlap the result of the convolution operation to output the audio output data.

[0024] The above-described on-chip memory may be implemented with a plurality of line buffers that divide and store the input buffer in units of the image input data lines, and may further include weight buffers (WeightBuf) used in the process of the neural network operation, each connected to a corresponding neural network unit to store a corresponding weight value, and a connection buffer that provides skip connections between the encoder and decoder and supports the pipeline operation.

[0025] Each of the above weight buffers is independently connected to the neural network units and can store all weight values ​​used in the operation process of the corresponding neural network unit.

[0026] Among the embodiments, the artificial intelligence acceleration method comprises: a step of implementing a neural network operation by connecting neural network units according to an encoder-decoder-based neural network structure; a step of converting audio input data into video input data through a preprocessing unit and providing the video input data to an encoder neural network unit positioned at the input end among the neural network units; a step of receiving video output data from a decoder neural network unit positioned at the output end among the neural network units through a postprocessing unit and outputting the video output data as audio output data; a step of storing the video input data and the video output data, respectively, through an input buffer and an output buffer of an on-chip memory; and a step of executing a pipeline operation of the neural network units through access to the on-chip memory according to the neural network operation through a controller.

[0027]

[0028] The disclosed technology may have the following effects. However, this does not mean that a specific embodiment must include all of the following effects or only the following effects; therefore, the scope of the rights of the disclosed technology should not be understood as being limited by this.

[0029] An artificial intelligence acceleration hardware device and method according to one embodiment of the present invention can process sound source data in real time to remove noise and separate a target sound source.

[0030] An artificial intelligence acceleration hardware device and method according to one embodiment of the present invention can efficiently remove noise generated from various environmental and technical factors by utilizing a U-net-based neural network structure.

[0031]

[0032] FIG. 1 is a drawing illustrating an artificial intelligence acceleration hardware device according to the present invention.

[0033] Figure 2 is a diagram illustrating the structure of a U-net neural network.

[0034] Figure 3 is a diagram illustrating the basic structure of the block diagram of each neural network unit of Figure 1.

[0035] FIG. 4 is a drawing illustrating a block diagram regarding the hardware structure of a preprocessing unit according to the present invention.

[0036] FIG. 5 is a diagram illustrating a block diagram regarding the hardware structure of a post-processing unit according to the present invention.

[0037] FIG. 6 is a flowchart illustrating a sound source noise removal process based on sound source separation according to an embodiment of the present invention.

[0038]

[0039] The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them.

[0040] Meanwhile, the meaning of the terms described in this application should be understood as follows.

[0041] Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component.

[0042] When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "between" and "exactly between," or "adjacent to" and "directly adjacent to," should be interpreted in the same way.

[0043] A singular expression should be understood to include a plural expression unless the context clearly indicates otherwise, and terms such as "include" or "have" are intended to specify the existence of the implemented features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood not to preclude the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0044] In each step, identifiers (e.g., a, b, c, etc.) are used for convenience of explanation and do not describe the order of the steps; the steps may occur differently from the specified order unless a specific order is clearly indicated in the context. That is, the steps may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

[0045] The present invention may be implemented as computer-readable code on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. Additionally, the computer-readable recording medium may be distributed across networked computer systems, so that computer-readable code can be stored and executed in a distributed manner.

[0046] Unless otherwise defined, all terms used herein have the same meaning as generally understood by those skilled in the art to which this invention pertains. Terms defined in commonly used dictionaries should be interpreted as having meanings consistent with the context of the relevant technology and should not be interpreted as having an ideal or overly formal meaning unless explicitly defined in this application.

[0047]

[0048] FIG. 1 is a drawing illustrating an artificial intelligence acceleration hardware device according to the present invention.

[0049] Referring to FIG. 1, the artificial intelligence acceleration hardware device may include a neural network unit (110), a preprocessing unit (120), a postprocessing unit (130), an on-chip memory (140), and a controller (150).

[0050] At this time, embodiments of the present invention are not required to include all of the above components simultaneously; depending on each embodiment, some of the components may be omitted, or some or all of the components may be selectively included. The operation of each component will be described in detail below.

[0051]

[0052] Neural network units (110) can be connected according to an encoder-decoder-based neural network structure to implement neural network operations. Here, neural network units (110) can be connected to each other according to a specific neural network structure, for example, according to a fully connected, convolutional, and recurrent neural network structure. Neural network units (110) can be connected to each other according to a specific neural network structure and arranged hierarchically. For example, neural network units (110) can be composed of an input layer, a hidden layer, and an output layer to determine the order for processing images between each neural network unit (110).

[0053] Additionally, each of the first and second neural network units (110) has an encoder-decoder structure and may include encoding units (111) that process an encoding process for compressing input data into encoded data. Here, the encoder-decoder structure may include two main steps: compressing input data into a reduced form (encoding) and then expanding it again (decoding) to generate an output. The neural network units (110) can process the original input data through the encoding units (111) to convert it into encoded data in a compressed form.

[0054] In one embodiment, the neural network units (110) may include decoding units (112) that process a decoding process to restore encoded data and output output data. Here, the neural network units (110) can expand the encoded data through the decoding units (112) and convert it into a final output form. For example, the neural network units (110) can perform upsampling or deconvolution operations on the encoded data through the decoding units (112) and restore the data to the resolution and dimensions of the initial input. Here, upsampling is a process of expanding the data to a higher resolution to make it close to the original size, and may include upsampling techniques such as Nearest Neighbor Upsampling, Bilinear Interpolation, and Bicubic Interpolation. Additionally, the deconvolution operation may correspond to the inverse operation of the convolution operation. Neural network units (110) can restore key features while expanding the resolution of an encoded image to a given size by performing deconvolution operations such as padding and stride adjustment based on decoding units (112).

[0055] In one embodiment, the neural network units (110) may include encoding units (111) that process an encoding process of receiving image input data and compressing it into encoding data, decoding units (112) that output image output data by removing noise from the image input data through a decoding process of restoring the encoding data, and a bottleneck unit (113) that connects the encoding units (111) and the decoding units (112). Here, the bottleneck unit (113) may be composed of the lowest dimension layers in the neural network and may be composed of convolution layers between the last downsampling and the first upsampling. Thus, the bottleneck unit (113) may have the characteristic of including high-level features. The neural network units (110) can perform an image noise removal process that minimizes image input data loss and removes noise from the image input data by processing the encoding process based on the encoding unit (111) and performing the decoding process through the decoding unit (112).

[0056] In one embodiment, each of the encoding unit (111) and the decoding unit (112) may include a weight register module (116) connected to a dedicated weight buffer (144) to receive weights, and a convolution unit module (118) implementing at least one convolution layer that performs a convolution operation for an encoding or decoding process for input data based on the weights. Here, the weight register module (116) may receive weights required for the convolution operation from the weight buffer (144), and the actual convolution operation of the convolution layers included in each neural network unit (110) may be performed sequentially in the convolution unit module (118). Additionally, the convolution unit module (118) may include processing elements configured to be optimized for the layer configuration. Each of the encoding unit (111) and the decoding unit (112) can receive weights from the weight register module (116) and perform a convolution operation on the input data through the convolution unit module (118). Here, each of the encoding unit (111) and the decoding unit (112) can generate a feature map by applying a filter to the window area of ​​the input data.

[0057] The preprocessing unit (120) can convert audio input data into video input data and provide the video input data to an encoder neural network unit positioned at the input end among the neural network units (110). Here, the preprocessing unit (120) can be structured to convert audio input data into video data and provide it as input to the neural network unit (110) to perform learning and processing on the audio input data. Additionally, the preprocessing unit (120) can perform a Fourier transform to convert the time-domain data of the audio input data into the frequency-domain.

[0058] In one embodiment, the preprocessing unit (120) may include a Fourier transform module (121) that performs a Fourier transform on audio input data and a spectrogram generation module (122) that generates a spectrogram of the result of the Fourier transform as image input data. Here, the preprocessing unit (120) can analyze the frequency and frequency intensity included in the audio input data by converting the audio input data into frequency components on the time axis through the Fourier transform module (121). Additionally, the preprocessing unit (120) can extract amplitude information from the complex data generated as a result of the Fourier transform through the spectrogram generation module (1220) and generate a 2D graph (e.g., a spectrogram) with time on the x-axis and frequency on the y-axis.

[0059] In one embodiment, the Fourier transform module (121) can perform a Fourier transform by dividing the audio input data into overlapping windows and performing a Short-Time Fourier Transform (STFT) operation. Here, the overlapping windows may correspond to a method of dividing the audio input data into small windows of a specific length, for example, data between adjacent windows may be duplicated through an overlapping structure of windows. The Fourier transform module (121) can apply a Fourier transform to each window segment by dividing the audio input data into overlapping windows and performing an STFT operation for each window segment. Here, the STFT operation may correspond to a variation of the Fourier transform, for example, a technique for analyzing frequency changes over time in the audio input data. The Fourier transform module (121) can generate a spectrogram composed of a time axis and a frequency axis by dividing the audio input data into overlapping windows and applying a Fourier transform based on an STFT operation to each window segment.

[0060] In one embodiment, the Fourier transform module (121) can set the size of the overlapping window according to the noise characteristics of the audio input data. Here, the noise characteristics of the audio input data may correspond to comprehensive attributes including the causes of unnecessary or unintended signal generation, temporal patterns, and frequency distribution, and may be classified, for example, into temporal characteristics, frequency characteristics, and intensity characteristics. Generally, as the window size becomes smaller, the temporal resolution tends to be higher and the frequency resolution tends to be lower, and as the window size becomes larger, the frequency resolution tends to be higher and the temporal resolution tends to be lower; therefore, the Fourier transform module (121) can set the size of the overlapping window according to the noise characteristics of the audio input data. For example, the Fourier transform module (121) can set the size of the overlapping window through a window size determination algorithm that dynamically determines the size of the overlapping window according to a specific noise frequency band.

[0061] In one embodiment, the spectrogram generation module (122) can generate a spectrogram as a feature map for image input data and provide it to the encoder neural network unit. Here, the feature map may correspond to matrix-shaped data representing features extracted from specific audio input data by the neural network unit (110), and may correspond, for example, to the result of compressing and summarizing the characteristics of the image input data (e.g., edges, texture, and color). The spectrogram generation module (122) can calculate the frequency band for each time interval based on the STFT operation result calculated by the Fourier transform module (121) and convert it into an image data form including a time axis and a frequency axis.

[0062] The post-processing unit (130) receives image output data from a decoder neural network unit positioned at the output end among the neural network units (110) and can output the image output data as audio output data. Here, the post-processing unit (130) can perform a series of processes necessary to convert the output data of the decoder neural network unit into an audio signal. For example, the post-processing unit (130) can convert the image output data into a signal in the time domain by performing an Inverse Fourier Transform operation.

[0063] In one embodiment, the post-processing unit (130) may include an inverse Fourier transform module (131) that performs an inverse Fourier transform on image output data and an overlap module (132) that outputs audio output data through overlapping the result of the inverse Fourier transform. Here, the post-processing unit (130) can convert the image output data into a time domain signal by performing an inverse Fourier transform on the image output data through the inverse Fourier transform module (131). Additionally, the post-processing unit (130) can perform overlap processing through the overlap module (132) to maintain connectivity between audio signals in the time domain signal obtained by the inverse Fourier transform and to minimize distortion. For example, the post-processing unit (130) can generate a continuous time domain signal by dividing the audio output data into fixed-size windows through the overlap module (132) and performing connection processing by overlapping adjacent windows.

[0064] In one embodiment, the inverse Fourier transform module (131) can perform an inverse Fourier transform by performing an Inverse Short-Time Fourier Transform (ISTFT) operation on the image output data. Here, the ISTFT operation may correspond to a technique that extends the inverse Fourier transform to restore frequency data divided into window units into time domain data. The inverse Fourier transform module (131) can perform an ISTFT operation on the image output data to perform an inverse Fourier transform on each time window data and connect the data between time windows to generate a continuous signal.

[0065] In one embodiment, the overlap module (132) can perform a convolution operation on the result of the inverse Fourier transform through the convolution unit module (118) and overlap the result of the convolution operation to output audio output data. Here, the convolution operation can proceed by overlapping a small matrix called a filter (or kernel) on a part of the input image, calculating the matching element-wise product, and adding all the results. At this time, the calculated value can be a pixel value of a new image (or feature map).

[0066] Additionally, the convolution operation module (118) can be performed repeatedly on the entire image while moving the filter to different parts of the image. The resulting feature map can be represented by emphasizing specific features of the original image (e.g., edges, color, texture, etc.). Additionally, the filter values ​​can be generated during the CNN training process.

[0067] The on-chip memory (140) may include an input buffer (141) and an output buffer (142) that store image input data and image output data, respectively. Here, the on-chip memory (140) may correspond to a memory with a smaller capacity but faster access speed than off-chip memory, and, for example, may be used to store intermediate feature maps that occur mainly during convolution operations. The on-chip memory (140) may be implemented so that the input buffer (141) and the output buffer (142) can be processed without accessing the off-chip memory, for example, the input buffer (141) and the output buffer (142) may be logically defined as independent regions on a single on-chip memory (140) or implemented as multiple physically separated on-chip memories (140).

[0068] In one embodiment, the on-chip memory (140) is implemented with a plurality of line buffers (143) that divide and store the input buffer (141) into units of image input data lines, and may further include weight buffers (144) that are used in the process of neural network operation and each is connected to the corresponding neural network unit (110) to store the corresponding weight values, and a connection buffer (145) that provides skip connections between encoder and decoder and supports pipeline operation. Here, the line buffer (143) may correspond to a device that provides input data for processing the assigned segmented image to the input neural network unit among the neural network units (110). Additionally, the weight buffer (144) may correspond to a device that provides at least a portion of the weights (hereinafter, weight data) to the neural network units (110). In one embodiment, the on-chip memory (140) may place a connection buffer (145) between the encoding unit (111) and the decoding unit (112) to store a concatenation feature map for skip connections. Here, the number of connection buffers (145) may be determined according to the pipelined structure, and the number of connection buffers (145) optimized for the pipeline may play an important role in accurately controlling the timing of the neural network.

[0069] For example, three connection buffers (145) may be required between the EN 0 Unit and the DE 0 Unit for pipeline processing. Conversely, one connection buffer (145) may be sufficient between the EN 1 Unit and the DE 1 Unit. If an additional encoding unit (111) is introduced, a decoding unit (112) must also be added in the same way, and two additional connection buffers (145) may be required to match the timing of the pipeline processing.

[0070] In one embodiment, each of the weight buffers (144) is independently connected to the neural network units (110) and can store all weight values ​​used in the operation of the corresponding neural network units (110). The weight buffer (144) can be independently connected to each neural network unit (110) and can store all weight values ​​required for the convolution operation performed in the operation of the corresponding neural network units (110). That is, the weights can be used for neural network operation after being loaded from the on-chip memory (140) along with the input data.

[0071] The controller (150) can execute the pipeline operation of the neural network units (110) by accessing the on-chip memory (140) according to the neural network operation. For example, the controller (150) can control access to the on-chip memory (140) before the neural network operation begins to read out input data and weight values ​​for the neural network operation, and access to the on-chip memory (140) after the neural network operation ends to store output data according to the neural network operation. Additionally, the controller (150) can control the initiation of a padding operation for the buffer data when the line-unit buffer data stored in the line buffers (143) overlaps with the edge region of the input data.

[0072]

[0073] Figure 2 is a diagram illustrating the structure of a U-net neural network.

[0074] Referring to Fig. 2, the U-net can be composed of two main parts. One may correspond to a 'contracting path' or 'encoding path' that extracts features from an image, and the other may correspond to an 'expanding path' or 'decoding path' that generates an output with the same resolution as the original image based on the extracted features.

[0075] The encoder can perform downsampling for the purpose of capturing context and may include multiple convolution layers and max pooling layers similar to a general CNN. In FIG. 2, the encoder may be composed of four encoding blocks (En 0, En 1, ..., En 3), and one encoding block may be composed of three CNN layers and one downsampling step.

[0076] Additionally, the decoder can perform upsampling for fine localization purposes and may include multiple upsampling layers and CNN layers symmetrically with the encoder. In FIG. 2, the decoder may be composed of four decoding blocks (De 0, De 1, ..., De 3), and one decoding block may be composed of three CNN layers and one upsampling step.

[0077] Skip connections can serve to link encoding blocks and symmetric decoding blocks. Through these connections, learning efficiency can be improved by integrating coarse information from the reduced path into the decoding process.

[0078] A bottleneck in a neural network can consist of the lowest-dimensional layers. That is, a bottleneck can be composed of the CNN layers between the last downsampling and the first upsampling. Therefore, a bottleneck can be characterized by containing high-level features.

[0079] The structure of U-Net allows for the acquisition of high-dimensional features through a deep network while effectively preserving fine-grained features (detailed and specific levels of information or structure) of images, thereby enabling pixel-level detailed predictions. U-Net's powerful feature extraction capabilities and ability to generate high-resolution outputs can be highly advantageous for restoring original images from noisy ones.

[0080]

[0081] FIG. 3 is a diagram explaining the basic structure of the block diagram of each neural network unit of FIG. 1, FIG. 4 is a diagram explaining the block diagram regarding the hardware structure of the preprocessing unit according to the present invention, and FIG. 5 is a diagram explaining the block diagram regarding the hardware structure of the postprocessing unit according to the present invention.

[0082] Referring to FIG. 3, an artificial intelligence acceleration hardware device (100) may include neural network units (110) connected according to a neural network structure to implement the operation of a neural network. At this time, the neural network units (110) may include a temporary buffer (310), a host CPU (320), a weight register module (330), a shift module (340), a convolution unit module (350), and a weight buffer (144). Each unit constituting the neural network unit (110) may be configured to include basic modules for convolution operations, and the detailed configuration may vary slightly depending on the role of each unit. For example, units including downsampling, residual connection, or concatenation may be implemented with different configurations depending on the detailed function.

[0083] The basis structure commonly applied to each unit may include a temporary buffer (310), a host CPU (320), a weight register module (330), a shift module (340), a convolution unit module (350), and weight buffers (144). Specifically, the temporary buffer (310) can store intermediate feature maps generated from layers other than the last convolution layer of each unit. The host CPU (320) can perform an execution process of transferring data allocated through the controller (150) of each neural network unit (110) to a line buffer (143) in the form of a scanline. The weight register module (330) can receive weights required for convolution operations from the weight buffer (144).

[0084] Additionally, in the shifter module (340), a shift operation on the input data can be performed for convolution operations. At this time, the convolution operation can be performed by the filters sliding the input feature map in pixel units. In the convolution unit module (350), the actual convolution operations of the CNN layers included in each unit can be performed sequentially. Also, in the convolution unit module (350), processing elements can be configured to be optimized for layer configuration. Additionally, the weight buffer (144) can receive and store all weights used in each unit from the neural network unit (110).

[0085] Referring to FIG. 4, the preprocessing unit (120) can convert sound source data into a spectrogram form by performing a Short-Time Fourier Transform (STFT). Here, the spectrogram generated through the STFT is suitable for sound source analysis because it comprehensively represents information in the time domain and the frequency domain, and also, since it can be viewed as a feature map, units following the preprocessing unit (120) can be processed in the same manner as before. Here, the operation method of the modules constituting the preprocessing unit (120) is as follows.

[0086] First, the preprocessing unit (120) can divide the input data into a fixed-length window through the window shifter module (410) and perform a shift operation on the input data. Here, the preprocessing unit (120) can perform a Fourier transform operation on the window received from the window shifter module (410) through the Fourier transform module (121). Here, the Fourier transform operation can be performed by sliding a constant interval, and the interval can also be changed according to the degree of overlap. Here, the overlap may correspond to the interval where spectrograms overlap during the process of aggregating the result spectrograms performed through the overlap module (131) in the postprocessing unit (130). The preprocessing unit (120) can convert the result of the Fourier transform operation into magnitude and phase through the spectrogram generation module (122) and visualize the change in frequency components over time in the form of a spectrogram. Additionally, the preprocessing unit (130) can store the intermediate feature map generated by the spectrogram generation module (122) through the temporary buffer module (420) and output the result value when the spectrogram for the entire time domain is completed.

[0087] Referring to FIG. 5, the post-processing unit (130) can process the process of restoring a spectrogram, from which noise has been removed by passing it through a neural network, into the form of sound source data. Here, since the spectrogram generated through STFT is divided into window regions, the post-processing unit (130) can perform an inverse Fourier transform through an inverse Fourier transform module (131) to restore it into sound source data and can combine the spectrograms according to the overlap interval through an overlap module (132). In one embodiment, the operation method of the modules constituting the post-processing unit (130) is as follows.

[0088] First, the post-processing unit (130) can perform an inverse Fourier transform operation and a shift operation on the input data through the window shifter module (510). Here, the post-processing unit (130) can perform an inverse Fourier transform operation on the window received from the window shifter module (510) through the inverse Fourier transform module (131). Additionally, the post-processing unit (130) can overlap and combine windows by the overlapped interval of the spectrogram through the overlap module (132). The post-processing unit (130) is not necessarily limited to this and can perform a convolution operation during the process of combining windows through the convolution unit module (530), and can store intermediate feature maps generated during the convolution operation process through the temporary buffer (520). In one embodiment, the post-processing unit (130) can output a result value when the entire window is combined.

[0089]

[0090] FIG. 6 is a flowchart illustrating a sound source noise removal process based on sound source separation according to an embodiment of the present invention.

[0091] In FIG. 6, the sequence of sound source noise removal processing based on sound source separation in the artificial intelligence acceleration hardware device (100) presented in FIG. 1 is described. First, the artificial intelligence acceleration hardware device (100) can receive weights from the host CPU (320) through the controller (150) and store them in the weight buffer (144) inside the neural network core (step S610). Subsequently, the artificial intelligence acceleration hardware device (100) can input sound source data into the preprocessing unit (120) through the controller (150) (step S620) and perform STFT operations on the sound source data through the preprocessing unit (120) to convert it into a spectrogram form (step S630). Here, the artificial intelligence acceleration hardware device (100) can perform data formatting and alignment operations before STFT conversion on the input sound source data through the preprocessing unit (120). Additionally, the artificial intelligence acceleration hardware device (100) can convert the sound source data into a spectrogram by performing a time-frequency conversion based on STFT operations on the sound source data and change the data structure so that frequency characteristics can be analyzed through the neural network unit (110).

[0092] In one embodiment, the artificial intelligence acceleration hardware device (100) can transmit the converted spectrogram in the form of a scanline to a line buffer (143) (step S640). The artificial intelligence acceleration hardware device (100) can receive the spectrogram as input through each neural network unit (110) and perform noise removal on the scanline in the form of SISD (Single Instruction Single Data) (step S650). The artificial intelligence acceleration hardware device (100) can perform an inverse Fourier transform on the result audio data stored in the output buffer (142) through a post-processing unit (130) and restore it to audio data (step S660). The artificial intelligence acceleration hardware device (100) can repeat the audio noise removal process based on audio separation until the entire audio data is processed, and terminate the operation when the last audio data is processed (step S670).

[0093]

[0094] Although the present invention has been described above with reference to preferred embodiments, those skilled in the art will understand that various modifications and changes can be made to the invention without departing from the spirit and scope of the invention as described in the following claims.

[0095]

[0096] [National R&D projects that supported this invention]

[0097] [Project ID] 2710064480

[0098] [Assignment No.] 00156354

[0099] [Ministry Name] Ministry of Science and ICT

[0100] [Project Management (Specialized) Agency Name] Korea Institute of Information and Communications Technology Planning and Evaluation

[0101] [Research Project Name] Information and Communication Broadcasting Innovation Talent Development (R&D)

[0102] [Project Title] Research on Ultra-Realistic XR Technology for Real-Virtual Interconnected Metaverse

[0103] [Contribution Rate] 50%

[0104] [Name of Project Performing Organization] Sejong University Industry-Academic Cooperation Foundation

[0105] [Research Period] 20250101 ~ 20251231

[0106]

[0107] [Project ID] 2710085862

[0108] [Assignment No.] 02315892

[0109] [Ministry Name] Ministry of Science and ICT

[0110] [Name of Project Management (Specialized) Agency] Korea Institute for Science and Technology Commercialization

[0111] [Research Project Name: Promotion of University Technology Management (IP Star Scientist Support Type)

[0112] [Project Title] IP Advancement and Commercialization for the Promotion of Commercialization of Real-time Denoising AI Hardware Technology

[0113] [Contribution Rate] 50%

[0114] [Name of Project Performing Organization] Sejong University Industry-Academic Cooperation Foundation

[0115] [Research Period] April 1, 2025 ~ December 31, 2026

[0116]

[0117] [Explanation of the symbol]

[0118] 100: Artificial Intelligence Acceleration Hardware Device

[0119] 110: Neural network unit 111: Encoding unit

[0120] 112: Decoding Unit 113: Bottleneck Unit

[0121] 120: Preprocessing unit

[0122] 121: Fourier Transform Module 122: Spectrogram Generation Module

[0123] 130: Post-processing unit 131: Inverse Fourier transform module

[0124] 132: Overlap Module

[0125] 140: On-chip memory 141: Input buffer

[0126] 142: Output buffer 143: Line buffer

[0127] 144: Weight buffer 145: Connection buffer

[0128] 150: Controller

Claims

1. Neural network units that implement neural network operations by being connected according to an encoder-decoder based neural network structure; A preprocessing unit that converts audio input data into video input data and provides the video input data to an encoder neural network unit positioned at the input end among the neural network units; A post-processing unit that receives image output data from a decoder neural network unit positioned at the output end among the above neural network units and outputs the image output data as audio output data; On-chip memory including an input buffer and an output buffer that respectively store the above-mentioned image input data and the above-mentioned image output data; and An artificial intelligence acceleration hardware device comprising a controller that executes pipeline operations of the neural network units through access to the on-chip memory according to the above neural network operation.

2. In paragraph 1, the neural network units Encoding units that receive the above-mentioned video input data and process the encoding process of compressing it into encoded data; Decoding units that output image output data by removing noise from the image input data through a decoding process that restores the above-mentioned encoding data; and An artificial intelligence acceleration hardware device characterized by including a bottleneck unit connecting the encoding units and decoding units.

3. In paragraph 2, each of the encoding units and the decoding units is A weight register (WeightReg) module connected to a dedicated weight buffer to receive weights; and An artificial intelligence acceleration hardware device characterized by including a Convolution Unit module that implements at least one convolution layer for performing a convolution operation for the encoding process or the decoding process for input data based on the above weights.

4. In paragraph 1, the pretreatment unit A Fourier transform module that performs a Fourier transform on the above audio input data; and An artificial intelligence acceleration hardware device characterized by including a spectrogram generation module that generates a spectrogram of the result of the above Fourier transform as the image input data.

5. In paragraph 4, the Fourier transform module is An artificial intelligence acceleration hardware device characterized by performing the Fourier transform by dividing the above audio input data into an overlapping window and performing a Short-Time Fourier Transform (STFT) operation.

6. In paragraph 5, the Fourier transform module is An artificial intelligence acceleration hardware device characterized by setting the size of the overlapping window according to the noise characteristics of the audio input data.

7. In paragraph 4, the spectrogram generation module is An artificial intelligence acceleration hardware device characterized by generating the spectrogram as a feature map for the image input data and providing it to the encoder neural network unit.

8. In paragraph 1, the post-processing unit An inverse Fourier transform module that performs an inverse Fourier transform on the above-mentioned image output data; and An artificial intelligence acceleration hardware device characterized by including an overlap module that outputs the audio output data through overlapping the result of the above inverse Fourier transform.

9. In paragraph 8, the inverse Fourier transform module is An artificial intelligence acceleration hardware device characterized by performing the inverse Fourier transform by performing an ISTFT (Inverse Short-Time Fourier Transform) operation on the above-mentioned video output data.

10. In claim 9, the overlap module An artificial intelligence acceleration hardware device characterized by performing a convolution operation on the result of the inverse Fourier transform through a convolution unit module and overlapping the result of the convolution operation to output the audio output data.

11. In paragraph 1, the on-chip memory The above input buffer is implemented with a plurality of line buffers that divide and store the image input data lines, and An artificial intelligence acceleration hardware device characterized by further including weight buffers used in the process of the above neural network operation, each connected to a corresponding neural network unit to store a corresponding weight value, and a connection buffer that provides a skip connection between the encoder and decoder and supports the pipeline operation.

12. In paragraph 11, each of the above weight buffers is An artificial intelligence acceleration hardware device characterized by being independently connected to the above neural network units and storing all weight values ​​used in the operation process of the said neural network units.

13. A step of implementing neural network operation by connecting neural network units according to an encoder-decoder-based neural network structure; A step of converting audio input data into video input data through a preprocessing unit and providing the video input data to an encoder neural network unit positioned at the input end among the neural network units; A step of receiving image output data from a decoder neural network unit positioned at the output end among the neural network units through a post-processing unit and outputting the image output data as audio output data; A step of storing the image input data and the image output data, respectively, through the input buffer and output buffer of the on-chip memory; and An artificial intelligence acceleration method comprising the step of executing pipeline operations of neural network units through access to the on-chip memory according to the neural network operation through a controller.