The invention relates to a hardware circuit design and method of a data loading device combined with a main memory. The hardware circuit design and method are used for deep convolutional neural network calculation acceleration. According to the device, a cache structure is specifically designed and comprises input cache and control, wherein a macro block segmentation method is applied to input ofa main memory or / and other memories, and regional data sharing and tensor data fusion and distribution are achieved; a parallel input register array for converting the data segmentation pieces input into the cache; and a tensor type data loading unit that is connected with the output of the input cache and the input of the parallel input register array. The design simplifies an address decoding circuit, saves area and power consumption, and does not influence high bandwidth of data. The hardware device and the data processing method provided by the invention comprise a transformation method, amacro block segmentation method and an addressing method for the input data, so that the requirement of carrying out algorithm acceleration by limited hardware resources is met, and the address management complexity is reduced.