The invention belongs to the field of computer management, and particularly relates to a GPU cluster service management system and method. The GPU cluster service management system comprises a resource monitoring module used for monitoring GPU cluster resources, generating cluster resource data and sending the cluster resource data, a resource allocation module used for acquiring task informationand the cluster resource data and allocating task resources according to the task information and the cluster resource data, a checking module used for obtaining the cluster resource data sent by theresource monitoring module, checking the GPU cluster resource state according to the cluster resource data, generating a checking result and sending the checking result, and an isolation module used for acquiring the inspection result and isolating abnormal resources according to the inspection result. According to the GPU cluster service management system and method, all resource states in the GPU cluster can be monitored in real time, and it is ensured that resources are efficiently utilized; and abnormal resources can be automatically checked out and isolated, so that normal operation of the GPU cluster is ensured, and the processing efficiency of the GPU cluster is improved.