The present disclosure is directed toward systems and methods for detecting an object in an input image based on a target object keyword. For example, one or more embodiments described herein generate a heat map of the input image based on the target object keyword and generate various bounding boxes based on a pixel analysis of the heat map. One or more embodiments described herein then utilize the various bounding boxes to determine scores for generated object location proposals in order to provide a highest scoring object location proposal overlaid on the input image.