The invention provides a text visual question-answering system and method based on concept interaction and associated semantics. The system comprises an object position extraction module, a first fullconnection layer, a text information extraction module, a second full connection layer, an OCR-object graph convolutional network, a multi-gate-step mechanism graph convolutional network, a converternetwork and a bidirectional converter representation encoder BERT. According to the invention, modeling is carried out by using a position relationship between an object and text information in an image, then modeling is performed on text information and object information through the OCR-object graph convolutional network, thus learning abundant and directional features for relationship coding through a gate mechanism, and finally, precisely paying attention to objects and texts in an image through a converter network, thereby obtaining a more accurate answer.