The invention discloses an indoor scene understanding method based on a 2D-3D semantic data set. The indoor scene understanding method comprises the main content of collection, arrangement, training and test division of data and comprises the following steps: firstly capturing an image, outputting a scanning region, an original color depth image (RGB-D) and a 3D texture grid, sampling the grid to generate point clouds, carrying out semantic annotation on the data, projecting each point label to a 3D grid and an image field, and representing certain regions with concentrated data to parts with similar buildings in aspects of appearances and architectural features, so as to define standard training and test division. According to the semantic data set disclosed by the invention, cross-modal study models and non-supervision methods can be developed and united by virtue of rules existing in large-scale indoor spaces; powerful prompts are provided for the detection of semanteme, layout, shields, shapes and modes; and the indoor scene understanding method is not limited by scale, diversity and number.