User interaction with a display is detected substantially simultaneously using at least two cameras whose intersecting FOVs define a three-dimensional hover zone within which user interactions can be imaged. Separately and collectively image data is analyzed to identify a relatively few user landmarks. A substantially unambiguous correspondence is established between the same
landmark on each acquired image, and a three-dimensional reconstruction is made in a common coordinate
system. Preferably cameras are modeled to have characteristics of pinhole cameras, enabling rectified epipolar
geometric analysis to facilitate more rapid disambiguation among potential
landmark points. Consequently
processing overhead is substantially reduced, as are latency times.
Landmark identification and position information is
convertible into a command causing the display to respond appropriately to a user gesture. Advantageously size of the hover zone can far exceed size of the display, making the invention
usable with smart phones as well as
large size entertainment TVs.