User interaction with a display is detected substantially simultaneously using at least two cameras whose intersecting FOVs define a three-dimensional hover zone within which user interactions can be imaged. Separately and collectively image data is analyzed to identify a relatively few user landmarks. A substantially unambiguous correspondence is established between the same 
landmark on each acquired image, and a three-dimensional reconstruction is made in a common coordinate 
system. Preferably cameras are modeled to have characteristics of pinhole cameras, enabling rectified epipolar 
geometric analysis to facilitate more rapid disambiguation among potential 
landmark points. Consequently 
processing overhead is substantially reduced, as are latency times. 
Landmark identification and position information is 
convertible into a command causing the display to respond appropriately to a user gesture. Advantageously size of the hover zone can far exceed size of the display, making the invention 
usable with smart phones as well as 
large size entertainment TVs.