Systems and methods are disclosed for packet voice conferencing. An encoding
system accepts two
sound field signals, representing the same
sound field sampled at two spatially-separated points. The relative
delay between the two
sound field signals is detected over a given time interval. The sound field signals are combined and then encoded as a single
audio signal, e.g., by a method suitable for monophonic VoIP. The encoded audio
payload and the relative
delay are placed in one or more packets and sent to a decoding device via the packet network. The decoding device uses the relative
delay to drive a playout splitter—once the encoded audio
payload has been decoded, the playout splitter creates multiple presentation channels by inserting the transmitted relative delay in the decoded
signal for one (or more) of the presentation channels. The listener thus perceives a speaker'
s voice as originating from a location related to the speaker's physical position at the other end of the conference. An
advantage of these embodiments is that a pseudo-stereo conference can be conducted with virtually the same bandwidth as a monophonic conference.