One embodiment of the present invention sets forth a technique for coalescing
memory barrier operations across multiple parallel threads.
Memory barrier requests from a given parallel thread
processing unit are coalesced to reduce the
impact to the rest of the
system. Additionally,
memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of
memory barrier instruction may
commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may
commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may
commit the memory transactions to a
system level of all threads sharing all
system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.