Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are many operations on data that are relatively slow from a CPU’s perspective — filling a cache line, page faulting, cache coherency, acquiring a lock, waiting on I/O to complete, etc. All of these add latency by stalling execution. In conventional software, when these events occur you simply stall execution, possibly triggering a context switch (which is very expensive). In many types of modern systems, these events are extremely frequent.

Latency hiding is a technique where 1) most workloads are trivially decomposed into independent components that can be executed separately and 2) you can infer or directly determine when any particular operation will stall. There are many ways to execute these high latency operations in an asynchronous and non-blocking way such that you can immediate work on some other part of the workload. The “latency-hiding” part is that the CPU is rarely stalled, always switching to a part of the workload that is immediately runnable if possible so that the CPU is never stalled and always doing real, constructive work. Latency-hiding optimizes for throughput, maximizing utilization of the CPU, but potentially increasing the latency of specific sub-operations by virtue of reordering the execution schedule to “hide” the latency of operations that would stall the processor. For many workloads, the latency of the sub-operations doesn’t matter, only the throughput of the total operation. The real advantage of latency-hiding architectures is that you can approach the theoretical IPC of the silicon in real software.

There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing. Cray/Tera MTA architecture is probably the canonical example as well as the original Xeon Phi. As a class, latency-hiding CPU architectures are sometimes referred to as “barrel processors”. In the case of Cray MTA, the CPU can track 128 separate threads of execution in hardware and automatically switch to a thread that is immediately runnable at each clock cycle. Thread coordination is effectively “free”. In software, switching between logical threads of execution is much more by inference but often sees huge gains in throughput. The only caveat is that you can’t ignore tail latencies in the design — a theoretically optimal latency-hiding architecture may defer execution of an operation indefinitely.



> There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing.

I don’t know much about supercomputers, but what you described is precisely how all modern GPUs deal with VRAM latency. Each core runs multiple threads, the count is bound by resources used: the more registers and group shared memory a shader uses, the less threads of the shader can be scheduled on the same core. The GPU then switches threads instead of waiting for that latency.

That’s how GPUs can saturate their RAM bandwidth, which exceeds 500 GB/second in modern high-end GPUs.


Is there some tool like ps/top /time that I can use measure how much “constructive” work my cpu spend doing?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: