Can someone explain why variable-to-variable move is not possible?
The short answer is that they are not possible to improve pipeline efficiency at no performance cost.
The long answer...
Memory is a lot slower than the CPU (about 100-1000 times slower!). In order to access some quantity of bytes from memory the CPU has to perform a
painfully complex step of actions.
- Compute effective address. Possibly several addition operations. Especially for array lookups.
- Resolve physical address. Page table lookup operation from lookaside buffer. Read in new lookaside buffer if required page is not within (more memory I/O through another branch).
- Check if required memory is already in cache. If so then use that to read and write to. Otherwise continue. Has to check multiple layers of cache, lower level caches being faster than high level caches. Higher levels of cache will move data to lower level caches that is frequently accessed for better data locality. Lower level in this context refers to the cache level closest to a processing element, with high levels usually being shared between all processing elements.
- If page is not in memory or does not exist then generate page fault interrupt for OS to process. This runs completely different code to resolve the page, possibly even having to interface with the I/O controller to read data from backing storage such as a hard disk. If page is in memory or the interrupt returns then resume execution.
- Issue a memory read/write order. This involves setting up the I/O pins that interface with the memory. Due to the slow nature of memory I/O it is possible that memory command is already being sent in which case the pipeline must stall.
- Wait some large number of cycles based on the memory access latency and clock rate. Memory is pipelined so other read/write instructions can be sent during this time for higher throughput however for the memory instruction to resume it still has to wait at least latency time for the response.
- Place memory into cache. Other memory may need to be moved out of cache which requires memory writing.
- Use memory in cache similar to step 3 would. It will now be in a register for reads and in cache for writes.
A lot of these steps require very complex hardware, as such there are very few units available for pipe-lining the operation. This is made even more complex by each instruction being located in the same memory space so already there has to be some duplicate systems to allow proper instruction pipelining. The end result is that using 1 instruction to copy memory from one location to another or using 2 instruction which separately read then write memory has absolutely no speed difference since the memory I/O controller will result in most of the processor stalls anyway.
In x86 there actually is an instruction to do this copy. Combined with the repeat bits of instructions it is possible to very efficiently copy large chunks of memory from 1 place to another. Simutrans uses this instruction on the GCC builds to speed up fill rate due to its software rasterizer.
However this is nothing more than a "hack". Internally processors use a RISC system known as "microcode" to manage pipelining and execution of instructions. The instruction decode circuitry will translate this highly complex function into some configuration of micro code which then is executed by the control circuitry to carry out all the required steps of the instruction. Combined with the fact the instruction path is probably not that optimized the end result is that that single executing instruction probably performs similarly to multiple functionally equivalent executing functions. The whole reason the function exists is for legacy reasons, and I believe it is not functional in x86-64 which aimed for a streamlined RISC style instruction set.
ARM processors should have caches, but only the high end ones such as found on phones. This is because the clock rate is sufficient that there is a significant speed up and power saving from using cache. Some ARM processors have additional memory access requirements such as alignment and minimum read size. ARM processors in the low end often avoid cache as the clock rate is so low that they interface at almost native speed with memory (especially since memory is often on the same dye or in the same package as the processor. That said they are such low power devices you cannot compare it with the ARM CPUs found in phones, using different instruction sets and architecture.
AMD is trying to alter how memory is interfaced to improve memory I/O rate and latency. Since they use integrated solutions (eg XBO and PS4, all elements are soldered directly on the board) they are deciding to go one further by pretty much putting stacked memory physically near/on the CPU/GPU giving it performance closer to cache than to current memory. That said it is obviously impossible to upgrade the memory in such systems, thus why it is only really useful for mass produced computers and game consoles.