ASM in C?

Deleted member 219079 · Aug 31, 2016

Note: The title has been changed from "Good C inline assembly tutorial?"

You can use inline assembly in C. The tutorials I used to learn C never really crossed this topic for some reason, but I'd like to learn it to get better grasp on C.

I'd prefer compact one with code examples (showing the whole code, not only a snippet). I'm targeting 64-bit Windows.

GhostWolf · Aug 31, 2016

Probably because it has nothing to do with C directly, and it will not give you any better grasp on C, but rather on assembler (which you probably don't need, unless you are writing some super hot function, and you really know how your compiler works and the code it emits, and how to write better code than that).
Generally speaking, the compiler probably knows how to do the job better than you, unless you are writing some very specific and probably tiny math function.

Deleted member 219079 · Aug 31, 2016

It's not for optimization silly. For learning purposes.

Dr Super Good · Sep 1, 2016

C does not support inline assembly. Inline assembly support is sometimes provided as a compiler extension and as such varies from compiler to compiler.

For example GCC offers good support for inline assembly on the x86 platform and its various extensions. MSVC also offers some inline assembly support but only for x86, with x86-64 not supporting inline assembly.

The reason for this is that assembly is not portable while C is. C code describes program behaviour that is processor architecture independent as the compiler works out what instructions are needed to achieve the standard behaviour. On the other hand assembly describes individual instruction calls which are extremely processor architecture dependant as some architextures may have different instructions and instruction behaviours. Recently there have been attempts to standardize assembly declerations to be more platform independent, but this is turning it more into a high level language rather than keeping it assembly.

Deleted member 219079 · Sep 1, 2016

That's bad, I can't even link 32-bit libraries with 64-bit program to bypass the problem.

I found this: Compiling 64-bit Assembler Code in Visual Studio 2013

So this is the "solution" really?

I specifically wanted inline assembly..

Well, is there good 64-bit assembly tutorial for Windows? I still want to learn a low level language. e: Using the method I linked, asm functions called from C code.

Dr Super Good · Sep 1, 2016

jondrean said:
So this is the "solution" really?
I specifically wanted inline assembly..

Inline assembly is not part of the standard and so is entirely up to the compiler. Due to the nature of assembly and inline assembly it is bad programming to use it.

Assembly should only ever be used for final pass optimizations and only for specific code which the processor spends a lot of time executing. For this reason MSVC provides support in the form of assembly files. For most program code you do not want assembly but rather a high level language so that the code remains platform independent.

jondrean said:
That's bad, I can't even link 32-bit libraries with 64-bit program to bypass the problem.

You cannot static link 32 bit libraries (x86) into 64bit programs (x86-64) because x86 instructions cannot be run when an application is executing in x86-64 mode.

Instead you need to assemble a 64 bit libraries and link them.

jondrean said:
Well, is there good 64-bit assembly tutorial for Windows? I still want to learn a low level language. e: Using the method I linked, asm functions called from C code.

64 bit assembly works just like 32 bit assembly but with different instructions being available and different instruction behaviours.

One can pick up instruction reference manuals from Intel. Although there are some minor variations with AMD's implementations, for the most part all the standard x86-64 instructions are functionally identical for compatibility purposes.

BlinkBoy · Sep 2, 2016

If you want to learn assembly and practice with it. I suggest you learn it with a simulator or an assembler.

x86-64 is very complex to program with due to the big amount of shortcut instructions it has. Normally you want to use "inline assembly" for very advanced stuffs like using specific processor extension instructions or prefetching memory regions to cache.

If you really want to venture into assembly languages. I suggest you start with a RISC assembly language. There are good simulators such as ARMSim or SPIM MIPS Simulator.

Afterwards, you can try to learn some x86 using MASM or NASM.

Deleted member 219079 · Sep 2, 2016

Thanks for the links, I believe I will have plenty of time in October!

Rui · Sep 2, 2016

If you become interested in x86, you can compile and run assembly online here. Coincidentally, just today I spent half the day programming and fixing assembly procedures for x86

! I felt so satisfied when I managed to optimize one of them to use only registers

EDIT: Also, @BlinkBoy, thanks for the links

I'll be sure to check them out as I've never programmed assembly for a RISC architecture.

BlinkBoy · Sep 2, 2016

Rui said:
If you become interested in x86, you can compile and run assembly online here. Coincidentally, just today I spent half the day programming and fixing assembly procedures for x86 ! I felt so satisfied when I managed to optimize one of them to use only registers

EDIT: Also, @BlinkBoy, thanks for the links I'll be sure to check them out as I've never programmed assembly for a RISC architecture.

I suggest you give ARM a shot. Not only is the architecture used in most mobile and tablet devices but it lacks a bunch of mechanisms that force you to do smart optimizations. For instance, it has no branch prediction so if you do a naive loop, you'll end up throwing the pipeline all the time.

Deleted member 219079 · Sep 3, 2016

Can someone explain why variable-to-variable move is not possible?

Dr Super Good · Sep 3, 2016

jondrean said:
Can someone explain why variable-to-variable move is not possible?

The short answer is that they are not possible to improve pipeline efficiency at no performance cost.

The long answer...
Memory is a lot slower than the CPU (about 100-1000 times slower!). In order to access some quantity of bytes from memory the CPU has to perform a painfully complex step of actions.

Compute effective address. Possibly several addition operations. Especially for array lookups.
Resolve physical address. Page table lookup operation from lookaside buffer. Read in new lookaside buffer if required page is not within (more memory I/O through another branch).
Check if required memory is already in cache. If so then use that to read and write to. Otherwise continue. Has to check multiple layers of cache, lower level caches being faster than high level caches. Higher levels of cache will move data to lower level caches that is frequently accessed for better data locality. Lower level in this context refers to the cache level closest to a processing element, with high levels usually being shared between all processing elements.
If page is not in memory or does not exist then generate page fault interrupt for OS to process. This runs completely different code to resolve the page, possibly even having to interface with the I/O controller to read data from backing storage such as a hard disk. If page is in memory or the interrupt returns then resume execution.
Issue a memory read/write order. This involves setting up the I/O pins that interface with the memory. Due to the slow nature of memory I/O it is possible that memory command is already being sent in which case the pipeline must stall.
Wait some large number of cycles based on the memory access latency and clock rate. Memory is pipelined so other read/write instructions can be sent during this time for higher throughput however for the memory instruction to resume it still has to wait at least latency time for the response.
Place memory into cache. Other memory may need to be moved out of cache which requires memory writing.
Use memory in cache similar to step 3 would. It will now be in a register for reads and in cache for writes.

A lot of these steps require very complex hardware, as such there are very few units available for pipe-lining the operation. This is made even more complex by each instruction being located in the same memory space so already there has to be some duplicate systems to allow proper instruction pipelining. The end result is that using 1 instruction to copy memory from one location to another or using 2 instruction which separately read then write memory has absolutely no speed difference since the memory I/O controller will result in most of the processor stalls anyway.

In x86 there actually is an instruction to do this copy. Combined with the repeat bits of instructions it is possible to very efficiently copy large chunks of memory from 1 place to another. Simutrans uses this instruction on the GCC builds to speed up fill rate due to its software rasterizer.

However this is nothing more than a "hack". Internally processors use a RISC system known as "microcode" to manage pipelining and execution of instructions. The instruction decode circuitry will translate this highly complex function into some configuration of micro code which then is executed by the control circuitry to carry out all the required steps of the instruction. Combined with the fact the instruction path is probably not that optimized the end result is that that single executing instruction probably performs similarly to multiple functionally equivalent executing functions. The whole reason the function exists is for legacy reasons, and I believe it is not functional in x86-64 which aimed for a streamlined RISC style instruction set.

ARM processors should have caches, but only the high end ones such as found on phones. This is because the clock rate is sufficient that there is a significant speed up and power saving from using cache. Some ARM processors have additional memory access requirements such as alignment and minimum read size. ARM processors in the low end often avoid cache as the clock rate is so low that they interface at almost native speed with memory (especially since memory is often on the same dye or in the same package as the processor. That said they are such low power devices you cannot compare it with the ARM CPUs found in phones, using different instruction sets and architecture.

AMD is trying to alter how memory is interfaced to improve memory I/O rate and latency. Since they use integrated solutions (eg XBO and PS4, all elements are soldered directly on the board) they are deciding to go one further by pretty much putting stacked memory physically near/on the CPU/GPU giving it performance closer to cache than to current memory. That said it is obviously impossible to upgrade the memory in such systems, thus why it is only really useful for mass produced computers and game consoles.

BlinkBoy · Sep 3, 2016

jondrean said:
Can someone explain why variable-to-variable move is not possible?

It really depends on the architecture.

If you are talking from inline assembly perspective then it's all Dr Super Good said. Also, at low level, memory hierarchy is a lot more complex. The registers are the fastest memory, they are like 'processor' variables that store a set amount of bits after that comes the cache which is a temporal memory storage used by the processor so repeated data does not have to be loaded from or into RAM memory (in mobile devices and most ARM, there's no cache).

In general, Registers > Cache (L1 > L2 > L3) > RAM Memory > Secondary Memory (HDD, SSD, Flash mem, etc).

It's quite possible to do it through indirection and special instructions but the compiler won't allow you to do it.

Deleted member 219079 · Dec 9, 2016

(Your answers didn't go in vain; I haven't had the time to learn ASM yet I've been learning C++ and doing school stuff.)

Dr Super Good said:
ARM processors should have caches, but only the high end ones such as found on phones. This is because the clock rate is sufficient that there is a significant speed up and power saving from using cache. Some ARM processors have additional memory access requirements such as alignment and minimum read size. ARM processors in the low end often avoid cache as the clock rate is so low that they interface at almost native speed with memory (especially since memory is often on the same dye or in the same package as the processor. That said they are such low power devices you cannot compare it with the ARM CPUs found in phones, using different instruction sets and architecture.

What kind of trickery is happening here then?

Dr Super Good · Dec 9, 2016

jondrean said:
What kind of trickery is happening here then?

Microsoft released an ARM build of Windows 10.

Deleted member 219079 · Dec 9, 2016

Okay, I missed the part where he talks about emulation at 1:28. I was thinking that there was something new on the table. Another marketing trick to get ARM Windows popular.

ASM in C?

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Similar threads