In the comparison link above, most obvious differences are the Clock speeds, which speak for themselves, but the major factor there is the "lithography", or the size of the transistors in the CPU. Smaller transistors mean a more efficient CPU, lower thermal output (TDP) and better computing performance per core. So as you go trough those 65nm, 45nm and 22nm CPUs, the improvements are much larger than mere speed increases (though they do help too).
Smaller transistors mean a more efficient CPU
Yeh no. Smaller brings in new problems now. If you half the size you do not quarter the power consumption and in cases such as connections you actually increase power consumption as the resistance per square increases. The end result is that with 20nm fab you could easily make a piece of silicon that will compute stuff at insane rates but consume >400 watts of power and last a few seconds.
Actual computation performance is not really increasing as fab size decreases. It used to but then kind of capped out around 3-4 GHz. This is because it is not possible to reliably produce devices that run at such high clock rates. Parasitic capacitance is just one of the problems with high clock speeds and will result in huge power consumption and possible signal loss. Performance comes with new designs, some of which need the extra transistor count that smaller fabs give. Adding 50% more cache using the saved area might improve processor performance >20% for some tasks yet the actual core performance could remain the same. Adding extra ALUs might cause you to lower the clock speed slightly yet could allow the processor to execute independent instructions faster.
Especially with intel, you will find a lot of their CPUs have a lot of built in GPU stuff in them instead of making the processors more powerful. Especially for laptops with no GPU this can greatly improve performance as CPUs are not really suited for high speed graphics.
Also just because intel and AMD have the same size of fab (technology) does not make them the same. Intel might have a larger node at the time but which supports processes AMD's smaller node does not which allows Intel to do stuff AMD cannot and vica versa.
there was a Pentium I believe with 4 GHz, which didnt have nearly enough power to like 2.6 GHz newer generation, so clock rate is not very good meter
Not possible. Pentium 1 was within the 100MHz range. You must be thinking of a Pentium 4 (>600 MHz to multi GHz).
Clock speed tells you how complicated the logic inside it is. Something running at 3 GHz probably has more complex combinational logic than something running at 6 GHz as the critical path must be smaller for the 6 GHz.
Currently developers like the RISC approach which drives clock speed up and combinational complexity down. Also better supports pipelining. However that is not to say that CISC is bad, in the future we might only have 1-2 GHz processors that are very energy efficient but can literally absorb entire functions in parallel.
if cache was that important, why would people migrate to 64 bit architectures, when you can physically fit 1/2 of the data into cache in 64 bit OS compared to 32 bit one
also, if it is so important, why wont you buy AMD, even my AMD has 4 MB L2 cache, whereas yours only has 1 MB(The shared is the same, 8 MB)
Memory is so slow that the processor would starve without a cache. By the time 1 read from memory completes you are looking at 100s of processor cycles passing where it can do nothing but idle. Yes we have DDR3 which is a ton faster than DDR1 all those years ago, but it is not the same order faster as the Pentium 4s compared with I7.
This is where cache plays an important part. Most programs have a very small highly active set with a much larger (orders larger) slightly active set. An example would be an iterative function since in 1000 clock cycles the code will be executed dozens of times. Cache is memory close to the processor that can be accessed at near processor speeds. Where as a read from memory may take >100 clock cycles, a read from L3/4 cache might take 1-2 clock cycles. By keeping such an iterative function code in cache it will save a lot of clock cycles spent waiting for memory. It does not just apply to code, any data that is manipulated a lot will benefit from caching. Caching is also done in pages so will befit from data localization (accessing the next index of an array would be instant as it will likely already be in the cache from importing the previous index).
So why use x86-64 bits if memory is a big bottleneck? Well x86-64 does not only add support for larger address sizes. It adds a whole lot of useful features.
Positive
Larger address sizes allow more memory to be indexed.
Larger registers allow more fast data manipulation as less loading and unloading to memory is required.
More registers allow more fast data manipulation as less loading and unloading to memory is required.
New instructions are provided for advanced data manipulation using the new registers for previously impossible register operations.
Highly streamlined for modern code execution.
Still can execute 32 bit compiled code with simple kernel level changes.
Negative
Larger addresses means larger instruction sizes so code is less memory efficient and more memory bandwidth dependant.
Larger addresses means more memory used to store pointers so less memory efficient.
No 16 bit backwards compatibility support (64 bit booted processors cannot easily switch to only 16 bit instructions like 32 bit mode can).
That said these negatives can be diminished. Larger caches means that the extra code size is less important. Just because it can address up to 64 bits does not mean it has to since few systems (none) have that much memory. As such most OS will use more like a 40bit addressing mode to avoid excess bloat that 64 bits would bring while still supporting 1024 gigabytes of memory.
The time lost and memory often is easily made up by the new capabilities that the instruction set provides. For a x86 (32 bit) platform to manipulate a long (64 bit) integer it needs to use 2 registers and multiple instructions depending on the desired result. Imagine adding 2 longs, that is suddenly 4 registers and multiple add instruction calls. In x86-64 you could load each long into a single register and then perform addition on both with a single instruction call. This is a saving of many instructions which more than compensates the time and space lost to have longer addresses.