[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: P4 VS. Athlon
Dan Downs wrote:
> Another thing, P4's won't be "good" till they get over 2ghz, most software
> isn't optimized for it yet, and won't be for a little while, so most
> software runs faster on an Athlon.
The P4 only has four things going for it:
1. They've adopted a "quad-pumped" bus like the Alpha/Athlon EV6/7,
which runs at 400MHz (the EV7 does as well, but is not availble in
any Alpha/Athlon product yet).
2. The whole chip has gone 256-bit datapath wide, from the L2 down
to the decoders. This optimizes dataflow where other CPUs, AMD and
Intel, only have 256-bit datapaths in select portions of the chip
(usually just the L1 and a few other, core areas).
3. The new SSE-2 instructions are a quantum leap in single
precision, floating point matrix operations. This allows faster
geometry translations for lossy video (like MPEG) and 3-D games that
take advantage of it (e.g., Quake 3).
[ Side new: The new Athlon 4, and newer Athlons, now support SSE-1/2
opcodes, for maximum compatibility and performance in a world
dominated by Intel optimized code. ]
4. The P4 finally expands the number of stages in its pipelines to
an average of 19. The P3 really could not be scaled past 1GHz until
this was done. AMD immediately went to 18-20 with the original
Athlon, which is why it will still scale to >1.5GHz, even thought
the first part was a 500MHz. Note: Adding stages to pipes is
usually a _performance_and_efficiency_degredating_ move!!! Don't
believe articles that tell you it is a "performance enhancement"
because it is _not_ (it is a pure design move that makes it possible
to scale up the clock and still schedule things).
With that said, there are several disadvantages to the P4. These
make the P4 drastically less powerful, MHz for MHz, than the P3 or
Athlon. Intel tried to correct most of them, but just ran out of
time.
A. First and foremost, branch prediction designs still _suck_ at
Intel, believed to be only sub-90% accurate. Branch mispredicts
cause a serious stall in the CPUs execution (see B), which hurt
performance "exponentially" (like collisions on an Ethernet
network). This is all due to Intel putting "all its eggs in the
'IA-64 predication' basket." Long story short, future IA-64
processors (like Itanium) don't do branch prediction -- the just
execute both branches and discard the result of the "road not
taken."
[ The best ever branch prediction unit was on the K6, about 98.5%
accurate and total overkill given the transistor count dedicated to
it. In AMD's defense, it wasn't their design but NexGen's Nx686.
The Athlon "throttled back" on the BP transistor count, and is
figured to be ~94% accurate -- not ideal in the other direction.
AMD believes they get about 96-97% in the new Athlon-4 and
"workstation" Athlons coming out, for not much additional transistor
count over the original Athlon, but a good savings over the K6).
B. Following onto A, is the actual "performance hit" in a branch
mispredict is a _total_loss_of_everything_ in the P4, just like the
P3 before it. But the "difference" between the P4 and the P3 is
that the P4's pipelines are twice as long as the P3's, 18-20 versus
9. When a branch mispredict occurs, every executing command in all
the P3/P4's execution units must "restart" all their operations from
scratch. In CPU time, this can several dozen clock cycles, which
can easily translate into stalls of the CPU. Since branch
mispredicts are exponentially harmful, the "stall rate" of an only
~90% accurate CPU can be 4x worse than a ~95% accurate CPU. Better
yet, AMD's Athlon handles "branch mispredicts" gracefully, only
stalling half of the chip and/or stages in the pipeline. So it's
very likely that the Athlon handles branch prediction and recovery a
good order of magnitude better than the P4.
C. Which brings me to what the Athlon does second best, buffering
and caching logic. The Athlon was designed to be deeply buffered
everywhere, not assuming on tight timings between units. As such,
it recovers very easily from stalls and mispredicts/mis-orders.
Furthermore, AMD includes a lot of special logic for
buffers/caches. One "highlight" of this is the fact that AMD
processors do NOT duplicate data in the L1, L2 and buffers, whereas
all Intel processors _require_ the L2 cache to contain a copy of the
L1 as it must contain a copy of the buffers (hence why Intel uses
small buffers compared to the L1, and a small L1 compared to the
L2). As such, the total, effective cache of any AMD product is
greater of any equivalent Intel product. Even better is the
synchronization of caches between dual-Athlon processors, whereas
Intel's SMP strategy takes a serious performance hit (long story).
[ The leads into the reasons why the Duron kicks the living crap out
of Celeron is not only because its L1 cache is 4x bigger than the
Celeron's L1 (which has a lower latency than L2 -- forget cache
clock and throughput, latency is _more_important_ since it makes up
for slow main memory access times), but because the total, effective
cache in the Duron is 192KB (128KB L1 + 64KB L2), compared to only
128KB in the Celeron (32KB L1 | 128KB L2). ]
D. Lastly, we talk about the Athlon's ultimate strength, its FPU.
Not only do its FPU pipes crush the P3/P4 (and even the K6 did
before, long story that defies the "common though"), but they do on
Intel-optimized code that only takes advantage of 2 of its total 3
FPU pipes -- upto 25% faster MHz for MHz versus the P3, and almost
40% faster than the P4, MHz for MHz. Now that an
Athlon-targettable, GNU support has been introduced, using all 3 FPU
pipes result in 40% faster performance versus P3, and almost 66%
faster than the P4. And what about those new SSE-2 instructions?
If you take a look at the image quality difference between MPEG-4
and Quake 3 rendered with SSE-2 or traditional floating point, the
differences are obvious, SSE-2 adds quite a bit of image quites loss
-- because it does SSE does NOT do single precision (32-bit)
floating point accurately on purpose (for performance reasons).
So when it comes down to it, the P4 is the design Intel thought it
wouldn't have to release because it figured IA-64 would already be
"taking over." Unfortunately, they didn't count on AMD bouncing
back by buying NexGen and putting into the Athlon what it learned
from the Nx686/K6. I'm personally looking forward to the
dual-processor (and higher) Athlons c/o the "smart" 760MP chipst
("smart" as in point-to-point nodes, instead of a "dumb" SMP that
makes the CPUs fend for themselves).
> Either get a current Athlon, or wait a month or so for the Athlon 4's(only a
> few minor changes from the thunderbird, but equally clocked a athlon 4 is
> about 5-10% faster) to start shipping.
The Athlon 4 will only be available in upto 1GHz to start though.
It is designed for mobile devices with drastically lowered power
requrirements. As such, I don't know if it going to be any
"cheaper". You might want to look at AMD's Roadmap to see where
they are headed:
http://www.amd.com/news/virtualpress/roadmap.html
The "advantage" with AMD is that all current and future processors
use the same 200-266MHz EV-6 bus (from the Alpha 264) and the same
Socket-A/462 form-factor. This includes even the planned
dual-processor systems. The next change won't be until the x86-64
"Sledgehammer" comes out, probably addopting a new "Socket-B" (most
likely the 400MHz EV-7 from the Alpha 364).
-- TheBS