UltraSPARC T2+ for massive parallel decision making?

I´m digging into the topic of realtime Solaris at the moment. I can´t talk about the project (it´s not finance, military or robotics, you wouldn´t believe it), but the usage of real-time technology is more common than you think at the moment. Of course the processing of sensoric data, for example in chemical plants, air traffic control radar. Everybody thinks about such stuff at first. But financial systems use this stuff as well. When you work in a business, where miliseconds decide about your profits in automated exchange of stocks and derivates, realtime technology gets essential. You want to be sure, that writing something to disk don´t interrupt the processing of your input data. You can log the stuff later, but the “Sell” order has to go out right here, right now. There are several interesting technologies in Solaris regarding such applications. I will write a tutorial about it soon. After working trough many documents i asked myself: An UltraSPARC T2+ dual or quad proc system should be a hell of a system for massive parallel processing of sensoric data. Let´s imagine, you have observe the stock value of 128 stocks. You could run the observing code on 128 threads in parallel. No context switches. Okay, substract some threads for house keeping, but let´s ignore this for a moment. Even when the proc has just half the clock frequency, the latency of a single transaction should be much faster. The following stuff is just a fast thought game … so correct me, if i got something wrong. And: Yes, i´m aware of the fact that this math is vastly simplified.
Let´s assume your code to get the information, to process it and to make a decision needs 1000 clock cycles. You´ve optimized the code on both system by hand to stay in this cycle budget. Let´s further assume that you need more code on the SPARC but each command is exectuted in 1 cycle (RISC) whereas you have several operations needing more than one cycle in the x86 code (CISC). On a normal single proc single-socket you could work through 128 threads serially. The latency on an system where the application does the dispatching (aka execute the code for decision serially for each stock and start again afterwards) would lead to a decision to decision latency of 128000 clock cycles. Even on a quad core, quad socket system you could only observe 16 inputs in parallel. But let´s assume you do this stuff in a single socket single core system for easier calculation. Now let´s assume you work with a 16 core system: 16 cores In 1000 clock cycles you could go through 16 decision processes, 128 divided by 16 results in 8. You need 8 runs of 16 threads in parallel to go through all 128 stocks. Thus the time from one decision regarding a stock to another decision would be would be 8000 clock cycles Now let´s calculate this for the UltraSPARC T2+. The switch from one thread to another is done without latency penalty. Thus to process 8 observers on a core you would need 4000 clocks as you have two ALU in a core. Thus the latency from decision to decision is just 4000 clock cycles. After 4000 8 decision cycles are totally executed and the next run can start. As you have additional cores, even for 128 stocks your decision to decision latency is 4000 threads. Why 4000? 4 threads are sharing one ALU. Thus you need 4000 clock cycles to execute 4 threads with 1000 commands each even when they look that they were scheduled to 4 different procs. Thus an x86 based system would twice frequency to do the same stuff in the same time as the frequencywise slower UltraSPARC T2+. Considering that an 256 thread system is planed on the foundation of the T2+ the results get even more in favour of the T2+. The decision latency of for such a system would be still 4000. The time between two decisions for a single stock on such a x86 system would rise to 16000 clock cycles, as you would need 16 runs to go to all 256 stocks. And this calculation doesn´t factor in high latency events like a L1 or L2 cache misses leading to context switches or idling pipelines. The architecture of the memory subsystem isn´t factored, too. This would give advantages to the T2+ as well, as context switching is an expensive operation in regard of clock cycles on a x86 architecture with fewer cores than software threads. On a T2+ you wouldn´t have to context switch at all, as you would have enough register sets for all software threads. It may look counterintuitive, but the more inputs you have to observe, the more viable massive multithreaded solutions get, even when they haven´t such a high freqency. I will think more about it to ensure that there is no error in my thoughts.