Baby Steps to our Future by Ron Fenley
Hyper-Threading – Fact and Future
Job interviewers already expect their applicants to multitask, what they really want to know is by how much. With a Hyper-Thread (HT) enabled computer you will be able to multitask a lot more than with any other computer. So what does Hyper-Threading have to do with multitasking?
Simple, Hyper-Threading takes a baby step toward parallel processing. With Hyper-Threading your operating system makes better use of your processor when executing multiple tasks. And a Hyper-Threading system acts like you have two processors in the computer thereby handling twice the tasks normally handled by one. Many computational intensive jobs like re-indexing a database, rendering a video, virus scanning your system or burning a DVD may tie up the computer making it slow, unresponsive or unstable for other tasks until the job is complete. With Hyper-Threading one has a second virtual processor free to do other tasks while the first task churns away.
So how does Hyper-Treading make the operating system think there are 2 virtual processors? To explain that we need to lightly touch on how a normal processor works and then we can see how the Hyper-Threading works.
How it Works
From a very simplified view, a processor is fed a sequential stream of program code or instructions. The processor is hardwired to respond in a specific way to each instruction. Some hardwired responses do very simple things like fetch a data value from memory, store a data value in a holding register, add the value of two registers, or store the results in memory. Simple instructions may only take a few clock cycles to complete. There are many other complex instructions that take many clock cycles. The longest delay might be forcing the processor to wait for data to be retrieved from disc which could take several hundred clock cycles. These wasted clock cycles results in wasted time and loss performance.
This was typically the case of early generation processors where it would process one instruction per clock cycle in sequential order. Over the years, several improvements have been made to reduce the wait time and to improve performance. A few of these improvements like the ability to pre-fetch data before the instruction is executed, the ability to execute instructions out of sequential order and the ability to execute more than one instruction per clock cycle have greatly improved the performance of processors and have led up to Hyper-Threading technology.
The following picture is a block representation of a processor executing the instructions of a long program called ‘Task 1’. Of course this picture is a very short segment of only 9 clock cycles of the overall program. The clock cycles are represented by the vertical columns. The 3 blocks in each column represent 3 different instructions that an Intel P4 processor can execute in one clock cycle. In this example the red blocks represent actual instructions from Task 1 that are being executed during that cycle and the white blocks represent a wait state. In this example execution starts at the left and progresses to the right after each clock cycle. At the first clock cycle we have 2 instructions executing, one at the top and bottom. The one at the top has executed and has a wait state of 3 clock cycles before the next instruction is executed. The bottom one has one wait state after it has executed. Execution of the middle instruction may be dependent on the results of the top or bottom instruction or is waiting for data to be fetched.
As you can see there is a large number of wait states in this process as represented by the white blocks. Now let’s say that at the end of the 9th clock cycle the processor received an interrupt from the operating system and it now has to execute the instructions from a program called ‘Task 2’. The instructions for Task 2 are represented by the green blocks in the following segment.
Again you can see that there is a large number of wait states that could be put to use and this is where Hyper-Threading (HT) technology comes to bear. Intel has built into their HT enabled P4 processors an architecture that appears to the operating system as 2 independent processors. Consequently, two threads of program instructions can be executed at the same time on one processor. The following block diagram shows how an HT enabled P4 processor would execute both Task 1 and 2 at the same time. Anytime there is an available execution resource the processor fills it with an instruction from either Task 1 or 2. Note, in the 7th clock cycle that both Task 1 and 2 were in a wait state but the overall wait states have drastically been reduced resulting in much better through put and performance. 
Intel first introduced HT technology with the Xeon family of P4 processors. Generally, this family of processors was intended to be operated in a multi-processor system. Typically, a Xeon system would have 2 processors which would yield 4 threads that could be executed at one time and the following block diagram illustrates that. The red and green colors represent program instructions from Task 1 and 2 executing on the second processor #1. The blue and lavender colors represent program instructions from Task 3 and 4 executing on the processor #2.
This article was written on a dual Xeon system running Windows XP Pro and the Task Manager detects and presents CPU charts for not 2 processors but 4 processors because of the HT architecture. The following picture illustrates this point.
Intel later introduced HT technology to the Pentium P4 family. There are only a handful of processors with this technology and the P4 designations are a little confusing because some P4 processors do not offer HT technology. Briefly, here are the ones that do. Intel P4 with a 800MHz front side bus have these speeds designations: 3.20 GHz, 3.0 GHz, 2.80C GHz, 2.60C GHz, 2.40C GHz. Intel P4 with a 533MHz front side bus have this speed available: 3.06 GHz. 
Please note that several lower speed P4 processors have been released previously supporting slower front side bus speeds and not supporting HT technology. New stepping (version) of the slower speed processors that support 800 MHz front side bus and HT technology are designated with a ‘C’ as shown above. Other stepping of slower speed processors that support the 533 MHz bus but not HT technology are designated with a ‘B’.
Not only does a system board require an HT compliant chipset but the user must check the system BIOS to make sure that the HT functionality has been enabled.
Finally, the last component that is required is an operating system that recognizes dual processors or HT technology. Intel indicates the Microsoft Windows 2000 and Windows XP are HT compatible operating systems.
The HT technology is not true parallel computing; although, it is very close. Even thought the HT enabled P4 processor can execute 3 instructions at the same time and it can also process two threads at the same time both threads must share the same execution and data resources. True parallel processing would have multiple CPU cores with their own dedicated execution and data resources executing segments of the running program simultaneously.
Intel’s next step toward true parallelism will be the dual core chip. Clearly multi-core chips are a design generation beyond thread level parallelism because of the complexity. Intel will be trailing other companies into the multi-core chip market. IBM has already released the Power 4 which was the 1st dual core chip for servers and they are planning to expand the concept to the Cell and Power 5 chips. Sun is also working on a dual-core UltraSparc 4 chip and Advanced Micro Devices is looking to do the same with their Hammer chip. 
Intel’s first multi-core chip will be Montecito; however, Intel has not yet decided on a brand name for the Montecito which is due out in 2005. Reportedly the Montecito will be the first Itanium core to be manufactured on a 90 nanometer process.
Intel has indicated they are pushing the multi-core concept into new directions with the potential to increase performance and reduce power consumption while at the same time managing the heat dissipation. Some have indicated that in the next decade or so, microprocessors will generate as much heat for their size as a nuclear reactor of equal size.
There are 2 basic approaches to building multi-core processors. Symmetrical multiprocessor chips like IBM’s Power 4 essentially squeeze 2 equal processors onto the same piece of silicon. Both would have the same processing power and would act like a dual processor system. 
The other type is Asymmetric multiprocessor where the two internal cores differ from each other and perform specific functions. Conceptually, you could have a high intensity region and a low intensity region where the high region would handle intense high-priority number crunching calculations and less significant tasks could be shunted off to the low region. Additionally, you could get a collection of mini-coprocessors to do various tasks that have previously been handled by software, like TCP/IP processing or encryption. Of course the ultimate configuration will depend on the type of market the chip will be deployed, like desktops, high-end workstations, servers or mobile devices.
Parallel processing for the general user is still several years out. However, Hyper-Thread technology is a good solid baby step in that direction.
 Intel Pentium 4 Processor
 Intel Adds Dual-Core Chip To Itanium Roadmap
 At Intel, the chip with two brains
Intel's Pentium Performance Hangs on a Hyper-Thread
All ‘Baby Steps to our Future’ articles are archived at www.hal-pc.org/~seeker/future.
Ron Fenley worked as an engineer/analyst and retired in 1999. Ron moved to the country and now pursues his interest in computers, basic science and technology. Ron has been a computer enthusiast for 20 years and has been a HAL-PC member for about half that time. Ron can be reached at email@example.com