By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed. The advanced architecture and parallel processing pdf executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases, whereas the latter executes multiple instructions in parallel by using multiple execution units. 1966 is often mentioned as the first superscalar design.
CPUs developed since about 1998 are superscalar. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as the number of units has increased.
If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design. In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching each to one of the several execution units contained inside a single CPU. Therefore, a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously.
In other cases they are inter-dependent: one instruction impacts either resources or results of the other. When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU’s clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates.