Great Microprocessors of the Past and Present

Version 5.0.0, last update: January 1995
By John Bayko

Table of Contents

  1. What's a "Great CPU" ?
  2. Before the Great Dark Cloud
  3. Forgotten/Innovative Designs before the Great Dark Cloud
  4. The Great Dark Cloud Falls: IBM's Choice
  5. UNIX and RISC, a New Hope
  6. Born Beyond Scalar
  7. Weird and Innovative Chips
  8. Appendix A
  9. Appendix B
  10. Appendix C

What's a "Great CPU"?

This list is not intended to be an exhaustive compilation of microprocessors, but rather a description of designs that are either unique (such as the RCA 1802, Acorn ARM, or INMOS Transputer), or representative designs typical of the period (such as the 6502 or 8080, 68000, and R2000). Not necessarily the first of their kind, or the best.

A microprocessor generally means a CPU on a single silicon chip, but exceptions have been made (and are documented) when the CPU includes particularly interesting design ideas, and is generally the result of the microprocessor design philosophy. However, towards the more modern designs, design from other fields overlap, and this criterion becomes rather fuzzy. In addition, parts that used to be separate (FPU, MMU) are now usually considered part of the CPU design.

This file is not intended as a reference work, though all attempts (well, many attempts) have been made to ensure its accuracy. It includes material from text books, magazine articles and papers, authoritative descriptions and half remembered folklore from obscure sources (and net.people who I'd like to thank for their many helpful comments). As such, it has no bibliography or list of references.

Enjoy, criticize, distribute and quote from this list freely.

Before the Great Dark Cloud.

The Intel 4004, the first (1972)

The first single chip CPU was the Intel 4004, a 4-bit processor meant for a calculator. It processed data in 4 bits, but its instructions were 8 bits long. Program and Data memory were separate, 1K data memory and a 12-bit PC for 4K program memory (in the form of a 4 level stack, used for CALL and RET instructions). There were also sixteen 4-bit (or eight 8-bit) general purpose registers.

The 4004 had 46 instructions. The 4040 was an enhanced version of the 4004, adding 14 instructions, larger (8 level) stack, 8K program space, and interrupt abilities (including shadows of the first 8 registers). [for additional information, see Appendix B]

The Intel 8080

The 8080 was the successor to the 8008 (intended as a terminal controller, and similar to the 4040). While the 8008 had 14 bit PC and addressing, the 8080 had a 16 bit address bus and an 8 bit data bus. Internally it had seven 8 bit registers (six which could also be combined as three 16 bit registers), a 16 bit stack pointer to memory which replaced the 8 level internal stack of the 8008, and a 16 bit program counter. It also had several I/O ports - 256 of them, so I/O devices could be hooked up without taking away or interfering with the addressing space, and a signal pin that allowed the stack to occupy a separate bank of memory. >

Intel updated the design with the 8085, which added two instructions for interrupts, and only required a +5V power supply.

The Zilog Z-80 - End of an 8-bit line (July 1976)

The Z-80 was intended to be an improved 8080, and it was - vastly improved. It also used 8 bit data and 16 bit addressing, and could execute all of the 8080 (but not 8085) op codes, but included 80 more, instructions that included 1, 4, 8 and 16 bit operations and even block move and block I/O instructions. The register set was doubled, with two banks of registers (including A and F) that could be switched between. This allowed fast operating system or interrupt context switches. The Z-80 also added two index registers (IX and IY) and relocatable vectored interrupts (via the 8-bit IV register).

Like many processors (including the 8085), the Z-80 featured many undocumented instructions. In some cases, they were a by-product of early designs (which did not trap invalid op codes, but tried to interpret them as best they could), and in other cases chip area near the edge was used for added instructions, but fabrication made the failure rate high. Instructions that often failed were just not documented, increasing chip yield. Later fabrication made these more reliable.

But the thing that really made the Z-80 popular was actually the memory interface - the CPU generated it's own RAM refresh signals, which meant easier design and lower system cost. That and it's 8080 compatibility, and CP/M, the first standard microprocessor operating system, made it the first choice of many systems.

The Z-8 (1979) was an embedded processor inspired by the Z-80 with on-chip RAM (actually a set of 124 general and 20 special purpose registers) and ROM (often a BASIC interpreter), and is available in a variety of custom configurations up to 20MHz. The Z-280 was a 16 bit version introduced about July, 1987. It also added an MMU to expand memory to 16Mb (still within a 64K memory map), features for multitasking, a 256 byte cache, and a huge number of new op codes tacked on (total of over 2000!). Internal clock could be run at 2 or 4 times the external clock (ex. 16MHz CPU with a 4MHz bus).

The 650x, Another Direction (1975)

Shortly after the 8080, Motorola introduced the 6800. Some designers then started MOS Technologies, which introduced the 650x series, based on 6800 design (not a clone for legal reasons), including the 6502 used in early Commodores, Apples and Ataris. Steve Wozniak described it as the first chip you could get for less than a hundred dollars (actually a quarter of the 6800 price).

Unlike the 8080 and its kind, the 6502 had very few registers. It was an 8 bit processor, with 16 bit address bus. Inside was one 8 bit data register, and two 8 bit index registers and an 8 bit stack pointer (stack was preset from address 256 to 511). It used these index and stack registers effectively, with more addressing modes, including a fast zero-page mode that accessed memory addresses from address 0 to 255 with an 8-bit address that speeded operations (it didn't have to fetch a second byte for the address).

Back when the 6502 was introduced, RAM was actually faster than CPUs, so it made sense to optimize for RAM access rather than increase the number of registers on a chip.

The 650x also had undocumented instructions.

The Apple II line, which actually includes the Apple I, was among the first microcomputers introduced and became the longest running line, eventually including the Apple IIgs based on the 65816, which is compatible with the original 6502, but expanded to 16 bits internally (including index and stack registers, and a 16-bit direct page register) and a 24-bit address bus.

The 6809, extending the 680x

The 6800 (1974) from Motorola was essentially the same idea as the 6502, but the latter left out one data register and added one index register, a minor change. But the 6809 was a major advance over both - at least relatively.

The 6809 had two 8 bit accumulators, rather than one in the 6502, and could combine them into a single 16 bit register. It also featured two index registers and two stack pointers, which allowed for some very advanced addressing modes. The 6809 was source compatible with the 6800, even though the 6800 had 78 instructions and the 6809 only had around 59. Some instructions were replaced by more general ones which the assembler would translate, and some were even replaced by addressing modes.

Other features were one of the first multiplication instructions of the time, 16 bit arithmetic, and a special fast interrupt. But it was also highly optimized, gaining up to five times the speed of the 6800 series CPU. Like the 6800, it included the undocumented HCF (Halt Catch Fire) bus test instruction.

The 6800 lived on as well, becoming the 6801/3, which included ROM, some RAM, a serial I/O port, and other goodies on the chip. It was meant for embedded controllers, where the part count was to be minimized. The 6803 led to the 68HC05 and 68HC11, the latter was extended to 16 bits as the 68HC16. It remains a popular embedded processor, and radiation hardened versions of the 68HC11 have been used in communications satellites, but the 6809 was a much faster and more flexible chip, particularly with the addition of the OS-9 operating system.

Of course, I'm a 6809 fan myself...

As a note, Hitachi produced a version called the 6309. Compatible with the 6809, it added 2 new 8-bit registers that could be added to form a second 16 bit register, and all four 8-bit registers could form a 32 bit register. It also featured division, and some 32 bit arithmetic, and was generally 30% faster in native mode. This information, surprisingly, was never published by Hitachi. I've heard that the Hitachi H-8 processor design was influenced by this series.

Advanced Micro Devices Am2901, a few bits at a time

Bit slice processors were modular processors. Mostly, they consisted of an ALU of 1, 2, 4, or 8 bits, and control lines (including carry or overflow signals usually internal to the CPU). Two 4-bit ALUs could be arranged side by side, with control lines between them, to form an ALU of 8-bits, for example. A sequencer would execute a program to provide data and control signals.

The Am2901, from Advanced Micro Devices, was a popular 4-bit-slice processor. It featured sixteen 4-bit registers and a 4-bit ALU, and operation signals to allow carry/borrow or shift operations and such to operate across any number of other 2901s. An address sequencer (such as the 2910) could provide control signals with the use of custom microcode in ROM.

The Am2903 featured hardware multiply.

Since it doesn't fit anywhere else in this list, I'll mention it here...

AMD also produced what is probably the first floating point "coprocessor" for microprocessors, the AMD 9511 "arithmetic circuit" (1979), which performed 32 bit (23 + 7 bit floating point) RPN-style operations (4 element stack) under CPU control - the 64-bit 9512 (1980) lacked the transcendental functions. It was based on a 16-bit ALU, performed add, subtract, multiply, and divide (plus sine and cosine), and while faster than software on microprocessors of the time (about 4X speedup over a 4MHz Z-80), it was much slower (at 200+ cycles for 32*32->32 bit multiply) than more modern math coprocessors are.

It was used in some CP/M (Z-80) systems, and on a S-100 bus math card for NorthStar systems. Calculator circuits (such as the National Semiconductor MM57109 (1980), actually a 4-bit NS COP400 processor with floating point routines in ROM) were also sometimes used, with emulated keypresses sent to it and results read back, to simplify programming rather than for speed.

Microchip Technology PIC 16x, call it RISC

Like the Fairchild F8, Zilog Z-8 and variants of the 6800, the PIC 16x architecture is more microcontroller than a microprocessor - low cost is the main goal, including a low pin count and large on chip memory, so like the Z-8, it has a large register set (versions with 25 to 72 8-bit registers, compared to the Z-8's 144 8-bit registers). Another register W is used as an accumulator, and the FSR register controls access to the register set (as an index, bank select on versions with multiple banks), and there are port control registers A, B, and C for I/O - these and the FSR can also be used as general purpose registers if needed.

The 16x is very simple and RISC-like, with only 33 fixed length (12 bit) instructions (program memory obviously separate from data (stored in registers)), including several with a skip-on-condition flag to skip the next program instruction (a powerful mechanism for loops and conditional branches). This and the 12 bit instruction format produce tight code important in embedded applications. It's also marginally pipelined (2 stages - fetch and execute), and has single cycle execution (except for branches - 2 cycles).

The PIC 16x is an interesting look at an 8 bit design made with newer design techniques than other 8 bit CPUs in this list - but was actually designed around 1978 by General Instruments. It lost out to more popular CPUs and was later sold to Microchip Technology, which still sells it for small embedded applications (and to electronics hobbyists). An example of the smallness of embedded applications possible with this CPU is a small PC board called the BASIC Stamp, consisting of 2 ICs - an 18-pin PIC 16C56 CPU (with a BASIC interpreter in 512 word ROM) and 8-pin 256 byte serial EEPROM where user programs (about 80 lines of BASIC) are stored.

Forgotten/Innovative Designs before the Great Dark Cloud

RCA 1802, weirdness at its best (1974)

The RCA 1802 was an odd beast, extremely simple and fabricated in CMOS, which allowed it to run at 6.4 MHz (at 10V, but very fast for 1974) or suspended with the clock stopped. It was an 8 bit processor, with 16 bit addressing, but the major features were it's extreme simplicity, and the flexibility of it's large register set. Simplicity was the primary design goal, and in that sense it was one of the first RISC chips.

It had sixteen 16-bit registers, which could be accessed as thirty-two 8 bit registers, and an accumulator D used for arithmetic and memory access - memory to D, then D to registers, and vice versa, using one 16-bit register as an address. This led to one person describing the 1802 as having 32 bytes of RAM and 65535 I/O ports. A 4-bit control register P selected any one general register as the program counter, while control registers X and N selected registers for I/O Index, and the operand for current instruction. All instructions were 8 bits - a 4-bit op code (total of 16 operations) and 4-bit operand register stored in N.

There was no real conditional branching (there were conditional skips which could implement it, though), no subroutine support, and no actual stack, but clever use of the register set allowed these to be implemented - for example, changing P to another register allowed jump to a subroutine. Similarly, on an interrupt P and X were saved, then R1 and R2 were selected for P and X until an RTI restored them.

A later version, the 1805, was enhanced, adding several Forth language primitives. Forth was commonly used in control applications.

Apart from the COSMAC microcomputer kit, the 1802 saw action in some video games from RCA and Radio Shack, and the chip is the heart of the Voyager, Viking and Galileo (along with some AMD 29000 bit slice processors) probes. One reason for this is that a version of the 1802 used silicon on sapphire (SOS) technology, which leads to radiation and static resistance, ideal for space operation.

Fairchild F8, Register windows

The F8 was an 8 bit processor. The processor itself didn't have an address bus - program and data memory access were contained in separate units, which reduced the number of pins, and the associated cost. It also featured 64 registers, accessed by the ISAR register in cells (windows) of eight, which meant external RAM wasn't always needed for small applications. In addition, the 2-chip processor didn't need support chips, unlike others which needed seven or more. The F8 inspired other similar CPUs, such as the Intel 8048.

The use of the ISAR register allowed a subroutine to be entered without saving a bunch of registers, speeding execution - the ISAR would just be changed. Special purpose registers were stored in the second cell (regs 9-15), and the first eight registers were accessed directly (globally).

The windowing concept was useful, but only the register pointed to by the ISAR could be accessed - to access other registers the ISAR was incremented or decremented through the window.

SC/MP, early advanced multiprocessing (April 1976)

The National Semiconductor SC/MP, (nicknamed "Scamp") was a typical 8 bit processor intended for control applications (a simple BASIC 2.5K ROM was added to one version). It featured 16 bit addressing, with 12 address lines and 4 lines borrowed from the data bus (it was common to borrow lines (sometimes all of them) from the data bus for addressing - however only the lower 12 index register/PC bits were incremented (4K pages), special instructions modified upper 4 bits). Internally, it included four index registers (P1 to P3, plus the PC/P0) and two 8 bit registers. It had no stack pointer or subroutine instructions (though they could be emulated with index registers). During interrupts, the PC and P3 were swapped. It was meant for embedded control, and many features were omitted for cost reasons. It was also bit serial internally to keep it cheap.

The unique feature was the ability to completely share a system bus with other processors. Most processors of the time assumed they were the only ones accessing memory or I/O devices. Multiple SC/MPs (as well as other intelligent devices, such as DMA controllers) could be hooked up to the bus. A control line (ENOUT (Enable Out) to ENIN) could be chained along the processors to allow cooperative processing. This was very advanced for the time, compared to other CPUs.

In addition to I/O ports like the 8080, the SC/MP also had instructions and one pin for serial input and one for output.

F100-L, a self expanding design

The Ferranti F100-L was designed by a British company for the British Military. It was an 8 bit processor, with 16 bit addressing, but it could only access 32K of memory (1 bit for indirection).

The unique feature of the F100-L was that it had a complete control bus available for a coprocessor that could be added on. Any instruction the F100-L couldn't decode was sent directly to the coprocessor for processing. Applications for coprocessors at the time were limited, but the design is still used in some modern processors, such as the National Semiconductor 320xx series (the predecessor of the Swordfish processor, described later), which included FPU, MMU, and other coprocessors that could just be added to the CPU's coprocessor bus in a chain. Other units not foreseen could be added later.

The Western Digital 3-chip CPU (June 1976)

The Western Digital MCP-1600 was probably the most flexible processor available. It consisted of at least four separate chips, including the control circuitry unit, the ALU, two or four ROM chips with microcode, and timing circuitry. It doesn't really count as a microprocessor, but neither do bit-slice processors (AMD 2901).

The ALU chip contained twenty six 8 bit registers and an 8 bit ALU, while the control unit supervised the moving of data, memory access, and other control functions. The ROM allowed the chip to function as either an 8 bit chip or 16 bit, with clever use of the 8 bit ALU. Even more, microcode allowed the addition of floating point routines (40 + 8 bit format), simplifying programming (and possibly producing a Floating Point Coprocessor).

Two standard microcode ROMS were available. This flexibility was one reason it was also used to implement the DEC LSI-11 processor as well as the WD Pascal Microengine.

Intersil 6100, old design in a new package

The IMS 6100 was a single chip design of the PDP-8 minicomputer, from DEC. The old PDP-8 design was very strange, and if it hadn't been popular, an awkward CPU like the 6100 would have never had a reason to exist.

The 6100 was a 12 bit processor, which had exactly three registers - the PC, AC (an accumulator), and MQ. All 2 operand instructions read AC and MQ, and wrote back to AC. It had a 12 bit address bus, limiting RAM to only 4K. Memory references were 7 bit (128 word) offset either from address 0, or the PC.

It had no stack. Subroutines stored the PC in the first word of the subroutine code itself, so recursion wasn't possible without fancy programming.

4K RAM was pretty much hopeless for general purpose use. The 6102 support chip (included on chip in the 6120) added 3 address lines, expanding memory to 32K the same way that the PDP-8/E expanded the PDP-8. Two registers, IFR and DFR, held the page for instructions and data respectively (IFR always used until a data address was detected). At the top of the 4K page, the PC wrapped back to 0, so the last instruction on a page had to load a new value into the IFR if execution was to continue.

The IMS 6120, was used in the DECmate, DEC's original competition for the IBM PC.

NOVA, another popular adaptation

Like the PDP-8, the Data General Nova was also copied, not just in one, but two implementations - the Data General MN601, Fairchild 9440. Luckily, the NOVA was a more mature design than the PDP-8.

The NOVA had four 16-bit accumulators, AC0 to AC3. There were also three 15-bit system registers - Stack pointer, Frame pointer, and Program Counter. AC2 and AC3 could be used for indexed addresses. Apart from the small register set, the NOVA was an ordinary CPU design.

Another CPU, the PACE, was based on the NOVA design, but featured 16 bit addressing, more addressing modes, and a 10 level stack (like the 8008).

Signetics 2650, enhanced accumulator based (1978?)

Superficially similar to the PDP-8 (and IMS 6100), the Signetics 2650 was based around a set of 8 bit registers with R0 used as an accumulator, and six other registers arranged in two sets (R1A-R3A and R1B-R3B) - a status bit determined which register bank was active. The other registers were generally used for address calculations (ex. offsets) within the 15 bit address range. This kept the instruction set simple - all loads/stores to registers went through R0.

It also had a subroutine stack of eight 15 bit elements, with no provision for spilling over into memory.

Motorola MC14500B ICU, one bit at a time

Probably the limit in small processors was the 1 bit 14500B from Motorola. It had a 4 bit instruction, and controlled a single data read/write line, used for application control. It had no address bus - that was an external unit that was added on. Another CPU could be used to feed control instructions to the 14500B in an application.

It had only 16 pins, less than a typical RAM chip, and ran at 1 MHz.

The Great Dark Cloud Falls: IBM's Choice.

TMS 9900, first of the 16 bits (June 1976)

One of the first true 16 bit microprocessors was the TMS 9900, by Texas Instruments (the first are probably National Semiconductor IMP-16 or AMD-2901 bit slice processors in 16 bit configuration). It was designed as a single chip version of the TI 990 minicomputer series, much like the Intersil 6100 was a single chip PDP-8, and the Fairchild 9440 and Data General MN601 were both one chip versions of Data General's Nova. Unlike the IMS 6100, however, the TMS 9900 had a mature, well thought out design.

It had a 15 bit address space and two internal 16 bit registers. One unique feature, though, was that all user registers were actually kept in memory - this included stack pointers and the program counter. A single workspace register pointed to the 16 register set in RAM, so when a subroutine was entered or an interrupt was processed, only the single workspace register had to be changed - unlike some CPUs which required a dozen or more register saves before acknowledging a context switch.

This was feasible at the time because RAM was often faster than the CPUs. A few modern designs, such as the INMOS Transputers, use this same design using caches or rotating buffers, for the same reason of improved context switches. Other chips of the time, such as the 650x series had a similar philosophy, using index registers, but the TMS 9900 went the farthest in this direction. Later versions added a write-through register buffer/cache.

That wasn't the only positive feature of the chip. It had good interrupt handling features and very good instruction set. Serial I/O was available through address lines. In typical comparisons with the Intel 8086, the TMS9900 had smaller and faster programs. The only disadvantage was the small address space and need for fast RAM.

Despite the very poor support from Texas Instruments, the TMS 9900 had the potential at one point to surpass the 8086 in popularity.

Zilog Z-8000, another direct competitor.

The Z-8000 was introduced not long after the 8086, but had superior features. It was basically a 16 bit processor, but could address up to 23 bits in some versions by using segment registers (to supply the upper 7 bits). There was also an unsegmented version, but both could be extended further with an additional MMU that used 64 segment registers. The Z-8070 was a memory mapped FPU.

Internally, the Z-8000 had sixteen 16 bit registers, but register size and use were exceedingly flexible - the first eight Z-8000 registers could be used as sixteen 8 bit registers (identified RH0, RL0, RH1 ...), or all sixteen could be grouped into eight 32 bit registers (RR0, RR2, RR4 ...), or four 64 bit registers. They were all general purpose registers - the stack pointer was typically register 15, with register 14 holding the stack segment (both accessed as one 32 bit register (RR14) for painless address calculations). The instruction set included 32-bit multiply (into 64 bits) and divide.

The Z-8000 was one of the first to feature two modes, one for the operating system and one for user programs. The user mode prevented the user from messing about with interrupt handling and other potentially dangerous stuff (each mode had its own stack register).

Finally, like the Z-80, the Z-8000 featured automatic RAM refresh circuitry. Unfortunately the processor was somewhat slow, but the features generally made up for that.

A later version, the Z-80000, was introduced about at the beginning of 1986, at about the same time as the 32 bit MC68020 and Intel 80386 CPUs, though the Z-80000 was quite a bit more advanced. It was fully expanded to 32 bits internally, including a eight more 32 bit registers (for sixteen total) organised as in the Z-8000 (ie the first eight could be used as sixteen 16 bit registers, and so on) - the system stack remained in RR14.

In addition to the addressing modes of the Z-8000, larger 24 bit (16Mb) segment addressing was added, as well as an integrated MMU (absent in the 68020 but added later in the 68030) which included an on chip 16 line 256-byte fully associated write-through cache (which could be set to cache only data, instructions, or both, and could also be frozen by software once 'primed'). It also featured multiprocessor support by defining some memory pages to be exclusive and others to be shared (and non-cacheable), with separate memory signals for each (including GREQ (Global memory REQuest) and GACK lines). There was also support for coprocessors, which would monitor the data bus and identify instructions meant for them (the CPU had two coprocessor control lines (one in, one out), and would produce any needed bus transactions).

Finally, the Z-80000 was fully pipelined (six stages), while the fully pipelined 80486 and 68040 weren't introduced until 1991.

But despite being technically advanced, the Z-8000 and Z-80000 series never met mainstream acceptance, due to initial bugs in the Z-8000 (the complex design did not use microcode) and to delays in the Z-80000. There was a radiation resistant military version, and a CMOS version of the Z-80000 (the Z-320). Zilog eventually gave up and became a second source for the AT&T WE32000 32-bit CPU instead.

Motorola 68000, a refined 16/32 bit CPU

The 68000 was actually a 32 bit architecture internally, but 16 bit externally for packaging reasons (the 68020 version in 1985 was 32 bit externally). It also included 24 bit addressing, without the use of segment registers. That meant that a single directly accessed array or structure could be larger than 64K in size. Addresses were computed as 32 bit, but the top 8 bits were cut to fit the address bus into a 64 pin package (address and data shared a bus in the 40 pin packages of the 8086 and Z-8000). Lack of segments made programming the 68000 easier than competing processors.

Looking back, it was logical, since most 8 bit processors featured direct 16 bit addressing without segments.

The 68000 had sixteen 32-bit registers, split into data and address registers. One address register was reserved for the Stack Pointer. Both types of registers could be used for any function except that only address registers could be used as the source of an address, but data registers could provide the offset from an address.

Like the Z-8000, the 68000 featured a supervisor and user mode, each with its own Stack Pointer. The Z-8000 and 68000 were similar in capabilities, but the 68000 was 32 bit units internally, making it faster and eliminating forced segmentations. It was designed for expansion, including specifications for floating point and string operations (floating point later implemented in the 68040 in 1991). Like many other CPUs of the time, the 68000 could fetch the next instruction during execution (2 stage pipeline - the 68040 was fully pipelined (6 stages)) with Harvard busses.

The 68060 (in late 1994) expanded the design to a superscalar version (dual issue to two integer and one floating point pipeline), like the Intel Pentium and NS320xx (Swordfish) series before it, though the 68060 also included many innovative power-saving features (3.3V operation, execution unit pipelines could actually be shut down, reducing power consumption at the expense of slower execution, and the clock could be reduced down to zero). It also features a branch target buffer, and decoded instruction prefetch buffer.

A variety of embedded versions were also introduced, including a version called ACE (1995), in which complex and unneeded instructions were removed from the architecture, simplifying it at the expense of only partial compatibility with the 68000 line. The embedded marked has become the main market for the series after workstation venders (and the Apple Macintosh) turned to faster RISC processors.

National Semiconductor 32032, similar but different

Like the 68000, the 320xx family consisted of a CPU which was 32-bit internally, and either 32 or 16 (and later 8) bits externally, as indicated by the last two digits. It appeared a little later than the others here, and so was not really a choice for the IBM PC, but is still representative of the era.

It was similar to the 68000 in basic features, such as byte addressing, 24-bit address bus in the first version, memory to memory instructions, and so on (The 320xx also includes a string and array instruction). Unlike the 68000, the 320xx had eight instead of sixteen 32-bit registers, and they were all general purpose, not split into data and address registers. There was also a useful scaled-index addressing mode, and unlike other CPUs of the time, only a few operations affected the condition codes (as in more modern CPUs).

Also different, the PC and stack registers were separate from the general register set - they were special purpose registers, along with the interrupt stack, and several "base registers" to provide multitasking support - the base data register pointed to the working memory of the current module (or process), the interrupt base register pointed to a table of interrupt handling procedures anywhere in memory (rather than a fixed location), and the module register pointed to a table of active modules.

The 320xx also had a coprocessor bus, similar to the 8-bit Ferranti F100-L CPU, and coprocessor instructions. Coprocessors included an MMU, and a Floating Point unit which included eight 32-bit registers, which could be used as four 64-bit registers.

The series found use mainly in embedded applications, and was expanded to that end, with timers, graphics enhancements, and even a Digital Signal Processor unit in the Swordfish version (1991), among the first superscalar processors, with two 4-stage integer units, one floating point add and one multiplier/DSP unit. The Swordfish also has dynamic bus resizing (8, 16, 32, or 64 bits, allowing 2 instructions to be fetched at once) and clock doubling, 2 DMA channels, and in circuit emulation (ICE) support for debugging. It seems interesting to note that in the case of the NS320xx and Z-80000, non mainstream processors gained many advanced design features well ahead of the more mainstream processors, which presumably had more development resources available. One possible reason for this is the greater importance of compatibility in processors used for computers and workstations, which limits the freedom of the designers. Or perhaps the non-mainstream processors were just more flexible designs to begin with.

Intel 8086, IBM's choice (1978)

The Intel 8086 was based on the design of the 8080/8085 (source compatible with the 8080) with a similar register set, but was expanded to 16 bits. The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining (8086 instructions varied from 1 to 4 bytes).

It featured four 16 bit general registers, which could also be accessed as eight 8 bit registers, and four 16 bit index registers (including the stack pointer). The data registers were often used implicitly by instructions, complicating register allocation for temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment registers that could be set from index registers.

The segment registers allowed the CPU to access 1 meg of memory through an odd process. Rather than just supplying missing bytes, as most segmented processors, the 8086 actually added the segment registers ( X 16, or shifted left 4 bits) to the address. As a strange result, segments overlapped, and it was possible to have two pointers with the same value point to two different memory locations, or two pointers with different values pointing to the same location. Most people consider this a brain damaged design.

Although this was largely acceptable for assembly language, where control of the segments was complete (it could even be useful then), in higher level languages it caused constant confusion (ex. near/far pointers). Even worse, this made expanding the address space to more than 1 meg difficult. A later version, the 80386, expanded the design to 32 bits, and 'fixed' the segmentation, but required extra modes (suppressing new or old features) for compatibility, and retains the awkward architecture. In fact, with the right assembler, code written for the 8008 can still be run on the most recent Pentium version.

The 80386 (1985) expanded the design to 32 bits, added an MMU and new op codes in a kludgy fashion similar to the Z-80 (and Z-280). The 80486 added full pipelines, on chip cache, integrated FPU, and clock doubling (like the Z-280). The Pentium (late 1993) was superscalar (up to two instructions at once in dual integer units and single FPU) - accomplishing this feat with such an awkward starting design required a lot of effort.

The Pentium was the name Intel gave the 80586 version because it could not legally protect the name "586" to prevent other companies from using it - and in fact, the Pentium compatible CPU from NexGen is called the Nx586. Due to its popularity, the 80x86 line has been the most widely cloned processors, from the NEC V20 clone of the 8086, AMD and Cyrix clones of the 80386 and 80486, to versions of the Pentium within less than two years of its introduction.

Interestingly, the awkward and old architecture is such a barrier to improvements that most of the Pentium compatible CPUs (NexGen Nx586, AMD K5, Cyrix M1) do not clone the Pentium, but emulate it with specialized hardware decoders which convert Pentium instructions to RISC instructions which are executed on specially designed superscalar RISC cores, actually faster than the Pentium itself. IBM is rumoured to be developing hardware to translate Pentium instructions to run on the PowerPC chip (or as part of a PowerPC CPU called the 616). Intel too, with partner Hewlett-Packard, has begun development of a next generation processor (compatible with the 80x86, probably with its own instruction translater), based on Very Long Instruction Word technology, which may let the 80x86 architecture finally fade away.

So why did IBM chose the 8086 series when most of the alternatives were so much better? Apparently IBM's own engineers wanted to use the 68000, and it was used later in the forgotten IBM Instruments 9000 Laboratory Computer, but IBM already had rights to manufacture the 8086, in exchange for giving Intel the rights to it's bubble memory designs. Apparently IBM was using 8086s in the IBM Displaywriter word processor.

Other factors were the 8-bit 8088 version, which could use existing low cost 8085-type components, and allowed the computer to be based on a modified 8085 design. 68000 components were not widely available, though it could use 6800 components to an extent. After the failure and expense of the IBM 5100, cost was a large factor in the design of the PC.

Intel bubble memory was on the market for a while, but faded away as better and cheaper memory technologies arrived.

Unix and RISC, a New Hope

TRON, between the ages (1987)

TRON stands for The Real-time Operating Network, and was a grand scheme devised by Japanese electronics firms to design a unified architecture for computer systems from the CPU, to operating systems, to large scale networks. It was designed just as RISC architectures were set to rise, but retained the CISC design philosophies - it could be considered a last gasp, though that doesn't do justice to the intent behind the design and its part in the TRON architecture.

The basic design is scalable, from 32 to 48 and 64 bit designs, with 16 general purpose registers. It is a CISC instruction set, but an elegant one. One early design was the Mitsubishi M32 (mid 1987), which optimised the simple and often used TRON instructions, much like the 80486 and 68040 did. It featured a 5 stage pipeline, dynamic branch prediction with a target branch buffer similar to that in the AMD 29K. It also featured an instruction prefetch queue, but being a prototype, had no MMU support or FPU.

Commercial versions such as the Gmicro/200 and Gmicro/300 from Fujitsu and Toshiba Tx1 were also introduced, but didn't catch on in the non-Japanese market. In addition, many RISC designers licensed their (faster) designs freely to Japanese companies. TRON's promise of a unified architecture (when complete) was less important to companies than raw performance and immediate compatibility (Unix, MS-DOS, Macintosh), and has not become significant in the industry.

SPARC, an extreme windowed RISC (1987)

SPARC, or the Scalable (originally Sun) Processor ARChitecture was designed by Sun Microsystems for their own use. Sun was a maker of workstations, and used standard 68000-based CPUs and a standard operating system, Unix. Research versions of RISC processors had promised a major step forward in speed [See Appendix A], but existing manufacturers were slow to introduce a RISC type processor, so Sun went ahead and developed its own (based on Berkeley's design). In keeping with their open philosophy, they licensed it to other companies, rather than manufacture it themselves.

SPARC was not the first RISC processor. The AMD 29000 (see below) came before it, as did the MIPS R2000 (based on Stanford's experimental design) and Hewlett-Packard PA-RISC CPU, among others. The SPARC design was radical at the time, even omitting multiple cycle multiply and divide instructions (like a few others did), while most RISC CPUs are more conventional.

SPARC usually contains about 128 or 144 registers, (CISC designs typically had 16 or less). At each time 32 registers are available - 8 are global, the rest are allocated in a 'window' from a stack of registers. The window is moved 16 registers down the stack during a function call, so that the upper and lower 8 registers are shared between functions, to pass and return values, and 8 are local. The window is moved up on return, so registers are loaded or saved only at the top or bottom of the register stack. This allows functions to be called in as little as 1 cycle. Like most RISC processors, global register zero is wired to zero to simplify instructions, and SPARC is pipelined for performance (a new instruction can start execution before a previous one has finished). Also like previous processors, a dedicated CCR holds comparison results.

SPARC is 'scalable' mainly because the register stack can be expanded (up to 512, or 32 windows), to reduce loads and saves between functions, or scaled down to reduce interrupt or context switch time, when the entire register set has to be saved. Function calls are usually much more frequent than interrupts, so the large register set is usually a plus, but compilers now can usually produce code which uses a fixed register set as efficiently as a windowed register set across function calls.

SPARC is not a chip, but a specification, and so there are various designs of it. It has undergone revisions, and now has multiply and divide instructions. Original versions were 32 bits, but 64 bit and superscalar versions were designed and implemented (beginning with the Texas Instruments SuperSparc in 1993), but performance lagged behind other RISC and even Intel 80x86 processors until the UltraSparc (1995) from Texas Instruments and Sun, which is a 64-bit superscalar processor which can issue up to four instructions at once (but not out of order) to any of: two integer units, two of the five floating point/graphics units, the branch and load/store unit. The UltraSparc also added a block move instruction (up to 600 MB/sec at 200MHz) which bypasses the cache to avoid disrupting it, and specialized pixel operations which can operate in parallel on 8, 16, or 32-bit values packed in a 64-bit word (a sort of simple SIMD/vector operation, similar to earlier HP PA-RISC 7100 and Motorola 88110 graphics unit instructions).

AMD 29000, a flexible register set (1987?)

The AMD 29000 is another RISC CPU descended from the Berkeley RISC design. Like the SPARC design that was introduced shortly later, the 29000 has a large set of registers split into local and global sets. But though it was introduced before the SPARC, it has a more elegant method of register management.

The 29000 has 64 global registers, in comparison to the SPARC's eight. In addition, the 29000 allows variable sized windows allocated from the 128 register stack cache. The current window or stack frame is indicated by a stack pointer (a modern version of the ISAR register in the Fairchild F8 CPU), a pointer to the caller's frame is stored in the current frame, like in an ordinary stack (directly supporting stack languages like C, a CISC-like philosophy). Spills and fills occur only at the ends of the cache, and registers are saved/loaded from the memory stack. This allows variable window sizes, from 1 to 128 registers. This flexibility, plus the large set of global registers, makes register allocation easier than in SPARC (optimised stack operations also make it ideal for a stack-oriented interpreted languages such as PostScript, making it popular as a laser printer controller).

There is no special condition code register - any general register is used instead, allowing several condition codes to be retained, though this sometimes makes code more complex. An instruction prefetch buffer (using burst mode) ensures a steady instruction stream. Branches to another stream can cause a delay, so the first four new instructions are cached - next time a cached branch (up to sixteen) is taken, the cache supplies instructions during the initial memory access delay.

Registers aren't saved during interrupts, allowing the interrupt routine to determine whether the overhead is worthwhile. In addition, a form of register access control is provided. All registers can be protected, in blocks of 4, from access. These features make the 29000 useful for embedded applications, which is where most of these processors are used, allowing it at one point to claim the title of 'the most popular RISC processor'. The 29000 also includes an MMU and support for the 29027 FPU (integrated into the 29050 CPU in 1990).

Advanced Micro Devices also makes clones of Intel 80x86 processors, and much of the development of the superscalar core for a new AMD 29000 was shared with the 'K5' (1995) Pentium compatible processor (the 'K5' translates 80x86 instructions to RISC-style instructions, and dispatches up to five at once to two integer units, two FPUs, a branch and a load/store unit).

MIPS R2000, the other approach. (June 1986)

The R2000 design came from the Stanford MIPS project, which stood for Microprocessor without Interlocked Pipeline Stages [See Appendix A], and was arguably the first commercial RISC processor (other candidates are the ARM and IBM ROMP used in the IBM PC/RT workstation, designed around 1981 but delayed until 1986). It was intended to simplify processor design by eliminating hardware interlocks between the five pipeline stages. This means that only single execution cycle instructions can access the thirty two 32 bit general registers, so that the compiler can schedule them to avoid conflicts. This also means that LOAD/STORE and branch instructions have a 1 cycle delay to account for. However, because of the importance of multiply and divide instructions, a special HI/LO pair of multiply/divide registers exist which do have hardware interlocks, since these take several cycles to execute and produce scheduling difficulties.

Like the AMD 29000 and DEC Alpha, the R2000 has no condition code register considering it a potential bottleneck. The PC is user readable. The CPU includes an MMU unit that can also control a cache, and the CPU was one of the first which could operate as a big or little endian processor. An FPU, the R2010, is also specified for the processor.

Newer versions include the R3000, with improved cache control, and the R4000 (1991), which is expanded to 64 bits and is superpipelined (twice as many pipeline stages do less work at each stage, allowing a higher clock rate and twice as many instructions in the pipeline at once, at the expense of increased latency when the pipeline can't be filled, such as during a branch, (and requiring interlocks added between stages for compatibility, making the original "I" in the "MIPS" acronym meaningless)).

The R8000 (1994), optimised for floating point operation, abandoned superpipelines, but is superscalar, allowing two integer or load/store operations (from four integer and two load/store units) and two floating point operations to be dispatched at a time, sending them to the independent R8010 floating point coprocessor (with its own set of thirty-two 64-bit registers and load/store queues).

The R10000 version (expected 1995 or 1996) brings the FPU on one chip, and adds almost every advanced modern CPU feature, including superscalar (four instructions dispatched (possibly out of order) to any of two integer, two floating point, and one load/store units), dynamic register renaming (allowing speculative execution of predicted branches using 'false' registers which are later kept, or discarded if the branch is incorrect or an exception occurs), and a 'predecoded' on-chip instruction cache where instructions are partially decoded when they are loaded from memory into the cache, simplifying and speeding the processor decode (and register rename/issue) stage. This technique was first implemented in the AT&T CRISP/Hobbit CPU, described later.

Hewlett-Packard 'Spectrum' PA-RISC, a conservative RISC (Oct 1986)

A design typical of many RISC processors, the PA-RISC (Precision Architecture, originally code-named Spectrum) was designed to replace older processors in HP-3000 MPE minicomputers, and Motorola 680x0 processors in the HP-9000 HP/UX Unix minicomputers and workstations. It has an unusually large instruction set for a RISC processor (including a conditional skip instruction, similar in concept to the condition bits in the ARM processor), partly because initial design took place before RISC philosophy was popular, and partly because careful analysis showed that performance benefited from the instructions chosen - in fact, version 1.1 added new multiple operation instructions combined from frequent instruction sequences. Despite this, it's an simple design - the entire original CPU had only 115,000 transistors, less than twice the much older 68000. It has a 5 stage pipeline, with hardware interlocks for instructions that take more than one cycle.

It is a load/story architecture, originally with a single instruction/data bus, later expanded to a Harvard architecture (separate instruction and data buses). It has thirty-two 32-bit integer registers (GR0 wired to constant 0, GR31 used as a link register for procedure calls) and (thirty-two?) (80?)-bit floating point registers, in an FPU (which could execute a floating point instruction simultaneously) added from the Apollo-designed Prism architecture after Hewlett-Packard acquired the company. Later versions (the PA-RISC 7200 in 1994) added a second integer unit (still dispatching only two instructions at a time to any of the three units). Addressing originally was 48 bits, and expanded to 64 bits, using a segmented addressing scheme.

The PA-RISC 7200 also included a tightly integrated cache and MMU, a high speed 64-bit 'Runway' bus, and a fast but complex fully associative 2KB on-chip assist cache, between the simpler direct-mapped data cache and main memory, which reduces thrashing (repeatedly loading the same cache line) when two memory addresses are aliased (mapped to the same cache line). Instructions are predecoded into a separate instruction cache (like the AT&T CRISP/Hobbit).

The PA-RISC 8000 (intended to compete with the R10000, UltraSparc, and others in late 1995 or 1996) expands the architecture to 64 bits (eliminating segments), and adds aggressive superscalar design which includes issuing 4 instructions to ten functional units, out of order execution and dynamic reordering, and speculative execution of branches.

Although typically sporting fewer of the advanced (and promised) features of competing CPUs designs, a simple elegant design and effective instruction set has kept PA-RISC performance among the best of its class (of those actually available at the time) since its introduction.

In the future Hewlett-Packard plans to pursue a VLIW (Very Long Instruction Word) design in conjunction with Intel, where several concurrent operations are encoded in a single instruction by the compiler, instead of being grouped from the instruction stream by special CPU hardware. Some of the new CPUs meant to execute Intel 80x86 instructions (The AMD 'K5' and NexGen Nx586 (late 1994), for example) treat 80x86 instructions as VLIW instructions, decoding them into RISC-like instructions and executing several concurrently.

Motorola 88000, Late but elegant (mid 1988)

The Motorola 88000 (originally named the 78000) is a 32 bit processor based on a Harvard architecture. Each bus has a separate cache, so simultaneous data and instruction access doesn't conflict. It is similar to the Hewlett Packard Precision Architecture (HP/PA) in design (including many control/status registers only visible in supervisor mode), though the 88000 is more modular, has a small elegant instruction set, and lacks the segmented addressing (limiting addressing to 32 bits). The 88200 unit provides dual caches (including multiprocessor support) and MMU functions for the 88100 CPU, if needed.

The 88000 has thirty-two 32 bit user registers, with distinct internal function units - an ALU and a floating point unit in the 88100 version. Other special function units, such as graphics, vector operations, and such can be added to the design to produce a custom design for customers. The function units of the 88100 share the same register set, while the 88110, like most modern chips, has a separate set of thirty two 80-bit registers for the FPU. Additional ALU and FPU units and instruction scheduling were added for the 88110 version of the CPU, one of the first superscalar designs (following the 320xx Swordfish) thanks to the elegant initial modular design. Despite this, it was introduced late and never became as popular in major systems as the MIPS or HP processors, so development (and performance) has lagged as Motorola favoured the PowerPC CPU being coproduced with IBM.

Like the most modern processors, the 88000 is pipelined, but unlike the early MIPS processors, it had register interlocks between functional units (and pipelined instructions) from the beginning. Also like most processors, results from one instruction can be forwarded to the next instruction instead of waiting to be stored in a register, and in the superscalar 88110, the result from one ALU can be fed directly into another in the next clock cycle, saving a clock cycle between instructions.

Loads and saves in the 88110 are buffered so the processor doesn't have to wait, except when loading from a memory location still waiting for a save to complete. The 88110 version can also speculatively execute conditional branches in the pipeline. If the speculation is true, there is no branch delay in the pipeline. Otherwise, the operations are rolled back from a history buffer (adding at least 1 cycle penalty, compared to 'register renaming' used by later superscalar processors), and the other fork of the branch is taken. This history buffer also allows precise interrupts, while interrupts are 'imprecise' in the 88100.

Acorn ARM, RISC for the masses (1986)

ARM (Advanced RISC Machine, originally Acorn RISC Machine) is often praised as one of the most elegant modern processors in existence. It was meant to be "MIPs for the masses", and designed as part of a family of chips (ARM - CPU, MEMC - MMU and DRAM/ROM controller, VIDC - video and DAC, IOC - I/O, timing, interrupts, etc), for the Archimedes home computer (multitasking OS, windows, etc). It's made by VLSI Technologies Inc, and based partly on the Berkeley experimental RISC design. It is simple, has a short 3-stage pipeline, and it can operate in big- or little-endian mode.

The original ARM (ARM1, 2 and 3) was a 32 bit CPU, but used 26 bit addressing. The newer ARM6xx spec is completely 32 bits. It has user, supervisor, and various interrupt modes (including 26 bit modes for ARM2 compatibility). The ARM architecture has sixteen registers (including user visible PC as R15) with a multiple load/save instruction, though many registers are shadowed in interrupt modes (2 in supervisor and IRQ, 7 in FIRQ) so need not be saved, for fast response.

A unique feature of ARM is that every instruction features a 4 bit condition code (including 'never execute', not officially recommended). Another bit indicates whether the instruction should set condition codes, so intervening instructions don't change them. This easily eliminates many branches and can speed execution. Another unique and useful feature is a barrel shifter which operates on the second operand of most ALU operations, allowing shifts to be combined with most operations (and index registers for addressing), effectively combining two or more instructions into one.

These features make ARM code both dense and efficient, despite the relatively low clock rate and short pipeline - it is roughly equivalent to a much more complex 80486 in speed. It was chosen for the Apple Newton handheld system because of its speed, combined with the low power consumption, low cost and customizable design (the ARM610 version used by Apple includes a custom MMU supporting object oriented protection and access to memory for the Newton's NewtOS).

The ARM series consists of the ARM6 CPU core (35,000 transistors, which can be used as the basis for a custom CPU) the ARM60 base CPU, and the ARM600 which also includes 4K cache, MMU, write buffer, and coprocessor interface (for FPU). A newer version, the ARM7 series, increases performance by optimising the multiplier, and adding DSP extensions including 32 bit and 64 bit multiply and multiply/accumulate instructions (not specified - replacing the barrel shifter with a full ALU?). It also includes embedded In Circuit Emulator (ICE) support and a faster clock rate. Although not fast in floating point, the ARM7 remains competitive in integer performance.

Born Beyond Scalar

Intel 960, Intel gets it right (1987 or 1988?)

Largely obscured by the marketing hype surrounding the Intel 80860, the 80960 was actually an overall better processor, and has since replaced the AMD 29K series as "the world's most popular RISC". The 960 was aimed for the high end embedded market (including multiprocessor and debugging support, and strong interrupt/fault handling, but lacking MMU support), while the 860 was intended to be a general purpose processor (the name 80860 echoing the popular 8086).

The 960 was designed to be superscalar, with instructions dispatched to multiple (undefined, but generally including at least one integer) execution units, which could include internal registers (such as the four 80 bit registers in the floating point unit (32, 64, and 80 bit IEEE operations)). There are sixteen 32 bit global registers and a sixteen register "cache" - similar to the SPARC register windows, but not overlapping (originally four banks). It's a RISC-based load/store Harvard architecture (32-bit flat addressing), but has some complex microcoded instructions (such as CALL/RET). There are also thirty-two 32 bit special function registers.

It's a very clean embedded architecture, not designed for high level applications, but very effective and scalable - something that can't be said for all Intel's processor designs.

Intel 860, "Cray on a Chip" (1988?)

The Intel 80860 was an impressive chip, able at top speed to perform close to 66 MFLOPS at 33 MHz in real applications, compared to a more typical 5 or 10 MFLOPS for other CPUs of the time. Much of this was marketing hype, and it never become popular, lagging behind most newer CPUs amd Digital Signal Processors in performance.

The 860 has several modes, from regular scaler mode to a superscalar mode that executes two instructions per cycle and a user visible pipelined mode. It can use the 8K data cache in a limited way as a small vector register (like those in supercomputers). Instruction and data busses are separate, with 4 G of memory, with segments. It also includes a Memory Management Unit for virtual storage.

The 860 has thirty two 32 bit registers and thirty two 32 bit (or sixteen 64 bit) floating point registers. It was one of the first microprocessors to contains not only an FPU as well as an integer ALU, and also included a 3-D graphics unit (attached to the FPU) that features lines drawing, Gouraud shading, Z-buffering for hidden line removal, and operations in conjunction with the FPU. It was also the first able to do an integer operation, and a special multiply and add floating point instruction, for the equivalent of three instructions, at the same time.

However actually getting the chip at top speed usually requires using assembly language - using standard compilers gives it a speed closer to other processors. Because of this, it was used as a coprocessor, either for graphics, or floating point acceleration, like add in parallel units for workstations. Another problem with using the Intel 860 as a general purpose CPU is the difficulty handling interrupts. It is extensively pipelined, having as many as four pipes operating at once, and when an interrupt occurs, the pipes can spill and lose data unless complex code is used to clean up. Delays range from 62 cycles (best case) to 50 microseconds (almost 2000 cycles).

IBM RS/6003 POWER chips (1991)

When IBM decided to become a real part of the workstation market (after its unsuccessful PC/RT based on the ROMP processor), it decided to produce a new innovative CPU, based partly on the 801 project that pioneered RISC theory. RISC normally stands for Reduced Instruction Set Computer, but IBM calls it Reduced Instruction Set Cycles, and implemented a relatively complex processor with more high level instructions than most CISC processors. They ended up with was a CPU (POWER1) that initially contained five or seven separate chips - the branch unit, fixed point unit, floating point unit, and either two or four cache chips (separate data and instruction cache).

The branch unit is the heart of the CPU, and enables multiple instructions (up to four in the original POWER1, more commonly two or three) to be executed at once. It contains the condition code register, performs checks on this register, and performs branches. The condition code register has eight fields - in POWER1 two were reserved for the fixed and floating point units, the other six could be set separately (or combined from several instructions), and can be checked several instructions later. It also dispatches multiple instructions (out of order if possible) to available execution units (each unit has a separate instruction buffer to allow other instructions to be dispatched instead of waiting for a single execution unit to finish a complex instruction). For added speed, the branch unit contains a loop register (for decrement and branch on zero with no penalty), a type of feature found in many Digital Signal Processors.

The branch unit can speculatively take branches (using a prediction bit in the POWER1 and PowerPC 601 (1993), and using dynamic prediction and a Branch History Table in the PowerPC 604 (late 1994) and newer versions), dispatching instructions and then canceling them if the branch is not taken (3 cycle maximum penalty). However it buffers the other instruction path to reduce latency. The branch unit also manages procedure calls and returns on a program counter stack, allowing effective zero-cycle calls when overlapped with other instructions. Finally, it handles interrupts (except floating point exceptions) without software intervention.

The integer unit(s) perform integer operations, as well as some complex string instructions in the POWER1 and 2 and PowerPC 630 (1995), and loads and stores in the POWER1 and PowerPC 601 (newer versions added a separate load/store unit, allowing integer operations to continue concurrently). Most versions contain thirty two 32 bit registers, while the PowerPC 620 and 630 registers are 64 bits (with appropriate new instructions). The PowerPC 630, designed for the AS/400 minicomputer series, also has decimal arithmetic and string instructions, and an interface for a matrix coprocessor for future RS/6000 workstations. All integer units can forward results needed by subsequent instructions before the write stage occurs, and some versions (PowerPC 604 and PowerPC 620) include extra registers which are renamed for a speculative or out-of-order instruction to prevent write conflicts, and make it easier to discard the results of a canceled instruction. A reorder buffer in the branch/dispatch unit tracks renamed integer and floating point registers.

The floating point unit contains thirty two 64 bit registers and performs all typical floating point operations, including multiply/accumulate instructions and array multiply and add. The registers are loaded and stored by the fixed point unit in POWER1 and PowerPC 601, by the Load/Store unit in others (because of its multichip design, the POWER2 has two dedicated floating point load/store units). Because FPU instructions are multi-cycle, the FPU provides register renaming to reduce or eliminate stalling. Like some other CPUs, floating point traps are imprecise due to execution time. For debugging, a precise trap mode prevents execution overlap, slowing execution. Normally, a trap bit is set on a floating point exception, and software can test for the condition to generate a trap - or ignore it if its a safe operation.

Data buses range from 32 bits for early and low end versions to 256 bits (plus ECC bits) for the high bandwidth POWER2 multichip CPU which issues up to six instructions and four simultaneous loads or stores.

Overall the IBM POWER CPU is very powerful, reminiscent of mainframe designs, which almost qualifies it as "Weird and Innovative", and violates the RISC philosophy of simplicity and fewer instructions (at over a hundred and growing, versus only about 34 for the ARM and 52 for the Motorola 88000 (including FPU instructions)). Originally a multichip design, single chip versions designed with partners Apple and Motorola are intended as a replacement to the 68000 and 80x86 architectures. Newer multichip versions have also been designed, such as the POWER2 (late 1993, with an impressive 23 million transistors in 8 chips, including 256K data cache). The high complexity is very effective, but also limits the clock rate of the designs - an interesting tradeoff considering that a highly parallel 71.5 MHz POWER2 is faster than a 200MHz DEC Alpha 21064.

DEC Alpha, Designed for the future (1992)

The DEC Alpha architecture is designed, according to DEC, for a operational life of 25 years. Its main innovation is PALcalls (or writable instruction set extension), but it is an elegant blend of features, selected to ensure no obvious limits to future performance - no special registers, etc. The first Alpha chip is the 21064.

Alpha is a 64 bit architecture (32 bit instructions) that doesn't support 8- or 16-bit operations, but allows conversions, so no functionality is lost (Most processors of this generation are similar, but have instructions with implicit conversions). Alpha 32-bit operations differ from 64 bit only in overflow detection. Alpha does not provide a divide instruction due to difficulty in pipelining it. It's very much like the MIPS R2000, including use of general registers to hold condition codes. However, Alpha has an interlocked pipeline, so no special multiply/divide registers are needed, and Alpha is meant to avoid the significant growth in complexity which the R2000 family experienced as it evolved into the R8000 and R10000.

One of Alpha's roles is to replace DEC's two prior architectures - the VAX minicomputers and MIPS-based workstations. To do this, the chip provides both IEEE and VAX 32 and 64 bit floating point operations, and features Privileged Architecture Library (PAL) calls, a set of programmable (non-interruptable) macros written in the Alpha instruction set, similar to the programmable microcode of the Western Digital MCP-1600 or the AMD Am2910 CPUs, to simplify conversion from other instruction sets using a binary translator, as well as providing flexible support for a variety of operating systems.

Alpha was also designed for the future for a 1000-fold eventual increase in performance (10 X by clock rate, 10 X by superscalar execution, and 10 X by multiprocessing) Because of this, superscalar instructions may be reordered, and trap conditions are imprecise (like in the 88100). Special instructions (memory and trap barriers) are available to syncronise both occurrences when needed (different from the POWER use of a trap condition bit which is explicitly by software, but similar in effect. SPARC also has a specification for similar barrier instructions). And there are no branch delay slots like in the R2000, since they produce scheduling problems in superscalar execution, and compatibility problems with extended pipelines. Instead speculative execution (branch instructions include hint bits) and a branch cache are used.

The 21064 was introduced with one integer, one floating point, and one load/store unit. The 21164 (Early 1995) added one integer unit (replacing the load/store unit) and one floating point unit, and increased clock speed from 200 MHz to 300 MHz (still roughly twice that of competing CPUs). DEC's Alpha is in many ways the antithesis of IBM's POWER design, which gains performance from complexity, and the expense of a large transistor count, while the Alpha concentrates on the original RISC idea of simplicity and a higher clock rate - though that also has it's drawback, in terms of very high power consumption despite a 3.3V implementation.

Weird and Innovative Chips

Intel 432, Extraordinary complexity (1980)

The Intel iAPX 432 was a complex, object oriented 32-bit processor that included high level operating system support in hardware, such as process scheduling and interprocess messaging. It was intended to be the main Intel microprocessor - the 80286 was envisioned as a step between the 8086 and the 432. The 432 actually included four chips. The GDP (processor) and IP (I/O controller) were introduced in 1980, and the BIU (Bus Interface Unit) and MCU (Memory Control Unit) were introduced in 1983 (but not widely). The GDP complexity was split into 2 chips (decode/sequencer and execution units, like the Western Digital MCP-1600), so it wasn't really a microprocessor.

The GDP was exclusively object oriented - normal linear memory access wasn't allowed, and there was hardware support for data hiding, methods, inheritance, late binding, and access protection, and it was promoted as being ideal for the Ada programming language. To enforce this, permission checks for every memory access (via a 2 stage segmentation) slowed execution (despite cached segment tables). It supported up to 2^24 segments, each limited to 64K in size (within a 2^32 address space), but the object oriented nature of the design meant that was not a real limitation. The stack oriented design meant the GDP had no user data registers. Instructions were bit encoded, ranging from 6 bits to 321 bits long (similar to the byte encoded T-9000) and could be very complex.

The BIU defined the bus, designed for multiprocessor support allowing up to 63 modules (BIU or MCU) on a bus and up to 8 independent buses (allowing memory interleaving to speed access). The MCU did automatic parity checking and ECC error correcting. The total system was designed to be fault tolerant to a large degree, and each of these parts contributes to that reliability.

Despite these advanced features, the 432 didn't catch on. The main reason was that it was slow, sometimes up to five or ten times slower than a 68000. Part of this was the lack of local data registers, or a data cache. Part of this was the fault-tolerant BIU, which defined an asynchronous clocked bus that resulted in 25% to 40% of the access time being used by wait states. The instructions weren't aligned on bytes or words, and took longer to decode. In addition, the protections imposed on the objects slowed data access. Finally, the implementation of the GDP on two chips instead of one produced a slower product. However, the fact that this complex design was produced and bug free is impressive.

It's high level architecture was similar to the Transputer systems, but it was implemented in a way that was much slower than other processors, while the T-414 wasn't just innovative, but much faster than other processors of the time.

The Intel 960 is sometimes considered a successor of the 432 (also called "RISC applied to the 432"), and does have similar hardware support for context switching, but has much in common with the Z-80 in concept. This path came about indirectly through the BiiN machine, which much more closely resembled the 432.

Rekursiv, an object oriented processor

The Rekursiv processor is actually a processor board, not a microprocessor, but is neat. It was created by a manufacturing company called Linn, to control their manufacturing system. The owner was a believer in automation, and had automated the company as much as possible with Vaxes, but wasn't satisfied, so hired software experts to design a new system, which they called LINGO. It was completely object oriented, like smalltalk (and unlike C++, which allows object concepts, but handles them in a conventional way), but too slow on the VAXes, so Linn commissioned a processor designed for the language.

This is not the only processor designed specifically for a language that is slow on other CPUs. Several specialized LISP processors, such as the Scheme-79 lisp processor, were created, but this chip is unique in its object oriented features. It also manages to support objects without the slowness of the Intel 432.

The Rekursiv processor features a writable instruction set, and is highly parallel. It uses 40 bits for objects, and 24 bit addressing, kind of. Memory can't be addressed directly, only through the object identifiers, which are 40 bit tags. The hardware handles all objects in memory and on disk, and swapping them to disk. It has no real program - all data and code/methods are embedded in the objects, and loaded when a message is sent to them. There is a page table which stores the object tags and maps them into memory.

There is a 64k area, arranges 16k X 128 bit words, for microcode, allowing an instruction set to be constructed on the fly. It can change for different objects.

The CPU hardware creates, loads, saves, destroys, and manipulates objects. The manipulation is accomplished with a standard AMD 29203 CPU, but the other parts are specially designed. It executes LINGO entirely fast enough, and is a perfect match between language and CPU, but it can execute more conventional languages, such as Smalltalk or C if needed - possible simultaneously, as separate complete objects.

TMS320C30, a popular DSP architecture (1988)

Digital Signal Processors can act as general purpose processors, but are optimised for certain types of computation (such as signal processing involving matrix computation), usually in embedded applications - resulting in designs which are both somewhat weird and innovative, compared to general purpose CPUs (although not when compared to other DSPs such as the TMS 320Cx0 - but this is a CPU list, not a DSP list, so they go in this section). There is usually little or no interrupt support, or memory management support.

The 320C30 is a 32 bit floating point DSP, based on the earlier 320C20/10 16 bit fixed point DSPs (1982). It has eight 40 bit extended precision registers R0 to R7 (32 bits plus 8 guard bits for floating, 32 bits for fixed), eight 32 bit auxiliary registers AR0 to AR7 (used for pointers) with two separate arithmetic units for address calculation, and twelve 32 bit control registers (including status, an index register, stack, interrupt mask, and repeat block loop registers).

It includes on chip memory in the form of one 4K ROM block, and two 1K RAM blocks - each bus has it's own bus, for a total of three (compared to one instruction and one data bus in a Harvard architecture), which essentially function as programer controlled caches. Two arguments to the ALU can be from memory or registers, and the result is written to a register, through a 4 stage pipeline. The ALU is separate from the control logic - a separation which is much clearer in the AT&T DSP32 and Motorola 56000 designs, and is even reflected in the MIPS R8000 processor FPU and IBM POWER architecture with its Branch Unit loop counter. The idea is to allow the separate parts to operate as independently as possible (for example, a memory access, pointer increment, and ALU operation), for the highest throughput, so instructions accessing loop and condition registers don't take the same path as data processing instructions.

The TMS320Cx0 series also includes the 320C80 (1994?), which has four DSP cores on a single chip.

Motorola DSP96002, an elegant DSP architecture

The 96002 is based on (and remained software compatible with) the earlier 56000 24 bit fixed point DSP (most fixed points DSPs are 16 bit, but 24 bits make it ideal for audio processing, without the high cost of floating point 32 bit DSPs). A 16 bit version (the 5616) was introduced later.

Like the TMS320C30, the 96002 has a separate program memory (RAM in this case, with a bootstrap ROM used to load the initial external program) and two blocks of data RAM, each with a separate data and address busses. The data blocks can also be switched to ROM blocks (such as sine and cosine tables). There's also a data bus for access to external memory. Separate units work independently, with their own registers (generally organised as three 32 bit parts of a single 96 bit register in the 96002 (where the '96' comes from), and three 24 bit registers in the 56000/1 (unrelated to the '56')).

The program control unit has a register containing 32 bit PC, status, and operating mode registers, plus 32 bit loop address and 32 bit loop counter registers (branches are 2 cycles, conditional branches are 3 cycles - with conditional execution support), and a fifteen element 64 bit stack (with separate 6 bit stack pointer).

The address generation unit has seven 96 bit registers, divided into three 32 bit (24 in the 56000/1) registers - R0-R7 address, N0-N7 offset, and M0-M7 modify (containing increment values) registers.

The Data Unit includes ten 96-bit floating point/integer registers, which also be divided into 32 bit registers. It was one of the first to perform fully IEEE floating point compliant operations.

The processor is not pipelined, but designed for single cycle execution within each unit. With multiple units and the large number of registers, it can perform a floating point multiply, add and subtract while loading two registers, performing a DMA transfer, and four address calculations within a two clock tick processor cycle, at peak speeds.

AT&T CRISP/Hobbit, CISC amongst the RISC (1987)

The AT&T Hobbit ATT92010 was inspired by the Bell Labs C Machine project, aimed at a design optimised for the C language. Since C is a stack based language, the processor is optimised for memory to memory stack based execution, and has no user visible registers (stack pointer is modified by special instructions, an accumulator is in the stack), with the goal of simplifying the compiler as much as possible.

Instead of registers, a thirty-two entry 32 bit two ported stack cache is provided. This is similar to the stack cache of the AMD 29000 (in Hobbit it's much smaller (64 32-bit words) but is easily expandable), and Hobbit has no global registers. Addresses can be memory direct or indirect (for pointers) relative to the stack pointer without extra instructions or operand bits. The cache is not optimised for multiprocessors.

Hobbit has an instruction prefetch buffer (3K in 92010, 6K in the 92020), like the 8086, but decodes the variable length (1, 3 or 5 halfword (16 bit)) instructions into a thirty-two entry instruction cache. Branches are not delayed, and a prediction bit directs speculative branch execution. The decode unit folds branches into the decoded instructions (which include next and alternate next PC), so a predicted branch does not take any clock cycles. The three stage execution unit takes instructions from the decode cache. Results can be forwarded when available to any prior stage as needed.

Though CISC in philosophy, the Hobbit is greatly simplified compared to traditional CISC designs, and features some very elegant design features. AT&T prefers to call it a RISC processor, and performance is comparable to similar RISC designs such as the ARM. It's most prominent use is in the EO Personal Communicator, a competitor to Apple's Newton which uses the ARM processor.

T-9000, parallel computing (1994)

The INMOS T-9000 is the latest version of the Transputer architecture, a processor designed to be hooked up to other processors for high speed parallel processing. The previous versions were the 16 bit T-212 and 32 bit T-414 and T-800 (which included a 64 bit FPU) processors (1985). The instruction set is minimised, like a RISC design, but is based on a stack/accumulator design (similar in idea to the PDP-8), and designed around the OCCAM language. The most important feature is that each chip contains 4 serial links to connect the chips in a network.

While the transputers were faster than their contemporaries, recent RISC designs have surpassed them. The T-9000 attempts to regain the lead. It starts with the architecture of the T-800 which contains only three 32 bit integer and three 64 bit floating point registers which are used as an evaluation stack - they are not general purpose. Instead, like the TMS 9900, it uses memory, addressed relative to the workspace register. This allows very fast context switching, less than a microsecond, speeding and simplifying process scheduling enough that it is automated in hardware (supporting two priority levels and event handling (link messages and interrupts)). The Intel 432 also attempted some hardware process scheduling, but was unsuccessful.

Unlike the TMS 9900, the T-9000 is far faster than memory access, so the CPU has several levels of very high speed caches and memory types. The main cache is 16K, and is designed for 3 reads and 1 write simultaneously. The workspace cache is based on 32 word rotating buffers, allows 2 reads and 1 write simultaneously.

Instructions are in bytes, consisting of 4 bit op code and 4 bit data (usually a 16 byte offset into the workspace), but prefix instructions can load extra data for an instruction which follows, 4 bits at a time. Less frequent instructions can be encoded with 2 (such as process start, message I/O) or more bytes (CRC calculations, floating point operations, 2D block copies and scheduler queue management). The stack architecture makes instructions very compact, but executing one instruction byte per clock can be slow for multibyte instructions, so the T-9000 has a grouper which gathers instruction bytes (up to eight) which can be executed in parallel in the 5 stage pipeline (fetching four per cycle, grouping up to 8 if multicycle instructions allow it to catch up - 2 memory loads (simple or indexed), single ALU and single store (a statement of the form a[i] = b[2] + c[3]) can be grouped for example).

The T-9000 contains 4 main internal units, the CPU, the VCP (handling the individual links of the previous chips, which needed software for communication), the PMI, which manages memory, and the Scheduler.

This processor is ideal for a model of parallel processing known as systolic arrays (a pipeline is a simple example). Even larger networks can be created with the C104 crossbar switch, which can connect 32 transputers or other C104 switches into a network hundreds of thousands of processors large. The C104 acts like a instant switch, not a network node, so the message is passed through, not stored. Communication can be at close to the speed of direct memory access.

Like the many CPUs, the Transputers can adapt to a 64, 32, 16, or 8 bit bus. They can also feed off a 5 MHz clock, generating their own internal clock (up to 50MHz for the T-9000)from this signal, and contain internal RAM, making them ideal for high performance embedded applications.

As a note, the T-800 FPU is probably the first large scale commercial device to be proven correct through formal design methods.

Appendix A

RISC and CISC definitions

RISC usually refers to a Reduced Instruction Set Computer. IBM pioneered many RISC ideas (but not the acronym) in their 801 project. RISC ideas also come from the CDC 6600 computer and projects at Berkeley (RISC I and II and SOAR) and Stanford University (the MIPS project). RISC designs call for each instruction to be a single, fixed length and to execute in a single cycle, which is done with pipelines, no microcode (to reduce chip complexity and increase speed). Operations are performed on registers only (with the only memory access being loading and storing). Finally, several RISC designs uses a large windowed register set (or stack cache) to speed subroutine calls (see the entry on SPARC for a description).

But despite these specifications, RISC is more a philosophy than a set of design criteria, and almost everything is called RISC, even if it isn't. Pipelines are used in the 68040 and 80486 CISC processors to execute instructions in a single cycle, even though they use microcode, and windowed registers have been added to CISC designs (such as the Hitachi H16), speeding them up in a similar way. Basically, RISC asks whether hardware (for complex instructions or memory-to-memory operations) is necessary, or whether it can be replaced by software (simpler instructions or load/store architecture). Higher instruction bandwidth is usually offset by a simpler chip that can run at a higher clock speed, and more available optimisations for the compiler.

CISC refers to a Complex Instruction Set Computer. There's not really a set of design features to characterize it like there is for RISC, but small register sets, memory to memory operations, large instruction sets (with variable length instructions), and use of microcode are common. The philosophy is that if added hardware can result in an overall increase in speed, it's good - the ultimate goal of mapping every high level language statement on to a single CPU instruction. The disadvantage is that it's harder to increase the clock speed of a complex chip. Microcode is a way of simplifying processor design to this end. Even though it results in instructions that are slower, requiring multiple clock cycles, clock frequency could be increased due to the simpler design. However, most complex instructions are seldom used.

VAX: The Penultimate CISC (1978)

The VAX architecture isn't a microprocessor, since it's still usually implemented in multiple chip modules. However, it and its predecessor, the PDP-11, helped inspire design of the Motorola 68000, Zilog Z8000, and particularly the National Semiconductor 32xxx series CPUs. It was considered the most advanced CISC design, and the closest so far to the ultimate CISC goal. This is one reason that the VAX 11/780 is used as the speed benchmark for 1 MIPS (Million Instructions Per Second), though actual execution was apparently closer to 0.5 MIPS.

The VAX was a 32 bit architecture, with a 32 bit address range (split into 1G sections for process space, process specific system space, system space, and unused/reserved for future use). Each process has it's own 1G process and 1G process system address space, with memory allocated in pages.

It features sixteen user visible 32 bit registers. Registers 12 to 15 are special - AP (Argument Pointer), FP (Frame Pointer), SP and PC (user, supervisor, executive, and kernal modes have separate SPs in R14, like the 68000 user and supervisor modes). All these registers can be used for data, addressing and indexing. A 64 bit PSL (Program Status Longword) keeps track of interrupt levels, program status, and condition codes.

The VAX 11 features an 8 byte instruction prefetch buffer, like the 8086, while the VAX 8600 has a full 6 stage pipeline. Instructions mimic high level language constructs, and provide dense code. For example, the CALL instruction, which not only handles the argument list itself, but enforces a standard procedure call for all compilers. However, the complex instructions aren't always the fastest way of doing things. For example, the INDEX instruction was 45% to 60% faster when by replaced by simpler VAX instructions. This was one inspiration for the RISC philosophy.

RISC Roots: CDC 6600 (1965)

Most RISC concepts can be traced back to the Control Data Corporation CDC 6600 'Supercomputer' designed by Seymore Cray (1964?), which emphasized a small (64 op codes) load/store and register-register instruction as a means to greater performance.

The CDC 6600 was a 60-bit machine ('bytes' were 6 bits each), with an 18-bit address range. It had eight 8 bit A (address) and eight 60 bit X (data) registers, with useful side effects - loading an address into A2, A3, A4 or A5 caused a load from memory at that address into registers X2, X3, X4 or X5. Similarly, A6 and A7 registers had the same effect on X6 and X7 registers - loading an address into A0 or A1 had no side effects. As an example, to add two arrays into a third, the starting addresses of the source could be loaded into A2 and A3 causing data to load into X2 and X3, the values could be added to X6, and the destination address loaded into A6, causing the result to be stored in memory. Incrementing A2, A3, and A6 (after adding) would step through the array. Side effects such as this are decidedly anti-RISC, but very nifty. This vector-oriented philosophy is more directly expressed in later Cray computers.

Multiple independent functional units in the CDC 6600 could operate concurrently, but they weren't pipelined until the CDC 7600 (1969), and only one instruction could be issued at a time (a scoreboard register prevented instructions from issuing to busy functional units). Compared to the variable instruction lengths of other machines, instructions were only 15 or 30 bits, packed within 30 bit "parcels" (a 30 bit instruction could not occupy the upper 15 bits of one parcel and the lower 15 bits of the next, so the compiler would insert NOPs to align instructions) to simplify decoding (a RISC-like feature). Like the DEC Alpha, there were no byte or character operations, until later versions added a CMU (Compare and Move Unit) for character, string and block operations.

RISC Formalised: IBM 801

The first system to formalise these principles was the IBM 801 project (1975). Like the VAX, it was not a microprocessor (ECL implementation), but strongly influenced microprocessor designs. The design goal was to speed up frequently used instructions while discarding complex instructions that slowed the overall implementation. Like the CDC 6600, memory access was limited to load/store operations (which were delayed, locking the register until complete, so most execution could continue). Branches were delayed, and instructions used a three operand format common to RISC processors. Execution was pipelined, allowing 1 instruction per cycle.

The 801 had thirty two 32 bit registers, but no floating point unit/registers, and no separate user/supervisor mode, since it was an experimental system - security was enforced by the compiler. It implemented Harvard architecture with separate data and instruction caches, and had flexible addressing modes.

IBM tried to commercialise the 801 design when RISC workstations first became popular with the ROMP CPU (Research OPD (Office Products Division) Mini Processor), 1986) in the PC/RT workstation, but it wasn't successful. Design changes to reduce cost included eliminating the caches and harvard architecture, reducing registers to sixteen, variable length instructions (to increase instruction density), and floating point support via an adaptor to an NS32081 FPU. This allowed a small CPU, only 45,000 transistors, but an average instruction took around 3 cycles.

The 801 itself morphed into the I/O processor for the IBM 3090 mainframes

RISC Refined: Berkeley RISC, Stanford MIPS

Some time after the 801, around 1981, projects at Berkeley (RISC I and II) and Stanford University (MIPS) further developed these concepts. The term RISC came from Berkeley's project, which was the basis for the SPARC processor. Because of this, features are similar, including a windowed register file (10 global and 22 windowed, vs 8 and 24 for SPARC) with R0 wired to 0. Branches are delayed, and like ARM, all instructions have a bit to specify if condition codes should be set, and execute in a 3 stage pipeline. In addition, next and current PC are visible to the user, and last PC is visible in supervisor mode.

The Berkeley project also produced an instruction cache with some innovative features, such as instruction line prefetch that identified jump instructions, frequently used instructions compacted in memory and expanded upon cache load, multiple cache chips support, and bits to map out defective cache lines.

The Stanford MIPS project was the basis for the MIPS R2000, and like the case with Berkeley project, there are close similarities. MIPS stood for Microprocessor without Interlocked Pipeline Stages, using the compiler to eliminate register conflicts. Like the R2000, the MIPS had no condition code register, and a special HI/LO multiply and divide register pair.

Unlike the R2000, the MIPS had only 16 registers, and two delay slots for LOAD/STORE and branch instructions. The PC and last three PC values were tracked for exception handling. In addition, instructions were 'packed' (like the Berkeley RISC), in that many instructions specified two operations that were dispatched in consecutive cycles (not decoded by the cache). In this way, it was a 2 operation VLIW, but executed sequentially. User assembly language was translated to 'packed' format by the assembler.

Being experimental, there was no support for floating point operations.

Processor Classifications:

Arbitrarily assigned by me...
Complex/                                                         Simple/      CISC____________________________________________________________RISC      |                                                         14500B*4-bit |                                                    *Am2901      |                                   *4004      |                                *40408-bit |                                                       *1802      |                                 *8008      SC/MP      |                             *8080  *2650     *    *F8      |                F100-L*    *Z-8       *6800,650x      |                                     *NOVA        *  *PIC16x      |          MCP1600*   *Z-80         *6809    IMS610016-bit|          *Z-280      |                      *8086    *TMS9900      |                 *Z8000          *65816      |      |            32016*   *68000 ACE HOBBIT               R300032-bit|    320C30*   96002 *68020    *   *  *      *29000     *    *ARM      | *432      *VAX * 80486 68040      i960    *SPARC      |          Z80000*    *  *    TRON48    PA-RISC      |       Pentium* [1]---*-------     *    *88100      | *      [2]--<860>-*--------            *     *8811064-bit| Rekurs          POWER         *        CDC6600     *R4000      |                           U-SPARC *     *R8000         *Alpha      |                                R10000[1] - About here, from left to right, the Swordfish and 68060.[2] - In general, Pentium emulator 'clones' such as the 586, AMD K5,and Cyrix M1 fit about here.

Boy, it's getting awfully crowded there!

Okay, an explanation. Since this is only a 2-dimensional graph, and I want to get a lot more across than that allows, design features pull a CPU along the RISC/CISC axis, and the complexity of the design (given the number of bits and other considerations) also tug it - thus the much of the POWER's RISC-ness is offset by its inherently complex (though effective) design. And it also depends on my mood that day - hey, it's ultimately subjective anyway.

Appendix B:

Appearing in IEEE Computer 1972:

NEW PRODUCTS

FEATURE PRODUCT

COMPUTER ON A CHIP

Intel has introduced an integrated CPU complete with a 4-bit parallel adder, sixteen 4-bit registers, an accumulator and a push-down stack on one chip. It's one of a family of four new ICs which comprise the MCS-4 micro computer system--the first system to bring the power and flexibility of a dedicated general-purpose computer at low cost in as few as two dual in-line packages.

MSC-4 systems provide complete computing and control functions for test systems, data terminals, billing machines, measuring systems, numeric control systems and process control systems.

The heart of any MSC-4 system is a Type 4004 CPU, which includes a set of 45 instructions. Adding one or more Type 4001 ROMs for program storage and data tables gives a fully functioning micro-programmed computer. Add Type 4002 RAMs for read-write memory and Type 4003 registers to expand the output ports. Using no circuitry other than ICs from this family of four, a system with 4096 8-bit bytes of ROM storage and 5120 bits of RAM storage can be created. For rapid turn-around or only a few systems, Intel's erasable and re-programmable ROM, Type 1701, may be substituted for the Type 4001 mask-programmed ROM.

MCS-4 systems interface easily with switches, keyboards, displays, teletypewriters, printers, readers, A-D converters and other popular peripherals. For further information, circle the reader service card 87 or call Intel at (408) 246-7501.

Appearing in IEEE Computer 1975:

The age of the affordable computer.

MITS announces the dawning of the Altair 8800 Computer. A lot of brain power at a price that's bound to create love and understanding. To say nothing of excitement.

The Altair 8800 uses a parallel, 8-bit processor (the Intel 8080) with a 16-bit address. It has 78 basic machine instructions with variances over 200 instructions. It can directly address up to 65K bytes of memory and it is fast. Very fast. The Altair 8800's basic instruction cycle time is 2 microseconds.

Combine this speed and power with Altair's flexibility (it can directly address 256 input and 256 output devices) and you have a computer that's competitive with most mini's on the market today.

The basic Altair 8800 Computer includes the CPU, front panel control board, front panel lights and switches, power supply (enough to power any additional cards), and expander board (with room for 3 extra cards) all enclosed in a handsome, aluminum case. Up to 16 cards can be added inside the main case.

Options now available include 4K dynamic memory cards, 1K static memory cards, parallel I/O cards, three serial I/O cards (TTL, R232, and TTY), octal to binary computer terminal, 32 character alpha-numeric display terminal, ASCII keyboard, audio tape interface, 4 channel storage scope (for testing), and expander cards.

Options under development include a floppy disc system, CRT terminal, line printer, floating point processor, vectored interrupt (8 levels), PROM programmer, direct memory access controller and much more.

Appendix C

Bubble Memories

Certain materials (ie. gadolinium gallium garnet) are magnetizable easily in only one direction. A film of these materials can be created so that it's magnetizable in an up-down direction. The magnetic fields tend to stick together, so you get a pattern that is kind of like air bubbles in water squished between glass, half with the north pole facing up, half with the south, floating inside the film. When a vertical magnetic field is imposed on this, the areas in opposite alignment to this field shrink to circles, or 'bubbles'.

A bubble can be formed by reversing the field in a small spot, and can be destroyed by increasing the field.

The bubbles are anchored to tiny magnetic posts arranged in lines. Usually a 'V V V' shape or a 'T T T' shape. Another magnetic field is applied across the chip, which is picked up by the posts and holds the bubble. The field is rotated 90 degrees, and the bubble is attracted to another part of the post. After four rotations, a bubble gets moved to the next post:

o                             o              o \/   \/       \/   \/      \/   \/      \/   \/               oo_|_   _|_      _|_   _|_     _|_o  _|_      _|_ o _|_     _|_ o _|_     |           o  |             |              |             |

I hope that diagram makes sense.

These bubbles move in long thin loops arranged in rows. At the end of the row, the bits to be read are copied to another loop that shift to read and write units that create or destroy bubbles. Access time for a particular bit depends on where it is, so it's not consistent.

One of the limitations with bubble memories, why they were superceded, was the slow access. A large bubble memory would require large loops, so accessing a bit could require cycling through a huge number of other bits first. The speed of propagation is limited by how fast magnetic fields could be switched back and forth, a limit of about 1 MHz. On the plus side, they are non-volatile, but eeproms, flash memories, and ferroelectric technologies are also non-volatile and and are faster.

Ferroelectric and Ferromagnetic (core) Memories

Ferroelectric materials are analogous to ferromagnetic materials, though neither actually need to contain any iron. Ferromagnetic materials, used in core memories, will retain a magnetic field that's been applied to it.

Core memories consist of ferromagnetic rings strung together on tiny wires. The wires will induce magnetic fields in the rings, which can later be read back. Usually reading this memory will erase it, so once a bit is read, it is written back. This type of memory is expensive because it has to be constructed physically, but is very fast and non-volatile. Unfortunately it's also large and heavy, compared to other technologies.

Legend reports that a Swedish jet prototype (the Viggen I believe) once crashed, but the flight recorders weren't fast enough to record the cause of the crash. The flight computers used core memory, though, so they were hooked up and read out, and the still contained the data microseconds before the crash occurred, allowing the cause to be determined.

Ferroelectric materials retain an electric field rather than a magnetic field. like core memories, they are fast and non-volatile, but bits have to be rewritten when read. Unlike core memories, ferroelectric memories can be fabricated on silicon chips in high density and at low cost.

Posted by 벅스바니
,