Day 01 Computer Architecture & MIPS

http://it.korea.ac.kr/

Introduction.ppt

Introduction_modified.ppt

Introduction_CA.ppt

ISA.ppt

Microarchitecture_-_0._Introduction.ppt

[정보] 인텔 메인보드 칩셋 히스토리 : 2부

http://www.acrofan.com/ko-kr/consumer/content/20081113/0000020001

[DAY 01-1]

Chaper1 Introduction

Page 127

Introduction_CA.ppt

Microprocessor: a single chip processor

- Intel i7, Intel Pentium IV, AMD Athlon, SUN Ultrasparc, ARM, MIPS, ..

ISA (Instruction Set Architecture)

- Defines machine instructions and programmer visible machine states such as registers and memory

- Examples

- X86(IA32): 386 ~ Pentium III, Pentium IV

- IA64: Itanium, Itanium2

- Others: PowerPC, SPARC, MIPS, ARM

Microarchitecture

- Implementation: implement the machine hardware according to the ISA

n Pipelining, caches, branch prediction, buffers

- Invisible to programmers

무어의 법칙 : 마이크로칩의 밀도(or 컴퓨터의 성능)가 18개월마다 2배로 늘어난다는 법칙

x86, SPARC, Power PC

x86(IA32)

32Bit Processor란?

기본적인 명령을 수행할때 32비트의 용량을 수용한다.

CISC (Complex Instruction Set Computer)

- Each instruction is complex

n Instructions of different sizes, many instruction formats, allow computations on memory data, …

- A large number of instructions in ISA

- Architectures until mid 80’s

n Examples: x86, VAX

RISC (Reduced Instruction Set Computer)

- Each instruction is simple

n Fixed size instructions, only a few instruction formats

- A small number of instructions in ISA

- Load-store architectures

n Data must be transferred to registers before computation

n Computations are allowed only on registers

- Most architectures built since 80’s

n Examples: MIPS, ARM, PowerPC, Alpha, SPARC, IA64, PA-RISC, etc.

CISC(Complex Instruction Set Computer)

컴파일러 작성을 쉽게 하기 위해 하드웨어화할 수 있는 것은 가능한 모두 하드웨어에게 맡긴다는 원칙 아래 설계된 컴퓨터이다.

RISC(Reduced Instruction Set Computer)

컴퓨터의 실행속도를 높이기 위해 가능한 한 복잡한 처리는 소프트웨어에게 맡기는 방법을 채택하여, 명령세트를 축소 설계한 컴퓨터를 말한다.

ISC의 특징을 CISC와 비교하여 알아보면 다음과 같다.

첫째, 명령의 대부분은 1머신 사이클에 실행되고, 명령길이는 고정이며, 명령세트는 단순한 것으로 구성되어 있는데, 가령 메모리에 대한 액세스는 Load/Store 명령으로 한정되어 있다.

둘째, 어드레싱 모드가 적으며, 마이크로 프로그램에 의한 제어를 줄이고, 와이어드 로직을 많이 이용하고 있다. 반면에 레지스터수가 많으며 마이크로 프로그램을 저장하는 칩의 공간에 레지스터를 배치한다.

셋째, 어셈블러 코드를 읽기 어려울 뿐 아니라 파이프라인을 효과적으로 사용하기 위해서 일부 어셈블러 코드를 시계열로 나열하지 않은 부분이 존재하여 컴파일러의 최적화가 필요하다. 최적화를 하지 않으면 파이프라인을 유효하게 이용할 수 없고, RISC을 사용하는 의미가 없어진다.

** 1980년대 이후의 컴퓨터 구조는 모두 RISC구조로 되어 있다.

Word

- Default data size for computation

n Size of a GPR & ALU data path depends on the word size

- The word size determines if a processor is a 8b, 16b, 32b, or 64b processor

Address (or pointer)

- Points to a location in memory

- Each address points to a byte (byte addressable)

- If you have a 32b address, you can address 232 bytes = 4GB

- If you have a 256MB memory, you need at least 28 bit address since 228 = 256MB

Caches

- Faster but smaller memory close to processor

n Fast since they are built using SRAMs, but more expensive

Word

워드(word)는 하나의 기계어 명령어나 연산을 통해 저장된 장치로부터 레지스터에 옮겨 놓을 수 있는 데이터 단위이다.(기본적인 연산의 단위)

메모리에서 레지스터로 데이터를 옮기거나, ALU을 통해 데이터를 조작하거나 할 때, 하나의 명령어로 실행될수 있는 데이터 처리 단위이다.

흔히 사용하는 32비트 CPU(ARM 등)라면 워드는 32비트가 된다.

CPU을 개발할 때는 우선 처리단위부터 결정해야 레지스터, ALU등의 하드웨어 설계가 가능하므로 중요한 요소이다.

Address(or Pointer)

- Point to a location in memory

- Each address points to a byte(byte addressable)

- If you have a 32b address, you can address 2^32bytes = 4GB

- If you have a 256MB

Cache : 메모리보다

람다 : Minimum feature size

[Microprocessors]

X86

CPU

1세대

2세대

3세대

4세대

5세대

80386

486

Pentium

PPro, PP2, PP3

**세대를 나누는 것은 내부구조가 다르기 때문이다.(Micro Architecture)

1세대

80348

Non Pipeline <= 5사이클마다 명령어가 실행

2세대

486 66MHz

Pipeline기능을 추가

: 각 사이클마다 명령어가 실행, 1세대 보다 이론적으로 5배의 속도를 가짐

모토로라 애플 IBM 의 합작 => Power PC방식

3세대

Pentium 100MHz

586으로 발표를 하려고 했지만 AMD社에서 586이라고 주장하자 Pentium이라고 명명

2-Way Superscalar, in-order Pipeline

Superscalar기존의 명령어를 하나씩 처리하던 구조와는 다르게 복수의 연산기기를 병렬로 작동시킴으로써 프로세스를 고속화하는 기술

4세대

- Pentium Pro 200MHz

n 메모리와 CPU칩을 동시에 탑재, 서버용

n 3-Way Superscalar Out-of-order Pipeline

n 32bit VA & 32bit PA

- Pentium 2 300MHz

n SRAM을 같이 사용하는 구조.

- Pentium 3 600MHz

n 바깥의 SRAM을 CPU칩 안에 넣음

5세대

Intel® Pentium® 4 Processor

Technology

- 0.13 process, 55M transistors, 82W

- 3.2 GHz, 478pin Flip-Chip PGA2

Performance

- 1221 Ispec, 1252 Fspec(서버용어플) on SPEC 2000

- Relative performance to SUN 300MHz Ultrasparc (100)

- 40% higher clock rate, 10~20% lower IPC compared to P III

Pipeline

- 20-stage out-of-order (OOO) pipeline, hyperthreading

Cache hierarchy

- 12K micro-op trace cache/8 KB on-chip D cache

- On-chip 512KB L2 ATC (Advanced Transfer Cache)

- Optional on-die 2MB L3 Cache

800MHz system bus, 6.4GB/s bandwidth

- Compared with 1.06GB/s on P III 133MHz bus

- Implemented by quad-pumping on 200MHz system bus

20-Way Superscalar Out-of-order Pipeline(3개의 명령어를 20번 실행)

**컴퓨터의 처리속도는 클럭속도 뿐만아니라 IPC와 프로그램의 크기에 의해 좌우된다.

SPEC 2000 : 2000년도에 나온 PC어플리케이션의 집합

**CPU의 성능은 SPEC이라는 비영리단체에서 지정한 Int95(1995년도에 사용하는 PC용 어플리케이션의 집합)를 통해 측정한다.

Hyperthreading : 컴퓨터 중앙처리장치(CPU)에 쓰이는 기술로, 하나의 프로세서가 두 개의 논리적 프로세스처럼 작동하도록 해 컴퓨터 처리속도를 향상시킨 것이다. 즉, 작업이 부여되지 않은 실행 유닛에 다른 스레드의 작업을 부여함으로써 성능을 높이는 기술이다.

QDR(Quad Data Rate) or Quad-Pumping

- 일종의 소프트웨어적인 기법으로 시스템 버스 스피드를 4배 올리는 기술

- 실제클럭 100MHz/s, 유효클럭 400MHz

- 실제 클럭 속도는 같지만 사이클의 다양한 지점에서 정보를 읽어 들여서 속도를 높이는 것

Intel® Itanium® 2 processor

- 인텔에서 제작했지만 HP에서 주도적으로 제작

Technology

- 1.5 GHz, 130W

Performance: 1322 Ispec, 2119 Fspec

- 50% higher transaction performance compared to Sun UltraSPARC III Cu processor (4-way MP system)

EPIC architecture(Explicity Parallel Instruction Computing)

- 병렬처리 기술을 통해 병목현상을 줄여서 고성능 프로세서를 만드는 기술을 말한다.

Pipeline

- 8-stage in-order pipeline (10-stage in Itanium)

- 11 issue ports (9 ports in Itanium)

- 6 INT, 4 MEM, 2 FP, 1 SIMD, 3 BR (4 INT, 2 MEM in Itanium)

Cache hierarchy

- 32KB L1 cache, 256KB L2 cache, and up to 6MB L3 Cache

Memory and System Interface

- 50b Physical Address(DRAM업데이트 할 때 만 참고), 64b Virtual Address(프로그램의 크기 2^64byte)

- 400MHz 128-bit system bus, 6.4GB/s bandwidth (compared to 266MHz 64-bit system bus, 2.1GB.s in Itanium)

Intel® i7 Processor

Technology

- 32nm process, 130W, 239 mm² die

- 3.46 GHz, 64-bit 6-core 12-thread processor(매 사이클마다 12개의 프로그램에서 Fetch가 일어남)

- 159 Ispec, 103 Fspec on SPEC CPU 2006 (296MHz UltraSparc II processor as a reference machine)

Core microarchitecture

- Next generation multi-core microarchitecture introduced in Q1 2006 (Derived from P6 microarchitecture)

- Optimized for multi-cores and lower power consumption

n 14-stage 4-issue out-of-order (OOO) pipeline

- 64bit Intel architecture (x86-64)

- Core i3 (entry-level), Core i5 (mainstream consumer), Core i7 (high-end consumer), Xeon (server)

256KB L2 cache/core, 12MB L3 Caches

Integrated memory controller

Architecture

Integrated memory controller

- 3 Channel, 3.2GHz clock, 25.6 GB/s memory bandwidth (memory up to 24GB DDR3 SDRAM), 36 bit physical address

QuickPath Interconnect (QPI)

- Point-to-point processor interconnect, replacing the front side bus (FSB)

- 64bit data every two clock cycles, up to 25.6GB/s, which doubles the theoretical bandwidth of 1600MHz FSB

Direct Media Interface (DMI)

- The link between Intel Northbridge and Intel Southbridge, sharing many characteristics with PCI-Express

IOH (Northbridge)

- 컴퓨터 메인보드에서 가운데를 기준으로 중앙처리장치(CPU)소켓 쪽의 집적 회로. 일반적으로는 시스템 컨트롤러를 지칭하며, 메모리 인터페이스(memory interface), AGP인터페이스 등도 있다.

ICH (Southbridge)

- 컴퓨터 메인보드에서 PCI슬롯 쪽에 위치한 집적회로.

Sun UltraSPARC T2 processor (“Niagara II”)

Multithreaded multicore technology

- Eight 1.4 GHz cores, 8 threads per core(각각의 코어마다 8개의 멀티스래딩 지원) → total 64 threads

- 65nm process, 1831 pin BGA(서버용이라 핀 수가 많다.), 503M transistors(503만개), 84W power consumption

Core microarchitecture: Two issue 8-stage instruction pipelines

4MB L2 – 8 banks, 64 FB DIMMs, 60+ GB/s memory bandwidth

Sun UltraSPARC T3 processor (“Rainbow Falls”)

- 40nm process, 16 1.65GHz cores, 8 threads per core(8 x 16) → total 128 threads

Trends in Technoloty

Integrated circuit technology

- Transistor density: 35%/year

- Die size: 10-20%/year

- Integration overall: 40-55%/year (회로의 집적도)

DRAM capacity: 25-40%/year (slowing)

Flash capacity: 50-60%/year

- 15-20X cheaper/bit than DRAM

Magnetic disk technology: 40%/year

- 15-25X cheaper/bit then Flash

- 300-500X cheaper/bit than DRAM

SRAM, DRAM 모두 휘발성 메모리 따라서 데이터는 Flash Memory나 Disk Driver에 저장.

Page 131

Bandwidth and Latency

Bandwidth or throughput

- Total work done in a given time

- 10,000-25,000X improvement for processors

- 300-1200X improvement for memory and disks

Latency or response time

- Time between start and completion of an event

- 30-80X improvement for processors

- 6-8X improvement for memory and disks

Feature size

- Minimum size of transistor or wire in x or y dimension

- 10 microns in 1971 to .032 microns in 2011

- Transistor performance scales linearly

- Integration density scales (more than) quadratically

- However, wire delay scales poorly compared to transistor performance!

- In the past few years, both wire delay and power dissipation have become major design limitations for VLSI design

기준 : 286을 1로 기준하여 Latency의 배수가 나옴.

Quad-Cord 3.3GHz

3.3byte x 4 x 4(쿼드코어) = 50billion Instruction ~~~BIPS = 50000MIPS

Latency : 어떠한 이벤트가 처리 되는 시간

Access Time : 기억 장치 접근에 대한 요구가 있은 후 데이터 전송이 시작될 때까지 걸리는 시간.

Cycle Time : 메모리에 들어가는 간격

Dynamic Power

For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power

For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy

Dropping voltage helps both, so went from 5V to 1V

Capacitive load is a function of number of transistors connected to output and technology determines capacitance of wires and transistors

To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. FPU)

Ex) 2GHz 보다 1GHz의 전력소모는 8/1수준이기 때문에 1GHz Dual Core를 사용하는 것이 낫다.

Page 137

Exercises & Discussion

3.2GHz Pentium4 processor is reported to have SPECint ratio of 1221 and SPECfp ratio of 1252 in SPEC2000 benchmarks. What does this mean?

How much memory can you address using 38 bits of address assuming byte-addressability?

2^38byte = 256Gbyte(최대프로그램의 크기)

Classify Intel’s 32bit microprocessors in terms of processor generations from 80386 to Pentium 4. What’s the meaning of generation here?

각 세대마다 내부구조가 다르기 때문에 세대를 나누었다.

Assume two processors, one RISC and one CISC implemented at the same clock speed and the same IPC. Which one performs better?

보통은 RISC가 유리하지만 CISC구조가 내부구조가 단순하기 때문에 속도가 빠르다.

[용어설명]

Microprocessor : a single chip processor

- Pentium IV, AMD Athlon, SUN Ultrasparc, ARM, MIPS,...

ISA(Instruction set Architecture) <= PU의설계를 위해서는 먼저 명령어를 만들어 줘야 한다.

- x86(IA32) : 386 ~ Pentium3, 4 // CISC구조

- x64(IA64) : Itanium, Itanium2 // RISC구조

- Others : PowerPC, SPARC, MIPS, ARM

Micro architecture

- Implementation : implement according to the ISA

- Pipelining, caches, branch prediction, buffers

- Invisible to programmers

Q)기본적인 데이터는 32bit지만 최근의 CPU는 64bit를 사용하고 있다. Why ?

- 데이터 연산이 빠르다. 메모리의 주소(포인터) 역시 데이터다.

- Virtual Address : 2^50 Byte 프로그램의 크기

- Physical Address : 2^40 Byte Main Memory의 크기

- 2^32 = 4Gbyte

버스(Bus)

컴퓨터 내부의 회로에서, 중앙처리장치(CPU)와 주기억장치, 입출력장치간에 정보를 전송하는 데 공용으로 사용하는 전기적 통로를 말한다. 한번에 처리할 수 있는 데이터 양에 따라 ISA 버스, EISA 버스, VESA 버스, PCI 버스 등으로 구분된다.

CISC(Complex Instruction Set Computer)

컴파일러 작성을 쉽게 하기 위해 하드웨어화할 수 있는 것은 가능한 모두 하드웨어에게 맡긴다는 원칙 아래 설계된 컴퓨터이다.

RISC(Reduced Instruction Set Computer)

컴퓨터의 실행속도를 높이기 위해 가능한 한 복잡한 처리는 소프트웨어에게 맡기는 방법을 채택하여, 명령세트를 축소 설계한 컴퓨터를 말한다.

ISC의 특징을 CISC와 비교하여 알아보면 다음과 같다.

** 1980년대 이후의 컴퓨터 구조는 모두 RISC구조로 되어 있다.

ILP(Instruction Level Parallelism)

TLP(Thread Level Parallelism)

SRAM(Static Random Access Memory)

플립플롭 방식의 메모리 장치를 가지고 있는 RAM(Random access memory)의 하나이다.

전원이 공급되는 동안만 저장된 내용을 기억하고 있다.

DRAM(Dynamic Random Access Memory)

디램은 램의 한 종류로 저장된 정보가 시간에 따라 소멸되기 때문에 주기적으로 재생시켜야 하는 특징을 가지고 있다. 구조가 간단해 집적이 용이하므로 대용량 임시기억장치로 사용되고 속도가 느리다.

- 전원을 공급해도 현재 상태를 유지하지 못한다.

- Refresh가 필요한 RAM

- Bit 단위로 표현

UNIVERSAL GATE ?

NAND, NOR, JK-FF과 같이 그것만으로도 어느 Gate든 자유롭게 만들 수 있는 Gate.

[DAY 01-2]

Page 4

Compiler : 컴파일러가 컴퓨터 시스템에 영향을 줄 수 있다.

Page 5

Compiler Phases

I.C.(Intermediate Code, 중간코드) :

- 컴파일러가 원시 언어로 된 프로그램을 목적 코드로 번역하는 과정에서 생성되는 내부적 코드.

- 컴파일 과정에서 중간 코드를 사용함으로써 번역 단계를 세분화된 모듈로 구성할 수 있으며, 각 단계별로 사용되는 중간 코드들은 일반적으로 다른 형태를 갖는다.

Syntax : 구문. 구문법(構文法). 문법. 어떤 언어(language)에 있어서 명확한 표현이나 문장을 구성하는 데 필요한 일련의 규칙

Semantic

Machine States of Computer

- Registers :

- Memory :

Page 7

Machine Instruction

Opcode(Operation code) : 덧셈을 ADD, 제곱근을 SQR 등으로 기억에 편리하도록 기호화한 것

Operands(피연산자)

Ex) 실제 연산 처리 : ADD R1 <- R2, R3

Page 8

Single Cycle Implementation - Arithmetic Instruction

http://talkingaboutme.tistory.com/488

Single Cycle Implementation – Memory Access Instruction

http://talkingaboutme.tistory.com/489

MIPS 레지스터

http://parkjunehyun.tistory.com/entry/MIPS-레지스터

MIPS명령어 - R타입

http://parkjunehyun.tistory.com/106

Page 11

Stack frame (activation record) of a procedure

- Store variables local to a procedure

n Procedure’s saved registers (arguments, return address, saved registers, local variables)

n Stack pointer : points to the top of the stack(최상위)

n Frame pointer : points to the first word of the stack frame(최하위)

저작자표시 비영리 변경금지

돌마우스의 웨어하우스

Day 01 Computer Architecture & MIPS

티스토리툴바