Download - Valladolid final-septiembre-2010

Valladolid, Septiembre 2010

“Evolución de la Arquitectura de Computadores ”

Valladolid, Septiembre 2010

Prof. Mateo Valero Director

2Valladolid, Septiembre 2010

Technological Achievements

● Transistor (Bell Labs, 1947)

● DEC PDP-1 (1957)● IBM 7090 (1960)

● Integrated circuit (1958)

● IBM System 360 (1965)● DEC PDP-8 (1965)

● Microprocessor (1971)

● Intel 4004


Pipeline (H. Ford)


Technology Trends

4


Power DensityW

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.

Pentium® 4Pentium® 4


High Volume Manufacturing

2004 2006 2008 2010 2012 2014 2016 2018

Technology Node (nm) 90 65 45 32 22 16 11 8

Integration Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down

Energy/Logic Op scaling

>0.35 >0.5 >0.5 Energy scaling will slow down

Bulk Planar CMOS High Probability Low Probability

Alternate, 3G etc Low Probability High Probability

Variability Medium High Very High

ILD (K) ~3 <3 Reduce slowly towards 2-2.5

RC Delay 1 1 1 1 1 1 1 1

Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation

Shekhar Borkar, Micro37, P

Technology Outlook


Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate

& & Transistor Transistor DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4


Increasing chip performance: Intel´s Petaflop chip

ICPP-2009, September 23rd 2009

● 80 processors in a die of 300 square mm.● Terabytes per second of memory bandwidth● Note: The barrier of the Teraflops was obtained by Intel in

1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters

● This will be possible in 3 years from nowThanks to IntelThanks to Intel


NVIDIA Fermi Architecture

Unified 768KB L2 cache serves all

threads

GigaThread hardware scheduler

assigns Thread Blocks to SMs

Wide DRAM interface provides 12 GB/s

bandwidth

16 Streaming- Multiprocessors(512 cores) execute Thread Blocks

620 Gigaflops


Cell Broadband Engine TM:A Heterogeneous Multi-core Architecture

* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.


Intel/UPC

Since 2002 (Roger

Espasa, Toni Juan)

40 People

Microprocessor

Development

(Larrabee x86

many core)


Top10


Looking at the Gordon Bell Prize

● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors● Static finite element analysis

● 1 TFlop/s; 1998; Cray T3E; 1024 Processors● Modeling of metallic magnet atoms, using a

variation of the locally self-consistent multiple scattering method.

● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors● Superconductive materials

● 1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads)

Jack Dongarra


BSC-CNS e iniciativas a nivel internacional: IESP

Build an international plan for developing the next

generation open source software for scientific high-

performance computing

Build an international plan for developing the next

generation open source software for scientific high-

performance computing

Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment


1 EFlop/s “Clean Sheet of Paper” Strawman

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)

• 1 Chip = 742 Cores (=4.5TF/s)• 213MB of L1I&D; 93MB of L2

• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)

• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops

• 68 MW w’aggressive assumptions

Sizing done by “balancing” power budgets with achievable capabilities

Largely due to Bill Dally

Courtesy of Peter Kogge, UND


Education for Parallel Programming

Multicore-based pacifier

I multi-core programming

I many-core

programming

We all massive parallel

prog.

I games


Navigating the Mare Nostrum


Initial developments

Mechanical machines

1854: Boolean algebra by G. Boole

1904: Diode vacuum tube by J.A. Fleming

1938: Boolean Algebra & Electronics Switches, C. Shannon

1946: ENIAC by J.P. Eckert and J. Mauchly

1945: Stored program by J.V. Neumann ??????

1947 : First transistor (Bell Labs)

1949: EDSAC by M. Wilkes

1952: UNIVAC I and IBM 701


In 50 Years ...

Eniac, Eckert&Mauchly1946 ... 18000 vacuum tubes

Pentium III playing DVD, 1998 ... 24 M transistors


2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Technology Trends: Microprocessor Capacity


Computer Architecture Achievements

• 1951 : Microprogramming (M. Wilkes)

• 1962 : Virtual Memory (Atlas, Manchester)

• 1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s)

• 1965 : Cache memory (M. Wilkes)

• 1975 : Vector processors (S. Cray)

• 1980 : RISC architecture (IBM, Berkeley, Stanford)

• 1982 : Multiprocessors with distributed memory

• 1990 : Superscalar processors: PA-Risc (HP) and RS-6000 (IBM)

• 1991 : Multiprocessors with distributed shared memory

• 1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers)

• 1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin)

• 1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU)

• 2000: Multicore/Manycore Architectures


Virtual Worlds have huge potential beyond Games

Commerce & Advertising

Corporate

Education

First Responders

Government

Health

Military

Science

Community Facilitation

Social Change


● Cray XT5-HE system● Over 37,500 quad-core AMD

Opteron processors running at 2.6 GHz, 224,162 cores.

● Power: 6.95 Mwatts● 300 terabytes of memory● 10 petabytes of disk space.● 240 gigabytes per second disk bandwidth● Cray's SeaStar2+ interconnect network.

Jaguar @ ORNL: 1.75 PF/s

Jack Dongarra


Performance analysis tools

Processor and node

Load balancing

Interconnect

ApplicationsProgramming

models

Models and prototype

MareIncognito: Project structure

4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPEGeneral kernels

Automatic analysisCoarse/fine grain predictionSamplingClusteringIntegration with Peekperf

Contention, CollectivesOverlap computation/communicationSlimmed NetworksDirect versus indirect networks

Contribution to new Cell designSupport for programming modelSupport for load balancingSupport for performance toolsIssues for future processors

Coordinated scheduling:Run time,Process,

JobPower efficiency

StarSs: CellSs, SMPSsOpenMP@Cell

OpenMP++MPI + OpenMP/StarSs


● Supercomputación y eCiencia● 22 grupos de élite● Más de 120

investigadores seniors● Más de 300 estudiantes

de doctorado

Application scope“Earth Sciences”

Application scope“Astrophysics”

Application scope“Engineering”

Application scope“Physics”

Application scope“Life Sciences”

Compilers and tuning of application

kernels

Programming models and performance tuning tools

Architecturesand hardwaretechnologies

BSC-CNS: vertebrador de la investigación en supercomputación en España


High Performance Computing as key-enabler

1980 1990 2000 2010 2020 2030

Capacity: # of Overnight

Loads cases run

Capacity: # of Overnight

Loads cases run

Available Computational

Capacity [Flop/s]

Available Computational

Capacity [Flop/s]

CFD-basedLOADS

& HQ

CFD-basedLOADS

& HQ

Aero Optimisation& CFD-CSM

Aero Optimisation& CFD-CSM Full MDOFull MDO

Real time CFD based

in flight simulation

Real time CFD based

in flight simulation

x106

1 Zeta (1021

)

1 Peta (1015

)

1 Tera (1012

)

1 Giga (109

)

1 Exa (1018

)

102

103

104

105

106

LES

CFD-basednoise

simulation

CFD-basednoise

simulation

RANS Low Speed

RANS High Speed

HS Design

Data Set

UnsteadyRANS

“Smart” use of HPC power:• Algorithms• Data mining• knowledge

Capability achieved during one night batch Capability achieved during one night batch Courtesy AIRBUS France


Diseño del ITER

TOKAMAK (JET, Oxford)


Supercomputación, teoría y experimentación

Cortesia de IBM


Weather, Climate and Earth Sciences:Roadmap

2009

Resolution : 80 Km

Memory: ≈110 GBStorage: ≈ 8 TBNEC-SX9 48 vector procs: ≈ 40 days run

2015

Resolution : 20 Km

MemSory: ≈ 3,5 TBStorage: ≈ 180 TBHigh resolution model with complete carbon cycle modelChallenges: data viz and post-processing, data discovery, archiving

2020

Resolution : 1 Km

Memory: ≈ 4 PBStorage: ≈ 150 PBHigher resolution with global cloud resolving modelChallenges: data sharing, transfer memory management, I/O management

FLOPS 3* 1014

FLOPS 1* 1016

FLOPS 1* 1019


Education for Parallel Programming

Multicore-based pacifier

I multi-core programming

I many-core

programming

We all massive parallel

prog.

I games


Navigating the Mare Nostrum