Valladolid, Septiembre 2010
“Evolución de la Arquitectura de Computadores ”
Valladolid, Septiembre 2010
Prof. Mateo Valero Director
2Valladolid, Septiembre 2010
Technological Achievements
● Transistor (Bell Labs, 1947)
● DEC PDP-1 (1957)● IBM 7090 (1960)
● Integrated circuit (1958)
● IBM System 360 (1965)● DEC PDP-8 (1965)
● Microprocessor (1971)
● Intel 4004
3Valladolid, Septiembre 2010
Pipeline (H. Ford)
4Valladolid, Septiembre 2010
Technology Trends
4
5Valladolid, Septiembre 2010
6Valladolid, Septiembre 2010
7Valladolid, Septiembre 2010
Power DensityW
att
s/c
m2
1
10
100
1000
i386i386i486i486
Pentium® Pentium®
Pentium® ProPentium® Pro
Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate
Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor
RocketRocketNozzleNozzleRocketRocketNozzleNozzle
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.
Pentium® 4Pentium® 4
8Valladolid, Septiembre 2010
9Valladolid, Septiembre 2010
High Volume Manufacturing
2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm) 90 65 45 32 22 16 11 8
Integration Capacity (BT)
2 4 8 16 32 64 128 256
Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down
Energy/Logic Op scaling
>0.35 >0.5 >0.5 Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability
Alternate, 3G etc Low Probability High Probability
Variability Medium High Very High
ILD (K) ~3 <3 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation
Shekhar Borkar, Micro37, P
Technology Outlook
10Valladolid, Septiembre 2010
Lower Lower VoltageVoltage
Increase Increase Clock RateClock Rate
& & Transistor Transistor DensityDensity
We have seen increasing number of gates on a chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel Processors > 100 Watts
We will not see the dramatic increases in clock speeds in the future.
However, the number of gates on a chip will continue to increase.
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
CoreCore
CacheCache
CoreCore
CacheCache
CoreCore
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
11Valladolid, Septiembre 2010
Increasing chip performance: Intel´s Petaflop chip
ICPP-2009, September 23rd 2009
● 80 processors in a die of 300 square mm.● Terabytes per second of memory bandwidth● Note: The barrier of the Teraflops was obtained by Intel in
1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters
● This will be possible in 3 years from nowThanks to IntelThanks to Intel
12Valladolid, Septiembre 2010
NVIDIA Fermi Architecture
Unified 768KB L2 cache serves all
threads
GigaThread hardware scheduler
assigns Thread Blocks to SMs
Wide DRAM interface provides 12 GB/s
bandwidth
16 Streaming- Multiprocessors(512 cores) execute Thread Blocks
620 Gigaflops
13Valladolid, Septiembre 2010
Cell Broadband Engine TM:A Heterogeneous Multi-core Architecture
* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.
14Valladolid, Septiembre 2010
Intel/UPC
Since 2002 (Roger
Espasa, Toni Juan)
40 People
Microprocessor
Development
(Larrabee x86
many core)
15Valladolid, Septiembre 2010
Top10
16Valladolid, Septiembre 2010
Looking at the Gordon Bell Prize
● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors● Static finite element analysis
● 1 TFlop/s; 1998; Cray T3E; 1024 Processors● Modeling of metallic magnet atoms, using a
variation of the locally self-consistent multiple scattering method.
● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors● Superconductive materials
● 1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads)
Jack Dongarra
17Valladolid, Septiembre 2010
BSC-CNS e iniciativas a nivel internacional: IESP
Build an international plan for developing the next
generation open source software for scientific high-
performance computing
Build an international plan for developing the next
generation open source software for scientific high-
performance computing
Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment
18Valladolid, Septiembre 2010
1 EFlop/s “Clean Sheet of Paper” Strawman
• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)
• 1 Chip = 742 Cores (=4.5TF/s)• 213MB of L1I&D; 93MB of L2
• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)
• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)
• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops
• 68 MW w’aggressive assumptions
Sizing done by “balancing” power budgets with achievable capabilities
Largely due to Bill Dally
Courtesy of Peter Kogge, UND
19Valladolid, Septiembre 2010
Education for Parallel Programming
Multicore-based pacifier
I multi-core programming
I many-core
programming
We all massive parallel
prog.
I games
20Valladolid, Septiembre 2010
Navigating the Mare Nostrum
21Valladolid, Septiembre 2010
Initial developments
Mechanical machines
1854: Boolean algebra by G. Boole
1904: Diode vacuum tube by J.A. Fleming
1938: Boolean Algebra & Electronics Switches, C. Shannon
1946: ENIAC by J.P. Eckert and J. Mauchly
1945: Stored program by J.V. Neumann ??????
1947 : First transistor (Bell Labs)
1949: EDSAC by M. Wilkes
1952: UNIVAC I and IBM 701
22Valladolid, Septiembre 2010
In 50 Years ...
Eniac, Eckert&Mauchly1946 ... 18000 vacuum tubes
Pentium III playing DVD, 1998 ... 24 M transistors
23Valladolid, Septiembre 2010
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Technology Trends: Microprocessor Capacity
24Valladolid, Septiembre 2010
25Valladolid, Septiembre 2010
Computer Architecture Achievements
• 1951 : Microprogramming (M. Wilkes)
• 1962 : Virtual Memory (Atlas, Manchester)
• 1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s)
• 1965 : Cache memory (M. Wilkes)
• 1975 : Vector processors (S. Cray)
• 1980 : RISC architecture (IBM, Berkeley, Stanford)
• 1982 : Multiprocessors with distributed memory
• 1990 : Superscalar processors: PA-Risc (HP) and RS-6000 (IBM)
• 1991 : Multiprocessors with distributed shared memory
• 1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers)
• 1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin)
• 1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU)
• 2000: Multicore/Manycore Architectures
26Valladolid, Septiembre 2010
27Valladolid, Septiembre 2010
Virtual Worlds have huge potential beyond Games
Commerce & Advertising
Corporate
Education
First Responders
Government
Health
Military
Science
Community Facilitation
Social Change
28Valladolid, Septiembre 2010
● Cray XT5-HE system● Over 37,500 quad-core AMD
Opteron processors running at 2.6 GHz, 224,162 cores.
● Power: 6.95 Mwatts● 300 terabytes of memory● 10 petabytes of disk space.● 240 gigabytes per second disk bandwidth● Cray's SeaStar2+ interconnect network.
Jaguar @ ORNL: 1.75 PF/s
Jack Dongarra
29Valladolid, Septiembre 2010
Performance analysis tools
Processor and node
Load balancing
Interconnect
ApplicationsProgramming
models
Models and prototype
MareIncognito: Project structure
4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPEGeneral kernels
Automatic analysisCoarse/fine grain predictionSamplingClusteringIntegration with Peekperf
Contention, CollectivesOverlap computation/communicationSlimmed NetworksDirect versus indirect networks
Contribution to new Cell designSupport for programming modelSupport for load balancingSupport for performance toolsIssues for future processors
Coordinated scheduling:Run time,Process,
JobPower efficiency
StarSs: CellSs, SMPSsOpenMP@Cell
OpenMP++MPI + OpenMP/StarSs
30Valladolid, Septiembre 2010
● Supercomputación y eCiencia● 22 grupos de élite● Más de 120
investigadores seniors● Más de 300 estudiantes
de doctorado
Application scope“Earth Sciences”
Application scope“Astrophysics”
Application scope“Engineering”
Application scope“Physics”
Application scope“Life Sciences”
Compilers and tuning of application
kernels
Programming models and performance tuning tools
Architecturesand hardwaretechnologies
BSC-CNS: vertebrador de la investigación en supercomputación en España
31Valladolid, Septiembre 2010
High Performance Computing as key-enabler
1980 1990 2000 2010 2020 2030
Capacity: # of Overnight
Loads cases run
Capacity: # of Overnight
Loads cases run
Available Computational
Capacity [Flop/s]
Available Computational
Capacity [Flop/s]
CFD-basedLOADS
& HQ
CFD-basedLOADS
& HQ
Aero Optimisation& CFD-CSM
Aero Optimisation& CFD-CSM Full MDOFull MDO
Real time CFD based
in flight simulation
Real time CFD based
in flight simulation
x106
1 Zeta (1021
)
1 Peta (1015
)
1 Tera (1012
)
1 Giga (109
)
1 Exa (1018
)
102
103
104
105
106
LES
CFD-basednoise
simulation
CFD-basednoise
simulation
RANS Low Speed
RANS High Speed
HS Design
Data Set
UnsteadyRANS
“Smart” use of HPC power:• Algorithms• Data mining• knowledge
Capability achieved during one night batch Capability achieved during one night batch Courtesy AIRBUS France
32Valladolid, Septiembre 2010
Diseño del ITER
TOKAMAK (JET, Oxford)
33Valladolid, Septiembre 2010
Supercomputación, teoría y experimentación
Cortesia de IBM
34Valladolid, Septiembre 2010
Weather, Climate and Earth Sciences:Roadmap
2009
Resolution : 80 Km
Memory: ≈110 GBStorage: ≈ 8 TBNEC-SX9 48 vector procs: ≈ 40 days run
2015
Resolution : 20 Km
MemSory: ≈ 3,5 TBStorage: ≈ 180 TBHigh resolution model with complete carbon cycle modelChallenges: data viz and post-processing, data discovery, archiving
2020
Resolution : 1 Km
Memory: ≈ 4 PBStorage: ≈ 150 PBHigher resolution with global cloud resolving modelChallenges: data sharing, transfer memory management, I/O management
FLOPS 3* 1014
FLOPS 1* 1016
FLOPS 1* 1019
35Valladolid, Septiembre 2010
Education for Parallel Programming
Multicore-based pacifier
I multi-core programming
I many-core
programming
We all massive parallel
prog.
I games
36Valladolid, Septiembre 2010
Navigating the Mare Nostrum
Top Related