# **HiPEAC vision 2015**

# "The End of the World As We Know It"



# What is **HiPEAC**?

- HiPEAC is a European Network of Excellence on High Performance and Embedded Architecture and Compilation
- Created in 2004, HiPEAC gathers over 370 leading European academic and industrial computing system researchers from nearly 140 universities and 70 companies in one virtual centre of excellence of 1500 researchers.



Coordinator: Prof. Koen De Bosschere (UGent)



# **HIPEAC** mission:

HiPEAC encourages computing innovation in Europe by providing:

- Collaboration grants, internships, sabbaticals, the semi-annual computing systems week,
- The ACACES summer school, the yearly HiPEAC conference.



# The **HiPEAC** Vision



2015

- Electronic and paper version available now
- Paper version:
  - Send to the members with the newsletter
  - Was available at HiPEAC 2015 conference

**Editors:** Marc Duranton (FR-CEA), Koen de Bosschere (BE-U Gent), Albert Cohen (FR-INRIA), Jonas Maebe (BE-U Gent), Harm Munk (NL-ASTRON)



#### **Glimpse into the HiPEAC Vision 2015**

For the first time, we have noticed that the community really *starts looking for disruptive solutions,* 

and that incrementally improving current technologies is considered inadequate to address the challenges that the computing community faces:

"The End of the World As We Know It"



### **Structure of the HiPEAC vision 2015**





### **Structure of the HiPEAC vision 2015**



#### Moore's law: increase in transistor density



Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç



## The end of Dennard Scaling

| Parameter<br>(scale factor = a) | Classic<br>Scaling | Current<br>Scaling |
|---------------------------------|--------------------|--------------------|
| Dimensions                      | I/a                | I/a                |
| Voltage                         | I/a                |                    |
| Current                         | I/a                | l/a                |
| Capacitance                     | I/a                | >1/a               |
| Power/Circuit                   | I/a²               | I/a                |
| Power Density                   | I                  | а                  |
| Delay/Circuit                   | I/a                | ~                  |

Source: Krisztián Flautner "From niche to mainstream: can critical systems make the transition?"



#### Limited frequency increase ⇒ more cores



Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

#### Limitation by power density and dissipation



Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

# Why using several compute cores?

- 1. Using several cores is also an answer to the Law of Diminishing Returns [Pollack' s Rule] :
  - Effectiveness per transistor decreases when the size of a single core is increased, due to the locality of computation
  - Controlling a larger core and data transport over a single larger core is super-linear
  - Smaller cores are more efficient in ops/mm<sup>2</sup>/W
- 2. Large area of today's microprocessors are for best effort processing and used to cope with unpredictability (branch prediction, reordering buffers, instructions, caches).



## Less than 20% of the area for execution units

★



Source: Dan Connors, "OpenCL and CUDA Programming for Multicore and GPU Architectures» ACACES 2011

#### **Stagnation of performance since few years**



# Power limits the active silicon area => more efficient specialized units





#### **Specialization leads to more efficiency**



Source from Bill Dally (nVidia) « Challenges for Future Computing Systems » HiPEAC conference 2015



# **Potential** other optimizations

 $P_{per unit} = C V^2 f + T_{sc} V I_{peak +} V I_{leak}$ 

Average power, peak power, power density, energy-delay, ...

#### CIRCUITS

#### ARCHITECTURE

Voltage scaling/islands
Clock gating/routing

Clock-tree distribution, half-swing clocks

• Redesigned latches/flip-flops pin-ordering, gate restructuring, topology restructuring, balanced delay paths, optimized bit transactions

Redesigned memory cells

Low-power SRAM cells, reduced bit-line swing, multi-Vt, bit line/word line isolation/segmentation

• Other optimizations Transistor resizing, GALS, low-power logic Voltage/freq scaling

Gating

Pipeline, clock, functional units, branch prediction, data path

- Split instrucn windows
- SMT thread throttling
- Bank partitioning
- · Cache redesign

Sequential, MRU, hash-rehash, column-associative, filter cache, subbanking, divided word line, block buffers, multi-divided module, scratch

- Low-power states
- DRAM refresh-control
- Switching control Gray, bus-invert, address-increment
- Code compression
- Data packing/buffering

COMPILER, OS, APPLICATION

• Switching control Register relabeling, operand swapping, instruction scheduling

- Memory access reduce Locality optimizations, register allocation
- Power-mode-control

CPU/resource schedule
 Memory/disk control
 Disk spinning, page allocation, memory
 mapping, memory bank control

• Networking Power-aware routing, proximity-based routing, balancing hop count, ...

• Distributed computing Mobile agents placement, network-driven computation

- Fidelity control
- Dynamic data types
- Power API



Source: P. Ranganathan, "System architectures for servers and datacenters »



# **Energy consumption of ICT**

#### Eximated consumption 410 TWh in 2020, 25% for Servers

BAU Scenario Annual Electricity Consumption of ICT (in TWh/a)



Source: European Commission DG INFSO, Impact of Informa9on and Communica9on Technologies on Energy Efficiency, final report, 2008

# **Cost of moving data**

#### The High Cost of Data Movement

Fetching operands costs more than computing on them



Source: Bill Dally, « To ExaScale and Beyond »

www.nvidia.com/content/PDF/sc\_2010/theater/Dally\_SC10.pdf



### Performances of SRAM hardly increase

| Node    | 45nm               | 16nm                | 14nm                | 10 nm              |  |  |
|---------|--------------------|---------------------|---------------------|--------------------|--|--|
| Density | 150 F <sup>2</sup> | 2ti7 F <sup>2</sup> | ti00 F <sup>2</sup> | 450 F <sup>2</sup> |  |  |
|         |                    |                     |                     |                    |  |  |

#### SRAM DENSITY - 16nm vs 28nm



Source: Joel Hruska, « Stop obsessing over transistor counts: It's a terrible way of comparing chips », <a href="http://www.extremetech.com">http://www.extremetech.com</a> 25



## SRAM takes more and more SoC area



Fig 3. To compress design schedule time, designers often reuse earlier design blocks and use third party IP. It is very rare that a new chip featuring billions of transistors is designed completely from scratch. Generally, most of a new design's transistors are used to form memories or functions derived from similar functions implemented in earlier designs. (Source: Semico Research Corporation, Study Number SC103-10, October 2010)



## Flash scaling also hits limits

#### **2** Questions



#### **The future will be non volatile memories** But still in development and which technology will be the winner?

\*

|                                      | FeRAM | RRAM  | Magnetic<br>field write<br>MRAM | PRAM             | STT<br>MRAM          |
|--------------------------------------|-------|-------|---------------------------------|------------------|----------------------|
| Non-volatile                         | Y     | Y     | Y                               | Y                | Y                    |
| Memory cell factor (F <sup>2</sup> ) | 16-32 | 4-6   | 16-32                           | 5-8              | 5-7                  |
| Read time (ns)                       | 20-50 | 10-20 | 3-20                            | 5-20             | 3-15                 |
| Write/erase time (ns)                | 50    | 20    | 10-20                           | >30              | 3-15                 |
| Number of rewrites                   | 1012  | 10°   | 10 <sup>15</sup> min            | 10 <sup>12</sup> | 10 <sup>15</sup> min |
| Power consumption at write           | Low   | Low   | Somewhat<br>high                | Low              | Low                  |
| Required input voltage (V)           | 2-3   | 1.2   | 3                               | 1.5-3            | 1.5                  |







0

## And the development cost is increasing



SoC Development Costs have Soared from \$20 Million at 90nm to Over \$100 Million at 32 nm

Rock's law: cost of IC plant doubles every 4 years Reaching 10<sup>th</sup> or 100<sup>th</sup> of \$ Billions...

# "Main drives in compuAng"

#### High Performance Computing



**1946** ENIAC, vacuum tube computer, 5KOPS, 150KW



**1965** General Electric GE6ti5, 4 processors, 2 MIPS



**2012** Bull B510, 10K cores, 4TeraBytes RAM, 200 TFlops



**2015** *Tianhe-2 (MilkyWay-2): Intel Xeon E5-2692 12C 2.200GHz, Intel Xeon Phi ti1S1P, ti.12M cores,* 1,024 TB RAM, 50 PFlops, 17,8 MW





★



# Specialization with interposer





# Many cores: technology to reduce energy consumption and cost



\*: test and package costs are not included but considered equal for both technologies in this exercise ti5



# Together...

# Electrons for compute

Electrons like to interact; easily moved; interaction needed for compute

# + lons for storage

lons like to interact; stay put; good for storage

# + Photons to communicate

Photons don't like to interact or stay put; good for long-distances

#### See the presentation on "The Machine" from HP

Courtesy: Jouppi2011



Source: P. Ranganathan, "Saving the world together, one server at a time..." ACACES 2011



## Software cost is rapidly increasing



ti7



# Parallelism and specializaAon are not t for free...

Frequency limit → parallelism Energy efficiency → heterogeneity

Ease of programming





# Parallelism and specializaAon are not for free...

Frequency limit → parallelism Energy efficiency → heterogeneity

Ease of programming





#### Managing complexity.... "Nontrivial software written with threads, semaphore, and mutexes is incomprehensible by humans"



Edward A. Lee

The future of embedded software

ARTEMIS 2006

Parallelism seems to be too complex for humans ?



# **Time to think differently?**

- Approximate computing
- Probabilistic CMOS
- Neuromorphic computing
- Declarative programming
- •Graphene
- SpintronicQuantum.





es





# Time to think differently?

- Approximate computing
- Probabilistic CMOS
- Neuromorphic computing
- Declarative programming
- •Graphene
- SpintronicQuantum.





es





# Time to think differently?

- Adequate computing
- Probabilistic CMOS
- Neuromorphic computing
- Declarative programming
- •Graphene
- SpintronicQuantum.





es





Dependab Securit

Multidiscip



glement the physical ual world

al evolution

#### **HiPEAC Vision 2015**

HIGH PERFORMANCE AND EMBEDDED ARCHITECTURE AND COMPILATION



Editorial board: Marc Duranton, Koen De Bosschere, Albert Cohen, Jonas Maebe, Harm Munk



