VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...

148
Lucas Bragança da Silva Fredy Alves José Nacif ( Apresentador ) Ricardo Ferreira ( Apresentador ) Universidade Federal de Viçosa Finantial Support: Intel Brasil, Intel Labs, Capes, Cnpq, Fapemig VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA Heterogeneous Architectures

Transcript of VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...

Page 1: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Lucas Bragança da Silva

Fredy Alves

José Nacif ( Apresentador )

Ricardo Ferreira ( Apresentador )

Universidade Federal de Viçosa

Finantial Support: Intel Brasil, Intel Labs, Capes, Cnpq, Fapemig

VI Escola de Sistemas Embarcados

ESSE 2016VI Brazilian Symposium on Computing Systems Engineering

CPU/FPGA Heterogeneous

Architectures

Page 2: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Outline

• Motivation• FPGA and CPU • OpenCL and FPGA accelerators• HARP Platform

• HARP Layers• Demo• HARP CGRA

Page 3: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

Page 4: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

Moore Law continues….

2005

Single Thread

Frequency

Power

Page 5: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

Moore Law continues….

multiple

cores

Page 6: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

MotivationIoT and Cloud Computing• “Coherently-attached FPGA accelerator for Xeon processors in the

datacenter which is estimated to have a $1B market opportunity by 2020”

- Prabhat K. Gupta - General Manager of Xeon+FPGA Product at Intel Corporation

• Microsoft Cataput: layer of reconfigurable logic (FPGAs) between the network switchesand the servers (enabling the FPGAs to communicate directly, at datacenter scale) - IEEE Micro2016 - “A Cloud-Scale Acceleration Architecture”

• Baidu, Inc. (NASDAQ: BIDU), the leading Chinese language Internet search provider• Accelerators = greater throughput at low latency while retaining practical power levels.• 10-20X performance/watt improvement.• Baidu-optimized FPGA platforms are tuned for machine learning applications such as image

and speech recognition.

Page 7: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

Page 8: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

Page 9: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

Page 10: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

Page 11: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

Page 12: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGA is scalable !

Page 13: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs

Page 14: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs

• Scalable• Energy Efficiency• Parallel and distributed computing• Temporal and Spatial Parallelism• From low cost embedded to high performance

cloud

Page 15: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs andTools

Hardware Description

Languages

Compilers

High Level Synthesis

General Purpose

Page 16: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs andTools

Specific tools for

specific applications

Page 17: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

CPU and FPGAs

• Heterogeneous applications and Heterogeneous hardware

• Real World• HARP - Intel/Altera Platform

• Microsoft Cataput

Page 18: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

Page 19: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

Page 20: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

Page 21: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Microsoft Cataput v2

172.6K ALMs ,4 GB DDR3

RoundTrip - 250,000 machines

in 20 microseconds,

Page 22: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Microsoft Cataput v2

• Microsoft's FPGA Translates Wikipedia in less than a Tenth of a Second

• FPGA network - breaking the “chicken and egg”• accelerators cannot be added until enough applications need them,

but applications will not rely upon the accelerators until they are present in the infrastructure.

• By decoupling the servers and FPGAs, software services that demand more FPGA capacity

Page 23: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 24: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 25: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 26: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 27: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 28: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 29: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 30: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 31: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Page 32: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Instructions

Page 33: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Temporal and

Spatial Parallelism

Page 34: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Universidade Federal de Viçosa

Temporal and

Spatial Parallelism

OpenCL

Page 35: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 36: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 37: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 38: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 39: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 40: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 41: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

OpenCL example

__attribute__(num_compute_units(4,4))

kernel void PE() {

...

}

PE

0,0

PE

0,1

PE

0,2

PE

0,3

Page 42: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

How to build a systolic computer…..

__attribute__(num_compute_units(4,4))

kernel void PE() {

...

}

PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

Page 43: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

OpenCL example

__attribute__(num_compute_units(4,4))

kernel void PE() {

row = get_compute_id(0);

col = get_compute_id(1);

….

} PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

Page 44: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

OpenCL example

channel float4 ch_bottom[4];

PE() {

float4 a,b;

if (row == 0)

a = read_channel

(ch_bottom[col]); PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

Page 45: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

OpenCL example channel float4 ch_bottom[4];

channel float4 ch_PE_col[4][4];

…PE() {

float4 a,b;

if (row == 0)

a = read_channel(ch_bottom[col]);

else

a = read_channel(ch_PE_col[row-1][col])

PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

Page 46: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 47: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 48: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Coarse Grained Reconfigurable Array CGRA vs FPGA

Huge Bitstream…..

fine grained Bitstream

word

level

Page 49: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

CGRA as a virtual layer Small Bitstream

FPGA

Page 50: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Legal Disclaimer

Copyright (C) 2008-2016 Intel Corporation All Rights Reserved.The source code contained or described herein and all documents relatedto the source code ("Material") are owned by Intel Corporation or itssuppliers or licensors. Title to the Material remains with Intel Corporationor its suppliers and licensors. The Material contains trade secrets andproprietary and confidential information of Intel or its suppliers andlicensors. The Material is protected by worldwide copyright and tradesecret laws and treaty provisions. No part of the Material may be copied,reproduced, modified, published, uploaded, posted, transmitted,distributed, or disclosed in any way without Intel's prior express writtenpermission.

Page 51: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP Prototype Xeon+FPGA* system disclaimerThis talk is about prototype hardware and software which has been madeavailable to universities in the HARP program.

Details of production Xeon+FPGA systems will be made available at a laterdate

Results and details in this presentation were generated using pre-production hardware and software, and may not reflect production or futuresystems

Page 52: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP Accelerating Workloads using Xeon and coherently attached FPGA in-socket

QPI1 6 GB/s

Heterogeneous architecture with homogenous platform support1QuickPath Interconnect

Page 53: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

Page 54: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

• 96 GB RAM

• Xeon 10 cores

• FPGA Stratix V- 622K LUTs

- 1M Registers

- 2.5K Memory Modules (M20K)

- 512 DSPs

Page 55: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

Stratix

V

Page 56: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

Xeon

Page 57: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – USB Programmer

Page 58: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP HDL Programming

Page 59: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

Page 60: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

Page 61: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Set of software tools for development and deployment of systems composed by asymmetric computing resources

• CPUs, GPUs, FPGAs as a server

• An application uses the server by requesting resources

Page 62: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Resource manager• Ensuring exclusiveness of the use of a resource.

• Service-oriented and object-oriented • Interface definitions, attributes and the objects which implements those interfaces.

Page 63: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Service-Oriented Architecture• Service: Encapsulation of functionality which consumes computing resources

• Registrar: Registers services and APIs, used to locate and acquire service interfaces.

• Client: Executable that uses a service by acquiring API from registrar.

Page 64: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Object Communications

• AAL uses an asynchronous communication, It returns to the application while the requested service execute in parallel.

Page 65: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Services, Interfaces, Composition

• Client accesses the service through virtual interfaces published on the Registrar which does not expose the implementation.

• Component objects implement the interface.

class IMyInterface{public:virtual doThis(void)=0;virtual doThat(void)=0;virtual

~IMyInterface(){}};

Page 66: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Abstraction, Resource Management

• AAL abstracts service instantiation from the application.

• Services can be created dynamically.

• When multiple implementations of a service are available in one or more compute resources, AAL returns the most suitable one.

• AAL Resource Manager controls the allocation and provisioning of compute resources to services.

• Resource management is important for precious and shared

resources such as accelerators on FPGAs.

Page 67: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Service Broker gets the information required to instantiate a service from the Registrar.

• Service libraries are loadable software such as DLLs.

Page 68: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Client 1 consults Service Broker for Service Compute.

• Service Broker obtains data record describing Service Compute from Service Registrar.

• Service broker consults Resource Manager which consults implementations and computing resources.

Page 69: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Resource Manager returns information to allow Broker to load service package.

• Service broker calls Service factory to instantiate it.

Page 70: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

Page 71: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Core Cache Interface (CCI)

• Interface between AFU and QPI

- Read and write requests to the system coherent memory.

- Coherent memory is mapped to CPU DRAM.

• FPGA implements Intel QPI

- Processor uses QPI to access the system cache.

6 GB/s

64 KB

Line = 64B

Page 72: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Core Cache Interface (CCI)

• Accelerated Function Units• Accelerates an application kernel, in the FPGA.

• Blue dotted box is the multiprocessor boundary.

• Red dotted box is the Cache access domain

6 GB/s

64 KB

Line = 64B

Page 73: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• QPI-FPGA implements the Caching and Configuration agents.

- The caching agent assures memory coherence.

- The Configuration Agent receives and handles read and write cycles from processor.

- System Protocol Layer (Virtual Address Translation)

Page 74: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

Page 75: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

Page 76: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

Page 77: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

Page 78: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• AFU connected via CCI Standard (CCI-S) or CCI Extended (CCI-E).

• CCI-S uses physical addressing and out of order responses.

• CCI-E uses virtual addressing.

• Intel provides SPL2 IP to translate virtual to physical addresses.

• AFU connected to SPL2 via CCI-E.

Page 79: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

Page 80: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Intel HARP

“Hello World”

Example.

Are you ready???

Page 81: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Accelerator “Hello World”

• AFU capable of adding two CPU memory values- SPL2 RTL for address translation- SW application (C++) & AFU RTL (Verilog)

• In this Example we will be demonstrating the use of:- AAL Runtime- Service API- Example AFU to build an user AFU.

Page 82: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

AAL Application code: Run method

m_runtimClient->getRuntime()->allocService(dynamic_cast<IBase

*>(this), Manifest);

m_Sem.Wait();

if(0 == m_Result){

MSG("Running Test");

btVirtAddr pWSUsrVirt = m_pWkspcVirt; // Address of

Workspace

const btWSSize WSLen = m_WkspcSize; // Length of

workspace

INFO("Allocated " << WSLen << "-byte Workspace at virtual

address " << std::hex << (void *)pWSUsrVirt);

// Number of bytes in each of the source and destination buffers

(4 MiB in this case)

btUnsigned32bitInt a_num_bytes= (btUnsigned32bitInt) ((WSLen -

sizeof(VAFU2_CNTXT)) / 2);

• Allocates services, workspace in Device

Status Memory (DSM) using allocService().

• If service is successfully allocated, runs the

test.

• Get the address and the length of the

workspace.

• Defines the size of the source and

destination buffers in bytes.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Page 83: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

btUnsigned32bitInt a_num_cl = a_num_bytes / CL(1); //

number of cache lines in buffer

// VAFU Context is at the beginning of the buffer

VAFU2_CNTXT *pVAFU2_cntxt = reinterpret_cast<VAFU2_CNTXT

*>(pWSUsrVirt);

// The source buffer is right after the VAFU Context

btVirtAddr pSource = pWSUsrVirt + sizeof(VAFU2_CNTXT);

// The destination buffer is right after the source buffer

btVirtAddr pDest = pSource + a_num_bytes;

• Defines the number of cache lines on each

buffer in a_num_cl.

• Get the pointer to afu context

(pVAFU2_cntxt), to source buffer

(pSource) and destiny buffer (pDest).

• pDest is pSource plus the size of pSource

in bytes (a_num_bytes).

VI Brazilian Symposium on Computing Systems Engineering, November 2016

AAL Application code: Run method

Page 84: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 85: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

// Initialize the command buffer

::memset(pVAFU2_cntxt, 0, sizeof(VAFU2_CNTXT));

pVAFU2_cntxt->num_cl = a_num_cl;

pVAFU2_cntxt->pSource = pSource;

pVAFU2_cntxt->pDest = pDest;

INFO("Starting SPL Transaction with Workspace");

m_Sem.Wait();

int num1 = 3;

int num2 = 2;

int *inputs_ADD = (int*)malloc(sizeof(int)*2);

volatile int *addIn = (int*)pSource;

inputs_ADD[0] = numa;

inputs_ADD[1] = numb;

memcpy((void*)addIn, inputs_ADD, sizeof(int)*2);

m_SPLService->StartTransactionContext(TransactionID(), pWSUsrVirt, 100);

• Initialize and copies the AFU

• defines two numbers (numa, numb) and

copy them to the source buffer

• Starts the transaction by using

StartTransactionContext, this enables the

start signal on the AFU and resets it.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 86: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

• Initialize and copies the AFU context to the

context pointer.

• defines two numbers (numa, numb) and

copy them to the source buffer using

memcpy.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 87: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

• Starts the transaction by using

StartTransactionContext, this enables the

start signal on the AFU and resets it.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 88: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

• The AFU writes its AFU_ID to DSM.

• The AFU will be running after the CPU

reads the AFU_ID from the DSM.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 89: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

// Wait for SPL VAFU to finish code

volatile bt32bitInt done = pVAFU2_cntxt->Status &

VAFU2_CNTXT_STATUS_DONE;

while (!done && --count) {

SleepMilli( delay );

done = pVAFU2_cntxt->Status & VAFU2_CNTXT_STATUS_DONE;

}

if ( !done ) {

// must have dropped out of loop due to count -- never

saw update

ERR("AFU never signaled it was done. Timing out anyway.

Results may be strange.\n");

}

int *pu32 = reinterpret_cast<int*>(&pDestCL[0]);

for(int i = 0;i< results_num;i++){

cout << *pu32 << "\n";

++pu32;

}

• Gets reference to AFU DONE signal.

• Wait for done to be set to 1.

• If AFU do not answer before the limit time,

prints and error message.

• If AFU answers in time, gets the reference

to destiny buffer and print the results.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 90: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

//Issue Stop Transaction and wait for OnTransactionStopped

INFO("Stopping SPL Transaction");

m_SPLService->StopTransactionContext(TransactionID());

m_Sem.Wait();

}

// Clean up and exit

INFO("Workspace verification complete, freeing workspace.");

m_SPLService->WorkspaceFree(m_pWkspcVirt, TransactionID());

m_Sem.Wait();

m_runtimClient->end();

// while(1){}

return m_Result;

}

• After transaction is done, it stops the

transaction.

• By stopping, it resets the AFU and set start

signal to 0.

• Frees the workspace.

\

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

Page 91: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

• Based on the Sudoku example.

• SPL RTL: Provided by Intel

• AFU RTL

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU

- afu_user: implements

communication interface

with the SPL module.

Page 92: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

module afu_user #(CACHE_WIDTH = 512)

(

input clk,

input reset_n,

// Read Request

output [ADDR_LMT-1:0] rd_req_addr,

output [MDATA-1:0] rd_req_mdata,

output reg rd_req_en,

input rd_req_almostfull,

// Read Response

input rd_rsp_valid,

input [MDATA-1:0] rd_rsp_mdata,

input [CACHE_WIDTH-1:0] rd_rsp_data,

• CACHE_WIDTH is the size of the cache

line in bits.

• SW application starts the transaction

(Reset).

• The read request signals are used to

request a read from the source buffer.

• The read response is the read data

reponse. in this case we only use

rd_rsp_data which has the cache line for

the last read request.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU USER Interface

Page 93: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

// Write Request

output [ADDR_LMT-1:0] wr_req_addr,

output [MDATA-1:0] wr_req_mdata,

output [CACHE_WIDTH-1:0] wr_req_data,

output reg wr_req_en,

input wr_req_almostfull,

// Write Response

input wr_rsp0_valid,

input [MDATA-1:0] wr_rsp0_mdata,

input wr_rsp1_valid,

input [MDATA-1:0] wr_rsp1_mdata,

// Start input signal

input start,

// Done output signal

output reg done,

// Control info from software

input [511:0] afu_context);

• The write request signals have the same

set of signals as the read request for cache

write operations.

• wr_rsp1_valid is used to identify when the

writing process finishes.

• “start” is the signal set by the CPU in the

AFU to start a transaction.

• “done” is the signal sent to the CPU to

indicate that the transaction processing is

over.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU USER Interface

Page 94: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

• Read data from the source buffer, process (AFU), write the results back to the destination buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States

Page 95: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_IDLE: begin

if(start) begin

fsm_ns = FSM_RD_REQ;

end

end

• Waits for start signal to be set to one by

the CPU.

• Changes to FSM_RD_REQ to start reading

process from source buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_IDLE

Page 96: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_RD_REQ: begin

// If there's no more data to copy

if(addr_cnt >= num_clines)

begin

fsm_ns = FSM_RUN_ADD;

addr_cnt_clr = 1'b1;

end

// There's more data to copy

else begin

// Issue rd_req

if(!rd_req_almostfull) begin

rd_req_en = 1'b1;

fsm_ns = FSM_RD_RSP;

end

end

end

• addr_cnt keeps track of which line is being

read.

• if addr_cnt is more than the number of

lines to be read (num_clines), change to

state to run user AFU.

• if addr_cnt is less than the number of lines

to be read and the read buffer is not full,

sends a read request to SPL (rd_req_en)

and changes to state to wait for read

response.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_REQ

Page 97: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

always@(posedge clk)

begin

if(rd_rsp_valid)

begin

case(addr_cnt)

'd0:

begin

inputs_add <= rd_rsp_data;

end

endcase // case (addr_cnt)

end // if (rd_rsp_valid)

end // always@ (posedge clk)

adder add0(

.clk(clk),

.start(start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• This always block waits for a response

(when rd_rsp_valid is 1) and then save the

data from the source buffer (rd_rsp_data),

in this case to inputs_add.

• inputs_add is connected to the input of the

user AFU which in this case is adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_REQ

Page 98: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_RD_RSP:

begin

// Receive rd_rsp, put read

data into data_buf

if(rd_rsp_valid)

begin

addr_cnt_inc = 1'b1;

fsm_ns = FSM_RD_REQ;

end

end

• Waits for response.

• addr_cnt_inc is set to one, this increases

the addr_cnt by 1 which means that the

next line in source buffer will be read.

• Goes back to FSM_RD_REQ.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_RSP

Page 99: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

// --- Address counter

reg [31:0] addr_cnt;

always @ (posedge clk) begin

if(!reset_n)

addr_cnt <= 0;

else

if(addr_cnt_inc)

addr_cnt <= addr_cnt + 1;

else if(addr_cnt_clr)

addr_cnt <= 'd0;

end

• This always block is responsible for

controlling the changes in addr_cnt.

• When addr_cnt_inc is set to 1, increase it

by 1 to go to next buffer line.

• When addr_cnt_clr is set to 1, clears

addr_cnt.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_RSP

Page 100: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_RUN_ADD:

begin

t_start = 1'b1;

fsm_ns = FSM_WAIT_ADD;

n_cnt = 'd0;

end

adder add0(

.clk(clk),

.start(t_start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• Sets t_start connected to the adder to 1.

• Adder starts.

• Goes to state to wait adder to finish.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RUN_ADD

Page 101: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_WAIT_ADD:

begin

if(w_done | w_error)

begin

fsm_ns = FSM_WR_REQ;

end

end

adder add0(

.clk(clk),

.start(t_start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• Waits for w_done wire connected to done

signal on the adder to be set to one

meaning that the adder processing

finished.

• When finished, goes to state to start writing

results to destination buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WAIT_ADD

Page 102: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_WR_REQ:

begin

if(addr_cnt >= num_clines)

begin

fsm_ns = FSM_DONE;

end

else if(!wr_req_almostfull)

begin

wr_req_en = 1'b1; // issue wr_req

fsm_ns = FSM_WR_RSP;

end

end

• Requests data to be written to destiny

buffer is the output of the adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WR_REQ

Page 103: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_WR_RSP:

begin

if(wr_rsp0_valid | wr_rsp1_valid)

begin

fsm_ns = FSM_WR_REQ;

addr_cnt_inc = 1'b1; // address counter ++

end

end

• The data to be written to destination buffer

is the output of the adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WR_RSP

Page 104: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FSM_DONE:

begin

done = 1'b1;

fsm_ns = FSM_DONE;

end

• Sets done to one which finishes the

transaction, stops the SPL and sends done

signal to the CPU.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_DONE

Page 105: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Collision detection

algorithm

• Detects collisions between rigid bodies

in a space and calculate the results for

these collision.

• Used in a wide variety of applications

such as games, simulations and

robotics.

• Implemented in engines. In our case we

integrate the HARP platform with ODE

(Open Dynamics Engine), an open-

source engine.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 106: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Case study:

spheres collision detection

• Inputs: Position, speed and

form of bodies in space.

• Outputs: Contact points

(Potential, fake and true).

Collision results (New

position for spheres).

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

In-game gameplay from the game Besiege.

Page 107: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

The system is composed by:

• The ODE application integrated

with AAL.

• The FPGA with the collision

detection AFU and SPL for

address translation.

• Source buffer to hold input data

for AFU.

• Destination buffer to hold the

collision detection results for the

application.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 108: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• The CPU sends the collisions

data to the Source Buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 109: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• CPU sends start and reset signal

to SPL which propagates to the

AFU.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 110: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• The AFU sends its AFU ID to the

SRC Buffer which is read by the

CPU indicating that the AFU

started.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 111: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• After it finishes processing the

collisions, the AFU sends the

results to the destination Buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 112: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• AFU indicates to CPU that it is

done processing the transaction.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 113: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

For each simulation step

• CPU retrieves results from

destination buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 114: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

Two application examples

Page 115: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FPGA Board Evaluation - DAC,2016

Page 116: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 117: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FPGA Board Evaluation - DAC,2016

Page 118: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FCCM,2016

DNA accelerator

short sequence

Harp/Intel

Page 119: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FCCM,2016

DNA accelerator

short sequence

Harp/Intel

Page 120: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FCCM,2016

Inside a PE

Page 121: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

FCCM,2016

Page 122: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Applications mapped on HARP

• “Runtime Parameterizable Regular Expression Operators for Databases”

• tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP LIKE in SQL).

• “High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform”

• 2.9x and 1.9x compared with CPU-only and FPGA-only baselines • 2.3x vs. FPGA implementation for sorting

Page 123: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 124: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 125: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 126: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA
Page 127: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

On Going work

• Previous Work• Modulo Scheduling

• Virtual CGRA

• High Level Stream Computation mapped onto HARP

Page 128: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Every day, we create 2.5 quintillion bytes of data —

so much that 90% of the data in the world today

has been created in the last two years alone.

IBM Big-Data

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Page 129: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Loop

A

B

C

D

E

F

G

H

I

J

End loop

A B

streams

C E

F

D

G H

I J

Sequential

CodeParallel

Data Flow

streams

FU FU FU

FU FU FU

Physical

Parallel

Architecture

CGRA

RUNTIME

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 130: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

streams

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Overlap iterations !

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 131: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

At same time, all operations are executed.....

One Clock Cycle THROUGHPUT !

ILP=10

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 132: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

FU

FU

FU

FU

FU

FU

FU

FU

FU

FU

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 133: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

FU

FU

FU

FU

FU

FU

FU

FU

FU

FU

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 134: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

A

B

I

J

C

D

F

E

G

H

A

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 135: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

A

B

I

J

C

D

F

E

G

H

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 136: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

FU

FU

FU

FU

FU

FU

10 OPs

6 units

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 137: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

G

F

FU

H

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t3

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 138: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

D

C

I

E

FU

J

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t4

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 139: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

G

F

FU

H

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t5

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 140: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

D

C

I

E

FU

J

t0

t1

t2

t3

t4

t5

t6

t7

New result, every 2 cycles ….ILP=5

Initial Interval (II) = 2 cycles 6 Units, t6

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 141: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

FU

t0

t1

t2

t3

t4

t5

t6

t7

FU

FU FU

Configuration C0

C0

placement

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 142: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

FU

t0

t1

t2

t3

t4

t5

t6

t7

FU

FU FU

Configuration C0

C0

D

C

FUFU

E FU

Configuration C1

C1

Placement and

Routing

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 143: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

t7

C0

C1

C0

B

A

FUH

F G

Configuration C0

D

C

FUFU

E FU

Configuration C1

Placement and

Routing

t0

Iteration i

Iteration i+1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 144: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

t7

C0

C1

C0

C1

B

A

FUH

F G

Configuration C0

D

C

FUFU

E FU

Configuration C1

Placement and

Routing

I

J

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 145: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

C0

C1

C0

t0

Iteration i

Iteration i+1

B

A

FUF

G H

Configuration C1

Configuration C0

Configuration Memory

Physical

Architecture

CGRA

C1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 146: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

C0

C1

C0

t0

Iteration i

Iteration i+1

E

C

JFU

D I

Configuration C1

Configuration C0

Configuration Memory

Physical

Architecture

CGRA

C1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Page 147: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

Global Register

FUFUFUFU

FUFUFUFU

FU

FU

FU

FU

FU FU

FU FU

RF RF RF RF

RF

RF

RF

RF

RF

RF

RF

RF

Virtual CGRA on the top of

Commercial FPGA

XILINX XC6VLX75T

FlipFlop 2.5 %

LUTs 14.7 %

Mem Bank 16.0 %

Clock 110 Mhz

F

UF

UF

UF

UF

UF

UF

UF

U

.

.

.

FlipFlop 2.7 %

LUTs 17.6 %

Mem Bank 4.5 %

Clock 90 Mhz

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Virtualization

Page 148: VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas Embarcados ESSE 2016 VI Brazilian Symposium on Computing Systems Engineering CPU/FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Questions ?

[email protected]

[email protected]