VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...

Post on 29-Jun-2020

2 views 0 download

Transcript of VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...

Lucas Bragança da Silva

Fredy Alves

José Nacif ( Apresentador )

Ricardo Ferreira ( Apresentador )

Universidade Federal de Viçosa

Finantial Support: Intel Brasil, Intel Labs, Capes, Cnpq, Fapemig

VI Escola de Sistemas Embarcados

ESSE 2016VI Brazilian Symposium on Computing Systems Engineering

CPU/FPGA Heterogeneous

Architectures

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Outline

• Motivation• FPGA and CPU • OpenCL and FPGA accelerators• HARP Platform

• HARP Layers• Demo• HARP CGRA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

Moore Law continues….

2005

Single Thread

Frequency

Power

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation

Moore Law continues….

multiple

cores

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

MotivationIoT and Cloud Computing• “Coherently-attached FPGA accelerator for Xeon processors in the

datacenter which is estimated to have a $1B market opportunity by 2020”

- Prabhat K. Gupta - General Manager of Xeon+FPGA Product at Intel Corporation

• Microsoft Cataput: layer of reconfigurable logic (FPGAs) between the network switchesand the servers (enabling the FPGAs to communicate directly, at datacenter scale) - IEEE Micro2016 - “A Cloud-Scale Acceleration Architecture”

• Baidu, Inc. (NASDAQ: BIDU), the leading Chinese language Internet search provider• Accelerators = greater throughput at low latency while retaining practical power levels.• 10-20X performance/watt improvement.• Baidu-optimized FPGA platforms are tuned for machine learning applications such as image

and speech recognition.

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Motivation FPGA

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGA is scalable !

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs

• Scalable• Energy Efficiency• Parallel and distributed computing• Temporal and Spatial Parallelism• From low cost embedded to high performance

cloud

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs andTools

Hardware Description

Languages

Compilers

High Level Synthesis

General Purpose

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPGAs andTools

Specific tools for

specific applications

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

CPU and FPGAs

• Heterogeneous applications and Heterogeneous hardware

• Real World• HARP - Intel/Altera Platform

• Microsoft Cataput

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Microsoft Cataput v2

172.6K ALMs ,4 GB DDR3

RoundTrip - 250,000 machines

in 20 microseconds,

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Microsoft Cataput v2

• Microsoft's FPGA Translates Wikipedia in less than a Tenth of a Second

• FPGA network - breaking the “chicken and egg”• accelerators cannot be added until enough applications need them,

but applications will not rely upon the accelerators until they are present in the infrastructure.

• By decoupling the servers and FPGAs, software services that demand more FPGA capacity

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Universidade Federal de Viçosa

Instructions

Universidade Federal de Viçosa

Temporal and

Spatial Parallelism

Universidade Federal de Viçosa

Temporal and

Spatial Parallelism

OpenCL

OpenCL example

__attribute__(num_compute_units(4,4))

kernel void PE() {

...

}

PE

0,0

PE

0,1

PE

0,2

PE

0,3

How to build a systolic computer…..

__attribute__(num_compute_units(4,4))

kernel void PE() {

...

}

PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

OpenCL example

__attribute__(num_compute_units(4,4))

kernel void PE() {

row = get_compute_id(0);

col = get_compute_id(1);

….

} PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

OpenCL example

channel float4 ch_bottom[4];

PE() {

float4 a,b;

if (row == 0)

a = read_channel

(ch_bottom[col]); PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

OpenCL example channel float4 ch_bottom[4];

channel float4 ch_PE_col[4][4];

…PE() {

float4 a,b;

if (row == 0)

a = read_channel(ch_bottom[col]);

else

a = read_channel(ch_PE_col[row-1][col])

PE

0,0

PE

0,1

PE

0,2

PE

0,3

PE

1,0

PE

1,1

PE

1,2

PE

1,3

PE

2,0

PE

2,1

PE

2,2

PE

2,3

PE

3,0

PE

3,1

PE

3,2

PE

3,3

Coarse Grained Reconfigurable Array CGRA vs FPGA

Huge Bitstream…..

fine grained Bitstream

word

level

CGRA as a virtual layer Small Bitstream

FPGA

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Legal Disclaimer

Copyright (C) 2008-2016 Intel Corporation All Rights Reserved.The source code contained or described herein and all documents relatedto the source code ("Material") are owned by Intel Corporation or itssuppliers or licensors. Title to the Material remains with Intel Corporationor its suppliers and licensors. The Material contains trade secrets andproprietary and confidential information of Intel or its suppliers andlicensors. The Material is protected by worldwide copyright and tradesecret laws and treaty provisions. No part of the Material may be copied,reproduced, modified, published, uploaded, posted, transmitted,distributed, or disclosed in any way without Intel's prior express writtenpermission.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP Prototype Xeon+FPGA* system disclaimerThis talk is about prototype hardware and software which has been madeavailable to universities in the HARP program.

Details of production Xeon+FPGA systems will be made available at a laterdate

Results and details in this presentation were generated using pre-production hardware and software, and may not reflect production or futuresystems

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP Accelerating Workloads using Xeon and coherently attached FPGA in-socket

QPI1 6 GB/s

Heterogeneous architecture with homogenous platform support1QuickPath Interconnect

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

• 96 GB RAM

• Xeon 10 cores

• FPGA Stratix V- 622K LUTs

- 1M Registers

- 2.5K Memory Modules (M20K)

- 512 DSPs

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

Stratix

V

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – Development platform

Xeon

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP-1 – USB Programmer

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP HDL Programming

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Set of software tools for development and deployment of systems composed by asymmetric computing resources

• CPUs, GPUs, FPGAs as a server

• An application uses the server by requesting resources

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Resource manager• Ensuring exclusiveness of the use of a resource.

• Service-oriented and object-oriented • Interface definitions, attributes and the objects which implements those interfaces.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP - Accelerator Abstraction Layer (AAL)

• Service-Oriented Architecture• Service: Encapsulation of functionality which consumes computing resources

• Registrar: Registers services and APIs, used to locate and acquire service interfaces.

• Client: Executable that uses a service by acquiring API from registrar.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Object Communications

• AAL uses an asynchronous communication, It returns to the application while the requested service execute in parallel.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Services, Interfaces, Composition

• Client accesses the service through virtual interfaces published on the Registrar which does not expose the implementation.

• Component objects implement the interface.

class IMyInterface{public:virtual doThis(void)=0;virtual doThat(void)=0;virtual

~IMyInterface(){}};

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Abstraction, Resource Management

• AAL abstracts service instantiation from the application.

• Services can be created dynamically.

• When multiple implementations of a service are available in one or more compute resources, AAL returns the most suitable one.

• AAL Resource Manager controls the allocation and provisioning of compute resources to services.

• Resource management is important for precious and shared

resources such as accelerators on FPGAs.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Service Broker gets the information required to instantiate a service from the Registrar.

• Service libraries are loadable software such as DLLs.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Client 1 consults Service Broker for Service Compute.

• Service Broker obtains data record describing Service Compute from Service Registrar.

• Service broker consults Resource Manager which consults implementations and computing resources.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – AAL Service Broker and Registrar

• Resource Manager returns information to allow Broker to load service package.

• Service broker calls Service factory to instantiate it.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Core Cache Interface (CCI)

• Interface between AFU and QPI

- Read and write requests to the system coherent memory.

- Coherent memory is mapped to CPU DRAM.

• FPGA implements Intel QPI

- Processor uses QPI to access the system cache.

6 GB/s

64 KB

Line = 64B

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Core Cache Interface (CCI)

• Accelerated Function Units• Accelerates an application kernel, in the FPGA.

• Blue dotted box is the multiprocessor boundary.

• Red dotted box is the Cache access domain

6 GB/s

64 KB

Line = 64B

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• QPI-FPGA implements the Caching and Configuration agents.

- The caching agent assures memory coherence.

- The Configuration Agent receives and handles read and write cycles from processor.

- System Protocol Layer (Virtual Address Translation)

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• Processor-FPGA is RX.

• FPGA-Processor is TX.

• Designed to accept one read and write per clock cycle.

• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.

• SPL2: up to 2GB pinned virtual address space to an AFU.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Interface Definitions: Attach points

• AFU connected via CCI Standard (CCI-S) or CCI Extended (CCI-E).

• CCI-S uses physical addressing and out of order responses.

• CCI-E uses virtual addressing.

• Intel provides SPL2 IP to translate virtual to physical addresses.

• AFU connected to SPL2 via CCI-E.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – General Architecture

“CGRA HARP”

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Intel HARP

“Hello World”

Example.

Are you ready???

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

HARP – Accelerator “Hello World”

• AFU capable of adding two CPU memory values- SPL2 RTL for address translation- SW application (C++) & AFU RTL (Verilog)

• In this Example we will be demonstrating the use of:- AAL Runtime- Service API- Example AFU to build an user AFU.

AAL Application code: Run method

m_runtimClient->getRuntime()->allocService(dynamic_cast<IBase

*>(this), Manifest);

m_Sem.Wait();

if(0 == m_Result){

MSG("Running Test");

btVirtAddr pWSUsrVirt = m_pWkspcVirt; // Address of

Workspace

const btWSSize WSLen = m_WkspcSize; // Length of

workspace

INFO("Allocated " << WSLen << "-byte Workspace at virtual

address " << std::hex << (void *)pWSUsrVirt);

// Number of bytes in each of the source and destination buffers

(4 MiB in this case)

btUnsigned32bitInt a_num_bytes= (btUnsigned32bitInt) ((WSLen -

sizeof(VAFU2_CNTXT)) / 2);

• Allocates services, workspace in Device

Status Memory (DSM) using allocService().

• If service is successfully allocated, runs the

test.

• Get the address and the length of the

workspace.

• Defines the size of the source and

destination buffers in bytes.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

btUnsigned32bitInt a_num_cl = a_num_bytes / CL(1); //

number of cache lines in buffer

// VAFU Context is at the beginning of the buffer

VAFU2_CNTXT *pVAFU2_cntxt = reinterpret_cast<VAFU2_CNTXT

*>(pWSUsrVirt);

// The source buffer is right after the VAFU Context

btVirtAddr pSource = pWSUsrVirt + sizeof(VAFU2_CNTXT);

// The destination buffer is right after the source buffer

btVirtAddr pDest = pSource + a_num_bytes;

• Defines the number of cache lines on each

buffer in a_num_cl.

• Get the pointer to afu context

(pVAFU2_cntxt), to source buffer

(pSource) and destiny buffer (pDest).

• pDest is pSource plus the size of pSource

in bytes (a_num_bytes).

VI Brazilian Symposium on Computing Systems Engineering, November 2016

AAL Application code: Run method

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

// Initialize the command buffer

::memset(pVAFU2_cntxt, 0, sizeof(VAFU2_CNTXT));

pVAFU2_cntxt->num_cl = a_num_cl;

pVAFU2_cntxt->pSource = pSource;

pVAFU2_cntxt->pDest = pDest;

INFO("Starting SPL Transaction with Workspace");

m_Sem.Wait();

int num1 = 3;

int num2 = 2;

int *inputs_ADD = (int*)malloc(sizeof(int)*2);

volatile int *addIn = (int*)pSource;

inputs_ADD[0] = numa;

inputs_ADD[1] = numb;

memcpy((void*)addIn, inputs_ADD, sizeof(int)*2);

m_SPLService->StartTransactionContext(TransactionID(), pWSUsrVirt, 100);

• Initialize and copies the AFU

• defines two numbers (numa, numb) and

copy them to the source buffer

• Starts the transaction by using

StartTransactionContext, this enables the

start signal on the AFU and resets it.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

• Initialize and copies the AFU context to the

context pointer.

• defines two numbers (numa, numb) and

copy them to the source buffer using

memcpy.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

• Starts the transaction by using

StartTransactionContext, this enables the

start signal on the AFU and resets it.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

• The AFU writes its AFU_ID to DSM.

• The AFU will be running after the CPU

reads the AFU_ID from the DSM.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

// Wait for SPL VAFU to finish code

volatile bt32bitInt done = pVAFU2_cntxt->Status &

VAFU2_CNTXT_STATUS_DONE;

while (!done && --count) {

SleepMilli( delay );

done = pVAFU2_cntxt->Status & VAFU2_CNTXT_STATUS_DONE;

}

if ( !done ) {

// must have dropped out of loop due to count -- never

saw update

ERR("AFU never signaled it was done. Timing out anyway.

Results may be strange.\n");

}

int *pu32 = reinterpret_cast<int*>(&pDestCL[0]);

for(int i = 0;i< results_num;i++){

cout << *pu32 << "\n";

++pu32;

}

• Gets reference to AFU DONE signal.

• Wait for done to be set to 1.

• If AFU do not answer before the limit time,

prints and error message.

• If AFU answers in time, gets the reference

to destiny buffer and print the results.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

//Issue Stop Transaction and wait for OnTransactionStopped

INFO("Stopping SPL Transaction");

m_SPLService->StopTransactionContext(TransactionID());

m_Sem.Wait();

}

// Clean up and exit

INFO("Workspace verification complete, freeing workspace.");

m_SPLService->WorkspaceFree(m_pWkspcVirt, TransactionID());

m_Sem.Wait();

m_runtimClient->end();

// while(1){}

return m_Result;

}

• After transaction is done, it stops the

transaction.

• By stopping, it resets the AFU and set start

signal to 0.

• Frees the workspace.

\

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

AAL Application code: Run method

• Based on the Sudoku example.

• SPL RTL: Provided by Intel

• AFU RTL

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU

- afu_user: implements

communication interface

with the SPL module.

module afu_user #(CACHE_WIDTH = 512)

(

input clk,

input reset_n,

// Read Request

output [ADDR_LMT-1:0] rd_req_addr,

output [MDATA-1:0] rd_req_mdata,

output reg rd_req_en,

input rd_req_almostfull,

// Read Response

input rd_rsp_valid,

input [MDATA-1:0] rd_rsp_mdata,

input [CACHE_WIDTH-1:0] rd_rsp_data,

• CACHE_WIDTH is the size of the cache

line in bits.

• SW application starts the transaction

(Reset).

• The read request signals are used to

request a read from the source buffer.

• The read response is the read data

reponse. in this case we only use

rd_rsp_data which has the cache line for

the last read request.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU USER Interface

// Write Request

output [ADDR_LMT-1:0] wr_req_addr,

output [MDATA-1:0] wr_req_mdata,

output [CACHE_WIDTH-1:0] wr_req_data,

output reg wr_req_en,

input wr_req_almostfull,

// Write Response

input wr_rsp0_valid,

input [MDATA-1:0] wr_rsp0_mdata,

input wr_rsp1_valid,

input [MDATA-1:0] wr_rsp1_mdata,

// Start input signal

input start,

// Done output signal

output reg done,

// Control info from software

input [511:0] afu_context);

• The write request signals have the same

set of signals as the read request for cache

write operations.

• wr_rsp1_valid is used to identify when the

writing process finishes.

• “start” is the signal set by the CPU in the

AFU to start a transaction.

• “done” is the signal sent to the CPU to

indicate that the transaction processing is

over.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU USER Interface

• Read data from the source buffer, process (AFU), write the results back to the destination buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States

FSM_IDLE: begin

if(start) begin

fsm_ns = FSM_RD_REQ;

end

end

• Waits for start signal to be set to one by

the CPU.

• Changes to FSM_RD_REQ to start reading

process from source buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_IDLE

FSM_RD_REQ: begin

// If there's no more data to copy

if(addr_cnt >= num_clines)

begin

fsm_ns = FSM_RUN_ADD;

addr_cnt_clr = 1'b1;

end

// There's more data to copy

else begin

// Issue rd_req

if(!rd_req_almostfull) begin

rd_req_en = 1'b1;

fsm_ns = FSM_RD_RSP;

end

end

end

• addr_cnt keeps track of which line is being

read.

• if addr_cnt is more than the number of

lines to be read (num_clines), change to

state to run user AFU.

• if addr_cnt is less than the number of lines

to be read and the read buffer is not full,

sends a read request to SPL (rd_req_en)

and changes to state to wait for read

response.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_REQ

always@(posedge clk)

begin

if(rd_rsp_valid)

begin

case(addr_cnt)

'd0:

begin

inputs_add <= rd_rsp_data;

end

endcase // case (addr_cnt)

end // if (rd_rsp_valid)

end // always@ (posedge clk)

adder add0(

.clk(clk),

.start(start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• This always block waits for a response

(when rd_rsp_valid is 1) and then save the

data from the source buffer (rd_rsp_data),

in this case to inputs_add.

• inputs_add is connected to the input of the

user AFU which in this case is adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_REQ

FSM_RD_RSP:

begin

// Receive rd_rsp, put read

data into data_buf

if(rd_rsp_valid)

begin

addr_cnt_inc = 1'b1;

fsm_ns = FSM_RD_REQ;

end

end

• Waits for response.

• addr_cnt_inc is set to one, this increases

the addr_cnt by 1 which means that the

next line in source buffer will be read.

• Goes back to FSM_RD_REQ.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_RSP

// --- Address counter

reg [31:0] addr_cnt;

always @ (posedge clk) begin

if(!reset_n)

addr_cnt <= 0;

else

if(addr_cnt_inc)

addr_cnt <= addr_cnt + 1;

else if(addr_cnt_clr)

addr_cnt <= 'd0;

end

• This always block is responsible for

controlling the changes in addr_cnt.

• When addr_cnt_inc is set to 1, increase it

by 1 to go to next buffer line.

• When addr_cnt_clr is set to 1, clears

addr_cnt.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RD_RSP

FSM_RUN_ADD:

begin

t_start = 1'b1;

fsm_ns = FSM_WAIT_ADD;

n_cnt = 'd0;

end

adder add0(

.clk(clk),

.start(t_start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• Sets t_start connected to the adder to 1.

• Adder starts.

• Goes to state to wait adder to finish.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_RUN_ADD

FSM_WAIT_ADD:

begin

if(w_done | w_error)

begin

fsm_ns = FSM_WR_REQ;

end

end

adder add0(

.clk(clk),

.start(t_start),

.numA(inputs_add[31:0]),

.numB(inputs_add[63:32]),

.result(w_outGrid),

.done(w_done)

);

• Waits for w_done wire connected to done

signal on the adder to be set to one

meaning that the adder processing

finished.

• When finished, goes to state to start writing

results to destination buffer.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WAIT_ADD

FSM_WR_REQ:

begin

if(addr_cnt >= num_clines)

begin

fsm_ns = FSM_DONE;

end

else if(!wr_req_almostfull)

begin

wr_req_en = 1'b1; // issue wr_req

fsm_ns = FSM_WR_RSP;

end

end

• Requests data to be written to destiny

buffer is the output of the adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WR_REQ

FSM_WR_RSP:

begin

if(wr_rsp0_valid | wr_rsp1_valid)

begin

fsm_ns = FSM_WR_REQ;

addr_cnt_inc = 1'b1; // address counter ++

end

end

• The data to be written to destination buffer

is the output of the adder.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_WR_RSP

FSM_DONE:

begin

done = 1'b1;

fsm_ns = FSM_DONE;

end

• Sets done to one which finishes the

transaction, stops the SPL and sends done

signal to the CPU.

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

The AFU Control States: FSM_DONE

Collision detection

algorithm

• Detects collisions between rigid bodies

in a space and calculate the results for

these collision.

• Used in a wide variety of applications

such as games, simulations and

robotics.

• Implemented in engines. In our case we

integrate the HARP platform with ODE

(Open Dynamics Engine), an open-

source engine.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Case study:

spheres collision detection

• Inputs: Position, speed and

form of bodies in space.

• Outputs: Contact points

(Potential, fake and true).

Collision results (New

position for spheres).

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

In-game gameplay from the game Besiege.

The system is composed by:

• The ODE application integrated

with AAL.

• The FPGA with the collision

detection AFU and SPL for

address translation.

• Source buffer to hold input data

for AFU.

• Destination buffer to hold the

collision detection results for the

application.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• The CPU sends the collisions

data to the Source Buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• CPU sends start and reset signal

to SPL which propagates to the

AFU.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• The AFU sends its AFU ID to the

SRC Buffer which is read by the

CPU indicating that the AFU

started.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• After it finishes processing the

collisions, the AFU sends the

results to the destination Buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• AFU indicates to CPU that it is

done processing the transaction.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

For each simulation step

• CPU retrieves results from

destination buffer.

Universidade Federal de Viçosa

VI Brazilian Symposium on Computing Systems Engineering, November 2016

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

FPL 2016 - PK Gupta - Intel

Accelerating DataCenter Workloads

Two application examples

FPGA Board Evaluation - DAC,2016

FPGA Board Evaluation - DAC,2016

FCCM,2016

DNA accelerator

short sequence

Harp/Intel

FCCM,2016

DNA accelerator

short sequence

Harp/Intel

FCCM,2016

Inside a PE

FCCM,2016

Applications mapped on HARP

• “Runtime Parameterizable Regular Expression Operators for Databases”

• tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP LIKE in SQL).

• “High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform”

• 2.9x and 1.9x compared with CPU-only and FPGA-only baselines • 2.3x vs. FPGA implementation for sorting

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

On Going work

• Previous Work• Modulo Scheduling

• Virtual CGRA

• High Level Stream Computation mapped onto HARP

Every day, we create 2.5 quintillion bytes of data —

so much that 90% of the data in the world today

has been created in the last two years alone.

IBM Big-Data

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop

A

B

C

D

E

F

G

H

I

J

End loop

A B

streams

C E

F

D

G H

I J

Sequential

CodeParallel

Data Flow

streams

FU FU FU

FU FU FU

Physical

Parallel

Architecture

CGRA

RUNTIME

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

streams

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Overlap iterations !

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

At same time, all operations are executed.....

One Clock Cycle THROUGHPUT !

ILP=10

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

FU

FU

FU

FU

FU

FU

FU

FU

FU

FU

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

FU

FU

FU

FU

FU

FU

FU

FU

FU

FU

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

A

B

I

J

C

D

F

E

G

H

A

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

Iteration i

Iteration i+1

Iteration i+2

Iteration i+3

Physical

Architecture

A

B

I

J

C

D

F

E

G

H

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

FU

FU

FU

FU

FU

FU

10 OPs

6 units

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

G

F

FU

H

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t3

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

D

C

I

E

FU

J

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t4

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

G

F

FU

H

t0

t1

t2

t3

t4

t5

t6

t7

6 Units, t5

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

D

C

I

E

FU

J

t0

t1

t2

t3

t4

t5

t6

t7

New result, every 2 cycles ….ILP=5

Initial Interval (II) = 2 cycles 6 Units, t6

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

FU

t0

t1

t2

t3

t4

t5

t6

t7

FU

FU FU

Configuration C0

C0

placement

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

B

A

FU

t0

t1

t2

t3

t4

t5

t6

t7

FU

FU FU

Configuration C0

C0

D

C

FUFU

E FU

Configuration C1

C1

Placement and

Routing

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

t7

C0

C1

C0

B

A

FUH

F G

Configuration C0

D

C

FUFU

E FU

Configuration C1

Placement and

Routing

t0

Iteration i

Iteration i+1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

t7

C0

C1

C0

C1

B

A

FUH

F G

Configuration C0

D

C

FUFU

E FU

Configuration C1

Placement and

Routing

I

J

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

C0

C1

C0

t0

Iteration i

Iteration i+1

B

A

FUF

G H

Configuration C1

Configuration C0

Configuration Memory

Physical

Architecture

CGRA

C1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

A B

C E

F

D

G H

I J

t0

t1

t2

t3

t4

t5

t6

C0

C1

C0

t0

Iteration i

Iteration i+1

E

C

JFU

D I

Configuration C1

Configuration C0

Configuration Memory

Physical

Architecture

CGRA

C1

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Loop Unrolling - Modulo Scheduling

Global Register

FUFUFUFU

FUFUFUFU

FU

FU

FU

FU

FU FU

FU FU

RF RF RF RF

RF

RF

RF

RF

RF

RF

RF

RF

Virtual CGRA on the top of

Commercial FPGA

XILINX XC6VLX75T

FlipFlop 2.5 %

LUTs 14.7 %

Mem Bank 16.0 %

Clock 110 Mhz

F

UF

UF

UF

UF

UF

UF

UF

U

.

.

.

FlipFlop 2.7 %

LUTs 17.6 %

Mem Bank 4.5 %

Clock 90 Mhz

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Virtualization

VI Brazilian Symposium on Computing Systems Engineering, November 2016

Universidade Federal de Viçosa

Questions ?

ricardo@ufv.br

jnacif@ufv.br