GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004...
-
Upload
amparo-ortiz-ferreyra -
Category
Documents
-
view
213 -
download
0
Transcript of GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004...
GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
email: [email protected]
Optimización de la Transformada Wavelet para Arquitecturas
Monoprocesador
Optimización de la Transformada Wavelet para Arquitecturas
Monoprocesador
Gregorio Bernabé García
Depto. Ingeniería y Tecnología de Computadores
Universidad de Murcia
30071 Murcia (SPAIN)
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
The 3D-FWT EncoderThe 3D-FWT Encoder
SOURCE VIDEODATA
THRESHOLDINGQUANTIZER
ENTROPYENCODER 3-D FWT
COMPRESSEDVIDEO DATA
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
INCREASE THE COMPRESSION RATE
MAINTAINING THE VIDEO QUALITY
SEVERAL IMPROVEMENTS IN THE QUANTIZATION AND THE ENTROPY
ENCODER 3D-Conscious Run-Length
Hexadecimal coding
Arithmetic coding
Proposal (I)Proposal (I)
OBJECTIVES
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
IntroductionIntroduction
Memory Conscious 3D FWT exploiting the
memory hierarchy
Rectangular overlapped partitioning Advantages
– Spatial locality of memory references– Reuse of floating point operations
Disavantages– L1 and L2 cache misses too high– Floating point operations executed too large
Blocking Techniques
Proposal (II)Proposal (II)
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Proposal (III)Proposal (III)
Optimize the Rectangular Overlapped Approach Reduce the number of FP instructions.
Pressure over the memory subsystem.
Enhancements Take advantage of the SSE efficiently (Intel C/C++
Compiler)
Data prefetching and Loop Unrolling
Columns Vectorization
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
SSE vectorization by handSSE vectorization by hand
1D-FWT algorithm (n pixels) with Daub-4
as wavelet mother function
for (i=0, j=0; i < n; i+=2, j++) {low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3];
high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3];
}
low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3];
WaveletCoefficient
Four pixels
Three additions
Four multiplications
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
SSE vectorization by handSSE vectorization by hand
1D-FWT algorithm (n pixels) with Daub-4
as wavelet mother function
for (i=0, j=0; i < n; i+=2, j++) {low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3];
high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3];
}
low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3];
4 coefficients16 fp mult12 fp add
low [1] = c0 * p[2] + c1 * p[3] + c2 * p[4] + c3 * p[5];
low [2] = c0 * p[4] + c1 * p[5] + c2 * p[6] + c3 * p[7];
low [3] = c0 * p[6] + c1 * p[7] + c2 * p[8] + c3 * p[9];
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
SSE vectorization by handSSE vectorization by hand
+ + + +
+ + + +
+ + + +
0 31 63 95 127
C0*p[0] C0*p[2] C0*p[6]C0*p[4]xmm0
C1*p[1] C1*p[3] C1*p[7]C1*p[5]xmm1
C2*p[2] C2*p[4] C2*p[6]C2*p[4]xmm2
C3*p[3] C3*p[5] C3*p[9]C3*p[7]xmm3
C0*p[0] + C1*p[1]
C0*p[2] + C1*p[3]
C0*p[6] + C1*p[7]
C0*p[4] + C1*p[5]xmm0 addps xmm0, xmm1
C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3]
C0*p[2] + C1*p[3] + C2*p[4] +
C3*p[5]
C0*p[6] + C1*p[7] + C2*p[8]
+ C3*p[9]
C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7]
xmm0 addps xmm0, xmm3
C0*p[0] + C1*p[1] + C2*p[2]
C0*p[2] + C1*p[3] + C2*p[4]
C0*p[6] + C1*p[7] + C2*p[8]
C0*p[4] + C1*p[5] + C2*p[6]
xmm0 addps xmm0, xmm2
low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3]low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5]low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7]low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Data prefetchingData prefetching
ControlUnit
Memory
Instructions + Data
. . .
Register File
Instructions
Processor
low[1] = C0*p[2] + C1*p[3] + C2*p[4] + C3*p[5]low[2] = C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7]low[3] = C0*p[6] + C1*p[7] + C2*p[8] + C3*p[9]
low[0] = C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3]
low[4] = C0*p[8] + C1*p[9] + C2*p[10] + C3*p[11]low[5] = C0*p[10] + C1*p[11] + C2*p[12] + C3*p[13]low[6] = C0*p[12] + C1*p[13] + C2*p[14] + C3*p[15]low[7] = C0*p[14] + C1*p[15] + C2*p[16] + C3*p[17]
4 wavelet coefficientsare being calculated
Pixels neededfor the next
coefficients are being prefetched
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Columns VectorizationColumns Vectorization
Columns 1 2 3 4 5 6 7 8 9 10 110
Row 0
Row 1
Row 2
Row 3
X-wavelet
X-wavelet
X-wavelet
X-wavelet
Row 4
Row 5
Second Row by Columns
Effective way apply the transform Y Locality of references Transform was already applied X dimension
First Row by Columns X-wavelet
X-wavelet
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
HyperthreadingHyperthreading
ProcessorExecutionResources
Architectural State
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
HyperthreadingHyperthreading
ProcessorExecutionResources
Architectural State
Architectural State
Architectural State
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
HyperthreadingHyperthreadingFetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
ProcessorExecutionResources
Architectural State
Architectural State
Architectural State
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Data Domain DecompositionData Domain Decomposition
Thread-1
Thread-2
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Data Domain DecompositionData Domain Decomposition
Thread-1
Thread-2
Sequenceof
Video 1
Sequenceof
Video 2
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
MemoryPrefetch
3D-FWT
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
MemoryPrefetch
3D-FWT
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
3D-FWT
Quantization
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
3D-FWT
Quantization
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
Functional DecompositionFunctional Decomposition
Thread-1
Thread-2
3D-FWTQ. Low
Q. High
Fetch
Queue Queue
Decode
Queue Queue
T. Cache/Mc ROM
Queue Queue
Rename/Allocate
Queue Queue
Retirement
Queue Queue
Out of OrderSchedule/Execute
APIC
APIC
Arch State
Arch State
Physical Registers
GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones Avanzadas Febrero, 2004
email: [email protected]
Transformada Wavelet 3D en Arquitecturas MonoprocesadorTransformada Wavelet 3D en
Arquitecturas Monoprocesador
Gregorio Bernabé García
Depto. Ingeniería y Tecnología de Computadores
Universidad de Murcia
30071 Murcia (SPAIN)