[IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...

High throughput pipelined 2D Discrete cosine

transform for video compression

Ekta Aggrawal Electronics & Telecommunication Engineering

SVKM's NMIMS Shirpur, Maharashtra, India [email protected]

Abstract- This paper proposes an architecture and Verilog design

of fast pipelined Two Dimensional Discrete Cosine Transform

(2D DCT) on FPGA with quantization which can be used as a

core in video compression hardware. In this design, the

methodologies adopted are to use highly parallel and heavily

pipelined circuits in order to increase the throughput and to be

platform independent, whether an implementation uses a FPGA

or an ASIC. The scheme incorporates dual-redundant input

image memory, 45 stages of pipelining, and an optimized

controller design yielding a throughput of one coefficient per

clock cycle at 100 MHz. Speed improvement of 30 percent has

been achieved and hardware resource are efficiently saved by

reducing arithmetic operators. This design aimed to be

implemented on Xilinx Spartan 3E XC3S1500E FPGA.

Keywords: Video compression, 2D-DCT, quantization, FPGA, pipelining.

I. INTRODUCTION

A lot of research works have been done on transmission of video streams since video has very challenging quality of service (QOS) requirements. A video stream is compressed by a video encoding mechanism before entering the transmitter module, specified in the standards of MPEG2, MPEG4,

H.264, JPEG2000 and etc. [1]. After compression the video bit rate can be drastically reduced. Effective high throughput video compression algorithms are always in great demand. There are lot of research work already done on image and video compression. Design presented in this paper experimentally proves the efficient algorithm for high throughput video compression technique for high quality video. Without using pipelined architecture, only 3 frames per second can be processed whereas using 45 stage pipeline we can process 40 frames per second as can be known by timing waveform information with respective to clock. This type of application can be utilized in high data processing of High Definition signal for example in wireless High-Definition Multimedia Interface (HDMI), and video post processing of HDMI. HD signals are containing high data, per frame per second, and for sending this data using wireless technology it is not easy to send that much of data rate, as the max limit for

IEEE 802.11 aln is limited to this kind of application and 5GHZ technology and 60 GHZ technology still cannot support uncompressed HD video as the bit rate is very much higher and for low latency. Thus tools for the compression of videos

978-1-4799-2900-9/14/$31.00 ©2014 IEEE

Nishant Kumar Electronics & Communication Engineering

Bansal Institute of Science and Technology, Bhopal, Madhya Pradesh, India

[email protected]

are required for those applications. Pipe lined and parallel DCT

processor which employs compression is proposed here whose latency 0.45 micro second is very less and bit rate is very high 30-40 frames per second. The rest of the paper is organized as follows. Section II presents the quantized DCT and high level architecture. Section III contains proposed system architecture, the modified pipelined and parallel architecture of signed adder and signed multiplier, dual RAM, ROM. In section IV comparison between the proposed design results and the previous work has been done followed by the conclusion in section V.

II. DISCRETE COSINE TRANSFORM AND QUANTIZA TION

A. 2D Discrete Cosine Transform and Quantization(DCTQ)

A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies. Signal information is mostly concentrated in a few low-frequency components of the DCT. Over the years, considerable amount of research work have been carried out in proposing new algorithms for the DCT [2,3] and implementing them on general-purpose computers, DSPs, and ASICs. Direct 2-D approach [4] results in less parallelism, whereas separable row-column I-D approach [5] yields a faster algorithm. The fast algorithms [6] with minimum numbers of multiplication are often realized by flexible software approaches on the DSPs. The speed requirement can be met by a high-speed DSP but it still needs to pay high hardware cost due to its inherent complexity of multipliers. A linear, highly pipe lined, parallel algorithm and architecture have been proposed and implemented by the author [7,8] for 2D-DCT and Quantization on FPGAs [15]. This architecture eliminates or minimizes the limitations cited in the earlier references. 2D-DCT of a block of size 8 x 8 pixels of an image is defined by (1).

702

1 � � [(2X + l)u 1 [ (2y + l)v 1 DCT(u,v) = 4C(U)C(v) L.L. f(x,y) cos 16 cos 16 x=o y=o

(I)

where f(x, y) is the pixel intensity and c(u) = c(v) = 1/..J2 for u = v = 0; c(u) = c(v) = 1 for u, v = 1 to 7. The DCT can be expressed conveniently in a matrix form: DCT = CXCT where X is the input image matrix, C is the

cosine coefficient matrix, and C T, its transpose with constants (1/2)c(u) and (l/2)c(v) absorbed in C and CT matrices respectively. For a clear understanding, the DCT may be expressed in an expanded form as shown in (2)

[ I Ix x I I

I Coo COl C07 I I

,00 . 01 I Coo CIO C70 I

I CIO Cll I I x x

I I C17 I I

' 10 ' 11 X17 I Cal Cll C7l I DCT=

I I I I : I

I C77 I I x x I

CI7 I to C7l J L 70 '71 X77.J �O7 C77J

r, -, I -, I

POO POI P07 I I Coo CIO . _ C70 I

= I P10 P11 P17 I I I (2) I I

COl CII C71 I I I I I

�70 P7I P77 I I CI7 I J �07 C77J

Quantized outputs can be obtained by dividing each of the 64 DCT coefficients by the corresponding quantization table values given in the standards [14] as per the expression (3):

DCTQ (u,v) = DCT (u,v) / q (u,v); u, v = 0 to 7 (3)

These stages can be pipelined in such a way that one DCTQ output can be generated for every clock cycle. In this paper, massive pipe lining and parallel processing have been employed, such that the frame rate is reached high, the total latency obtained when implemented in Spartan 3e hardware is less than 1 micro second (0.44 us ) and is compared with the traditional DCTQ processor.

B. High Level System Architecture

The entire system architecture to be implemented in FPGA is shown in Fig. I. Input data is inserted into the system every 8 bit sequentially. Actually, many DCT designs insert the input to the DCT in parallel [9],[10],[11]. This is ideal for DCT computing because it only consumes a clock cycle to insert data to ID-DCT unit. With sequential manner, it takes 8 clock cycles to insert a set of data (8 points) to the DCT unit. The sequential architecture is chosen to save 110 port in FPGA chip. The 2D-DCT architecture with quantization used in this paper is shown in Fig. 2. The 2D-DCT module construction is modified from [12] that also put the data sequentially into the module. Thus, the architecture of 2D-DCT was divided into two ID DCT modules and one transpose buffer. The same ID DCT module is used twice. The transpose buffer operates like a temporal barrier between the first and the second ID DCT. It is made from static RAM with two sets of data and address bus.

12 bit 12 bit 12-bit 12 bit

OCT Zig-Zag lO-OCT -i Transpose r-- lO-OCT r-- Quantizer r-- Buffer Buffor

r �n � ilddrin-tf Iddrout-tj nltf '"� -j iJddrin-zl r Controller

Fig.l System Architecture

One for read process and the other for write which is mentioned as dual.

III. PROPOSED SYSTEM ARCHITECTURE

A. System Architecture

The image to be processed is input block by block by a host computer such as a Pentium Processor, into the DCTQ processor, where the discrete cosine transform is performed followed by quantization [16]. The application is to receive a burst of image/video data and apply a transform such as the DCT followed by quantization in order to effect compression on a picture_ Fig.2 presents the block diagram for the proposed high level system design. DCTQ processor can be viewed as a black box with inputs and outputs defined to suit the application requirements. Based on the emerging details, specifications are formulated. Next the blocks used in DCTQ design will be examined.

B. Modified Adder

The parallel signed adder shown in Fig.3 has a simple algorithm_ This has been proposed for use in the DCTQ application, where speed of processing has the top most priority_ The signed addition can be realized with seven 2-input adders and five pipeline stages. In the first stage, four numbers of 12 bits, two's complement adders are used to add all the 8 numbers. They work concurrently, thereby speeding up the process.

Fiv� st"&� pipe!!"I", Se�en sta&e pipelinlna:

Eiaht staee pipelinin& Six staKe pipelinin&

Fig.2 Proposed high level system architecture

DUAL RAM

Input Source File (h8 Im.- bloc:k)

20J4Internationai Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 703

nO(11:0)

n1(11:0)

n2(11:0)

n3[11:0)

n4[11:0)

n5[11:0)

n6(11:0)

n7[11:0)

elk(l) elk(2} LSB MSB Register

First Stage

Fig.3 Modified Adder

clk(3} elk(4} LSB MSB Register

Second Stage

LSB Register

Third Stage

They have pipe lined registers internally. The clock input is marked as clk(l), clk(2), etc., and correspond to internal pipeline registers. The LSBs at the first clock pulse clk(l) and the MSBs at the next clock pulse clk(2) are added along with the carry generated at the LSB. In the second stage, the four outputs are added, each of size 13 bits, generated at the first stage. Two numbers of 2- input adders are used at this stage. LSBs and MSBs are added with the arrival of the clock pulse clk(3) and clock pulse clk(4) respectively. In the third stage, with the arrival of the clock pulse clk(5), the LSBs of the two inputs of size, 14 bits are added. Subsequently, the MSBs are added along with carry generated while adding the LSBs to produce 15 bits final result [18].

C. Modified Multiplier

This is a new algorithm developed for the sake of implementing DCTQ on the FPGA or as an ASIC with an eye on achieving as high a throughput as possible. The multiplier design presented in FigA incorporates a high degree of parallel circuits and a pipelining of eight levels. The multiplier performs a multiplication of two signed numbers nl and n2, one of II bits and the other of 8 bits, as an example. The result is of size 19 bits in two's complement. The multiplication is done primarily on magnitudes of the two numbers, and therefore, firstly the sign and the magnitude separated and only the magnitude is then processed. The sign can be dealt separately by using an exclusive or gate [18].

D. Dual Address ROM and Dual RAM

In DCTQ applications, there is a need to compute cosine transform coefficients from a block of data and cosine terms

(C) and its transpose (CT). Instead of using two separate single addressed ROMs for storing cosine values and the transpose of the cosine values, a better way in terms of chip area is to use a

single ROM with dual address since C and CT contents are precisely the same [17].

n1 n2

elk(l} LSB

Fig.4 Modified Multiplier

elk(2} elk(3} LSB MSB

P'pe:llne Relisters

first Stale

elk(6) elk(S} L5B MSBSlgn

Pipeline Reaister5 Pipeline Rl!ilsters

Second Stille third Stait'

The requirement of dual RAM with a particular memory organization arises from the needs of the design application, DCTQ. The dual RAM consists of two RAMs, each of which stores the image information. Initially, one of the double memory buffers, RAM I, is filled and once it is full, the image information is written to the second RAM. While the second memory, RAM 2, is being written into, the RAM I will be read concurrently to process the DCTQ coefficients. In order to process DCTQ, there is a need to write a block of image data consisting of 64 pixels. Eight clock cycles are required to write a block of information since eight bytes per cycle can be written. One pixel data size is one byte for monochrome and three bytes for color image or picture. This design is for processing monochrome picture or color motion picture.

IV. IMPLEMENT A TION RESULTS

The architecture of 2-D DCTQ processor for image compression has been described in Veri log. All the modules used in DCTQ processor has been successfully designed and coded. The design is implemented and synthesized into a Xilinx Spartan 3E chip XC3s1500E. Firstly, the verification of algorithms and concepts was done using MATLAB computational tool and then these algorithms are converted into architectures that can be coded into HDL. The PSNR value obtained is much closer to the MA TLAB result.

The synthesis results to Spartan-3 FPGA are listed in Table I. The comparison between this work (2D-DCT) and the pure 2D-DCT designed in [9] is presented in Table 2.

TABLE J - DEVICE UTILIZA nON USING XILINX SPARTAN-3

Lo!!ic Unit Used Availab)e Utilization No. of Slices 4409 14752 29%

No. of slice Flip Flops 5322 29504 18%

No. of 4 input LUTs 4971 29504 16%

No. of Bonded lOBs 98 250 39%

No. of GCLKs 2 24 8%

704 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

TABLE 2 - DEVICE UTILIZA nON COMPARISON BETWEEN THIS WORK AND 2D-DCT DESIGNED IN [9]

Logic Unit Present Paper Presented in 19J

No. of slices 4409 7260

No. of slice FFs 5322 9644

No. of 4 input LUTs 4971 11194

No. of Bonded JOBs 98 101

Total number of pipeline stages in DCTQ processor is 45 and thus the latency produced by the system in 2D-DCT is 44

clock cycles. As comparison, 2D-DCT designed in [13] has latency 94 clock cycles. The result reached by system in [11] has 160 clock cycles as system latency to compute 2D-DCT.

V. CONCLUSION

A linear highly pipelined, parallel algorithm and architecture has been proposed and implemented for 2D-DCT and quantization on FPGAs. The architectures for the various stages are based on efficient and high performance designs suited for VLSI implementation. The verification of algorithms and concepts was done using MA TLAB computational tool and implementation was tested for functional correctness using Verilog with Xilinx tool. Pipeline process causes latency in the system. Maximum frequency can be achieved by this system is 101.1 MHz. The design takes less device resources and suitable for FPGA like Xilinx

XC3s1500E. The latency produced by design is less compared to previous works. Finally it is designed as a balanced architecture compared to previous works.

REFERENCES

[1] Kim SeongSoo ,"Adaptive multi-beam transmission of uncompressed video over 60ghz wireless systems" in International journal of Future Generation Communication and Networking, 12/2007.

[2] P. Lee and F. Y. Huang, "An efficient prime-factor algorithm for the discrete cosine transfonn and its hardware implementations", IEEE Trans. Signal Process., 42, pp.1996-2005, 1994.

[3] c.L. Wang and c.y. Chen, "High throughput VLSI architectures for the I-D and 2-D discrete cosine transfonns", IEEE Trans. Circuits Syst. Video Technol., 5, pp. 31-40, 1995.

[4] Yung-Pin Lee, Thou-Ho Chen, Liang-Gee Chen, Mei-Juan Chen and Chung-Wei Ku, "A cost-effective architecture for 8 x 8 2-D DCT//DCT using direct method", IEEE Trans. Circuits Syst. Video Technol., 7, 1997.

[5] Yi-Shin Tung, Chia-Chiang Ho and Ja-Lung Wu," MMX-based DCT and MC Algorithms for real-time pure software MPEG decoding", IEEE Computer Society Circuits and Systems, Signal Processing, 1, Florence, Italy, pp. 357-362, 1999.

[6] Y.P. Lee, T.H. Chen, L.G. Chen, MJ. Chen and C.W. Ku, "A costeffective architecture for 8 x 8 2D-DCT//DCT using direct method", IEEE Trans. Circuits Syst. VideoTechnol., 7, pp. 459-467, 1997.

[7] D.V.R. Murthy, S. Ramachandran and S. Srinivasan, "Parallel implementation of 2Ddiscrete cosine transfonn using EPLDs", International Conference on VLSI Design,Goa, January, 1999.

[8] S. Ramachandran, S. Srinivasan and R. Chen, "EPLD-based Architecture of Real Time 2D-Discrete Cosine Transform and Quantization for Image Compression", IEEE International Symposium on Circuits and Systems (ISCAS '99), Orlando, Florida, May-June 1999.

[9] Trang T.T. Do, Binh P. Nguyen "A High-Accuracy and High-Speed 2-D 8x8 Discrete Cosine Transfonn Design". Proceedings of ICGCRCICT 2010, vol. 1, 2010, pp. 135-138.

[10] L Basri, B. Sutopo, "Implementation ID-DCT Algoritma Feig- Wino grad di FPGA Spartan-3E (Indonesian)". Proceedings of CITEE 2009, vol. 1, 2009, pp. 198-203

[11] L. Agostini, S. Bampi, "Pipelined Fast 2-D DCT Architecture for JPEG Image Compression". Proceedings of the 14th Annual Symposium on Integrated Circuits and Systems Design, Pirenopolis, Brazil. IEEE Computer Society 2001. pp 226-231.

[12] Sun, M., Ting C., and Albert M., "VLSI Implementation of a 16 X 16 Discrete Cosine Transfonn", IEEE Transactions on Circuits and Systems, Vol. 36, No. 4, April 1989.

[13] Enas Dhuhri Kusuma, Thomas Sri Widodo "FPGA Implementation of Pipelined 2D-DCT and Quantization Architecture for JPEG Image Compression" IEEE, 2010.

[14] ISO/IEC MPEG 2 standards for generic coding of moving pictures:part 2, Video, 1988.

[15] Ramachandran S, "Development of Algorithms and Verification Using High Level Languages" in "Digital VLSI Systems Design"., Springer, 2007.

[16] Ramachandran S., "Architectural Design" and "Project Design" in "Digital VLSI Systems Design", Springer, 2007.

[17] Ramachandran S, "Design of Memories" in "Digital VLSI Systems Design, Springer, 2007.

[18] Ramachandran S, "Arithmetic Circuit Designs" in "Digital VLSI Systems Design, Springer, 2007.

20J4Internationai Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 705

Date post:	25-Dec-2016
Category:	Documents
Upload:	nishant
View:	213 times
Download:	1 times

[IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...

Documents