Hough Based Deinterlacer

Hough-Based Deinterlacing

Altera Industrial Placement

Summer 2014

Abdulaziz Azman

CID:00680225

Table of Contents

1 Summary 3

2 Introduction 3

3 Altera Overview and Project Scope 4

4 The Deinterlacer Project 5 4.2 Project Expectations 5

5 Team Management and Organizational Tools 6 5.1 Video IP Team Management 6 5.2 Project Management tools. 7

6 Implementing and Developing The Algorithm 8 6.1 The Deinteralcing Challenge 8

6.1.1 Deinterlacing Background 8 6.1.2 The Low Angled Problem 9 6.1.3 Project Motivation 10

6.2 The Hough-Based Deinterlacer Design 11 6.2.1 General Method 11 6.2.2 Hough Transform 13 6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology 15

6.3 Research and Improvements 16 6.3.1 X biased Sobel 16 6.3.2 The Proximity Hough 17 6.3.3 Post-Processing Block 20

6.4 Conclusion and Personal Reflection 21

7 Altera OpenCL and High-Level Synthesis 21 7.1 The OpenCL tool flow 22 7.2 Compromises and Optimizations 23 7.3 Conclusion and Personal Reflection 25

8 Offsite and Extra Activities 25

9 Conclusion 26

10 Appendix 27 I Pseudo Code of Conventional and Proximity Hough Transform 27 II Post-Processing Block 28 III IBC paper submission 29

1 Summary Altera is a semiconductor manufacturing company based in the Silicon Valley. Altera manufactures FPGAs, PLD and

ASICs and has a large business unit portfolio that covers sectors such as broadcast, automotive, industrial and

communications. The company has many sites located across the globe with the European headquarters being located in

High Wycombe. The Altera Europe focuses on the business unit aspect of the company rather than the research and

manufacturing. The business unit I was assigned to is known as Broadcast. Altera Broadcast aims to provide IP solutions

and software tools that focus on broadcast related components. The current focus in the Altera Broadcast business unit is to

meet customer demands of 4k (UHD) processing by making IP cores that can handle 4k video bandwidth and data. A

challenge identified in 4k displays is when lower resolution interlaced sources are displayed on UHD television. To display

interlaced video source, the input video undergoes a decompression process known as deinterlacing which converts the

interlaced video to a progressive video suitable for digital displays. This video decompression process often generates image

artifacts which are further magnified by the up-scaling process required to match the 4k display resolution. There are two

goals in my project. The first is to develop a deinterlacing method proposed by my supervisor that targets to eliminate these

image artefacts. This phase of the project involves modifying the method and employing additional image processing

techniques to render the solution feasible. A feasible deinterlacer design is one that is able to consistently remove image

artifacts and can be fitted into a single FPGA chip. There were several novel methods that had to be incorporated to generate

significantly improved deinterlaced video outputs. The second goal in my project is to implement the finalized algorithm

into hardware using Altera high-level synthesis tools. The main challenges encountered during the project was to create a

deinterlacer design that produced consistent results. This feat is difficult due to the vast amount feature variation in video

sequences. The approach to tackling this problem was to research different computer vision and image processing methods

that could be exploited. The final result of the research is a novel adaptation of an established image processing algorithm

for feature detection and the inclusion of a smoothing post-processing technique. This is a personal achievement due to

the novelty of the algorithm. The inception of the solution involved a good understanding of the deinterlacing problem and

a low-level familiarity of the feature detection algorithm. Using the high-level synthesis tool was also a challenge because

it is the first exposure for me. Courses like Digital System Design and VHDL in my third year were relevant in this phase.

I was already accustomed to hardware performance measures, loop unrolling, memory interfaces and hardware timing

requirements. Throughout the placement I learned to communicate better with work colleagues and bosses namely when

giving progress updates and project resource allocation. I realize the importance of giving the right impression to your

manager regarding your capabilities in completing a specific task. A false impression would result in your manager over-

expecting results and over-loading you with more task to complete. I attended many Altera Broadcast related meetings and

video conferences which provided insight to the development of IP cores and the management of a team of engineers.

Overall I learnt the technical and productization of an engineering solution. The finished deinterlacer design produced at

the end of the placement has potential to be developed further. Improving the algorithm by incorporating suitable digital

filters could drastically reduce execution time. Inferring more hardware parallelism by increasing the number of data paths

would enable the deinterlacer to perform 4k processing at 60 frames per seconds.

2 Introduction

In recent years there has been a demand for higher resolution displays. Broadcast and digital television companies around

the world are seeking to satisfy the growing market demand for high-definition (1080i & 1080p) video quality broadcast

and television resolution. This advent in the market inevitably will require improved video processing components such as

scalers, codecs and deinterlacers. This industrial project focuses on designing a deinterlacer on an FPGA that would meet

the high quality deinterlaced video demands.

The project is divided into two main tasks, which are the algorithm development phase and the hardware implementation

phase. The algorithm development phase is meant to identify caveats in the general method proposed and exploring aspects

of the design that can be improved. This involved initially writing the method into C++ and making modifications to the

method based on feedback from video outputs generated. The proposed deinterlacing method introduces numerous image

artifacts, hence a large period of the placement was dedicated to the research and innovation of new methods to improve

these video outputs. The hardware implementation phase is to validate an Altera High-level synthesis (HLS) tool flow. The

HLS tool flow aims to increase the range of Altera customers by enabling easy translation of a widely used C-based

programming language to a registers-transfer level hardware description. This abstraction from hardware is attractive to

software engineers and allows them to exploit the FPGA architecture without having to learn hardware-description

languages. This report will center around the research and modification made to the method, an outline of the translation

process from C to OpenCL and an evaluation of the performance of the hardware generated.

The report also includes an overview of Altera and the engineering team I was assigned to. A basic description of the

management practices and tools used along with a brief account of company-related offsite activities are also included.

3 Altera Overview and Project Scope Altera is a manufacturing company of Programmable Logic Devices (PLD). The headquarters of Altera is located in Silicon

Valley but there are many Altera sites and offices around the world. The company portfolio includes research and

manufacturing of PLDs as well as providing FPGA related tools and solutions to specific sectors. These sectors are identified

as business units and the main Altera business units (BU) are Industrial, Automotive, Communications and Broadcast. A

hierarchy of the BUs can be seen below in figure 1 along with the teams under the Broadcast BU.

Figure 1 Diagram showing the Business Units inside Altera and teams under the Broadcast BU.

The teams in Altera are constantly changing based on customer demands. This is because Altera adopts a dynamic allocation

of resources. Engineers and funds are constantly being reassigned and reallocated to different teams to meet the immediate

demands of customers. This management approach is practical and strategic as it achieves optimal resource allocation.

Altera engineers are also not restricted to a single BU.

I was assigned to the Video IP team under the Broadcast business unit. The Broadcast BU aims to meet broadcast customer

demands by providing intellectual property cores, design softwares and tools. The Video IP team under Altera Broadcast

focuses to meet customer demands of 4k video displays and processing by providing solutions such as 4k IP cores and

design tools. Altera invest in creating readily available IPs to customers to encourage the usage of Alteras FPGAs.

Examples of video IPs available are scalers, deinterlacers, SDI and HDMI interfaces and chroma-resamplers. The project

assigned to me focuses on developing a deinterlacer component. Deinterlacers are included in digital video systems to

enable interlaced video to be displayed. It is therefore an important component to a digital display system.

Altera Business Units

Broadcast

Video Codec Video IP

Automotive Industrial Communication

4 The Deinterlacer Project

There are three phases in the Deinterlacer Project. The first is to implement, develop and conduct a feasibility study on the

deinterlacer method. This method is described in an internal Altera document authored by my supervisor, Jon Harris and is

explained in section 6. The second is to restructure the deinterlacer algorithm to suit a hardware implementation. The third

phase is the hardware synthesis of the deinterlacer design. While the primary objective of the third phase is to implement

the deinterlacer algorithm into hardware the secondary objective is to validate an Altera high-level synthesis tool. The tool

is still in its infancy and is available to the public. Though some features of the tool are only available to Altera employees.

The task division of the project was done using Jira to help easy monitoring. A breakdown of the project planning is

presented in the table of figure 2.

Phase Task Description Time

Allocated (weeks)

Time Completed (weeks)

Algorithm Development

Develop Deinterlacer Method in C++

Implement method described in patent draft and use OpenCV as interface to generate image outputs from C++ functions.

11 14

Algorithm Translation

Translate C++ Code to OpenCL

Rewrite sections of code in a pixel-streaming manner. Modify Code in a structure that the OpenCL compiler is able to compile and extract parallelism

2 4

Hardware Implementation

Use OpenCL tool flow to create first OpenCL based-program. Perform necessary optimizations and compromises

Main task was to use resource estimation to ensure that design will fit into card. Main task was to verify frame outputs from hardware using emulation. This was achieved using optimizations and compromise. Optimizations include inferring pipelines wherever possible, relaxing loops by reducing data dependency and inferring shift registers wherever possible. Compromises include reduction of intercept range and delta range (benchmark is 1080i).

6 2

Hardware Synthesis Synthesize design to FPGA 3 2

Figure 2: Table showing breakdown of project plan proposed by placement supervisor.

The time allocation was decided by my supervisor and he based the amount (in weeks) on the fact that the algorithm requires

refinement and that I am unfamiliar with the high-level synthesis tool. The algorithm development phase was allocated 11

weeks but took 13 weeks to complete. The prolonged period was due to the difficulty in making the deinterlacer produce

consistent outputs across a set of industry standard interlaced video sequences. A large research effort was put to make the

deinterlacer design a feasible solution to the deinterlacing problem (elaborated in section 6). Note that the term feasible is

used to describe a solution that consistently removes targeted artifacts and can be synthesized to fit on a single readily

available FPGA chip. The algorithm translation phase took longer than expected due to the fact that the C code was written

with no hardware consideration. The initial code used C classes and random accesses which are more difficult to translate

into hardware as compared to writing the algorithm in a pixel-streaming fashion. Hence the restructuring of the C code not

only involved a translation of the algorithm but major changes in the algorithm to render it suitable for pixel-streaming. The

modified algorithm had to achieve the same affect, or at least similar affect as it would in a pure C implementation. The

hardware implementation phase was completed a lot faster than expected. Out of the allocated 6 weeks, this phase was

completed in 2. The acceleration is due to the robustness of the high-level synthesis tools used (elaborated in section 7) and

the close communication between me, my supervisor and an Altera employee who is involved in the development of the

tools used.

4.2 Project Expectations The proposed method of deinterlacing is the first of its kind in that the Hough Transform is used to deinterlace a frame. Due

to its novelty and limited placement time, the performance of the Algorithm is not expected to be extremely robust and

efficient. It may produce additional image artifacts while targeting to refine certain sections of the image. It is expected

though, that the algorithm should work very well in certain situations where the algorithm parameters are fine-tuned to a

particular video sequence. This inconsistency in video deinterlacers are not surprising as even matured and well-established

deinterlacing algorithms do not produces consistent results for all video sequences.

The hardware generated from the HLS tool is not expected to be optimized and may take up a large amount of resources.

The target board used for this design is the Nallatec-pcie385n FPGA accelerator card, which has a Stratix V. The Stratix

FPGA family, is a higher end FPGAs available in the market today and hence will provide the highest number of FPGA

resources. Despite this the design is expected to use up most of the Stratix V and therefore is very expensive in terms of

FPGA resources. The design is therefore not suitable to be incorporated into an FPGA based display system which often

include other video components such as scalers, memory and port interfaces, video encoder and decoders etc. It is the hope

that when the FPGA resources increase in the future and when the HLS tool matures that the deinterlacer design would

become a practical solution for better deinterlaced video output.

5 Team Management and Organizational Tools A positive aspect of the placement was the constant exposure to meetings and organizational tools used in the Video IP

team. My manager and supervisor included me, whenever relevant, to video conferences and group discussions. As a

result I familiarized myself with the dynamics of the Video IP team management and the management tools they use for

IP core development.

5.1 Video IP Team Management Benjamin Cope who is based in Altera High Wycombe is managing the VIP group. The group consists of 11 engineers, 8

of which are based in Altera Penang. The engineers in the VIP group work closely together but are not based in the same

Altera site. The team therefore uses video conferences and online management tools to monitor the progress of all the

engineers. The video conferences, between Altera Europe and Penang, are held once a month to update on individual

progress and to highlight group milestones. I attended these meetings despite not having any contribution to the team project,

though on occasion I provide updates on the progress of my deinterlacer project.

The management system used by the VIP team is called the Agile system. The online management tool used by the VIP

team is called Jira. These management system and tools are used to keep track of work progress and to generate performance

measures. Though my industrial placement project is not part of the VIP teams project, I was included in their management

system to provide insight on how engineers organize themselves in developing softwares and IP cores.

The Agile system is a method of organizing a software development process especially when the team consist of individuals

with different functional expertise. The specializations in the VIP team range from members adept in specific IP cores such

as deinterlacers to members who focus on verification. The Agile system introduces management techniques and measures

such as Sprints, Story Points, Issues and Scrums. These measures provide an iterative and incremental framework for IP or

software development. The short time span allocated for a task (i.e weeks) allows the team to be very responsive to changes

in customer demands. The placement was an individual work project. Hence the tasks were divided over periods of months

instead of weeks. There are many online tools that adopt the Agile system to make it more versatile and accessible.

Jira is an online tool that adopts the Agile system. Jira aims to provide bug and software development and monitoring. Jira

allows easy issue allocation between team members by a simple, drag and drop action. Issues can be priorities and labeled

to allow easy tracking. For example an issue may be labeled as a bug fix or a new feature. End of sprints reports can also

be generated through Jira. These reports consist of a burn-down chart of the amount of story points achieved versus their

respective time taken for completion. The expected trajectory is also displayed in the burn-down chart. Burn-down chart

comparisons give the team an idea of how well they meet expected targets and provide feedback to the managers as to how

much task to assign to the team in the future. An example of a burn-down chart of the deinterlacer project is shown in figure

3.

Figure 3 showing a comparison of the burn down chart of deinterlacer project between the expected progress and the actual

progress.

Exposure to monthly video conferences with Penang and software management tools familiarized myself with the IP core

development process and the dynamics of engineers that are located far apart.

5.2 Project Management tools. Throughout the internship, meetings and project tools were used to help the progress of the projects. One-on-one meetings

between my supervisor and I occur almost on a daily basis depending on my progress or whether I require assistance.

The third phase of the project involved using an Altera OpenCL tool flow which many engineers in Altera Europe are

unfamiliar with. Andrei Hagiescu is an Altera employee based in Toronto and he is involved in developing the Altera

OpenCL tool flow and making OpenCL example designs. Andrei was helpful in many aspects of the latter phase of my

project. He provided constant feedback on the latest deinterlacer code revision and advised on a suitable FPGA development

kit that would suit my project application. The code feedback consisted of Andrei evaluating resource estimation and

optimization reports generated by the AOCL compiler and suggesting methods of improving and optimizing the hardware

being generated. The main role of Andrei was to guide the second phase of the project in the right general direction.

Code revisions in the hardware implementation phase were sent between me and Andrei through emails. Errors and bugs

encountered by me in the high-level synthesis tool were addressed to Andrei using Fogbugz. Fogbugz is a bug tracking

system where bug finds are reported as incidence which can be assigned to another teammate to help resolve. You can also

assign the bug to multiple teammates. Fogbugz will also retain a thread of correspondence between teammates for easier

tracking. There were several bugs in the high-level synthesis tool that I identified while using the tool. Workarounds and

suggestions were almost always immediately provided by the Altera employees that are involved with the HLS compiler.

Having one-on-one meetings with my supervisor and video conferences with Andrei definitely helped me develop my

communication skills. Supervisors and managers who do not have an idea on the amount of effort required to solve a

problem may impose unrealistic goals. It is therefore paramount that you convey the right impression about your capabilities

and the estimated amount of time and effort you would require to complete a task. A false impression would result in your

supervisor expecting more results than possible.

0

1

2

3

4

5

0 5 10 15 20 25 30 35 40

Ou

tsta

nd

ing

Tas

k

Weeks

Burndown Chart of Deinterlacer Project

Expected Progress

Actual Progress

C++ implementaion

OpenCL kernel translation

OpenCL kernel optimisation

Hardware Synthesis

6 Implementing and Developing The Algorithm At the onset of the placement, the deinterlacing method proposed has never been implemented. The method described was

general and required many additional image processing techniques to render the method a feasible solution to the

deinterlacing problem. Due to the lack of validation and research done in the proposed deinterlacing method, more than half

of the placement time was allocated to making the algorithm feasible.

6.1 The Deinteralcing Challenge

Deinterlacing has been an ongoing problem for decades. The primitive form of video compression blindly removes half of

the image information that reconstruction is a challenge. As a result decompression artifacts are difficult to avoid. Despite

the tendency to produce decompression artifacts, the lack of complex codecs and compression algorithms make it an

attractive technique for broadcast companies as they can avoid the extra cost of specialized broadcast equipment for the

codecs. The cost of compensating the simple compression technique is incurred on the receiving end of the transmission

where better deinterlacers are required to either totally eliminate or mitigate the decompression artifacts. The widespread

use of interlaced video format broadcasting has driven a large amount of research in the deinterlacing area. This section of

the report will delve deeper into the fundamentals of deinterlacing, the targeted decompression artifact and the project

motivation.

6.1.1 Deinterlacing Background

Deinterlacing is a method of converting interlaced scan video to progressive scan video. Interlace scan video only capture

either the horizontally even or odd lines of a video frame. A video frame containing either the odd or even lines are known

as sub-fields. This terminology will be used throughout this report. Transmitting and storing interlaced video format is

therefore an advantage as the same amount of bandwidth or memory is required for an increased frame rate. This

improvement in temporal resolution reduces image flickering in analogue television and as a result, interlaced format is

commonly used in analogue broadcast systems such as PAL and NTSC use interlaced video format. On the other hand

Progressive video captures and display all lines in a video frame and is the video format used in almost all digital video

display devices. Digital devices therefore will require a deinterlacing component in the system to allow the display of

interlaced scan video format. A simple illustration of the conversion between interlaced and progressive video can be seen

in figure 4.

Figure 4 An image created during industrial placement illustrating the capturing, storing and conversion of interlaced scan

fields to progressive video.

Deinterlacing is a widely researched area due to its vast applications in the televisions and video storage. The deinterlacing

methods can be crudely divided into 4 techniques, which are motion-compensation, motion-adaptive, directional

interpolation and non-adaptive (temporal and spatial) interpolation. Deinterlacing methods can also be divided as being

either intra-field or inter-field deinterlacing. Intra-field methods use only one sub-field to generate a full progressive frame

while inter-field deinterlacing uses multiple sub-fields. Directional and non-adaptive interpolation techniques tend to be less

algorithmically and computationally complex compared to their motion-detection based counterparts. Directional and non-

adaptive techniques therefore produce lower quality deinterlaced video output and are more prone to generate image

artifacts. A common image artifact that has yet to be fully resolved is the low-angled image artifact and it occurs in many

directional and non-adaptive (spatially) interpolation techniques.

6.1.2 The Low Angled Problem

As the name suggest, the low-angled image artifacts are artifacts that occur along straight and low-angled line. The term

low-angled is used to keep the exact angle of the line arbitrary as the artifacts are subjective and are more apparent on some straight edges than others. For the sake of definition we will assign low-angled lines as being lines that are less than

45 degrees to the horizontal. The figure 5 shows the low-angled artifacts generated by spatially non-adaptive deinterlacing

(i.e. LD and LA) and edge-dependent deinterlacing (i.e. ELA) in techniques.

(a) (b) (c)

Figure 5 Low-angled image artifacts generated using C++ and OpenCV of Line Doubling (LD), Line Averaging (LA) and

Edge-dependent Line Averaging (ELA). Images were generated using C++ and OpenCV.

The main reason that these artifacts occur at low-angles is due to the inherent lack of vertical resolution of interlaced video.

Recall that a sub-field only captures either horizontally odd or even lines of a full frame. Hence half of the horizontal

information is not captured. Deinterlacing techniques essentially reconstructs these missing horizontal frame lines by using

information contained within the sub-field. The closer the straight line is to the horizontal, the less horizontal information

of the line exist in the interlaced frame. This idea can be easily understood if you image a perfectly horizontal line in an

image, assuming the line to one pixel wide, there is a chance that the entire line might not be captured using an interlaced

format as some horizontal frame lines are ignored. In contrast a perfectly vertical line has half of its line information

available regardless if it is the odd or even sub-field being captured. To proceed further into the problem, there are several

crude classifications of straight lines. The relevant types for this problem are highly directionally textured and macro-lines.

The distinction lies in the thickness and space between consecutive lines. Textured lines are single pixel wide and are closely

packed while macro lines are multiple pixels wide and far apart from other lines. This illustration can be seen in figure 6 of

the 1080p NTSC Slices sequence.

Figure 6 Showing regions of highly textured, textured and macro-lines in the Slices test sequence created by NTSC.

Existing edge-dependent deinterlacing methods are insufficient due to the locality of the pixels they analyze. Only

neighborhoods of at most 20 pixels across are processed before performing the directional interpolation. The locality of

pixel analysis makes most existing edge-dependent deinterlacers suitable for highly directionally textured regions in frames

as shown in yellow of figure 6, this locality also means that it is unable to reconstruct regions with macro lines as shown in

blue of figure 6, which require a larger neighborhood of pixels to be processed. The existing schemes can at best approximate

the direction of these macro lines. This approximation of macro-lines and its drawback can be seen in figure 7 which shows

the deinterlacing of the 1080i (interlaced) NTSC Slice sequence using the Fine-Directional Deinterlacingi (FDD) method

which is a form of edge-dependent deinterlacing.

(a) (b)

Figure 7 Showing the progressive 1080p (a) and the FDD output (b) of the 1080i NTSC Slice test Sequence respectively.

The solution to reconstructing macro-lines is to analyze the entire sub-field in contrast to analyzing a local pixel

neighborhood. This image processing is done using a Hough Transform (HT), which will be further elaborated, in section

4.3. It is crucial to note that analyzing a delocalized set of pixels and reconstructing a single pixel using these pixel data will

result in a higher chance of other image artifacts being generated. This is because the pixel to be interpolated will be affected

by pixel values that are relatively distant. Obviously distant pixels, relative to the pixel to-be interpolated, tend not to

represent the same image feature in a frame. This is precisely why most existing deinterlacers do not venture to the

processing pixel domain further than 10 pixels to avoid such artifacts. Even for deinterlacers that analyze a neighborhood

of 5 pixels across generate artifacts due to an inaccurate and erroneous interpolation based on distant pixels.

To remove secondary image artifacts produced from pixel interpolation, we can focus on post-processing methods that

double-check the existence of an edge and perform a smoothing based on these edge confirmations. A large section of the initial phase of this project is dedicated to these post-processing methods. Despite the drawbacks of analyzing a large pixel

domain, there are several motivations to pursue a more delocalized pixel domain analysis technique for deinterlacing.

6.1.3 Project Motivation The increase in demand for HD (1080p) and UHD (2160p) television displays requires the conversion of existing videos of lower resolution to higher resolution. Image artifacts in deinterlaced video inevitably become more apparent after scaling

up. Hence previously tolerable and minute low-angled image artifacts now become more discernable to the human eye. The

improved deinterlacer would be able to provide a competitive edge to companies whose business models are predicated on

making high resolution digital televisions. It is important to note that there are still many interlaced video sources used today

such as in broadcast and video storage despite it being a primitive form of video compression. The ubiquity of interlaced

format make the deinterlacer component in digital television a crucial component that would improve the user experience.

FPGA area and resources are progressing at a much faster rate than television resolution demand. The lack of increase in

video resolution makes more FPGA resources available. Hence it is reasonable to invest more FPGA resources for better

deinterlacers to improve video quality.

The expected complexity and resource usage of the proposed deinterlacer design is less than that of motion compensation

deinterlacer designs. This is due to the fact that motion-compensation requires computationally heavy motion detectors that

involve computing motion vectors at different pixel domain levels to estimate motion. The expected deinterlaced output

quality and robustness of the proposed design should be comparable to that of motion compensated deinterlacers despite

having a lower resource usage and complexity. This point is revisited at the end of this report once the deinterlacer design

has been explained, implemented and verified.

6.2 The Hough-Based Deinterlacer Design

My industrial placement supervisor, Jon Harris, proposed the Hough-Based Deinterlacer design in a patent draft (yet to be

submitted). The novelty the patent claims is the application of the Hough Transform in deinterlacing video. The document

describes the method as well as a digital hardware realization of the deinterlacer. This section of the report is dedicated to

introducing the general method as described in the patent draft document.

6.2.1 General Method Hough-Based deinterlacer analyses an entire sub-field for edges. This is in contrast to existing deinterlacers where only

local pixel domains are processed. The Hough-Based deinterlacer aims to extract edge information to dictate the directional

interpolation of the pixels. Edge information may include variables such as intercept, gradient, start coordinates and end

coordinates of a line. The scheme extracts edge information via the image processing flow shown in figure 8.

(a)

(b)

(c)

(d)

Figure 8 Shows the functional flow of the deinterlacer and the image output of each processing stage. The image is the

200th field of the Table Tennis sequence.

The flow consists of an RGB to luma conversion process, an edge detection process, a line detection process, an offset mask

generation process and an interpolation process. Referring to figure 8 , the interpolator uses the processed sub-field from

the luma, sobel and offset mask generation block to perform directional interpolation to produce the deinterlaced output.

The scheme targets the luminance (i.e brightness information) of the subfield while ignoring the chrominance (i.e color

information) of the sub-field because the human eye is more sensitive to the variation in brightness of an image rather than

its color variation. This higher tolerance for changes in color information is precisely the reason why the Chroma of an image or video is usually sub-sampled. An RGB conversion is required depending on the format of video input into the

design. Instances where a conversion is not required is if the video input format uses the YUV color space which encodes a

video using YCbCr. Y contains the luminance information and Cb and Cr contains the chrominace information. Equation

(1) shows the equation to convert from digital RGB to digital luminance (i.e Y).

= 66. + . 129 + 25. + 128

23+ 16 (1)

The Red, Green, Blue and luminance value is represented using an 8 bit binary number and therefore ranges from 0 to 255.

The Sobel process is used to detect boundaries in the sub-field. Boundaries in images are characterized by a change in

luminance or chrominance. This change can be computed using image kernels that contain gradient operators. There are

numerous gradient operators but the Sobel was chosen due to the higher weightage give to the pixels vertically adjacent to

the center pixel. The Sobel also has smoothing properties by having the coefficient 2 in the kernels. Figure 9 shows common

edge operators including the Sobel operators.

[0 1

1 0]

[1 00 1

]

[1 0 11 0 11 0 1

]

[1 1 10 0 01 1 1

]

[1 0 12 0 21 0 1

]

[1 2 10 0 01 2 1

]

(a) (b) (c)

Figure 9 The Roberts, Prewitt and Sobel edge operators.

Having a Sobel Threshold will set the minimum output value of the Sobel Transform that we would consider an edge. The

image generated after applying the threshold will be a binary image. Increasing the threshold would reduce chances of

detecting erroneous edges but will simultaneously reduce the edge information. A compromise based on the users preference is therefore necessary to dictate the tolerance in gradient. The effect of varying the Sobel threshold is illustrated

in the figure 10 Take notice in the disappearance of the straight edge of the roof eave with increasing Sobel Threshold.

(a) (b)

(c) (d)

Figure 10 Shows the output of the Sobel Transform with increasing Sobel threshold.

The Hough Transform uses the binary edge information (Figure 10 (c)) to detect lines in an image through a voting system,

an explanation of the Hough transform is in section 4.3. Lines are detected and are used to generate an offset mask (shown

in Figure10 (d). An offset mask is an intermediate image that will tell the location of the pixel requiring directional

interpolation along with the offset value. A simple illustration of the function of the offset mask is shown in figure

The offset value is used to directionally interpolate a pixel. Directional interpolation is derived from the characteristic that

pixels along edges tend to have the same RGB or luminance value. Hence to reconstruct the edge, the pixels to-be

deinterlaced will take the average value between the top and bottom pixels at offsets determined by the angle of the edge

itself. Consider the following illustration in figure 11.

Figure 11 Shows the directional interpolation method along with a comparison of the results.

6.2.2 Hough Transform The Hough Transform (HT) was first introduced in 1962 in a patentii published by Hough Paul C V. The patent describes a

method of extracting image features, such as lines and ellipses, which can be mathematically parameterized. As an example

the patent uses straight lines, which are easily represented by an intercept value and a gradient value. The basic idea in the

HT is that for all high binary edge pixel, we compute the parameters of all possible feature orientation and accumulate these

values in a parameter space.

Say we wish to detect straight lines which are parameterized by the intercept parameter, , and a gradient parameter, . The size of the parameter space (also known as the Hough Space) would be , the range of and values of and are arbitrary and will be discussed further in the next section. Consider a high binary edge pixel at the center of a 100x100 image shown as the picture in Figure 12 (a). To find all possible line orientation for that particular pixel we

simply compute the intercepts for the entire Nm range. This set of Nm and values are then accumulated in the Hough Space shown in the right plot of figure 12. The Hough Space is represented using a color-map with a color-bar to the right

of the figure indicating the value of a specific element in the Hough Space.

(a) Edge Image (b) Hough Space

Figure 12 showing an edge image (left) with a single high binary pixel at its center and the corresponding set of accumulation points in a color map representation of the Hough Space.

Referring to figure 12 (b), notice that the maximum element value in the Hough Space is 1 as there is only 1 voting pixel.

We can therefore expect a similar Hough Space profile for individual pixels as we extend to more high binary edge pixels.

Performing the above iteration for an image containing a line angled at 45, as shown in figure 13 (a) generates the Hough Space accumulation pattern as shown in figure 13 (b).

(a) Edge Image (b) Hough Space

Figure 13 shows a 45 line in an edge image (left) and the corresponding Hough space generated.

The Hough Space has a maximum accumulation of 20 hits suggesting 20 pixels voting for a specific line at points = 0 and = 45degrees as expected. This point is shown in the Hough Space as the region in red. To extract maxima in the Hough Space we apply a threshold. The Hough Space coordinates that satisfy this threshold would then be the detected

lines.

After the Hough was invented, a wide variety of different versions have since been introduced to suit different applications.

In the context of the deinterlacer project, where straight and low-angled lines are concerned, several features were included

to better serve our purpose. The are two major modification made to the conventional Hough Transform. The first is that a

Cartesian coordinate system is used rather than a polar coordinate system. The second is that the set of angles a line can

take is restricted and bounded. These modifications reduce the performance of the Hough transform in that many lines are

drastically approximated. The following paragraphs explain the reason.

To represent a straight line, the parameters may either be expressed in terms of the angle and distant to origin (i.e Polar

Coordinate) or in terms of the gradient and intercept (i.e Cartesian coordinate). This design adopts the Cartesian coordinate

system because the offset value (i.e used for directional interpolation) can be easily derived from the gradient of a line. The

main drawback of using a Cartesian based method is that there is a problem representing a vertical line. This disadvantage

is ignored as we are not interested in representing vertical lines.

The term bounded-offset refers to the manner in which the line gradients are discretized. It is more natural to assume that we allow lines to take discrete values at regular intervals. These intervals would then dictate the angle resolution of the

detected lines. The key modification in this design is that the lines are bounded to take values determined by the offset

values. The detectable lines are therefore bounded to angles illustrated below in figure 14.

Figure 14 shows the angles discretization created by neighboring pixels.

These offset values are denoted by x as they are inherently the change in the x direction given that the change in the y direction is always 1 which is true in our case as the interpolation always happens between the current pixel and the pixel

directly above it. Using the delta x as a parameter rather than angles in degrees inevitably changes the parameter space.

A key feature of the using x is the improvement in resolution for very low angles. The resolution of the angles are variable unlike most parameters where the resolution remains constant throughout. The resolution is given by the derivative of the

equation describing the relationship between delta X and angles. This derivative is shown in (3).

= tan1 (1

x) , x (2)

x=

1

x2 + 1 , x (3)

Note that the derivative is always a negative value which is as expected as the angles decrease with x. It is also important to note that as x increases the derivative also decreases implying that the angle resolution exponentially improves with x. This advantage is also a drawback for higher valued angles such as those above 18 where the angle resolution is more than 5. The result is that lines detected are therefore an approximation or they remain undetected.

6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology

The patent draft describes a method of the deinterlacer design with a demonstration in principle that the design should

resolve low-angled edges. As a proof of concept, the initial phase of the project was dedicated to creating a working

prototype of the deinterlacer.

The metric used for verification is the PSNR value which is a standard measure used by researchers involved with

deinterlacing or video compression and reconstruction. The PSNR value stands for the peak signal-to-noise ratio and can

be derived from a single or multiple frames. The PSNR is a ratio between the peak signal power value and peak noise value.

The peak value for an image, which has a single color channel represented as 8 bits, is 255. The noise is the average absolute

difference between the original and reconstructed image. Hence, in the case of deinterlaced video, the PSNR value can only

be calculated provided that the progressive version of the video is available. The PSNR can be shown mathematically in (4)

where I(i,j) is the progressive frame and K(i,j) is the deinterlaced frame.

= 20. log (255

) (4)

=1

. [(, ) (, )]2

1

=0

1

=0

(5)

The PSNR measure was a means to reconfirm quantitatively that the deinterlacer scheme is able to reconstruct low-angled

edges whilst not introducing new image artifacts. Though throughout most of the algorithm development phase of the

project, the frame outputs of the deinterlacer were visually assessed rather than quantitatively measured. This is because the

end users of the video outputs are people, hence it is crucial that deinterlaced videos are visually satisfactory.

By using these verification methods various aspects of the Algorithm were found lacking. A thorough description of the

caveats discovered and the improvements made in response are presented in the next section.

6.3 Research and Improvements

This section of the report is dedicated to several modifications and inclusions to the deinterlacer algorithm that were made.

These modifications were necessary to render satisfactory video outputs. My supervisor occasionally made suggestions on

how to improve the design but most of the modifications were purely invented by me. It is important to note that a large

amount of the industrial placement was dedicated to the research and improvement of the Algorithm. The research method

employed is trial-and-error based and is by no means systematical in approach or exhaustive. To help improve the design,

a wide variety of image processing techniques were experimented. The challenge was that there are not many deinterlacer

papers written that had a similar algorithm to the Hough-Based deinterlacer due to its novelty. As a result, I employed to

image processing techniques that are not usually used in deinterlacing like connected-component labeling and line

correlation techniques. The unconventionality of the proposed techniques meant that I had to constantly refer my idea to

my industrial supervisor to gain feedback on whether he finds the solution a realistic one that is worth pursuing. A detailed

account of the research is not included in the report. The following sections will only describe and highlight the

modifications that I made to the final revision of the deinterlacer design. Most of the modifications were invented and

proposed by me and some suggested by my supervisor.

Caveats in Algorithm Modifications made in Response

Insufficient Line Resolution

for detection

No solution provided, instead algorithm relies on line to be sufficiently thick to allow line approximation. Though increasing line resolution would improve line detection, this comes

at the usage of more hardware resources.

Over-detection of lines due

to high pixel luma variation

or high density of lines

I introduced a discriminatory process in the Hough that takes into account the distance

between voting pixels. This would reduce overlap between produced by regions with high

pixel luminance variation and regions with high line density. This modification was purely

invented by me.

Image artefacts produced

from directional

interpolation are very

apparent and are difficult to

contain.

Introduce a processing block after the Hough Transform as a means filtering out false

positive lines by consolidating detected edges with luma, edge and offset information. Most

of the consolidation methods were proposed by my supervisor and some by me.

Edge detection needs to

prioritize horizontal edges

Removed the y-directional Sobel kernel and only depend on x-direction Sobel kernel for

line detection. This was suggested by me.

Figure 15 Table showing the caveats in Hough-based deinterlacer algorithm along with modifications made in response.

6.3.1 X biased Sobel The proposed edge detection technique was a Sobel transform. To better capture vertical color gradient while ignoring

horizontal color gradient, the x-Sobel kernel was removed. The result is an edge detection method which account less for

straight lines closer to the vertical while accentuating straight edges closer to the horizontal. This simplification both

improves the detection rate of low-angled lines (as less erroneous high pixels are derived from horizontal gradient changes)

and reduces the complexity of the edge detection process. This reduction in complexity though is insignificant to the increase

in complexity brought about by other processes. The difference in the output binary edge image and the clear improvement

is show below in figure 16.

(a) (b) (c)

Figure 16 Shows (a) original grayscale image (b) Conventional Sobel Transform (c) X biased Sobel Transform.

6.3.2 The Proximity Hough The Proximity Hough (PH) method is the name I gave to a discriminatory process I incorporated into the traditional Hough

method. This new Hough method was invented by me and it forms the bulk of the research value added to the deinterlacer

algorithm. Not only does the line-detection rate drastically improve but also it is also very resistant to noise. The context of

noise in this case are high binary pixels which do not form a straight edge and can either be jagged, curvy edges or simply

regions of high pixel luma variation.

The objective of introducing the PH method was to tackle the problem of over-detection. Over-detection occurs when the

Hough Space gets too cluttered due to regions in the image with high luma variations or a high straight-line count. The

Hough Space in figure 17 clearly demonstrates this affect by performing the Hough Transform on an original image with

high binary pixel clusters and on an image with the high binary pixel clusters artificially removed.

(a) (b)

(c) (d)

Figure 17 shows the difference in Hough Space produces by images with and without high binary pixel clusters.

The PH transform works by storing additional information of a particular high edge binary pixel and stores these information

inside memory blocks. We could think of these memory blocks as bins but in reality they do not accumulate anything and

are simply updated with a different value. The information stored in these bins are the coordinates (x,y) of the voting pixel,

the x start point of the line and the only accumulation bin. These bins have dimensions exactly the same to that of the

accumulation bin, which is essentially the parameter space. It is therefore obvious that the coordinates and the x start point

is unique to a particular line. Recall that each element in the parameter space represents a unique line that can be drawn on

an image.

The general method of the PH transform is that once there is a hit for a particular line, the accumulation bin will only increment if the (x,y) coordinates of the previously voted pixel for that particular line is within a tolerable range. This

tolerance is given the name as proximity threshold which is the minimum distance between voting pixels. Hence if you

specify a proximity threshold of 0, then voting pixels must be adjacent or diagonally adjacent to one another. In the case

where the minimum distance is not met then the accumulation bin will not be incremented and the x start point bin will

remain unchanged. The x start point is only updated when a line first gets a hit. Hence it stores the x coordinate of the first pixel that voted for the respective line which corresponds to the start x coordinate of the line. The y coordinate is

neglected as it can be mathematically derived as we know the intercept and delta x value.

Note that the PH transform uses an additional 3 bins as compared to the conventional Hough transform. It also requires

more arithmetic computations and operations. A pseudo code of the conventional and the proximity Hough transform is

included in Appendix I. This increase in complexity will require more FPGA hardware resources and increase the latency.

The throughput of the hardware generated is every clock cycle as much of the algorithm is easily pipelined. It is key to note

that in terms of memory, the Proximity Hough requires a constant reading and writing to all bins whereas the conventional

Hough only requires a write as no feedback required to validate the a particular line hit. A comparison of resource usage

between the Proximity Hough and the Conventional Hough is shown in figure 19. The resource estimation is generated

using the Altera Offline OpenCL compiler which is able to estimate the amount of resources a design would occupy given

the development board or acceleration card used. In our case, these values were generated based on the Nalletech pcie385n

A7 accelerator card.

Figure 18 shows a bar graph of the resource usage in percentage of a Stratix V a7 FPGA.

The performance presented above is due to the reduced clustering of the Hough Space. Referring to figure 18 we can observe

that the Hough Space produced using PH (figure 19 (b)) is less clustered as compared to the Hough Space produced using

the conventional Hough (figure 19 (a)). The discriminatory process of the PH reduces the clustering of line hits and hence over-detection is avoided.

(a) (b)

(c) (d)

Figure 19 shows the Hough Space of the conventional (a) and the Proximity Hough transform (b) along with the binary

edge image input with high binary pixels clusters.

0

10

20

30

40

50

60

70

Logic utilization Dedicated LogicRegisters

Memory Blocks DSP Blocks

Re

sou

rce

Usa

ge

(%

)

Resource Type

Resource Usage of Conventional and Proximity Hough Transform

Conventional Hough

Proximity Hough

The performance of the PH in terms of robustness to regions of high pixel luma variation and in terms of number of rate of

lines detected is apparent though it increases the hardware resource usage. Despite this increase, the video quality output of

the Hough-Based Deinterlacer is greatly dependent on the performance of the line detection process. Hence it is justified to

invest a lot of research and development time and hardware resources on the line detection process.

An interesting improvement that could reduce the resource usage and the complexity of the Hough Transform is to map the

discriminatory process to a form of Hough Space filtering. Parameter space filtering of the Hough Transform is not novel

and has been used in applications such as. This improvement would definitely take more research time and hence was

neglected so that hardware implementation and other research areas could be explored. The next major research area is the

post-processing method which would be comparable in terms of resource usage to the line detection process.

6.3.3 Post-Processing Block The PH transform reduces image artifacts generated by decreasing the chances of detecting an erroneous line. Despite this

improvement, image artifacts still appear in certain regions of the image namely at along the periphery of edges, at intercepts

of edges and at edge endpoints. The objective of the post-processing block is to reduce the probability of generating image

artifacts once the line detection block has completed. Examples of the occurrence of these image artifacts are shown in

figure 20.

(a) (b)

Figure 20 (a) shows image artifacts occurring along the periphery of edges. Figure 21 (b) shows refined output from the

post-processing block.

The post-processing block eliminates the above artifacts by conducting luminance and offset checks. These checks consist

of exploiting the characteristic of similar luminance occurring along edges and consolidating the offset mask with the edge

image. The effectiveness of these checks was visually verified. The post-processing block also includes a smoothing of

deinterlaced pixels. This smoothing comprises of a blend between the base interpolation and the offset interpolation. This blend reduces any drastic pixel luminance introduced by offset interpolation. Detailed descriptions of the checks conducted

are included in the appendix under the Post-processing section (Appendix II). The introduction of the X-biased Sobel,

Proximity Hough and the Post-Processing blocks were crucial in making the initially proposed method a feasible one. I

particularly found this section of the internship challenging due to the elaborate solutions that I had to invent. Though there

were several other solutions such as connected-component labeling and straight-line correlation techniques that were

initially pursued but later discarded due to the high additional complexity introduced and poor robustness to different video

sequences. The final algorithm still has room for improvement namely when video sequences with highly textured regions

are concerned, though the algorithm produces impressive edge reconstruction when macro-lines are present in an image.

The next major section of the placement is the implementation of the Algorithm into hardware using the Alteras High Level Synthesis tool.

The final post-processing algorithm was translated into hardware. During this translation several key considerations had to

be made to generate a pixel throughput of 1 every clock cycle. This was a challenge as the post-processing algorithm consists

of several blocks (see appendix for detailed description of these blocks) that are inherently dependent upon the outputs of

previous post-processing blocks. To solve this, intermediate shift-buffers were introduced in between these blocks. These

buffers are read from multiple times throughout the main loop but are written to only once. By introducing latency, data-

dependencies in the main loop in the post-processing can be eliminated entirely. A graphical representation of the post-

processing algorithm showing the numerous internal blocks and shift buffers can be found in the appendix. Theoretically,

with the inclusion of these shift-buffers, the final algorithm should be easily translated into hardware that would generate a

throughput of 1 pixel every clock cycle. The challenge in the hardware implementation phase is to write the post-processing

algorithm in a fashion in which the high-level synthesis compiler would generate the intended hardware.

6.4 Conclusion and Personal Reflection The conclusion of this phase of the placement is that the proposed deinterlacer method is a feasible solution to the low-

angled problem. The algorithm works particularly well for refining the image artifacts at edges with a high luma difference.

This is key as the regions of high luma difference are regions that are most apparent to the human eye. A good example of

the result at regions with high luma difference is along the Ping Pong table edges shown in figure 21. Unexpected

improvement in performances were also discovered which further consolidates the feasibility of the design. The unexpected

performance is that slightly curved edges that exhibit similar image artifacts are detected and refined by the final algorithm.

Though the development was challenging due to the fine-tuning of design parameters across several video sequences, I was

motivated by the fact that the algorithm I am developing is tackling a real engineering problem in the interlaced video

broadcast industry. The algorithm development was very rewarding as many additions to the algorithm such as a

discriminatory Hough process and a post-processing block are novel methods. Hence I have not only contributed to the

deinterlacer design by implementing it in C but by adding 4 months of research value to the algorithm. I have experienced

the difficulty in designing a robust image processing algorithm that produces consistent results across the many variations

in video sequences. At the end of this phase the algorithm is finalized and ready to be implemented into hardware.

(a) (b)

Figure 21 shows output of the FDD method (a) with a comparison of the final result of the Hough-Based deinterlacer design

(a) taking input the 187th frame of the table tennis sequence.

7 Altera OpenCL and High-Level Synthesis The second phase of the project involves the implement of the deinterlacer design to hardware using Altera high-level

synthesis tools. A High-level synthesis tool is a program that generates RTL or logic from a higher-level programming

language such as python or C. The target tool for the deinterlacer design is called the Altera high-level synthesis tool. The

Altera HLS is in its infancy and the objective of this phase of the project is to validate the tool by recording the development

time, identifying bugs and providing user feedback. The Altera HLS compiler is uses the same high-level synthesis process

a more mature high-level synthesis tool called the Altera OpenCL compiler. The Altera OpenCL tool flow has useful debug

tools such as resource estimation and hardware emulation (elaborated in the next section). This would provide easier C code

translation to Hardware while still enabling validation of the high-level synthesis process working at the back-end of the

Altera OpenCL compiler. Despite this similarity there are several differences between Altera OpenCL and HLS.

The distinction between the Altera OpenCL and HLS compiler is in the target users. The increase in available FPGA

resources, brought about by better manufacturing technology and FPGA architecture, has made FPGAs an attractive solution for high-performance computing. The FPGA accelerates execution times of computations with little data-

dependencies that are easily parallelized. But to leverage the FPGA, the user requires a background on hardware design and

a familiarization with hardware development tools. To widen the range of users, Altera provides the Altera OpenCL

compiler (AOCL) as a solution. The AOCL compiler is based on a parallel-programming standard called OpenCL. This

standard supports a multitude of different platforms such as DSPs, CPUs, GPUs and FPGAs. Hence it is possible to create

a high-performance computing system with OpenCL that allows offloading to a variety of different platforms. The AOCL

is an elegant tool for software engineers who simply wish to implement their software on an FPGA due to the abstraction

from hardware interfaces and the complete generation of the final hardware system. On the other hand, the lack of control

over the hardware interfaces generated is unattractive to hardware designers. Altera provides another compiler, Altera HLS,

to target these hardware designers. The objective of the Altera HLS is mainly to generate single IP cores rather than an

entire hardware system. This IP core generated could then be instantiated and incorporated in a hardware system via Qsys.

The HLS compiler would both provide internal and external benefit to Altera. Internally, the HLS compiler could be used

to accelerate hardware prototyping. This is demonstrated in the hardware translation of the Hough-Based Deinterlacer

design. It is estimated that implementing the design into hardware would require 1 man-year, whilst the hardware translation

phase of this project was completed in under a month. Fast prototyping would help generate quick outputs to assess the

feasibility of an algorithm. The only drawback the HLS compiler would have is that it will not provide the hardware designer

register level control of the hardware. Externally, the HLS tool will provide customers an easy tool to modify or adapt

existing reference designs or Altera IP to suit their application. They would be able to tweak sections of C/C++ code to

include new components or change design parameters.

7.1 The OpenCL tool flow

The Altera OpenCL tool flow is designed to speed up the hardware development and design by introducing an Emulator

and Optimization reports generated from OpenCL kernels. The OpenCL kernels are C functions with restrictions and

extensions imposed by OpenCL. These restriction and extensions provide a framework for data-parallelism programming.

The kernels are written in a single-threaded, tasked-based fashion and hardware parallelism is inferred by unrolling loops

and pipelining computations. Hardware parallelism can also be inferred by replicating (i.e vectoriziation) kernels. Kernels

are launched via an OpenCL host program which is an executable. In addition to launching kernels, the host program

allocates, reads and writes to the target devices global memory. The host program is generated using any standard C compiler

which gives the flexibility to include any readily available C library into our host program. For this application, the OpenCV

library is included to read and write image to and from the target hardware.

The Altera OpenCL tool flow mimics a software-like debug flow due to the relative short compile time of the emulation

and optimization report generation. Emulation and Optimization reports take seconds to compile which is in stark contrast

to hardware compiles which usually take hours to complete. The complete OpenCL tool flow is shown in figure 21.

Figure 21 Flow chart of Altera OpenCL tool flow.

The Emulation feature allows functional debug of design without any hardware generation. The tool will produce a binary

file (.aocx) which contains program objects that target the FPGA. The binary file can be executed using any x86 processor

and simulate the hardware generated. Hardware performance is not provided by the emulation as no actual hardware is

running. Once the design generates output and is verified to be correct, the next stage is to improve the efficiency of hardware

generated and get a resource estimate. These feedback is provided by the optimization report.

The optimization report will give the engineer an idea of how efficient the compiler has generated the hardware. The report

consists of a list of successfully pipelined code sections, serially executed code sections, data dependencies and a resource

estimation of the design. This stage is crucial for the final hardware performance. Most of the development time is spent

reducing data dependencies alerted to by the report. A snippet from an optimization report is shown below in figure 22.

Figure 22 Snippet from an optimization report showing sections of pipelined code, serially executed code, data

dependencies and a resource estimation.

7.2 Compromises and Optimizations The hardware implementation phase consisted of 4 main kernel code revisions which are the serial hardware execution,

hardware parallelism introduced, data dependencies removed and optimized hardware code revisions. The code was

modified based on feedback from the information provided by the optimization report. The goal was to essentially remove

data-dependencies where possible to enable parallelism, shift-register inference of pixel buffers, removal of conditional

loops and memory access. The progress in terms of resource usage and execution times across the 4 code revisions is

displayed in the bar charts of figure 23 and 24.

Figure 23 Bar chart showing FPGA resource usage for all hardware revisions.

111

96

72

636558

4741

29 29 2825

3228

22

9

-5

15

35

55

75

95

115

Serial Hardware Execution Hardware ParalellismIntroduced

Data Dependencies Removed Optimized Optimization

FPGA Resource Usage

Memory Blocks Logic Utilization DSP Blocks Logic Registers

Figure 24 Bar chart showing kernel execution time of all hardware revisions running at 150 Mhz.

The first code revision (serial hardware execution) was meant to get the design to fit on a Stratix V A7 FPGA by

compromising design performance and reusing hardware blocks. To decrease the deinterlacer resource usage, the range of

y-intercept values was reduced which compromised the final video quality output. Note that due to the fact that the interested

angles are relatively small (i.e less than 45), this compromise did not significantly affect the final video output making it a

reasonable tradeoff.

The second code revision (hardware parallelism introduced) targets to generate efficient pixel buffers to reduce resource

usage. Pixel buffers were initially included in the algorithm translation phase to remove data-dependencies by introducing

latency. The pixel buffers would store a preset number of pixels (depending on process) by having the first element of the

buffer updated with the incoming pixel value and the final element in the buffer deleted. An efficient method of

implementing pixel buffers are using shift-registers. Shift-registers would mimic an array that has all its elements shifted

by one index while having the first element updated with a new value. For a shift-register interpretation of pixel buffers, the

access index has to be known at compile time. The shift-registers form a delay-line with signal taps at the respective access

indexes. Hence the number of index access also had to be kept at a minimal to reduce hardware signal taps generated. Pixel

buffers with dynamic indexes were written in C code in a circular buffer fashion which would infer a memory block where

there is a constant read and write to it.

The goal of the third code revision (data-dependencies removed) was to remove-data dependencies in the post-processing

kernel. As mentioned in Section 6.3.3, the post-processing algorithm inherently has data-dependencies which were removed

by introducing shift-buffers. The ideal result is a throughput of one pixel per clock cycle. Though the C code was written to

generate hardware that removes these data-dependencies, the compiler does not necessarily generate the intended hardware.

To achieve this performance there are three kinds of processes that need to be avoided which are conditional read from

memory, conditional for loops and buffer indexes that depend on a value from the same buffer. Conditional read from memory and for loops were removed by reading from memory and performing the for loop iteration regardless and having a boolean variable to validate the final assignment. Indexes that depend on a value from the buffer itself was solved

by duplicating the buffer and deriving the index from the cloned buffer. These amendments in the C code successfully

generated a post-processing hardware that had no data-dependencies in the main loop. The removal of data-dependencies

in the main-loop also allows the compiler to infer efficient pipelined computations. The efficiency of a pipelined loop is

measured by how many clock cycles between successively launched iterations. At the end of this processed all serially

executed sections had a pipeline efficiency of 50% (i.e 2 clock cycles between iterations). Kernel-level pipelines were also

Figure 24 Bar chart showing FPGA resource usage for all hardware revisions.

101.9

84.0

32.122.5

271.5 275.6

34.724.3

281.9 285.9

36.825.3

332.7

2.2 2.32.3 2.3 2.3 2.4

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

Serial Hardware Execution Hardware ParalellismIntroduced

Data Dependencies Removed Optimised Hardware

Kernel Execution Time

Sobel Kernel Hough Kernel Line Drawer Kernel Interpolator Kernel Frame Generator

introduce to enable a new set of inputs to be processed while the previous set of inputs was sent to the next kernel. The final

improvement made during this hardware revision was task parallelization through loop unrolling. For loops in the post-processing blocks were unrolled. This unrolling increases the resources usage but gives us a faster execution time.

The final hardware revision aims to produce optimized hardware. Optimized hardware was achieved by removing division

by constant, division and modulo operator. These operations are expensive due to the floating-point division. Division is

more expensive in hardware compared to multiplication. The division by a constant can be approximated into a

multiplication with a constant and a division by a power of 2 which in hardware is a shift operation and hence relatively

cheap. The constant of interest is the width of a video frame. A video frame with a width of 720 for example can be

approximated to a multiplication with a constant, 91, and a shift of 2 to the power of 16. The division by a variable is

translated into a ROM lookup table. Note that the range of values the denominator and the numerator of this particular

division operation is bounded to a limited set of discrete numbers (i.e -64 to 64 for denominator and 0 to width of video

frame for numerator). Hence a lookup table would be apt. These alterations in the C code reduced the resource usage and

execution time.

The improvement in kernel execution time and FPGA resource usage brought about the four code revisions are apparent

and is highly dependent on writing the kernel in a manner in which the compiler can extract parallelism. The key technique

is to reduce data-dependencies and conditional read and for loops. The hardware can be further optimized by reducing division operation and translating limited functions into lookup tables.

7.3 Conclusion and Personal Reflection Early introduction to the OpenCL and HLS tools at the onset of the placement would have produced a better final

deinterlacer design. There was a lack of hardware consideration during the development of the algorithm. Hardware

considerations like memory access patterns (i.e random access or bit-streaming), task parallelism and pipelining were not

taken into account. Therefore sections of the algorithm that could not easily be translated into a bit-streaming structure had

to be rewritten in a bit-streaming fashion to achieve the same output, which often resulted in a degraded version of the

design. About 2 weeks were dedicated to rewrite the algorithm in a bit-streaming fashion. Experimenting with the high-

level synthesis tools and understanding how to generate efficient hardware for video processing would have reduced the

time used in the hardware translation phase.

Overall this phase took about a month. The weekly video conference calls with Andrei Hagiescu in Toronto were especially

helpful. He provided crucial guidance and feedback on the direction and improvement of the design. The final code revision

generates hardware that is able to process video input at 720i at 30 frame per second. This performance is not up to par with

the market demand of deinterlacing 1080i video at 60 frames per second. I am confident that given more time in developing

the hardware using openCL or even perhaps a translation to Verilog would be able to achieve the 1080i at 60 frames per

second benchmark.

8 Offsite and Extra Activities On the 6th of August I paid a visit to the Altera site in Penang to visit the Malaysian team managed by Benjamin Cope.

Benjamin Cope is the manager of the Altera team I am in (i.e the Video IP team) and he suggested the visit. The aim of the

visit is to observe how the Altera in Malaysia operates. The site is located on the Island of Penang situated off the west coast

of the Malaysian Peninsular. The Island has a specially designated industrial zone that has many manufacturing, research

and electronic companies such as Intel, AMD, Motorola, Agilent and Altera. I was brought around the site by Ivan Teh,

who is the manager of the Malaysian VIP team. I conversed with the Malaysian VIP team asking about work-life balance

and the benefits Altera provides. On another note, I was impressed by the sheer size of Altera in Malaysia where there are

about 1000 employees. The visit was eye opening as it made me realize the scale in which Altera operated internationally

and its strong presence of in the growing Southeast Asian region. I would definitely consider working at an Altera site closer to home.

The IBC (International Broadcast Conference) is an annual event held at the Rai Amsterdam and runs for 5 days. The

conference attracts companies and organizations involved in future solutions of electronic media and technology. Altera

operates a booth and I was given the privilege to submit a poster that describes the Hough-Based deinterlacer. The poster is

included in Appendix III. The poster was displayed in the IBC Future Zone Section, which is a fairly new exhibition included

in the conference. The purpose of the Future Zone is to showcase interesting ideas and projects from research and

development labs and universities. The poster is included in the appendix. During the exhibition, the organizer of the Future

Zone is hoping to create an online archive where submitted posters and papers could be stored and accessed. I went to

Amsterdam on the 12th of September and showcased my design for the entire day along with my placement supervisor, Jon

Harris. During the exhibition, several people approached us expressing their interest in the solution. Some requested a copy

of the poster and some showed wanted to implementing the design on GPUs. The interaction with people from the broadcast

industry regarding the design reassured me that the deinterlacer project addresses a real problem in the deinterlacing world.

There were customers who expressed interest in the design and requested video samples from the deinterlacer for them to

evaluate. These video output sequences were generated and give to the customer for evaluation. A following quotation for

the IP core will be in place given that the customers are pleased with the video sequences. The identity of the customer is

confidential but their request for video output sequences are testament to the fact that the decompression artifacts generated

by most deinterlacers are a genuine problem in industry. The experience was both interesting and rewarding in that I am

able to work on a real engineering problem and have the opportunity to interact with potential customers.

The offsite activities have provided me with a platform to interact with both Altera employees in Penang and potential

customers and researches that attended the IBC conference. This exposure has helped me understand how products, software

and IP cores are developed internally and how they are further presented and marketed to customers. These activities have

definitely been insightful in terms of engineering product inception, development and marketing.

9 Conclusion

The project aims were to develop a deinterlacing algorithm that targets low-angled artifacts and to synthesize the design

into hardware using an Altera high-level synthesis compiler. I personally feel that these two objectives were met as the

proposed method has been developed into a feasible solution and has been successfully been implemented into a working

deinterlacer hardware. The feasibility of the deinterlacer is assessed in terms of weather it consistently removes the targeted

decompression artifacts and if it is able to fit onto a single FPGA chip. Though the final deinterlacer hardware does not

meet the market performance benchmark, this benchmark is speculated to be achievable given further development of the

design in either openCL or Verilog. I am confident in the robustness and success of the deinterlacer IP core that it would

make a good addition to the set of deinterlacers available in the Altera Video IP suite licensed to customers. The confidence

comes from the consistent video outputs generated by the deinterlacer algorithm across 10 common test video interlaced

sequence. The robustness of the algorithm is attributed to the key incorporation of image processing algorithms, namely the

proximity Hough transform and the post-processing block. These two processes formed the central focus of the research

done for the deinterlacer and have definitely improved the rate of line detection and edge refinement ability of the design.

A summary of the industrial placement achievements is summarized in the bulleted list below:

Invented a discriminatory process for the Hough transform to improve line detection rate and robustness to high pixel luminance variation.

Invented a post-processing algorithm that consolidates detected edges to eliminate artifacts generated by directional interpolation.

Successfully implemented deinterlacer into an FPGA in under a month using the Altera OpenCL compiler. Verified that hardware generates correct output. Achieved sub-par video performance in a limited amount of hardware development time and confident that

further hardware optimization can easily achieve industry standard video deinterlacing performance. Completed algorithm development and hardware implementation of design in 6 months. Received positive feedback from customers who compared the Hough-based deinterlacer design to their

existing deinterlacer (identity of company cannot be disclosed).

10 Appendix

I Pseudo Code of Conventional and Proximity Hough Transform

II Post-Processing Block Post Processing Functional Flow Diagram

Post-processing block descriptions

Modules Sub kernel

Blocks ID Description

Edge Image Generator

SK0 Edge image generator

EG_0 Generates binary edge image

OM Generator (offset mask)

SK1 Edge consolidation

Om_0 Consolidates Offset Mask and Binary Edge Image

Offset Mask Check Om_1 Checks presence of offset mask along suggested pixels

Offset Mask Expander

Om_2 Expands offset mask to include more peripheral pixels

Offset Mask Estimator

Om_3 Estimates true offset mask (this is due to the fact that detected offset mask is an approximation)

WM Generator (weight mask)

SK2 Offset mask raw Wm_1 Generates a weight mask from an offset mask

Post offset mask end roll off

Wm_2 Introduces a post roll-off weight to smoothen final output

Pre offset mask end roll off

Wm_3 Introduces a pre roll-off weight to smoothen final output

SK3 Luminance check Wm_4 Checks that top and bottom luminance has a significant variation which implies an edge.

Top and bottom check

Wm_5 Checks that weight mask exist at the top and bottom of targeted pixel.

Offset Check Wm_6 Checks that luminance along edge does not vary by a tolerance level.

Average Weight Wm_7 Performs an averaging on the weight mask to allow a smoother transition between edge and non-edge regions

Interpolator SK4 Interpolator Interpolate Performs linear or directional interpolation based on offset and weight mask.

III IBC paper submission

Using a Bounded Offset Hough Transform For Edge-Dependent Deinterlacing

Jon Harris (Altera) and Abdulaziz Azman (Imperial College London)

Abstract This paper presents an edge-dependent deinterlacer scheme which uses

a Bounded Offset Hough Transform. The scheme aims to resolve low-angled

artefacts which occur in images produced by most intra-field deinterlacing

methods. Existing edge-dependent deinterlacers analyze neighboring pixels to

recover edge information. While this is sufficient for edges closer to the vertical,

low-angled edge information is rarely sufficiently recovered. This is due to the

inherent lack of vertical edge information in low-angled edges. The scheme

proposes the use of a variant of the Hough Transform to analyze non-localized

pixels to extract more edge information which will be used for edge-dependent de-

interlacing. Comparing the output of the scheme with several known deinterlacing

schemes suggests significant improvement in image quality output.

I. Introduction

Interlaced scan signals were initially used in analogue CRT television

to improve the video frame rate without requiring additional bandwidth

and is achieved by sampling only the horizontally odd or even lines of an

image. The sampled image from one time instance in an interlaced video

is called a sub-field. Though the vertical resolution is effectively halved,

the temporal resolution is doubled which reduces image flicker in CRT

television. Interlaced video eventually formed the basis of Analogue

broadcast systems such as PAL and NTSC. Modern digital displays such

as LCD and plasma screens use a progressive video format which captures

and displays all horizontal lines at the same instance. To display interlaced

video on modern digital displays, a deinterlacer is required. An ideal

deinterlacer algorithm will be able to fully recover the missing horizontal

information in interlaced video. Full reconstruction is not an easy task and

theoretically impossible based on the Nyquist Sampling Theorem. Visual

artefacts from de-interlacing are therefore hard to avoid. Further

information regarding de-interlacing can be found in [1].

Existing deinterlacers can be crudely categorized to inter-field and

intra-field. Inter-field deinterlacers mainly extract motion information and

perform deinterlacing accordingly. The main drawback of inter-field

deinterlacing is the high complexity and the reliability of the motion

detection algorithm. Motion detection fails when large displacements are

involved which often results in poor video output. In contrast, Intra-field

deinterlacers only process a single sub-field and are therefore

algorithmically less complex than their inter-field counterparts.

Existing Intra-field deinterlacer schemes process and i

Date post:	09-Jan-2016
Category:	Documents
Upload:	abdulaziz-azman
View:	23 times
Download:	1 times

Hough Based Deinterlacer

Documents