+ All Categories
Home > Documents > Hough Based Deinterlacer

Hough Based Deinterlacer

Date post: 09-Jan-2016
Category:
Upload: abdulaziz-azman
View: 23 times
Download: 1 times
Share this document with a friend
Description:
The pdf is a documentation of a hardware design of a deinterlacer that specifically targets to remove low-angled artefacts generated via the widely used line averaging method. The design was initially proposed by Jon Harris. Note that the document does not reveal much detail of the hardware design, rather it provides an overall description of the hour-based deinterlacing method.

of 32

Transcript
  • Hough-Based Deinterlacing

    Altera Industrial Placement

    Summer 2014

    Abdulaziz Azman

    CID:00680225

  • Table of Contents

    1 Summary 3

    2 Introduction 3

    3 Altera Overview and Project Scope 4

    4 The Deinterlacer Project 5 4.2 Project Expectations 5

    5 Team Management and Organizational Tools 6 5.1 Video IP Team Management 6 5.2 Project Management tools. 7

    6 Implementing and Developing The Algorithm 8 6.1 The Deinteralcing Challenge 8

    6.1.1 Deinterlacing Background 8 6.1.2 The Low Angled Problem 9 6.1.3 Project Motivation 10

    6.2 The Hough-Based Deinterlacer Design 11 6.2.1 General Method 11 6.2.2 Hough Transform 13 6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology 15

    6.3 Research and Improvements 16 6.3.1 X biased Sobel 16 6.3.2 The Proximity Hough 17 6.3.3 Post-Processing Block 20

    6.4 Conclusion and Personal Reflection 21

    7 Altera OpenCL and High-Level Synthesis 21 7.1 The OpenCL tool flow 22 7.2 Compromises and Optimizations 23 7.3 Conclusion and Personal Reflection 25

    8 Offsite and Extra Activities 25

    9 Conclusion 26

    10 Appendix 27 I Pseudo Code of Conventional and Proximity Hough Transform 27 II Post-Processing Block 28 III IBC paper submission 29

  • 1 Summary Altera is a semiconductor manufacturing company based in the Silicon Valley. Altera manufactures FPGAs, PLD and

    ASICs and has a large business unit portfolio that covers sectors such as broadcast, automotive, industrial and

    communications. The company has many sites located across the globe with the European headquarters being located in

    High Wycombe. The Altera Europe focuses on the business unit aspect of the company rather than the research and

    manufacturing. The business unit I was assigned to is known as Broadcast. Altera Broadcast aims to provide IP solutions

    and software tools that focus on broadcast related components. The current focus in the Altera Broadcast business unit is to

    meet customer demands of 4k (UHD) processing by making IP cores that can handle 4k video bandwidth and data. A

    challenge identified in 4k displays is when lower resolution interlaced sources are displayed on UHD television. To display

    interlaced video source, the input video undergoes a decompression process known as deinterlacing which converts the

    interlaced video to a progressive video suitable for digital displays. This video decompression process often generates image

    artifacts which are further magnified by the up-scaling process required to match the 4k display resolution. There are two

    goals in my project. The first is to develop a deinterlacing method proposed by my supervisor that targets to eliminate these

    image artefacts. This phase of the project involves modifying the method and employing additional image processing

    techniques to render the solution feasible. A feasible deinterlacer design is one that is able to consistently remove image

    artifacts and can be fitted into a single FPGA chip. There were several novel methods that had to be incorporated to generate

    significantly improved deinterlaced video outputs. The second goal in my project is to implement the finalized algorithm

    into hardware using Altera high-level synthesis tools. The main challenges encountered during the project was to create a

    deinterlacer design that produced consistent results. This feat is difficult due to the vast amount feature variation in video

    sequences. The approach to tackling this problem was to research different computer vision and image processing methods

    that could be exploited. The final result of the research is a novel adaptation of an established image processing algorithm

    for feature detection and the inclusion of a smoothing post-processing technique. This is a personal achievement due to

    the novelty of the algorithm. The inception of the solution involved a good understanding of the deinterlacing problem and

    a low-level familiarity of the feature detection algorithm. Using the high-level synthesis tool was also a challenge because

    it is the first exposure for me. Courses like Digital System Design and VHDL in my third year were relevant in this phase.

    I was already accustomed to hardware performance measures, loop unrolling, memory interfaces and hardware timing

    requirements. Throughout the placement I learned to communicate better with work colleagues and bosses namely when

    giving progress updates and project resource allocation. I realize the importance of giving the right impression to your

    manager regarding your capabilities in completing a specific task. A false impression would result in your manager over-

    expecting results and over-loading you with more task to complete. I attended many Altera Broadcast related meetings and

    video conferences which provided insight to the development of IP cores and the management of a team of engineers.

    Overall I learnt the technical and productization of an engineering solution. The finished deinterlacer design produced at

    the end of the placement has potential to be developed further. Improving the algorithm by incorporating suitable digital

    filters could drastically reduce execution time. Inferring more hardware parallelism by increasing the number of data paths

    would enable the deinterlacer to perform 4k processing at 60 frames per seconds.

    2 Introduction

    In recent years there has been a demand for higher resolution displays. Broadcast and digital television companies around

    the world are seeking to satisfy the growing market demand for high-definition (1080i & 1080p) video quality broadcast

    and television resolution. This advent in the market inevitably will require improved video processing components such as

    scalers, codecs and deinterlacers. This industrial project focuses on designing a deinterlacer on an FPGA that would meet

    the high quality deinterlaced video demands.

  • The project is divided into two main tasks, which are the algorithm development phase and the hardware implementation

    phase. The algorithm development phase is meant to identify caveats in the general method proposed and exploring aspects

    of the design that can be improved. This involved initially writing the method into C++ and making modifications to the

    method based on feedback from video outputs generated. The proposed deinterlacing method introduces numerous image

    artifacts, hence a large period of the placement was dedicated to the research and innovation of new methods to improve

    these video outputs. The hardware implementation phase is to validate an Altera High-level synthesis (HLS) tool flow. The

    HLS tool flow aims to increase the range of Altera customers by enabling easy translation of a widely used C-based

    programming language to a registers-transfer level hardware description. This abstraction from hardware is attractive to

    software engineers and allows them to exploit the FPGA architecture without having to learn hardware-description

    languages. This report will center around the research and modification made to the method, an outline of the translation

    process from C to OpenCL and an evaluation of the performance of the hardware generated.

    The report also includes an overview of Altera and the engineering team I was assigned to. A basic description of the

    management practices and tools used along with a brief account of company-related offsite activities are also included.

    3 Altera Overview and Project Scope Altera is a manufacturing company of Programmable Logic Devices (PLD). The headquarters of Altera is located in Silicon

    Valley but there are many Altera sites and offices around the world. The company portfolio includes research and

    manufacturing of PLDs as well as providing FPGA related tools and solutions to specific sectors. These sectors are identified

    as business units and the main Altera business units (BU) are Industrial, Automotive, Communications and Broadcast. A

    hierarchy of the BUs can be seen below in figure 1 along with the teams under the Broadcast BU.

    Figure 1 Diagram showing the Business Units inside Altera and teams under the Broadcast BU.

    The teams in Altera are constantly changing based on customer demands. This is because Altera adopts a dynamic allocation

    of resources. Engineers and funds are constantly being reassigned and reallocated to different teams to meet the immediate

    demands of customers. This management approach is practical and strategic as it achieves optimal resource allocation.

    Altera engineers are also not restricted to a single BU.

    I was assigned to the Video IP team under the Broadcast business unit. The Broadcast BU aims to meet broadcast customer

    demands by providing intellectual property cores, design softwares and tools. The Video IP team under Altera Broadcast

    focuses to meet customer demands of 4k video displays and processing by providing solutions such as 4k IP cores and

    design tools. Altera invest in creating readily available IPs to customers to encourage the usage of Alteras FPGAs.

    Examples of video IPs available are scalers, deinterlacers, SDI and HDMI interfaces and chroma-resamplers. The project

    assigned to me focuses on developing a deinterlacer component. Deinterlacers are included in digital video systems to

    enable interlaced video to be displayed. It is therefore an important component to a digital display system.

    Altera Business Units

    Broadcast

    Video Codec Video IP

    Automotive Industrial Communication

  • 4 The Deinterlacer Project

    There are three phases in the Deinterlacer Project. The first is to implement, develop and conduct a feasibility study on the

    deinterlacer method. This method is described in an internal Altera document authored by my supervisor, Jon Harris and is

    explained in section 6. The second is to restructure the deinterlacer algorithm to suit a hardware implementation. The third

    phase is the hardware synthesis of the deinterlacer design. While the primary objective of the third phase is to implement

    the deinterlacer algorithm into hardware the secondary objective is to validate an Altera high-level synthesis tool. The tool

    is still in its infancy and is available to the public. Though some features of the tool are only available to Altera employees.

    The task division of the project was done using Jira to help easy monitoring. A breakdown of the project planning is

    presented in the table of figure 2.

    Phase Task Description Time

    Allocated (weeks)

    Time Completed (weeks)

    Algorithm Development

    Develop Deinterlacer Method in C++

    Implement method described in patent draft and use OpenCV as interface to generate image outputs from C++ functions.

    11 14

    Algorithm Translation

    Translate C++ Code to OpenCL

    Rewrite sections of code in a pixel-streaming manner. Modify Code in a structure that the OpenCL compiler is able to compile and extract parallelism

    2 4

    Hardware Implementation

    Use OpenCL tool flow to create first OpenCL based-program. Perform necessary optimizations and compromises

    Main task was to use resource estimation to ensure that design will fit into card. Main task was to verify frame outputs from hardware using emulation. This was achieved using optimizations and compromise. Optimizations include inferring pipelines wherever possible, relaxing loops by reducing data dependency and inferring shift registers wherever possible. Compromises include reduction of intercept range and delta range (benchmark is 1080i).

    6 2

    Hardware Synthesis Synthesize design to FPGA 3 2

    Figure 2: Table showing breakdown of project plan proposed by placement supervisor.

    The time allocation was decided by my supervisor and he based the amount (in weeks) on the fact that the algorithm requires

    refinement and that I am unfamiliar with the high-level synthesis tool. The algorithm development phase was allocated 11

    weeks but took 13 weeks to complete. The prolonged period was due to the difficulty in making the deinterlacer produce

    consistent outputs across a set of industry standard interlaced video sequences. A large research effort was put to make the

    deinterlacer design a feasible solution to the deinterlacing problem (elaborated in section 6). Note that the term feasible is

    used to describe a solution that consistently removes targeted artifacts and can be synthesized to fit on a single readily

    available FPGA chip. The algorithm translation phase took longer than expected due to the fact that the C code was written

    with no hardware consideration. The initial code used C classes and random accesses which are more difficult to translate

    into hardware as compared to writing the algorithm in a pixel-streaming fashion. Hence the restructuring of the C code not

    only involved a translation of the algorithm but major changes in the algorithm to render it suitable for pixel-streaming. The

    modified algorithm had to achieve the same affect, or at least similar affect as it would in a pure C implementation. The

    hardware implementation phase was completed a lot faster than expected. Out of the allocated 6 weeks, this phase was

    completed in 2. The acceleration is due to the robustness of the high-level synthesis tools used (elaborated in section 7) and

    the close communication between me, my supervisor and an Altera employee who is involved in the development of the

    tools used.

    4.2 Project Expectations The proposed method of deinterlacing is the first of its kind in that the Hough Transform is used to deinterlace a frame. Due

    to its novelty and limited placement time, the performance of the Algorithm is not expected to be extremely robust and

    efficient. It may produce additional image artifacts while targeting to refine certain sections of the image. It is expected

    though, that the algorithm should work very well in certain situations where the algorithm parameters are fine-tuned to a

    particular video sequence. This inconsistency in video deinterlacers are not surprising as even matured and well-established

    deinterlacing algorithms do not produces consistent results for all video sequences.

  • The hardware generated from the HLS tool is not expected to be optimized and may take up a large amount of resources.

    The target board used for this design is the Nallatec-pcie385n FPGA accelerator card, which has a Stratix V. The Stratix

    FPGA family, is a higher end FPGAs available in the market today and hence will provide the highest number of FPGA

    resources. Despite this the design is expected to use up most of the Stratix V and therefore is very expensive in terms of

    FPGA resources. The design is therefore not suitable to be incorporated into an FPGA based display system which often

    include other video components such as scalers, memory and port interfaces, video encoder and decoders etc. It is the hope

    that when the FPGA resources increase in the future and when the HLS tool matures that the deinterlacer design would

    become a practical solution for better deinterlaced video output.

    5 Team Management and Organizational Tools A positive aspect of the placement was the constant exposure to meetings and organizational tools used in the Video IP

    team. My manager and supervisor included me, whenever relevant, to video conferences and group discussions. As a

    result I familiarized myself with the dynamics of the Video IP team management and the management tools they use for

    IP core development.

    5.1 Video IP Team Management Benjamin Cope who is based in Altera High Wycombe is managing the VIP group. The group consists of 11 engineers, 8

    of which are based in Altera Penang. The engineers in the VIP group work closely together but are not based in the same

    Altera site. The team therefore uses video conferences and online management tools to monitor the progress of all the

    engineers. The video conferences, between Altera Europe and Penang, are held once a month to update on individual

    progress and to highlight group milestones. I attended these meetings despite not having any contribution to the team project,

    though on occasion I provide updates on the progress of my deinterlacer project.

    The management system used by the VIP team is called the Agile system. The online management tool used by the VIP

    team is called Jira. These management system and tools are used to keep track of work progress and to generate performance

    measures. Though my industrial placement project is not part of the VIP teams project, I was included in their management

    system to provide insight on how engineers organize themselves in developing softwares and IP cores.

    The Agile system is a method of organizing a software development process especially when the team consist of individuals

    with different functional expertise. The specializations in the VIP team range from members adept in specific IP cores such

    as deinterlacers to members who focus on verification. The Agile system introduces management techniques and measures

    such as Sprints, Story Points, Issues and Scrums. These measures provide an iterative and incremental framework for IP or

    software development. The short time span allocated for a task (i.e weeks) allows the team to be very responsive to changes

    in customer demands. The placement was an individual work project. Hence the tasks were divided over periods of months

    instead of weeks. There are many online tools that adopt the Agile system to make it more versatile and accessible.

    Jira is an online tool that adopts the Agile system. Jira aims to provide bug and software development and monitoring. Jira

    allows easy issue allocation between team members by a simple, drag and drop action. Issues can be priorities and labeled

    to allow easy tracking. For example an issue may be labeled as a bug fix or a new feature. End of sprints reports can also

    be generated through Jira. These reports consist of a burn-down chart of the amount of story points achieved versus their

    respective time taken for completion. The expected trajectory is also displayed in the burn-down chart. Burn-down chart

    comparisons give the team an idea of how well they meet expected targets and provide feedback to the managers as to how

    much task to assign to the team in the future. An example of a burn-down chart of the deinterlacer project is shown in figure

    3.

  • Figure 3 showing a comparison of the burn down chart of deinterlacer project between the expected progress and the actual

    progress.

    Exposure to monthly video conferences with Penang and software management tools familiarized myself with the IP core

    development process and the dynamics of engineers that are located far apart.

    5.2 Project Management tools. Throughout the internship, meetings and project tools were used to help the progress of the projects. One-on-one meetings

    between my supervisor and I occur almost on a daily basis depending on my progress or whether I require assistance.

    The third phase of the project involved using an Altera OpenCL tool flow which many engineers in Altera Europe are

    unfamiliar with. Andrei Hagiescu is an Altera employee based in Toronto and he is involved in developing the Altera

    OpenCL tool flow and making OpenCL example designs. Andrei was helpful in many aspects of the latter phase of my

    project. He provided constant feedback on the latest deinterlacer code revision and advised on a suitable FPGA development

    kit that would suit my project application. The code feedback consisted of Andrei evaluating resource estimation and

    optimization reports generated by the AOCL compiler and suggesting methods of improving and optimizing the hardware

    being generated. The main role of Andrei was to guide the second phase of the project in the right general direction.

    Code revisions in the hardware implementation phase were sent between me and Andrei through emails. Errors and bugs

    encountered by me in the high-level synthesis tool were addressed to Andrei using Fogbugz. Fogbugz is a bug tracking

    system where bug finds are reported as incidence which can be assigned to another teammate to help resolve. You can also

    assign the bug to multiple teammates. Fogbugz will also retain a thread of correspondence between teammates for easier

    tracking. There were several bugs in the high-level synthesis tool that I identified while using the tool. Workarounds and

    suggestions were almost always immediately provided by the Altera employees that are involved with the HLS compiler.

    Having one-on-one meetings with my supervisor and video conferences with Andrei definitely helped me develop my

    communication skills. Supervisors and managers who do not have an idea on the amount of effort required to solve a

    problem may impose unrealistic goals. It is therefore paramount that you convey the right impression about your capabilities

    and the estimated amount of time and effort you would require to complete a task. A false impression would result in your

    supervisor expecting more results than possible.

    0

    1

    2

    3

    4

    5

    0 5 10 15 20 25 30 35 40

    Ou

    tsta

    nd

    ing

    Tas

    k

    Weeks

    Burndown Chart of Deinterlacer Project

    Expected Progress

    Actual Progress

    C++ implementaion

    OpenCL kernel translation

    OpenCL kernel optimisation

    Hardware Synthesis

  • 6 Implementing and Developing The Algorithm At the onset of the placement, the deinterlacing method proposed has never been implemented. The method described was

    general and required many additional image processing techniques to render the method a feasible solution to the

    deinterlacing problem. Due to the lack of validation and research done in the proposed deinterlacing method, more than half

    of the placement time was allocated to making the algorithm feasible.

    6.1 The Deinteralcing Challenge

    Deinterlacing has been an ongoing problem for decades. The primitive form of video compression blindly removes half of

    the image information that reconstruction is a challenge. As a result decompression artifacts are difficult to avoid. Despite

    the tendency to produce decompression artifacts, the lack of complex codecs and compression algorithms make it an

    attractive technique for broadcast companies as they can avoid the extra cost of specialized broadcast equipment for the

    codecs. The cost of compensating the simple compression technique is incurred on the receiving end of the transmission

    where better deinterlacers are required to either totally eliminate or mitigate the decompression artifacts. The widespread

    use of interlaced video format broadcasting has driven a large amount of research in the deinterlacing area. This section of

    the report will delve deeper into the fundamentals of deinterlacing, the targeted decompression artifact and the project

    motivation.

    6.1.1 Deinterlacing Background

    Deinterlacing is a method of converting interlaced scan video to progressive scan video. Interlace scan video only capture

    either the horizontally even or odd lines of a video frame. A video frame containing either the odd or even lines are known

    as sub-fields. This terminology will be used throughout this report. Transmitting and storing interlaced video format is

    therefore an advantage as the same amount of bandwidth or memory is required for an increased frame rate. This

    improvement in temporal resolution reduces image flickering in analogue television and as a result, interlaced format is

    commonly used in analogue broadcast systems such as PAL and NTSC use interlaced video format. On the other hand

    Progressive video captures and display all lines in a video frame and is the video format used in almost all digital video

    display devices. Digital devices therefore will require a deinterlacing component in the system to allow the display of

    interlaced scan video format. A simple illustration of the conversion between interlaced and progressive video can be seen

    in figure 4.

    Figure 4 An image created during industrial placement illustrating the capturing, storing and conversion of interlaced scan

    fields to progressive video.

    Deinterlacing is a widely researched area due to its vast applications in the televisions and video storage. The deinterlacing

    methods can be crudely divided into 4 techniques, which are motion-compensation, motion-adaptive, directional

    interpolation and non-adaptive (temporal and spatial) interpolation. Deinterlacing methods can also be divided as being

    either intra-field or inter-field deinterlacing. Intra-field methods use only one sub-field to generate a full progressive frame

    while inter-field deinterlacing uses multiple sub-fields. Directional and non-adaptive interpolation techniques tend to be less

    algorithmically and computationally complex compared to their motion-detection based counterparts. Directional and non-

    adaptive techniques therefore produce lower quality deinterlaced video output and are more prone to generate image

  • artifacts. A common image artifact that has yet to be fully resolved is the low-angled image artifact and it occurs in many

    directional and non-adaptive (spatially) interpolation techniques.

    6.1.2 The Low Angled Problem

    As the name suggest, the low-angled image artifacts are artifacts that occur along straight and low-angled line. The term

    low-angled is used to keep the exact angle of the line arbitrary as the artifacts are subjective and are more apparent on some straight edges than others. For the sake of definition we will assign low-angled lines as being lines that are less than

    45 degrees to the horizontal. The figure 5 shows the low-angled artifacts generated by spatially non-adaptive deinterlacing

    (i.e. LD and LA) and edge-dependent deinterlacing (i.e. ELA) in techniques.

    (a) (b) (c)

    Figure 5 Low-angled image artifacts generated using C++ and OpenCV of Line Doubling (LD), Line Averaging (LA) and

    Edge-dependent Line Averaging (ELA). Images were generated using C++ and OpenCV.

    The main reason that these artifacts occur at low-angles is due to the inherent lack of vertical resolution of interlaced video.

    Recall that a sub-field only captures either horizontally odd or even lines of a full frame. Hence half of the horizontal

    information is not captured. Deinterlacing techniques essentially reconstructs these missing horizontal frame lines by using

    information contained within the sub-field. The closer the straight line is to the horizontal, the less horizontal information

    of the line exist in the interlaced frame. This idea can be easily understood if you image a perfectly horizontal line in an

    image, assuming the line to one pixel wide, there is a chance that the entire line might not be captured using an interlaced

    format as some horizontal frame lines are ignored. In contrast a perfectly vertical line has half of its line information

    available regardless if it is the odd or even sub-field being captured. To proceed further into the problem, there are several

    crude classifications of straight lines. The relevant types for this problem are highly directionally textured and macro-lines.

    The distinction lies in the thickness and space between consecutive lines. Textured lines are single pixel wide and are closely

    packed while macro lines are multiple pixels wide and far apart from other lines. This illustration can be seen in figure 6 of

    the 1080p NTSC Slices sequence.

    Figure 6 Showing regions of highly textured, textured and macro-lines in the Slices test sequence created by NTSC.

  • Existing edge-dependent deinterlacing methods are insufficient due to the locality of the pixels they analyze. Only

    neighborhoods of at most 20 pixels across are processed before performing the directional interpolation. The locality of

    pixel analysis makes most existing edge-dependent deinterlacers suitable for highly directionally textured regions in frames

    as shown in yellow of figure 6, this locality also means that it is unable to reconstruct regions with macro lines as shown in

    blue of figure 6, which require a larger neighborhood of pixels to be processed. The existing schemes can at best approximate

    the direction of these macro lines. This approximation of macro-lines and its drawback can be seen in figure 7 which shows

    the deinterlacing of the 1080i (interlaced) NTSC Slice sequence using the Fine-Directional Deinterlacingi (FDD) method

    which is a form of edge-dependent deinterlacing.

    (a) (b)

    Figure 7 Showing the progressive 1080p (a) and the FDD output (b) of the 1080i NTSC Slice test Sequence respectively.

    The solution to reconstructing macro-lines is to analyze the entire sub-field in contrast to analyzing a local pixel

    neighborhood. This image processing is done using a Hough Transform (HT), which will be further elaborated, in section

    4.3. It is crucial to note that analyzing a delocalized set of pixels and reconstructing a single pixel using these pixel data will

    result in a higher chance of other image artifacts being generated. This is because the pixel to be interpolated will be affected

    by pixel values that are relatively distant. Obviously distant pixels, relative to the pixel to-be interpolated, tend not to

    represent the same image feature in a frame. This is precisely why most existing deinterlacers do not venture to the

    processing pixel domain further than 10 pixels to avoid such artifacts. Even for deinterlacers that analyze a neighborhood

    of 5 pixels across generate artifacts due to an inaccurate and erroneous interpolation based on distant pixels.

    To remove secondary image artifacts produced from pixel interpolation, we can focus on post-processing methods that

    double-check the existence of an edge and perform a smoothing based on these edge confirmations. A large section of the initial phase of this project is dedicated to these post-processing methods. Despite the drawbacks of analyzing a large pixel

    domain, there are several motivations to pursue a more delocalized pixel domain analysis technique for deinterlacing.

    6.1.3 Project Motivation The increase in demand for HD (1080p) and UHD (2160p) television displays requires the conversion of existing videos of lower resolution to higher resolution. Image artifacts in deinterlaced video inevitably become more apparent after scaling

    up. Hence previously tolerable and minute low-angled image artifacts now become more discernable to the human eye. The

    improved deinterlacer would be able to provide a competitive edge to companies whose business models are predicated on

    making high resolution digital televisions. It is important to note that there are still many interlaced video sources used today

    such as in broadcast and video storage despite it being a primitive form of video compression. The ubiquity of interlaced

    format make the deinterlacer component in digital television a crucial component that would improve the user experience.

    FPGA area and resources are progressing at a much faster rate than television resolution demand. The lack of increase in

    video resolution makes more FPGA resources available. Hence it is reasonable to invest more FPGA resources for better

    deinterlacers to improve video quality.

    The expected complexity and resource usage of the proposed deinterlacer design is less than that of motion compensation

    deinterlacer designs. This is due to the fact that motion-compensation requires computationally heavy motion detectors that

    involve computing motion vectors at different pixel domain levels to estimate motion. The expected deinterlaced output

    quality and robustness of the proposed design should be comparable to that of motion compensated deinterlacers despite

    having a lower resource usage and complexity. This point is revisited at the end of this report once the deinterlacer design

    has been explained, implemented and verified.

  • 6.2 The Hough-Based Deinterlacer Design

    My industrial placement supervisor, Jon Harris, proposed the Hough-Based Deinterlacer design in a patent draft (yet to be

    submitted). The novelty the patent claims is the application of the Hough Transform in deinterlacing video. The document

    describes the method as well as a digital hardware realization of the deinterlacer. This section of the report is dedicated to

    introducing the general method as described in the patent draft document.

    6.2.1 General Method Hough-Based deinterlacer analyses an entire sub-field for edges. This is in contrast to existing deinterlacers where only

    local pixel domains are processed. The Hough-Based deinterlacer aims to extract edge information to dictate the directional

    interpolation of the pixels. Edge information may include variables such as intercept, gradient, start coordinates and end

    coordinates of a line. The scheme extracts edge information via the image processing flow shown in figure 8.

    (a)

    (b)

    (c)

    (d)

    Figure 8 Shows the functional flow of the deinterlacer and the image output of each processing stage. The image is the

    200th field of the Table Tennis sequence.

    The flow consists of an RGB to luma conversion process, an edge detection process, a line detection process, an offset mask

    generation process and an interpolation process. Referring to figure 8 , the interpolator uses the processed sub-field from

    the luma, sobel and offset mask generation block to perform directional interpolation to produce the deinterlaced output.

    The scheme targets the luminance (i.e brightness information) of the subfield while ignoring the chrominance (i.e color

    information) of the sub-field because the human eye is more sensitive to the variation in brightness of an image rather than

    its color variation. This higher tolerance for changes in color information is precisely the reason why the Chroma of an image or video is usually sub-sampled. An RGB conversion is required depending on the format of video input into the

    design. Instances where a conversion is not required is if the video input format uses the YUV color space which encodes a

    video using YCbCr. Y contains the luminance information and Cb and Cr contains the chrominace information. Equation

    (1) shows the equation to convert from digital RGB to digital luminance (i.e Y).

  • = 66. + . 129 + 25. + 128

    23+ 16 (1)

    The Red, Green, Blue and luminance value is represented using an 8 bit binary number and therefore ranges from 0 to 255.

    The Sobel process is used to detect boundaries in the sub-field. Boundaries in images are characterized by a change in

    luminance or chrominance. This change can be computed using image kernels that contain gradient operators. There are

    numerous gradient operators but the Sobel was chosen due to the higher weightage give to the pixels vertically adjacent to

    the center pixel. The Sobel also has smoothing properties by having the coefficient 2 in the kernels. Figure 9 shows common

    edge operators including the Sobel operators.

    [0 1

    1 0]

    [1 00 1

    ]

    [1 0 11 0 11 0 1

    ]

    [1 1 10 0 01 1 1

    ]

    [1 0 12 0 21 0 1

    ]

    [1 2 10 0 01 2 1

    ]

    (a) (b) (c)

    Figure 9 The Roberts, Prewitt and Sobel edge operators.

    Having a Sobel Threshold will set the minimum output value of the Sobel Transform that we would consider an edge. The

    image generated after applying the threshold will be a binary image. Increasing the threshold would reduce chances of

    detecting erroneous edges but will simultaneously reduce the edge information. A compromise based on the users preference is therefore necessary to dictate the tolerance in gradient. The effect of varying the Sobel threshold is illustrated

    in the figure 10 Take notice in the disappearance of the straight edge of the roof eave with increasing Sobel Threshold.

    (a) (b)

    (c) (d)

    Figure 10 Shows the output of the Sobel Transform with increasing Sobel threshold.

    The Hough Transform uses the binary edge information (Figure 10 (c)) to detect lines in an image through a voting system,

    an explanation of the Hough transform is in section 4.3. Lines are detected and are used to generate an offset mask (shown

    in Figure10 (d). An offset mask is an intermediate image that will tell the location of the pixel requiring directional

    interpolation along with the offset value. A simple illustration of the function of the offset mask is shown in figure

  • The offset value is used to directionally interpolate a pixel. Directional interpolation is derived from the characteristic that

    pixels along edges tend to have the same RGB or luminance value. Hence to reconstruct the edge, the pixels to-be

    deinterlaced will take the average value between the top and bottom pixels at offsets determined by the angle of the edge

    itself. Consider the following illustration in figure 11.

    Figure 11 Shows the directional interpolation method along with a comparison of the results.

    6.2.2 Hough Transform The Hough Transform (HT) was first introduced in 1962 in a patentii published by Hough Paul C V. The patent describes a

    method of extracting image features, such as lines and ellipses, which can be mathematically parameterized. As an example

    the patent uses straight lines, which are easily represented by an intercept value and a gradient value. The basic idea in the

    HT is that for all high binary edge pixel, we compute the parameters of all possible feature orientation and accumulate these

    values in a parameter space.

    Say we wish to detect straight lines which are parameterized by the intercept parameter, , and a gradient parameter, . The size of the parameter space (also known as the Hough Space) would be , the range of and values of and are arbitrary and will be discussed further in the next section. Consider a high binary edge pixel at the center of a 100x100 image shown as the picture in Figure 12 (a). To find all possible line orientation for that particular pixel we

    simply compute the intercepts for the entire Nm range. This set of Nm and values are then accumulated in the Hough Space shown in the right plot of figure 12. The Hough Space is represented using a color-map with a color-bar to the right

    of the figure indicating the value of a specific element in the Hough Space.

    (a) Edge Image (b) Hough Space

    Figure 12 showing an edge image (left) with a single high binary pixel at its center and the corresponding set of accumulation points in a color map representation of the Hough Space.

  • Referring to figure 12 (b), notice that the maximum element value in the Hough Space is 1 as there is only 1 voting pixel.

    We can therefore expect a similar Hough Space profile for individual pixels as we extend to more high binary edge pixels.

    Performing the above iteration for an image containing a line angled at 45, as shown in figure 13 (a) generates the Hough Space accumulation pattern as shown in figure 13 (b).

    (a) Edge Image (b) Hough Space

    Figure 13 shows a 45 line in an edge image (left) and the corresponding Hough space generated.

    The Hough Space has a maximum accumulation of 20 hits suggesting 20 pixels voting for a specific line at points = 0 and = 45degrees as expected. This point is shown in the Hough Space as the region in red. To extract maxima in the Hough Space we apply a threshold. The Hough Space coordinates that satisfy this threshold would then be the detected

    lines.

    After the Hough was invented, a wide variety of different versions have since been introduced to suit different applications.

    In the context of the deinterlacer project, where straight and low-angled lines are concerned, several features were included

    to better serve our purpose. The are two major modification made to the conventional Hough Transform. The first is that a

    Cartesian coordinate system is used rather than a polar coordinate system. The second is that the set of angles a line can

    take is restricted and bounded. These modifications reduce the performance of the Hough transform in that many lines are

    drastically approximated. The following paragraphs explain the reason.

    To represent a straight line, the parameters may either be expressed in terms of the angle and distant to origin (i.e Polar

    Coordinate) or in terms of the gradient and intercept (i.e Cartesian coordinate). This design adopts the Cartesian coordinate

    system because the offset value (i.e used for directional interpolation) can be easily derived from the gradient of a line. The

    main drawback of using a Cartesian based method is that there is a problem representing a vertical line. This disadvantage

    is ignored as we are not interested in representing vertical lines.

    The term bounded-offset refers to the manner in which the line gradients are discretized. It is more natural to assume that we allow lines to take discrete values at regular intervals. These intervals would then dictate the angle resolution of the

    detected lines. The key modification in this design is that the lines are bounded to take values determined by the offset

    values. The detectable lines are therefore bounded to angles illustrated below in figure 14.

  • Figure 14 shows the angles discretization created by neighboring pixels.

    These offset values are denoted by x as they are inherently the change in the x direction given that the change in the y direction is always 1 which is true in our case as the interpolation always happens between the current pixel and the pixel

    directly above it. Using the delta x as a parameter rather than angles in degrees inevitably changes the parameter space.

    A key feature of the using x is the improvement in resolution for very low angles. The resolution of the angles are variable unlike most parameters where the resolution remains constant throughout. The resolution is given by the derivative of the

    equation describing the relationship between delta X and angles. This derivative is shown in (3).

    = tan1 (1

    x) , x (2)

    x=

    1

    x2 + 1 , x (3)

    Note that the derivative is always a negative value which is as expected as the angles decrease with x. It is also important to note that as x increases the derivative also decreases implying that the angle resolution exponentially improves with x. This advantage is also a drawback for higher valued angles such as those above 18 where the angle resolution is more than 5. The result is that lines detected are therefore an approximation or they remain undetected.

    6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology

    The patent draft describes a method of the deinterlacer design with a demonstration in principle that the design should

    resolve low-angled edges. As a proof of concept, the initial phase of the project was dedicated to creating a working

    prototype of the deinterlacer.

    The metric used for verification is the PSNR value which is a standard measure used by researchers involved with

    deinterlacing or video compression and reconstruction. The PSNR value stands for the peak signal-to-noise ratio and can

    be derived from a single or multiple frames. The PSNR is a ratio between the peak signal power value and peak noise value.

    The peak value for an image, which has a single color channel represented as 8 bits, is 255. The noise is the average absolute

    difference between the original and reconstructed image. Hence, in the case of deinterlaced video, the PSNR value can only

    be calculated provided that the progressive version of the video is available. The PSNR can be shown mathematically in (4)

    where I(i,j) is the progressive frame and K(i,j) is the deinterlaced frame.

    = 20. log (255

    ) (4)

    =1

    . [(, ) (, )]2

    1

    =0

    1

    =0

    (5)

    The PSNR measure was a means to reconfirm quantitatively that the deinterlacer scheme is able to reconstruct low-angled

    edges whilst not introducing new image artifacts. Though throughout most of the algorithm development phase of the

  • project, the frame outputs of the deinterlacer were visually assessed rather than quantitatively measured. This is because the

    end users of the video outputs are people, hence it is crucial that deinterlaced videos are visually satisfactory.

    By using these verification methods various aspects of the Algorithm were found lacking. A thorough description of the

    caveats discovered and the improvements made in response are presented in the next section.

    6.3 Research and Improvements

    This section of the report is dedicated to several modifications and inclusions to the deinterlacer algorithm that were made.

    These modifications were necessary to render satisfactory video outputs. My supervisor occasionally made suggestions on

    how to improve the design but most of the modifications were purely invented by me. It is important to note that a large

    amount of the industrial placement was dedicated to the research and improvement of the Algorithm. The research method

    employed is trial-and-error based and is by no means systematical in approach or exhaustive. To help improve the design,

    a wide variety of image processing techniques were experimented. The challenge was that there are not many deinterlacer

    papers written that had a similar algorithm to the Hough-Based deinterlacer due to its novelty. As a result, I employed to

    image processing techniques that are not usually used in deinterlacing like connected-component labeling and line

    correlation techniques. The unconventionality of the proposed techniques meant that I had to constantly refer my idea to

    my industrial supervisor to gain feedback on whether he finds the solution a realistic one that is worth pursuing. A detailed

    account of the research is not included in the report. The following sections will only describe and highlight the

    modifications that I made to the final revision of the deinterlacer design. Most of the modifications were invented and

    proposed by me and some suggested by my supervisor.

    Caveats in Algorithm Modifications made in Response

    Insufficient Line Resolution

    for detection

    No solution provided, instead algorithm relies on line to be sufficiently thick to allow line approximation. Though increasing line resolution would improve line detection, this comes

    at the usage of more hardware resources.

    Over-detection of lines due

    to high pixel luma variation

    or high density of lines

    I introduced a discriminatory process in the Hough that takes into account the distance

    between voting pixels. This would reduce overlap between produced by regions with high

    pixel luminance variation and regions with high line density. This modification was purely

    invented by me.

    Image artefacts produced

    from directional

    interpolation are very

    apparent and are difficult to

    contain.

    Introduce a processing block after the Hough Transform as a means filtering out false

    positive lines by consolidating detected edges with luma, edge and offset information. Most

    of the consolidation methods were proposed by my supervisor and some by me.

    Edge detection needs to

    prioritize horizontal edges

    Removed the y-directional Sobel kernel and only depend on x-direction Sobel kernel for

    line detection. This was suggested by me.

    Figure 15 Table showing the caveats in Hough-based deinterlacer algorithm along with modifications made in response.

    6.3.1 X biased Sobel The proposed edge detection technique was a Sobel transform. To better capture vertical color gradient while ignoring

    horizontal color gradient, the x-Sobel kernel was removed. The result is an edge detection method which account less for

    straight lines closer to the vertical while accentuating straight edges closer to the horizontal. This simplification both

    improves the detection rate of low-angled lines (as less erroneous high pixels are derived from horizontal gradient changes)

    and reduces the complexity of the edge detection process. This reduction in complexity though is insignificant to the increase

    in complexity brought about by other processes. The difference in the output binary edge image and the clear improvement

    is show below in figure 16.

  • (a) (b) (c)

    Figure 16 Shows (a) original grayscale image (b) Conventional Sobel Transform (c) X biased Sobel Transform.

    6.3.2 The Proximity Hough The Proximity Hough (PH) method is the name I gave to a discriminatory process I incorporated into the traditional Hough

    method. This new Hough method was invented by me and it forms the bulk of the research value added to the deinterlacer

    algorithm. Not only does the line-detection rate drastically improve but also it is also very resistant to noise. The context of

    noise in this case are high binary pixels which do not form a straight edge and can either be jagged, curvy edges or simply

    regions of high pixel luma variation.

    The objective of introducing the PH method was to tackle the problem of over-detection. Over-detection occurs when the

    Hough Space gets too cluttered due to regions in the image with high luma variations or a high straight-line count. The

    Hough Space in figure 17 clearly demonstrates this affect by performing the Hough Transform on an original image with

    high binary pixel clusters and on an image with the high binary pixel clusters artificially removed.

    (a) (b)

  • (c) (d)

    Figure 17 shows the difference in Hough Space produces by images with and without high binary pixel clusters.

    The PH transform works by storing additional information of a particular high edge binary pixel and stores these information

    inside memory blocks. We could think of these memory blocks as bins but in reality they do not accumulate anything and

    are simply updated with a different value. The information stored in these bins are the coordinates (x,y) of the voting pixel,

    the x start point of the line and the only accumulation bin. These bins have dimensions exactly the same to that of the

    accumulation bin, which is essentially the parameter space. It is therefore obvious that the coordinates and the x start point

    is unique to a particular line. Recall that each element in the parameter space represents a unique line that can be drawn on

    an image.

    The general method of the PH transform is that once there is a hit for a particular line, the accumulation bin will only increment if the (x,y) coordinates of the previously voted pixel for that particular line is within a tolerable range. This

    tolerance is given the name as proximity threshold which is the minimum distance between voting pixels. Hence if you

    specify a proximity threshold of 0, then voting pixels must be adjacent or diagonally adjacent to one another. In the case

    where the minimum distance is not met then the accumulation bin will not be incremented and the x start point bin will

    remain unchanged. The x start point is only updated when a line first gets a hit. Hence it stores the x coordinate of the first pixel that voted for the respective line which corresponds to the start x coordinate of the line. The y coordinate is

    neglected as it can be mathematically derived as we know the intercept and delta x value.

    Note that the PH transform uses an additional 3 bins as compared to the conventional Hough transform. It also requires

    more arithmetic computations and operations. A pseudo code of the conventional and the proximity Hough transform is

    included in Appendix I. This increase in complexity will require more FPGA hardware resources and increase the latency.

    The throughput of the hardware generated is every clock cycle as much of the algorithm is easily pipelined. It is key to note

    that in terms of memory, the Proximity Hough requires a constant reading and writing to all bins whereas the conventional

    Hough only requires a write as no feedback required to validate the a particular line hit. A comparison of resource usage

  • between the Proximity Hough and the Conventional Hough is shown in figure 19. The resource estimation is generated

    using the Altera Offline OpenCL compiler which is able to estimate the amount of resources a design would occupy given

    the development board or acceleration card used. In our case, these values were generated based on the Nalletech pcie385n

    A7 accelerator card.

    Figure 18 shows a bar graph of the resource usage in percentage of a Stratix V a7 FPGA.

    The performance presented above is due to the reduced clustering of the Hough Space. Referring to figure 18 we can observe

    that the Hough Space produced using PH (figure 19 (b)) is less clustered as compared to the Hough Space produced using

    the conventional Hough (figure 19 (a)). The discriminatory process of the PH reduces the clustering of line hits and hence over-detection is avoided.

    (a) (b)

    (c) (d)

    Figure 19 shows the Hough Space of the conventional (a) and the Proximity Hough transform (b) along with the binary

    edge image input with high binary pixels clusters.

    0

    10

    20

    30

    40

    50

    60

    70

    Logic utilization Dedicated LogicRegisters

    Memory Blocks DSP Blocks

    Re

    sou

    rce

    Usa

    ge

    (%

    )

    Resource Type

    Resource Usage of Conventional and Proximity Hough Transform

    Conventional Hough

    Proximity Hough

  • The performance of the PH in terms of robustness to regions of high pixel luma variation and in terms of number of rate of

    lines detected is apparent though it increases the hardware resource usage. Despite this increase, the video quality output of

    the Hough-Based Deinterlacer is greatly dependent on the performance of the line detection process. Hence it is justified to

    invest a lot of research and development time and hardware resources on the line detection process.

    An interesting improvement that could reduce the resource usage and the complexity of the Hough Transform is to map the

    discriminatory process to a form of Hough Space filtering. Parameter space filtering of the Hough Transform is not novel

    and has been used in applications such as. This improvement would definitely take more research time and hence was

    neglected so that hardware implementation and other research areas could be explored. The next major research area is the

    post-processing method which would be comparable in terms of resource usage to the line detection process.

    6.3.3 Post-Processing Block The PH transform reduces image artifacts generated by decreasing the chances of detecting an erroneous line. Despite this

    improvement, image artifacts still appear in certain regions of the image namely at along the periphery of edges, at intercepts

    of edges and at edge endpoints. The objective of the post-processing block is to reduce the probability of generating image

    artifacts once the line detection block has completed. Examples of the occurrence of these image artifacts are shown in

    figure 20.

    (a) (b)

    Figure 20 (a) shows image artifacts occurring along the periphery of edges. Figure 21 (b) shows refined output from the

    post-processing block.

    The post-processing block eliminates the above artifacts by conducting luminance and offset checks. These checks consist

    of exploiting the characteristic of similar luminance occurring along edges and consolidating the offset mask with the edge

    image. The effectiveness of these checks was visually verified. The post-processing block also includes a smoothing of

    deinterlaced pixels. This smoothing comprises of a blend between the base interpolation and the offset interpolation. This blend reduces any drastic pixel luminance introduced by offset interpolation. Detailed descriptions of the checks conducted

    are included in the appendix under the Post-processing section (Appendix II). The introduction of the X-biased Sobel,

    Proximity Hough and the Post-Processing blocks were crucial in making the initially proposed method a feasible one. I

    particularly found this section of the internship challenging due to the elaborate solutions that I had to invent. Though there

    were several other solutions such as connected-component labeling and straight-line correlation techniques that were

    initially pursued but later discarded due to the high additional complexity introduced and poor robustness to different video

    sequences. The final algorithm still has room for improvement namely when video sequences with highly textured regions

    are concerned, though the algorithm produces impressive edge reconstruction when macro-lines are present in an image.

    The next major section of the placement is the implementation of the Algorithm into hardware using the Alteras High Level Synthesis tool.

    The final post-processing algorithm was translated into hardware. During this translation several key considerations had to

    be made to generate a pixel throughput of 1 every clock cycle. This was a challenge as the post-processing algorithm consists

    of several blocks (see appendix for detailed description of these blocks) that are inherently dependent upon the outputs of

    previous post-processing blocks. To solve this, intermediate shift-buffers were introduced in between these blocks. These

    buffers are read from multiple times throughout the main loop but are written to only once. By introducing latency, data-

  • dependencies in the main loop in the post-processing can be eliminated entirely. A graphical representation of the post-

    processing algorithm showing the numerous internal blocks and shift buffers can be found in the appendix. Theoretically,

    with the inclusion of these shift-buffers, the final algorithm should be easily translated into hardware that would generate a

    throughput of 1 pixel every clock cycle. The challenge in the hardware implementation phase is to write the post-processing

    algorithm in a fashion in which the high-level synthesis compiler would generate the intended hardware.

    6.4 Conclusion and Personal Reflection The conclusion of this phase of the placement is that the proposed deinterlacer method is a feasible solution to the low-

    angled problem. The algorithm works particularly well for refining the image artifacts at edges with a high luma difference.

    This is key as the regions of high luma difference are regions that are most apparent to the human eye. A good example of

    the result at regions with high luma difference is along the Ping Pong table edges shown in figure 21. Unexpected

    improvement in performances were also discovered which further consolidates the feasibility of the design. The unexpected

    performance is that slightly curved edges that exhibit similar image artifacts are detected and refined by the final algorithm.

    Though the development was challenging due to the fine-tuning of design parameters across several video sequences, I was

    motivated by the fact that the algorithm I am developing is tackling a real engineering problem in the interlaced video

    broadcast industry. The algorithm development was very rewarding as many additions to the algorithm such as a

    discriminatory Hough process and a post-processing block are novel methods. Hence I have not only contributed to the

    deinterlacer design by implementing it in C but by adding 4 months of research value to the algorithm. I have experienced

    the difficulty in designing a robust image processing algorithm that produces consistent results across the many variations

    in video sequences. At the end of this phase the algorithm is finalized and ready to be implemented into hardware.

    (a) (b)

    Figure 21 shows output of the FDD method (a) with a comparison of the final result of the Hough-Based deinterlacer design

    (a) taking input the 187th frame of the table tennis sequence.

    7 Altera OpenCL and High-Level Synthesis The second phase of the project involves the implement of the deinterlacer design to hardware using Altera high-level

    synthesis tools. A High-level synthesis tool is a program that generates RTL or logic from a higher-level programming

    language such as python or C. The target tool for the deinterlacer design is called the Altera high-level synthesis tool. The

    Altera HLS is in its infancy and the objective of this phase of the project is to validate the tool by recording the development

    time, identifying bugs and providing user feedback. The Altera HLS compiler is uses the same high-level synthesis process

    a more mature high-level synthesis tool called the Altera OpenCL compiler. The Altera OpenCL tool flow has useful debug

    tools such as resource estimation and hardware emulation (elaborated in the next section). This would provide easier C code

    translation to Hardware while still enabling validation of the high-level synthesis process working at the back-end of the

    Altera OpenCL compiler. Despite this similarity there are several differences between Altera OpenCL and HLS.

  • The distinction between the Altera OpenCL and HLS compiler is in the target users. The increase in available FPGA

    resources, brought about by better manufacturing technology and FPGA architecture, has made FPGAs an attractive solution for high-performance computing. The FPGA accelerates execution times of computations with little data-

    dependencies that are easily parallelized. But to leverage the FPGA, the user requires a background on hardware design and

    a familiarization with hardware development tools. To widen the range of users, Altera provides the Altera OpenCL

    compiler (AOCL) as a solution. The AOCL compiler is based on a parallel-programming standard called OpenCL. This

    standard supports a multitude of different platforms such as DSPs, CPUs, GPUs and FPGAs. Hence it is possible to create

    a high-performance computing system with OpenCL that allows offloading to a variety of different platforms. The AOCL

    is an elegant tool for software engineers who simply wish to implement their software on an FPGA due to the abstraction

    from hardware interfaces and the complete generation of the final hardware system. On the other hand, the lack of control

    over the hardware interfaces generated is unattractive to hardware designers. Altera provides another compiler, Altera HLS,

    to target these hardware designers. The objective of the Altera HLS is mainly to generate single IP cores rather than an

    entire hardware system. This IP core generated could then be instantiated and incorporated in a hardware system via Qsys.

    The HLS compiler would both provide internal and external benefit to Altera. Internally, the HLS compiler could be used

    to accelerate hardware prototyping. This is demonstrated in the hardware translation of the Hough-Based Deinterlacer

    design. It is estimated that implementing the design into hardware would require 1 man-year, whilst the hardware translation

    phase of this project was completed in under a month. Fast prototyping would help generate quick outputs to assess the

    feasibility of an algorithm. The only drawback the HLS compiler would have is that it will not provide the hardware designer

    register level control of the hardware. Externally, the HLS tool will provide customers an easy tool to modify or adapt

    existing reference designs or Altera IP to suit their application. They would be able to tweak sections of C/C++ code to

    include new components or change design parameters.

    7.1 The OpenCL tool flow

    The Altera OpenCL tool flow is designed to speed up the hardware development and design by introducing an Emulator

    and Optimization reports generated from OpenCL kernels. The OpenCL kernels are C functions with restrictions and

    extensions imposed by OpenCL. These restriction and extensions provide a framework for data-parallelism programming.

    The kernels are written in a single-threaded, tasked-based fashion and hardware parallelism is inferred by unrolling loops

    and pipelining computations. Hardware parallelism can also be inferred by replicating (i.e vectoriziation) kernels. Kernels

    are launched via an OpenCL host program which is an executable. In addition to launching kernels, the host program

    allocates, reads and writes to the target devices global memory. The host program is generated using any standard C compiler

    which gives the flexibility to include any readily available C library into our host program. For this application, the OpenCV

    library is included to read and write image to and from the target hardware.

    The Altera OpenCL tool flow mimics a software-like debug flow due to the relative short compile time of the emulation

    and optimization report generation. Emulation and Optimization reports take seconds to compile which is in stark contrast

    to hardware compiles which usually take hours to complete. The complete OpenCL tool flow is shown in figure 21.

    Figure 21 Flow chart of Altera OpenCL tool flow.

    The Emulation feature allows functional debug of design without any hardware generation. The tool will produce a binary

    file (.aocx) which contains program objects that target the FPGA. The binary file can be executed using any x86 processor

    and simulate the hardware generated. Hardware performance is not provided by the emulation as no actual hardware is

    running. Once the design generates output and is verified to be correct, the next stage is to improve the efficiency of hardware

    generated and get a resource estimate. These feedback is provided by the optimization report.

  • The optimization report will give the engineer an idea of how efficient the compiler has generated the hardware. The report

    consists of a list of successfully pipelined code sections, serially executed code sections, data dependencies and a resource

    estimation of the design. This stage is crucial for the final hardware performance. Most of the development time is spent

    reducing data dependencies alerted to by the report. A snippet from an optimization report is shown below in figure 22.

    Figure 22 Snippet from an optimization report showing sections of pipelined code, serially executed code, data

    dependencies and a resource estimation.

    7.2 Compromises and Optimizations The hardware implementation phase consisted of 4 main kernel code revisions which are the serial hardware execution,

    hardware parallelism introduced, data dependencies removed and optimized hardware code revisions. The code was

    modified based on feedback from the information provided by the optimization report. The goal was to essentially remove

    data-dependencies where possible to enable parallelism, shift-register inference of pixel buffers, removal of conditional

    loops and memory access. The progress in terms of resource usage and execution times across the 4 code revisions is

    displayed in the bar charts of figure 23 and 24.

    Figure 23 Bar chart showing FPGA resource usage for all hardware revisions.

    111

    96

    72

    636558

    4741

    29 29 2825

    3228

    22

    9

    -5

    15

    35

    55

    75

    95

    115

    Serial Hardware Execution Hardware ParalellismIntroduced

    Data Dependencies Removed Optimized Optimization

    FPGA Resource Usage

    Memory Blocks Logic Utilization DSP Blocks Logic Registers

  • Figure 24 Bar chart showing kernel execution time of all hardware revisions running at 150 Mhz.

    The first code revision (serial hardware execution) was meant to get the design to fit on a Stratix V A7 FPGA by

    compromising design performance and reusing hardware blocks. To decrease the deinterlacer resource usage, the range of

    y-intercept values was reduced which compromised the final video quality output. Note that due to the fact that the interested

    angles are relatively small (i.e less than 45), this compromise did not significantly affect the final video output making it a

    reasonable tradeoff.

    The second code revision (hardware parallelism introduced) targets to generate efficient pixel buffers to reduce resource

    usage. Pixel buffers were initially included in the algorithm translation phase to remove data-dependencies by introducing

    latency. The pixel buffers would store a preset number of pixels (depending on process) by having the first element of the

    buffer updated with the incoming pixel value and the final element in the buffer deleted. An efficient method of

    implementing pixel buffers are using shift-registers. Shift-registers would mimic an array that has all its elements shifted

    by one index while having the first element updated with a new value. For a shift-register interpretation of pixel buffers, the

    access index has to be known at compile time. The shift-registers form a delay-line with signal taps at the respective access

    indexes. Hence the number of index access also had to be kept at a minimal to reduce hardware signal taps generated. Pixel

    buffers with dynamic indexes were written in C code in a circular buffer fashion which would infer a memory block where

    there is a constant read and write to it.

    The goal of the third code revision (data-dependencies removed) was to remove-data dependencies in the post-processing

    kernel. As mentioned in Section 6.3.3, the post-processing algorithm inherently has data-dependencies which were removed

    by introducing shift-buffers. The ideal result is a throughput of one pixel per clock cycle. Though the C code was written to

    generate hardware that removes these data-dependencies, the compiler does not necessarily generate the intended hardware.

    To achieve this performance there are three kinds of processes that need to be avoided which are conditional read from

    memory, conditional for loops and buffer indexes that depend on a value from the same buffer. Conditional read from memory and for loops were removed by reading from memory and performing the for loop iteration regardless and having a boolean variable to validate the final assignment. Indexes that depend on a value from the buffer itself was solved

    by duplicating the buffer and deriving the index from the cloned buffer. These amendments in the C code successfully

    generated a post-processing hardware that had no data-dependencies in the main loop. The removal of data-dependencies

    in the main-loop also allows the compiler to infer efficient pipelined computations. The efficiency of a pipelined loop is

    measured by how many clock cycles between successively launched iterations. At the end of this processed all serially

    executed sections had a pipeline efficiency of 50% (i.e 2 clock cycles between iterations). Kernel-level pipelines were also

    Figure 24 Bar chart showing FPGA resource usage for all hardware revisions.

    101.9

    84.0

    32.122.5

    271.5 275.6

    34.724.3

    281.9 285.9

    36.825.3

    332.7

    2.2 2.32.3 2.3 2.3 2.4

    0.0

    50.0

    100.0

    150.0

    200.0

    250.0

    300.0

    350.0

    Serial Hardware Execution Hardware ParalellismIntroduced

    Data Dependencies Removed Optimised Hardware

    Kernel Execution Time

    Sobel Kernel Hough Kernel Line Drawer Kernel Interpolator Kernel Frame Generator

  • introduce to enable a new set of inputs to be processed while the previous set of inputs was sent to the next kernel. The final

    improvement made during this hardware revision was task parallelization through loop unrolling. For loops in the post-processing blocks were unrolled. This unrolling increases the resources usage but gives us a faster execution time.

    The final hardware revision aims to produce optimized hardware. Optimized hardware was achieved by removing division

    by constant, division and modulo operator. These operations are expensive due to the floating-point division. Division is

    more expensive in hardware compared to multiplication. The division by a constant can be approximated into a

    multiplication with a constant and a division by a power of 2 which in hardware is a shift operation and hence relatively

    cheap. The constant of interest is the width of a video frame. A video frame with a width of 720 for example can be

    approximated to a multiplication with a constant, 91, and a shift of 2 to the power of 16. The division by a variable is

    translated into a ROM lookup table. Note that the range of values the denominator and the numerator of this particular

    division operation is bounded to a limited set of discrete numbers (i.e -64 to 64 for denominator and 0 to width of video

    frame for numerator). Hence a lookup table would be apt. These alterations in the C code reduced the resource usage and

    execution time.

    The improvement in kernel execution time and FPGA resource usage brought about the four code revisions are apparent

    and is highly dependent on writing the kernel in a manner in which the compiler can extract parallelism. The key technique

    is to reduce data-dependencies and conditional read and for loops. The hardware can be further optimized by reducing division operation and translating limited functions into lookup tables.

    7.3 Conclusion and Personal Reflection Early introduction to the OpenCL and HLS tools at the onset of the placement would have produced a better final

    deinterlacer design. There was a lack of hardware consideration during the development of the algorithm. Hardware

    considerations like memory access patterns (i.e random access or bit-streaming), task parallelism and pipelining were not

    taken into account. Therefore sections of the algorithm that could not easily be translated into a bit-streaming structure had

    to be rewritten in a bit-streaming fashion to achieve the same output, which often resulted in a degraded version of the

    design. About 2 weeks were dedicated to rewrite the algorithm in a bit-streaming fashion. Experimenting with the high-

    level synthesis tools and understanding how to generate efficient hardware for video processing would have reduced the

    time used in the hardware translation phase.

    Overall this phase took about a month. The weekly video conference calls with Andrei Hagiescu in Toronto were especially

    helpful. He provided crucial guidance and feedback on the direction and improvement of the design. The final code revision

    generates hardware that is able to process video input at 720i at 30 frame per second. This performance is not up to par with

    the market demand of deinterlacing 1080i video at 60 frames per second. I am confident that given more time in developing

    the hardware using openCL or even perhaps a translation to Verilog would be able to achieve the 1080i at 60 frames per

    second benchmark.

    8 Offsite and Extra Activities On the 6th of August I paid a visit to the Altera site in Penang to visit the Malaysian team managed by Benjamin Cope.

    Benjamin Cope is the manager of the Altera team I am in (i.e the Video IP team) and he suggested the visit. The aim of the

    visit is to observe how the Altera in Malaysia operates. The site is located on the Island of Penang situated off the west coast

    of the Malaysian Peninsular. The Island has a specially designated industrial zone that has many manufacturing, research

    and electronic companies such as Intel, AMD, Motorola, Agilent and Altera. I was brought around the site by Ivan Teh,

    who is the manager of the Malaysian VIP team. I conversed with the Malaysian VIP team asking about work-life balance

    and the benefits Altera provides. On another note, I was impressed by the sheer size of Altera in Malaysia where there are

    about 1000 employees. The visit was eye opening as it made me realize the scale in which Altera operated internationally

    and its strong presence of in the growing Southeast Asian region. I would definitely consider working at an Altera site closer to home.

    The IBC (International Broadcast Conference) is an annual event held at the Rai Amsterdam and runs for 5 days. The

  • conference attracts companies and organizations involved in future solutions of electronic media and technology. Altera

    operates a booth and I was given the privilege to submit a poster that describes the Hough-Based deinterlacer. The poster is

    included in Appendix III. The poster was displayed in the IBC Future Zone Section, which is a fairly new exhibition included

    in the conference. The purpose of the Future Zone is to showcase interesting ideas and projects from research and

    development labs and universities. The poster is included in the appendix. During the exhibition, the organizer of the Future

    Zone is hoping to create an online archive where submitted posters and papers could be stored and accessed. I went to

    Amsterdam on the 12th of September and showcased my design for the entire day along with my placement supervisor, Jon

    Harris. During the exhibition, several people approached us expressing their interest in the solution. Some requested a copy

    of the poster and some showed wanted to implementing the design on GPUs. The interaction with people from the broadcast

    industry regarding the design reassured me that the deinterlacer project addresses a real problem in the deinterlacing world.

    There were customers who expressed interest in the design and requested video samples from the deinterlacer for them to

    evaluate. These video output sequences were generated and give to the customer for evaluation. A following quotation for

    the IP core will be in place given that the customers are pleased with the video sequences. The identity of the customer is

    confidential but their request for video output sequences are testament to the fact that the decompression artifacts generated

    by most deinterlacers are a genuine problem in industry. The experience was both interesting and rewarding in that I am

    able to work on a real engineering problem and have the opportunity to interact with potential customers.

    The offsite activities have provided me with a platform to interact with both Altera employees in Penang and potential

    customers and researches that attended the IBC conference. This exposure has helped me understand how products, software

    and IP cores are developed internally and how they are further presented and marketed to customers. These activities have

    definitely been insightful in terms of engineering product inception, development and marketing.

    9 Conclusion

    The project aims were to develop a deinterlacing algorithm that targets low-angled artifacts and to synthesize the design

    into hardware using an Altera high-level synthesis compiler. I personally feel that these two objectives were met as the

    proposed method has been developed into a feasible solution and has been successfully been implemented into a working

    deinterlacer hardware. The feasibility of the deinterlacer is assessed in terms of weather it consistently removes the targeted

    decompression artifacts and if it is able to fit onto a single FPGA chip. Though the final deinterlacer hardware does not

    meet the market performance benchmark, this benchmark is speculated to be achievable given further development of the

    design in either openCL or Verilog. I am confident in the robustness and success of the deinterlacer IP core that it would

    make a good addition to the set of deinterlacers available in the Altera Video IP suite licensed to customers. The confidence

    comes from the consistent video outputs generated by the deinterlacer algorithm across 10 common test video interlaced

    sequence. The robustness of the algorithm is attributed to the key incorporation of image processing algorithms, namely the

    proximity Hough transform and the post-processing block. These two processes formed the central focus of the research

    done for the deinterlacer and have definitely improved the rate of line detection and edge refinement ability of the design.

    A summary of the industrial placement achievements is summarized in the bulleted list below:

    Invented a discriminatory process for the Hough transform to improve line detection rate and robustness to high pixel luminance variation.

    Invented a post-processing algorithm that consolidates detected edges to eliminate artifacts generated by directional interpolation.

    Successfully implemented deinterlacer into an FPGA in under a month using the Altera OpenCL compiler. Verified that hardware generates correct output. Achieved sub-par video performance in a limited amount of hardware development time and confident that

    further hardware optimization can easily achieve industry standard video deinterlacing performance. Completed algorithm development and hardware implementation of design in 6 months. Received positive feedback from customers who compared the Hough-based deinterlacer design to their

    existing deinterlacer (identity of company cannot be disclosed).

  • 10 Appendix

    I Pseudo Code of Conventional and Proximity Hough Transform

  • II Post-Processing Block Post Processing Functional Flow Diagram

    Post-processing block descriptions

    Modules Sub kernel

    Blocks ID Description

    Edge Image Generator

    SK0 Edge image generator

    EG_0 Generates binary edge image

    OM Generator (offset mask)

    SK1 Edge consolidation

    Om_0 Consolidates Offset Mask and Binary Edge Image

    Offset Mask Check Om_1 Checks presence of offset mask along suggested pixels

    Offset Mask Expander

    Om_2 Expands offset mask to include more peripheral pixels

    Offset Mask Estimator

    Om_3 Estimates true offset mask (this is due to the fact that detected offset mask is an approximation)

    WM Generator (weight mask)

    SK2 Offset mask raw Wm_1 Generates a weight mask from an offset mask

    Post offset mask end roll off

    Wm_2 Introduces a post roll-off weight to smoothen final output

    Pre offset mask end roll off

    Wm_3 Introduces a pre roll-off weight to smoothen final output

    SK3 Luminance check Wm_4 Checks that top and bottom luminance has a significant variation which implies an edge.

    Top and bottom check

    Wm_5 Checks that weight mask exist at the top and bottom of targeted pixel.

    Offset Check Wm_6 Checks that luminance along edge does not vary by a tolerance level.

    Average Weight Wm_7 Performs an averaging on the weight mask to allow a smoother transition between edge and non-edge regions

    Interpolator SK4 Interpolator Interpolate Performs linear or directional interpolation based on offset and weight mask.

  • III IBC paper submission

    Using a Bounded Offset Hough Transform For Edge-Dependent Deinterlacing

    Jon Harris (Altera) and Abdulaziz Azman (Imperial College London)

    Abstract This paper presents an edge-dependent deinterlacer scheme which uses

    a Bounded Offset Hough Transform. The scheme aims to resolve low-angled

    artefacts which occur in images produced by most intra-field deinterlacing

    methods. Existing edge-dependent deinterlacers analyze neighboring pixels to

    recover edge information. While this is sufficient for edges closer to the vertical,

    low-angled edge information is rarely sufficiently recovered. This is due to the

    inherent lack of vertical edge information in low-angled edges. The scheme

    proposes the use of a variant of the Hough Transform to analyze non-localized

    pixels to extract more edge information which will be used for edge-dependent de-

    interlacing. Comparing the output of the scheme with several known deinterlacing

    schemes suggests significant improvement in image quality output.

    I. Introduction

    Interlaced scan signals were initially used in analogue CRT television

    to improve the video frame rate without requiring additional bandwidth

    and is achieved by sampling only the horizontally odd or even lines of an

    image. The sampled image from one time instance in an interlaced video

    is called a sub-field. Though the vertical resolution is effectively halved,

    the temporal resolution is doubled which reduces image flicker in CRT

    television. Interlaced video eventually formed the basis of Analogue

    broadcast systems such as PAL and NTSC. Modern digital displays such

    as LCD and plasma screens use a progressive video format which captures

    and displays all horizontal lines at the same instance. To display interlaced

    video on modern digital displays, a deinterlacer is required. An ideal

    deinterlacer algorithm will be able to fully recover the missing horizontal

    information in interlaced video. Full reconstruction is not an easy task and

    theoretically impossible based on the Nyquist Sampling Theorem. Visual

    artefacts from de-interlacing are therefore hard to avoid. Further

    information regarding de-interlacing can be found in [1].

    Existing deinterlacers can be crudely categorized to inter-field and

    intra-field. Inter-field deinterlacers mainly extract motion information and

    perform deinterlacing accordingly. The main drawback of inter-field

    deinterlacing is the high complexity and the reliability of the motion

    detection algorithm. Motion detection fails when large displacements are

    involved which often results in poor video output. In contrast, Intra-field

    deinterlacers only process a single sub-field and are therefore

    algorithmically less complex than their inter-field counterparts.

    Existing Intra-field deinterlacer schemes process and i


Recommended