+ All Categories
Home > Documents > ForntPAGE.doc

ForntPAGE.doc

Date post: 03-Jun-2018
Category:
Upload: rohitkota
View: 218 times
Download: 0 times
Share this document with a friend

of 29

Transcript
  • 8/12/2019 ForntPAGE.doc

    1/29

    A

    SEMINAR REPORT

    ON

    THE INTEL MMX TECHNOLOGY

    Submitted in partial fulfillment for the requirement of the award for the

    Degree of Bachelor in Technology

    In

    Electronics & Commnic!tion En"ineerin"

    S#mitte$ %' S#mitte$ to'

    Renu Kanwar Mr.Yogendra boti!".#.D.$

    B. Tech. !%thSem.$ Department of &'&

    (EPARTMENT O) ELECTRONICS & COMM*NICATION ENGINEERING

    MAR+AR ENGINEERING COLLEGE & RESEARCH CENTRE ,O(HP*R

    -RA,ASTHAN.

    RA,ASTHAN TECHNICAL *NI/ERSITY0 1OTA -RA,ASTHAN.

    -234352344.

    1

  • 8/12/2019 ForntPAGE.doc

    2/29

    Intro$ction'5

    Intel() MM*+ technology ,- /0 i) an e1ten)ion to the ba)ic Intel rchitecture !I$

    de)igned to impro2e performance of multimedia and communication algorithm). Thetechnology include) new in)truction) and data type) which achie2e new le2el) of

    performance for the)e algorithm) on ho)t proce))or).

    MM* technology e1ploit) the paralleli)m inherent in many of the)e algorithm). Many

    of the)e algorithm) e1hibit the property of 3fi1ed4 computation on a large

    data )et.

    The definition of MM* technology e2ol2ed from earlier wor5 in the i%67+

    architecture ,80. The i%67 architecture wa) the indu)try() fir)t general purpo)e

    proce))or to pro2ide )upport for graphic) rendering. The i%67 proce))or pro2ided

    in)truction) that operated on multiple ad9acent data operand) in parallel for e1ample

    four ad9acent pi1el) of an image.

    fter the introduction of the i%67 proce))or Intel e1plored e1tending the i%67

    architecture in order to deli2er high performance for other media application) for

    e1ample image proce))ing te1ture mapping and audio and 2ideo decompre))ion.

    Se2eral of the)e algorithm) naturally lent them)el2e) to SIMD proce))ing. Thi) effort

    laid the foundation for )imilar )upport for Intel() main)tream general purpo)e

    architecture I.

    The MM* technology e1ten)ion wa) the fir)t ma9or addition to the in)truction )et

    )ince the Intel8%6+ architecture. :i2en the large in)talled )oftware ba)e for the I a

    )ignificant e1ten)ion to the architecture required )pecial attention to bac5ward

    compatibility and de)ign i))ue).

    MM* technology pro2ide) benefit) to the end u)er by impro2ing the performance of

    multimedia;rich application) by a factor of -.

  • 8/12/2019 ForntPAGE.doc

    3/29

    Thi) paper pro2ide) in)ight into the proce)) and con)ideration) u)ed to define the

    MM* technology. It al)o pro2ide) )pecific) on MM* in)truction) that were added to

    the I a) well a) the approach ta5en to add thi) )ignificant capability without adding a

    new )oftware 2i)ible architectural )tate.

    The paper al)o pre)ent) application e1ample) that )how the u)age and benefit) of

    MM* in)truction). Data )howing the performance benefit) for the application) i)

    al)o pre)ented.

    3

  • 8/12/2019 ForntPAGE.doc

    4/29

    AC1NO+LE(GEMENT

    Theoretical 5nowledge i) impro2ed through )eminar preparation a) it

    contribute) )ignificantly to the )tudent() under)tanding and gi2e) him to

    fir)t hand 5nowledge of the comple1itie) of engineering arena.

    >ir)t I would li5e to than5 the almighty and my parent) who ga2e me

    their 2aluable )upport and ble))ing) to complete thi) minor pro9ect . I

    would al)o li5e to than5 &r. ?. K. Bhan)ali !Director M.&.'.R.'. @odhpur$

    A &r. Yogendra boti !"ead of Department &lectronic A 'ommunication

    &ngineering$ for their encouragement A appreciation).

    n accompli)hment of any )ignificance depend) on the Synergy and

    'ooperation of re)ource) both material and human. I e1pre)) my heartfelt

    gratitude to all tho)e who ha2e contributed directly or indirectly in thi)

    endea2or.

    My fir)t and foremo)t regard) are for my family member) who

    patiently and pain)ta5ingly helped me out in e2ery way the can.

    IN(EX

    4

  • 8/12/2019 ForntPAGE.doc

    5/29

    S. o. 'ontent Cage o.

    7- $e6in!tion 7rocess

    7/ #!sic conce7ts E

    78 P!c8e$ $!t! 6orm!t E

    7= 'on$ition!l e9ection -7

    7< S!tr!tin" !rit:metic -/

    76 )i9e$ 7oint !rit:metic -/

    7 Re7ositionin" o6 $!t! elements ;it:in

    7!c8e$ $!t! 6orm!t

    -8

    7% (!t! !li"nment -=

    7E 6e!tres -or e1ample

    for a motion e)timation algorithm data i) naturally organiGed in -6 row) with each

    row containing only -6 byte) of data. In thi) ca)e operating on more than -6 data

    element) at a time will require reformatting the input data. De)ign con)ideration)

    in2ol2e i))ue) )uch a) the practical width of the data path and how many time)

    functional unit) will replicate.

    :i2en that current Intel proce))or) already ha2e 6=;bit data path) !for e1ample

    floating;point data path) a) well a) a data path between the integer regi)ter file and

    memory )ub)y)tem due to dual load)tore capability in the Centium proce))or$ we

    cho)e the width of MM* data type) to be 6= bit).

    Con$ition!l E9ection'5

    #perating on multiple data operand) u)ing a )ingle in)truction pre)ent) an intere)tingi))ue. Fhat happen) when a computation i) only done if the operand 2alue pa))e)

    9

  • 8/12/2019 ForntPAGE.doc

    10/29

    )ome conditional chec5L >or e1ample in an ab)olute 2alue calculation only if the

    number i) alreadynegati2e do we perform a /() complement on itJ

    for I - -77

    if a,i0 N 7 then b,i0 ; a,i0 el)e b,i0 a,i0

    O b)olute 2alue calculation

    There are different approache) po))ible and )ome are impler than other). P)ing a

    branch approach doe) not wor5 well for two rea)on)J fir)t a branch;ba)ed )olution i)

    )lower becau)e of the inherent branch mi)prediction penalty and )econd becau)e of

    the need to con2ert pac5ed data type) to )calar).

    Direct conditional e1ecution )upport doe) not wor5 well for the I )ince it require)

    three independent operand) !)ource )ourcede)tination and predicate 2ector$. Keeping

    with the philo)ophy of performance and )implicity we cho)e a )impler )olution. Theba)ic idea wa) to con2ert a conditional e1ecution into a conditional a))ignment.

    'onditional a))ignment in turn can be implemented through different approache). #ne

    approach would be to pro2ide the fle1ibility of )pecifying a dynamically generated

    ma)5 with an a))ignment in)truction. Such an approach would ha2e required defining

    in)truction) with three operand) !)ource )ourcede)tination and ma)5$. "ere al)o we

    adopted a )olution that i) more amenable to higher performance de)ign).

    'ompare operation) in MM* technology re)ult in a bit ma)5 corre)ponding to the

    length of the operand). >or e1ample a compare operation operating on pac5ed byteoperand) produce byte;wide ma)5). The)e ma)5) then can be u)ed in con9unction with

    logical operation) to achie2e conditional a))ignment.

    'on)ider the following e1ampleJ

    If True

    Ra J Rb el)e Ra J Rc

    et u) )ay regi)ter R1 contain) all -() if the condition i) true and all 7() if thecondition i) fal)e. Then we can compute Ra with the following logical e1pre))ionJ

    Ra !Rb D R1$ #R !Rc D#T R1$

    Thi) approach wor5) for operation) with a regi)ter a) the de)tination. 'onditional

    a))ignment to memory can be implemented a) a )equence of load conditional

    a))ignment and )tore. Fe re9ected more efficient )upport for conditional )tore) for

    two rea)on)J fir)t the )upport require) three )ource operand) which doe) not map well

    to high;performance architecture) and )econd the benefitof )uch )upport i) dependent on )upport from the platform for efficient partial

    tran)fer).

    10

  • 8/12/2019 ForntPAGE.doc

    11/29

    The MM* in)truction )et contain) a pac5ed compare in)truction that generate) a bit

    ma)5 enabling data dependent calculation) to be e1ecuted without branch in)truction)

    and to be e1ecuted on )e2eral data element) in parallel. The bit ma)5 re)ult of the

    pac5ed compare in)truction ha) all -() in element) where the relation te)ted for i) true

    and all 7() otherwi)e !)ee >igure -$.

    S!tr!tin" Arit:metic'5

    #perand )iGe) typically u)ed in multimedia are )mall !for e1ample % bit) for

    repre)enting a color component$. n %;bit number allow) only /

  • 8/12/2019 ForntPAGE.doc

    12/29

    There may be ca)e) where an application want) to e1amine the occurrence of an

    o2erflow in a computation. Cro2iding a flag to indicate thi) !i.e. indicating whether or

    not the 2alue wa) )aturated$ would ha2e been de)irable. "owe2er we decided again)t

    pro2iding thi) flag )ince we did not want to add any additional new )tate) to the

    architecture to pre)er2e the bac5ward compatibility. #ur analy)i) al)o )howed that itwa) not critical to pro2ide thi) information in mo)t application). If needed an

    application can determine if )aturation wa) encountered by comparing the re)ult of a

    computation with the ma1imum and minimum 2alueO typically )aturation i) the

    correct beha2ior.

    )i9e$5Point Arit:metic'5

    Media application) in2ol2e wor5ing on fraction 2alue) for e1ample the u)e of a

    weighting coefficient in filtering a2eraging etc. #ne way to )upport operation) on

    fraction 2alue) i) to pro2ide SIMD operation) for floating;point operand). "owe2er

    floating;point unit) are hardware inten)i2e. l)o for )e2eral media application) e2en

    preci)ion of -7 to -/ binary bit) and dynamic range of = to 6 bit) are )ufficient.

    Indu)try;)tandard floating;point !I&&& >C$ require) a minimum of /8 bit) of

    preci)ion. oo5ing at application requirement) and the trade;off of performance and

    de)ign comple1ity lead) to the u)e of a fi1ed;point arithmetic paradigm for )e2eral

    media application). ote that )ome of the computation) may )till require the dynamic

    range and the preci)ion )upported by I&&& floating;point for e1ample geometry

    tran)formation for )tate;of;the;art 8D application).

    In fi1ed;point computation from the point of 2iew of the proce))or architecture

    computation) are done on integer 2alue) but programmerapplication) interpret the

    integer 2alue) a) fraction 2alue). Some number of leading bit) !determined by the

    application$ are interpreted a) an integer while the remaining bit) of the 2alue are

    interpreted a) a fraction. It i) the application() re)pon)ibility to perform appropriate

    )hift) in order to )cale the number.

    Re7ositionin" o6 (!t! Elements +it:in P!c8e$ (!t! )orm!t'5

    The pac5ed data format pre)ent) one other i))ue. There are )e2eral ca)e) where

    element) of pac5ed data may be required to be repo)itioned within the pac5ed data or

    the element) of two pac5ed data operand) may need to be merged. There are ca)e)

    where either input or the de)ired output repre)entation of a data may not be ideal for

    ma1imiGing computation throughput. >or e1ample it may be preferable to compute oncolor component) of a pi1el in 3planar format4 while the input may be in 3pac5ed

    format.4

    12

  • 8/12/2019 ForntPAGE.doc

    13/29

    There are al)o )ituation) where one need) to perform intermediate computation) in

    wider format !perhap) pac5ed word format$ while the re)ult i) pre)ented in

    pac5ed byte format.

    In the abo2e ca)e) there i) a need to e1tract )ome element) of a pac5ed data type andwrite them into a different po)ition in the pac5ed re)ult.

    #ne general )olution to thi) i))ue i) to pro2ide an in)truction that ta5e) two pac5ed

    data operand) and allow) merging of their byte) in any arbitrary order into the

    de)tination pac5ed data operand. "owe2er )uch a general )olution i) e1pen)i2e to

    implement. Thi) )olution e))entially will require a full cro)) bar connection.

    In the MM* technology architecture we defined an in)truction that require) a

    relati2ely ea)y )wiGGle networ5 and yet allow) the efficient repo)itioning and

    combining of element) from pac5ed data operand) in mo)t ca)e).

    The in)truction unpack ta5e) two pac5ed data operand) and merge) them a) )hown in

    >igure /.

    The unpack in)truction can be u)ed for a 2ariety of efficient repo)itioning of data

    element) including data replication within pac5ed data. >or e1ample con)ider

    con2erting a color repre)entation from pac5ed form !i.e. for each pi1el four

    con)ecuti2e byte) repre)ent R : B and lpha 2alue)$ to planar format !i.e. four

    con)ecuti2e byte) repre)ent the red component of four con)ecuti2e pi1el)$.

    (!t! Ali"nment'5

    13

  • 8/12/2019 ForntPAGE.doc

    14/29

    P)e of pac5ed data al)o pre)ent) data alignment i))ue). In )ome ca)e) the data may be

    aligned on it) natural boundary and not on the )iGe of the pac5ed data operand. >or

    e1ample in a motion e)timation routine the -61-6

    bloc5 i) aligned at an arbitrary byte boundary and not at a 6=;bit boundary. Therefore

    in )ome ca)e) there i) a need to )upport efficient acce)) of unaligned data for media

    application). #ne approach i) to )upport unaligned

    acce))e) directly in hardware which generally doe) not wor5 well with the high;performance cache de)ign. lternati2ely one can limit memory acce))e) to aligned

    data and e1tract out the de)ired data from the acce))ed data u)ing e1plicit in)truction).

    MM* technology include) logical )hift;left and )hift;right operation) on 6= bit).

    The)e in)truction) enable u)ing a )equence of Shift left Shift right and Or operation)

    to a))emble the de)ired byte from the aligned data that encompa))e) the de)ired byte).

    )e!tres'5

    MM* technology feature) includeJ

    ew data type) built by pac5ing independent data element) together into oneregi)ter.

    n enhanced in)truction )et that operate) on all independent data element) in a

    regi)ter u)ing parallel SIMD fa)hion.

    ew 6=;bit MM* regi)ter) that are mapped on the I floating;point regi)ter).

    >ull I compatibility.

    Ne; (!t! T7es'5

    MM* technology introduce) four new data type)J three pac5ed data type) and a new

    6=;bit entity. &ach element within the pac5ed data type) i) an independent fi1ed;point

    integer. The architecture doe) not )pecify the place of the

    fi1ed point within the element) becau)e it i) the u)er() re)pon)ibility to control it)

    place within each element throughout the calculation. Thi) add) a burden on the u)er

    but it al)o lea2e) a large amount of fle1ibility to choo)e and change the preci)ion of

    fi1ed;point number) during the cour)e of the application in order to fully control the

    dynamic range of 2alue).

    The following four data type) are defined !)ee >igure 8$J

    14

  • 8/12/2019 ForntPAGE.doc

    15/29

    Cac5ed byte % byte) pac5ed into 6= bit)

    Cac5ed word = word) pac5ed into 6= bit)

    Cac5ed double word / double word) pac5ed into 6= bit)

    Cac5ed quad word 6= bit)

    En:!nce$ Instrction Set'5

    MM* technology define) a rich )et of in)truction) that perform parallel operation) on

    multiple data element) pac5ed into 6= bit) !%1%;bit =1-6;bit or /18/;bit fi1ed point

    integer data element)$. Fe 2iew the MM* technology in)truction )et a) an e1ten)ion

    of the ba)ic operation) one would perform on a )ingle datum in the

    SIMD domain. In)truction) that operate on pac5ed byte) were defined to )upport

    frequent image operation) thatin2ol2e %;bit pi1el) or one of the %;bit colorcomponent) of /=8/;bit pi1el) !Red :reen Blue lpha channel$. Fe

    15

  • 8/12/2019 ForntPAGE.doc

    16/29

    defined full )upport for pac5ed word !-6;bit$ data type).Thi) i) becau)e we found -6;

    bit data to be a frequent data type in many multimedia algorithm) !e.g. M#D&M

    udio$ and )er2e) a) the higher preci)ion bac5up for operation) on byte data.

    ba)ic in)truction )et i) pro2ided for pac5ed doubleword data type) to )upport

    operation) that need intermediate higher preci)ion than -6 bit) and a 2ariety of 8Dgraphic) algorithm). Becau)e MM* technology i) a 6=;bit capability new in)truction)

    to )upport 6= bit) were added )uch a) 6=;bit memory mo2e) or 6=;bit logical

    operation).

    #2erall

  • 8/12/2019 ForntPAGE.doc

    17/29

    Table - )ummariGe) the in)truction) introduced by MM* technologyJ

  • 8/12/2019 ForntPAGE.doc

    18/29

    ) the MM* regi)ter) are mapped o2er the floating;point regi)ter) application) that

    u)e MM* technology ha2e -6 regi)ter) to u)e. &ight are the MM* regi)ter) each 6=

    bit) in )iGe that hold pac5ed data and eight are integer regi)ter) which can be u)ed fordifferent operation) li5e addre))ing loop control or any other data manipulation.

    MM* data 2alue) re)ide in the low order 6= bit) !the manti))a$ of the I %7;bit

    floatingpoint regi)ter) !)ee >igure =$.

    The e1ponent field of the corre)ponding floating;point regi)ter !bit) 6=;%$ and the

    )ign bit !bit E$ are )et to one) !-()$ ma5ing the 2alue in the regi)ter a a !ot a

    umber$ or infinity when 2iewed a) a floating;point 2alue. Thi) help) to reduce

    confu)ion by en)uring that an MM* data 2alue will not loo5 li5e a 2alid floating;point

    2alue. MM* in)truction) only acce)) the low;order 6= bit) of the floating;point

    regi)ter) and are not affected by the fact that they operate on in2alid floating;point

    2alue).

    The dual u)age of the floating;point regi)ter) doe) not preclude application) from

    u)ing both MM* code and floating;point code. In)ide the application the MM*

    18

  • 8/12/2019 ForntPAGE.doc

    19/29

    codeand floating;point code )hould be encap)ulated in )eparate code )equence). fter

    one )equence complete) the floating;point )tate i) re)et and the ne1t )equence can

    )tart. The need to u)e floating;point data and MM* !fi1ed;point integer$ data at the

    )ame time i) infrequent.

    t a gi2en time in an application data being operated upon i) u)ually of one type.

    Thi) enabled u) to u)e the floating;point regi)ter) to )tore the MM* technology 2alue)and achie2e our full bac5ward compatibility goal.

    Preser>in" )ll %!c8;!r$ Com7!ti#ilit'5

    #ne of the important requirement) for MM* technology wa) to enable u)e of MM*

    in)truction) in application) without requiring any change) in the I )y)tem )oftware.

    n additional requirement wa) that an application )hould be able to utiliGe

    performance benefit) of MM* technology in a )eamle)) fa)hion i.e. it )hould be able

    to employ MM* in)truction) in part of the application

    without requiring the whole of the application to be MM* technology;aware.

    Crimary bac5ward compatibility requirement) and their implication) areJ

    pplication) u)ing MM* in)truction) )hould wor5 on all e1i)ting multita)5ing

    and non;multita)5ing operating )y)tem). Thi) require) that MM* technology

    )hould not add any new architecturally 2i)ible )tate) or e2ent) !e1ception)$.

    &1i)ting application) that do not u)e MM* in)truction) )hould run unchanged.

    Thi) require) that MM* technology )hould not redefine the beha2ior of any

    e1i)ting I 8/;bit in)truction). #nly tho)e undefined opcode) that are not relied

    on for cau)ing illegal e1ception) by e1i)ting )oftware )hould be u)ed to define

    MM* in)truction). l)o MM* in)truction) )hould only affect the I 8/; bit

    )tate when in u)e.

    &1i)ting application) )hould be able to utiliGe MM* technology without being

    required to ma5e the whole application MM* technology;aware. It )hould be

    po))ible to employ MM* in)truction) within a procedure in an e1i)ting

    application without requiring any change) in the re)t of the application. Thi)

    require) that MM* in)truction) wor5 well within the conte1t of e1i)ting I

    calling con2ention) for procedure call).

    It )hould be po))ible to run an application e2en in an older generation of

    proce))or) that doe) not )upport MM* technology. P)ing dynamically lin5ed

    librarie) !D)$ for MM* and non;MM* technology proce))or) i) an ea)y way

    to do thi).

    MM* in)truction) )hould be )emantically compatible with other I

    in)truction) i.e. it )hould be ea)y to )upport new MM* in)truction) in e1i)ting

    a))embler). They )hould al)o ha2e minimal impact on the in)truction decoder.

    nother a)pect of thi) i) that MM* in)truction) )hould not require

    programmer) to thin5 in new way) regarding the ba)ic beha2ior of in)truction).

    19

  • 8/12/2019 ForntPAGE.doc

    20/29

    >or e1ample addre))ing mode) and the a2ailability of operation) with memory

    )hould conceptually wor5 the )ame.

    No Ne; St!te'5

    The MM* technology )tate o2erlap) with the >loating; Coint )tate. #2erlapping the

    MM* )tate with the >C )tac5 pre)ented an intere)ting challenge. >or performance

    rea)on) a) well a) for ea)e of implementation for )ome micro architecture) we wanted

    to allow the acce))ing of the MM* regi)ter) in a flat regi)ter model. Fe needed to

    enable o2erlapping MM* regi)ter) with the >C )tac5 while )till allowing a flat regi)ter

    acce)) model for MM* in)truction). Thi) wa) accompli)hed by enforcing a fi1ed

    relation)hip between the logical and phy)ical regi)ter) for the >C )tac5 when acce))ed2ia MM* in)truction). dditionally e2ery MM* in)truction ma5e) the whole MM*

    regi)ter file 2alid. Thi) i) different from the floating;point )tac5 model where new

    )tac5 entrie) are made 2alid only if the in)truction )pecifie) a 3pu)h4 operation.

    MM* in)truction) them)el2e) do not update >C in)truction )tate regi)ter) !for

    e1ample >C opcode >#C >C Data )elector >DS >C IC >IC etc.$. The >C in)truction

    )tate i) u)ed only by >C e1ception handler). Since MM* in)truction) do not create any

    computation e1ception) thi) )tate i) really not meaningful for MM* in)truction).

    dditionally not updating the)e )tate) eliminate) the comple1ity of maintaining thi))tate for MM* technology implementation). Therefore we made a deci)ion to let the

    >C in)truction )tate regi)ter point to the la)t >C in)truction e1ecuted e2en though

    future MM* in)truction) will update the >C )tac5 and T: regi)ter. &2entually when

    an >C in)truction i) e1ecuted all of the >C in)truction )tate get) updated. Therefore

    >C e1ception handler) alway) )ee con)i)tent >C in)truction )tate.

    No Ne; E9ce7tions'5

    MM* in)truction) can be 2iewed a) new non;I&&& floating;point in)truction) that donot generate computation e1ception). "owe2er )imilar to >C in)truction) they do

    report any pending >C e1ception). >or compatibility with e1i)ting )oftware it i)

    critical that any pending >C e1ception i) reported to the )oftware prior to e1ecution of

    any MM* in)truction which could update the >C )tate.

    t the point of rai)ing the pending >C e1ception the >C e1ception )tate )till point) to

    the la)t >C in)truction creating the >C condition. Therefore the fact that the e1ception

    get) reported by an MM* in)truction in)tead of an >C in)truction i) tran)parent to the

    >C e1ception handler.

    dditional e1ception) that are pertinent to MM*

    20

  • 8/12/2019 ForntPAGE.doc

    21/29

    technology are memory e1ception) de2ice;not;a2ailable !D ; IT$ e1ception)

    and >C emulation e1ception).

    "andling of memory e1ception) in general doe) not depend on the opcode of the

    in)truction cau)ing the e1ception. Therefore MM* technology e1ception) do not

    cau)e a malfunction of any memory acce));related e1ception handler. #ur e1ten)i2e

    compatibility 2erification 2alidated thi) further.

    D e1ception i) cau)ed when the TS bit in 'R7 i) )et and any other in)truction

    that could modify the >C )tate i) i))ued. Thi) include) e1ecution of an MM*

    in)truction when the TS bit i) )et. In thi) ca)e )imilar to the >C ca)e a D e1ception

    i) in2o5ed. The re)pon)e of thi) e1ception i) to )a2e the >C )tate and free it up for u)e

    by future >CMM* in)truction). Thi) e1ception handler al)o doe) not ha2e a u)e for

    the opcode of the in)truction cau)ing thi) e1ception.

    Fhen the 'R7.&M bit i) )et a floating;point in)truction cau)e) an >C emulation

    e1ception. In thi) ca)e in)tead of u)ing >C hardware >C functionality i) )upported 2ia)oftware emulation. Since the MM* technology architecture )tate o2erlap) with the

    >C architecture )tate the i))ue ari)e) a) to the correct beha2ior for MM* in)truction)

    when the 'R7.&M bit i) )et.

    'au)ing an emulation e1ception for MM* in)truction) when 'R7.&M i) )et i) not the

    right beha2ior )ince the e1i)ting >C emulator doe) not 5now about MM*

    in)truction). Therefore the fir)t natural choice )eemed to ignore 'R7.&M for MM*

    technology. "owe2er thi) choice ha) a problem. Ignoring 'R7.&M for MM*

    in)truction) would re)ult in two )eparate conte1t) for the >C Stac5 and T: word)Jone conte1t in the emulator memory for >C and one conte1t in the hardware for MM*

    in)truction). Thi) lead) to an architectural incon)i)tency between the ca)e) when

    'R7.&M i) )et and when it i) not )et.

    Fe had to find )ome other logical way to deal with thi) without defining any new

    e1ception). Fe cho)e to define the 'R7.&M - ca)e to re)ult in an illegal opcode

    e1ception. Thu) e))entially when 'R7.&M i) )et the MM* technology architecture

    e1ten)ion i) di)abled.

    C:oice o6 O7co$es 6or MMX Instrctions'5

    The MM* in)truction opcode) were cho)en after e1ten)i2e analy)i) of the undefined

    opcode map. Fe hadto ma5e )ure that the a2ailable opcode) were reallyunu)ed. Thi)

    required en)uring that no )oftware wa) relying on the illegal opcode fault beha2ior of

    the)e opcode). Intel wa) already wor5ing with )oftware 2endor) to en)ure that they

    relied only on one )pecific encoding 7>>> to cau)e an illegal opcode fault. #ther

    encoding may cau)e an illegal e1ception fault in future implementation).

    &1cept for a few ca)e) we found that )oftware wa) u)ing only pre)cribed encoding for

    cau)ing a programcontrolled in2alid opcode fault.

    21

  • 8/12/2019 ForntPAGE.doc

    22/29

    #nly addre)) prefi1e) are defined to be meaningful for MM* in)truction). P)e of a

    Repeat oc5 or Data prefi1 i) illegal for MM* in)truction). The addre)) prefi1 ha)

    the )ame beha2ior a) for any other in)truction.

    *se o6 )P (LL Mo$el 6or MMX Co$e'5

    To enable common multimedia application) for proce))or) with and without MM*

    technology we cho)e to promote the Dynamic in5ed ibrary !D$ model a)

    the primary model to )upport MM* in)truction).

    In the D model depending upon whether the proce))or pro2ide) MM* technology

    )upport in hardware !the proce))or 'CPID pro2ide) thi) information$ the appropriate

    2er)ion of the media library function i) lin5ed dynamically.

    MM* technology D) )ugge)t the )ame guideline) a) that of >C D). The primary

    guideline) areJ

    t the end of a D lea2e the floating;point regi)ter) in the correct )tate for thecalling procedure. Thi) generally mean) lea2ing the floating;point )tac5

    empty unle)) a procedure ha) a return 2alue. Thi) al)o mean) that the caller

    )hould chec5 for and handle any >C e1ception) that it might ha2e generated.

    Do not a))ume that the floating;point )tate remain) the )ame acro)) procedure).

    The callee can typically a))ume that at entry the >C )tac5 i) empty unle)) there

    i) )ome )et con2ention for parameter pa))ing. ote that nothing in the MM*

    technology architecture depend) on the)e guideline) for functional correctne)).

    MM* technology can be u)ed in any other u)age model). MM* technology pro2ide)an in)truction to clear all of >C )tate with a )ingle in)truction !&MMS in)truction$. If

    )ome D i) written to return with the >C )tac5 only partially empty one need) to u)e

    a combination of &MMS and floating;point load) to create the correct >C )tac5 )tate.

    'lean the )tate of MM* with &MMS in)truction.

    Per6orm!nce A$>!nt!"e'5

    Fe will analyGe the performance enhancement due to MM* technology through ane1ample of a matri1;2ector multiplication 2ery much li5e the one in >igure

  • 8/12/2019 ForntPAGE.doc

    23/29

    multimedia and communication) application) u)ed in ba)ic mathematical primiti2e)

    li5e matri1 multiply and filter).

    multiply;accumulate operation !M'$ i) defined a) the product of two operand)

    added to a third operand !the accumulator$. Thi) operation require) two load)

    !operand) of the multiplication operation$ a multiply and an add !to the

    accumulator$. MM* technology doe) not )upport three operand in)truction)O

    therefore it doe) not ha2e a full M' capability. #n the other hand the pac5ed

    multiply;add in)truction !CMDDFD$ i) defined which compute) four -6;bit 1 -6;

    bit multiplie) generating four 8/;bit product) and doe) two 8/;bit add) !out of the four

    needed$. )eparate pac5ed add double word !CDDD$ add) the two 8/;bit re)ult) of

    the pac5ed multiply;add to another MM* regi)ter which i) u)ed a) an accumulator.

    >or thi) performance e1ample we will a))ume both input 2ector) to be the length of

    -6 element) each element in the 2ector) being )igned -6 bit). ccumulation will be

    performed in 8/;bit preci)ion. The Centium proce))or for e1ample would ha2e to

    proce)) each of the operation) one at a time in a )equential fa)hion. Thi) amount) to

    8/ load) -6 multiplie) and -< addition) a total of 68 in)truction). ))uming weperform = M') !out of the -6$ per iteration we need to add -/ in)truction) for loop

    control !8 in)truction) per iteration increment compare branch$ and one in)truction

    for )toring the re)ult. The total i) 6 in)truction). ))uming all data and in)truction)

    are in the on;chip cache) and that e1iting the loop will incur one branch mi)prediction

    the integer a))embly optimiGed 2er)ion of thi) code !utiliGing both pipeline)$ ta5e) 9u)t

    o2er /77 cycle) on a Centium proce))or microarchitecture. The cycle count i)

    dominated by the integer multiply being a non;pipelined --;cycle operation. Pnder the

    )ame condition) but a))uming the data i) in a floating;point format the floating;point

    optimiGed a))embly 2er)ion e1ecute) in = cycle). The floating;point 2er)ion i) fa)ter!a))uming the data i) in floating;pointing format$ )ince the floating;point multiply

    ta5e) three cycle) to e1ecute and i) a pipelined unit.

    23

  • 8/12/2019 ForntPAGE.doc

    24/29

    MM* technology on the other hand compute) four element) at a time. Thi) reduce)

    the in)truction count to eight load) four CMDDFD in)truction) three CDDD

    in)truction) one )tore in)truction and three additional in)truction) !o2erhead due to

    pac5ed data type)$ totaling -E in)truction). Cerforming loop unrolling of four

    CMDDFD in)truction) eliminate) the need to in)ert any loop control in)truction).

    Thi) i) becau)e four CMDDFD) already perform all the -6 required M'). The

    MM* in)truction count i) four time) le)) than when u)ing integer or floating;pointoperation)Q Fith the )ame a))umption) a) abo2e on Centium proce))or with MM*

    technology an MM* technology;optimiGed a))embly 2er)ion of the code utiliGing

    both pipeline) will e1ecute in only -/ cycle).

    'ontinuing the abo2e e1ample a))ume a -61-6 matri1 i) multiplied by a -6;element

    2ector. Thi) operation i) built of -6 ?ector;Dot;Croduct) !?DC$ of length -6.

    Repeating the )ame e1erci)e a) before and a))uming a loop unrolling that perform)

    four ?DC) each iteration the regular Centium proce))or code will total =!=68$

    -//% in)truction). P)ing MM* technology will require=!=-E8$ 8-6 in)truction). The MM* in)truction count i) 8.E time) le)) than

    when u)ing regular operation). The be)t regular code implementation !floating;point

    optimiGed 2er)ion$ ta5e) 9u)t under -/77 cycle) to complete in compari)on to /7

    cycle) for the MM* code 2er)ion.

    Intel ha) introduced two proce))or familie) with MM* technologyJ the Centium

    proce))or with MM* technology and the Centium II proce))or. The performance of

    both proce))or) wa) compared on the Intel Media Benchmar5 !IMB$ ,igure 6 and Table / compare the Centium proce))or with MM* technology and the

    Centium II proce))or again)t the Centium proce))or and the Centium Cro proce))or.

    24

  • 8/12/2019 ForntPAGE.doc

    25/29

    25

  • 8/12/2019 ForntPAGE.doc

    26/29

  • 8/12/2019 ForntPAGE.doc

    27/29

    The floating point regi)ter)J;

    -. >loating point i) proce))ed by eight %7 bit regi)ter) ST!7$ ST!-$ UST!$ in

    the floating point unit.

    /. Fhen doing floating point arithmetic the)e regi)ter) are organiGed in a

    )tac5.8. Crogramming floating point i) quite different that programming integer

    arithmetic.

    =. >loating point calculation) are done u)ing %7 bit) e2en when the program

    )pecifie) )toring 8/ or 6= bit data 2alue).

    d2antage) of u)ing the floating point regi)ter) in MM*J;

    -. The regi)ter) already e1i)t. #nly logic had to be added to the chip.

    /. The operating )y)tem already 5now) about the floating point regi)ter).8. Fhen a computer i) )witche) from one program to another the )tate

    !regi)ter)$ of the current program mu)t be )a2ed )o )tate can be re)tored

    when the program become) the acti2e program once again.

    =. The floating point regi)ter) are automatically )a2ed a) part of the )tate of a

    program.

  • 8/12/2019 ForntPAGE.doc

    28/29

    Conclsion'5

    MM* technology implement) a high;performance technique that enhance) theperformance of Intel rchitecture microproce))or) for media application). The core

    algorithm) in the)e application) are compute inten)i2e. The)e algorithm) perform

    operation) on a large amount of data u)e )mall data type) and pro2ide many

    opportunitie) for paralleli)m. The)e algorithm) are a natural fit for SIMD architecture.

    MM* technology define) a general purpo)e and ea)y;to;implement )et of primiti2e)

    to operate on pac5ed data type).

    MM* technology while deli2ering performance boo)t to media application) i) fully

    compatible with the e1i)ting application and operating )y)tem ba)e.

    MM* technology i) general by de)ign and can be applied to a 2ariety of )oftware

    media problem). Some e1ample) of thi) 2ariety were de)cribed in thi) paper. >uture

    media related )oftware technologie) for u)e on the Intranet and Internet )hould benefit

    from MM* technology.

    Centium proce))or) with MM* technology pro2ide a )ignificant performance boo)t

    !appro1imately =1 for )ome of the 5ernel)$ for media application). Cerformance gain)

    from the technology will )cale well with an increa)ed proce))or operating frequency

    and future microarchitecture).

    28

  • 8/12/2019 ForntPAGE.doc

    29/29

    Re6erences'5

    ,-0 . Celeg P. Fei)erMMX Technology Extension to the

    Intel Architecture I&&& Micro ?ol. -6 o. = ugu)t

    -EE6 pp. =/;