Suyash Thesis Presentation

Technische Universität München

Efficient Parallelization of Robustness Validation for Digital

Circuits

Suyash Shukla

Institute for Electronic Design AutomationUniv.-Prof. Dr.-Ing. Ulf Schlichtmann


Agenda

Basic concepts of parallel computing OpenMP Interface Existing serial code for evaluating robustness Ideology Changes made in the existing code Results of parallel computing Further improvements Conclusion

04.01.2013 Zuverlässigkeit von CMOS-Schaltungen 2


Agenda




Basic concepts of parallel computing

Block of code executed concurrently by multiple threads. Utilizes multiple core processors on a single machine. Large functions are broken into smaller discrete parts which are

executed concurrently ( iterate_d( ) ). Instructions from each part execute simultaneously on different

processors depending on the ‘number of threads’. Each instruction has its own copy of data input tg. OpenMP interface was chosen for parallel computing.

Note: tg dynamically stores functions of class TG.



Agenda




OpenMP Interface

OpenMP (Open Multi-processing) is an Application Programming Interface that supports shared-memory parallel programming in C++ and FORTRAN.

It consists of its own:1. Directives2. Execution Environment Routines3. Timing Routines4. Environment variables

A parallel region is created by the directive #pragma omp parallel Code within this region is executed by multiple threads

simultaneously but in a random thread order.



Multi-threading depends on environment variable, void omp_set_num_threads (int num_threads), where the master thread “forks” out a specific number of worker threads.

Each thread has its thread ID which can be called by int omp_get_thread_num (void). At the end of parallel region, threads “join” back to one master thread.

Note: In my project, the number of threads is defined by NUM_THREADS which can be set in the command line by --numthreads <int NUM_THREADS>


Code in Parallel region

Master thread

“Fork” “Join”

Code in Parallel region

Thread ID = 0


Agenda




Existing serial code for evaluating Robustness

Robustness is calculating by:1. Verifying specification points ( verifySpec() , verifyPoint () ).2. Checking if any of the 4 points are violated or not

(SpecViolated).3. Iterating (iterate_d ()) from these 4 points and calculating a

validPoint.


Volt

TemptempReq.first tempReq.second

voltReq.first

voltReq.second

‘v’

‘t’


4. 8 valid Points are stored in a vector ‘m_validPoints’.5. The function probs( ) iterates through vector m_validPoints and

returns a doublePair value. This value is sent to the function robustness( ) which returns a double “robustnessprobValue”.

Area is calculated by ‘dichotomy’ where the iterations in 8 directions take place serially, one after the other, which is very time consuming.

The iterate_d function calls verifyPoint( ) and verifyTolerance( ) which uses *tg to access the useProfile, updateArrivalTime( ) and getSinkArrivalTime( ).



Agenda




Ideology

My main aim was to speed up the process of calculating robustness, especially for larger circuits.

To make this possible, I had to run the iterate_d ( ) function in parallel. This would help me to compute all the 8 validPoints concurrently.


V

T

validPoints 1.176

1.212

65-20


If I parallelize the function iterate_d( ), multiple threads also execute the other functions such as verifyPoint( ) and verifyTolerance( ) simultaneously.

Each of these functions need their own copy of *tg.

Hence, depending on the number of threads, NUM_THREADS, I create those many copies of ‘tg’ and store it in a vector named ‘TGVec’.

Each instance of tg can then be accessed by TGVec[thread ID]; tg0 is stored in TGVec[0], tg1 is stored in TGVec[1] and so on.



Agenda




Changes made in the existing code

1. TG *tg; tg = new TG;

Instead of declaring just 1 tg, I now create several copies of tg’s depending on the number of threads, NUM_THREADS. I then store them in a vector.

std :: vector<TG*> TGVec;

for ( int j = 0 ; j < NUM_THREADS ; j++)

{

TG *tg;

tg = new TG;

TGVec.push_back(tg); }



2. tg -> loadTimingLib (xmllib);

Since there are several copies of tg now, I need to point every tg to the functions:

- TGVec[i]-> loadOA(oalib, oadesign, oaview);- TGVec[i]-> loadConstraintsLib(xmlconstrlib);- TGVec[i]-> set_useProfile(prof1);- TGVec[i]->getSourceNodes();- TGVec[i]->getSinkNodes();- TGVec[i]->getNodes();

for ( int i = 0 ; i < TGVec.size() ; i++ )

{

TGVec[i] -> loadTimingLib(xmllib); }



3. useProfile *oldProf = m_timer -> get_useProfile(); useProfile newProf;

m_timer was declared as ‘static TG *m_timer ;’ But since this function is executing in parallel in verifyPoint() and verifyTolerance(), I need to create multiple copies of ‘m_timer’ or ‘tgs’

useProfile *oldProf = m_TGVec[omp_get_thread_num()] -> get_useProfile();


tg 0 tg 1 tg 2 tg 3verifyPoint ( );verifyTolerance ( );

calculateRobustness( );

Iterate_d ( )


4. startPoint = doublePair (m_tempReq.first, m_voltReq.first); boundary = doublePair (m_tempReq.first, m_tempLimit.first); iterate_d (startPoint, boundary, 't', m_intervalLimit);

The function iterate_d () is no longer a void function. It now “returns” a double pair value, start.first and start.second.

After iterating 8 times in parallel sections, and hitting the barrier directive, each thread waits for the rest to finish their computation.

The 8 values are then stored in a vector ‘m_validPoints’. It is important to maintain the order to the iteration values starting from voltReq.first, tempReq.first, iterating with respective to the temperature axis.



#pragma omp parallel // Parallel region begins here

{

#pragma omp sections // Code is distributed and executed over the threads

{

#pragma omp section

{

startPoint = doublePair (m_tempReq.first, m_voltReq.first);

boundary = doublePair (m_tempReq.first, m_tempLimit.first);temp1 = iterate_d (startPoint, boundary, 't', m_intervalLimit);

printf (“Iteration from 1st point thread %d\n", omp_get_thread_num()); }

}

#pragma omp barrier // All threads wait here for each other

} // Parallel region ends here m_validPoints.push_back (doublePair (temp1.first, temp1.second) );




#pragma omp sectioniterate_d ( )

Returns doublePair temp1

#pragma omp barrier





m_validPoints.push_back

#pragma omp parallel





Vector “m_validPoints”

temp1 is stored first in this vector

temp5 is stored in the end this vector



5. To analyze the results, I used a OpenMP Timing routine, double omp_get_wtime(void);

This returns the real time elapsed in seconds for any kind of computations or functions.

double OmpStart;

double OmpEnd;

OmpStart = omp_get_wtime();

{ … calculateRobustness ( ) … }

OmpEnd = omp_get_wtime();std::cout << "Robustness at "<< age << "years calculated in ";

std::cout << static_cast<double> (OmpEnd - OmpStart) << " OpenMP real time seconds! ";std::cout << std::endl;


6. To set the number of threads for parallel computing, I could either declare it by hard coding: ‘#define NUM_THREADS 2’ or any other int value like 4 or 8. Instead I declared it as a Command Line argument.

The number of threads can now be set by - - numthreads <int> in the command line. The advantage is that, the user need not compile it every time. It is an automated program!

TCLAP::ValueArg<int> numThreadsArg("", "numthreads", "Override Number of threads usage", false, 1, "int", cmd);

If nothing is declared in the command line, number of threads will be set to 1, as default.



Agenda




Results of parallel computing

The program was compiled with multiple threads and OpenMP, #include <omp.h>

Hence, computation time for robustness was much lesser. It speeded the calculation by almost 2 times.

The program utilizes all the resources. The real computation time decreases with increase in number of

threads, even though the CPU computation time increases.

For rechner machines (4 core processors), 8 threads were resulting to be the best and 12 threads were observed to be the maximum.



The program utilizes both the core processors for running the program ‘Robustness’.

CPU usage is 157% which implies both the CPUs are busy with the same program.

With increase in processors, the %CPU usage increases too.



Results for:NangateDesign: cell 1908_i89 ( NUM_THREADS = 2 )Dimensions: 3D

Machine: Rechner2 04.01.2013 Zuverlässigkeit von CMOS-Schaltungen 26


Results for:– NangateDesigns: cell c1908_i89– Dimension: 2D ( Age 10 years )– Machine: Rein ( 2 processors )

By increasing NUM_THREADS, the run time reduces and the robustness value remains unchanged. The program executes the code faster without doubt.


Num_threads Robustness value CPU Time OpenMP Real Time

1 0.317204 47.15 47.1668 sec

2 0.317204 48.6 28.8759 sec

4 0.317204 48.29 28.1237 sec

8 0.317204 48.47 27.5828 sec


Results for:– NangateDesigns: cell c1908_i89– Dimension: 2D ( Age 10 years )– Machine: Rechner3 ( 4 processors )

Results here seem to be as expected too!


Num_threads Robustness value CPU Time OpenMP Real Time

1 0.317204 43.55 43.6704 sec

2 0.317204 43.53 25.3362 sec

4 0.317204 44.19 16.4133 sec

8 0.317204 47.97 11.563 sec

16 0.317204 50.06 11.7313 sec


Graphical representation of OpenMP real time

Based on the values from the previous slides.

The time elapsed for computation decreases with increase in OpenMP threads.



Agenda




Further Improvements

The number of OpenMP threads can also be set as a ‘environment variable’ in the linux terminal by the command.

export OMP_NUM_THREADS = <int #>

int NUM_THREADS could then be defined as

int NUM_THREADS = omp_get_num_threads( );

With this working, I wouldn’t need the command line input for number of threads.



Agenda




Conclusion

Parallelism has been employed for more than a few years now, mainly in high-performance computers, but with multi core computers being so common these days, the interest has grown massively.

Parallel computer programs are more difficult to write than sequential programs as it requires more planning and skills to troubleshoot software bugs like race conditions.

The objective of this bachelor thesis, however, has been met. Parts of the robustness validation programs are now efficiently

parallelized, which speed up the whole process of robustness calculation.




Date post:	22-Jan-2018
Category:	Documents
Upload:	tanvee-katyal
View:	79 times
Download:	0 times

Suyash Thesis Presentation

Documents