+ All Categories
Home > Documents > DATA MANAGEMENT PRINCIPLES IN APPLICATION … · DBMS creation through an implementation of a DBMS...

DATA MANAGEMENT PRINCIPLES IN APPLICATION … · DBMS creation through an implementation of a DBMS...

Date post: 19-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
49
DATA MANAGEMENT PRINCIPLES IN APPLICATION DEVELOPMENT Database Development, Data Structures and Sorting Algorithms Tanasorn (Mimi) Chindasook Jacobs University, M.Sc. Data Engineering Student ID: 30002281 [email protected]
Transcript
  • DATA MANAGEMENT

    PRINCIPLES IN APPLICATION

    DEVELOPMENT Database Development, Data Structures and Sorting

    Algorithms

    Tanasorn (Mimi) Chindasook Jacobs University, M.Sc. Data Engineering

    Student ID: 30002281 [email protected]

  • Acknowledgements

    I respectfully acknowledge Dr. Bendick Mahleko and Nilabhra Roy Chowdhury for their support, advice

    and input on the project; Prateek K. Choudhary and Shengchen Dong for their support during the course

    and critical review of the paper.

    Abstract

    Data management is a crucial skill that every data engineer should possess in order to effectively

    implement and maintain database systems within an organisation. A relational database is an efficient

    way to store data to perform queries that can be used in application development. Data structures and

    sorting algorithms are also a crucial part in application development as both can be used to optimise

    performance when correctly implemented.

    Introduction

    With the increasing volume of data being generated each day, database management systems (DBMS)

    exist as a method to assist users in maintain and utilise large collections of data. Without proper

    management, data collected cannot be fully utilised to its full potential as information retrieval and

    analysis would be an incredibly difficult task to accomplish. This paper explores the various factors in

    DBMS creation through an implementation of a DBMS for a start-up company looking to create an

    employees database. There are many factors to consider in order to correctly implement an effective

    DBMS. Therefore, it is imperative that research be done on the system that is to be modelled, and the

    requirements be correctly and thoroughly collected as it is difficult to change the structure of a DBMS

    once implemented.

    Furthermore, to be able to effectively develop applications, research on the appropriate data structures

    and sorting algorithms must also be conducted so that they can be appropriately implemented.

    Therefore, this paper also aims to introduce different data structures by comparing its usage across

    various applications, and provide an overview of sorting algorithms by evaluating its efficiency using

    time-complexity comparisons.

    The rest of the paper is structured as followed: Section 1 provides a brief literature review for data

    management principles, data structures and sorting algorithms, Section 2 details the creation of an

    example DBMS system for employees in a start-up company along with some query examples, Section 3

    explores key data structures used in data management, Section 4 compares various naïve and efficient

    sorting algorithms through a time complexity analysis and describes their applications, and finally,

    Section 5 concludes the paper.

  • 1. Literature Review

    In 1960, Charles Bachman designed the IDS (Integrated Data Store) to improve performance which

    significantly influenced the development of other DBMS systems. Later, in 1966, IBM released the IMS

    (Information Management System) which was based on hierarchical database and was intended for

    storing large bills of material for aerospace projects such as the Apollo space vehicle9 (Shagufta, 2017).

    The relational data model was then proposed by E.F. Codd during his time at IBM in 1970. The idea was

    that the data would be represented through table form and thus would allow the possibility of

    incorporating many-to-many joins unlike the hierarchical data model. He released several papers after

    his initial theoretical work which detailed the aspects of the relational model, such as relational algebra9

    (Shagufta, 2017).

    The ER Model was then developed by Peter Chen in 1976 (Chen, 1976). This model represents the world

    in terms of entities and relationships, and is the model that is used in abstraction to assist in database

    design. This led to the relational model being adopted as the standard approach for DBMS in 1980,

    along with the development of SQL as a query language and its adoption by ANSI and ISO. Several

    relational DBMS were developed, such as Informix and Oracle.

    Another significant aspect in database management and application development are data structures.

    Data structures can be characterised into primitive and non-primitive types. Primitive data structures

    are used in data management to define the type of data that should be stored using DDL. Non-primitive

    data structures are crucial in application development as the differing characteristics of each data

    structure allows for various implementations.

    Data structures such as queues have always existed as part of the fundamental logic in batch processing.

    However, stacks were developed in 1946 in Turing’s computer design and linked lists were developed by

    Newell, Shaw and Simon for RAND Corporation’s Information Processing Language.

    Sorting algorithms also play an important role in application development and data management as a

    data pre-processing step. Efficient sorting algorithms differ in stability and memory usage and are

    implemented in several DBMS user interfaces. Efficient sorting algorithms such as Merge sort was

    conceptualised by John von Neumann in 1945 (Knuth, 1998) and its more widely used counterpart,

    Quick sort, was developed in by Tony Hoare in 1959 (Hoare, 1961).

    2. Relational Database Concepts and Database Design

    A database is defined as a structured collection of data that describes the different components of one

    or more related organisations, and can be stored or accessed in various ways (Ramakrishnan, 2003).

    Relational databases are a type of database that typically utilises the ANSI-SPARC architecture in data

    management which was first proposed in 1975 (Brodie, 1975). The ANSI-SPARC architecture is defined at

    three levels of abstraction which enable the end user to achieve logical and physical data independence.

    Logical data independence protects users from alterations in the logical structure of the data, whilst

    physical data independence refers to end user protection from changes at the physical storage as the

    modifications are transformed through mapping techniques in the conceptual schema.

  • Fig 1: Structure of a DBMS (Mahleko, 2018)

    Before beginning any database design, a requirements analysis must be carried out to ensure that all

    data is represented in the appropriate format in the database. For example, the upper management at a

    small start-up company would like to implement a relational database to store its employee

    information. The requirements analysis for this particular case will be answered in the following

    manner:

    What data must be stored in the database?

    All data pertaining to information on an employee in relation to the company, along with some personal

    information must be stored in the database (e.g. name, birthdate, start date, end date, salary, email,

    department). As employees can be promoted or change departments, records of how long an employee

    has worked in which department at which position must also be kept. The company also provides an

    extended health insurance policy to dependents of the employee.

    Who will use this database and what do these users want from the database?

    Upper management and HR will be the primary users of this database. The primary use of this database

    is to be able to easily retrieve information on each employee in the company when issuing monthly

    payments, preparing for company audits and providing an overview of the employees.

    What operations are to be performed on the database? Which of these operations are

    frequently performed?

    Operations that will be mostly performed on the database are:

    1.) Viewing of a set of employees and their corresponding information

    2.) Updating information when an employee changes departments, gets promoted, or leaves

    3.) Addition of new employees

    Once the data requirements have been thoroughly analysed, the relational database design can then

    commence.

  • Fig 1.1: Overview of ANSI-SPARC architecture (Abidin, 2010)

    Conceptual Schema

    The conceptual schema (or Data Modelling) is the first level of abstraction in ANSI-SPARC architecture

    and consists of the definition of the data’s logical structure. In the conceptual schema, data base

    designers define the tables that the database will be established upon, along with the entities that

    should be included in those tables, its datatypes, the choice of relation between those entities and any

    constraints on the data. The process of representing the data as a set of tables is denoted as the

    conceptual database design.

    Conceptual Design

    The conceptual design aspect focuses on describing the data and customer intention. Entity Relationship

    Models (ERM) are a high-level abstraction that represents the world in terms of relationships and

    entities (Chilson, 1983). ER diagrams are a semiformal way of representing the data using the ERM

    concept. Although ER diagrams cannot be immediately automated into database format, it is an

    effective way to visualise the relationships between entities. The following ER diagram depicts the

    Employees database in terms of the ER Model:

  • Fig 2: ER Diagram for the employees database

    An entity is an object that can be distinctly identified. In the diagram, an employee is considered to be an entity. A department is also considered to be an entity (Chen, 1976).

    Weak entities are entities that can be identified only by considering the primary key of its owner (Chen, 1976). In this ER diagram a dependent is considered to be a weak entity because a dependent is only related to a company through the employee that works there.

    A descriptive fact about an entity. In this ER diagram, birthdate, email, name and salary are all descriptive attributes of an employee. Department name is a descriptive attribute of a department (Chen, 1976).

    The unique identifier for an entity. No two entities can have the same primary key (Chen, 1976). In the ER diagram, the primary keys are eid (employee ID) and did (department ID)

    Weak entities can only be identified by considering the primary key of another related entity (Chen, 1976). As represented in the ER diagram, dependents is a weak entity that can only be identified by the employee id through the company health insurance policy. In this case, the weak identifier for dependents is employee id.

    Denotes a relationship between two entities. In this ER diagram, the “works for” relationship represents the relation between employees and departments (Chen, 1976).

    Denotes an identifying relationship. In the case of the ER Diagram, Manages is an identifying relationship as when given a department, its manager can be uniquely identified. Policy is also an identifying relationship as when given a dependent, its related employee can be uniquely identified.

  • Key Constraint. The arrow points to the direction that is constrained. In the case of the employee database, only one employee can manage a department at any given time. Therefore, the arrow is pointed in the direction of the employee.

    One-to-Many Relationship. As represented in the ER diagram, one employee can have many dependents.

    Many-to-Many Relationship. As represented in the ER diagram, many employees work for one department, and one department can have many employees.

    An aspect that is not included in this design but should be mentioned is the participation constraint.

    example of a participation constraint is that all employees must work for at least one department, or all

    departments must have at least one employee. An example of a one-sided participation constraint is

    demonstrated in the manages relationship where each department must have a manager, but not all

    employees must have a managing role. Due to the fact that companies change structure constantly

    (especially at a start up stage in this case), participation constraints are not enforced in this database.

    Logical Design

    The logical model is constructed based on the conceptual model with the addition of the datatypes

    (Abidin, 2010). The logical design focuses on the abstract and disregards the implementation. For the

    Employees database, the logical design is as follows:

    Employees(eid:integer, name:string, email:string, birthdate:date, salary:real)

    Department(did:integer, dname:string, managerid:int)

    Dependent(eid:integer, dependent_name:string)

    Works(eid:integer, did:integer, start_date:date, end_date:date)

    Manages(did:integer, eid:integer, start_date:date, end_date:date)

    Policy(pid:integer, eid:integer)

    Physical Schema

    After the conceptual schema has been created, the physical schema should then be considered. The

    physical schema is where the files and indexes used are defined. This step denotes how the data, such as

    the relations defined in the conceptual schema, will be represented and stored in secondary storage

    systems such as Oracle, Postgres, SQL Server and MySQL. These secondary storage systems all utilise

    Structured Query Language (SQL) as a means of interaction but differ in the syntax. The statements that

    are referenced in the body of this report and its appendix are in MySQL syntax.

    Physical Design

    The physical design and DDL of the employees database can be found in Appendix [1.1].

    External Schema

    Following the definition of the physical schema is the external schema. The external schema represents

    the different views of the data that can be seen by the end-user (Ramakrishnan, 2003; Brodie, 1975). For

  • example, students at a university should not be allowed to view the salaries of the professors.

    Therefore, permissions and roles of each user must be defined at this level. An example command in

    DDL to grant SELECT access on all tables in the employees database to user tdaoruang is:

    GRANT USAGE ON employees_n TO tdaoruang

    GRANT SELECT ON ALL TABLES IN SCHEMA employees_n TO tdaoruang

    In Fig. 4, the user mchindasook holds the role of Database Administrator and has permission for all

    aspects of the database. In comparison, the user lhaller is the top manager in the HR department and

    can view employee information, add new employees and update an employee’s information through an

    application. Finally, the user tdaoruang is an employee in the HR department and can only view

    information through an application. It is important to note that HR will only view the database through

    an application as HR will not have direct access to the employees database in reality, but only through a

    user-friendly interface. The external schema is used in application development as the external view is

    not stored, but rather computed as it is accessed (Ramakrishnan, 2003).

    Fig. 4: External schema for employees database

    Languages that can be used to implement the conceptual and physical schema are data definition

    language (DDL) and data modelling language (DML).

    DDL: used to define conceptual and external schemas

    o CREATE

    DML: used to perform operations on the data

    o INSERT, UPDATE, DELETE

    For more examples of using DDL and DML to create a database and insert, update and delete data, along

    with some basic queries that can be performed on databases, please refer to Appendix [1.2] and

    Appendix [1.3].

    Database Querying

    Relational Algebra and Database Queries

    The rudimentary operations of relational algebra are projection, selection, set union, set intersection,

    set difference and Cartesian product (Ramakrishnan, 2003). Relational algebra is primarily used in data

    modelling and database querying. The types of joins that are most used in database querying are: inner

    join, left outer join, right outer join and full-outer join.

    The natural join is one of the most essential operations in relational algebra as it is the relational

    equivalent of the logical AND, and is the operation that returns the set of all arrangements of tuples that

    are equivalent on a common attribute.

  • Fig 5: Example of a natural join followed by a query

    Fig 5 depicts the process of finding the employee with the highest salary through the use of a join. This

    join can be achieved in two ways:

    Query A: Query B:

    It should be noted that, in this case, there are other ways to produce the identical results, each with

    variations in efficiency. An example of a highly ineffective query to achieve the result above is to

    perform a Cartesian join then find the tuple that satisfies the conditions. The example queries above

    depict two different ways to join different tables in the database. For this instance, query A is less

    efficient than query B as it selects the highest salary with the use of a subquery. Queries that use

    subqueries in this manner are often subject to slower performance as the subquery has to finish running

    before the outer query can initialise. Another notable difference is that Query B will return all the

    employees regardless of whether or not they are assigned to a department or not. In Query B, if an

    employee is not assigned to a department, tuple that is returned for that employee will contain all the

    selected information in the base table (employees) and NULL everywhere else.

    For more examples of database SELECT queries, please refer to Appendix [2.1] and Appendix [2.2].

    SELECT e.ename, d.dname, e.salary

    FROM employees e, departments d, works w

    WHERE e.eid = w.eid

    AND d.did = w.did

    AND w.to_date IS NOT NULL

    AND e.salary = (SELECT MAX(salary) FROM

    employees);

    SELECT e.ename, d.dname, e.salary

    FROM employees e

    WHERE w.to_date IS NOT NULL

    LEFT JOIN works w ON w.eid = e.eid

    LEFT JOIN departments d ON d.did = w.did

    ORDER BY e.salary DESC LIMIT 1;

  • The advantages of employing a DBMS include improvements in data integrity and security by enforcing

    integrity constraints on data that is accessed or input and enforces access control for other users.

    Furthermore, a DBMS also protects its users from being affected by system failure through crash

    recovery mechanisms.

    3. Data Structures

    Data structures are vital components in data management and application development as it pertains to

    the way data is stored in an exceptionally effective manner (Shaffer, 2009). Data structures can be

    characterised into primitive and non-primitive types; primitive types refer to the datatypes such as

    Boolean or Integer, and non-primitive types refer to arrays, stacks or queues where data is referenced

    using an index and not directly stored (Shaffer, 2009). The various non-primitive data structures differ in

    how data is inserted, deleted and queried, leading to diverse applications in data management.

    Linear Data Structures

    Stacks

    A stack utilises the last-in-first-out (LIFO) principle and allows only two operations: the push of an item

    onto the stack, and the pop of an item from the stack (Shaffer, 2009). A stack is considered a limited

    access data structure as items can only be added and removed from the top of the stack. It is also a

    recursive data structure as it can either be empty, or has a top element and the rest which is the stack

    (Shaffer, 2009).

    Fig 6: Visualisation of a stack (Techspirited.com, 2018)

    Applications of stacks in data management include backtracking or undoing and runtime memory

    management. Backtracking refers to the undo mechanism in text editors; this is accomplished by storing

    all of the text changes in a stack. When a user presses undo in a text editor, the stack pops off the top

    element and the remaining stack is the code that remains minus the last change.

    Queue

    A queue is a vital data structure in data management that follow the first-in-first-out (FIFO) principle

    (Barnett, 2008). The item that is stored in the front of the queue can be removed and insertion can

    occur only at the back of the queue. A traditional queue is allowed three operations: enqueue inserts an

    item at the back of the queue, dequeue removes an item from the front of the queue, and peek allows

    the user to view the item at the front of the queue without actually removing it (Barnett, 2008).

  • Fig 7: Visualisation of a Queue Data Structure (Techspirited.com, 2018)

    Queues are effective in situations where data is transferred between processes. Typical data

    management applications are data transmission and disk scheduling. One significant application of

    queues that is commonly seen in typical web applications is online ticket purchasing. Queues are often

    use in to determine the order in which customers are allowed to purchase tickets; this is applied across

    various industries, such as airlines tickets, concert tickets or limited addition footwear purchase.

    Linked Lists

    Linked lists are a collection of nodes that are linearly linked to each other through pointers (Barnett,

    2008). The first node in the list is referred to as the head. A characteristic feature of linked lists is that

    each node is made out of two (or three for doubly linked lists) components: the data that is stored in the

    node and the memory address(es) of the node(s) that it is pointing to (Barnett, 2008). The memory

    addresses are randomly assigned. The two types of linked lists are single linked lists and double linked

    lists, with the only differentiating factor being that single linked lists only point to the next node, but

    double linked lists has pointers to the previous node and the next node.

    Fig 8: Visualisation of a Singly Linked List (Techspirited.com, 2018)

    Applications of linked lists in data management can be found in the history section of web browsers and

    collision resolution by chaining in hash tables. The history section of web browsers employ double linked

    lists to allow users to traverse through and fetch data of previously visited sites. When a user presses

    the back button, the previous node’s data is returned; similarly, when the forward button is pressed, the

    next node’s data is returned. In hash tables, linked lists are used for resolving collisions when one bucket

    has more than one data point allocated to it. The collision will be resolved by the bucket referencing a

    linked list that contains all the elements that have been assigned to the specific bucket.

    Tree Data Structures

    Heap

    A heap is a simple tree data structure where all the nodes in the tree are arranged in a specific order;

    the data structure is represented as an array. There are two types of heaps, a max heap and a min heap

  • (Cormen, 1989). Min heaps are typically used in for queueing jobs in the CPU; the heap data structure is

    essential in the implementation of priority queues for operating systems. Max heap and min heap follow

    similar approaches where each node has a left child and a right child. In the max heap, the root of the

    heap is the first item in the array, the parent node and its children are then determined by the following

    rules:

    Parent {A[i]} > = A[i]

    Left Child{A[i]} = A[2i]

    Right Child{A[i]} = A[2i +1]

    Fig 9: Heap data structure (Hackerearth.com, 2018)

    Another type of tree data structure that should be mentioned are B-Trees. A B-Tree is an almost

    balanced rooted tree with lg n height. This data structure is typically used to index external storage by

    storing multiple keys based on some criteria. Data in a B-Tree is stored in the leaf nodes, which makes it

    efficient for insertion and searching, leading to its primary use for caching objects (Cormen, 1989).

    Hash tables

    A hash table is a special type of data structure that implements a hash function to map keys to actual

    values (Larson 1988). Hashing can be implemented via the division or multiplication method. The

    division method assigns keys using a hash function that takes the remainder of the division between the

    number of available slots in the table m by the index of the value to be stored k. For an effective hash

    function, m must be a large prime number so that there are less numbers with the same remainder,

    thus reducing collisions. The best case search time for an element in a hash table is O(1), and the worst

    case is O(n) (Cormen, 1989). The largest problem that hashing faces are collisions, which occur when

    two or more keys hash to the same spot. The two approaches in resolving collisions are:

    1.) Chaining

    The chaining method handles collision resolution by putting all of the elements that collide into a linked

    list (Cormen, 1989). When implemented correctly, the hash function should not assign all of the

    elements to the same slot. Mapping all elements to the same key causes the hash table to become a

    linked list. In this worst case scenario, the search time for the hash table is O(n). Chaining has an

  • advantage in where the hash table’s capacity is not limited. In general, chaining is preferred over open

    addressing due to this fact.

    2.) Open Addressing

    Open addressing deals with collisions by continuously searching the array by incrementing the index

    until a free slot is found (Cormen, 1989). This searching method is called probing. Probing can be The

    advantage of using open addressing is that there are no additional data structures required, however, an

    inefficient hash function increases the possibilities of the keys clustering which will subsequently

    increase the required search time.

    Hashing has many imperative implementations in application programming as it can be used to protect

    or verify information. The most universal example of hashing is in password storage. When

    programming an application, password storage is essential in allowing users access to their account.

    However, a password cannot be simply stored as the string that has been input. Instead, once the user

    determines a password, the string will get hashed and the hash will be stored in the system to prevent

    security vulnerabilities. Hash tables can then be used as an efficient lookup table to retrieve the hashed

    password once the user logs in.

    4. Sorting Algorithms

    Sorting algorithms are another essential part in application development and data management as it is

    commonly used in the processing of data. In DML, the statement that calls a sorting mechanism is

    ORDER BY. The efficiency of sorting algorithms is an important aspect of data management as choosing

    the best sort is imperative in sorting extremely large datasets. The efficiency of sorting algorithms can

    be evaluated using asymptotic notation and performance is represented graphically by a time

    complexity comparison graph to see how the sort fairs with larger sets of data. The worst case

    asymptotic notation is typically used as the base for efficiency comparison. Sorting algorithms can be

    categorised into two major groups: naïve and efficient algorithms.

    Naïve Sorting Algorithms

    Naïve sorting algorithms encompass Bubble Sort, Selection Sort and Insertion Sort. These algorithms are

    considered naïve as they sort each element by searching for its position amongst the other sorted

    elements (Wirth, 1986). The distinguishing difference between the three sorts are as follows:

    Bubble sort compares neighbouring items in the array and swaps them when A[i] < A[i-1].

    Selection sort finds the smallest value in the array and swaps it with the item in the first

    position.

    Insertion sort takes elements from the array and inserts them into the correct position in a new

    array.

    All three naïve sorting algorithms exhibit quadratic worst case behaviour,

    For examples of the implementation of the sorts in Java, please refer to Appendix [3.1] for Bubble sort,

    Appendix [3.2] for Selection sort and Appendix [3.3] for Insertion sort.

  • Fig 10: Time complexity comparison for Naïve sorting algorithms using the same dataset.

    Fig 10 shows that out of the three sorts, insertion sort is the best performer, followed by selection sort,

    then bubble sort. All of the three sorts exhibit quadratic behaviour like its worst case performance. In

    bubble sort, there is not much evident difference between worst case and best case. Selection sort

    exhibits the most variation between the best case and the worst case as its performance is highly

    dependent on whether or not the array is already partially pre-sorted. Bubble sort performs the worst

    because it will perform the same number of comparisons on every unsorted value, no matter if the array

    is somewhat pre-sorted as it does not take into account the order of the remaining items. Insertion sort

    performs the best here as it divides the array into sorted and unsorted elements, therefore making

    comparisons only on the sorted values.

    These naïve sorts are no longer implemented in real life applications, as more efficient sorts have

    replaced them. However, they served as the foundation for these efficient algorithms to be developed

    upon. For the Java Implementation of Bubble sort, Selection sort and Insertion sort used to obtain the

    data for Fig 10, please refer to Appendix [3.4], Appendix [3.5] and Appendix [3.6] respectively.

    Efficient Sorting Algorithms

    Many efficient sorting algorithms solve problems by recursion. A recursive function is defined as having

    a base case, and a recursive case that will eventually resolve itself to the base case when input with

    smaller arguments. The recursive algorithm works in three stages:

    1.) Divide the problem into smaller sub-problems

    2.) Solve the sub-problems through recursion, if the problem is small enough, return a value

    3.) Combine the solutions to the sub-problems

    This approach is called the divide and conquer principle and is applied in efficient sorting algorithms

    such as Merge sort and Quick sort.

    Recursive Algorithm Efficiency Calculation

    The efficiency of a recursion algorithm can be solved using two methods: the recursion tree and the

    master theorem.

    The recursion tree method represents the recurrence as a tree with nodes that represent the sub-

    problem cost. The overall efficiency of the recursive algorithm is then calculated by aggregating the

    Worst Case:

    Bubble Sort – O(n2)

    Selection Sort – O(n2)

    Insertion Sort – O(n2)

  • costs at all levels. The recursion tree is a highly effective method in visualising how a recursion algorithm

    works, however, the limitation to using this method is that other methods, such as substitution, must be

    used to verify its solutions.

    The Master theorem evaluates the efficiency of recursive sorting algorithms using three evaluation cases

    (See Appendix [4.1]). The solution is determined by larger of f(n) or n logba where a, b are positive

    constants that satisfy the conditions a ≥ 1, b > 1, and f(n) > 0 (Cormen, 1989). The limitation of the

    master theorem is that it does not cover all the cases, but its intuitive reasoning makes it an easy

    method for evaluating algorithm efficiency.

    Examples of using the master theorem to find the asymptotic notation for recursion algorithms are as

    follows (Cormen, 1989):

    Case 1 Example: T (n) = 16T (n/4)+ n ⇒ T (n) = Θ (n2)

    Case 2 Example: T (n) = 4T (n/2)+ n2 ⇒ T (n) = Θ (n2 log n)

    Case 3 Example: T (n) = T (n/2) + 2n ⇒ Θ (2n)

    Unsolvable Example: T (n) = 0.5T (n/2)+ 1/n ⇒ Does not apply (a < 1)

    Merge Sort

    Merge sort is a divide and conquer algorithm that divides the unsorted array into n sub arrays that

    contain 1 element (Knuth, 1998). The sub arrays are then merged into a new sorted array. Merge sort is

    an extremely stable sort, with the worst case, average case and best case all equating to n log n.

    However, the drawbacks to this sort is that it requires O(n) amount of memory in order to duplicate the

    elements that must be sorted as sub arrays. As it is an out-of-place sort, the amount of memory

    required increases as the dataset increases, leading to memory allocation issues for large datasets.

    For an example implementation of Merge sort, please refer to Appendix [4.2]

    Quick Sort

    As an alternative to combat Merge sort’s memory allocation disadvantage, Quick sort can be

    implemented. Quick sort is also a divide and conquer sorting algorithm that divides an array into smaller

    sub arrays using partition and recursively sorts those arrays (Hoare, 1961). It can be more efficient than

    merge sort when correctly implemented; the partition value is key in the efficiency of the Quick sort

    algorithm. Quicksort has a best case and average case of n log n, whilst its worst case is n2. The

    advantage to using Quick sort is that it only uses O(log n) memory, therefore it is the preferred method

    as it can efficiently sort large datasets without causing any memory allocation issues.

    Due to Quick sort’s advantages, it is the algorithm of choice for many practical solutions. For example,

    Java’s primary system sort, or the Arrays.sort() method, utilises a 3-way partitioned Quick sort

    algorithm.

    For an example implementation of Quick sort, please refer to Appendix [4.3]

  • Fig 11:

    Time complexity comparisons for e

    i

    Fig 11: Time complexity comparisons for effcient sorting algorithms using the same dataset.

    Fig 11 portrays how Merge sort is the more stable algorithm. Although, a lot of Quick sort’s data points

    still lie on the same range of that of Merge sort, meaning that the two sorts are generally comparable in

    efficiency. However, the stability of the sort is not the only factor that is considered, and the limitation

    of the time-complexity comparison graph is that it does not show the memory allocation used. The

    trade-off between choosing Quick sort over Merge sort is more efficient memory usage over speed, and

    guaranteed performance over stability. Therefore, when factoring in the memory allocation and

    practicality, Quick sort is typically the better performing sort.

    For the Java implementation of Merge sort and Quick sort on a large dataset that was used to obtain

    data for Fig 11, please refer to Appendix [4.2] and Appendix [4.3].

    For a table that compares the worst case, average case and best case time complexity comparison for

    naïve sorting algorithms and efficient sorting algorithms, please refer to Appendix [4.4].

    5. Conclusions

    The methods detailed in the report provide a basis for the foundations of data management and

    application development and should be studied extensively before implementation. The factors that

    should be considered before any DBMS implementation are the requirements analysis, conceptual

    schema, physical schema and external schema. Once a database is developed, the data in the database

    can be accessed in different views through SQL queries. Queries can be written in different formats with

    varying efficiency, therefore it is essential that best practices be studied

    Apart from databases, data structures also play a crucial role in data management and application

    development as they determine how the data is stored. Sorting algorithms must also be considered by

    its performance efficiency on large data sets. Ultimately, there are still more sophisticated methods out

    there that can be applied to this topic, however, it is imperative that these fundamental concepts be

    understood by all data engineers in order to establish a solid foundation for future research.

    Worst Case:

    Quick Sort – O(n2)

    Merge Sort – O(n lg n)

  • References

    [1] Mahleko, B. (2018). “MMM010-340163 Data Management for Graduate Students – Lecture 02”.

    Jacobs University. pp 20.

    [2] Abidin, Siti & Ahmad, Suzana & M S Yafooz, Wael. (2010). A new system architecture for flexible

    database conversion. WSEAS Transactions on Computers. 9]

    [3] Chilson, D., & Kudlac, M. (1983). Database design: a survey of logical and physical design techniques.

    ACM SIGMIS Database, 15(1), pp.13

    [4] Chen, P. (1976). The entity-relationship model—toward a unified view of data. ACM Transactions on

    Database Systems (TODS), 1(1),

    [5] Ramakrishnan, R. & Gehrke, J. (2003). Database Management Systems (pp. 3-50). 3rd edition. New

    York: McGraw-Hill.

    [6] Brodie, M. & Schmidt, J. (1975) ANSI/X3/SPARC Study Group on Data Base Management

    Systems. Interim Report. FDT, ACM SIGMOD bulletin. Volume 7, No. 2

    [7] Barnett, G. & Del Tongo, L. (2008). Data Structures and Algorithms: Annotated Reference with

    Examples. First Edition Copyright.

    [8] Shaffer, C. (2009). A Practical Introduction to Data Structures and Algorithm Analysis Third Edition

    (Java). Department of Computer Science. Virginia Tech Blacksburg, VA 24061.

    [9] Wirth, Niklaus (1986), Algorithms & Data Structures, Upper Saddle River, NJ: Prentice-Hall, pp. 76–

    77, ISBN 0130220051

    [10] Shagufta, P. & Chandra, U. & Wani, A. (2017). A Literature Review on Evolving Database.

    International Journal of Computer Applications (0975 – 8887). Volume 162, No 9

    [11] Knuth, D. (1998). "Section 5.2.4: Sorting by Merging". Sorting and Searching. The Art of Computer Programming. 3 (2nd ed.). Addison-Wesley. pp. 158–168. ISBN 0-201-89685-0.

    [12] Hoare, C. A. R. (1961). "Algorithm 64: Quicksort". Comm. ACM. 4 (7): 321. doi:10.1145/366622.366644.

    [13] Larson, P. (1988). Dynamic Hash Tables. Commun. ACM, 31, 446-457.

    [14] Cormen, T & Leiserson, C. & Rivest, R. & Stein, C. (1989). Introduction to Algorithms Third Edition.

    pp 151 – 484

    [15] Heaps/Priority Queues Tutorials & Notes | Data Structures. (2018). Retrieved from

    https://www.hackerearth.com/practice/data-structures/trees/heapspriority-queues/tutorial. Accessed

    on 22 November 2018

    [16] Types of Data Structures in Computer Science and Their Applications. (2018).

    https://techspirited.com/types-of-data-structures-in-computer-science-their-applications. Accessed on

    22 November 2018

    https://en.wikipedia.org/wiki/Niklaus_Wirthhttps://en.wikipedia.org/wiki/International_Standard_Book_Numberhttps://en.wikipedia.org/wiki/Special:BookSources/0130220051https://en.wikipedia.org/wiki/The_Art_of_Computer_Programminghttps://en.wikipedia.org/wiki/The_Art_of_Computer_Programminghttps://en.wikipedia.org/wiki/International_Standard_Book_Numberhttps://en.wikipedia.org/wiki/Special:BookSources/0-201-89685-0https://en.wikipedia.org/wiki/Tony_Hoarehttps://en.wikipedia.org/wiki/Communications_of_the_ACMhttps://en.wikipedia.org/wiki/Digital_object_identifierhttps://doi.org/10.1145%2F366622.366644https://www.hackerearth.com/practice/data-structures/trees/heapspriority-queues/tutorialhttps://techspirited.com/types-of-data-structures-in-computer-science-their-applications

  • APPENDIX

    Appendix [1.1]

    CREATE TABLE employees (eid INT PRIMARY KEY NOT NULL auto_increment,

    ename VARCHAR(128),

    email VARCHAR(128),

    birthdate DATE,

    salary FLOAT(25));

    CREATE TABLE departments (did INT PRIMARY KEY NOT NULL auto_increment,

    dname VARCHAR(128));

    CREATE TABLE dependents (eid INT NOT NULL,

    dependent_name VARCHAR(256));

    CREATE TABLE manages (eid INT NOT NULL,

    did INT NOT NULL,

    start_date DATE,

    end_date DATE,

    PRIMARY KEY(did),

    FOREIGN KEY(eid) REFERENCES employees(eid),

    FOREIGN KEY(did) REFERENCES departments(did));

    CREATE TABLE works (eid INT,

    did INT,

    start_date DATE,

    end_date DATE,

    FOREIGN KEY(eid) REFERENCES employees(eid) NOT NULL,

    FOREIGN KEY(did) REFERENCES departments(did) NOT NULL);

    CREATE TABLE policy (eid INT,

    pid VARCHAR(128),

    FOREIGN KEY(eid) REFERENCES employees(eid));

  • Appendix [1.2]

    Consider the following relational schema. An employee can work in more than one department; the

    pct time field of the Works relation shows the percentage of time that a given employee works in a

    given department.

    Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pcttime:

    integer) Dept(did: integer, dname: string, budget: real, managerid: integer)

    Create a database based on the above schema.

    SHOW DATABASES;

    CREATE DATABASE employees_new;

    SHOW DATABASES;

    USE employees_new;

    CREATE TABLE emp (eid INT PRIMARY KEY NOT NULL auto_increment,

    ename VARCHAR(128),

    age INT,

    salary FLOAT(25));

    CREATE TABLE dept (did INT PRIMARY KEY NOT NULL auto_increment,

    dname VARCHAR(128),

    budget FLOAT(25),

    managerid INT,

    FOREIGN KEY(managerid) REFERENCES emp(eid));

    CREATE TABLE works (eid INT,

    did INT,

    pcctime INT,

    FOREIGN KEY(eid) REFERENCES emp(eid),

    FOREIGN KEY(did) REFERENCES dept(did));

    SHOW TABLES;

  • DESC emp;

    +-----------+------------------+-------+------+----------+------------------------+

    | Field | Type | Null | Key | Default | Extra |

    +-----------+------------------+-------+------+----------+------------------------+

    | eid | int(11) | NO | PRI | NULL | auto_increment |

    | ename | varchar(128) | YES | | NULL | |

    | age | int(11) | YES | | NULL | |

    | salary | double | YES | | NULL | |

    +-----------+------------------+-------+------+-----------+-----------------------+

    INSERT INTO emp (ename,age,salary) VALUES

    ("Mimi",23,250000),("Akeem",35,21816),("Alexis",58,17439),("Jin",35,26836),("Clare",61,42221786),("El

    eanor",27,5758651),("Murphy",65,232610),("Shad",61,1580),("Tobias",46,454323.50),("Randall",40,422

    71.21),("Gray",31,12368.60);

    SELECT * FROM emp;

    +------+-----------+------+--------------+

    | eid | ename | age | salary |

    +------+-----------+------+--------------+

    | 1 | Mimi | 23 | 250000 |

    | 2 | Akeem | 35 | 21816 |

    | 3 | Alexis | 58 | 17439 |

    | 4 | Jin | 35 | 26836 |

    | 5 | Clare | 61 | 42221786 |

    | 6 | Eleanor | 27 | 5758651 |

    | 7 | Murphy | 65 | 232610 |

    | 8 | Shad | 61 | 1580 |

    | 9 | Tobias | 46 | 454323.5 |

    | 10 | Randall | 40 | 42271.21 |

    | 11 | Gray | 31 | 12368.6 |

    +------+-----------+------+--------------+

    DESC dept;

  • +----------------+-----------------+--------+-------+-----------+--------------------+

    | Field | Type | Null | Key | Default | Extra |

    +----------------+-----------------+--------+-------+-----------+---------------------+

    | did | int(11) | NO | PRI | NULL | auto_increment |

    | dname | varchar(128) | YES | | NULL | |

    | budget | double | YES | | NULL | |

    | managerid | int(11) | YES | MUL | NULL | |

    +---------------+------------------+-------+--------+-----------+----------------------+

    INSERT INTO dept (dname,budget,managerid) VALUES ("Software",60000,1),("Hardware",

    10000000,4),("HR",5000,7),("Marketing",70000,2);

    SELECT * FROM dept;

    +-----+---------------+--------------+----------------+

    | did | dname | budget | managerid |

    +-----+---------------+--------------+----------------+

    | 1 | Software | 60000 | 1 |

    | 2 | Hardware | 10000000 | 4 |

    | 3 | HR | 5000 | 7 |

    | 4 | Marketing | 70000 | 2 |

    +-----+---------------+--------------+----------------+

  • DESC works;

    +------------+----------+-------+--------+-----------+---------+

    | Field | Type | Null | Key | Default | Extra |

    +------------+----------+-------+--------+-----------+---------+

    | eid | int(11) | YES | MUL | NULL | |

    | did | int(11) | YES | MUL | NULL | |

    | pcctime | int(11) | YES | | NULL | |

    +------------+----------+-------+-------+-----------+----------+

    INSERT INTO works (eid,did,pcctime) VALUES

    (1,1,100),(2,1,50),(2,2,50),(3,4,100),(4,3,90),(4,4,10),(5,1,75),(5,2,25),(6,3,100),(7,4,100),(8,2,60),(8,1,10)

    ,(8,3,30),(9,1,25),(9,2,25),(9,3,25),(9,4,25),(10,4,100),(11,2,100);

    SELECT * FROM works;

    +------+------+---------+

    | eid | did | pcctime |

    +------+------+---------+

    | 1 | 1 | 100 |

    | 2 | 1 | 50 |

    | 2 | 2 | 50 |

    | 3 | 4 | 100 |

    | 4 | 3 | 90 |

    | 4 | 4 | 10 |

    | 5 | 1 | 75 |

    | 5 | 2 | 25 |

    | 6 | 3 | 100 |

    | 7 | 4 | 100 |

    | 8 | 2 | 60 |

    | 8 | 1 | 10 |

    | 8 | 3 | 30 |

  • | 9 | 1 | 25 |

    | 9 | 2 | 25 |

    | 9 | 3 | 25 |

    | 9 | 4 | 25 |

    | 10 | 4 | 100 |

    | 11 | 2 | 100 |

    +------+------+---------+

    Write the following queries in SQL:

    a. Print the names and ages of each employee who works in both the Hardware department and

    the Software department

    SELECT e.ename, e.age FROM emp e WHERE e.eid in (SELECT eid FROM works WHERE did = 1) AND e.eid

    in (SELECT eid FROM works WHERE did = 2);

    +-----------+------+

    | ename | age |

    +-----------+------+

    | Akeem | 35 |

    | Clare | 61 |

    | Shad | 61 |

    | Tobias | 46 |

    +-----------+-------+

    b. Find the managerids of managers who manage only departments with budgets greater than

    $1 million

    SELECT managerid FROM dept WHERE budget > 1000000;

    +-----------+

    | managerid |

    +-----------+

    | 4 |

    +-----------+

  • c. Find the enames of managers who manage the departments with the largest budgets.

    SELECT ename FROM emp WHERE eid = (SELECT managerid FROM dept WHERE budget = (SELECT

    max(budget) FROM dept));

    +-------+

    | ename |

    +-------+

    | Jin |

    +-------+

  • Appendix [1.3]

    1. Find the number of employees hired in the year 2000. SELECT COUNT(emp_no) FROM employees e WHERE YEAR(hire_date) = "2000";

    +----------+

    | COUNT(*) |

    +----------+

    | 13 |

    +----------+

    2. Find the average age (in years) of employees who were hired in the year 2000

    SELECT AVG(TIMESTAMPDIFF(year,birth_date,curdate())) AS avg FROM employees e WHERE YEAR(hire_date) = "2000";

    +-----------------+

    | avg |

    +------------------+

    | 60.7692 |

    +-------------------+

    3. Create a table called millenial_hires consisting of the following fields: a. id (auto increment, unsigned int(6), not null, primary key) b. first_name (varchar(30)) c. dob (date) Describe the table you just created and validate if the description matches the specification. CREATE TABLE millennial_hires (id INT(6) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, first_name VARCHAR(30), dob DATE);

    DESC millennial_hires;

    +------------+---------------------+-------+------+-----------+--------------------+

    | Field | Type | Null | Key | Default | Extra |

    +------------+---------------------+-------+------+-----------+--------------------+

    | id | int(6) unsigned | NO | PRI | NULL | auto_increment |

  • | first_name | varchar(30) | YES | | NULL | |

    | dob | date | YES | | NULL | |

    +---------------+-----------------+-------+------+-----------+------------------+

    4. Insert the first name and birth date of all the people hired in the year 2000 into the table created in the last task. Add your details to the table. Print out all the values. INSERT INTO millennial_hires(first_name, dob) (SELECT first_name, birth_date FROM employees WHERE YEAR(hire_date) = "2000"); INSERT INTO millennial_hires(first_name, dob) VALUES ("Mimi", "1995-02-07");

    SELECT * FROM millennial_hires;

    +----+-------------+----------------+

    | id | first_name | dob |

    +----+-------------+----------------+

    | 1 | Ulf | 1960-09-09 |

    | 2 | Seshu | 1964-04-21 |

    | 3 | Randi | 1953-02-09 |

    | 4 | Mariangiola | 1955-04-14 |

    | 5 | Ennio | 1960-09-12 |

    | 6 | Volkmar | 1959-08-07 |

    | 7 | Xuejun | 1958-06-10 |

    | 8 | Shahab | 1954-11-17 |

    | 9 | Jaana | 1953-04-09 |

    | 10 | Jeong | 1953-04-27 |

    | 11 | Yucai | 1957-05-09 |

    | 12 | Bikash | 1964-06-12 |

    | 13 | Hideyuki | 1954-05-06 |

    | 16 | Mimi | 1995-02-07 |

    +----+---------------+----------------+

    SELECT COUNT(*) FROM millennial_hires;

  • +----------+

    | COUNT(*) |

    +----------+

    | 14 |

    +----------+

    5. From the table, delete the entries of employees who were born in or after the year 1960. Find the number of records in the table after deletion. DELETE FROM millennial_hires WHERE YEAR(dob) >= "1960"; SELECT COUNT(*) FROM millennial_hires;

    +----------+

    | COUNT(*) |

    +----------+

    | 9 |

    +----------+

    6. Add a new column called birth_year in the table. Put the birth year of each person as the values in this column and delete the dob column. Print the resulting table.

    ALTER TABLE millennial_hires ADD COLUMN birth_year INT(4); UPDATE millennial_hires SET birth_year = YEAR(dob);

    ALTER TABLE millennial_hires DROP COLUMN dob;

    SELECT * FROM millennial_hires;

    +----+-----------------+------------+

    | id | first_name | birth_year |

    +----+-----------------+------------+

    | 3 | Randi | 1953 |

    | 4 | Mariangiola | 1955 |

    | 6 | Volkmar | 1959 |

    | 7 | Xuejun | 1958 |

    | 8 | Shahab | 1954 |

    | 9 | Jaana | 1953 |

    | 10| Jeong | 1953 |

    | 11| Yucai | 1957 |

    | 13| Hideyuki | 1954 |

    +----+----------------+------------+

  • Appendix 2.1

    1. Print all the tables in the database.

    mysql> SHOW TABLES;

    +--------------------------------+

    | Tables_in_employees |

    +--------------------------------+

    | current_dept_emp |

    | departments |

    | dept_emp |

    | dept_emp_latest_date |

    | dept_manager |

    | employees |

    | salaries |

    | titles |

    +--------------------------------+

    8 rows in set (0.00 sec)

    2. Find and understand the schemas of all the tables.

    mysql> DESC current_dept_emp; DESC departments; DESC dept_emp; DESC dept_emp_latest_date;

    DESC dept_manager; DESC employees; DESC salaries; DESC titles;

    +----------------+---------+------+-----+------------+-------+

    | Field | Type | Null | Key | Default | Extra |

    +----------------+---------+------+-----+-------------+-------+

    | emp_no | int(11) | NO | | NULL | |

    | dept_no | char(4) | NO | | NULL | |

    | from_date | date | YES | | NULL | |

    | to_date | date | YES | | NULL | |

    +----------------+-----------+------+----+----------+-------+

    4 rows in set (0.00 sec)

  • +----------------+----------------+-------+------+-----------+-------+

    | Field | Type | Null | Key | Default | Extra |

    +----------------+-----------------+------+------+------------+-------+

    | dept_no | char(4) | NO | PRI | NULL | |

    | dept_name | varchar(40) | NO | UNI | NULL | |

    +----------------+-----------------+-------+-------+----------+-------+

    2 rows in set (0.00 sec)

    +---------------+-----------+-------+-----+-----------+-----------+

    | Field | Type | Null | Key | Default | Extra |

    +---------------+-----------+-------+------+----------+-----------+

    | emp_no | int(11) | NO | PRI | NULL | |

    | dept_no | char(4) | NO | PRI | NULL | |

    | from_date | date | NO | | NULL | |

    | to_date | date | NO | | NULL | |

    +----------------+-----------+-------+-----+----------+-----------+

    4 rows in set (0.00 sec)

    +---------------+----------+-------+-----+----------+----------+

    | Field | Type | Null | Key | Default | Extra |

    +---------------+----------+-------+-----+----------+----------+

    | emp_no | int(11) | NO | | NULL | |

    | from_date | date | YES | | NULL | |

    | to_date | date | YES | | NULL | |

    +----------------+---------+-------+-----+----------+----------+

    3 rows in set (0.00 sec)

  • +----------------+----------+-------+------+-----------+--------+

    | Field | Type | Null | Key | Default | Extra |

    +---------------+-----------+-------+------+-----------+--------+

    | emp_no | int(11) | NO | PRI | NULL | |

    | dept_no | char(4) | NO | PRI | NULL | |

    | from_date | date | NO | | NULL | |

    | to_date | date | NO | | NULL | |

    +---------------+-----------+-------+------+----------+--------+

    4 rows in set (0.00 sec)

    +---------------+-------------------+--------+-----+-----------+---------+

    | Field | Type | Null | Key | Default | Extra |

    +---------------+-------------------+--------+-----+-----------+---------+

    | emp_no | int(11) | NO | PRI | NULL | |

    | birth_date | date | NO | | NULL | |

    | first_name | varchar(14) | NO | | NULL | |

    | last_name | varchar(16) | NO | | NULL | |

    | gender | enum('M','F') | NO | | NULL | |

    | hire_date | date | NO | | NULL | |

    +---------------+-------------------+-------+-----+-----------+----------+

    6 rows in set (0.00 sec)

    +----------------+---------+--------+-----+------------+-------+

    | Field | Type | Null | Key | Default | Extra |

    +----------------+---------+--------+-----+------------+-------+

    | emp_no | int(11) | NO | PRI | NULL | |

    | salary | int(11) | NO | | NULL | |

    | from_date | date | NO | PRI | NULL | |

  • | to_date | date | NO | | NULL | |

    +-------------+---------+--------+-----+------------+-------+

    4 rows in set (0.00 sec)

    +-----------+-------------+------+------+-----------+-------+

    | Field | Type | Null | Key | Default | Extra |

    +-----------+-------------+------+------+-----------+-------+

    | emp_no | int(11) | NO | PRI | NULL | |

    | title | varchar(50) | NO | PRI | NULL | |

    | from_date | date | NO | PRI | NULL | |

    | to_date | date | YES | | NULL | |

    +-----------+-------------+------+------+-----------+-------+

    4 rows in set (0.00 sec)

    3. Find the number of employees in the database.

    mysql> SELECT COUNT(*) FROM employees;

    +----------+

    | COUNT(*) |

    +----------+

    | 300024 |

    +----------+

    1 row in set (0.19 sec)

    4. List all the departments and their number.

    mysql> SELECT DISTINCT dept_name, dept_no FROM departments ORDER BY 2;

    +--------------------+---------+

    | dept_name | dept_no |

    +--------------------+---------+

    | Marketing | d001 |

    | Finance | d002 |

    | Human Resources | d003 |

  • | Production | d004 |

    | Development | d005 |

    | Quality Management | d006 |

    | Sales | d007 |

    | Research | d008 |

    | Customer Service | d009 |

    +--------------------+---------+

    9 rows in set (0.01 sec)

    5. Find the number of female employees.

    mysql> SELECT COUNT(*) FROM employees e where e.gender LIKE 'F';

    +----------+

    | COUNT(*) |

    +----------+

    | 120051 |

    +----------+

    1 row in set (0.16 sec)

    6. Print the maximum and the minimum salary,

    mysql> SELECT MAX(salary),MIN(salary) FROM salaries;

    +-------------+-------------+

    | MAX(salary) | MIN(salary) |

    +-------------+-------------+

    | 158220 | 38623 |

    +-------------+-------------+

    1 row in set (1.44 sec)

    7. Print the department number and the corresponding number of employees who have ever worked

    there.

    mysql> SELECT d.dept_no, COUNT(d.emp_no) FROM dept_emp d GROUP BY 1 ORDER BY 1;

    +---------+-----------------+

    | dept_no | COUNT(d.emp_no) |

  • +---------+-----------------+

    | d001 | 20211 |

    | d002 | 17346 |

    | d003 | 17786 |

    | d004 | 73485 |

    | d005 | 85707 |

    | d006 | 20117 |

    | d007 | 52245 |

    | d008 | 21126 |

    | d009 | 23580 |

    +---------+-----------------+

    9 rows in set (2.94 sec)

    8. Print the department name and the corresponding number of employees who have ever worked

    there.

    mysql> SELECT dp.dept_name, COUNT(d.emp_no) FROM dept_emp d, departments dp WHERE

    d.dept_no = dp.dept_no GROUP BY 1 ORDER BY 1;

    +-------------------------+-----------------+

    | dept_name | COUNT(d.emp_no) |

    +--------------------------+-----------------+

    | Customer Service | 23580 |

    | Development | 85707 |

    | Finance | 17346 |

    | Human Resources | 17786 |

    | Marketing | 20211 |

    | Production | 73485 |

    | Quality Management | 20117 |

    | Research | 21126 |

    | Sales | 52245 |

    +-------------------------+-----------------+

  • 9 rows in set (0.63 sec)

    9. Print the department names and their corresponding average salaries.

    mysql> SELECT dp.dept_name, AVG(s.salary) FROM dept_emp d, departments dp, salaries s WHERE

    d.dept_no = dp.dept_no AND s.emp_no = d.emp_no GROUP BY 1 ORDER BY 1;

    +-------------------------+--------------------+

    | dept_name | AVG(s.salary) |

    +-------------------------+-------------------+

    | Customer Service | 58770.3665 |

    | Development | 59478.9012 |

    | Finance | 70489.3649 |

    | Human Resources | 55574.8794 |

    | Marketing | 71913.2000 |

    | Production | 59605.4825 |

    | Quality Management | 57251.2719 |

    | Research | 59665.1817 |

    | Sales | 80667.6058 |

    +------------------------+-------------------+

    9 rows in set (11.60 sec)

    10. Print the employee name, employee id and the maximum salary earned by him or her. Only report

    for the employees with the top 5 highest salaries.

    mysql> SELECT CONCAT(e.first_name," ",e.last_name) AS full_name, e.emp_no, MAX(s.salary) FROM

    employees e, salaries s WHERE e.emp_no = s.emp_no GROUP BY 2 ORDER BY 3 DESC LIMIT 5;

    +-------------------------+-----------+---------------+

    | full_name | emp_no | MAX(s.salary) |

    +-------------------------+-----------+---------------+

    | Tokuyasu Pesch | 43624 | 158220 |

    | Honesty Mukaidono | 254466 | 156286 |

    | Xiahua Whitcomb | 47978 | 155709 |

    | Sanjai Luders | 253939 | 155513 |

  • | Tsutomu Alameldin | 109334 | 155377 |

    +---------------------------+------------+---------------+

    5 rows in set (4.18 sec)

  • Appendix [2.2] Consider the following schema:

    Suppliers(sid: integer, sname: string, address: string) Parts(pid: integer, pname: string, color: string)

    Catalog(sid: integer, pid: integer, cost: real)

    The Catalog relation lists the prices charged for parts by Suppliers. Create a database based on the

    above schema.

    CREATE TABLE suppliers (sid INT PRIMARY KEY NOT NULL AUTO_INCREMENT, sname VARCHAR(128),

    address VARCHAR(256));

    CREATE TABLE parts(pid INT PRIMARY KEY NOT NULL AUTO_INCREMENT, pname VARCHAR(128), colour

    VARCHAR(56));

    CREATE TABLE catalog(sid INT,pid INT,cost FLOAT(25),FOREIGN KEY(sid) REFERENCES

    suppliers(sid),FOREIGN KEY(pid) REFERENCES parts(pid));

    INSERT INTO suppliers (sname,address) VALUES ("Martena","P.O. Box 872, 8417 Tellus.

    St."),("Tatum","P.O. Box 216, 7552 Lacus, St."),("Gillian","4443 Donec Rd."),("Dylan","266-4108 Eu,

    St."),("Austin","1114 Imperdiet St."),("Alice","P.O. Box 926, 5519 Feugiat. Avenue"),("Irma","P.O. Box

    902, 5166 Pulvinar Rd."),("Gloria","P.O. Box 518, 5152 Tortor Av."),("Russell","Ap #428-7669 Sed

    St."),("Amanda","1868 Orci. Ave");

    INSERT INTO parts(pname,colour) VALUES ("Q6M-1V2","green"),("A8B-1P0","green"),("D7B-

    6D3","blue"),("D1N-2E4","violet"),("F6B-8L7","orange"),("N2O-7V1","indigo"),("F8T-

    1V3","indigo"),("T6F-0R9","indigo"),("T1Y-9M4","indigo"),("R0N-6N2","red");

    INSERT INTO catalog (sid,pid,cost) VALUES

    (7,7,"9220.76"),(2,7,"9833.77"),(1,9,"6641.91"),(8,10,"2505.70"),(4,5,"1601.36"),(3,10,"3887.97"),(10,9,"

    2324.85"),(6,3,"6497.71"),(4,10,"2088.35"),(3,9,"9718.23"),

    (5,5,"2442.83"),(6,6,"204.87"),(1,4,"716.85"),(8,10,"5489.20"),(7,1,"148.81"),(2,5,"713.07"),(5,6,"4232.0

    9");

    Write the following queries in SQL:

    a.) Find the pnames of parts for which there is some supplier

    SELECT pname FROM parts WHERE pid in (SELECT pid FROM catalog WHERE sid IS NOT NULL)

    +--------------+

    | pname |

    +--------------+

    | Q6M-1V2 |

    | D7B-6D3 |

    | D1N-2E4 |

  • | F6B-8L7 |

    | N2O-7V1 |

    | F8T-1V3 |

    | T1Y-9M4 |

    | R0N-6N2 |

    +-------------+

    b.) Find the snames of suppliers who supply every red part.

    SELECT sname FROM suppliers WHERE sid in (SELECT sid FROM catalog WHERE pid in (SELECT pid FROM

    parts WHERE colour LIKE 'red'));

    +------------+

    | sname |

    +------------+

    | Gillian |

    | Dylan |

    | Gloria |

    +------------+

    c.) Find the sids of suppliers who supply only red parts.

    SELECT DISTINCT sid FROM (SELECT sid,colour from catalog c, parts p where c.pid = p.pid) t1 WHERE NOT

    EXISTS (SELECT * FROM (SELECT sid, colour from catalog c, parts p where c.pid = p.pid) t2 where t1.sid =

    t2.sid and t2.colour!='red');

    +------+

    | sid |

    +------+

    | 8 |

    +------+

    d.) Find the sids of suppliers who supply a red part and a green part.

    SELECT sid FROM (SELECT sid,colour FROM catalog c, parts p WHERE p.pid = c.pid AND colour LIKE 'red')

    t1 WHERE sid in (SELECT sid FROM catalog c, parts p WHERE p.pid = c.pid AND colour LIKE 'green');

  • Empty set (0.00 sec)

    e.) Find the sids of suppliers who supply a red part or a green part.

    SELECT DISTINCT sid FROM (SELECT sid,colour FROM catalog c, parts p WHERE p.pid = c.pid AND (colour

    LIKE 'red' OR colour LIKE 'green')) t1;

    +------+

    | sid |

    +------+

    | 7 |

    | 8 |

    | 3 |

    | 4 |

    +------+

  • Appendix [3.1]

    Bubble Sort Using Java

    Appendix [3.2]

    Selection Sort Using Java

    Appendix [3.3]

    Insertion Sort Using Java

  • Appendix [3.4]

    Bubble Sort Implementation on the INPUT dataset

  • Appendix [3.5]

    Selection Sort Implementation on the INPUT dataset

  • Appendix [3.6]

    Insertion Sort Implementation on the INPUT dataset

  • Appendix [4.1]*

    *Taken from: Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C. (2001) Introduction to Algorithms, Second edition. Cambridge, Massachusetts: MIT

    Press. Chapters 4.

  • Appendix [4.2]

    Merge Sort Implementation on INPUT dataset

    (Completed by Tanasorn Chindasook, Prateek Choudhary and Shengchen Dong)

  • Appendix [4.3]

    Quick Sort Implementation on INPUT dataset

    (Completed by Tanasorn Chindasook, Prateek Choudhary and Shengchen Dong)

  • Appendix [4.4]*

    *Taken from "Know Thy Complexities!" Big-O Algorithm Complexity Cheat Sheet (Know Thy

    Complexities!) @ericdrowell. Accessed December 09, 2018. http://bigocheatsheet.com/.


Recommended