+ All Categories
Home > Documents > Datamining With Big Data_Siva

Datamining With Big Data_Siva

Date post: 27-Feb-2018
Category:
Upload: venkatesh-gardas
View: 213 times
Download: 0 times
Share this document with a friend

of 69

Transcript
  • 7/25/2019 Datamining With Big Data_Siva

    1/69

    ABSTRACT

    Big Data concern large-volume, complex, growing data sets with multiple,

    autonomous sources. With the fast development of networking, data storage,and the data collection capacity, Big Data are now rapidly expanding in all

    science and engineering domains, including physical, biological and biomedical

    sciences. This paper presents a !"# theorem that characteri$es the features of

    the Big Data revolution, and proposes a Big Data processing model, from the

    data mining perspective. This data-driven model involves demand-driven

    aggregation of information sources, mining and analysis, user interestmodelling, and security and privacy considerations. We analyse the challenging

    issues in the data-driven model and also in the Big Data revolution.

    1 .INTRODUCTION

    Introduction Data Mining

    1

  • 7/25/2019 Datamining With Big Data_Siva

    2/69

    %tructure of Data &ining

    'enerally, data mining (sometimes called data or knowledge discovery) is the

    process of analy$ing data from different perspectives and summari$ing it into

    useful information - information that can be used to increase revenue, cuts costs,

    or both. Data mining software is one of a number of analytical tools for

    analy$ing data. *t allows users to analy$e data from many different dimensions

    or angles, categori$e it, and summari$e the relationships identified. Technically,

    data mining is the process of finding correlations or patterns among do$ens of

    fields in large relational databases.

    Data Mining Works

    While large-scale information technology has been evolving separate

    transaction and analytical systems, data mining provides the link between the

    two. Data mining software analy$es relationships and patterns in storedtransaction data based on open-ended user +ueries. %everal types of analytical

    software are available statistical, machine learning, and neural networks.

    Generall! an o" "our t#es o" relations$i#s are soug$t%

    Classes %tored data is used to locate data in predetermined groups. or

    example, a restaurant chain could mine customer purchase data to

    determine when customers visit and what they typically order. This

    2

  • 7/25/2019 Datamining With Big Data_Siva

    3/69

    information could be used to increase traffic by having daily specials.

    Clusters Data items are grouped according to logical relationships or

    consumer preferences. or example, data can be mined to identify marketsegments or consumer affinities.

    Associations Data can be mined to identify associations. The beer-diaper

    example is an example of associative mining.

    Se&uential #atterns Data is mined to anticipate behavior patterns and

    trends. or example, an outdoor e+uipment retailer could predict the

    likelihood of a backpack being purchased based on a consumers purchase

    of sleeping bags and hiking shoes.

    Data 'ining consists o" "i(e 'a)or ele'ents%

    /) #xtract, transform, and load transaction data onto the data warehouse

    system.

    0) %tore and manage the data in a multidimensional database system.

    1) 2rovide data access to business analysts and information technology

    professionals.

    3) !naly$e the data by application software.

    4) 2resent the data in a useful format, such as a graph or table.

    Di""erent le(els o" analsis are a(aila*le%

    Arti"icial neural net+orks 5on-linear predictive models that learn

    3

  • 7/25/2019 Datamining With Big Data_Siva

    4/69

    through training and resemble biological neural networks in structure.

    Genetic algorit$'s 6ptimi$ation techni+ues that use process such as

    genetic combination, mutation, and natural selection in a design based on

    the concepts of natural evolution.

    Decision trees Tree-shaped structures that represent sets of decisions.

    These decisions generate rules for the classification of a dataset. %pecific

    decision tree methods include "lassification and 7egression Trees

    ("!7T) and "hi %+uare !utomatic *nteraction Detection ("!*D).

    "!7T and "!*D are decision tree techni+ues used for classification of

    a dataset. They provide a set of rules that you can apply to a new

    (unclassified) dataset to predict which records will have a given outcome.

    "!7T segments a dataset by creating 0-way splits while "!*D

    segments using chi s+uare tests to create multi-way splits. "!7T

    typically re+uires less data preparation than "!*D.

    Nearest neig$*or 'et$od ! techni+ue that classifies each record in a

    dataset based on a combination of the classes of the krecord(s) most

    similar to it in a historical dataset (where k8/). %ometimes called the k-

    nearest neighbor techni+ue.

    Rule induction The extraction of useful if-then rules from data based on

    statistical significance.

    Data (isuali,ation The visual interpretation of complex relationships in

    multidimensional data. 'raphics tools are used to illustrate data

    relationships.

    C$aracteristics o" Data Mining%

    4

  • 7/25/2019 Datamining With Big Data_Siva

    5/69

    -arge &uantities o" data The volume of data so great it has to be

    analy$ed by automated techni+ues e.g. satellite information, credit card

    transactions etc.

    Nois! inco'#lete data *mprecise data is the characteristic of all data

    collection.

    Co'#le data structure conventional statistical analysis not possible

    /eterogeneous data stored in legac sste's

    Bene"its o" Data Mining%

    /) *t9s one of the most effective services that are available today. With the

    help of data mining, one can discover precious information about the

    customers and their behavior for a specific set of products and evaluate

    and analy$e, store, mine and load data related to them

    0) !n analytical "7& model and strategic business related decisions can be

    made with the help of data mining as it helps in providing a complete

    synopsis of customers

    1) !n endless number of organi$ations have installed data mining pro:ects

    and it has helped them see their own companies make an unprecedented

    improvement in their marketing strategies ("ampaigns)

    3) Data mining is generally used by organi$ations with a solid customer

    focus. or its flexible nature as far as applicability is concerned is being

    used vehemently in applications to foresee crucial data including industry

    analysis and consumer buying behaviors

    4) ast paced and prompt access to data along with economic processing

    techni+ues have made data mining one of the most suitable services that a

    company seek

    5

  • 7/25/2019 Datamining With Big Data_Siva

    6/69

    Ad(antages o" Data Mining%

    1. Marketing 0 Retail%

    Data mining helps marketing companies build models based on historical

    data to predict who will respond to the new marketing campaigns such as direct

    mail, online marketing campaign;etc. Through the results, marketers will have

    appropriate approach to sell profitable products to targeted customers.

    Data mining brings a lot of benefits to retail companies in the same way asmarketing. Through market basket analysis, a store can have an appropriate

    production arrangement in a way that customers can buy fre+uent buying

    products together with pleasant. *n addition, it also helps the retail companies

    offer certain discounts for particular products that will attract more customers.

    . 2inance 0 Banking

    Data mining gives financial institutions information about loan information

    and credit reporting. By building a model from historical customer9s data, the

    bank and financial institution can determine good and bad loans. *n addition,

    data mining helps banks detect fraudulent credit card transactions to protect

    credit card9s owner.

    3. Manu"acturing

    By applying data mining in operational engineering data, manufacturers can

    detect faulty e+uipments and determine optimal control parameters. or

    example semi-conductor manufacturers has a challenge that even the conditions

    of manufacturing environments at different wafer production plants are similar,

    the +uality of wafer are lot the same and some for unknown reasons even has

    defects. Data mining has been applying to determine the ranges of control

    parameters that lead to the production of golden wafer. Then those optimal

    6

  • 7/25/2019 Datamining With Big Data_Siva

    7/69

    control parameters are used to manufacture wafers with desired +uality.

    4. Go(ern'ents

    Data mining helps government agency by digging and analy$ing records offinancial transaction to build patterns that can detect money laundering or

    criminal activities.

    5. -a+ en"orce'ent%

    Data mining can aid law enforcers in identifying criminal suspects as well as

    apprehending these criminals by examining trends in location, crime type,habit, and other patterns of behaviors.

    6. Researc$ers%

    Data mining can assist researchers by speeding up their data analy$ing

    process< thus, allowing those more time to work on other pro:ects.

    -IT7RATUR7 SUR879

    7

  • 7/25/2019 Datamining With Big Data_Siva

    8/69

    1: Algorit$'s "or Mining t$e 7(olution o" Conser(ed Relational States in

    Dna'ic Net+orks!

    AUT/ORS% 7. !hmed and '. =arypisDynamic networks have recently being recogni$ed as a powerful abstraction to

    model and represent the temporal changes and dynamic aspects of the data

    underlying many complex systems. %ignificant insights regarding the stable

    relational patterns among the entities can be gained by analy$ing temporal

    evolution of the complex entity relations. This can help identify the transitions

    from one conserved state to the next and may provide evidence to the existence

    of external factors that are responsible for changing the stable relational patterns

    in these networks. This paper presents a new data mining method that analy$es

    the time-persistent relations or states between the entities of the dynamic

    networks and captures all maximal non-redundant evolution paths of the stable

    relational states. #xperimental results based on multiple datasets from real-

    world applications show that the method is efficient and scalable.

    : No(el A##roac$es to Cra+ling I'#ortant ;ages 7arl

    AUT/ORS%&.. !lam, >.W. a, and %.=. ?ee

    Web crawlers are essential to many Web applications, such as Web search

    engines, Web archives, and Web directories, which maintain Web pages in their

    local repositories. *n this paper, we study the problem of crawl scheduling that

    biases crawl ordering toward important pages. We propose a set of crawling

    algorithms for effective and efficient crawl ordering by prioriti$ing important

    pages with the well-known 2age7ank as the importance metric. *n order to

    score @7?s, the proposed algorithms utili$e various features, including partial

    link structure, inter-host links, page titles, and topic relevance. We conduct a

    large-scale experiment using publicly available data sets to examine the effect

    of each feature on crawl ordering and evaluate the performance of many

    algorithms. The experimental results verify the efficacy of our schemes. *n

    particular, compared with the representative 7ank &ass crawler, the 27-title-

    host algorithm reduces computational overhead by a factor as great as three in

    8

  • 7/25/2019 Datamining With Big Data_Siva

    9/69

    running time while improving effectiveness by 4 A in cumulative 2age7ank

    3: Identi"ing In"luential and Susce#ti*le Me'*ers o" Social Net+orks

    AUT/ORS% %. !ral and D. Walker

    *dentifying social influence in networks is critical to understanding how

    behaviors spread. We present a method that uses in vivo randomi$ed

    experimentation to identify influence and susceptibility in networks while

    avoiding the biases inherent in traditional estimates of social contagion.

    #stimation in a representative sample of /.1 million acebook users showed that

    younger users are more susceptible to influence than older users, men are more

    influential than women, women influence men more than they influence other

    women, and married individuals are the least susceptible to influence in the

    decision to adopt the product offered. !nalysis of influence and susceptibility

    together with network structure revealed that influential individuals are less

    susceptible to influence than noninfluential individuals and that they cluster in

    the network while susceptible individuals do not, which suggests that influential

    people with influential friends may be instrumental in the spread of this productin the network.

    4: Big ;ri(ac% ;rotecting Con"identialit in Big Data

    AUT/ORS% !. &achanava::hala and >.2. 7eiter

    ! tremendous amount of data about individuals e.g., demographic

    information, internet activity, energy usage, communication patterns and social

    9

  • 7/25/2019 Datamining With Big Data_Siva

    10/69

    interactions are being collected and analy$ed by many national statistical

    agencies, survey organi$ations, medical centers, and Web and social networking

    companies. Wide dissemination of microdata (data at the granularity of

    individuals) facilitates advances in science and public policy, helps citi$ens to

    learn about their societies, and enables students to develop skills at data

    analysis. 6ften, however, data producers cannot release microdata as collected,

    because doing so could reveal data sub:ects identities or values of sensitive

    attributes. ailing to protect confidentiality (when promised) is unethical and

    can cause harm to data sub:ects and the data provider. *t even may be illegal,

    especially in government and research settings. or example, if one reveals

    confidential data covered by the @. %. "onfidential *nformation 2rotection and

    %tatistical #fficiency !ct, one is sub:ect to a maximum of C04, in fines and

    a five year prison term.

    5: Anal,ing Collecti(e Be$a(ior "ro' Blogs Using S+ar' Intelligence

    AUT/ORS%%. Baner:ee and 5. !garwal

    With the rapid growth of the availability and popularity of interpersonal andbehavior-rich resources such as blogs and other social media avenues, emerging

    opportunities and challenges arise as people now can, and do, actively use

    computational intelligence to seek out and understand the opinions of others.

    The study of collective behavior of individuals has implications to business

    intelligence, predictive analytics, customer relationship management, and

    examining online collective action as manifested by various flash mobs, the!rab %pring (0//) and other such events. *n this article, we introduce a nature-

    inspired theory to model collective behavior from the observed data on blogs

    using swarm intelligence, where the goal is to accurately model and predict the

    future behavior of a large population after observing their interactions during a

    training phase. %pecifically, an ant colony optimi$ation model is trained with

    behavioral trend from the blog data and is tested over real-world blogs.

    2romising results were obtained in trend prediction using ant colony based

    10

  • 7/25/2019 Datamining With Big Data_Siva

    11/69

    pheromone classier and "* statistical measure. We provide empirical

    guidelines for selecting suitable parameters for the model, conclude with

    interesting observations, and envision future research directions.

    . S9ST7M STUD9

    27ASIBI-IT9 STUD9

    The feasibility of the pro:ect is analy$ed in this phase and business

    11

  • 7/25/2019 Datamining With Big Data_Siva

    12/69

    proposal is put forth with a very general plan for the pro:ect and some cost

    estimates. During system analysis the feasibility study of the proposed system is

    to be carried out. This is to ensure that the proposed system is not a burden to

    the company. or feasibility analysis, some understanding of the ma:or

    re+uirements for the system is essential.

    Three key considerations involved in the feasibility analysis are

    #"656&*"!? #!%*B*?*TE

    T#"5*"!? #!%*B*?*TE

    %6"*!? #!%*B*?*TE

    7CONOMICA- 27ASIBI-IT9

    This study is carried out to check the economic impact that the system

    will have on the organi$ation. The amount of fund that the company can pour

    into the research and development of the system is limited. The expenditures

    must be :ustified. Thus the developed system as well within the budget and this

    was achieved because most of the technologies used are freely available. 6nly

    the customi$ed products had to be purchased.

    T7C/NICA- 27ASIBI-IT9

    This study is carried out to check the technical feasibility, that is, the

    technical re+uirements of the system. !ny system developed must not have a

    high demand on the available technical resources. This will lead to high

    demands on the available technical resources. This will lead to high demands

    being placed on the client. The developed system must have a modest

    re+uirement, as only minimal or null changes are re+uired for implementing this

    system.

    SOCIA- 27ASIBI-IT9

    The aspect of study is to check the level of acceptance of the system by

    the user. This includes the process of training the user to use the system

    12

  • 7/25/2019 Datamining With Big Data_Siva

    13/69

    efficiently. The user must not feel threatened by the system, instead must accept

    it as a necessity. The level of acceptance by the users solely depends on the

    methods that are employed to educate the user about the system and to make

    him familiar with it. is level of confidence must be raised so that he is also

    able to make some constructive criticism, which is welcomed, as he is the final

    user of the system.

    3. S9ST7M R7

  • 7/25/2019 Datamining With Big Data_Siva

    14/69

    &onitor /4 F'! "olour.

    &ouse ?ogitech.

    7am 4/0 &b.

    SO2TWAR7 R7

  • 7/25/2019 Datamining With Big Data_Siva

    15/69

    6b:ect oriented

    2ortable

    Distributed

    igh performance

    *nterpreted

    &ultithreaded

    7obust

    Dynamic

    %ecure

    With most programming languages, you either compile or interpret a program

    so that you can run it on your computer. The >ava programming language is

    unusual in that a program is both compiled and interpreted. With the compiler,

    first you translate a program into an intermediate language called Java byte

    codesKthe platform-independent codes interpreted by the interpreter on the

    >ava platform. The interpreter parses and runs each >ava byte code instruction

    on the computer. "ompilation happens :ust once< interpretation occurs each time

    the program is executed. The following figure illustrates how this works.

    Eou can think of >ava byte codes as the machine code instructions for the

    Java Virtual Machine (>ava F&). #very >ava interpreter, whether it9s a

    development tool or a Web browser that can run applets, is an implementation

    of the >ava F&. >ava byte codes help make Lwrite once, run anywhereM

    15

  • 7/25/2019 Datamining With Big Data_Siva

    16/69

    possible. Eou can compile your program into byte codes on any platform that

    has a >ava compiler. The byte codes can then be run on any implementation of

    the >ava F&. That means that as long as a computer has a >ava F&, the same

    program written in the >ava programming language can run on Windows 0,

    a %olaris workstation, or on an i&ac.

    T$e =a(a ;lat"or'

    ! platform is the hardware or software environment in which a

    program runs. We9ve already mentioned some of the most popular

    platforms like Windows 0, ?inux, %olaris, and &ac6%. &ost

    platforms can be described as a combination of the operating system and

    hardware. The >ava platform differs from most other platforms in that it9s

    a software-only platform that runs on top of other hardware-based

    platforms.

    The Java platform has two components:

    TheJava Virtual Machine(>ava F&)

    TheJava Application Programming Interface(>ava !2*)

    Eou9ve already been introduced to the >ava F&. *t9s the base for the

    >ava platform and is ported onto various hardware-based platforms.

    16

  • 7/25/2019 Datamining With Big Data_Siva

    17/69

    The Java API is a large collection of readymade software

    components that provide many useful capabilities! such as graphical user

    interface "#$I% widgets& The Java API is grouped into libraries of related

    classes and interfaces' these libraries are known as packages& The ne(t

    section! )hat *an Java Technology +o, -ighlights what functionality

    some of the packages in the Java API provide&

    The following figure depicts a program that.s running on the Java

    platform& As the figure shows! the Java API and the virtual machine

    insulate the program from the hardware&

    5ative code is code that after you compile it, the compiled code runs on a

    specific hardware platform. !s a platform-independent environment, the >ava

    platform can be a bit slower than native code. owever, smart compilers, well-

    tuned interpreters, and :ust-in-time byte code compilers can bring performance

    close to that of native code without threatening portability.

    W$at Can =a(a Tec$nolog Do>

    The most common types of programs written in the >ava programming

    language are appletsand applications. *f you9ve surfed the Web, you9re

    probably already familiar with applets. !n applet is a program that

    adheres to certain conventions that allow it to run within a >ava-enabled

    browser.

    -owever! the Java programming language is not /ust for writing cute!

    entertaining applets for the )eb& The generalpurpose! highlevel Java

    17

  • 7/25/2019 Datamining With Big Data_Siva

    18/69

    programming language is also a powerful software platform& $sing the

    generous API! you can write many types of programs&

    An application is a standalone program that runs directly on the Javaplatform& A special kind of application known as a server serves and

    supports clients on a network& 0(amples of servers are )eb servers!

    pro(y servers! mail servers! and print servers& Another speciali1ed

    program is a servlet& A servlet can almost be thought of as an applet that

    runs on the server side& Java 2ervlets are a popular choice for building

    interactive web applications! replacing the use of *#I scripts& 2ervlets

    are similar to applets in that they are runtime e(tensions of applications&

    Instead of working in browsers! though! servlets run within Java )eb

    servers! configuring or tailoring the server&

    -ow does the API support all these kinds of programs, It does so with

    packages of software components that provides a wide range of

    functionality& 0very full implementation of the Java platform gives you

    the following features:

    T$e essentials 6b:ects, strings, threads, numbers, input and

    output, data structures, system properties, date and time, and so on.

    A##lets The set of conventions used by applets.

    Net+orking @7?s, T"2 (Transmission "ontrol 2rotocol), @D2(@ser Data gram 2rotocol) sockets, and *2 (*nternet 2rotocol)

    addresses.

    Internationali,ation elp for writing programs that can be

    locali$ed for users worldwide. 2rograms can automatically adapt to

    specific locales and be displayed in the appropriate language.

    Securit Both low level and high level, including electronicsignatures, public and private key management, access control, and

    18

  • 7/25/2019 Datamining With Big Data_Siva

    19/69

    certificates.

    So"t+are co'#onents =nown as >avaBeansT&, can plug into

    existing component architectures.

    O*)ect seriali,ation !llows lightweight persistence and

    communication via 7emote &ethod *nvocation (7&*).

    =a(a Data*ase Connecti(it ?=DBCTM: 2rovides uniform access

    to a wide range of relational databases.

    The >ava platform also has !2*s for 0D and 1D graphics, accessibility,

    servers, collaboration, telephony, speech, animation, and more. The

    following figure depicts what is included in the >ava 0 %D=.

    /o+ Will =a(a Tec$nolog C$ange M -i"e>

    We can9t promise you fame, fortune, or even a :ob if you learn the >avaprogramming language. %till, it is likely to make your programs better and

    re+uires less effort than other languages. We believe that >ava technology will

    help you do the following

    Get started &uickl !lthough the >ava programming language is

    a powerful ob:ect-oriented language, it9s easy to learn, especially

    for programmers already familiar with " or "NN.

    19

  • 7/25/2019 Datamining With Big Data_Siva

    20/69

    Write less code "omparisons of program metrics (class counts,

    method counts, and so on) suggest that a program written in the

    >ava programming language can be four times smaller than the

    same program in "NN.

    Write *etter code The >ava programming language encourages

    good coding practices, and its garbage collection helps you avoid

    memory leaks. *ts ob:ect orientation, its >avaBeans component

    architecture, and its wide-ranging, easily extendible !2* let you

    reuse other people9s tested code and introduce fewer bugs.

    De(elo# #rogra's 'ore &uickl Eour development time may be

    as much as twice as fast versus writing the same program in "NN.

    WhyO Eou write fewer lines of code and it is a simpler

    programming language than "NN.

    A(oid #lat"or' de#endencies +it$ 1@@ ;ure =a(a Eou can

    keep your program portable by avoiding the use of libraries written

    in other languages. The /A 2ure >avaT& 2roduct "ertification

    2rogram has a repository of historical process manuals, white

    papers, brochures, and similar materials online.

    Write once! run an+$ere Because /A 2ure >ava programs

    are compiled into machine-independent byte codes, they run

    consistently on any >ava platform.

    Distri*ute so"t+are 'ore easil Eou can upgrade applets easily

    from a central server. !pplets take advantage of the feature of

    allowing new classes to be loaded Lon the fly,M without

    recompiling the entire program.

    6DB"

    &icrosoft 6pen Database "onnectivity (6DB") is a standardprogramming interface for application developers and database systems

    20

  • 7/25/2019 Datamining With Big Data_Siva

    21/69

    providers. Before 6DB" became a de facto standard for Windows programs to

    interface with database systems, programmers had to use proprietary languages

    for each database they wanted to connect to. 5ow, 6DB" has made the choice

    of the database system almost irrelevant from a coding perspective, which is as

    it should be. !pplication developers have much more important things to worry

    about than the syntax that is needed to port their program from one database to

    another when business needs suddenly change.

    Through the 6DB" !dministrator in "ontrol 2anel, you can specify the

    particular database that is associated with a data source that an 6DB"

    application program is written to use. Think of an 6DB" data source as a door

    with a name on it. #ach door will lead you to a particular database. or

    example, the data source named %ales igures might be a %J? %erver database,

    whereas the !ccounts 2ayable data source could refer to an !ccess database.

    The physical database referred to by a data source can reside anywhere on the

    ?!5.

    The 6DB" system files are not installed on your system by Windows P4.

    7ather, they are installed when you setup a separate database application, such

    as %J? %erver "lient or Fisual Basic 3.. When the 6DB" icon is installed in

    "ontrol 2anel, it uses a file called 6DB"*5%T.D??. *t is also possible to

    administer your 6DB" data sources through a stand-alone program called

    6DB"!D&.#G#. There is a /Q-bit and a 10-bit version of this program andeach maintains a separate list of 6DB" data sources.

    rom a programming perspective, the beauty of 6DB" is that the

    application can be written to use the same set of function calls to interface with

    any data source, regardless of the database vendor. The source code of the

    application doesn9t change whether it talks to 6racle or %J? %erver. We only

    21

  • 7/25/2019 Datamining With Big Data_Siva

    22/69

    mention these two as an example. There are 6DB" drivers available for several

    do$en popular database systems. #ven #xcel spreadsheets and plain text files

    can be turned into data sources. The operating system uses the 7egistry

    information written by 6DB" !dministrator to determine which low-level

    6DB" drivers are needed to talk to the data source (such as the interface to

    6racle or %J? %erver). The loading of the 6DB" drivers is transparent to the

    6DB" application program. *n a clientHserver environment, the 6DB" !2*

    even handles many of the network issues for the application programmer.

    The advantages of this scheme are so numerous that you are probably

    thinking there must be some catch. The only disadvantage of 6DB" is that it

    isn9t as efficient as talking directly to the native database interface. 6DB" has

    had many detractors make the charge that it is too slow. &icrosoft has always

    claimed that the critical factor in performance is the +uality of the driver

    software that is used. *n our humble opinion, this is true. The availability of

    good 6DB" drivers has improved a great deal recently. !nd anyway, the

    criticism about performance is somewhat analogous to those who said that

    compilers would never match the speed of pure assembly language. &aybe not,

    but the compiler (or 6DB") gives you the opportunity to write cleaner

    programs, which means you finish sooner. &eanwhile, computers get faster

    every year.

    >DB"

    *n an effort to set an independent database standard !2* for >ava< %un

    &icrosystems developed >ava Database "onnectivity, or >DB". >DB" offers a

    generic %J? database access mechanism that provides a consistent interface to a

    variety of 7DB&%s. This consistent interface is achieved through the use of

    22

  • 7/25/2019 Datamining With Big Data_Siva

    23/69

    Lplug-inM database connectivity modules, or drivers. *f a database vendor wishes

    to have >DB" support, he or she must provide the driver for each platform that

    the database and >ava run on.

    To gain a wider acceptance of >DB", %un based >DB"9s framework on

    6DB". !s you discovered earlier in this chapter, 6DB" has widespread

    support on a variety of platforms. Basing >DB" on 6DB" will allow vendors to

    bring >DB" drivers to market much faster than developing a completely new

    connectivity solution.

    >DB" was announced in &arch of /PPQ. *t was released for a P day

    public review that ended >une R, /PPQ. Because of user input, the final >DB"

    v/. specification was released soon after.

    The remainder of this section will cover enough information about >DB" for

    you to know what it is about and how to use it effectively. This is by no means a

    complete overview of >DB". That would fill an entire book.

    >DB" 'oals

    ew software packages are designed without goals in mind. >DB" is one

    that, because of its many goals, drove the development of the !2*. These goals,

    in con:unction with early reviewer feedback, have finali$ed the >DB" class

    library into a solid framework for building database applications in >ava.

    The goals that were set for >DB" are important. They will give you some

    insight as to why certain classes and functionalities behave the way they do. The

    eight design goals for >DB" are as follows

    23

  • 7/25/2019 Datamining With Big Data_Siva

    24/69

    /. %J? ?evel !2*

    The designers felt that their main goal was to define a %J? interface for

    >ava. !lthough not the lowest database interface level possible, it is at a low

    enough level for higher-level tools and !2*s to be created. "onversely, it is

    at a high enough level for application programmers to use it confidently.

    !ttaining this goal allows for future tool vendors to LgenerateM >DB" code

    and to hide many of >DB"9s complexities from the end user.

    0. %J? "onformance

    %J? syntax varies as you move from database vendor to database vendor.

    *n an effort to support a wide variety of vendors, >DB" will allow any +uery

    statement to be passed through it to the underlying database driver. This

    allows the connectivity module to handle non-standard functionality in a

    manner that is suitable for its users.

    1. >DB" must be implemental on top of common database interfaces

    The >DB" %J? !2* must LsitM on top of other common %J? level

    !2*s. This goal allows >DB" to use existing 6DB" level drivers by theuse of a software interface. This interface would translate >DB" calls to

    6DB" and vice versa.

    3. 2rovide a >ava interface that is consistent with the rest of the >ava system

    Because of >ava9s acceptance in the user community thus far, the

    designers feel that they should not stray from the current design of the core

    >ava system.4. =eep it simple

    This goal probably appears in all software design goal listings. >DB" is

    no exception. %un felt that the design of >DB" should be very simple,

    allowing for only one method of completing a task per mechanism. !llowing

    duplicate functionality only serves to confuse the users of the !2*.

    Q. @se strong, static typing wherever possible

    %trong typing allows for more error checking to be done at compile timeDB". owever, more complex

    %J? statements should also be possible.

    >ava ha two things a programming language and a platform. >ava is a

    high-level programming language that is all of the following

    %imple !rchitecture-neutral

    6b:ect-oriented 2ortable

    Distributed igh-performance

    *nterpreted multithreaded

    7obust Dynamic

    %ecure

    >ava is also unusual in that each >ava program is both compiled and

    interpreted. With a compile you translate a >ava program into an

    intermediate language called >ava byte codes the platform-independent

    code instruction is passed and run on the computer.

    "ompilation happens :ust once< interpretation occurs each time the

    program is executed. The figure illustrates how this works.

    25

  • 7/25/2019 Datamining With Big Data_Siva

    26/69

    JavaProgram

    Compilers

    Interpreter

    My Program

    Eou can think of >ava byte codes as the machine code instructions

    for the >ava Firtual &achine (>ava F&). #very >ava interpreter,

    whether it9s a >ava development tool or a Web browser that can run

    >ava applets, is an implementation of the >ava F&. The >ava F& can

    also be implemented in hardware.

    >ava byte codes help make Lwrite once, run anywhereM possible.

    Eou can compile your >ava program into byte codes on my platformthat has a >ava compiler. The byte codes can then be run any

    implementation of the >ava F&. or example, the same >ava program

    can run Windows 5T, %olaris, and &acintosh.

    Net+orking

    TC;0I; stack

    The T"2H*2 stack is shorter than the 6%* one

    26

  • 7/25/2019 Datamining With Big Data_Siva

    27/69

    T"2 is a connection-oriented protocol< @D2 (@ser Datagram

    2rotocol) is a connectionless protocol.

    *2 datagram9s

    The *2 layer provides a connectionless and unreliable delivery

    system. *t considers each datagram independently of the others. !ny

    association between datagram must be supplied by the higher layers.

    The *2 layer supplies a checksum that includes its own header. The

    header includes the source and destination addresses. The *2 layer

    handles routing through an *nternet. *t is also responsible for breaking

    up large datagram into smaller ones for transmission and reassembling

    them at the other end.

    @D2

    @D2 is also connectionless and unreliable. What it adds to *2 is a

    checksum for the contents of the datagram and port numbers. These are

    27

  • 7/25/2019 Datamining With Big Data_Siva

    28/69

    used to give a clientHserver model - see later.

    T"2

    T"2 supplies logic to give a reliable connection-oriented protocol

    above *2. *t provides a virtual circuit that two processes can use to

    communicate.

    *nternet addresses

    *n order to use a service, you must be able to find it. The *nternet

    uses an address scheme for machines so that they can be located. The

    address is a 10 bit integer which gives the *2 address. This encodes a

    network *D and more addressing. The network *D falls into various

    classes according to the si$e of the network address.

    5etwork address

    "lass ! uses R bits for the network address with 03 bits left over for

    other addressing. "lass B uses /Q bit network addressing. "lass " uses

    03 bit network addressing and class D uses all 10.

    %ubnet address

    *nternally, the @5*G network is divided into sub networks. Building

    // is currently on one sub network and uses /-bit addressing, allowing

    /03 different hosts.

    ost address

    R bits are finally used for host addresses within our subnet. This

    28

  • 7/25/2019 Datamining With Big Data_Siva

    29/69

    places a limit of 04Q machines that can be on the subnet.

    Total address

    The 10 bit address is usually written as 3 integers separated by dots.

    2ort addresses

    ! service exists on a host, and is identified by its port. This is a /Q

    bit number. To send a message to a server, you send it to the port for

    that service of the host that it is running on. This is not location

    transparencyS "ertain of these ports are well known.

    %ockets

    ! socket is a data structure maintained by the system to handle

    network connections. ! socket is created using the call socket. *t returns

    an integer that is like a file descriptor. *n fact, under Windows, this

    handle can be used with 7ead ileand Write ilefunctions.

    29

  • 7/25/2019 Datamining With Big Data_Siva

    30/69

    #include

    #include

    intsocket(intfamily, inttype, intprotocol)ree"hart is a free /A >ava chart library that makes it easy for

    developers to display professional +uality charts in their applications.

    >ree"harts extensive feature set includes

    ! consistent and well-documented !2*, supporting a wide range of

    chart typesree"hartava%cript is an ob:ect-oriented scripting language primarily used in client-sideinterfaces for web applications. !:ax (!synchronous >ava%cript and G&?) is a

    32

  • 7/25/2019 Datamining With Big Data_Siva

    33/69

    Web 0. techni+ue that allows changes to occur in a web page without the need

    to perform a page refresh. >ava%cript toolkits can be leveraged to implement

    !:ax-enabled components and functionality in web pages.

    We* Ser(er and Client

    Web %erver is a software that can process the client re+uest and send the

    response back to the client. or example, !pache is one of the most widely used

    web server. Web %erver runs on some physical machine and listens to client

    re+uest on specific port.

    ! web client is a software that helps in communicating with the server. %ome of

    the most widely used web clients are irefox, 'oogle "hrome, %afari etc. When

    we re+uest something from server (through @7?), web client takes care of

    creating a re+uest and sending it to server and then parsing the server response

    and present it to the user.

    /TM- and /TT;

    Web %erver and Web "lient are two separate softwares, so there should be some

    common language for communication. T&? is the common language between

    server and client and stands for /yperText Markup -anguage.

    Web server and client needs a common communication protocol, TT2

    (/yperText Transfer ;rotocol) is the communication protocol between server

    and client. TT2 runs on top of T"2H*2 communication protocol.%ome of the important parts of TT2 7e+uest are

    /TT; Met$od action to be performed, usually '#T, 26%T, 2@T etc.

    UR- 2age to access

    2or' ;ara'eters similar to arguments in a :ava method, for example

    user,password details from login page.

    %ample TT2 7e+uest

    33

  • 7/25/2019 Datamining With Big Data_Siva

    34/69

    /

    0

    1

    '#T Hirst%ervlet2ro:ectH:spsHhello.:sp TT2H/./

    ost localhostRR

    "ache-"ontrol no-cache

    %ome of the important parts of TT2 7esponse are Status Code an integer to indicate whether the re+uest was success or

    not. %ome of the well known status codes are 0 for success, 33 for 5ot

    ound and 31 for !ccess orbidden.

    Content T#e text, html, image, pdf etc. !lso known as &* type

    Content actual data that is rendered by client and shown to user.

    MIME Type or Content Type: If you see above sample -TTP response header!

    it contains tag 3*ontentType4& It.s also called MIM0 type and server sends it

    to client to let them know the kind of data it.s sending& It helps client in

    rendering the data for user& 2ome of the mostly used mime types are te(t5html!

    te(t5(ml! application5(ml etc&

    Understanding UR-

    $67 is acronym of $niversal 6esource 7ocator and it.s used to locate the

    server and resource& 0very resource on the web has it.s own uni8ue address&

    7et.s see parts of $67 with an e(ample&

    http://localhost:8080/FirstServletProject/jsps/hello.jsp

    http://9 This is the first part of $67 and provides the communication protocol

    to be used in serverclient communication&

    34

  • 7/25/2019 Datamining With Big Data_Siva

    35/69

    localhost9 The uni8ue address of the server! most of the times it.s the hostname

    of the server that maps to uni8ue IP address& 2ometimes multiple hostnamespoint to same IP addresses and web server virtual host takes care of sending

    re8uest to the particular server instance&

    80809 This is the port on which server is listening! it.s optional and if we don.t

    provide it in $67 then re8uest goes to the default port of the protocol& Portnumbers to ;

    for -TTP! ??= for -TTP2!

    )eb servers are good for static contents -TM7 pages but they don.t know how

    to generate dynamic content or how to save data into databases! so we need

    another tool that we can use to generate dynamic content& There are severalprogramming languages for dynamic content like P-P! Python! 6uby on 6ails!

    Java 2ervlets and J2Ps&

    Java 2ervlet and J2Ps are server side technologies to e(tend the capability of

    web servers by providing support for dynamic response and data persistence&

    35

  • 7/25/2019 Datamining With Big Data_Siva

    36/69

    We* Container

    Tomcat is a web container! when a re8uest is made from *lient to web server! itpasses the re8uest to web container and it.s web container /ob to find the

    correct resource to handle the re8uest "servlet or J2P% and then use the

    response from the resource to generate the response and provide it to web

    server& Then web server sends the response back to the client&

    )hen web container gets the re8uest and if it.s for servlet then container

    creates two b/ects -TTP2ervlet6e8uest and -TTP2ervlet6esponse& Then it

    finds the correct servlet based on the $67 and creates a thread for the re8uest&

    Then it invokes the servlet service"% method and based on the -TTP method

    service"% method invokes do#et"% or doPost"% methods& 2ervlet methods

    generate the dynamic page and write it to response& nce servlet thread is

    complete! container converts the response to -TTP response and send it back to

    client&

    2ome of the important work done by web container are:

    Co''unication Su##ort "ontainer provides easy way of

    communication between web server and the servlets and >%2s. Because of

    container, we don9t need to build a server socket to listen for any re+uest

    from web server, parse the re+uest and generate response. !ll these

    important and complex tasks are done by container and all we need to

    focus is on our business logic for our applications.

    -i"eccle and Resource Manage'ent "ontainer takes care of

    managing the life cycle of servlet. "ontainer takes care of loading the

    servlets into memory, initiali$ing servlets, invoking servlet methods and

    destroying them. "ontainer also provides utility like >5D* for resource

    36

  • 7/25/2019 Datamining With Big Data_Siva

    37/69

    pooling and management.

    Multit$reading Su##ort "ontainer creates new thread for every

    re+uest to the servlet and when it9s processed the thread dies. %o servlets

    are not initiali$ed for each re+uest and saves time and memory.

    =S; Su##ort >%2s doesn9t look like normal :ava classes and web

    container provides support for >%2. #very >%2 in the application is

    compiled by container and converted to %ervlet and then container

    manages them like other servlets.

    Miscellaneous Task Web container manages the resource pool, does

    memory optimi$ations, run garbage collector, provides security

    configurations, support for multiple applications, hot deployment and

    several other tasks behind the scene that makes our life easier.

    We* A##lication Director Structure

    Java )eb Applications are packaged as )eb Archive ")A6% and it has adefined structure& Bou can e(port above dynamic web pro/ect as )A6 file and

    un1ip it to check the hierarchy& It will be something like below image&

    De#lo'ent Descri#tor

    we.!"lfile is the deployment descriptor of the web application and contains

    37

  • 7/25/2019 Datamining With Big Data_Siva

    38/69

    mapping for servlets "prior to =&%! welcome pages! security configurations!

    session timeout settings etc&

    Thats all for the /ava web application startup tutorial! we will e(plore 2ervletsand J2Ps more in future posts&

    MS

  • 7/25/2019 Datamining With Big Data_Siva

    39/69

    A relational database stores data in separate tables rather than putting

    all the data in one big storeroom& The database structures are organi1ed

    into physical files optimi1ed for speed& The logical model! with ob/ects

    such as databases! tables! views! rows! and columns! offers a fle(ible

    programming environment& Bou set up rules governing the relationships

    between different data fields! such as onetoone! onetomany! uni8ue!

    re8uired or optional! and 3pointers4 between different tables& The

    database enforces these rules! so that with a welldesigned database!

    your application never sees inconsistent! duplicate! orphan! outofdate!

    or missing data&

    The 2C7 part of 3My2C74 stands for 32tructured Cuery 7anguage4&

    2C7 is the most common standardi1ed language used to access

    databases& +epending on your programming environment! you might

    enter 2C7 directly "for e(ample! to generate reports%! embed 2C7

    statements into code written in another language! or use a language

    specific API that hides the 2C7 synta(&

    2C7 is defined by the AD2I5I2 2C7 2tandard& The 2C7 standard has

    been evolving since ;E>F and several versions e(ist& In this manual!

    32C7Ethe standard released in ;EEE! and 32C7:

  • 7/25/2019 Datamining With Big Data_Siva

    40/69

    and use it without paying anything& If you wish! you may study the source

    code and change it to suit your needs& The My2C7 software uses the #P7

    "#D$ #eneral Public 7icense%! http:55www&fsf&org5licenses5! to define

    what you may and may not do with the software in different situations& If

    you feel uncomfortable with the #P7 or need to embed My2C7 code into

    a commercial application! you can buy a commercially licensed version

    from us& 2ee the My2C7 7icensing verview for more information

    "http:55www&mys8l&com5company5legal5licensing5%&

    The MyS#$ *ataase Server is very 'ast+ reliale+ scalale+ an% easy to

    )se.

    If that is what you are looking for! you should give it a try& My2C7 2erver

    can run comfortably on a desktop or laptop! alongside your other

    applications! web servers! and so on! re8uiring little or no attention& Ifyou dedicate an entire machine to My2C7! you can ad/ust the settings to

    take advantage of all the memory! *P$ power! and I5 capacity

    available& My2C7 can also scale up to clusters of machines! networked

    together&

    Bou can find a performance comparison of My2C7 2erver with other

    database managers on our benchmark page&

    My2C7 2erver was originally developed to handle large databases much

    faster than e(isting solutions and has been successfully used in highly

    demanding production environments for several years& Although under

    constant development! My2C7 2erver today offers a rich and useful set of

    40

  • 7/25/2019 Datamining With Big Data_Siva

    41/69

    functions& Its connectivity! speed! and security make My2C7 2erver

    highly suited for accessing databases on the Internet&

    MyS#$ Server wor,s in client/server or e"e%%e% syste"s.

    The My2C7 +atabase 2oftware is a client5server system that consists of a

    multithreaded 2C7 server that supports different backends! several

    different client programs and libraries! administrative tools! and a wide

    range of application programming interfaces "APIs%&

    )e also provide My2C7 2erver as an embedded multithreaded library

    that you can link into your application to get a smaller! faster! easierto

    manage standalone product&

    - lar&e a"o)nt o' contri)te% MyS#$ so'tware is availale.

    My2C7 2erver has a practical set of features developed in close

    cooperation with our users& It is very likely that your favorite application

    or language supports the My2C7 +atabase 2erver&

    The official way to pronounce 3My2C74 is 3My 0ss Cue 0ll4 "not 3my

    se8uel4%! but we do not mind if you pronounce it as 3my se8uel4 or in some

    other locali1ed way&

    41

  • 7/25/2019 Datamining With Big Data_Siva

    42/69

    S9ST7M D7SIGN D787-O;M7NT

    S9ST7M ARC/IT7CTUR7%

    DATA 2-OW DIAGRAM%

    /. The DD is also called as bubble chart. *t is a simple graphical formalism

    that can be used to represent a system in terms of input data to the

    system, various processing carried out on this data, and the output data is

    42

  • 7/25/2019 Datamining With Big Data_Siva

    43/69

    Manager Master

    tweeteraccount

    Verify account

    1.2.1

    Location

    1.2.2

    Hash tags

    1.2.3

    2.3

    Manager Master

    Formingclusterreports

    Tweet count

    1.2.4

    t

    Managers Master

    Location infoa

    generated by this system.

    0. The data flow diagram (DD) is one of the most important modeling

    tools. *t is used to model the system components. These components are

    the system process, the data used by the process, an external entity that

    interacts with the system and the information flows in the system.

    1. DD shows how the information moves through the system and how it is

    modified by a series of transformations. *t is a graphical techni+ue that

    depicts information flow and the transformations that are applied as data

    moves from input to output.

    4. DD is also known as bubble chart. ! DD may be used to represent a

    system at any level of abstraction. DD may be partitioned into levels

    that represent increasing information flow and functional detail.

    43

  • 7/25/2019 Datamining With Big Data_Siva

    44/69

    UM- DIAGRAMS

    @&? stands for @nified &odeling ?anguage. @&? is a standardi$ed

    general-purpose modeling language in the field of ob:ect-oriented software

    engineering. The standard is managed, and was created by, the 6b:ect

    &anagement 'roup.

    The goal is for @&? to become a common language for creating models

    of ob:ect oriented computer software. *n its current form @&? is comprised of

    two ma:or components a &eta-model and a notation. *n the future, some form

    of method or process may also be added to< or associated with, @&?.

    The @nified &odeling ?anguage is a standard language for specifying,

    Fisuali$ation, "onstructing and documenting the artifacts of software system, as

    well as for business modeling and other non-software systems.

    The @&? represents a collection of best engineering practices that have

    proven successful in the modeling of large and complex systems.

    The @&? is a very important part of developing ob:ects oriented

    software and the software development process. The @&? uses mostly

    graphical notations to express the design of software pro:ects.

    GOA-S%

    The 2rimary goals in the design of the @&? are as follows

    /. 2rovide users a ready-to-use, expressive visual modeling ?anguage so

    that they can develop and exchange meaningful models.

    0. 2rovide extendibility and speciali$ation mechanisms to extend the core

    concepts.

    1. Be independent of particular programming languages and development

    process.

    44

  • 7/25/2019 Datamining With Big Data_Siva

    45/69

    3. 2rovide a formal basis for understanding the modeling language.

    4. #ncourage the growth of 66 tools market.

    Q. %upport higher level development concepts such as collaborations,

    frameworks, patterns and components.

    I. *ntegrate best practices.

    US7 CAS7 DIAGRAM%! use case diagram in the @nified &odeling ?anguage (@&?) is a type

    45

  • 7/25/2019 Datamining With Big Data_Siva

    46/69

    of behavioral diagram defined by and created from a @se-case analysis. *ts

    purpose is to present a graphical overview of the functionality provided by a

    system in terms of actors, their goals (represented as use cases), and any

    dependencies between those use cases. The main purpose of a use case diagram

    is to show what system functions are performed for which actor. 7oles of the

    actors in the system can be depicted.

    C-ASS DIAGRAM%

    *n software engineering, a class diagram in the @nified &odeling ?anguage

    (@&?) is a type of static structure diagram that describes the structure of a

    system by showing the systems classes, their attributes, operations (or

    methods), and the relationships among the classes. *t explains which class

    contains information.

    46

  • 7/25/2019 Datamining With Big Data_Siva

    47/69

    S7

  • 7/25/2019 Datamining With Big Data_Siva

    48/69

    ACTI8IT9 DIAGRAM%

    !ctivity diagrams are graphical representations of workflows of stepwise

    activities and actions with support for choice, iteration and concurrency. *n the

    @nified &odeling ?anguage, activity diagrams can be used to describe the

    business and operational step-by-step workflows of components in a system. !n

    activity diagram shows the overall flow of control.

    48

  • 7/25/2019 Datamining With Big Data_Siva

    49/69

    5. SCR77N -A9OUT

    49

  • 7/25/2019 Datamining With Big Data_Siva

    50/69

    50

  • 7/25/2019 Datamining With Big Data_Siva

    51/69

    51

  • 7/25/2019 Datamining With Big Data_Siva

    52/69

    52

  • 7/25/2019 Datamining With Big Data_Siva

    53/69

    53

  • 7/25/2019 Datamining With Big Data_Siva

    54/69

    6. S9ST7M T7STING

    The purpose of testing is to discover errors. Testing is the process of

    trying to discover every conceivable fault or weakness in a work product. *t

    provides a way to check the functionality of components, sub assemblies,

    assemblies andHor a finished product *t is the process of exercising software

    with the intent of ensuring that the

    %oftware system meets its re+uirements and user expectations and does not fail

    in an unacceptable manner. There are various types of test. #ach test type

    addresses a specific testing re+uirement.

    TYPES OF TESTS

    Unit testing

    @nit testing involves the design of test cases that validate that the internal

    54

  • 7/25/2019 Datamining With Big Data_Siva

    55/69

    program logic is functioning properly, and that program inputs produce valid

    outputs. !ll decision branches and internal code flow should be validated. *t is

    the testing of individual software units of the application .it is done after the

    completion of an individual unit before integration. This is a structural testing,

    that relies on knowledge of its construction and is invasive. @nit tests perform

    basic tests at component level and test a specific business process, application,

    andHor system configuration. @nit tests ensure that each uni+ue path of a

    business process performs accurately to the documented specifications and

    contains clearly defined inputs and expected results.

    Integration testing

    *ntegration tests are designed to test integrated software components to

    determine if they actually run as one program. Testing is event driven and ismore concerned with the basic outcome of screens or fields. *ntegration tests

    demonstrate that although the components were individually satisfaction, as

    shown by successfully unit testing, the combination of components is correct

    and consistent. *ntegration testing is specifically aimed at exposing the

    problems that arise from the combination of components.

    Functional test

    unctional tests provide systematic demonstrations that functions tested are

    available as specified by the business and technical re+uirements, system

    documentation, and user manuals.

    unctional testing is centered on the following items

    55

  • 7/25/2019 Datamining With Big Data_Siva

    56/69

    Falid *nput identified classes of valid input must be accepted.

    *nvalid *nput identified classes of invalid input must be re:ected.

    unctions identified functions must be exercised.

    6utput identified classes of application outputs must be

    exercised.

    %ystemsH2rocedures interfacing systems or procedures must be invoked.

    6rgani$ation and preparation of functional tests is focused on re+uirements,

    key functions, or special test cases. *n addition, systematic coverage pertaining

    to identify Business process flows< data fields, predefined processes, and

    successive processes must be considered for testing. Before functional testing is

    complete, additional tests are identified and the effective value of current tests is

    determined.

    System Test

    %ystem testing ensures that the entire integrated software system meetsre+uirements. *t tests a configuration to ensure known and predictable results.

    !n example of system testing is the configuration oriented system integration

    test. %ystem testing is based on process descriptions and flows, emphasi$ing

    pre-driven process links and integration points.

    White Box Testing White Box Testing is a testing in which in which the software tester has

    knowledge of the inner workings, structure and language of the software, or at

    least its purpose. *t is purpose. *t is used to test areas that cannot be reached

    from a black box level.

    Black Box Testing

    Black Box Testing is testing the software without any knowledge of the

    56

  • 7/25/2019 Datamining With Big Data_Siva

    57/69

    inner workings, structure or language of the module being tested. Black box

    tests, as most other kinds of tests, must be written from a definitive source

    document, such as specification or re+uirements document, such as

    specification or re+uirements document. *t is a testing in which the software

    under test is treated, as a black box .you cannot LseeM into it. The test provides

    inputs and responds to outputs without considering how the software works.

    6.1 Unit Testing%

    @nit testing is usually conducted as part of a combined code and unit test

    phase of the software lifecycle, although it is not uncommon for coding and unit

    testing to be conducted as two distinct phases.

    Test strategy and approach

    Field testin !ill "e #e$%&$'ed '(n)(ll* (nd %)n+ti&n(l tests !ill "e

    !$itten in det(il,

    Test objectives

    All %ield ent$ies ')st !&$- #$e$l*,

    P(es ')st "e (+ti.(ted %$&' t/e identi%ied lin-,

    T/e ent$* s+$een 'ess(es (nd $es#&nses ')st n&t "e del(*ed,

    Features to be tested

    e$i%* t/(t t/e ent$ies ($e &% t/e +&$$e+t %&$'(t

    N& d)#li+(te ent$ies s/&)ld "e (ll&!ed

    All lin-s s/&)ld t(-e t/e )se$ t& t/e +&$$e+t #(e,

    57

  • 7/25/2019 Datamining With Big Data_Siva

    58/69

    6. Integration Testing

    %oftware integration testing is the incremental integration testing of two

    or more integrated software components on a single platform to produce failures

    caused by interface defects.

    The task of the integration test is to check that components or software

    applications, e.g. components in a software system or one step up software

    applications at the company level interact without error

    Test Results% !ll the test cases mentioned above passed successfully. 5o

    defects encountered.

    6.3 Acceptance Testing

    @ser !cceptance Testing is a critical phase of any pro:ect and re+uires

    significant participation by the end user. *t also ensures that the system meets

    the functional re+uirements.

    Test Results% !ll the test cases mentioned above passed successfully. 5odefects encountered.

    Test Case Report1

    Use &ne te'#l(te %&$ e(+/ test +(seGENERAL INFORMATION

    Test Stage: Unit Functionality

    Interface

    Performance Acceptance

    Test Date: 09/09/2010 System Date, if 09/03/2015

    58

  • 7/25/2019 Datamining With Big Data_Siva

    59/69

    applicable:

    Tester: Janardhan Test Case Nmber: 1

    Test CaseDescripti!":

    Unit testin focuses on !erifyin the effort on thesmallest unit of soft"are#module$ %he local datastructure is e&amined to ensure that the date stored

    temporarily maintains its interity durin all steps in thealorithm's e&ecution$ (oundary conditions are tested toensure that the module operates properly at )oundariesesta)lished to limit or restrict processin$

    Reslts: Pass*+,- Fail

    INTROD#CTION

    Re$ireme"t%s&t! be teste':

    .ettin %"itter Account

    R!les a"'Resp!"sibilities

    :

    .atherin the euirements of the Proect esinin and %estin$

    Set #p(r!ce'res:

    (y Installin clipse$

    EN)IRONMENTAL NEEDS

    *ar'+are: P4 "ith inimum 20.( 6ard is7 and 1.( A$

    S!ft+are: 8indo"s P/2000: +racle: clipse$

    TEST

    Test Items a"'

    Featres:

    ;ocationI and et"eet 4ount$

    (r!ce'ralSteps:

    If the User enters the ;ocation id it "ill )e redirected to another appropriatepae so that "e can confirm test is accepted$

    Epecte'Reslts !fCase:

    If the pae is redirected "e can confirm the result of this %est case issucceeded$

    Test Case Report2

    Use &ne te'#l(te %&$ e(+/ test +(seGENERAL INFORMATION

    Test Stage: Unit Functionality

    Interface

    Performance Acceptance

    Test Date: 09/09/2010 System Date, ifapplicable:

    09/02/2015

    Tester: Janardhan Test Case Nmber: 2

    Test CaseDescripti!":

    Unit testin focuses on !erifyin the effort on thesmallest unit of soft"are#module$ %he local datastructure is e&amined to ensure that the date storedtemporarily maintains its interity durin all steps in thealorithm's e&ecution$ (oundary conditions are tested toensure that the module operates properly at )oundariesesta)lished to limit or restrict processin$

    Reslts: Pass*+,- FailINTROD#CTION

    59

  • 7/25/2019 Datamining With Big Data_Siva

    60/69

    Re$ireme"t%s&t! be teste':

    After ettin

  • 7/25/2019 Datamining With Big Data_Siva

    61/69

    characteristics of the Big Data are /) huge with heterogeneous and diverse data

    sources, 0) autonomous with distributed and decentrali$ed control, and 1)

    complex and evolving in data and knowledge associations. %uch combined

    characteristics suggest that Big Data re+uire a Lbig mindM to consolidate data for

    maximum values V0I.

    To explore Big Data, we have analy$ed several challenges at the data, model,

    and system levels. To support Big Data mining, high-performance computing

    platforms are re+uired, which impose systematic designs to unleash the full

    power of the Big Data. !t the data level, the autonomous information sources

    and the variety of the data collection environments, often result in data with

    complicated conditions, such as missingHuncertain values. *n other situations,

    privacy concerns, noise, and errors can be introduced into the data, to produce

    altered data copies. Developing a safe and sound information sharing protocol is

    a ma:or challenge. !t the model level, the key challenge is to generate global

    models by combining locally discovered patterns to form a unifying view. This

    re+uires carefully designed algorithms to analy$e model correlations betweendistributed sites, and fuse decisions from multiple sources to gain a best model

    out of the Big Data. !t the system level, the essential challenge is that a Big

    Data mining framework needs to consider complex relationships between

    samples, models, and data sources, along with their evolving changes with time

    and other possible factors. ! system needs to be carefully designed so that

    unstructured data can be linked through their complex relationships to formuseful patterns, and the growth of data volumes and item relationships should

    help form legitimate patterns to predict the trend and future.

    We regard Big Data as an emerging trend and the need for Big Data mining is

    arising in all science and engineering domains. With Big Data technologies, we

    will hopefully be able to provide most relevant and most accurate social sensing

    feedback to better understand our society at real time. We can further stimulate

    the participation of the public audiences in the data production circle for societal

    61

  • 7/25/2019 Datamining With Big Data_Siva

    62/69

    and economical events. The era of Big Data has arrived.

    BIB-IOGRA;/9

    V/ 7. !hmed and '. =arypis, L!lgorithms for &ining the #volution of

    "onserved 7elational %tates in Dynamic 5etworks,M =nowledge and

    *nformation %ystems, vol. 11, no. 1, pp. Q1-Q1, Dec. 0/0.

    V0 &.. !lam, >.W. a, and %.=. ?ee, L5ovel !pproaches to "rawling

    *mportant 2ages #arly,M =nowledge and *nformation %ystems, vol. 11, no. 1, pp

    II-I13, Dec. 0/0.

    V1 %. !ral and D. Walker, L*dentifying *nfluential and %usceptible &embers of

    %ocial 5etworks,M %cience, vol. 11I, pp. 11I-13/, 0/0.

    V3 !. &achanava::hala and >.2. 7eiter, LBig 2rivacy 2rotecting "onfidentiality

    in Big Data,M !"& "rossroads, vol. /P, no. /, pp. 0-01, 0/0.

    V4 %. Baner:ee and 5. !garwal, L!naly$ing "ollective Behavior from Blogs

    @sing %warm *ntelligence,M =nowledge and *nformation %ystems, vol. 11, no.

    62

  • 7/25/2019 Datamining With Big Data_Siva

    63/69

    1, pp. 401-43I, Dec. 0/0.

    VQ #. Birney, LThe &aking of #5"6D# ?essons for Big-Data 2ro:ects,M

    5ature, vol. 3RP, pp. 3P-4/, 0/0.

    VI >. Bollen, . &ao, and G. Xeng, LTwitter &ood 2redicts the %tock &arket,M

    >. "omputational %cience, vol. 0, no. /, pp. /-R, 0//.

    VR %. Borgatti, !. &ehra, D. Brass, and '. ?abianca, L5etwork !nalysis in the

    %ocial %ciences,M %cience, vol. 101, pp. RP0-RP4, 0P.

    VP >. Bughin, &. "hui, and >. &anyika, "louds, Big Data, and %mart !ssets

    Ten Tech-#nabled Business Trends to Watch. &c=in%ey Juarterly, 0/.

    V/ D. "entola, LThe %pread of Behavior in an 6nline %ocial 5etwork

    #xperiment,M %cience, vol. 10P, pp. //P3-//PI, 0/.

    V// #.E. "hang, . Bai, and =. Xhu, L2arallel !lgorithms for &ining ?arge-

    %cale 7ich-&edia Data,M 2roc. /Ith !"& *nt9l "onf. &ultimedia, (&& 9P,)

    pp. P/I-P/R, 0P.

    V/0 7. "hen, =. %ivakumar, and . =argupta, L"ollective &ining of Bayesian5etworks from Distributed eterogeneous Data,M =nowledge and *nformation

    %ystems, vol. Q, no. 0, pp. /Q3-/RI, 03.

    V/1 E.-". "hen, W.-". 2eng, and %.-E. ?ee, L#fficient !lgorithms for *nfluence

    &aximi$ation in %ocial 5etworks,M =nowledge and *nformation %ystems, vol.

    11, no. 1, pp. 4II-Q/, Dec. 0/0.

    63

  • 7/25/2019 Datamining With Big Data_Siva

    64/69

    V/3 ".T. "hu, %.=. =im, E.!. ?in, E. Eu, '.7. Bradski, !.E. 5g, and =.

    6lukotun, L&ap-7educe for &achine ?earning on &ulticore,M 2roc. 0th !nn.

    "onf. 5eural *nformation 2rocessing %ystems (5*2% 9Q), pp. 0R/-0RR, 0Q.

    V/4 '. "ormode and D. %rivastava, L!nonymi$ed Data 'eneration, &odels,

    @sage,M 2roc. !"& %*'&6D *nt9l "onf. &anagement Data, pp. //4-//R,

    0P.

    V/Q %. Das, E. %ismanis, =.%. Beyer, 7. 'emulla, 2.>. aas, and >. &c2herson,

    L7icardo *ntegrating 7 and adoop,M 2roc. !"& %*'&6D *nt9l "onf.

    &anagement Data (%*'&6D 9/), pp. PRI-PPR. 0/.

    V/I 2. Dewdney, 2. all, 7. %chili$$i, and >. ?a$io, LThe %+uare =ilometre

    !rray,M 2roc. *###, vol. PI, no. R, pp. /3R0-/3PQ, !ug. 0P.

    V/R 2. Domingos and '. ulten, L&ining igh-%peed Data %treams,M 2roc.%ixth !"& %*'=DD *nt9l "onf. =nowledge Discovery and Data &ining (=DD

    9), pp. I/-R, 0.

    V/P '. Duncan, L2rivacy by Design,M %cience, vol. 1/I, pp. //IR-//IP, 0I.

    V0 B. #fron, L&issing Data, *mputation, and the Bootstrap,M >. !m. %tatistical!ssoc., vol. RP, no. 30Q, pp. 3Q1-3I4, /PP3.

    V0/ !. 'hoting and #. 2ednault, Ladoop-&? !n *nfrastructure for the 7apid

    *mplementation of 2arallel 7eusable !nalytics,M 2roc. ?arge-%cale &achine

    ?earning 2arallelism and &assive Data %ets Workshop (5*2% 9P), 0P.

    V00 D. 'illick, !. aria, and >. De5ero, &ap7educe Distributed "omputing

    64

  • 7/25/2019 Datamining With Big Data_Siva

    65/69

    for &achine ?earning, Berkley, Dec. 0Q.

    V01 &. elft, L'oogle @ses %earches to Track lu9s %pread,M The 5ew Eork

    Times, httpHHwww.nytimes.comH0RH//H/0HtechnologyHinternetH/0flu.html.

    0R.

    V03 D. owe et al., LBig Data The uture of Biocuration,M 5ature, vol. 344,

    pp. 3I-4, %ept. 0R.

    V04 B. uberman, L%ociology of %cience Big Data Deserve a Bigger

    !udience,M 5ature, vol. 3R0, p. 1R, 0/0.

    V0Q L*B& What *s Big Data Bring Big Data to the #nterprise,M httpHH www-

    /.ibm.comHsoftwareHdataHbigdataH, *B&, 0/0.

    V0I !. >acobs, LThe 2athologies of Big Data,M "omm. !"&, vol. 40, no. R, pp.1Q-33, 0P.

    V0R *. =opanas, 5. !vouris, and %. Daskalaki, LThe 7ole of Domain

    =nowledge in a ?arge %cale Data &ining 2ro:ect,M 2roc. %econd ellenic "onf.

    !* &ethods and !pplications of !rtificial *ntelligence, *.2. Flahavas, ".D.

    %pyropoulos, eds., pp. 0RR-0PP, 00.

    V0P !. ?abrinidis and . >agadish, L"hallenges and 6pportunities with Big

    Data,M 2roc. F?DB #ndowment, vol. 4, no. /0, 010-011, 0/0.

    V1 E. ?indell and B. 2inkas, L2rivacy 2reserving Data &ining,M >. "ryptology,

    vol. /4, no. 1, pp. /II-0Q, 00.

    65

    http://www.nytimes.com/2008/11/12/technology/internet/12flu.html.%202008http://www.nytimes.com/2008/11/12/technology/internet/12flu.html.%202008http://www.nytimes.com/2008/11/12/technology/internet/12flu.html.%202008http://www.nytimes.com/2008/11/12/technology/internet/12flu.html.%202008
  • 7/25/2019 Datamining With Big Data_Siva

    66/69

    V1/ W. ?iu and T. Wang, L6nline !ctive &ulti-ield ?earning for #fficient

    #mail %pam iltering,M =nowledge and *nformation %ystems, vol. 11, no. /, pp.

    //I-/1Q, 6ct. 0/0.

    V10 >. ?orch, B. 2arno, >. &ickens, &. 7aykova, and >. %chiffman, L%horoud

    #nsuring 2rivate !ccess to ?arge-%cale Data in the Data "enter,M 2roc. //th

    @%#5*G "onf. ile and %torage Technologies (!%T 9/1), 0/1.

    V11 D. ?uo, ". Ding, and . uang, L2aralleli$ation with &ultiplicative

    !lgorithms for Big Data &ining,M 2roc. *### /0th *nt9l "onf. Data &ining, pp.

    3RP-3PR, 0/0.

    V13 >. &ervis, L@.%. %cience 2olicy !gencies 7ally to Tackle Big Data,M

    %cience, vol. 11Q, no. QII, p. 00, 0/0.

    V14 . &ichel, Low &any 2hotos !re @ploaded to lickr #very Day and&onthOM httpHHwww.flickr.comHphotosHfranckmichelHQR44/QPRRQH, 0/0.

    V1Q T. &itchell, L&ining our 7eality,M %cience, vol. 10Q, pp. /Q33-/Q34, 0P.

    V1I 5ature #ditorial, L"ommunity "leverness 7e+uired,M 5ature, vol. 344, no.

    I0P, p. /, %ept. 0R.

    V1R %. 2apadimitriou and >. %un, LDisco Distributed "o-"lustering with &ap-

    7educe ! "ase %tudy Towards 2etabyte-%cale #nd-to-#nd &ining,M 2roc.

    *### #ighth *nt9l "onf. Data &ining (*"D& 9R), pp. 4/0-40/, 0R.

    V1P ". 7anger, 7. 7aghuraman, !. 2enmetsa, '. Bradski, and ". =o$yrakis,

    L#valuating &ap7educe for &ulti-"ore and &ultiprocessor %ystems,M 2roc.

    66

  • 7/25/2019 Datamining With Big Data_Siva

    67/69

    *### /1th *nt9l %ymp. igh 2erformance "omputer !rchitecture (2"! 9I),

    pp. /1-03, 0I.

    V3 !. 7a:araman and >. @llman, &ining of &assive Data %ets. "ambridge

    @niv. 2ress, 0//.

    V3/ ". 7eed, D. Thompson, W. &a:id, and =. Wagstaff, L7eal Time &achine

    ?earning to ind ast Transient 7adio !nomalies ! %emi-%upervised !pproach

    "ombining Detection and 7* #xcision,M 2roc. *nt9l !stronomical @nion %ymp.

    Time Domain !stronomy, %ept. 0//.

    V30 #. %chadt, LThe "hanging 2rivacy ?andscape in the #ra of Big Data,M

    &olecular %ystems, vol. R, article Q/0, 0/0.

    V31 >. %hafer, 7. !grawal, and &. &ehta, L%27*5T ! %calable 2arallel

    "lassifier for Data &ining,M 2roc. 00nd F?DB "onf., /PPQ.

    V33 !. da %ilva, 7. "hiky, and '. eYbrail, L! "lustering !pproach for

    %ampling Data %treams in %ensor 5etworks,M =nowledge and *nformation

    %ystems, vol. 10, no. /, pp. /-01, >uly 0/0.

    V34 =. %u, . uang, G. Wu, and %. Xhang, L! ?ogical ramework for*dentifying Juality =nowledge from Different Data %ources,M Decision %upport

    %ystems, vol. 30, no. 1, pp. /QI1-/QR1, 0Q.

    V3Q LTwitter Blog, Dispatch from the Denver Debate,M httpHH

    blog.twitter.comH0/0H/Hdispatch-from-denver-debate.html,6ct. 0/0.

    V3I D. Wegener, &. &ock, D. !dranale, and %. Wrobel, LToolkit-Based igh-

    67

  • 7/25/2019 Datamining With Big Data_Siva

    68/69

    2erformance Data &ining of ?arge Data on &ap7educe "lusters,M 2roc. *nt9l

    "onf. Data &ining Workshops (*"D&W 9P), pp. 0PQ-1/, 0P.

    V3R ". Wang, %.%.&. "how, J. Wang, =. 7en, and W. ?ou, L2rivacy-

    2reserving 2ublic !uditing for %ecure "loud %torageM *### Trans. "omputers,

    vol. Q0, no. 0, pp. 1Q0-1I4, eb. 0/1.

    V3P G. Wu and G. Xhu, L&ining with 5oise =nowledge #rror-!ware Data

    &ining,M *### Trans. %ystems, &an and "ybernetics, 2art !, vol. 1R, no. 3, pp.

    P/I-P10, >uly 0R.

    V4 G. Wu and %. Xhang, L%ynthesi$ing igh-re+uency 7ules from Different

    Data %ources,M *### Trans. =nowledge and Data #ng., vol. /4, no. 0, pp. 141-

    1QI, &ar.H!pr. 01.

    V4/ G. Wu, ". Xhang, and %. Xhang, LDatabase "lassification for &ulti-Database &ining,M *nformation %ystems, vol. 1, no. /, pp. I/- RR, 04.

    V40 G. Wu, LBuilding *ntelligent ?earning Database %ystems,M !* &aga$ine,

    vol. 0/, no. 1, pp. Q/-QI, 0.

    V41 G. Wu, =. Eu, W. Ding, . Wang, and G. Xhu, L6nline eature %electionwith %treaming eatures,M *### Trans. 2attern !nalysis and &achine

    *ntelligence, vol. 14, no. 4, pp. //IR-//P0, &ay 0/1.

    V43 !. Eao, Low to 'enerate and #xchange %ecretes,M 2roc. 0Ith !nn. %ymp.

    oundations "omputer %cience (6"%) "onf., pp. /Q0-/QI, /PRQ.

    V44 &. Ee, G. Wu, G. u, and D. u, L!nonymi$ing "lassification Data @sing

    68

  • 7/25/2019 Datamining With Big Data_Siva

    69/69

    7ough %et Theory,M =nowledge-Based %ystems, vol. 31, pp. R0-P3, 0/1.

    V4Q >. Xhao, >. Wu, G. eng, . Giong, and =. Gu, L*nformation 2ropagation in

    6nline %ocial 5etworks ! Tie-%trength 2erspective,M =nowledge and

    *nformation %ystems, vol. 10, no. 1, pp. 4RP-QR, %ept. 0/0.

    V4I G. Xhu, 2. Xhang, G. ?in, and E. %hi, L!ctive ?earning rom %tream Data

    @sing 6ptimal Weight "lassifier #nsemble,M *### Trans. %ystems, &an, and

    "ybernetics, 2art B, vol. 3, no. Q, pp. /QI- /Q0/, Dec. 0/.


Recommended