R handbook - from Installation to Text Analytics

Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited

• All rights reserved

How to use R

From Installation to Text Analytics

Table of Contents

From the Team ............................................................................................................... 5 Copyright ........................................................................................................................ 6 Something helpful .......................................................................................................... 7 Section 1 ................................................................................................................ 8 PART 1 – WHAT IS R ....................................................................................................... 9 Introduction ................................................................................................................... 9 What is R? ....................................................................................................................... 9 Why R? ........................................................................................................................... 9 Summary ...................................................................................................................... 10 PART 2 – INSTALLATION ............................................................................................... 11 Introduction ................................................................................................................. 11 How can R be installed? ............................................................................................... 11 Overview of the R GUI .................................................................................................. 12 Installation of R Studio ................................................................................................. 14 Overview of R Studio GUI ............................................................................................. 15 Summary ...................................................................................................................... 19 Section 2 .............................................................................................................. 20 PART 1 – DATA TYPES ................................................................................................... 21 Introduction ................................................................................................................. 21 Why are DATA TYPES important? ................................................................................ 21 Data types .................................................................................................................... 22 Creating data types in R ............................................................................................... 24 Summary ...................................................................................................................... 29 PART 2 – DATA STRUCTURES & VECTORS .................................................................... 30 Introduction ................................................................................................................. 30 What is a data structure ............................................................................................... 30 What is the difference between data structure and data type ................................... 31 Type of data structure – Vectors .................................................................................. 32 How to create a Vector in R ......................................................................................... 33 Mixing up data types in a Vector ................................................................................. 35 Replacing the contents of a Vector .............................................................................. 37 Arithmetic functions between Vectors ........................................................................ 38 Identifying elements in a Vector .................................................................................. 41 Replacing contents in a Vector..................................................................................... 42 Using a function to Index ............................................................................................. 43 Speeding up the task with Operators .......................................................................... 45 Summary ...................................................................................................................... 48



3

PART 3 – DATAFRAMES ................................................................................................ 49 Introduction ................................................................................................................. 49 What is a Dataframe .................................................................................................... 49 Creating a Dataframe in R ............................................................................................ 50 Functions that can be carried out with Dataframes .................................................... 52 Summary ...................................................................................................................... 58 PART 4 – LIST & MATRIX ............................................................................................... 59 Introduction ................................................................................................................. 59 What is a List ................................................................................................................ 59 What is a Matrix ........................................................................................................... 60 Creating a List in R ........................................................................................................ 61 Creating a Matrix in R ................................................................................................... 63 Creating a Matrix out of a Dataframe .......................................................................... 63 Summary ...................................................................................................................... 65 PART 5 – FACTORS ........................................................................................................ 66 Introduction ................................................................................................................. 66 What is a factor ............................................................................................................ 66 How to create a factor in R .......................................................................................... 67 Summary ...................................................................................................................... 69 Section 3 .............................................................................................................. 70 PART 1 – PACKAGES ..................................................................................................... 71 Introduction ................................................................................................................. 71 What is a Package ........................................................................................................ 71 Installing and Loading a Package ................................................................................. 72 Importing an Excel file into R ....................................................................................... 75 Importing a CSV file into R ........................................................................................... 77 SUMMARY .................................................................................................................... 79 PART 2 – Exporting and Reading data in R ................................................................... 80 Introduction ................................................................................................................. 80 Exporting to Excel ......................................................................................................... 81 Exporting to CSV ........................................................................................................... 82 Reading a file in R ......................................................................................................... 83 Summary ...................................................................................................................... 85 Section 4 .............................................................................................................. 86 PART 1 – Logical operators and If condition ................................................................ 87 Introduction ................................................................................................................. 87 What is a logical operator ............................................................................................ 87 How to execute a logical operator in R ........................................................................ 89 What is IF Condition in R .............................................................................................. 93 Summary ...................................................................................................................... 94



4

PART 2 – Merging data ................................................................................................. 95 Introduction ................................................................................................................. 95 What is merging of data ............................................................................................... 95 What are the different ways to merge data ................................................................ 96 How to carry out a merge in R ..................................................................................... 98 Summary .................................................................................................................... 100 Section 5 ............................................................................................................ 101 PART 1 – Understanding text analytics ...................................................................... 102 Introduction ............................................................................................................... 102 What is text analytics ................................................................................................. 103 How is Text Analytics useful ....................................................................................... 104 Important terms in Text Analytics .............................................................................. 106 What is needed to create a Word Cloud .................................................................... 109 Summary .................................................................................................................... 112 PART 2 – Understanding text analytics ...................................................................... 113 Introduction ............................................................................................................... 113 How to create a Word Cloud ...................................................................................... 113 Step 1: Creating a dataframe ..................................................................................... 113 Step 2: Installing the tm package ............................................................................... 117 Step 3: Understanding TDM ....................................................................................... 119 Step 4: Creating a corpus ........................................................................................... 119 Step 5: Cleaning the corpus ....................................................................................... 121 Step 6: Creating the term document matrix .............................................................. 124 Step 7: Calculating frequencies .................................................................................. 127 Step 8: Creating the word cloud ................................................................................ 128 Summary .................................................................................................................... 132



5

From the Team A note from the Online Education team

ANALYTICS TRAINING INSTITUTE

Hello,

Thanks to the many students who have signed up for our courses, we are delighted to

offer all our online lectures as downloadable material. We know that learning should

be continuous, so through this material we hope that you will take your time within

your busy schedule to really understand the concepts and techniques of this

fascinating open source tool – R!

We have tried to make your learning easy by highlighting key takeaways and screen

grabs of the tool so that you can continue your learning offline as well.

We welcome you to post any comments or questions on this material in the

Discussion forum as many of you have been doing or just reach out to us at

[email protected].

Enjoy the learning experience and thank you for choosing us to be your partner in

your journey of discovery!

The team at ATI

mailto:[email protected]



6

Copyright

(c) 2014 Redwood Associates Business Solutions Private Limited

All rights reserved. Without limiting rights under the copyright reserved above, no

part of this publication may be reproduced, stored, introduced into a retrieval

system, distributed or transmitted in any form or by any means, including without

limitation photocopying, recording, or other electronic or mechanical methods,

without the prior written permission of the publisher, except in the case of brief

quotations embodied in critical reviews and other non commercial uses permitted bu

copyright law. The scanning, uploading, and/or distribution of this document via the

internet or via any other means without the permission of the publisher is illegal and

punishable by law. Please do not participate or encourage electronic piracy of

copyrightable materials.

For permission requests, email [email protected]




7

SOMETHING HELPFUL

Here are a few things that you would probably find helpful before you begin.

Sections:

There are 5 sections in this material starting with the installation of R and R Studio

right up to generating a Word Cloud in R which showcases the text mining capabilities

of R.

Online videos:

This material is a supplement to the online videos available on

https://www.udemy.com/analyticstraining/?dtcode=VQRaQsx1KWR2

This material corresponds to the section on R.

Material:

The online class format supports downloadable material for each section. So perhaps

it would be a good idea to check each section for additional downloadable material

like case studies or sample data to work on and so forth.

Ready to begin?

https://www.udemy.com/analyticstraining/?dtcode=VQRaQsx1KWR2



8

Section 1 Overview and Installation



9

PART 1 – WHAT IS R

INTRODUCTION

R is an open source statistical tool which not just manages data but also carries out a lot of sophisticated analytical processes as well. Before looking at how R works, it is important to get a good overview of R. So, here’s what will be covered in this tutorial:

- What is R? - Why R?

WHAT IS R?

So, to begin let’s start with a very basic question. What is R?

1. R as we already know is a statistical tool which is at par with other statistical tools like SAS, SPSS and Python in terms of what it can do.

2. R can manage and analyse data. It can execute all statistical techniques like liner regression, logistical regression, forecasting, decision trees and any other technique that you can think of.

WHY R?

So what makes R stand out when compared to other statistical tools? Let us break it down.

1. Firstly, R can work with any type of data and can handle data of any size. So whether the data you are working with is small or really big, R will be able to handle it.

2. R can work with data received in any type of file format, whether text, CSV, SASS and so on.

3. R offers really great visualization of data. It can connect with Google maps and Motion charts.

4. Next – and this is what makes R so much more powerful than other statistical tools –it is open source. Open source does not just mean that it can be used for free, but that anyone can contribute to it as well.



10

5. R does not use much code, even if it is handling large volumes of data or carrying out complicated analytical techniques.

6. As was mentioned earlier, R being open source means anyone can contribute to it. This is why R has a huge community of contributors who almost on a daily basis keep adding functionality to it. This is the reason why even the most complicated techniques can be executed in R by just calling a function. So, when using R we as users do not need to worry about how to perform a linear regression or a logistics regression. The code to execute this and many other advanced analytical functions is already built in and refined by those in the R community on a regular basis.

7. R is used by a lot of big corporations like Facebook, Google, Mozilla, Llyods and Merck, among others. This goes a long way in validating the capability of R and adds to its credibility.

SUMMARY

In this material, we covered the reasons which make R a powerful statistical tool. To summarize,

R is an open source statistical tool that can be used freely by anyone.

It is improved upon everyday by a large community of contributors who periodically keep adding new codes and functions to it.

R can work with big or small data, and also with the different formats in which data is usually presented.

It does not use much code and offers great data visualization making it a popular statistical tool in many global corporations.



11

PART 2 – INSTALLATION

INTRODUCTION

To start leveraging the power of R, it first needs to be installed. So, here’s what will be covered in this tutorial:

- Installation of R - Overview of a typical GUI or Graphic User Interface of R - Installation of R Studio - Overview of the GUI of R Studio

HOW CAN R BE INSTALLED?

To begin, let’s look at how to install R. To install R click on the link displayed: http://cran.r-projeact.org/bin/windows/base/old/3.0.2 On opening this link different options to download R based on system configuration and operating systems are available - like “R for 32bit system or R for a 64bitsystem or R for Windows.” Download the version of R that is best suited to the operating system being used. When you update your version of R, the earlier version is NOT automatically uninstalled. Further, R Studio allows you to run multiple versions of R (though not in same session) Therefore in R Studio, find out which version of R is running by typing R.Version(). The default version of R that R Studio runs can be changed from Tools>Options> R General.

Before proceeding with the rest of this tutorial, we suggest that you download R in case you already haven’t.

http://cran.r-projeact.org/



12

OVERVIEW OF THE R GUI

Shown here is a typical R GUI or Graphic User Interface.

At first glance, all that is visible on the R GUI is a single screen, which is known as the Console. The Console can be used to input data as well as view output. But we recommend that the Console is used to only view output. Commands or inputs in R are referred to as Scripts. To write a script, go to File and select New Script.



13

A new blank window called R Editor opens.

Think of a Script in R as code or syntax that is written in order to tell R what it needs to execute. For eg, let’s enter a = 1, which means R is being told to create a variable “a” and store a value of “1” against it.

To execute this script or code, press Control +Enter. As shown in the image below, the command is executed and the output is displayed in the Console.



14

The output a=1 is displayed in red in the Console. So, essentially input is specified in R Editor and the output is displayed in the Console.

INSTALLATION OF R STUDIO

A more user friendly option available to users is R Studio. It has a better GUI and comes with more options. To take a more detailed look at R Studio, let us first install it. To download R Studio, click on the link displayed on the screen: http://www.rstudio.com/ide/download/

We recommend that installation of R Studio is complete before proceeding with this tutorial.

When you update your version of R, the earlier version is NOT automatically uninstalled. Further, R Studio allows you to run multiple versions of R (though not in same session) Therefore in R Studio, find out which version of R is running by typing R.Version(). The default version of R that R Studio runs can be changed from Tools>Options> R General.

http://www.rstudio.com/ide/download/



15

OVERVIEW OF R STUDIO GUI

As shown in the image, there are 4 sections in R studio.

Let’s briefly go through each section.



16

Section1: Editor

The first section is the Editor section where the script or code that R needs to execute is written. To add more than one script, use the plus sign on the top left hand corner. It is possible to add as many scripts as required using this option.

Using the example looked at earlier, the script or code a = 1 is entered. To execute this code, press Control plus Enter.



17

Section 2: Console

The output appears in the Console window which can be found right below the Editor window. When values appear in the Console section it means that the script or the code has been executed.

Section 3: Workspace

To the right of the Editor section, is the Workspace section, where the data being worked on can be viewed.

This includes even data that has been imported from an external source. In the example used, a new variable “a” with a value of “1” was created. Since this is the data currently being worked on, both “a” and “1” are displayed in the Workspace section.



18

Section 4: File, Plots, Packages and Help

The other sections in R Studio are File, Plots, Packages and the Help section. The Help section helps in locating functions in R Studio. In the Search field, type what is being searched for and click Enter. For eg, if plot is entered in Search, everything related to it is displayed just below.

The other tabs available are packages (which will be covered later), Plot and Files which displays all the files that are currently being worked on.

For the rest of the series of tutorials on R, we will be working with R Studio as it has a better GUI than R Editor.



19

SUMMARY

We are now ready to start using R – to manage data and carry out other types of actions on data. To summarize:

The links to install both R and R Studio have been provided.

An overview of the typical GUI of R has been looked at. Since individual screens need to be opened, a better option is R Studio.

R Studio is more user friendly as all the relevant sections are available at a single glance removing the need to have multiple screens open at a time.

There are 4 sections in R Studio. a. The first section is the Editor section which is used to enter scripts or

codes for R to execute. b. The second section is the Console where output is displayed. c. The data that is generated or being worked upon can be found in the

Workspace section. d. Files, packages and Help make up the fourth section.



20

Section 2 Data Types and Data Structures



21

PART 1 – DATA TYPES

INTRODUCTION

What is Analytics without data? Likewise how can you leverage the amazing capabilities of R without understanding data? This tutorial is an indepth look at types of data or data types. So, here’s what will be covered in this tutorial:

- Understanding why data types are important - Different data types - Creating data types in R

WHY ARE DATA TYPES IMPORTANT?

To begin, it is important to understand why data types are useful and why it is necessary to be able to distinguish between different types of data. Suppose, you have been asked to evaluate five different brands of cars –let us call them Brand A, Brand B, Brand C, Brand D and Brand E. If you were asked to calculate the mean of these five cars, how would you go about it? It most likely would be an impossible operation to carry out because all you have is the name or the brand of these cars and as you know you cannot calculate the mean of names! Now, the situation would have been different if you had some numeric data about these cars. This emphasizes the need to understand the type of data you have to work with because certain types of functions can be carried out on certain types of data. Like calculating mean is not possible with character data types like names or brands.



22

DATA TYPES

Data can be of different types. The different types of data one would commonly come across are: Numeric:

Refers to any number or numeric value.

Eg: 1.2, 2.1 etc

Numeric data types include even decimals.

Integer: Refers to any number without a fractional part. Eg: 1, 2, 3…..

Logical:

Refers to any values which are either True or False.

Eg: if x = 1, y = 2, then x being greater than y is False

Character: Refers to textual data. Eg: learning, education….

Factor: Refers to data in categories.

Eg: City, Gender

Each data type will now be discussed in some detail.



23

1.1 Numeric data types

Numeric data type is any number or numeric value like 2.1, 1.2 and so on. It could be an integer or a decimal value. In R Studio, to create a numeric data type the syntax y<-3.1 (or y is equal to 3.1) is used. This means that a variable y is being created against which a numeric value of 3.1 is being stored.

To indicate equal to we can either use the symbol <- or =

1.2 Integer data types

Integer data type indicates any data which stores integer values. In R Studio, numeric data types can be converted to integer data types by using the following syntax: as.integer(numeric value) Eg: as.integer (3.1)

1.3 Logical data types

Logical data type indicates any data where the value is either True or False, but never both. In R Studio, the following syntax can be used to create a logical data type: if x <-1, y<-2, then x > y is FALSE (x is equal to 1, y is equal to 2, then x being greater than y is false)

1.4 Character data types

Character data type stores characters or strings. In R Studio, they have to be written within double quotes. For example, the text learning would be written as “learning”.

1.5 Factor data types

Factor data type refers to categorical types of data like gender or cities. This data type will be covered in more detail in a separate tutorial.



24

CREATING DATA TYPES IN R

How to create a numeric data type in R

In R studio, let’s create a numeric data type with a variable name of num1 and a numeric value of 3.1 stored against it. Enter these values using the code num1<-3.1 and press Control + Enter to execute this statement. The output is displayed in the Console area in blue indicating that the code has been executed. Simultaneously, the values num1 and 3.1 are displayed in the Workspace section.

In order to identify the data type of the variable num 1 use a function called Class. Type the words class and the name of the variable in brackets as shown below. class(num1) and press Control + Enter to execute it. In the Console area numeric is displayed indicating that the data type of num1 is numeric.



25

How to create an integer data type in R

A numeric data type can be converted into an integer data type in R. In the example used above, the number 3.1 when converted to an integer gives a value of 3. To convert this numeric data type to an integer data type in R Studio, the function as.Integer(numeric variable) is used. Let us create a new variable num3 and store the integer value against this variable. Enter the code num3 <-as.Integer(num1) and press Control + Enter. The values will be displayed in the Workspace section when the code is executed.

In order to determine the data type of num3, the following function will be used class(num3)



26

Press Control + Enter to execute this statement. The output in this case would be integer as shown in the Console section.

How to print the contents of a variable

To print the value of any variable like num1 or num3, enter the value, say num3 and press Enter. The value will be displayed in the Console like in the case of num3, where the value 3 is displayed. Alternatively, use the code print(num3) (print and the variable name within brackets) and execute this.



27

How to create a character data type in R

To create a character data type in R, let us create a variable char1 and store a value of “hello” against it. The code to execute this is char1<-“hello” or char1= “hello”

Remember that equal to can also be indicated by using the equal to sign.

When this code is executed, in the Workspace section a variable char1 has been created and a value “hello” stored against it.

To find out the data type of this variable, use the class function discussed earlier. Enter the code class(char1) and press Enter. The value character is displayed in the Console area.



28

An important fact to remember about character data types is that these values are always mentioned within double quotes. So anytime a value is entered within double quotes R will recognize it as a character data type.

Logical and factor data types will be discussed in more depth in a later section.



29

SUMMARY

We have covered important data types in this tutorial. Understanding these data types will help in managing and working with data in R. To summarize:

It is important to understand data types in order to determine what type of actions can be carried out with a specific type of data.

Different data types are available numeric, integer, character, logical and factor.

Different data types can be created in R using the proper syntax. Eg num1<-3.1, as.integer(3.1), char1<-“hello”

The function ‘class’ is used to determine the data type of a variable.

The function ‘print’ is used to print the contents of a variable.



30

PART 2 – DATA STRUCTURES & VECTORS

INTRODUCTION

The data that you are working on, needs to work for you. In other words it has to be arranged in a way that helps you manage, store it and analyze it better. This tutorial will deal with data structures. So, here’s what will be covered in this tutorial:

- Understanding data structures - Vectors – a type of data structure - Creating vectors in R

WHAT IS A DATA STRUCTURE

A data structure in simple terms is a way of storing and organizing data. Let us understand this better with the help of an example. Shown here is a table with different types of information stored in it.

When storing information of different types, it will need to be stored across more than one variable. For eg, if the data to be stored relates to employee records, then the variables across which this data would be stored would be Name, Age, Address, Nationality, Assessment scores and so on. This collection of information displayed across different variables is referred to as a data structure.



31

WHAT IS THE DIFFERENCE BETWEEN DATA STRUCTURE AND

DATA TYPE

A data structure is different from data type because of the number of values stored. Let’s look at this with the help of an example. If a variable “Name” has been created, and a value “Bob” stored against it, it will result in the creation of a character data type. In a data type only one value is stored. But when different information related to Bob apart from his name, is stored, like his age, address, nationality and assessment score then it results in the creation of a data structure. A data structure stores more than one value. A simple way to look at a data structure is to think of an Excel sheet with rows and columns where the columns are made up of different data types. In the example used, the Name column will store character data types, the Age column will store integer data types, the Score column will store numeric data types and so on.



32

TYPE OF DATA STRUCTURE – VECTORS

The first type of data structure that will be discussed is referred to as Vectors. A Vector is like a column in an Excel sheet. Going back to the example used earlier, Vectors would be Name, Age, Address, Nationality and so on. In Vectors, all the elements within a Vector should be of the same data type.

Vectors cannot have a combination of data types!

So, if Age is a Vector, then all the elements under age should be of the data type integer. This Vector cannot have any other data type within it like character or number, nor can they be a combination of data types.

Vector Not a Vector So, Vector is therefore a data structure which contains elements of the same data type. Visualize a single column in an Excel sheet which contains values of the same data type.



33

HOW TO CREATE A VECTOR IN R

In R Studio, a Vector can be created through a function known as c operator or concatenate. So, let’s create a Vector called vector 1, and store 4 values in it. This vector will contain elements of the numeric data type. To create this vector enter the code vector1<-(9,8,2,7) and execute this code.

Two events will take place. First, the Console will display vector1 with its corresponding values.

Second, in the Workspace section the variable vector 1 will be displayed along with the data type of its values - which is numeric - and the number of values which is 4.



34

To print the contents of vector 1, write the name of the Vector and press Control + Enter. In the Console the values 9, 8, 2 and 7 will be displayed. Here 1 represents the column number of the Vector.



35

MIXING UP DATA TYPES IN A VECTOR

Now let us look at something interesting. As discussed, a Vector can only contain elements of the same data type. There can be no mixing of data types within a Vector. So what happens if a second Vector is created and along with numeric data types, a character data type is inserted into it? Shown here, is the code to create a new Vector called vector 2 with some values. Inserted into these values is a character value “bob”.



36

When this code is executed, the output is displayed in the Console. But the syntax used includes elements of different data types, whereas we know that vectors can only store elements of the same data type. So why is no error being displayed?

When the contents of vector 2 are printed, all values in the Vector are displayed in the Console in quotes. This indicates that by default R has converted all numeric data types in the Vector to character data types by adding quotes to all the numbers. This is why R does not display any error on executing this code!

R recognizes the rule of common data types and converts uncommon values to a single data type.



37

REPLACING THE CONTENTS OF A VECTOR

Values in a Vector can also be overwritten. So one data type can always be replaced with another data type within the same Vector. In the example we looked at earlier, vector 2 contains 11 values all of which are character data types. Suppose we want to replace these 11 values with 4 values of numeric data type. These 4 values are 1, 2, 3 and 8. Let us enter the code Vector2<-(1,2,3,8) and press Control + Enter. In the Workspace the data type of vector 2 has now changed to numeric and has 4 values stored against it.



38

ARITHMETIC FUNCTIONS BETWEEN VECTORS

It is also possible to carry out arithmetic functions between Vectors like addition, subtraction, multiplication and division. The only pre requisite to execute these functions is that the data types in each Vector should be of equal length. As you can see in the workspace both vector 1 and vector 2, are of numeric data type and have 4 values each, which means they are both of the same length.

It is possible to carry out any type of arithmetic function on these 2 vectors such as vector 1 + vector 2 or vector 1 – vector 2 and so on. Let us enter the code vector1 + vector2 and press Control + Enter. The output is displayed in the Console.

Let’s cross check these values. Vector 2 comprises the values 1, 2, 3 and 8. To check the values of vector 1, enter vector 1 and press Control + Enter. The values displayed in the console are 9, 8, 2 and 7.



39

So, when the statement is executed addition is carried out by adding element 1 of vector 1 to element 1 of vector 2, element 2 of vector 1 to element 2 of vector 2 and so on. So, when 9, the first element of vector 1 is added to 1, the first element of vector 2 the result is 10 which is shown in the Console.

You can cross check the rest of the results as well!

In this example the vectors were both of equal length. Let’s look at what happens in the event the elements in the vector are of unequal length. Vector 1 has 4 elements. Let us add this to a new vector c which has 3 elements 1, 2 and 3. Let us enter the code vector1 + c(1,2,3) and press Control + Enter. On executing this code, a warning message is displayed in the console but the addition function has still been executed. How?



40

The first three elements of vector 1 have been added to the three elements of vector c. But the fourth element in vector 1 has been left out as there is no corresponding fourth element in vector c. Therefore for accurate results it is better to add elements of the same length.



41

IDENTIFYING ELEMENTS IN A VECTOR

Another interesting feature in Vectors is referred to as indexing. This feature allows a particular element in a Vector to be accessed. For eg, we know that vector 1 contains 4 elements, 9, 8, 2 and 7. Let us suppose that we want to find out the third element in vector 1 which is 2. Let us enter the code vector1 [3] and press Control + Enter. Entering 3 indicates that we want to access the third element of vector 1. We can see a value of 2 displayed in the console which as we know is the third element in vector 1.

So to index a Vector, next to the name of the Vector enter within square brackets the number of the element that needs to be accessed. Eg, vector1[3]

Indexing helps in identifying values in a vector based on their position.



42

REPLACING CONTENTS IN A VECTOR

Now, let us suppose that we want to create a new Vector called new_vector. In this new Vector we want to populate the same elements as vector 1 but without the second element. So in new_vector we only want to store the first, third and fourth elements of vector 1. Let us enter the code new_vector<-vector1[-2] and press Control + Enter. Entering minus next to 2 indicates that we want to exclude the second element of vector 1 in new_vector. When the code is executed we can see in the Workspace section that the vector new_vector has been created with three values of numeric data type.

To view the contents of new_vector, enter the name of the vector and press Control + Enter. In the console, 9,2 and 7 are displayed. 8 is not displayed as it is the second element in vector 1 and hence has been excluded.

If a Vector has only three elements but if a value of 10 is being entered in square brackets, then it means that we are trying to index elements that are greater than



43

what are actually present in the Vector. This situation is referred to as an Index out of Boundary.

If there are only 3 elements in a Vector, then how can you locate the 10th element?! Hence the term Index out of Boundary.

USING A FUNCTION TO INDEX

Indexing in Vectors can also be done with the help of logical functions. Here’s how. Let us create a new Vector called vector 2 with 4 elements in it 1,2,3,4. Enter the code vector2=c(1,2,3,4) and press Control + Enter. We already have a Vector, called vector 1 which has the elements 9,8,2,7. Let us now use a logical function to find the the third element in vector 1. Enter the logical function vector1[vector2==3]

By entering vector 2==3, we are trying to locate the position of the value 3 in vector 2. The value 3 is the third element in vector. So, when the code is executed, in the Console the third element in vector 1 needs to be displayed. Since the third element in vector 1 is 2, we should be able to see this number in the Console.



44

Press Control + Enter. On executing this code we can see in the Console the value 2.

Using the position of 3 in vector 2, the logical function tries to find the equivalent position in vector 1.



45

SPEEDING UP THE TASK WITH OPERATORS

Operators help in executing certain types of tasks quickly and more efficiently. Let us understand this better with the help of an example.

Let us assume that a Vector called Age needs to be created which needs to store the first 100 natural numbers i.e., numbers from 1 to 100. One way to execute this is to write the code age<-(1,2,3…..) and so on mentioning all numbers till 100. This obviously is not a feasible option. Sometimes numbers could run till 100, at other times till even 1000! In these types of situations, a good option would be to use Operators. Here are a few common Operators that are used in R.

Colon Operator

The Colon Operator can be used to create Vectors like the Age Vector quite easily by using the code age<-1:100 To execute this code press Control +Enter.

To view the contents of the Vector, enter age and press Control + Enter. In the Console values from 1 to 100 are displayed.



46

In a Colon Operator the value before the colon is the first value in the series and the value after the colon is the last value in the series. So, when 500:505 is entered, the series will begin from 500, then move to 501, 502, 503,504 and end with 505.

Colon operators remove the necessity of writing a long series of numbers!

Sequence Operator

Let us suppose that a Vector is to be created with some numbers, which are not continuous but have some sort of order to it. An example would be 1,3,5,7,9 and so on. To create this Vector, the Sequence Operator can be used. Let’s create a Vector called Age and populate it with the values values 1,3,5,7,9 and so on till 101. To do this, enter the code age<-(1,101,2) and press Control + Enter.



47

To view the contents of this Vector, enter age and press Control +Enter. In the Console the entire series is displayed.

In the code entered, 1 represents the start point, 101 represents the end point and 2 represents how the numbers should increment.

Sequence operators can populate vectors with data that follow a logical sequence.

So, now we know that vectors can be created in 3 different ways – firstly through c or concatenate, secondly with a colon operator and lastly with a sequence operator.



48

SUMMARY

In this tutorial we have looked at the importance of having a proper structure to store and organize one’s data. One such structure is called a Vector. To summarize:

Data structures are needed to store and organize data.

A data structure comprises different data types like characters, numbers, integers and so on which are displayed in the data structure as variables like age or name.

A Vector is a data structure which can store values of a single data type – like only characters or only numbers or only integers.

Vectors can be created in three ways - using concatenate, Colon Operator or Sequence Operator.

Arithmetic functions like addition, subtraction and so on can be carried out between Vectors provided they are of equal length.

Indexing is a function which can locate a particular element within a Vector.

It is also possible to create a new Vector by copying or modifying the contents of another Vector.



49

PART 3 – DATAFRAMES

INTRODUCTION

Vectors, the first type of data structure that was looked at is actually quite closely linked to the next type of data structure that will be discussed. This tutorial will deal with Dataframes. So, here’s what will be covered in this tutorial:

- Understanding Dataframes - Creating Dataframes in R - Different functions related to Dataframes

WHAT IS A DATAFRAME

Shown here is a table with columns like Name, Age and Score.

Each column is in fact a Vector. So, Name constitutes one Vector, Age another and Score another. So, a Dataframe is nothing but a collection of Vectors of equal length.



50

CREATING A DATAFRAME IN R

Here is a table with some data. This data needs to be converted into a Dataframe called Records.

The table has 4 columns which individually become 4 Vectors in the Dataframe. So, the first step in creating the Dataframe is to create the four Vectors.

To create the four Vectors in R Studio, viz, Name, Gender, Age and Income, enter the code Name<- c(“Aryan”, “Gopal”, “Zubin”, “Ravi”, “Umesh”, “Anita”) Gender<- c(“M”, “M”, “F”, “M”, “M”, “F”) Age<- c(20,21,24,26,26,23) Income<- c(20000,30000,35000,40000,41000,50000) Select the code entered and press Control + Enter.



51

The next step is to actually create a Dataframe around this called Records. Enter the code Records<- data.frame(Name, Gender, Age,Income) and press Control + Enter.

The order in which the Vectors are entered is important. If Name is entered first, it will be the first Vector displayed in the Dataframe. Likewise if Gender is entered first it will be the first Vector displayed in the Dataframe. On executing this statement, the Console shows that the code has been executed.

In Workspace the name of the Dataframe Records is displayed together with its Vectors, their data types and the number of values in each. When we double click on records we can see the entire Dataframe displayed.



52

FUNCTIONS THAT CAN BE CARRIED OUT WITH DATAFRAMES

How to print the names of variables

To find out the names of the variables in a Dataframe, enter the code ‘names’ followed by the name of the Dataframe whose variables need to be determined.

So, to find out the names of the variables in the Dataframe Records, enter the code names(Records) and press Control + Enter. In the Console is displayed all the variables of the Dataframe Records -– which is Name, Gender, Age and Income.

This is a useful function when working with a Dataframe that contains a large number of variables. In this tutorial, a simple Dataframe with just four variables has been created. There could be a situation where a large Excel table with lots of variables is imported and the names of the variables used in this table need to be determined.

How to find a particular element

To find a particular element in a Dataframe, a function called Indexing is used. When the Dataframe Records is opened from the Workspace section, we can see that it has 4 columns with 6 rows. In the Workspace section, 6 observations indicates 6 rows and 4 variables means 4 columns.



53

Let us assume that we want to find out the gender of a particular person, say Gopal, who is listed in the table. The value Gopal can be found in the second row and Gender is the second column in the Dataframe Records. Enter the code Records[2,2] and press Control + Enter

In the code, the first 2 indicates the second row in the Dataframe where the value Gopal is displayed. The second 2 indicates the column Gender.

When this code is executed, ‘M’is displayed in the Console indicating that the gender of Gopal is male.

How to view the elements in a row

It is also possible to view all the elements in any row of a Dataframe. For eg, to view the elements of only the first row in the Dataframe Records, enter the code Records[1, ] and press Control + Enter.

The space left after the comma indicates that all the elements in the row need to be fetched. When this code is executed, in the Console the entire elements of row 1 of Records is displayed.



54

To view the elements of more than one row, say for eg, the first 4 rows of the Dataframe Records, enter the code Records[c(1:4),] and press Control + Enter. On executing this code the elements of the 4 rows is displayed in the Console. Given below is also the code to access only rows 3 and 4, and the resulting output in the Console.

How to view the elements in a column

There are three ways to find out the content/s of a particular column in a Dataframe. Let us look at each of these with the help of an example. Let us find out the contents of the column “Name” in the Dataframe Records.



55

1. Enter the code

Records.$Name

and press Control + Enter.

On executing this code we can see all the values under the Name column displayed.

2. Enter the code Records*,”Name”+ and press Control + Enter.

Here the first field is empty because it relates to rows and the second field is the name of the column whose contents are to be retrieved. On executing this code the contents of the column are displayed in the Console.

3. Enter the code

Records[,1] and press Control + Enter.

This method works if the column number is known beforehand. In this case we know that Name is the first column in Records. On executing this code, the contents of Name are displayed in the Console.

Of the three ways to find the values in a column, two work with the name or the label of the column and the third requires the number of the column.



56

How to add columns to a Dataframe

To add a column to a Datafram, the only pre requisite is that the new column to be added should be the same size as the other columns in the Dataframe. In the Dataframe Records, there are 6 rows and 4 columns. To add a fifth column to this Dataframe enter the code Records$newc<-(100:106) and press Control + Enter. Entering 100:106 indicates that the new column needs to have 6 rows with the values 100, 101, 102, 103, 104, 105 and 106. When this code is executed, an error is displayed in the Console as 100:106 adds up to seven rows and not six.

So, the code needs to be changed as follows Records$newc<-(100:105) and press Control + Enter. On executing this code in the Workspace the number of columns in Records is now 5.

Also, when Records is opened, the column New is displayed with values from 100 to 105.



57

How to remove a column from a Dataframe

It is also possible to remove a column from a Dataframe. For eg, to remove the column New from the Dataframe Records enter the code Records$new<-NULL and press Control + Enter.

When this code is executed the data in the Workspace is updated to show only 4 columns.

When Records is opened, the column new is no longer visible.



58

SUMMARY

In this tutorial we have looked at another important data structure called Dataframe. To summarize:

A Dataframe is a data structure made up of vectors of equal length.

To create a Dataframe in R, first vectors need to be created.

There are various types of functions that can be carried out with Dataframes. These are: a. Printing the contents of a Dataframe b. Indexing or locating specific values in a Dataframe c. Finding out the values of a row in a Dataframe d. Finding out the values of a column in a Dataframe e. Adding a column with values to a Dataframe f. Removing a column from a Dataframe



59

PART 4 – LIST & MATRIX

INTRODUCTION

Data structures can also be in the form of a list or a matrix. This tutorial will deal with two more types of data structures – List and Matrix. So, here’s what will be covered in this tutorial:

- Understanding Lists - Understanding a Matrix - Creating Lists in R - Creating a Matrix in R - Different functions related to List and Matrix

WHAT IS A LIST

Just like a Dataframe, a List is also made up of Vectors. But unlike the Dataframe, the Vectors in a List can be of equal or unequal length. However, the Vectors in a List should comprise elements of the same data type. For eg, in the table below, n,s and b are 3 Vectors of different data types – numeric, character and logical. Each of them are of unequal length. A combination of these Vectors can make up a List.

n s b

2 aa TRUE

3 bb FALSE

5 cc TRUE

dd FALSE

ee FALSE



60

WHAT IS A MATRIX

A Matrix is a collection of data elements arranged in a 2 dimensional rectangular layout. To create a Matrix in R of 10 elements arranged in 5 columns and 2 rows, the syntax to be used is shown below. In this example, a matrix called my.matrix will be created. So, enter the code my.matrix<-matrix(c(1:10), ncol=5, nrow=2) and press Control + Enter. The result is shown below.

Unlike a Dataframe where each column stores different elements like Name or Age, in a Matrix all the columns need to have the same type of elements – either only numbers or only characters and so on.

A Matrix cannot have one column with character data types, one column with integers and so on.



61

CREATING A LIST IN R

There are two ways in which a List can be created in R. 1. By creating the Vectors in the List To understand how to create a List in R, the table shown earlier will be converted into a List. In that table, column n has only numeric data, s has only character data and b contains logical data (only True or false). Each of these columns are Vectors. To create the List, enter the code n = c(2,3,5) s = c(“aa”, “bb”, “cc”, “dd”, “ee”) b = c(TRUE, FALSE, TRUE, FALSE, FALSE) and press Control + Enter. The output is displayed in the Console.

In the Workspace section, each Vector n, s and b is displayed together with its data type and the number of values in it.



62

2. By creating a separate List around Vectors To create a List X with the three Vectors n, s and b and a fourth Vector called 3, enter the code X = list(n,s,b,3) and press Control + Enter. On executing this statement, in the Workspace section x with a value List against it is visible. It also indicates that the List has 4 Vectors.

To view the contents of the List, select the name of the List and press Enter. The values in X will be displayed in the Console.



63

CREATING A MATRIX IN R

Let us create a matrix in R, called my.matrix with 5 columns and 2 rows. This Matrix needs to store 10 elements. To create this Matrix, enter the code my.matrix<-matrix(c(1:10, ncol=5, nrow=2) and press Control + Enter. Here, the first argument indicates the number of elements to be stored in the Matrix, the second argument relates to the number of columns in the Matrix and the third argument relates to the number of rows in the Matrix. On executing this code the workspace indicates that a Matrix has been created.

On double clicking the name of the Matrix, a 2x5 matrix with 10 elements is visible.

CREATING A MATRIX OUT OF A DATAFRAME

A Dataframe can also be converted into a Matrix. Let us understand this with the help of an example. First the Dataframe needs to be created, with some sample elements. To create a Dataframe called data_frame, enter the code data_frame<-data.frame(a=c(1,2,”3”), b=c(1,2,3)) and press Control + Enter. To convert this Dataframe to a Matrix (let us call it next.matrix), use the function next.matrix<-as.matrix(data_frame) and press Control + Enter.



64

The output is displayed in the Console and the details of the Matrix in the Workspace section. On opening the Matrix, a 2x5 matrix is displayed. Here the column names displayed are V1, V2, V3 etc as no specific column names have been mentioned in the code.

Now let us find out the data type of the second column in the Dataframe data_frame. The data type of the second column is numeric, but this can be found out by using the code class(data_frame$b) On executing this statement, numeric is displayed in the Console.

Now let us find out the data type of the second column in the Matrix next.matrix. The second column in next.matrix is b. To find out the data type of b, enter the code class(next.matrix[,2]) and press Control + Enter. In the Console, character is displayed.

So, if the same elements in the Dataframe were used to create the Matrix, why does the data type of the column differ? Column a or the first column in the dataframe that was used to create next.matrix has elements of the character data type. So as a Matrix needs to have elements of the same data type, every element in the Matrix including the elements in the second column b have been converted to character data type. This is why the data type of the second column of the Matrix is character. This underlies the key difference between a Dataframe and a Matrix i.e, all the elements in a Matrix need to be of the same data type.



65

SUMMARY

In this tutorial we have covered two more data structures – List and Matrix. To summarize:

List is a combination of vectors, either of equal or unequal length .

A Matrix is a collection of data elements, where all the elements need to be of the same data type.

There are two ways in which a List can be created in R. The first is by generating the Vectors in the List individually. The second is by combining the Vectors into a consolidated List.

A Matrix can be created using code where the number of elements, columns and rows is specified.

A Dataframe can also be converted to a Matrix. If the Dataframe has different data types, only one data type will be stored when converted to a Matrix.



66

PART 5 – FACTORS

INTRODUCTION

An important data type used in data structures is factor. Factor as already mentioned refers to data types of categorical nature. So, here’s what will be covered in this tutorial:

- Understanding factors - Creating a factor in R

WHAT IS A FACTOR

In R, let us assume that a Vector called fac_list has been created with names of cities like city1, city2, city 3 etc. fac_list<-c(“city1”, “city2”, “city3”, “city4”) The names of these cities are categories in themselves. So each city which is originally a character data type can be converted into factor or a separate category in R. Let us take another example. In a Vector like gender, there are invariably two values, male and female, each of which are categories in their own right. So, the utility of the data type factors is to convert values into categories.

Vectors are the base on which factors are generated from.



67

HOW TO CREATE A FACTOR IN R

To use factors to create categories out of values, let us assume that the values in the Vector fac_list are to be converted to categories or factors. First create the Vector fac_list using the code mentioned above. Now enter the code fact1<-as.factor(fac_list) and press Control + Enter. In the Workspace section, fact1 with 5 factors is visible.

To find out the data type of fact1, use the code class(fact1) and press Control + Enter. On executing this code, factor is displayed in the console area.

To view the values in fact1, enter the code summary(fact1) and press Control + Enter.



68

In the Console area is shown each of the values in fact_list displayed as categories.

The values under each indicate the number of times they appear in the Vector fact_list. For eg, city 1 appears only once hence the value 1, but city 2 appears twice which is indicated by the value 2. Likewise the number of times the other categories appear is also indicated.



69

SUMMARY

In this tutorial the data type factors was explained in some detail. To summarize:

Factors are a data type which converts values into categories. For eg, names of cities to city, male and female to gender and so on.

In R, factors can be created out of values in a vector.

It is also possible to view the number of times a category appears in a Vector.



70

Section 3 Data Handling



71

PART 1 – PACKAGES

INTRODUCTION

One section of the R Studio GUI comprises a section of Packages. They allow for amny important functions to be carried out. So, here is what will be covered in this tutorial:

- Understanding packages

- Installing and loading packages

- Importing data into R

- Exporting data from R

WHAT IS A PACKAGE

Packages are collections of R functions, data and compiled code put together in a well defined format. They can be thought of as prepared routines that are available in R.

Packages are like a bundle of everything that is needed to carry out a specific function in R.

Let us understand the importance of packages through an example. Suppose we want to carry out a linear regression in R to create a linear model. One way to do this is to write all the logic and code to carry out a linear regression and then execute it. Another way is to access a linear regression function from an external file, pass your data through it and execute it. This pre made function is what is referred to as packages in R.



72

As we have already discussed R has a huge community of contributors. These contributors create these premade functions or packages in R which can then be used by all users of R. So, if a user needs to forecast something in R, all they need to do is look for the forecasting package in R and use it. Packages definitely make R a user friendly tool.

By using the right package in R, one can save time and effort in carrying out a particular function.

INSTALLING AND LOADING A PACKAGE

When talking about packages there are two common terminologies that are used. The first is installing a package and the second is loading a package. To understand these terminologies, let us look at an example. Using this example will not just show us how to use a package, but also demonstrate how to import data into R. Let us assume that in one of the drives in the system being used, an Excel file called Excel_import is to be imported into R. In R the code to import an excel file is read.xlsx.

But if we were to execute this code, it would not work. This is because xlsx is a function present within a package and it will only work if this package is installed. So certain functions in R are linked to packages and will only work if those packages are installed in R.

To execute certain functions, it is important to install and load a package in R.



73

Installing a Package Let us now look at how to install a package. The option Packages is available in R Studio on the right hand side.

Click on Packages, then Install packages.

In the field “Packages” enter the name of the package that needs to be installed. In the example being used, the package to be imported is called xlsx. So, enter xlsx.

Make sure that when installing a package like xlsx you are connected to the internet, as R will need to download the package from a server. Like in the case of xlsx it will be downloaded from the server Repository.

After entering the name of the package to be installed, click on the Install button.



74

Once the package has been installed look for it by entering its name in the search field in the Packages section. As it appears in the search results, we know that the package has already been installed.

To find out if a package is already installed in R, look to see if it comes up in a Search.

Loading a package Installing a package adds it to your system, but post that the package needs to be loaded. Loading means using the package in R to carry out or execute the function. To load a package in R, the common code that is used is library followed by the name of the package within brackets. So, enter the code library(xlsx) and press Control + Enter. In the console the text in red indicates that the package has been loaded in R.



75

IMPORTING AN EXCEL FILE INTO R

Now let us import an Excel file into R, as the package to import the file is installed and loaded in R. To do this the code to import the Excel file needs to be entered. A breakdown of this code is mentioned here: read.xlsx(file= “file path.xlsx”,sheet.index=1)

- filpath.xlsx is the file path or the location of the file to be imported - sheet.index=1 indicates that only the first sheet in the Excel file needs to be

imported (So, if 2 is entered instead of 1, then the second sheet will be imported from the Excel file)

The path of the file to be imported can be found under Properties.

Let us assume that the name of the sample Excel file to be imported is Excel_import. To find out the file path of this Excel file, right click next to the file and look under Properties. In the space left for the file path, paste the file path of the Excel file. When pasting or writing the file path, make sure that back slash is entered twice in the file path. After the file path enter the name of the file which is Excel_Import. Then the sheet to be imported needs to be mentioned. We can either enter 1 or sheet dot Index equal to 1. The imported excel sheet needs to be stored in a Dataframe in R. So we will create a new Dataframe data1. Let us execute this code by entering Control + Enter.



76

In the Console the presence of the red dot means that the data is being compiled.

In the Workspace section, a new dataframe data1 is created which has 99 observations or rows and 4 variables or columns. On opening data1 we can see the data that has been imported. Check to see if the correct data has been imported by opening the Excel file that has been imported.



77

IMPORTING A CSV FILE INTO R

To import a comma separated value file or a CSV file, the code read.csv is to be used. Importing a CSV file does not require any package to be imported as this function is inbuilt in R. The code to import a CSV file is shown here: read.csv(file= “file path.csv”,sep= “ , “)

- file path.csv is the file path or the location of the file to be imported - sep = “ , “ indicates that the file to be imported is a comma separated value

file

Assume that a csv file called CSV_Import is to be imported. Copy the file path of this file which can be found under Properties. In R, enter the code to import the file by first entering the name of the Dataframe where the imported file will be stored, which is data2. Then enter the code read.csv followed by the filepath which has been copied earlier. Then the name of the file to be imported is entered which is CSV_Import followed by the file type which is CSV. Remember to add 2 back slashes to the file path just as we did in the case of the Excel file import. The last part in the code is the separator which is a comma. Press control plus enter to import the file.

In Workspace, data2 has been created with 99 observations and 4 variables.



78

An easier way to import a CSV file is to create a Dataframe (like a) and use the code: read.csv (file.choose ( ) ) The space after choose ( ) is to select the file from the menu in the system. This option is a menu driven option and removes the need to copy and paste the file path in the code. After pressing Control + Enter, in the Select File option which appears look for the CSV file to be imported (which in our case is CSV_Import).

On selection in the Workspace a Dataframe “a” is created which has the same observations and rows as the earlier Dataframe created.



79

SUMMARY

In this tutorial the utility of Packages was explained in some detail. It also covered the importing of Excel and CSV data into R. To summarize:

Packages are a bundle of pre defined functions. They help in executing certain processes in R with ease.

Certain functions in R can only be done through packages.

To use a package in R, first install it and then load it.

To import an Excel file in R, install and load the package xlsx.

To import a CSV file in R, use the code read.csv.



80

PART 2 – EXPORTING AND READING DATA IN R

INTRODUCTION

Just like data can be imported into R – whether in Excel or CSV format – it can also be exported from R. So, here is what will be covered in this tutorial:

- Exporting data from R to Excel

- Exporting data from R to CSV

- Reading a file in R



81

EXPORTING TO EXCEL

In an earlier section, a Dataframe called “a” has already been created when CSV files were imported to R. Let us assume that the contents of this Dataframe will now be exported to an Excel sheet.

To do this, enter the following code: write.xlsx (data, file= “file path”) So, if the contents of Dataframe “a” is to be exported to a sample Excel file called abc, the code to be entered is shown below:

Let us deconstruct this code. The data to be exported is specified as “a” and the location to which is to be exported is mentioned after file. Also mentioned is the name of the Excel sheet where the data is to be stored which in this case is abc. When this code is executed, the data is exported to the location specified. You can always check this by going to the location where the Excel file is saved, and checking its contents.



82

EXPORTING TO CSV

Exporting to a CSV file is similar to exporting to an Excel file. The code to carry out this function is shown here:

In the code shown, a is the Dataframe whose contents are to be exported, filepath is the location where the CSV file is to be saved and the comma against the separator (sep) indicates that the data has to be exported in CSV format. After executing the code, go to the desktop of your system and look for the location where the CSV file has been stored. Verify that the contents of the Dataframe “a” have been exported in CSV format.



83

READING A FILE IN R

Like in the case of exporting data, reading data is also carried out with the help of code, which in this case is read.table. Shown here is the code to read a sample text file in R.

Assume that on the desktop of your system, a text file called “Consultants” is available whose contents are to be read through R. Assume that this file contains a set of email ids all separated by commas. When this data is read in R, we want to make sure that each email id is an element in itself. Let us now deconstruct the code to read data. The dataframe where the contents of the text file are to be displayed is mentioned which in the code displayed is “a”. The location of the text file is given next. Comma is written against separator as all values in the text file “Consultants” are separated by commas. When we execute this code, in the Console a red dot appears indicating that the data is being compiled.

In the Workspace section, a Dataframe called a is visible. On opening this Dataframe, all the email ids in the file “Consultants” are stored as separate elements.



84

Assume that in the desktop of your system, is another text file called “Backup codes”. Here the elements are separated by a space. To read the contents of this file, using the same code used earlier, replace the comma with a space.

On executing this code, a Dataframe b is created in the Workspace section. On opening this Dataframe, the contents of the text file are displayed as separate elements. So as the contents were separated in the text file with a space, a space was used in the code as a separator.



85

SUMMARY

In this section, exporting data from R – whether in Excel or CSV format – was

covered. Also, the code to read text files in R was also looked at.

To summarize:

Data can be exported in Excel format using the code write.xlsx.

Data can also be exported in CSV format.

To read code in R, use the code read.table.

To read elements separated by a comma, use the separator “,”.

To read elements separated by a space, leave space as a separator.



86

Section 4 Logical Operations and Conditions



87

PART 1 – LOGICAL OPERATORS AND IF CONDITION

INTRODUCTION

Locating values in R is fairly simple with the use of logical operators and conditions. So, here is what will be covered in this tutorial:

- Understanding logical operators

- Common types of logical operators

- Executing logical operators in R

- Understanding IF condition

- Executing IF condition in R

WHAT IS A LOGICAL OPERATOR

Logical operators are used to locate specific elements in a data structure. Here are examples of logical operators in R – Greater than, Less than, Equal to, And, OR. An example will be used to understand each of these terms better. Assume there is a table, that lists a few names along with certain particulars related to those names like gender, age and income. The utility of a logical operator in a table such as this or in a Dataframe in R is to identify or isolate a specific element or elements, or certain row or rows. So, if in this sample table, one wants to identify all those names where the age is Greater than 23, then the logical operator Greater than is used. Here is the result of using this operator on the sample table. Looking at the table we can identify 3 names where the age is greater than 23.

Name Gender Age Income

Aryan M 20 20000

Gopal M 21 30000

Zubin F 24 35000

Ravi M 26 40000

Umesh M 26 41000

Anita F 23 50000

Age Greater than



88

23

Let us take a look at another example. Suppose we want to identify all those names whose gender is male and whose income is greater than 40000. Here we need to use 3 logical operators to identify these names. These are gender Equal to male, followed by the logical operator AND, followed by income Greater than 40000. Here is the result of applying these logical operators.


Aryan M 20 20000

Gopal M 21 30000

Zubin F 24 35000

Ravi M 26 40000

Umesh M 26 41000

Anita F 23 50000

Gender Equal to male AND income Greater than 40000

So from these examples we can see that logical operators are very useful in extracting particular information from a Dataframe or a table.



89

HOW TO EXECUTE A LOGICAL OPERATOR IN R

We will now look at how to work with these logical operators in R through a simple exercise. In an earlier section, a Dataframe called Records was created using the information mentioned in the sample table above. But for purposes of this exercise, we will and create this Dataframe again.

To create the Dataframe “Records” again, first create the vectors “Name”, “Age” and “Income” before creating “Records”.

After the Dataframe has been created, the following three tasks will be carried out:

1. The vectors in “Records” will be printed 2. The elements where the age is less than 23 will be identified 3. The elements where gender is male and age greater than 21 will be identified

Finding out the rows where the age is less than 23 Let us begin with finding out the elements or rows where the age is less than 23. From the table, we know that there are 2 rows where the age is less than 23. These can be found against the names Aryan and Gopal.


Aryan M 20 20000

Gopal M 21 30000

Zubin F 24 35000

Ravi M 26 40000

Umesh M 26 41000

Anita F 23 50000

Age Less than 23

So how do we get the same result in R? Here is the the code to create the Dataframe “Records”.

Press Control plus Enter.



90

The Dataframe is created and information relating to this is displayed under Workspace. On double clicking this Dataframe “Records”, the information shown in the table is present in the Dataframe.

Let us now find out the elements or rows in this Dataframe where the age is less than 23. When discussing data structures we touched upon the code to find out the number of rows. The first rule to remember is to use square brackets after the name of the Dataframe, and the second rule is that the first argument within the bracket relates to rows and the second argument relates to columns. So as we need to find out the rows where the age is less than 23, the logical statement is mentioned in the first argument and the second argument has been left blank – as in nothing is mentioned after the comma. The code to find the rows where the age is less than 23 is:

So now let us deconstruct this code. First mention the Dataframe name which is “Records” followed by the dollar sign ($) and the name of the column from where data needs to be identified. In our example this would be Age. Then enter the logical operator less than ( < ) followed by 23. On pressing Control plus Enter, we can see in the console section two rows. The rows displayed tally with the results that we arrived at when we looked at the data displayed in the table.

Remember to identify rows, enter only the first argument in the code, as the second argument relates to columns.



91

Let us now assume that we want to count the number of rows where the age is less than 23. The details of the rows where the age is less than 23 is displayed, but we now need a count of these rows. First we need to create a dataset data1 and attach it to the code we used earlier. data1<-Records[Records$Age<23] As you recall from earlier sections, this has the effect of attaching the results of the code to the dataset data1. So the two rows that we saw in the Console now belong to the dataset data1. Now to find out the number of rows in data1, enter the code nrow (data1) On pressing Control plus Enter, 2 is displayed in the Console.



92

Finding out the rows where the age is less than 23

Going back to the table, we can see that there are two records which meet the conditions of gender being male and age being over 21. These are found against the names Ravi and Umesh.


Aryan M 20 20000

Gopal M 21 30000

Zubin F 24 35000

Ravi M 26 40000

Umesh M 26 41000

Anita F 23 50000

Gender Equal to male AND age over 21

So how do we get the same results in R? In R, enter the code Records[Records$Gender== “M”&Records$Age>21,+ So what we have effectively stated in this code is to find in the Dataframe Records, all rows with gender Equal to M and with age Greater than 21. On pressing Control plus Enter,in the Console the rows with Ravi and Umesh are displayed. This as we have seen exactly matches the requirements of all rows with gender male and age greater than 21.

Remember that when entering character data types in R, the values need to be entered within double quotes.



93

WHAT IS IF CONDITION IN R

To understand the conditional statement IF in R, let us use an example. Let us begin by opening the Dataframe Records. In this Dataframe, let us assume we want to add another variable called Gender_dummy. The values to be displayed against this variable are 1 against all those rows where M (male) is displayed and 0 against all those rows where F (female) is displayed. The code to execute this is shown here: Records$Gender_dummy<-ifelse(Records$Gender== “M”,1,0) ifelse indicates that IF the value in the column Gender is M display 1, ELSE display 0.

Remember when entering the code to precede the name of the column with a dollar symbol.

Press Control plus Enter. In the Workspace section, the number of variables has increased to 5 (where it was earlier 4). On opening the Dataframe we can see that a new variable Gender underscore dummy has been created. In this column all 1s have been added against all those elements where the gender is M or male and 0 against all those elements where the gender is F or female.

So let us run through the code one more time. IF the statement gender is equal to male is true, display 1, Else display 0 (which means that if the statement gender is equal to male is not true, then display 0)



94

SUMMARY

In this section, the use of logical operators like Greater than, Less than, Equal to, AND

have been covered in some detail. The use of the conditional statement ifelse has

also been covered.

To summarize:

Logical operators are used to identify certain elements in a data structure. Egs

are greater than, less than, equal to etc

When executing a logical operator in R, mention the name of the data

structure and the name of the column which contains the desired variables.

Symbols are used in R for logical operators like =,<,>,&

If condition looks for the presence of certain conditions before carrying out a

specific function



95

PART 2 – MERGING DATA

INTRODUCTION

Data in different tables can be merged in R.

So, here’s what will be covered in this tutorial:

- Understanding merging of data

- Different ways in which to merge data

- Executing a merge in R

WHAT IS MERGING OF DATA

Let us assume that an organization has prepared a database of its employee information. One table which we will refer to as Table 1 stores details related to Employee ID (shown as EmpID), Name and Income. Employee ID is a unique key or identity.

A second table which we will refer to as Table 2, stores Employee ID, Address and Nationality.

Assume that the organization wants to combine the information in these two tables into a single table. To do this, Merge will be used. So, Merge is an operation which helps in combining data which are present in different tables.



96

WHAT ARE THE DIFFERENT WAYS TO MERGE DATA

A merge can be carried out in different ways, or in simple terms there is more than one way to merge data. Again let’s look an example. Shown here are two data sets. The first data set has three columns, k1, k2 and data.

The second dataset has 2 columns k1 and k3.

In order to merge these 2 datasets, an important condition needs to be met – there needs to be atleast one column in common between the two. In our example, the column which is common between the two datasets is k1. So it is possible to merge these two datasets as k1 is common between the 2. Full merge

The first type of merge possible is called the Full merge. In our example, we have two columns in the first dataset and three in the next dataset. Of these, one column k1 is common. After a full merge one dataset with four columns will be created – k1, k2, k3 and data. So a full merge is a concatenated table with all the unique columns and data present in the tables that were merged. So let us look at merged table that has been created after a full merge of the 2 datasets.



97

As you can see, in the column k1 seven elements are visible. 1 is the common element in k1 in both datasets. All other unique values in k1 and the other columns are present in the new merged dataset. Inner merge The second type of merge is called Inner merge. In this type of merge only the row with matching elements in the common column of the datasets to be merged are brought together. In our example, the only column that is common between the two datasets is k1. Within k1, the only common element between the 2 columns is the number 1. So when an Inner merge is carried out only the row which has the common element is merged. The figure shown on your screen indicates the result of an Inner merge between the two datasets.

Left outer & Right outer merge The third type of merge is called the left outer merge. In this type of merge a consolidated table is created, but only the contents or elements of the columns which are to the left are merged. The figure shown on your screen displays the results of a left outer merge.

The inverse of a left outer merge would be a right outer merge, where only the contents or elements of the columns to the right are merged.



98

HOW TO CARRY OUT A MERGE IN R

The first thing we will do is create two dataframes X and Y which will contain the elements of the datasets that we used as an example earlier. The dataset X will comprise the columns k1, k2 and data, whereas the dataset Y will comprise the columns k1 and k3. Shown here is the code to create the dataframes X and Y.

The dataframes are created by pressing Control plus Enter. Carrying out a full merge Let us now look at the syntax or code to carry out a full merge of both the datasets x and y. Enter the code: merge(x, y, by.x = “k1”, by.y = “k1”, all=TRUE) x and y are the datasets that are going to be merged. K1 is the column that is common between the datasets x and y. To indicate that a full merge needs to be carried out, all = TRUE is specified. On pressing Control plus Enter, a fully merged dataset is displayed in the Console.



99

Carrying out an inner merge

To carry out an inner merge, enter the code: merge(x, y, by.x = “k1”, by.y = “k1”) Press Control plus Enter. In the Console, the results of an inner merge are shown, wherein the common elements in the common column are merged.

Carrying out a left outer merge

To carry out a left outer merge, enter the code: merge(x, y, by.x = “k1”, by.y = “k1”, all.x = TRUE) all.x is mentioned as the dataset x is to the left . On pressing Control plus Enter, the results are shown in the Console.



100

Carrying out a right outer merge

The code to carry out a right outer merge is shown here: merge(x, y, by.x = “k1”, by.y = “k1”, all.y = TRUE) Here all.y is specified, as y is the dataset to the right. On pressing Control plus Enter, the results of the right outer merge are shown in the Console.

For datasets to be merged there has to be atleast one column in common between them.

SUMMARY

In this section, the different ways two data structures can be merged has been

looked at in some detail.

To summarize:

The merge operation brings together elements of different datasets or tables

into a single consolidated table.

To carry out a merge, atleast one of the columns in each of the datasets to be

merged must be common.

There is more than one way to merge data.

A full merge combines all of the elements in the datasets into a consolidated

dataset.

An inner merge combines only the elements of the row which have elements in

common (within the common column)

A left outer merge combines the elements of the table or dataset to the left. A

right outer merge combines the elements of the table or dataset to the right.



101

Section 5 Text Analytics and Word Cloud



102

PART 1 – UNDERSTANDING TEXT ANALYTICS

INTRODUCTION

Analyzing text is extremely powerful and is an integral part of our social media and web activity. So, here is what will be covered in this tutorial:

- Understanding textual or text analytics

- Importance of text analytics for organizations

- Common terms in text analytics

- Understanding the framework to create a Word Cloud



103

WHAT IS TEXT ANALYTICS

To understand this let us look at the type of information that is available to organizations today. In today’s competitive environment information is power. A lot of this information or data is present on the web in the form of text or videos. Very rarely is this information available to organizations in a structured format which can be stored in a database. In fact organizations need to take data that is available out there and structure it so that it is useful for them. But this can be a daunting task especially where most of the information is in text format. This is where textual or text analytics plays a key role.

So if we were to define text analytics we could say that it is the process of deriving high quality information from unstructured text. Simply put, it is making sense or giving structure to data or information which is not structured.



104

HOW IS TEXT ANALYTICS USEFUL

Let us suppose that you have been searching on the web for anything related to computer games. On your search results page you will often find ads and recommended pages related to computer games. A lot of what you see is dependent on your search history or the keywords that you have been using.

Likewise, when you are on the Newsfeed page of Facebook, you can see posts on Suggested pages or Ads displayed on the right hand side of your page. Maybe you have been looking for something specific on Facebook or have been spending time on a certain company page. Those suggested pages or ads could be very similar to the pages that you have been looking for or spending time on in Facebook.

If you have a Gmail account, you would find in your Spam folder a lot of mail that you yourself did not actually send to Spam. Well all of these examples that we have cited is a result of using text analytics. Take the example of Spam filtering. There have been instances when you have flagged of mail from a certain recipient as Spam. Your mail service provider will now automatically look for those words in a string and send



105

any mail with that text to Spam. Likewise in facebook what you search for or write is being analysed to come up with suggested pages and display ads.

Text analytics is an exciting and useful part of analytics. To understand this concept better, in the sections ahead we are going to focus on two aspects:

1. Understanding the common terms used in text analytics; and 2. Completing a a text analytics project using data from a popular social medium

- Twitter. The project will focus on the framework to create a Word Cloud out of a set of tweets on Big data, R and analytics.

You will need to execute this project in R using the concepts that will be discussed in this tutorial. We will of course be guiding you along the way.



106

IMPORTANT TERMS IN TEXT ANALYTICS

Corpus

The first term that we will look at is called Corpus. Corpus is the data structure that is used to manage the text that is being analyzed. A simple way to look at a corpus is to think of a dictionary.

It is a data structure of the relevant words in a piece of text. Let us assume that we are analyzing a blog on democracy. The corpus will be a list of all relevant words in the blog related to democracy stored in a structured format. So like in a dictionary when you look up the term democracy you will find all words associated with it listed in one place, the corpus will list all the relevant words from the blog in a single place. An important point to remember about a Corpus is that just like in the case of a dataset, it needs to be cleaned up. Cleaning up a Corpus Stopwords

So what do we typically clean from a Corpus? Firstly, words which do not really make sense in itself need to be removed. For eg, if the blog that we are analyzing, uses the words “the”, “or”, “of” , “am” , “is”, “are” , “was” quite frequently – these words really carry no meaning or have little or no value and hence need to be removed from the Corpus. These types of words are referred to as stopwords. There are around 196 stopwords that have been identified. You need not worry about identifying these words by yourself, because in R we will be using a Text Mining or TM package which will help you in identifying and removing stopwords from your Corpus. In addition to the 196 stopwords identified, you can also add your own stopwords based on what you think is useful or not. For example, if you think that names of



107

people in the text you are analyzing is not useful, these can be added to the list of stopwords to be cleaned from the Corpus. Numbers Secondly, we can also remove numbers from the Corpus. So, if numbers have been used to demarcate points like 1, 2, 3 and so on, these can be removed from the Corpus as they have no meaning by themselves. Punctuation

Thirdly we can also remove punctuations like commas, semi colons, colons, full stops etc from the Corpus. Treatment of case

Fourthly, we can decide whether the same words used in a text need to begin with upper case or lower case. For eg, if democracy is spelt in one place with lower case but in another sentence begins with upper case, then we need to decide if in the Corpus democracy should always start with upper case or lower case. Stemming

The next type of clean up that can be done is through a process called Stemming. To understand Stemming, let us assume that in the blog we are analyzing a word “participate” which has been mentioned in different ways like “participated” “participating” “participatory” etc across the blog. All these words relate to the same root word which is “participate”. The process of Stemming will ensure that all these words will eventually add up to that one word no matter the tense used. Another example would be a verb like “fly” which can be represented in an article as “flew” “flying” “flown” etc. Stemming will ensure that in the end this is all represented by the one word “fly”.



108

Framework So, the framework to start analyzing text begins with creating a Corpus which is a data structure to store text. This is then followed by the process of cleaning up the Corpus wherein the following is carried out: Stopwords are removed Numbers are removed Punctuation is removed Treatment of case is decided Stemming is done

Another important term is Tokenize. In this process a sentence is broken down into individual tokens so that each word in that sentence is a separate entity. So the sentence “Parliament is the seat of democracy”, when tokenized would be: Parliament, is, the, seat, of, democracy. This method is also used in search engines like Google when they look at keywords. For eg, if the keywords “analytics jobs” is entered, it would be first broken down into 2 tokens “analytics” and “jobs”.



109

WHAT IS NEEDED TO CREATE A WORD CLOUD

TDM

Having arrived at a clean Corpus, we now need to decide what to do with it. Remember that a Corpus in itself is not an output, but a dictionary to be used to create something else. So if our final objective is to create a Word Cloud out of a Corpus, the Corpus needs to be converted into a format which enables a Word Cloud to be created from it. To understand this better, we need to know what is required to create a Word Cloud. Two very simple components make up a Word Cloud – words and the number of times or frequency with which those words appear. For example, look at the table shown below:

Words Frequency

People 20

Democracy 35

Freedom 40

The numbers next to those words represent the number of times they appear in a piece of text like a blog or an article. When a Word Cloud is created, the frequency will determine the size of the word within the Word Cloud. For example in the image shown below, the larger the size of the words, the more frequently they would have recurred in the content or the text from which this Word Cloud was created.



110

So, the structure which has been described above is referred to as a Term Document Matrix or TDM. So to take a Corpus and make it Word Cloud ready we need to create a TDM. A TDM is made up of rows and columns. The columns represent the words and the rows represent the frequency of their occurrence.

So let’s stop for a while and ask ourselves a question. Hey, I have a blog and I want to create a Word Cloud out of it. How can I do it? Well, everything we have discussed so far should answer our question. Quite simply: 1. Create a Corpus 2. Clean up the Corpus 3. Create a TDM or Term Document Matrix out of it 4. Create your Word Cloud



111

Installing the TM package in R

Creating a Word Cloud in R is possible through a package called the TM or Text Mining package. For example, to help with Step 2 which is cleaning up of the Corpus – the TM package uses a function called tm_map. To carry out various types of processes like removing Stopwords the correct argument needs to be entered after tm_map.

The TM package comes with some really good documentation which you need to go through to understand how to execute each of the steps we have talked about. Remember to also use the Help feature in R for specific queries.

Before we move onto our project of creating a Word cloud out of a set of tweets, let’s make sure we do the following.

1. Download the file comprising the tweets that we need to convert into a Word Cloud. You can find this in the Download section of this tutorial.

When opening this file, remember to right click and select Open with R studio.

2. When the file is opened in R Studio, it will be visible in the Workspace section

with the number of tweets in it visible which is 320.

3. Import and install the list of packages that are mentioned in the Download section of this tutorial. The packages are: a) Twitter: This package is needed to read the tweets that have been

downloaded b) Word Cloud: This is required to create the Word Cloud c) TM: As mentioned earlier, the TM or Text mining package is needed to

create the Corpus, clean up the Corpus and create the TDM d) Snowball: This package is required to enable Stemming to be carried out.



112

SUMMARY

In this section, the meaning and importance of Text Analytics was covered. Some

important terms in Text Analytics and the framework to create a Word Cloud has also

been explained.

To summarize:

Text Analytics gives structure to unstructured textual data.

In R, Text Analytics is done with the help of the TM or Text Mining or TM

package.

In Text Analytics, the first step is to create the Corpus.

The next step is to clean the Corpus, by removing Stopwords, numbers,

punctuation, stemming etc

To convert a Corpus to a Word Cloud, a TDM or Term Document Matrix needs

to be created.



113

PART 2 – UNDERSTANDING TEXT ANALYTICS

INTRODUCTION

Word Clouds are a product of text analytics. They are not so difficult to create. So, here is what will be covered in this tutorial:

- Understanding how to create a Word Cloud from twitter data

- The syntax used to carry out some important steps

- Understanding how to use a few packages in R

HOW TO CREATE A WORD CLOUD

STEP 1: CREATING A DATAFRAME

Open the dataset of sample tweets in R Studio.

Before we begin, you’ll should have downloaded the sample set of tweets and opened it in R studio. You should have also imported and installed the list of packages that were specified in the previous section.

As you can see in the workspace section, a list of 320 tweets is displayed.



114

Double click on this list and you will see a list displayed in Notepad.

From this list it is pretty evident that there are no details of the actual content of the tweets. So what we have is essentially unstructured text. To convert this unstructured text into a Dataframe, enter the code shown below: library(twitter) df<-do.call(“rbind”,lapply(tweets,as.dataframe)) Let us now deconstruct this code. df = the twitter data is going to be stored in a dataframe called df do.call = a function which is calling another function multiple times. In the case of our code the function that is being called multiple times is rbind or row bind.

For a detailed explanation of the syntax do.call, go to the Help section in R studio. Type the words do.call in the Search field and press Enter. As you can see a detailed explanation of the function do dot call is shown in Help. You can go through this explanation to understand this function better.



115

rbind = an action to bind or combine together the 320 rows of tweets lapply = a function which converts the tweets that are being combined into a dataframe

Let us now execute the code mentioned above. Press Control plus Enter. As you can see in the Workspace section, a dataframe df with 320 observations is visible. These 320 observations is nothing but the twitter data which has been converted into a dataframe.

Let us open the dataframe. As you can see each row is numbered, with the first row relating to the first tweet, the second row the second tweet and so on. This dataframe will run into 320 rows which corresponds to the 320 tweets in our original data structure.



116

The dataframe has 14 variables. For the purposes of our exercise we will focus on the text column of the dataframe.

In order to view the dimension of the dataframe, enter the code dim(df) Press Control plus Enter and you can see in the Console the numbers 320 and 14 displayed.



117

STEP 2: INSTALLING THE TM PACKAGE

Once the tm package has been loaded, we can find out how to use it from the Help section. Go to Help and enter tm in the Search field. Press Enter. Click on the link tm.

You can see two links shown here – a Description file and Overview of user guides and package vignettes.

We will click on the second option which is the Overview of user guides and package vignettes.



118

Once we do that, a PDF file on the Introduction to the tm package will open.

As we scroll through this document, you will see every conceivable task that is possible with the tm package listed. It lists out how to eliminate stopwords to how to carry out stemming to creating a Term Document Matrix. Term Document Matrix or TDM, as we know is essential to creating a Word Cloud as it lists out the set of words along with the frequency or the number of times they appear in a given text.



119

STEP 3: UNDERSTANDING TDM

There are actually two types of matrix that can be created out of a Corpus. The first is a Term Document Matrix where the terms or the words are rows. The second is Document Term Matrix where the documents are rows. As you can see in the image below, an example of a Document Term Matrix has been shown. Here the documents are mentioned in rows.

Let us interpret this matrix. Listed under Docs are the names or numbers of the text that have been analysed. Listed as columns are words which appear in these documents. So, if against Doc 127 and against the word “able” the number 10 was mentioned, it would mean that the word “able” has appeared 10 times in the document 127.

STEP 4: CREATING A CORPUS

The next step in our project is to convert the dataframe df that we have created into a Corpus. A Corpus if you recall is the data structure to store all the text that will be analyzed.



120

In order to do this, use the syntax shown below. myCorpus <-Corpus(VectorSource(df$text))

Let us deconstruct this syntax. The name of the Corpus to be created is myCorpus. As we know Vector is a column so the term “VectorSource” refers to the column in the dataframe whose data is to be copied to the Corpus. Since we are only interested in the text portion of the dataframe we need to indicate in the syntax that only the text column needs to be added to the Corpus. If we open the dataframe df you can see that the column which contains the contents of the tweets is referred to as text.

So in the syntax we mention df $ text. Press Control plus Enter to create the Corpus.



121

STEP 5: CLEANING THE CORPUS

Now that we have created a corpus called MyCorpus which contains all the text to be converted to a Word Cloud, we now need to proceed to the next step which is the cleaning up of the Corpus. The function in the tm package which will help in the clean up of the Corpus is referred to as tm_map. If you refer to the documentation on the tm package, which we looked at earlier, all the information to transform the Corpus has been specified in detail. Cleaning up of the corpus is also part of the transformation of the Corpus.

Within the document, you will find the code required to carry out various processes like eliminating white space or blank space from the Corpus, to conversion to lower case to removal of stop words.

In the Help document, in the code shown to transform the Corpus, a sample Corpus name “reuters” has been used. For the purposes of our project we need to use the same code but replace “reuters” with the name myCorpus.

Shown here in R Studio is a list of codes to clean up myCorpus.

In R Studio against each of the code or syntax mentioned we will hit Control plus Enter and start cleaning up the Corpus. We will first convert to lower case, then remove punctuation and then remove numbers from the Corpus. Removing URLS It is also possible to remove urls from the Corpus with the help of a user defined function which is shown below.

Let us deconstruct this function. The name of the function is removeurl. By indicating http in the code followed by alnum followed by two double quotes we are stating that where any url starting with http is present, it needs to be blanked out. X in the code is the placeholder for the name of the Corpus which in our case is myCorpus. In the next line we can see the code which calls the removeurl function.



122

To find out the meaning of gsub which is used in the function, type out the words gsub in the search field of the Help section. As you can see from the text displayed gsub is a function which is used to carry out any kind of replacement.

So in the removeurl function gsub is replacing any text starting with http with a blank.

Removing stop words We will now look at how to remove stop words from the Corpus. As you can see in the Console there are a number of words which by themselves do not make sense. Some of these words are once, why, each, in, to, etc.

There are around 196 Stop Words that have been identified, but you can include more as well.

In addition, we also want to include some other words which for the purposes of our project are of no value or utility. These words are English, available and via.



123

Now using the code shown below we will go ahead and remove the stop words including the ones we have added from the Corpus.

Press Control plus Enter.

Stemming Now that we have removed stop words, we will move onto another important process in the clean up of the Corpus which is called Stemming. To do this we first create a copy of the corpus by using the code shown.

Stemming will convert words like eating, eaten etc to one root word eat.

In order to carry out stemming we need to install a package in R called SnowballC. To do this, go to Packages, Install Packages and write out the name of the package which is SnowballC.

Since we have already installed this package we will click on Cancel, but in case you have not then click on Install instead.



124

To carry out stemming, we need to use a function called stemDocument which is found in the SnowballC package.

Press Control plus Enter to start stemming.

STEP 6: CREATING THE TERM DOCUMENT MATRIX

Let us pause for a while and try to recollect the next step in the framework to convert a Corpus to a Word Cloud. After cleaning up of the corpus the next step would be convert the Corpus into a Term Document Matrix. Shown here is the code to convert the Corpus into a Term Document Matrix.

Let us deconstruct this code. The code indicates that any word with a frequency from one to infinity needs to be added to the Term Document Matrix.

This need not be mentioned in the code, because by default words with all types of frequencies will be added to the term document matrix.

Press Control plus Enter. The Term Document Matrix has been created.



125

A Term Document Matrix will be the inverse of the Document Term Matrix wherein the terms will be rows and the documents will be columns. The frequency will indicate the number of times the word has appeared in a document. If we look at the Console we can see a term Sparsity followed by 99%. This means that 99% of the times these words or the words in the matrix, do not appear in the document.

To view the contents of the Term Document Matrix, go to the Workspace section where you can see the value myTdm displayed. However, the Term Document Matrix is in the form of a List, whereas we would like to see it in the form of a matrix.

In order to do this, we create a dataframe matrix called m and convert the List into this matrix using the code shown below.

In the Workspace section, a matrix m is displayed. Double click on this and our Term Document Matrix opens up! So let us break this down.



126

The first column “row names” indicates the words that are contained in the 320 tweets (remember all stop words have been removed, so these are the actual usable words) The rows which are numbered 1,2 3 and so on are the number of tweets, which we know will run into 320. The numbers indicate the number of times these words appear in each of these 320 tweets. In most cases the number is zero indicating that they have not appeared in those tweets. To find out the frequency or the cumulative number of times a word appears across the 320 tweets we will need to look at the sum of each row. So for example to find out the frequency of the word “big”, we will need to add up all the numbers under each of the 320 columns against the row “big”.



127

STEP 7: CALCULATING FREQUENCIES

In order to create a word cloud we will need to plot the word against its frequency. The code to calculate the frequency of words and sort it in descending order is shown here.

Let us deconstruct this code. The term “rowSums” and within brackets m indicates that the summation of each row in the Term Document Matrix m will be carried out. Decreasing = true, means that the summated amounts will be arranged in descending order. Press Control plus Enter. The result will be stored against wordfreq.

To view the frequencies that have been calculated, select wordfreq and press Control plus Enter. The results are displayed in the form of a List. So an easier alternative would be to convert the List “wordfreq” into a matrix “wordfreq1” using the code that is shown on the screen.

In the Workspace double click on the matrix “wordfreq1”. Shown on the screen is a matrix of all the words in the Corpus myCorpus along with their frequencies or the cumulative number of times they appear in the 320 tweets. Also, the frequencies have been arranged in descending order from the highest to the lowest.



128

STEP 8: CREATING THE WORD CLOUD

Now all that is left to be done is to generate the Word Cloud. Let us go to the Help section in R Studio and enter the words Word Cloud. Click on the link which appears. As you can see the arguments necessary to create a Word Cloud will be listed.

The first requirement shown is words, followed by frequencies. There are many other options listed so that one can create a Word Cloud based on different conditions. But a Word Cloud can be generated with just 2 pieces of information – words and their frequencies. The Term Document Matrix that we will be using to generate the Word Cloud is called wordfreq1. To generate the Word Cloud enter the following code:



129

Press Control plus Enter. The Word Cloud is being created in the Plots section of R studio.

The Word Cloud creates the words with the highest frequencies first. So, words like r, analysis, research and example have high frequencies and hence are displayed quite prominently in the Word Cloud. In the matrix, there were many words with a frequency of 1. We can choose not to show those words in the Word Cloud. To exclude these words from the Word Cloud, enter in the code an option to include only those words with a frequency of say 5 and above. The code to execute this is shown below:

This code can also be found in the Help section.



130

Press Control plus Enter. You can see that fewer words are being added to the Word Cloud.

Another thing to remember is that each time a Word Cloud is generated the position of the words will change. As we can see, r which was earlier vertical is now horizontal and is located in a different place. In order to ensure that the position of a word does not change each time the Word Cloud is generated, we can use the function “set.seed”.

In order to limit the number of words to be shown in the Word Cloud, we can use the syntax “max.words”. We can also determine the colour of the Word Cloud by using the syntax colour is equal to say red (within brackets)



131

As you can see the words are now being displayed in red.

So we have completed the objective of this project which was to generate a Word Cloud. To generate a Word Cloud all that is needed are words and their frequencies. Other parameters can also be defined. Modifications can also be done on the Word Cloud like minimum frequency, maximum number of words to be displayed and colour.

Do try out the other parameters available by referring to the content in the Help

section under Word Cloud.



132

SUMMARY

Creating a Word Cloud in R is a function of using the right package with the right set

of text or words.

To summarize:

Unstructured text can be converted into a structured format like a Dataframe

in R using the correct syntax.

The tm package in R which is needed to carry out text analysis comes with

detailed documentation.

While converting a Dataframe to a Corpus the name of the vector/column

which contains the text needs to be indicated.

The tm package document displays the code to clean up a Corpus.

Apart from cleaning up stop words, numbers, punctuation, urls can also be

removed through a user defined function.

A TDM is first created as a List and then converted to a matrix in R.

To calculate the frequency of words in a TDM, the rows against each word in

the matrix needs to be summed up.

A Word Cloud can be created once words and their frequencies are mapped

out.

Modifications to a Word Cloud include defining minimum frequency,

maximum number of words to be displayed and colour.



133

By downloading this material and signing up for our course, you have supported us in

our mission to help individuals and organizations take smarter decisions every day.

We hope to keep upgrading this material by focusing on improving quality and

providing additional lectures and material on this subject.

To send us feedback on how to improve this course, do write to us at

[email protected] with the subject line “R Handbook”.


Date post:	12-Jul-2015
Category:	Education
Upload:	ati2205
View:	274 times
Download:	1 times