Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
How to use R
From Installation to Text Analytics
Table of Contents
From the Team ............................................................................................................... 5 Copyright ........................................................................................................................ 6 Something helpful .......................................................................................................... 7 Section 1 ................................................................................................................ 8 PART 1 – WHAT IS R ....................................................................................................... 9 Introduction ................................................................................................................... 9 What is R? ....................................................................................................................... 9 Why R? ........................................................................................................................... 9 Summary ...................................................................................................................... 10 PART 2 – INSTALLATION ............................................................................................... 11 Introduction ................................................................................................................. 11 How can R be installed? ............................................................................................... 11 Overview of the R GUI .................................................................................................. 12 Installation of R Studio ................................................................................................. 14 Overview of R Studio GUI ............................................................................................. 15 Summary ...................................................................................................................... 19 Section 2 .............................................................................................................. 20 PART 1 – DATA TYPES ................................................................................................... 21 Introduction ................................................................................................................. 21 Why are DATA TYPES important? ................................................................................ 21 Data types .................................................................................................................... 22 Creating data types in R ............................................................................................... 24 Summary ...................................................................................................................... 29 PART 2 – DATA STRUCTURES & VECTORS .................................................................... 30 Introduction ................................................................................................................. 30 What is a data structure ............................................................................................... 30 What is the difference between data structure and data type ................................... 31 Type of data structure – Vectors .................................................................................. 32 How to create a Vector in R ......................................................................................... 33 Mixing up data types in a Vector ................................................................................. 35 Replacing the contents of a Vector .............................................................................. 37 Arithmetic functions between Vectors ........................................................................ 38 Identifying elements in a Vector .................................................................................. 41 Replacing contents in a Vector..................................................................................... 42 Using a function to Index ............................................................................................. 43 Speeding up the task with Operators .......................................................................... 45 Summary ...................................................................................................................... 48
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
3
PART 3 – DATAFRAMES ................................................................................................ 49 Introduction ................................................................................................................. 49 What is a Dataframe .................................................................................................... 49 Creating a Dataframe in R ............................................................................................ 50 Functions that can be carried out with Dataframes .................................................... 52 Summary ...................................................................................................................... 58 PART 4 – LIST & MATRIX ............................................................................................... 59 Introduction ................................................................................................................. 59 What is a List ................................................................................................................ 59 What is a Matrix ........................................................................................................... 60 Creating a List in R ........................................................................................................ 61 Creating a Matrix in R ................................................................................................... 63 Creating a Matrix out of a Dataframe .......................................................................... 63 Summary ...................................................................................................................... 65 PART 5 – FACTORS ........................................................................................................ 66 Introduction ................................................................................................................. 66 What is a factor ............................................................................................................ 66 How to create a factor in R .......................................................................................... 67 Summary ...................................................................................................................... 69 Section 3 .............................................................................................................. 70 PART 1 – PACKAGES ..................................................................................................... 71 Introduction ................................................................................................................. 71 What is a Package ........................................................................................................ 71 Installing and Loading a Package ................................................................................. 72 Importing an Excel file into R ....................................................................................... 75 Importing a CSV file into R ........................................................................................... 77 SUMMARY .................................................................................................................... 79 PART 2 – Exporting and Reading data in R ................................................................... 80 Introduction ................................................................................................................. 80 Exporting to Excel ......................................................................................................... 81 Exporting to CSV ........................................................................................................... 82 Reading a file in R ......................................................................................................... 83 Summary ...................................................................................................................... 85 Section 4 .............................................................................................................. 86 PART 1 – Logical operators and If condition ................................................................ 87 Introduction ................................................................................................................. 87 What is a logical operator ............................................................................................ 87 How to execute a logical operator in R ........................................................................ 89 What is IF Condition in R .............................................................................................. 93 Summary ...................................................................................................................... 94
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
4
PART 2 – Merging data ................................................................................................. 95 Introduction ................................................................................................................. 95 What is merging of data ............................................................................................... 95 What are the different ways to merge data ................................................................ 96 How to carry out a merge in R ..................................................................................... 98 Summary .................................................................................................................... 100 Section 5 ............................................................................................................ 101 PART 1 – Understanding text analytics ...................................................................... 102 Introduction ............................................................................................................... 102 What is text analytics ................................................................................................. 103 How is Text Analytics useful ....................................................................................... 104 Important terms in Text Analytics .............................................................................. 106 What is needed to create a Word Cloud .................................................................... 109 Summary .................................................................................................................... 112 PART 2 – Understanding text analytics ...................................................................... 113 Introduction ............................................................................................................... 113 How to create a Word Cloud ...................................................................................... 113 Step 1: Creating a dataframe ..................................................................................... 113 Step 2: Installing the tm package ............................................................................... 117 Step 3: Understanding TDM ....................................................................................... 119 Step 4: Creating a corpus ........................................................................................... 119 Step 5: Cleaning the corpus ....................................................................................... 121 Step 6: Creating the term document matrix .............................................................. 124 Step 7: Calculating frequencies .................................................................................. 127 Step 8: Creating the word cloud ................................................................................ 128 Summary .................................................................................................................... 132
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
5
From the Team A note from the Online Education team
ANALYTICS TRAINING INSTITUTE
Hello,
Thanks to the many students who have signed up for our courses, we are delighted to
offer all our online lectures as downloadable material. We know that learning should
be continuous, so through this material we hope that you will take your time within
your busy schedule to really understand the concepts and techniques of this
fascinating open source tool – R!
We have tried to make your learning easy by highlighting key takeaways and screen
grabs of the tool so that you can continue your learning offline as well.
We welcome you to post any comments or questions on this material in the
Discussion forum as many of you have been doing or just reach out to us at
Enjoy the learning experience and thank you for choosing us to be your partner in
your journey of discovery!
The team at ATI
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
6
Copyright
(c) 2014 Redwood Associates Business Solutions Private Limited
All rights reserved. Without limiting rights under the copyright reserved above, no
part of this publication may be reproduced, stored, introduced into a retrieval
system, distributed or transmitted in any form or by any means, including without
limitation photocopying, recording, or other electronic or mechanical methods,
without the prior written permission of the publisher, except in the case of brief
quotations embodied in critical reviews and other non commercial uses permitted bu
copyright law. The scanning, uploading, and/or distribution of this document via the
internet or via any other means without the permission of the publisher is illegal and
punishable by law. Please do not participate or encourage electronic piracy of
copyrightable materials.
For permission requests, email [email protected]
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
7
SOMETHING HELPFUL
Here are a few things that you would probably find helpful before you begin.
Sections:
There are 5 sections in this material starting with the installation of R and R Studio
right up to generating a Word Cloud in R which showcases the text mining capabilities
of R.
Online videos:
This material is a supplement to the online videos available on
https://www.udemy.com/analyticstraining/?dtcode=VQRaQsx1KWR2
This material corresponds to the section on R.
Material:
The online class format supports downloadable material for each section. So perhaps
it would be a good idea to check each section for additional downloadable material
like case studies or sample data to work on and so forth.
Ready to begin?
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
8
Section 1 Overview and Installation
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
9
PART 1 – WHAT IS R
INTRODUCTION
R is an open source statistical tool which not just manages data but also carries out a lot of sophisticated analytical processes as well. Before looking at how R works, it is important to get a good overview of R. So, here’s what will be covered in this tutorial:
- What is R? - Why R?
WHAT IS R?
So, to begin let’s start with a very basic question. What is R?
1. R as we already know is a statistical tool which is at par with other statistical tools like SAS, SPSS and Python in terms of what it can do.
2. R can manage and analyse data. It can execute all statistical techniques like liner regression, logistical regression, forecasting, decision trees and any other technique that you can think of.
WHY R?
So what makes R stand out when compared to other statistical tools? Let us break it down.
1. Firstly, R can work with any type of data and can handle data of any size. So whether the data you are working with is small or really big, R will be able to handle it.
2. R can work with data received in any type of file format, whether text, CSV, SASS and so on.
3. R offers really great visualization of data. It can connect with Google maps and Motion charts.
4. Next – and this is what makes R so much more powerful than other statistical tools –it is open source. Open source does not just mean that it can be used for free, but that anyone can contribute to it as well.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
10
5. R does not use much code, even if it is handling large volumes of data or carrying out complicated analytical techniques.
6. As was mentioned earlier, R being open source means anyone can contribute to it. This is why R has a huge community of contributors who almost on a daily basis keep adding functionality to it. This is the reason why even the most complicated techniques can be executed in R by just calling a function. So, when using R we as users do not need to worry about how to perform a linear regression or a logistics regression. The code to execute this and many other advanced analytical functions is already built in and refined by those in the R community on a regular basis.
7. R is used by a lot of big corporations like Facebook, Google, Mozilla, Llyods and Merck, among others. This goes a long way in validating the capability of R and adds to its credibility.
SUMMARY
In this material, we covered the reasons which make R a powerful statistical tool. To summarize,
R is an open source statistical tool that can be used freely by anyone.
It is improved upon everyday by a large community of contributors who periodically keep adding new codes and functions to it.
R can work with big or small data, and also with the different formats in which data is usually presented.
It does not use much code and offers great data visualization making it a popular statistical tool in many global corporations.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
11
PART 2 – INSTALLATION
INTRODUCTION
To start leveraging the power of R, it first needs to be installed. So, here’s what will be covered in this tutorial:
- Installation of R - Overview of a typical GUI or Graphic User Interface of R - Installation of R Studio - Overview of the GUI of R Studio
HOW CAN R BE INSTALLED?
To begin, let’s look at how to install R. To install R click on the link displayed: http://cran.r-projeact.org/bin/windows/base/old/3.0.2 On opening this link different options to download R based on system configuration and operating systems are available - like “R for 32bit system or R for a 64bitsystem or R for Windows.” Download the version of R that is best suited to the operating system being used. When you update your version of R, the earlier version is NOT automatically uninstalled. Further, R Studio allows you to run multiple versions of R (though not in same session) Therefore in R Studio, find out which version of R is running by typing R.Version(). The default version of R that R Studio runs can be changed from Tools>Options> R General.
Before proceeding with the rest of this tutorial, we suggest that you download R in case you already haven’t.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
12
OVERVIEW OF THE R GUI
Shown here is a typical R GUI or Graphic User Interface.
At first glance, all that is visible on the R GUI is a single screen, which is known as the Console. The Console can be used to input data as well as view output. But we recommend that the Console is used to only view output. Commands or inputs in R are referred to as Scripts. To write a script, go to File and select New Script.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
13
A new blank window called R Editor opens.
Think of a Script in R as code or syntax that is written in order to tell R what it needs to execute. For eg, let’s enter a = 1, which means R is being told to create a variable “a” and store a value of “1” against it.
To execute this script or code, press Control +Enter. As shown in the image below, the command is executed and the output is displayed in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
14
The output a=1 is displayed in red in the Console. So, essentially input is specified in R Editor and the output is displayed in the Console.
INSTALLATION OF R STUDIO
A more user friendly option available to users is R Studio. It has a better GUI and comes with more options. To take a more detailed look at R Studio, let us first install it. To download R Studio, click on the link displayed on the screen: http://www.rstudio.com/ide/download/
We recommend that installation of R Studio is complete before proceeding with this tutorial.
When you update your version of R, the earlier version is NOT automatically uninstalled. Further, R Studio allows you to run multiple versions of R (though not in same session) Therefore in R Studio, find out which version of R is running by typing R.Version(). The default version of R that R Studio runs can be changed from Tools>Options> R General.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
15
OVERVIEW OF R STUDIO GUI
As shown in the image, there are 4 sections in R studio.
Let’s briefly go through each section.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
16
Section1: Editor
The first section is the Editor section where the script or code that R needs to execute is written. To add more than one script, use the plus sign on the top left hand corner. It is possible to add as many scripts as required using this option.
Using the example looked at earlier, the script or code a = 1 is entered. To execute this code, press Control plus Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
17
Section 2: Console
The output appears in the Console window which can be found right below the Editor window. When values appear in the Console section it means that the script or the code has been executed.
Section 3: Workspace
To the right of the Editor section, is the Workspace section, where the data being worked on can be viewed.
This includes even data that has been imported from an external source. In the example used, a new variable “a” with a value of “1” was created. Since this is the data currently being worked on, both “a” and “1” are displayed in the Workspace section.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
18
Section 4: File, Plots, Packages and Help
The other sections in R Studio are File, Plots, Packages and the Help section. The Help section helps in locating functions in R Studio. In the Search field, type what is being searched for and click Enter. For eg, if plot is entered in Search, everything related to it is displayed just below.
The other tabs available are packages (which will be covered later), Plot and Files which displays all the files that are currently being worked on.
For the rest of the series of tutorials on R, we will be working with R Studio as it has a better GUI than R Editor.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
19
SUMMARY
We are now ready to start using R – to manage data and carry out other types of actions on data. To summarize:
The links to install both R and R Studio have been provided.
An overview of the typical GUI of R has been looked at. Since individual screens need to be opened, a better option is R Studio.
R Studio is more user friendly as all the relevant sections are available at a single glance removing the need to have multiple screens open at a time.
There are 4 sections in R Studio. a. The first section is the Editor section which is used to enter scripts or
codes for R to execute. b. The second section is the Console where output is displayed. c. The data that is generated or being worked upon can be found in the
Workspace section. d. Files, packages and Help make up the fourth section.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
20
Section 2 Data Types and Data Structures
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
21
PART 1 – DATA TYPES
INTRODUCTION
What is Analytics without data? Likewise how can you leverage the amazing capabilities of R without understanding data? This tutorial is an indepth look at types of data or data types. So, here’s what will be covered in this tutorial:
- Understanding why data types are important - Different data types - Creating data types in R
WHY ARE DATA TYPES IMPORTANT?
To begin, it is important to understand why data types are useful and why it is necessary to be able to distinguish between different types of data. Suppose, you have been asked to evaluate five different brands of cars –let us call them Brand A, Brand B, Brand C, Brand D and Brand E. If you were asked to calculate the mean of these five cars, how would you go about it? It most likely would be an impossible operation to carry out because all you have is the name or the brand of these cars and as you know you cannot calculate the mean of names! Now, the situation would have been different if you had some numeric data about these cars. This emphasizes the need to understand the type of data you have to work with because certain types of functions can be carried out on certain types of data. Like calculating mean is not possible with character data types like names or brands.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
22
DATA TYPES
Data can be of different types. The different types of data one would commonly come across are: Numeric:
Refers to any number or numeric value.
Eg: 1.2, 2.1 etc
Numeric data types include even decimals.
Integer: Refers to any number without a fractional part. Eg: 1, 2, 3…..
Logical:
Refers to any values which are either True or False.
Eg: if x = 1, y = 2, then x being greater than y is False
Character: Refers to textual data. Eg: learning, education….
Factor: Refers to data in categories.
Eg: City, Gender
Each data type will now be discussed in some detail.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
23
1.1 Numeric data types
Numeric data type is any number or numeric value like 2.1, 1.2 and so on. It could be an integer or a decimal value. In R Studio, to create a numeric data type the syntax y<-3.1 (or y is equal to 3.1) is used. This means that a variable y is being created against which a numeric value of 3.1 is being stored.
To indicate equal to we can either use the symbol <- or =
1.2 Integer data types
Integer data type indicates any data which stores integer values. In R Studio, numeric data types can be converted to integer data types by using the following syntax: as.integer(numeric value) Eg: as.integer (3.1)
1.3 Logical data types
Logical data type indicates any data where the value is either True or False, but never both. In R Studio, the following syntax can be used to create a logical data type: if x <-1, y<-2, then x > y is FALSE (x is equal to 1, y is equal to 2, then x being greater than y is false)
1.4 Character data types
Character data type stores characters or strings. In R Studio, they have to be written within double quotes. For example, the text learning would be written as “learning”.
1.5 Factor data types
Factor data type refers to categorical types of data like gender or cities. This data type will be covered in more detail in a separate tutorial.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
24
CREATING DATA TYPES IN R
How to create a numeric data type in R
In R studio, let’s create a numeric data type with a variable name of num1 and a numeric value of 3.1 stored against it. Enter these values using the code num1<-3.1 and press Control + Enter to execute this statement. The output is displayed in the Console area in blue indicating that the code has been executed. Simultaneously, the values num1 and 3.1 are displayed in the Workspace section.
In order to identify the data type of the variable num 1 use a function called Class. Type the words class and the name of the variable in brackets as shown below. class(num1) and press Control + Enter to execute it. In the Console area numeric is displayed indicating that the data type of num1 is numeric.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
25
How to create an integer data type in R
A numeric data type can be converted into an integer data type in R. In the example used above, the number 3.1 when converted to an integer gives a value of 3. To convert this numeric data type to an integer data type in R Studio, the function as.Integer(numeric variable) is used. Let us create a new variable num3 and store the integer value against this variable. Enter the code num3 <-as.Integer(num1) and press Control + Enter. The values will be displayed in the Workspace section when the code is executed.
In order to determine the data type of num3, the following function will be used class(num3)
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
26
Press Control + Enter to execute this statement. The output in this case would be integer as shown in the Console section.
How to print the contents of a variable
To print the value of any variable like num1 or num3, enter the value, say num3 and press Enter. The value will be displayed in the Console like in the case of num3, where the value 3 is displayed. Alternatively, use the code print(num3) (print and the variable name within brackets) and execute this.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
27
How to create a character data type in R
To create a character data type in R, let us create a variable char1 and store a value of “hello” against it. The code to execute this is char1<-“hello” or char1= “hello”
Remember that equal to can also be indicated by using the equal to sign.
When this code is executed, in the Workspace section a variable char1 has been created and a value “hello” stored against it.
To find out the data type of this variable, use the class function discussed earlier. Enter the code class(char1) and press Enter. The value character is displayed in the Console area.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
28
An important fact to remember about character data types is that these values are always mentioned within double quotes. So anytime a value is entered within double quotes R will recognize it as a character data type.
Logical and factor data types will be discussed in more depth in a later section.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
29
SUMMARY
We have covered important data types in this tutorial. Understanding these data types will help in managing and working with data in R. To summarize:
It is important to understand data types in order to determine what type of actions can be carried out with a specific type of data.
Different data types are available numeric, integer, character, logical and factor.
Different data types can be created in R using the proper syntax. Eg num1<-3.1, as.integer(3.1), char1<-“hello”
The function ‘class’ is used to determine the data type of a variable.
The function ‘print’ is used to print the contents of a variable.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
30
PART 2 – DATA STRUCTURES & VECTORS
INTRODUCTION
The data that you are working on, needs to work for you. In other words it has to be arranged in a way that helps you manage, store it and analyze it better. This tutorial will deal with data structures. So, here’s what will be covered in this tutorial:
- Understanding data structures - Vectors – a type of data structure - Creating vectors in R
WHAT IS A DATA STRUCTURE
A data structure in simple terms is a way of storing and organizing data. Let us understand this better with the help of an example. Shown here is a table with different types of information stored in it.
When storing information of different types, it will need to be stored across more than one variable. For eg, if the data to be stored relates to employee records, then the variables across which this data would be stored would be Name, Age, Address, Nationality, Assessment scores and so on. This collection of information displayed across different variables is referred to as a data structure.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
31
WHAT IS THE DIFFERENCE BETWEEN DATA STRUCTURE AND
DATA TYPE
A data structure is different from data type because of the number of values stored. Let’s look at this with the help of an example. If a variable “Name” has been created, and a value “Bob” stored against it, it will result in the creation of a character data type. In a data type only one value is stored. But when different information related to Bob apart from his name, is stored, like his age, address, nationality and assessment score then it results in the creation of a data structure. A data structure stores more than one value. A simple way to look at a data structure is to think of an Excel sheet with rows and columns where the columns are made up of different data types. In the example used, the Name column will store character data types, the Age column will store integer data types, the Score column will store numeric data types and so on.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
32
TYPE OF DATA STRUCTURE – VECTORS
The first type of data structure that will be discussed is referred to as Vectors. A Vector is like a column in an Excel sheet. Going back to the example used earlier, Vectors would be Name, Age, Address, Nationality and so on. In Vectors, all the elements within a Vector should be of the same data type.
Vectors cannot have a combination of data types!
So, if Age is a Vector, then all the elements under age should be of the data type integer. This Vector cannot have any other data type within it like character or number, nor can they be a combination of data types.
Vector Not a Vector So, Vector is therefore a data structure which contains elements of the same data type. Visualize a single column in an Excel sheet which contains values of the same data type.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
33
HOW TO CREATE A VECTOR IN R
In R Studio, a Vector can be created through a function known as c operator or concatenate. So, let’s create a Vector called vector 1, and store 4 values in it. This vector will contain elements of the numeric data type. To create this vector enter the code vector1<-(9,8,2,7) and execute this code.
Two events will take place. First, the Console will display vector1 with its corresponding values.
Second, in the Workspace section the variable vector 1 will be displayed along with the data type of its values - which is numeric - and the number of values which is 4.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
34
To print the contents of vector 1, write the name of the Vector and press Control + Enter. In the Console the values 9, 8, 2 and 7 will be displayed. Here 1 represents the column number of the Vector.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
35
MIXING UP DATA TYPES IN A VECTOR
Now let us look at something interesting. As discussed, a Vector can only contain elements of the same data type. There can be no mixing of data types within a Vector. So what happens if a second Vector is created and along with numeric data types, a character data type is inserted into it? Shown here, is the code to create a new Vector called vector 2 with some values. Inserted into these values is a character value “bob”.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
36
When this code is executed, the output is displayed in the Console. But the syntax used includes elements of different data types, whereas we know that vectors can only store elements of the same data type. So why is no error being displayed?
When the contents of vector 2 are printed, all values in the Vector are displayed in the Console in quotes. This indicates that by default R has converted all numeric data types in the Vector to character data types by adding quotes to all the numbers. This is why R does not display any error on executing this code!
R recognizes the rule of common data types and converts uncommon values to a single data type.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
37
REPLACING THE CONTENTS OF A VECTOR
Values in a Vector can also be overwritten. So one data type can always be replaced with another data type within the same Vector. In the example we looked at earlier, vector 2 contains 11 values all of which are character data types. Suppose we want to replace these 11 values with 4 values of numeric data type. These 4 values are 1, 2, 3 and 8. Let us enter the code Vector2<-(1,2,3,8) and press Control + Enter. In the Workspace the data type of vector 2 has now changed to numeric and has 4 values stored against it.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
38
ARITHMETIC FUNCTIONS BETWEEN VECTORS
It is also possible to carry out arithmetic functions between Vectors like addition, subtraction, multiplication and division. The only pre requisite to execute these functions is that the data types in each Vector should be of equal length. As you can see in the workspace both vector 1 and vector 2, are of numeric data type and have 4 values each, which means they are both of the same length.
It is possible to carry out any type of arithmetic function on these 2 vectors such as vector 1 + vector 2 or vector 1 – vector 2 and so on. Let us enter the code vector1 + vector2 and press Control + Enter. The output is displayed in the Console.
Let’s cross check these values. Vector 2 comprises the values 1, 2, 3 and 8. To check the values of vector 1, enter vector 1 and press Control + Enter. The values displayed in the console are 9, 8, 2 and 7.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
39
So, when the statement is executed addition is carried out by adding element 1 of vector 1 to element 1 of vector 2, element 2 of vector 1 to element 2 of vector 2 and so on. So, when 9, the first element of vector 1 is added to 1, the first element of vector 2 the result is 10 which is shown in the Console.
You can cross check the rest of the results as well!
In this example the vectors were both of equal length. Let’s look at what happens in the event the elements in the vector are of unequal length. Vector 1 has 4 elements. Let us add this to a new vector c which has 3 elements 1, 2 and 3. Let us enter the code vector1 + c(1,2,3) and press Control + Enter. On executing this code, a warning message is displayed in the console but the addition function has still been executed. How?
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
40
The first three elements of vector 1 have been added to the three elements of vector c. But the fourth element in vector 1 has been left out as there is no corresponding fourth element in vector c. Therefore for accurate results it is better to add elements of the same length.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
41
IDENTIFYING ELEMENTS IN A VECTOR
Another interesting feature in Vectors is referred to as indexing. This feature allows a particular element in a Vector to be accessed. For eg, we know that vector 1 contains 4 elements, 9, 8, 2 and 7. Let us suppose that we want to find out the third element in vector 1 which is 2. Let us enter the code vector1 [3] and press Control + Enter. Entering 3 indicates that we want to access the third element of vector 1. We can see a value of 2 displayed in the console which as we know is the third element in vector 1.
So to index a Vector, next to the name of the Vector enter within square brackets the number of the element that needs to be accessed. Eg, vector1[3]
Indexing helps in identifying values in a vector based on their position.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
42
REPLACING CONTENTS IN A VECTOR
Now, let us suppose that we want to create a new Vector called new_vector. In this new Vector we want to populate the same elements as vector 1 but without the second element. So in new_vector we only want to store the first, third and fourth elements of vector 1. Let us enter the code new_vector<-vector1[-2] and press Control + Enter. Entering minus next to 2 indicates that we want to exclude the second element of vector 1 in new_vector. When the code is executed we can see in the Workspace section that the vector new_vector has been created with three values of numeric data type.
To view the contents of new_vector, enter the name of the vector and press Control + Enter. In the console, 9,2 and 7 are displayed. 8 is not displayed as it is the second element in vector 1 and hence has been excluded.
If a Vector has only three elements but if a value of 10 is being entered in square brackets, then it means that we are trying to index elements that are greater than
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
43
what are actually present in the Vector. This situation is referred to as an Index out of Boundary.
If there are only 3 elements in a Vector, then how can you locate the 10th element?! Hence the term Index out of Boundary.
USING A FUNCTION TO INDEX
Indexing in Vectors can also be done with the help of logical functions. Here’s how. Let us create a new Vector called vector 2 with 4 elements in it 1,2,3,4. Enter the code vector2=c(1,2,3,4) and press Control + Enter. We already have a Vector, called vector 1 which has the elements 9,8,2,7. Let us now use a logical function to find the the third element in vector 1. Enter the logical function vector1[vector2==3]
By entering vector 2==3, we are trying to locate the position of the value 3 in vector 2. The value 3 is the third element in vector. So, when the code is executed, in the Console the third element in vector 1 needs to be displayed. Since the third element in vector 1 is 2, we should be able to see this number in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
44
Press Control + Enter. On executing this code we can see in the Console the value 2.
Using the position of 3 in vector 2, the logical function tries to find the equivalent position in vector 1.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
45
SPEEDING UP THE TASK WITH OPERATORS
Operators help in executing certain types of tasks quickly and more efficiently. Let us understand this better with the help of an example.
Let us assume that a Vector called Age needs to be created which needs to store the first 100 natural numbers i.e., numbers from 1 to 100. One way to execute this is to write the code age<-(1,2,3…..) and so on mentioning all numbers till 100. This obviously is not a feasible option. Sometimes numbers could run till 100, at other times till even 1000! In these types of situations, a good option would be to use Operators. Here are a few common Operators that are used in R.
Colon Operator
The Colon Operator can be used to create Vectors like the Age Vector quite easily by using the code age<-1:100 To execute this code press Control +Enter.
To view the contents of the Vector, enter age and press Control + Enter. In the Console values from 1 to 100 are displayed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
46
In a Colon Operator the value before the colon is the first value in the series and the value after the colon is the last value in the series. So, when 500:505 is entered, the series will begin from 500, then move to 501, 502, 503,504 and end with 505.
Colon operators remove the necessity of writing a long series of numbers!
Sequence Operator
Let us suppose that a Vector is to be created with some numbers, which are not continuous but have some sort of order to it. An example would be 1,3,5,7,9 and so on. To create this Vector, the Sequence Operator can be used. Let’s create a Vector called Age and populate it with the values values 1,3,5,7,9 and so on till 101. To do this, enter the code age<-(1,101,2) and press Control + Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
47
To view the contents of this Vector, enter age and press Control +Enter. In the Console the entire series is displayed.
In the code entered, 1 represents the start point, 101 represents the end point and 2 represents how the numbers should increment.
Sequence operators can populate vectors with data that follow a logical sequence.
So, now we know that vectors can be created in 3 different ways – firstly through c or concatenate, secondly with a colon operator and lastly with a sequence operator.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
48
SUMMARY
In this tutorial we have looked at the importance of having a proper structure to store and organize one’s data. One such structure is called a Vector. To summarize:
Data structures are needed to store and organize data.
A data structure comprises different data types like characters, numbers, integers and so on which are displayed in the data structure as variables like age or name.
A Vector is a data structure which can store values of a single data type – like only characters or only numbers or only integers.
Vectors can be created in three ways - using concatenate, Colon Operator or Sequence Operator.
Arithmetic functions like addition, subtraction and so on can be carried out between Vectors provided they are of equal length.
Indexing is a function which can locate a particular element within a Vector.
It is also possible to create a new Vector by copying or modifying the contents of another Vector.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
49
PART 3 – DATAFRAMES
INTRODUCTION
Vectors, the first type of data structure that was looked at is actually quite closely linked to the next type of data structure that will be discussed. This tutorial will deal with Dataframes. So, here’s what will be covered in this tutorial:
- Understanding Dataframes - Creating Dataframes in R - Different functions related to Dataframes
WHAT IS A DATAFRAME
Shown here is a table with columns like Name, Age and Score.
Each column is in fact a Vector. So, Name constitutes one Vector, Age another and Score another. So, a Dataframe is nothing but a collection of Vectors of equal length.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
50
CREATING A DATAFRAME IN R
Here is a table with some data. This data needs to be converted into a Dataframe called Records.
The table has 4 columns which individually become 4 Vectors in the Dataframe. So, the first step in creating the Dataframe is to create the four Vectors.
To create the four Vectors in R Studio, viz, Name, Gender, Age and Income, enter the code Name<- c(“Aryan”, “Gopal”, “Zubin”, “Ravi”, “Umesh”, “Anita”) Gender<- c(“M”, “M”, “F”, “M”, “M”, “F”) Age<- c(20,21,24,26,26,23) Income<- c(20000,30000,35000,40000,41000,50000) Select the code entered and press Control + Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
51
The next step is to actually create a Dataframe around this called Records. Enter the code Records<- data.frame(Name, Gender, Age,Income) and press Control + Enter.
The order in which the Vectors are entered is important. If Name is entered first, it will be the first Vector displayed in the Dataframe. Likewise if Gender is entered first it will be the first Vector displayed in the Dataframe. On executing this statement, the Console shows that the code has been executed.
In Workspace the name of the Dataframe Records is displayed together with its Vectors, their data types and the number of values in each. When we double click on records we can see the entire Dataframe displayed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
52
FUNCTIONS THAT CAN BE CARRIED OUT WITH DATAFRAMES
How to print the names of variables
To find out the names of the variables in a Dataframe, enter the code ‘names’ followed by the name of the Dataframe whose variables need to be determined.
So, to find out the names of the variables in the Dataframe Records, enter the code names(Records) and press Control + Enter. In the Console is displayed all the variables of the Dataframe Records -– which is Name, Gender, Age and Income.
This is a useful function when working with a Dataframe that contains a large number of variables. In this tutorial, a simple Dataframe with just four variables has been created. There could be a situation where a large Excel table with lots of variables is imported and the names of the variables used in this table need to be determined.
How to find a particular element
To find a particular element in a Dataframe, a function called Indexing is used. When the Dataframe Records is opened from the Workspace section, we can see that it has 4 columns with 6 rows. In the Workspace section, 6 observations indicates 6 rows and 4 variables means 4 columns.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
53
Let us assume that we want to find out the gender of a particular person, say Gopal, who is listed in the table. The value Gopal can be found in the second row and Gender is the second column in the Dataframe Records. Enter the code Records[2,2] and press Control + Enter
In the code, the first 2 indicates the second row in the Dataframe where the value Gopal is displayed. The second 2 indicates the column Gender.
When this code is executed, ‘M’is displayed in the Console indicating that the gender of Gopal is male.
How to view the elements in a row
It is also possible to view all the elements in any row of a Dataframe. For eg, to view the elements of only the first row in the Dataframe Records, enter the code Records[1, ] and press Control + Enter.
The space left after the comma indicates that all the elements in the row need to be fetched. When this code is executed, in the Console the entire elements of row 1 of Records is displayed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
54
To view the elements of more than one row, say for eg, the first 4 rows of the Dataframe Records, enter the code Records[c(1:4),] and press Control + Enter. On executing this code the elements of the 4 rows is displayed in the Console. Given below is also the code to access only rows 3 and 4, and the resulting output in the Console.
How to view the elements in a column
There are three ways to find out the content/s of a particular column in a Dataframe. Let us look at each of these with the help of an example. Let us find out the contents of the column “Name” in the Dataframe Records.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
55
1. Enter the code
Records.$Name
and press Control + Enter.
On executing this code we can see all the values under the Name column displayed.
2. Enter the code Records*,”Name”+ and press Control + Enter.
Here the first field is empty because it relates to rows and the second field is the name of the column whose contents are to be retrieved. On executing this code the contents of the column are displayed in the Console.
3. Enter the code
Records[,1] and press Control + Enter.
This method works if the column number is known beforehand. In this case we know that Name is the first column in Records. On executing this code, the contents of Name are displayed in the Console.
Of the three ways to find the values in a column, two work with the name or the label of the column and the third requires the number of the column.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
56
How to add columns to a Dataframe
To add a column to a Datafram, the only pre requisite is that the new column to be added should be the same size as the other columns in the Dataframe. In the Dataframe Records, there are 6 rows and 4 columns. To add a fifth column to this Dataframe enter the code Records$newc<-(100:106) and press Control + Enter. Entering 100:106 indicates that the new column needs to have 6 rows with the values 100, 101, 102, 103, 104, 105 and 106. When this code is executed, an error is displayed in the Console as 100:106 adds up to seven rows and not six.
So, the code needs to be changed as follows Records$newc<-(100:105) and press Control + Enter. On executing this code in the Workspace the number of columns in Records is now 5.
Also, when Records is opened, the column New is displayed with values from 100 to 105.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
57
How to remove a column from a Dataframe
It is also possible to remove a column from a Dataframe. For eg, to remove the column New from the Dataframe Records enter the code Records$new<-NULL and press Control + Enter.
When this code is executed the data in the Workspace is updated to show only 4 columns.
When Records is opened, the column new is no longer visible.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
58
SUMMARY
In this tutorial we have looked at another important data structure called Dataframe. To summarize:
A Dataframe is a data structure made up of vectors of equal length.
To create a Dataframe in R, first vectors need to be created.
There are various types of functions that can be carried out with Dataframes. These are: a. Printing the contents of a Dataframe b. Indexing or locating specific values in a Dataframe c. Finding out the values of a row in a Dataframe d. Finding out the values of a column in a Dataframe e. Adding a column with values to a Dataframe f. Removing a column from a Dataframe
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
59
PART 4 – LIST & MATRIX
INTRODUCTION
Data structures can also be in the form of a list or a matrix. This tutorial will deal with two more types of data structures – List and Matrix. So, here’s what will be covered in this tutorial:
- Understanding Lists - Understanding a Matrix - Creating Lists in R - Creating a Matrix in R - Different functions related to List and Matrix
WHAT IS A LIST
Just like a Dataframe, a List is also made up of Vectors. But unlike the Dataframe, the Vectors in a List can be of equal or unequal length. However, the Vectors in a List should comprise elements of the same data type. For eg, in the table below, n,s and b are 3 Vectors of different data types – numeric, character and logical. Each of them are of unequal length. A combination of these Vectors can make up a List.
n s b
2 aa TRUE
3 bb FALSE
5 cc TRUE
dd FALSE
ee FALSE
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
60
WHAT IS A MATRIX
A Matrix is a collection of data elements arranged in a 2 dimensional rectangular layout. To create a Matrix in R of 10 elements arranged in 5 columns and 2 rows, the syntax to be used is shown below. In this example, a matrix called my.matrix will be created. So, enter the code my.matrix<-matrix(c(1:10), ncol=5, nrow=2) and press Control + Enter. The result is shown below.
Unlike a Dataframe where each column stores different elements like Name or Age, in a Matrix all the columns need to have the same type of elements – either only numbers or only characters and so on.
A Matrix cannot have one column with character data types, one column with integers and so on.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
61
CREATING A LIST IN R
There are two ways in which a List can be created in R. 1. By creating the Vectors in the List To understand how to create a List in R, the table shown earlier will be converted into a List. In that table, column n has only numeric data, s has only character data and b contains logical data (only True or false). Each of these columns are Vectors. To create the List, enter the code n = c(2,3,5) s = c(“aa”, “bb”, “cc”, “dd”, “ee”) b = c(TRUE, FALSE, TRUE, FALSE, FALSE) and press Control + Enter. The output is displayed in the Console.
In the Workspace section, each Vector n, s and b is displayed together with its data type and the number of values in it.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
62
2. By creating a separate List around Vectors To create a List X with the three Vectors n, s and b and a fourth Vector called 3, enter the code X = list(n,s,b,3) and press Control + Enter. On executing this statement, in the Workspace section x with a value List against it is visible. It also indicates that the List has 4 Vectors.
To view the contents of the List, select the name of the List and press Enter. The values in X will be displayed in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
63
CREATING A MATRIX IN R
Let us create a matrix in R, called my.matrix with 5 columns and 2 rows. This Matrix needs to store 10 elements. To create this Matrix, enter the code my.matrix<-matrix(c(1:10, ncol=5, nrow=2) and press Control + Enter. Here, the first argument indicates the number of elements to be stored in the Matrix, the second argument relates to the number of columns in the Matrix and the third argument relates to the number of rows in the Matrix. On executing this code the workspace indicates that a Matrix has been created.
On double clicking the name of the Matrix, a 2x5 matrix with 10 elements is visible.
CREATING A MATRIX OUT OF A DATAFRAME
A Dataframe can also be converted into a Matrix. Let us understand this with the help of an example. First the Dataframe needs to be created, with some sample elements. To create a Dataframe called data_frame, enter the code data_frame<-data.frame(a=c(1,2,”3”), b=c(1,2,3)) and press Control + Enter. To convert this Dataframe to a Matrix (let us call it next.matrix), use the function next.matrix<-as.matrix(data_frame) and press Control + Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
64
The output is displayed in the Console and the details of the Matrix in the Workspace section. On opening the Matrix, a 2x5 matrix is displayed. Here the column names displayed are V1, V2, V3 etc as no specific column names have been mentioned in the code.
Now let us find out the data type of the second column in the Dataframe data_frame. The data type of the second column is numeric, but this can be found out by using the code class(data_frame$b) On executing this statement, numeric is displayed in the Console.
Now let us find out the data type of the second column in the Matrix next.matrix. The second column in next.matrix is b. To find out the data type of b, enter the code class(next.matrix[,2]) and press Control + Enter. In the Console, character is displayed.
So, if the same elements in the Dataframe were used to create the Matrix, why does the data type of the column differ? Column a or the first column in the dataframe that was used to create next.matrix has elements of the character data type. So as a Matrix needs to have elements of the same data type, every element in the Matrix including the elements in the second column b have been converted to character data type. This is why the data type of the second column of the Matrix is character. This underlies the key difference between a Dataframe and a Matrix i.e, all the elements in a Matrix need to be of the same data type.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
65
SUMMARY
In this tutorial we have covered two more data structures – List and Matrix. To summarize:
List is a combination of vectors, either of equal or unequal length .
A Matrix is a collection of data elements, where all the elements need to be of the same data type.
There are two ways in which a List can be created in R. The first is by generating the Vectors in the List individually. The second is by combining the Vectors into a consolidated List.
A Matrix can be created using code where the number of elements, columns and rows is specified.
A Dataframe can also be converted to a Matrix. If the Dataframe has different data types, only one data type will be stored when converted to a Matrix.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
66
PART 5 – FACTORS
INTRODUCTION
An important data type used in data structures is factor. Factor as already mentioned refers to data types of categorical nature. So, here’s what will be covered in this tutorial:
- Understanding factors - Creating a factor in R
WHAT IS A FACTOR
In R, let us assume that a Vector called fac_list has been created with names of cities like city1, city2, city 3 etc. fac_list<-c(“city1”, “city2”, “city3”, “city4”) The names of these cities are categories in themselves. So each city which is originally a character data type can be converted into factor or a separate category in R. Let us take another example. In a Vector like gender, there are invariably two values, male and female, each of which are categories in their own right. So, the utility of the data type factors is to convert values into categories.
Vectors are the base on which factors are generated from.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
67
HOW TO CREATE A FACTOR IN R
To use factors to create categories out of values, let us assume that the values in the Vector fac_list are to be converted to categories or factors. First create the Vector fac_list using the code mentioned above. Now enter the code fact1<-as.factor(fac_list) and press Control + Enter. In the Workspace section, fact1 with 5 factors is visible.
To find out the data type of fact1, use the code class(fact1) and press Control + Enter. On executing this code, factor is displayed in the console area.
To view the values in fact1, enter the code summary(fact1) and press Control + Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
68
In the Console area is shown each of the values in fact_list displayed as categories.
The values under each indicate the number of times they appear in the Vector fact_list. For eg, city 1 appears only once hence the value 1, but city 2 appears twice which is indicated by the value 2. Likewise the number of times the other categories appear is also indicated.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
69
SUMMARY
In this tutorial the data type factors was explained in some detail. To summarize:
Factors are a data type which converts values into categories. For eg, names of cities to city, male and female to gender and so on.
In R, factors can be created out of values in a vector.
It is also possible to view the number of times a category appears in a Vector.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
70
Section 3 Data Handling
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
71
PART 1 – PACKAGES
INTRODUCTION
One section of the R Studio GUI comprises a section of Packages. They allow for amny important functions to be carried out. So, here is what will be covered in this tutorial:
- Understanding packages
- Installing and loading packages
- Importing data into R
- Exporting data from R
WHAT IS A PACKAGE
Packages are collections of R functions, data and compiled code put together in a well defined format. They can be thought of as prepared routines that are available in R.
Packages are like a bundle of everything that is needed to carry out a specific function in R.
Let us understand the importance of packages through an example. Suppose we want to carry out a linear regression in R to create a linear model. One way to do this is to write all the logic and code to carry out a linear regression and then execute it. Another way is to access a linear regression function from an external file, pass your data through it and execute it. This pre made function is what is referred to as packages in R.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
72
As we have already discussed R has a huge community of contributors. These contributors create these premade functions or packages in R which can then be used by all users of R. So, if a user needs to forecast something in R, all they need to do is look for the forecasting package in R and use it. Packages definitely make R a user friendly tool.
By using the right package in R, one can save time and effort in carrying out a particular function.
INSTALLING AND LOADING A PACKAGE
When talking about packages there are two common terminologies that are used. The first is installing a package and the second is loading a package. To understand these terminologies, let us look at an example. Using this example will not just show us how to use a package, but also demonstrate how to import data into R. Let us assume that in one of the drives in the system being used, an Excel file called Excel_import is to be imported into R. In R the code to import an excel file is read.xlsx.
But if we were to execute this code, it would not work. This is because xlsx is a function present within a package and it will only work if this package is installed. So certain functions in R are linked to packages and will only work if those packages are installed in R.
To execute certain functions, it is important to install and load a package in R.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
73
Installing a Package Let us now look at how to install a package. The option Packages is available in R Studio on the right hand side.
Click on Packages, then Install packages.
In the field “Packages” enter the name of the package that needs to be installed. In the example being used, the package to be imported is called xlsx. So, enter xlsx.
Make sure that when installing a package like xlsx you are connected to the internet, as R will need to download the package from a server. Like in the case of xlsx it will be downloaded from the server Repository.
After entering the name of the package to be installed, click on the Install button.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
74
Once the package has been installed look for it by entering its name in the search field in the Packages section. As it appears in the search results, we know that the package has already been installed.
To find out if a package is already installed in R, look to see if it comes up in a Search.
Loading a package Installing a package adds it to your system, but post that the package needs to be loaded. Loading means using the package in R to carry out or execute the function. To load a package in R, the common code that is used is library followed by the name of the package within brackets. So, enter the code library(xlsx) and press Control + Enter. In the console the text in red indicates that the package has been loaded in R.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
75
IMPORTING AN EXCEL FILE INTO R
Now let us import an Excel file into R, as the package to import the file is installed and loaded in R. To do this the code to import the Excel file needs to be entered. A breakdown of this code is mentioned here: read.xlsx(file= “file path.xlsx”,sheet.index=1)
- filpath.xlsx is the file path or the location of the file to be imported - sheet.index=1 indicates that only the first sheet in the Excel file needs to be
imported (So, if 2 is entered instead of 1, then the second sheet will be imported from the Excel file)
The path of the file to be imported can be found under Properties.
Let us assume that the name of the sample Excel file to be imported is Excel_import. To find out the file path of this Excel file, right click next to the file and look under Properties. In the space left for the file path, paste the file path of the Excel file. When pasting or writing the file path, make sure that back slash is entered twice in the file path. After the file path enter the name of the file which is Excel_Import. Then the sheet to be imported needs to be mentioned. We can either enter 1 or sheet dot Index equal to 1. The imported excel sheet needs to be stored in a Dataframe in R. So we will create a new Dataframe data1. Let us execute this code by entering Control + Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
76
In the Console the presence of the red dot means that the data is being compiled.
In the Workspace section, a new dataframe data1 is created which has 99 observations or rows and 4 variables or columns. On opening data1 we can see the data that has been imported. Check to see if the correct data has been imported by opening the Excel file that has been imported.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
77
IMPORTING A CSV FILE INTO R
To import a comma separated value file or a CSV file, the code read.csv is to be used. Importing a CSV file does not require any package to be imported as this function is inbuilt in R. The code to import a CSV file is shown here: read.csv(file= “file path.csv”,sep= “ , “)
- file path.csv is the file path or the location of the file to be imported - sep = “ , “ indicates that the file to be imported is a comma separated value
file
Assume that a csv file called CSV_Import is to be imported. Copy the file path of this file which can be found under Properties. In R, enter the code to import the file by first entering the name of the Dataframe where the imported file will be stored, which is data2. Then enter the code read.csv followed by the filepath which has been copied earlier. Then the name of the file to be imported is entered which is CSV_Import followed by the file type which is CSV. Remember to add 2 back slashes to the file path just as we did in the case of the Excel file import. The last part in the code is the separator which is a comma. Press control plus enter to import the file.
In Workspace, data2 has been created with 99 observations and 4 variables.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
78
An easier way to import a CSV file is to create a Dataframe (like a) and use the code: read.csv (file.choose ( ) ) The space after choose ( ) is to select the file from the menu in the system. This option is a menu driven option and removes the need to copy and paste the file path in the code. After pressing Control + Enter, in the Select File option which appears look for the CSV file to be imported (which in our case is CSV_Import).
On selection in the Workspace a Dataframe “a” is created which has the same observations and rows as the earlier Dataframe created.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
79
SUMMARY
In this tutorial the utility of Packages was explained in some detail. It also covered the importing of Excel and CSV data into R. To summarize:
Packages are a bundle of pre defined functions. They help in executing certain processes in R with ease.
Certain functions in R can only be done through packages.
To use a package in R, first install it and then load it.
To import an Excel file in R, install and load the package xlsx.
To import a CSV file in R, use the code read.csv.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
80
PART 2 – EXPORTING AND READING DATA IN R
INTRODUCTION
Just like data can be imported into R – whether in Excel or CSV format – it can also be exported from R. So, here is what will be covered in this tutorial:
- Exporting data from R to Excel
- Exporting data from R to CSV
- Reading a file in R
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
81
EXPORTING TO EXCEL
In an earlier section, a Dataframe called “a” has already been created when CSV files were imported to R. Let us assume that the contents of this Dataframe will now be exported to an Excel sheet.
To do this, enter the following code: write.xlsx (data, file= “file path”) So, if the contents of Dataframe “a” is to be exported to a sample Excel file called abc, the code to be entered is shown below:
Let us deconstruct this code. The data to be exported is specified as “a” and the location to which is to be exported is mentioned after file. Also mentioned is the name of the Excel sheet where the data is to be stored which in this case is abc. When this code is executed, the data is exported to the location specified. You can always check this by going to the location where the Excel file is saved, and checking its contents.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
82
EXPORTING TO CSV
Exporting to a CSV file is similar to exporting to an Excel file. The code to carry out this function is shown here:
In the code shown, a is the Dataframe whose contents are to be exported, filepath is the location where the CSV file is to be saved and the comma against the separator (sep) indicates that the data has to be exported in CSV format. After executing the code, go to the desktop of your system and look for the location where the CSV file has been stored. Verify that the contents of the Dataframe “a” have been exported in CSV format.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
83
READING A FILE IN R
Like in the case of exporting data, reading data is also carried out with the help of code, which in this case is read.table. Shown here is the code to read a sample text file in R.
Assume that on the desktop of your system, a text file called “Consultants” is available whose contents are to be read through R. Assume that this file contains a set of email ids all separated by commas. When this data is read in R, we want to make sure that each email id is an element in itself. Let us now deconstruct the code to read data. The dataframe where the contents of the text file are to be displayed is mentioned which in the code displayed is “a”. The location of the text file is given next. Comma is written against separator as all values in the text file “Consultants” are separated by commas. When we execute this code, in the Console a red dot appears indicating that the data is being compiled.
In the Workspace section, a Dataframe called a is visible. On opening this Dataframe, all the email ids in the file “Consultants” are stored as separate elements.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
84
Assume that in the desktop of your system, is another text file called “Backup codes”. Here the elements are separated by a space. To read the contents of this file, using the same code used earlier, replace the comma with a space.
On executing this code, a Dataframe b is created in the Workspace section. On opening this Dataframe, the contents of the text file are displayed as separate elements. So as the contents were separated in the text file with a space, a space was used in the code as a separator.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
85
SUMMARY
In this section, exporting data from R – whether in Excel or CSV format – was
covered. Also, the code to read text files in R was also looked at.
To summarize:
Data can be exported in Excel format using the code write.xlsx.
Data can also be exported in CSV format.
To read code in R, use the code read.table.
To read elements separated by a comma, use the separator “,”.
To read elements separated by a space, leave space as a separator.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
86
Section 4 Logical Operations and Conditions
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
87
PART 1 – LOGICAL OPERATORS AND IF CONDITION
INTRODUCTION
Locating values in R is fairly simple with the use of logical operators and conditions. So, here is what will be covered in this tutorial:
- Understanding logical operators
- Common types of logical operators
- Executing logical operators in R
- Understanding IF condition
- Executing IF condition in R
WHAT IS A LOGICAL OPERATOR
Logical operators are used to locate specific elements in a data structure. Here are examples of logical operators in R – Greater than, Less than, Equal to, And, OR. An example will be used to understand each of these terms better. Assume there is a table, that lists a few names along with certain particulars related to those names like gender, age and income. The utility of a logical operator in a table such as this or in a Dataframe in R is to identify or isolate a specific element or elements, or certain row or rows. So, if in this sample table, one wants to identify all those names where the age is Greater than 23, then the logical operator Greater than is used. Here is the result of using this operator on the sample table. Looking at the table we can identify 3 names where the age is greater than 23.
Name Gender Age Income
Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000
Age Greater than
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
88
23
Let us take a look at another example. Suppose we want to identify all those names whose gender is male and whose income is greater than 40000. Here we need to use 3 logical operators to identify these names. These are gender Equal to male, followed by the logical operator AND, followed by income Greater than 40000. Here is the result of applying these logical operators.
Name Gender Age Income
Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000
Gender Equal to male AND income Greater than 40000
So from these examples we can see that logical operators are very useful in extracting particular information from a Dataframe or a table.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
89
HOW TO EXECUTE A LOGICAL OPERATOR IN R
We will now look at how to work with these logical operators in R through a simple exercise. In an earlier section, a Dataframe called Records was created using the information mentioned in the sample table above. But for purposes of this exercise, we will and create this Dataframe again.
To create the Dataframe “Records” again, first create the vectors “Name”, “Age” and “Income” before creating “Records”.
After the Dataframe has been created, the following three tasks will be carried out:
1. The vectors in “Records” will be printed 2. The elements where the age is less than 23 will be identified 3. The elements where gender is male and age greater than 21 will be identified
Finding out the rows where the age is less than 23 Let us begin with finding out the elements or rows where the age is less than 23. From the table, we know that there are 2 rows where the age is less than 23. These can be found against the names Aryan and Gopal.
Name Gender Age Income
Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000
Age Less than 23
So how do we get the same result in R? Here is the the code to create the Dataframe “Records”.
Press Control plus Enter.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
90
The Dataframe is created and information relating to this is displayed under Workspace. On double clicking this Dataframe “Records”, the information shown in the table is present in the Dataframe.
Let us now find out the elements or rows in this Dataframe where the age is less than 23. When discussing data structures we touched upon the code to find out the number of rows. The first rule to remember is to use square brackets after the name of the Dataframe, and the second rule is that the first argument within the bracket relates to rows and the second argument relates to columns. So as we need to find out the rows where the age is less than 23, the logical statement is mentioned in the first argument and the second argument has been left blank – as in nothing is mentioned after the comma. The code to find the rows where the age is less than 23 is:
So now let us deconstruct this code. First mention the Dataframe name which is “Records” followed by the dollar sign ($) and the name of the column from where data needs to be identified. In our example this would be Age. Then enter the logical operator less than ( < ) followed by 23. On pressing Control plus Enter, we can see in the console section two rows. The rows displayed tally with the results that we arrived at when we looked at the data displayed in the table.
Remember to identify rows, enter only the first argument in the code, as the second argument relates to columns.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
91
Let us now assume that we want to count the number of rows where the age is less than 23. The details of the rows where the age is less than 23 is displayed, but we now need a count of these rows. First we need to create a dataset data1 and attach it to the code we used earlier. data1<-Records[Records$Age<23] As you recall from earlier sections, this has the effect of attaching the results of the code to the dataset data1. So the two rows that we saw in the Console now belong to the dataset data1. Now to find out the number of rows in data1, enter the code nrow (data1) On pressing Control plus Enter, 2 is displayed in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
92
Finding out the rows where the age is less than 23
Going back to the table, we can see that there are two records which meet the conditions of gender being male and age being over 21. These are found against the names Ravi and Umesh.
Name Gender Age Income
Aryan M 20 20000
Gopal M 21 30000
Zubin F 24 35000
Ravi M 26 40000
Umesh M 26 41000
Anita F 23 50000
Gender Equal to male AND age over 21
So how do we get the same results in R? In R, enter the code Records[Records$Gender== “M”&Records$Age>21,+ So what we have effectively stated in this code is to find in the Dataframe Records, all rows with gender Equal to M and with age Greater than 21. On pressing Control plus Enter,in the Console the rows with Ravi and Umesh are displayed. This as we have seen exactly matches the requirements of all rows with gender male and age greater than 21.
Remember that when entering character data types in R, the values need to be entered within double quotes.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
93
WHAT IS IF CONDITION IN R
To understand the conditional statement IF in R, let us use an example. Let us begin by opening the Dataframe Records. In this Dataframe, let us assume we want to add another variable called Gender_dummy. The values to be displayed against this variable are 1 against all those rows where M (male) is displayed and 0 against all those rows where F (female) is displayed. The code to execute this is shown here: Records$Gender_dummy<-ifelse(Records$Gender== “M”,1,0) ifelse indicates that IF the value in the column Gender is M display 1, ELSE display 0.
Remember when entering the code to precede the name of the column with a dollar symbol.
Press Control plus Enter. In the Workspace section, the number of variables has increased to 5 (where it was earlier 4). On opening the Dataframe we can see that a new variable Gender underscore dummy has been created. In this column all 1s have been added against all those elements where the gender is M or male and 0 against all those elements where the gender is F or female.
So let us run through the code one more time. IF the statement gender is equal to male is true, display 1, Else display 0 (which means that if the statement gender is equal to male is not true, then display 0)
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
94
SUMMARY
In this section, the use of logical operators like Greater than, Less than, Equal to, AND
have been covered in some detail. The use of the conditional statement ifelse has
also been covered.
To summarize:
Logical operators are used to identify certain elements in a data structure. Egs
are greater than, less than, equal to etc
When executing a logical operator in R, mention the name of the data
structure and the name of the column which contains the desired variables.
Symbols are used in R for logical operators like =,<,>,&
If condition looks for the presence of certain conditions before carrying out a
specific function
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
95
PART 2 – MERGING DATA
INTRODUCTION
Data in different tables can be merged in R.
So, here’s what will be covered in this tutorial:
- Understanding merging of data
- Different ways in which to merge data
- Executing a merge in R
WHAT IS MERGING OF DATA
Let us assume that an organization has prepared a database of its employee information. One table which we will refer to as Table 1 stores details related to Employee ID (shown as EmpID), Name and Income. Employee ID is a unique key or identity.
A second table which we will refer to as Table 2, stores Employee ID, Address and Nationality.
Assume that the organization wants to combine the information in these two tables into a single table. To do this, Merge will be used. So, Merge is an operation which helps in combining data which are present in different tables.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
96
WHAT ARE THE DIFFERENT WAYS TO MERGE DATA
A merge can be carried out in different ways, or in simple terms there is more than one way to merge data. Again let’s look an example. Shown here are two data sets. The first data set has three columns, k1, k2 and data.
The second dataset has 2 columns k1 and k3.
In order to merge these 2 datasets, an important condition needs to be met – there needs to be atleast one column in common between the two. In our example, the column which is common between the two datasets is k1. So it is possible to merge these two datasets as k1 is common between the 2. Full merge
The first type of merge possible is called the Full merge. In our example, we have two columns in the first dataset and three in the next dataset. Of these, one column k1 is common. After a full merge one dataset with four columns will be created – k1, k2, k3 and data. So a full merge is a concatenated table with all the unique columns and data present in the tables that were merged. So let us look at merged table that has been created after a full merge of the 2 datasets.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
97
As you can see, in the column k1 seven elements are visible. 1 is the common element in k1 in both datasets. All other unique values in k1 and the other columns are present in the new merged dataset. Inner merge The second type of merge is called Inner merge. In this type of merge only the row with matching elements in the common column of the datasets to be merged are brought together. In our example, the only column that is common between the two datasets is k1. Within k1, the only common element between the 2 columns is the number 1. So when an Inner merge is carried out only the row which has the common element is merged. The figure shown on your screen indicates the result of an Inner merge between the two datasets.
Left outer & Right outer merge The third type of merge is called the left outer merge. In this type of merge a consolidated table is created, but only the contents or elements of the columns which are to the left are merged. The figure shown on your screen displays the results of a left outer merge.
The inverse of a left outer merge would be a right outer merge, where only the contents or elements of the columns to the right are merged.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
98
HOW TO CARRY OUT A MERGE IN R
The first thing we will do is create two dataframes X and Y which will contain the elements of the datasets that we used as an example earlier. The dataset X will comprise the columns k1, k2 and data, whereas the dataset Y will comprise the columns k1 and k3. Shown here is the code to create the dataframes X and Y.
The dataframes are created by pressing Control plus Enter. Carrying out a full merge Let us now look at the syntax or code to carry out a full merge of both the datasets x and y. Enter the code: merge(x, y, by.x = “k1”, by.y = “k1”, all=TRUE) x and y are the datasets that are going to be merged. K1 is the column that is common between the datasets x and y. To indicate that a full merge needs to be carried out, all = TRUE is specified. On pressing Control plus Enter, a fully merged dataset is displayed in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
99
Carrying out an inner merge
To carry out an inner merge, enter the code: merge(x, y, by.x = “k1”, by.y = “k1”) Press Control plus Enter. In the Console, the results of an inner merge are shown, wherein the common elements in the common column are merged.
Carrying out a left outer merge
To carry out a left outer merge, enter the code: merge(x, y, by.x = “k1”, by.y = “k1”, all.x = TRUE) all.x is mentioned as the dataset x is to the left . On pressing Control plus Enter, the results are shown in the Console.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
100
Carrying out a right outer merge
The code to carry out a right outer merge is shown here: merge(x, y, by.x = “k1”, by.y = “k1”, all.y = TRUE) Here all.y is specified, as y is the dataset to the right. On pressing Control plus Enter, the results of the right outer merge are shown in the Console.
For datasets to be merged there has to be atleast one column in common between them.
SUMMARY
In this section, the different ways two data structures can be merged has been
looked at in some detail.
To summarize:
The merge operation brings together elements of different datasets or tables
into a single consolidated table.
To carry out a merge, atleast one of the columns in each of the datasets to be
merged must be common.
There is more than one way to merge data.
A full merge combines all of the elements in the datasets into a consolidated
dataset.
An inner merge combines only the elements of the row which have elements in
common (within the common column)
A left outer merge combines the elements of the table or dataset to the left. A
right outer merge combines the elements of the table or dataset to the right.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
101
Section 5 Text Analytics and Word Cloud
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
102
PART 1 – UNDERSTANDING TEXT ANALYTICS
INTRODUCTION
Analyzing text is extremely powerful and is an integral part of our social media and web activity. So, here is what will be covered in this tutorial:
- Understanding textual or text analytics
- Importance of text analytics for organizations
- Common terms in text analytics
- Understanding the framework to create a Word Cloud
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
103
WHAT IS TEXT ANALYTICS
To understand this let us look at the type of information that is available to organizations today. In today’s competitive environment information is power. A lot of this information or data is present on the web in the form of text or videos. Very rarely is this information available to organizations in a structured format which can be stored in a database. In fact organizations need to take data that is available out there and structure it so that it is useful for them. But this can be a daunting task especially where most of the information is in text format. This is where textual or text analytics plays a key role.
So if we were to define text analytics we could say that it is the process of deriving high quality information from unstructured text. Simply put, it is making sense or giving structure to data or information which is not structured.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
104
HOW IS TEXT ANALYTICS USEFUL
Let us suppose that you have been searching on the web for anything related to computer games. On your search results page you will often find ads and recommended pages related to computer games. A lot of what you see is dependent on your search history or the keywords that you have been using.
Likewise, when you are on the Newsfeed page of Facebook, you can see posts on Suggested pages or Ads displayed on the right hand side of your page. Maybe you have been looking for something specific on Facebook or have been spending time on a certain company page. Those suggested pages or ads could be very similar to the pages that you have been looking for or spending time on in Facebook.
If you have a Gmail account, you would find in your Spam folder a lot of mail that you yourself did not actually send to Spam. Well all of these examples that we have cited is a result of using text analytics. Take the example of Spam filtering. There have been instances when you have flagged of mail from a certain recipient as Spam. Your mail service provider will now automatically look for those words in a string and send
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
105
any mail with that text to Spam. Likewise in facebook what you search for or write is being analysed to come up with suggested pages and display ads.
Text analytics is an exciting and useful part of analytics. To understand this concept better, in the sections ahead we are going to focus on two aspects:
1. Understanding the common terms used in text analytics; and 2. Completing a a text analytics project using data from a popular social medium
- Twitter. The project will focus on the framework to create a Word Cloud out of a set of tweets on Big data, R and analytics.
You will need to execute this project in R using the concepts that will be discussed in this tutorial. We will of course be guiding you along the way.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
106
IMPORTANT TERMS IN TEXT ANALYTICS
Corpus
The first term that we will look at is called Corpus. Corpus is the data structure that is used to manage the text that is being analyzed. A simple way to look at a corpus is to think of a dictionary.
It is a data structure of the relevant words in a piece of text. Let us assume that we are analyzing a blog on democracy. The corpus will be a list of all relevant words in the blog related to democracy stored in a structured format. So like in a dictionary when you look up the term democracy you will find all words associated with it listed in one place, the corpus will list all the relevant words from the blog in a single place. An important point to remember about a Corpus is that just like in the case of a dataset, it needs to be cleaned up. Cleaning up a Corpus Stopwords
So what do we typically clean from a Corpus? Firstly, words which do not really make sense in itself need to be removed. For eg, if the blog that we are analyzing, uses the words “the”, “or”, “of” , “am” , “is”, “are” , “was” quite frequently – these words really carry no meaning or have little or no value and hence need to be removed from the Corpus. These types of words are referred to as stopwords. There are around 196 stopwords that have been identified. You need not worry about identifying these words by yourself, because in R we will be using a Text Mining or TM package which will help you in identifying and removing stopwords from your Corpus. In addition to the 196 stopwords identified, you can also add your own stopwords based on what you think is useful or not. For example, if you think that names of
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
107
people in the text you are analyzing is not useful, these can be added to the list of stopwords to be cleaned from the Corpus. Numbers Secondly, we can also remove numbers from the Corpus. So, if numbers have been used to demarcate points like 1, 2, 3 and so on, these can be removed from the Corpus as they have no meaning by themselves. Punctuation
Thirdly we can also remove punctuations like commas, semi colons, colons, full stops etc from the Corpus. Treatment of case
Fourthly, we can decide whether the same words used in a text need to begin with upper case or lower case. For eg, if democracy is spelt in one place with lower case but in another sentence begins with upper case, then we need to decide if in the Corpus democracy should always start with upper case or lower case. Stemming
The next type of clean up that can be done is through a process called Stemming. To understand Stemming, let us assume that in the blog we are analyzing a word “participate” which has been mentioned in different ways like “participated” “participating” “participatory” etc across the blog. All these words relate to the same root word which is “participate”. The process of Stemming will ensure that all these words will eventually add up to that one word no matter the tense used. Another example would be a verb like “fly” which can be represented in an article as “flew” “flying” “flown” etc. Stemming will ensure that in the end this is all represented by the one word “fly”.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
108
Framework So, the framework to start analyzing text begins with creating a Corpus which is a data structure to store text. This is then followed by the process of cleaning up the Corpus wherein the following is carried out: Stopwords are removed Numbers are removed Punctuation is removed Treatment of case is decided Stemming is done
Another important term is Tokenize. In this process a sentence is broken down into individual tokens so that each word in that sentence is a separate entity. So the sentence “Parliament is the seat of democracy”, when tokenized would be: Parliament, is, the, seat, of, democracy. This method is also used in search engines like Google when they look at keywords. For eg, if the keywords “analytics jobs” is entered, it would be first broken down into 2 tokens “analytics” and “jobs”.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
109
WHAT IS NEEDED TO CREATE A WORD CLOUD
TDM
Having arrived at a clean Corpus, we now need to decide what to do with it. Remember that a Corpus in itself is not an output, but a dictionary to be used to create something else. So if our final objective is to create a Word Cloud out of a Corpus, the Corpus needs to be converted into a format which enables a Word Cloud to be created from it. To understand this better, we need to know what is required to create a Word Cloud. Two very simple components make up a Word Cloud – words and the number of times or frequency with which those words appear. For example, look at the table shown below:
Words Frequency
People 20
Democracy 35
Freedom 40
The numbers next to those words represent the number of times they appear in a piece of text like a blog or an article. When a Word Cloud is created, the frequency will determine the size of the word within the Word Cloud. For example in the image shown below, the larger the size of the words, the more frequently they would have recurred in the content or the text from which this Word Cloud was created.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
110
So, the structure which has been described above is referred to as a Term Document Matrix or TDM. So to take a Corpus and make it Word Cloud ready we need to create a TDM. A TDM is made up of rows and columns. The columns represent the words and the rows represent the frequency of their occurrence.
So let’s stop for a while and ask ourselves a question. Hey, I have a blog and I want to create a Word Cloud out of it. How can I do it? Well, everything we have discussed so far should answer our question. Quite simply: 1. Create a Corpus 2. Clean up the Corpus 3. Create a TDM or Term Document Matrix out of it 4. Create your Word Cloud
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
111
Installing the TM package in R
Creating a Word Cloud in R is possible through a package called the TM or Text Mining package. For example, to help with Step 2 which is cleaning up of the Corpus – the TM package uses a function called tm_map. To carry out various types of processes like removing Stopwords the correct argument needs to be entered after tm_map.
The TM package comes with some really good documentation which you need to go through to understand how to execute each of the steps we have talked about. Remember to also use the Help feature in R for specific queries.
Before we move onto our project of creating a Word cloud out of a set of tweets, let’s make sure we do the following.
1. Download the file comprising the tweets that we need to convert into a Word Cloud. You can find this in the Download section of this tutorial.
When opening this file, remember to right click and select Open with R studio.
2. When the file is opened in R Studio, it will be visible in the Workspace section
with the number of tweets in it visible which is 320.
3. Import and install the list of packages that are mentioned in the Download section of this tutorial. The packages are: a) Twitter: This package is needed to read the tweets that have been
downloaded b) Word Cloud: This is required to create the Word Cloud c) TM: As mentioned earlier, the TM or Text mining package is needed to
create the Corpus, clean up the Corpus and create the TDM d) Snowball: This package is required to enable Stemming to be carried out.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
112
SUMMARY
In this section, the meaning and importance of Text Analytics was covered. Some
important terms in Text Analytics and the framework to create a Word Cloud has also
been explained.
To summarize:
Text Analytics gives structure to unstructured textual data.
In R, Text Analytics is done with the help of the TM or Text Mining or TM
package.
In Text Analytics, the first step is to create the Corpus.
The next step is to clean the Corpus, by removing Stopwords, numbers,
punctuation, stemming etc
To convert a Corpus to a Word Cloud, a TDM or Term Document Matrix needs
to be created.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
113
PART 2 – UNDERSTANDING TEXT ANALYTICS
INTRODUCTION
Word Clouds are a product of text analytics. They are not so difficult to create. So, here is what will be covered in this tutorial:
- Understanding how to create a Word Cloud from twitter data
- The syntax used to carry out some important steps
- Understanding how to use a few packages in R
HOW TO CREATE A WORD CLOUD
STEP 1: CREATING A DATAFRAME
Open the dataset of sample tweets in R Studio.
Before we begin, you’ll should have downloaded the sample set of tweets and opened it in R studio. You should have also imported and installed the list of packages that were specified in the previous section.
As you can see in the workspace section, a list of 320 tweets is displayed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
114
Double click on this list and you will see a list displayed in Notepad.
From this list it is pretty evident that there are no details of the actual content of the tweets. So what we have is essentially unstructured text. To convert this unstructured text into a Dataframe, enter the code shown below: library(twitter) df<-do.call(“rbind”,lapply(tweets,as.dataframe)) Let us now deconstruct this code. df = the twitter data is going to be stored in a dataframe called df do.call = a function which is calling another function multiple times. In the case of our code the function that is being called multiple times is rbind or row bind.
For a detailed explanation of the syntax do.call, go to the Help section in R studio. Type the words do.call in the Search field and press Enter. As you can see a detailed explanation of the function do dot call is shown in Help. You can go through this explanation to understand this function better.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
115
rbind = an action to bind or combine together the 320 rows of tweets lapply = a function which converts the tweets that are being combined into a dataframe
Let us now execute the code mentioned above. Press Control plus Enter. As you can see in the Workspace section, a dataframe df with 320 observations is visible. These 320 observations is nothing but the twitter data which has been converted into a dataframe.
Let us open the dataframe. As you can see each row is numbered, with the first row relating to the first tweet, the second row the second tweet and so on. This dataframe will run into 320 rows which corresponds to the 320 tweets in our original data structure.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
116
The dataframe has 14 variables. For the purposes of our exercise we will focus on the text column of the dataframe.
In order to view the dimension of the dataframe, enter the code dim(df) Press Control plus Enter and you can see in the Console the numbers 320 and 14 displayed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
117
STEP 2: INSTALLING THE TM PACKAGE
Once the tm package has been loaded, we can find out how to use it from the Help section. Go to Help and enter tm in the Search field. Press Enter. Click on the link tm.
You can see two links shown here – a Description file and Overview of user guides and package vignettes.
We will click on the second option which is the Overview of user guides and package vignettes.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
118
Once we do that, a PDF file on the Introduction to the tm package will open.
As we scroll through this document, you will see every conceivable task that is possible with the tm package listed. It lists out how to eliminate stopwords to how to carry out stemming to creating a Term Document Matrix. Term Document Matrix or TDM, as we know is essential to creating a Word Cloud as it lists out the set of words along with the frequency or the number of times they appear in a given text.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
119
STEP 3: UNDERSTANDING TDM
There are actually two types of matrix that can be created out of a Corpus. The first is a Term Document Matrix where the terms or the words are rows. The second is Document Term Matrix where the documents are rows. As you can see in the image below, an example of a Document Term Matrix has been shown. Here the documents are mentioned in rows.
Let us interpret this matrix. Listed under Docs are the names or numbers of the text that have been analysed. Listed as columns are words which appear in these documents. So, if against Doc 127 and against the word “able” the number 10 was mentioned, it would mean that the word “able” has appeared 10 times in the document 127.
STEP 4: CREATING A CORPUS
The next step in our project is to convert the dataframe df that we have created into a Corpus. A Corpus if you recall is the data structure to store all the text that will be analyzed.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
120
In order to do this, use the syntax shown below. myCorpus <-Corpus(VectorSource(df$text))
Let us deconstruct this syntax. The name of the Corpus to be created is myCorpus. As we know Vector is a column so the term “VectorSource” refers to the column in the dataframe whose data is to be copied to the Corpus. Since we are only interested in the text portion of the dataframe we need to indicate in the syntax that only the text column needs to be added to the Corpus. If we open the dataframe df you can see that the column which contains the contents of the tweets is referred to as text.
So in the syntax we mention df $ text. Press Control plus Enter to create the Corpus.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
121
STEP 5: CLEANING THE CORPUS
Now that we have created a corpus called MyCorpus which contains all the text to be converted to a Word Cloud, we now need to proceed to the next step which is the cleaning up of the Corpus. The function in the tm package which will help in the clean up of the Corpus is referred to as tm_map. If you refer to the documentation on the tm package, which we looked at earlier, all the information to transform the Corpus has been specified in detail. Cleaning up of the corpus is also part of the transformation of the Corpus.
Within the document, you will find the code required to carry out various processes like eliminating white space or blank space from the Corpus, to conversion to lower case to removal of stop words.
In the Help document, in the code shown to transform the Corpus, a sample Corpus name “reuters” has been used. For the purposes of our project we need to use the same code but replace “reuters” with the name myCorpus.
Shown here in R Studio is a list of codes to clean up myCorpus.
In R Studio against each of the code or syntax mentioned we will hit Control plus Enter and start cleaning up the Corpus. We will first convert to lower case, then remove punctuation and then remove numbers from the Corpus. Removing URLS It is also possible to remove urls from the Corpus with the help of a user defined function which is shown below.
Let us deconstruct this function. The name of the function is removeurl. By indicating http in the code followed by alnum followed by two double quotes we are stating that where any url starting with http is present, it needs to be blanked out. X in the code is the placeholder for the name of the Corpus which in our case is myCorpus. In the next line we can see the code which calls the removeurl function.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
122
To find out the meaning of gsub which is used in the function, type out the words gsub in the search field of the Help section. As you can see from the text displayed gsub is a function which is used to carry out any kind of replacement.
So in the removeurl function gsub is replacing any text starting with http with a blank.
Removing stop words We will now look at how to remove stop words from the Corpus. As you can see in the Console there are a number of words which by themselves do not make sense. Some of these words are once, why, each, in, to, etc.
There are around 196 Stop Words that have been identified, but you can include more as well.
In addition, we also want to include some other words which for the purposes of our project are of no value or utility. These words are English, available and via.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
123
Now using the code shown below we will go ahead and remove the stop words including the ones we have added from the Corpus.
Press Control plus Enter.
Stemming Now that we have removed stop words, we will move onto another important process in the clean up of the Corpus which is called Stemming. To do this we first create a copy of the corpus by using the code shown.
Stemming will convert words like eating, eaten etc to one root word eat.
In order to carry out stemming we need to install a package in R called SnowballC. To do this, go to Packages, Install Packages and write out the name of the package which is SnowballC.
Since we have already installed this package we will click on Cancel, but in case you have not then click on Install instead.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
124
To carry out stemming, we need to use a function called stemDocument which is found in the SnowballC package.
Press Control plus Enter to start stemming.
STEP 6: CREATING THE TERM DOCUMENT MATRIX
Let us pause for a while and try to recollect the next step in the framework to convert a Corpus to a Word Cloud. After cleaning up of the corpus the next step would be convert the Corpus into a Term Document Matrix. Shown here is the code to convert the Corpus into a Term Document Matrix.
Let us deconstruct this code. The code indicates that any word with a frequency from one to infinity needs to be added to the Term Document Matrix.
This need not be mentioned in the code, because by default words with all types of frequencies will be added to the term document matrix.
Press Control plus Enter. The Term Document Matrix has been created.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
125
A Term Document Matrix will be the inverse of the Document Term Matrix wherein the terms will be rows and the documents will be columns. The frequency will indicate the number of times the word has appeared in a document. If we look at the Console we can see a term Sparsity followed by 99%. This means that 99% of the times these words or the words in the matrix, do not appear in the document.
To view the contents of the Term Document Matrix, go to the Workspace section where you can see the value myTdm displayed. However, the Term Document Matrix is in the form of a List, whereas we would like to see it in the form of a matrix.
In order to do this, we create a dataframe matrix called m and convert the List into this matrix using the code shown below.
In the Workspace section, a matrix m is displayed. Double click on this and our Term Document Matrix opens up! So let us break this down.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
126
The first column “row names” indicates the words that are contained in the 320 tweets (remember all stop words have been removed, so these are the actual usable words) The rows which are numbered 1,2 3 and so on are the number of tweets, which we know will run into 320. The numbers indicate the number of times these words appear in each of these 320 tweets. In most cases the number is zero indicating that they have not appeared in those tweets. To find out the frequency or the cumulative number of times a word appears across the 320 tweets we will need to look at the sum of each row. So for example to find out the frequency of the word “big”, we will need to add up all the numbers under each of the 320 columns against the row “big”.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
127
STEP 7: CALCULATING FREQUENCIES
In order to create a word cloud we will need to plot the word against its frequency. The code to calculate the frequency of words and sort it in descending order is shown here.
Let us deconstruct this code. The term “rowSums” and within brackets m indicates that the summation of each row in the Term Document Matrix m will be carried out. Decreasing = true, means that the summated amounts will be arranged in descending order. Press Control plus Enter. The result will be stored against wordfreq.
To view the frequencies that have been calculated, select wordfreq and press Control plus Enter. The results are displayed in the form of a List. So an easier alternative would be to convert the List “wordfreq” into a matrix “wordfreq1” using the code that is shown on the screen.
In the Workspace double click on the matrix “wordfreq1”. Shown on the screen is a matrix of all the words in the Corpus myCorpus along with their frequencies or the cumulative number of times they appear in the 320 tweets. Also, the frequencies have been arranged in descending order from the highest to the lowest.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
128
STEP 8: CREATING THE WORD CLOUD
Now all that is left to be done is to generate the Word Cloud. Let us go to the Help section in R Studio and enter the words Word Cloud. Click on the link which appears. As you can see the arguments necessary to create a Word Cloud will be listed.
The first requirement shown is words, followed by frequencies. There are many other options listed so that one can create a Word Cloud based on different conditions. But a Word Cloud can be generated with just 2 pieces of information – words and their frequencies. The Term Document Matrix that we will be using to generate the Word Cloud is called wordfreq1. To generate the Word Cloud enter the following code:
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
129
Press Control plus Enter. The Word Cloud is being created in the Plots section of R studio.
The Word Cloud creates the words with the highest frequencies first. So, words like r, analysis, research and example have high frequencies and hence are displayed quite prominently in the Word Cloud. In the matrix, there were many words with a frequency of 1. We can choose not to show those words in the Word Cloud. To exclude these words from the Word Cloud, enter in the code an option to include only those words with a frequency of say 5 and above. The code to execute this is shown below:
This code can also be found in the Help section.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
130
Press Control plus Enter. You can see that fewer words are being added to the Word Cloud.
Another thing to remember is that each time a Word Cloud is generated the position of the words will change. As we can see, r which was earlier vertical is now horizontal and is located in a different place. In order to ensure that the position of a word does not change each time the Word Cloud is generated, we can use the function “set.seed”.
In order to limit the number of words to be shown in the Word Cloud, we can use the syntax “max.words”. We can also determine the colour of the Word Cloud by using the syntax colour is equal to say red (within brackets)
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
131
As you can see the words are now being displayed in red.
So we have completed the objective of this project which was to generate a Word Cloud. To generate a Word Cloud all that is needed are words and their frequencies. Other parameters can also be defined. Modifications can also be done on the Word Cloud like minimum frequency, maximum number of words to be displayed and colour.
Do try out the other parameters available by referring to the content in the Help
section under Word Cloud.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
132
SUMMARY
Creating a Word Cloud in R is a function of using the right package with the right set
of text or words.
To summarize:
Unstructured text can be converted into a structured format like a Dataframe
in R using the correct syntax.
The tm package in R which is needed to carry out text analysis comes with
detailed documentation.
While converting a Dataframe to a Corpus the name of the vector/column
which contains the text needs to be indicated.
The tm package document displays the code to clean up a Corpus.
Apart from cleaning up stop words, numbers, punctuation, urls can also be
removed through a user defined function.
A TDM is first created as a List and then converted to a matrix in R.
To calculate the frequency of words in a TDM, the rows against each word in
the matrix needs to be summed up.
A Word Cloud can be created once words and their frequencies are mapped
out.
Modifications to a Word Cloud include defining minimum frequency,
maximum number of words to be displayed and colour.
Copyright (c) 2014 • Redwood Associates Business Solutions Private Limited
• All rights reserved
133
By downloading this material and signing up for our course, you have supported us in
our mission to help individuals and organizations take smarter decisions every day.
We hope to keep upgrading this material by focusing on improving quality and
providing additional lectures and material on this subject.
To send us feedback on how to improve this course, do write to us at
[email protected] with the subject line “R Handbook”.