Modeling Techniques in Predictive Analytics

Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R

TEXT ANALYTICS

objective of case study

To analyze the trend of movies released over the years and how they differ from decade to decade using text analytics tools and methods.

Methodology

We have the data of movies released over last 100 years in the file. We will capture each and every text from that file and store that text in the form of text corpus. We will perform text formatting on the text and only use the relevant information for our analysis. We make use of R Programming Language for our statistical analysis.

The Internet Movie Database (IMDb.com) is a good source of information about movies and which is freely available on Internet. We have downloaded the information in the form of text file for our use. For our example, we choose a smaller text file from IMDb, the tagline file.

Text analytics like predictive analytics is also number game, but with words rather than numbers as the raw input. We will turn words into numbers for analysis.

Data Preprocessing

This is how the unstructured text file looks.

We must process the text before we can understand what it says.

We have to process and clean this data to understand the content of the data.

We make use of this formatting in parsing the tagline file for entry into text database.

This is how structured data looks like:

Packages Used

library(tm)

library(stringr)

library(grid)

library(ggplot2)

library(latticeExtra

library(cluster)

library(proxy

Visualization using HISTOGRAM

To determine the ranges of year to consider in our study, we look at the distribution of release dates in the movie taglines data. The histogram figure below shows more than one hundred movies a year from the mid 1970’s through 2013 and more than one thousand movies a year from 2003 to 2013.

Understanding the Trend Using Plot

We use a horizon plot to visualize text measure in time.

We identify five common groups or cluster of words, defining the text measures that we call LOVED, WORLDS, TRUTH, LIFE, STORY.

INTERPRETATION/EVALUATION Story based movie produced more with fluctuations. Autobiographical movies has been produced more after 2000. Prominent increase of

autobiographical movies have been noticed from 2000-2010. Non fiction movies has been produced more. Prominent increase of non fiction movies from

98-2010. Movies related to natural Geography/Wildlife, world has been produced more with

fluctuations,a bit up and down till 1980---up in 89--down till 2002-increse 2010. Movies with subject as love story has been pretty fluctuating,UP till 79--down in1980-81--

82high-low till 86-high-majorly low till 2003--high till 2005-low-high trend in 2010.

CONCLUSION

Based on our Analysis: The current trend of movies are Non-Fictional movies. As production of movies are directly proportional to revenue, it is preferred for producers to

invest in non-fictional movies, as category “Truth” is quite higher and production of over all Movies has also increased after 2000.

As per our Analysis, Producers can have higher revenue, if they produce/make “Non-Fictional-Movies”

Date post:	15-Apr-2017
Category:	Documents
Upload:	piya-chauhan
View:	45 times
Download:	3 times

Modeling Techniques in Predictive Analytics

Documents