+ All Categories
Home > Documents > Lecture Unit 2 Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this...

Lecture Unit 2 Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this...

Date post: 31-Mar-2015
Category:
Upload: elisha-pickeral
View: 215 times
Download: 2 times
Share this document with a friend
Popular Tags:
71
Lecture Unit 2 Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: 1) Construct graphs that appropriately describe data 2) Calculate and interpret numerical summaries of a data set. 3) Combine numerical methods with graphical methods to analyze a data set. 4) Apply graphical methods of summarizing data to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate graphical and numerical summary procedures.
Transcript
  • Slide 1

Slide 2 Lecture Unit 2 Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct graphs that appropriately describe data n 2)Calculate and interpret numerical summaries of a data set. n 3)Combine numerical methods with graphical methods to analyze a data set. n 4)Apply graphical methods of summarizing data to choose appropriate numerical summaries. n 5)Apply software and/or calculators to automate graphical and numerical summary procedures. Slide 3 Displaying Qualitative Data Section 2.1 Sometimes you can see a lot just by looking. Yogi Berra Hall of Fame Catcher, NY Yankees Slide 4 The three rules of data analysis wont be difficult to remember n 1.Make a picture reveals aspects not obvious in the raw data; enables you to think clearly about the patterns and relationships that may be hiding in your data. n 2.Make a picture to show important features of and patterns in the data. You may also see things that you did not expect: the extraordinary (possibly wrong) data values or unexpected patterns n 3.Make a picture the best way to tell others about your data is with a well-chosen picture. Slide 5 Bar Charts: show counts or relative frequency for each category n Example: Titanic passenger/crew distribution Slide 6 Pie Charts: shows proportions of the whole in each category n Example: Titanic passenger/crew distribution Slide 7 Example: Top 10 causes of death in the United States 2001 RankCauses of deathCounts % of top 10s % of total deaths 1Heart disease700,14237%28% 2Cancer553,76829%22% 3Cerebrovascular163,5389%6% 4Chronic respiratory123,0136%5% 5Accidents101,5375%4% 6Diabetes mellitus71,3724%3% 7Flu and pneumonia62,0343%2% 8Alzheimers disease53,8523%2% 9Kidney disorders39,4802% 10Septicemia32,2382%1% All other causes629,96725% For each individual who died in the United States in 2001, we record what was the cause of death. The table above is a summary of that information. Slide 8 Top 10 causes of deaths in the United States 2001 Top 10 causes of death: bar graph Each category is represented by one bar. The bars height shows the count (or sometimes the percentage) for that particular category. The number of individuals who died of an accident in 2001 is approximately 100,000. Slide 9 Bar graph sorted by rank Easy to analyze Top 10 causes of deaths in the United States 2001 Sorted alphabetically Much less useful Slide 10 1. United States $158 2. China $64.4 3. Japan $54 4. Germany $24.4 5. Britain $23.5 6. France $19.3 7. Brazil $14.2 8. Italy $13.1 9. Australia $12.8 10. India $11.9 1. United States $137.9 2. Japan $23.4 3. Germany $20 4. Britain $16.8 5. France $12.6 6. Canada $7.3 7. Italy $6.3 8. China $5.4 9. Netherlands $5.4 10. Australia $4.8 Software Sales 2009 ($billions)Computer Hardware Sales 2009 ($billion) NY Times Slide 11 Percent of people dying from top 10 causes of death in the United States in 2001 Top 10 causes of death: pie chart Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Slide 12 Percent of deaths from top 10 causes Percent of deaths from all causes Make sure your labels match the data. Make sure all percents add up to 100. Slide 13 Slide 14 Student Debt North Carolina Schools Slide 15 Child poverty before and after government interventionUNICEF, 1996 What does this chart tell you? The United States has the highest rate of child poverty among developed nations (22% of under 18). Its government does almost the leastthrough taxes and subsidiesto remedy the problem (size of orange bars and percent difference between orange/blue bars). Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why? The poverty line is defined as 50% of national median income. Slide 16 Slide 17 Unnecessary dimension in a pie chart Slide 18 Contingency Tables: Categories for Two Variables n Example: Survival and class on the Titanic Marginal distributions marg. dist. of survival 710/2201 32.3% 1491/2201 67.7% marg. dist. of class 885/2201 40.2% 325/2201 14.8% 285/2201 12.9% 706/2201 32.1% Slide 19 Marginal distribution of class. Bar chart. Slide 20 Marginal distribution of class: Pie chart Slide 21 Contingency Tables: Categories for Two Variables (cont.) n Conditional distributions. Given the class of a passenger, what is the chance the passenger survived? Slide 22 Conditional distributions: segmented bar chart Slide 23 Contingency Tables: Categories for Two Variables (cont.) Questions: n What fraction of survivors were in first class? n What fraction of passengers were in first class and survivors ? n What fraction of the first class passengers survived? 202/710 202/2201 202/325 Slide 24 TV viewers during the Super Bowl in 2007. What is the marginal distribution of those who watched the commercials only? 1. 8.0% 2. 23.5% 3. 58.2% 4. 27.7% 10 Slide 25 TV viewers during the Super Bowl in 2007. What percentage watched the game and were female? 1. 41.8% 2. 38.8% 3. 51.2% 4. 19.8% 10 Slide 26 TV viewers during the Super Bowl in 2007. Given that a viewer did not watch the Super Bowl, what percentage were male? 1. 45.2% 2. 48.8% 3. 26.8% 4. 27.7% 10 Slide 27 3-Way Tables n Example: Georgia death-sentence data Slide 28 UC Berkeley Lawsuit Slide 29 LAWSUIT (cont.) Slide 30 Simpsons Paradox n The reversal of the direction of a comparison or association when data from several groups are combined to form a single group. Slide 31 Fly Alaska Airlines, the on- time airline! Slide 32 American West Wins! Youre a Hero! Slide 33 Section 2.2 Displaying Quantitative Data Histograms Stem and Leaf Displays Slide 34 Relative Frequency Histogram of Exam Grades 0.05.10.15.20.25.30 405060708090 Grade Relative frequency 100 Slide 35 Frequency Histograms Slide 36 A histogram shows three general types of information: n It provides visual indication of where the approximate center of the data is. n We can gain an understanding of the degree of spread, or variation, in the data. n We can observe the shape of the distribution. Slide 37 All 200 m Races 20.2 secs or less Slide 38 Histograms Showing Different Centers Slide 39 Histograms - Same Center, Different Spread Slide 40 Frequency and Relative Frequency Histograms n identify smallest and largest values in data set n divide interval between largest and smallest values into between 5 and 20 subintervals called classes * each data value in one and only one class * no data value is on a boundary Slide 41 How Many Classes? Slide 42 Histogram Construction (cont.) * compute frequency or relative frequency of observations in each class * x-axis: class boundaries; y-axis: frequency or relative frequency scale * over each class draw a rectangle with height corresponding to the frequency or relative frequency in that class Slide 43 Example. Number of daily employee absences from work n 106 obs; approx. no of classes= {2(106)} 1/3 = {212} 1/3 = 5.69 1+ log(106)/log(2) = 1 + 6.73 = 7.73 n There is no single correct answer for the number of classes n For example, you can choose 6, 7, 8, or 9 classes; dont choose 15 classes Slide 44 EXCEL Histogram Slide 45 Absences from Work (cont.) n 6 classes n class width: (158-121)/6=37/6=6.17 7 n 6 classes, each of width 7; classes span 6(7)=42 units n data spans 158-121=37 units n classes overlap the span of the actual data values by 42-37=5 n lower boundary of 1st class: (1/2)(5) units below 121 = 121-2.5 = 118.5 Slide 46 EXCEL histogram Slide 47 Grades on a statistics exam Data: 75 66 77 66 64 73 91 65 59 86 61 86 61 58 70 77 80 58 94 78 62 79 83 54 52 45 82 48 67 55 Slide 48 Frequency Distribution of Grades Class Limits Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 Total 2 6 8 7 5 2 30 Slide 49 Relative Frequency Distribution of Grades Class Limits Relative Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 2/30 =.067 6/30 =.200 8/30 =.267 7/30 =.233 5/30 =.167 2/30 =.067 Slide 50 Relative Frequency Histogram of Grades 0.05.10.15.20.25.30 405060708090 Grade Relative frequency 100 Slide 51 Based on the histo- gram, about what percent of the values are between 47.5 and 52.5? 1. 50% 2. 5% 3. 17% 4. 30% 10 Slide 52 Stem and leaf displays n Have the following general appearance stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64 Slide 53 Stem and Leaf Displays n Partition each no. in data into a stem and leaf n Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps Slide 54 Example: employee ages at a small company 18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10s digit; leaf: 1s digit n 18: stem=1; leaf=8; 18 = 1 | 8 stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64 Slide 55 Suppose a 95 yr. old is hired stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64 7 8 95 Slide 56 Number of TD passes by NFL teams: 2012-2013 season ( stems are 10s digit) stemleaf 4343 03 247 26677789 201222233444 113467889 08 Slide 57 Pulse Rates n = 138 Slide 58 Advantages/Disadvantages of Stem-and-Leaf Displays n Advantages 1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) n Disadvantages display becomes unwieldy for large data sets Slide 59 Population of 185 US cities with between 100,000 and 500,000 n Multiply stems by 100,000 Slide 60 Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999-2000, 2012-13 multiply stems by 10 1999-20002012-13 2403 637 2324 665526677789 43322221100201222233444 9998887666167889 4211134 08 Slide 61 Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10s digits 1. 4 2. 6 3. 8 4. 10 5. 12 10 Slide 62 Interpreting Graphical Displays: Shape n A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. Symmetric distribution Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations. Skewed distribution A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side. Slide 63 Shape (cont.)Female heart attack patients in New York state Age: left-skewedCost: right-skewed Slide 64 AlaskaFlorida Shape (cont.): Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. Slide 65 Center: typical value of frozen personal pizza? ~$2.65 Slide 66 Spread: fuel efficiency 4, 8 cylinders 4 cylinders: more spread8 cylinders: less spread Slide 67 Other Graphical Methods for Economic Data n Time plots plot observations in time order, with time on the horizontal axis and the vari- able on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.) Slide 68 Unemployment Rate, by Educational Attainment Slide 69 Water Use During Super Bowl Slide 70 Winning Times 100 M Dash Slide 71 Annual Mean Temperature Slide 72 End of Section 2.2


Recommended