CDS 151 — Spring 2013 — Data Ethics in an Information Society
Lecture 2 :
Introduction to
Data Ethics
2
Outline
Reading Assignment & Class Assignments
Introduction to Data & Data Ethics
Statistics: Use, Abuse, and Misuse
3
Reading Assignment
Weekly reading assignments are posted online at http://mymason.gmu.edu/
How to Lie with Statistics (D. Huff) Last week’s assignment: Introduction and Chapter 1
This week: Chapters 2 and 3
Visual & Statistical Thinking: Displays of Evidence
for Decision Making (E. R. Tufte) [ no assignment this week]
On Being a Scientist: Responsible Conduct in
Research (National Academy of Sciences) [free] [ no assignment this week]
4
Class Assignments
1. Assignments in Blackboard – don’t forget!
2. On-going assignment (due date April 17) – submit a
copy of your Training Completion Report. Choose
one of these to complete : a) Complete the RCR training (after passing the exams, submit copy of
Completion Report for up to 25% of course grade): http://oria.gmu.edu/ethical-conduct-of-research/responsible-conduct-of-research-education/responsible-conduct-of-research-training-plan/
or b) Complete the HSR training (after passing the exams, submit copy of
Completion Report for up to 25% of course grade): http://oria.gmu.edu/research-with-humans-or-animals/institutional-review-board/human-subjects-training/
5
Outline
Reading Assignment & Class Assignments
Introduction to Data & Data Ethics
Statistics: Use, Abuse, and Misuse
The
Data
Tsunami
We are now facing a huge problem !
8
The Data Flood is Everywhere!
Huge quantities of data are
being generated in all
business, government, and
research domains:
Banking, retail, marketing,
telecommunications, health,
homeland security, computer
networks, social networks,
business transactions,
scientific data (genomics,
astronomy, physics, etc.),
Web, text, and e-commerce
9
How much data are there?
Data volume doubles every year !
There are a lot !
So, how do we measure it ?
Note: “Data” are plural (many), and datum is singular (one item)
10
Measuring Data Quantities
Byte 8 bits 1 one byte = one character (A,B,C...)
one bit = 0/1 or Y/N or T/F
Kilobyte 1000 bytes 210 half a page of text
Megabyte 106 bytes 220 small digital photo, or small book, or 3.5-inch diskette
Gigabyte 109 bytes 230 DVD with broadcast quality movie, or 2 CDs
Terabyte 1012 bytes 240 50,000 trees made into paper and printed into text
Petabyte 1015 bytes 250 all U.S. academic research libraries
Exabyte 1018 bytes 260 all words ever spoken by human beings throughout all of history
… followed by Zettabytes, Yottabytes, Brontobytes … http://www.whatsabyte.com/
11
Measuring Data Quantities
Byte 8 bits 1 one byte = one character (A,B,C...)
one bit = 0/1 or Y/N or T/F
Kilobyte 1000 bytes 210 half a page of text
Megabyte 106 bytes 220 small digital photo, or small book, or 3.5-inch diskette
Gigabyte 109 bytes 230 DVD with broadcast quality movie
Terabyte 1012 bytes 240 50,000 trees made into paper and printed into text
Petabyte 1015 bytes 250 all U.S. academic research libraries
Exabyte 1018 bytes 260 all words ever spoken by human beings throughout all of history
12
UC Berkeley 2003 estimate:
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
Updated … 2008 estimate by IDC.com:
http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
5 exabytes* created in 2002
1800 exabytes* (1.8 zettabytes)
estimated for 2011
* 1 exabyte = 1000 petabytes = 1 million terabytes = 1 billion gigabytes !!
So … how much data are there??
>2 Zettabytes, and growing http://www.networkworld.com/community/blog/volume-data-darn-near-indescribable-without-i
13
How much is that?
2 zettabytes = about 4 trillion CDs of data
4 trillion CDs are hard to imagine …
So, try to visualize 1/7,000,000th of that amount …
14
Data Ethics in an Information Society
With so much data and information out there, it is imperative for each one of us to …
Protect the rights of the owners of the data and information (infodata)
Protect your infodata from thieves
Protect the integrity of your infodata from corruption (intentional or accidental)
Deter criminals who would steal your infodata
Use infodata correctly (do not abuse or misuse your infodata)
… act in an ethical manner at all times …
16
Data Ethics in upcoming lectures
We will define Ethics, and how it appears in human society:
Principles, Policies, Regulations, and Laws
The Belmont Report … FERPA, HIPPA, ….
We will investigate some simple life examples where things can go wrong with data:
Data privacy – who owns my data anyway?
Information security – protecting your data from computer criminals (and others)
Misunderstanding statistics
Telling lies with statistics “There are 3 types of lies – lies, damned lies, and statistics!”
17
18
Outline
Reading Assignment & Class Assignments
Introduction to Data & Data Ethics
Statistics: Use, Abuse, and Misuse
Quote from H.G. Wells (1903) …
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
Well, that day is here now!
19
Famous & Infamous Quotes
“There are three kinds of lies: lies, damned lies, and statistics.” (Benjamin Disraeli)
“It is now beyond any doubt that cigarettes are the biggest cause of statistics.”
“If your experiment needs statistics, you ought to have done a better experiment.” (Bertrand Russell)
“The Lottery is a tax on people who are bad at math.”
“Statistics in the hands of an engineer are like a lamppost to a drunk – they're used more for support than for illumination.”
20
Other Quotes – Abusing Statistics
“42.7% of all statistics are made up on the spot.” (Steven Wright, comedian)
“Say you were standing with one foot in the oven and one foot in an ice bucket. Then according to the percentage people, you should be perfectly comfortable.”
“Then there is the man who drowned crossing a stream that had an average depth of six inches.”
“A man may have 21 meals on Sunday and no meals for the rest of the week, making a perfect average of three meals per day, but that is not a good way to live.”
21
“Say what?...”
“Global warming, earthquakes, hurricanes, and other natural disasters are a direct effect of the shrinking numbers of pirates since the 1800s.”
Correlation does not imply causation!
“The leading cause of divorce must be marriage, since we find that 100% of divorced couples were married first.”
“Our education system must be really bad since half of the students in this country scored below average on their SAT tests.”
22
Small Group Exercise: Answer this…
Suppose that a survey finds that 10% of people believe that product X is bad for you.
After a national advertising campaign to inform society of the dangers of product X, another survey is taken.
The national media report the survey result:
Following the national advertising campaign, the number
of people who now believe that product X is bad for you has
increased by 90%.
Your question: What percentage of people now believe that product X is bad for you?
23
Statistical Concepts – Ethical Concerns
1. Biased sample
2. Insufficient sample
3. Correlation does not imply causation
4. Confounding factors (or Lurking Variables)
5. Subjective inference vs. Objective inference from data
Reference: http://www.lsat-center.com/lsatc4s3b.htm
25
We will briefly examine these concerns for now, but we will give detailed
examples in a future lecture.
Statistical Concepts – Ethical Concerns
1. Biased sample – was the sample chosen fairly, so that all possible outcomes are really possible?
2. Insufficient sample
3. Correlation does not imply causation
4. Confounding factors (or Lurking Variables)
5. Subjective inference vs. Objective inference from data
26
Statistical Concepts – Ethical Concerns
1. Biased sample
2. Insufficient sample – was the sample large enough to justify a statistically significant conclusion?
3. Correlation does not imply causation
4. Confounding factors (or Lurking Variables)
5. Subjective inference vs. Objective inference from data
27
Statistical Concepts – Ethical Concerns
1. Biased sample
2. Insufficient sample
3. Correlation does not imply causation – implying that one thing caused the other can be very misleading.
4. Confounding factors (or Lurking Variables)
5. Subjective inference vs. Objective inference from data
28
Statistical Concepts – Ethical Concerns
1. Biased sample
2. Insufficient sample
3. Correlation does not imply causation
4. Confounding factors (or Lurking Variables) – These are extraneous (usually unknown, ignored, or invisible) factors that affect the outcome of a survey.
5. Subjective inference vs. Objective inference from data
29
Statistical Concepts – Ethical Concerns
1. Biased sample
2. Insufficient sample
3. Correlation does not imply causation – implying that one thing caused the other can be very misleading.
4. Confounding factors (or Lurking Variables)
5. Subjective inference vs. Objective inference from data – is the conclusion presented by the authors biased or else is the conclusion clearly supported by the data?
30
Outline
Reading Assignment & Class Assignments
Introduction to Data & Data Ethics
Statistics: Use, Abuse, and Misuse
Final Comments:
Complete your assignments on MyMason.gmu.edu
Complete your Reading Assignment
31