The data age
Learning paradigms◦ Learning as inference
◦ Bayesian learning, full Bayesian inference, Bayesian model averaging
◦ Model identification, maximum likelihood learning
Probably Approximately Correct learning
February 20, 2018A.I. 2
The „data” age: can we automate analysis/learning?
Universal (statistical) predictor?
Universal learning architectures?
Self-improving super-intelligence?
Phases of AI: expert, supervised, autonomous
2/20/2018A.I. 3
Serendipity-driven („experimental”) science◦ 1000 years ago
◦ description of natural phenomena
Hypothesis-driven („analytical”) science◦ Last few hundred years
◦ Falsification, Newton's Laws, Maxwell's Equations
Computation-driven („computational”) science◦ Last few decades
◦ simulation of complex phenomena
Data-driven („hypothesis-free”) research◦ must move from data to information to knowledge
February 20, 2018A.I. 4
Jim Gray: Evolution of science: 4 paradigms
Tansley, Stewart, and Kristin M. Tolle. The fourth paradigm: data-intensive scientific discovery. Ed. Tony Hey. Vol. 1.
Redmond, WA: Microsoft research, 2009.
Moore’s law: storage and computation (grids, GPUs, quantum chips..)
Semantic web thechnologies: linked open data
Artificial intelligence methods: learning, reasoning, decision
5
Computational
hardware
Artificial
Intelligence
Semantic
technologies
Factors:
1965, Gordon Moore, founder of Intel:
„The number of transistors that can be
placed inexpensively on an integrated
circuit doubles approximately every two
years ”... "for at least ten years"
2/20/2018A.I. 6
•10 µm – 1971
•6 µm – 1974
•3 µm – 1977
•1.5 µm – 1982
•1 µm – 1985
•800 nm – 1989
•600 nm – 1994
•350 nm – 1995
•250 nm – 1997
•180 nm – 1999
•130 nm – 2001
•90 nm – 2004
•65 nm – 2006
•45 nm – 2008
•32 nm – 2010
•22 nm – 2012
•14 nm – 2014
•10 nm – 2017
•7 nm – ~2019
•5 nm – ~2021
2012: single
atom transistor
(~0.1n, 1A)
7
9
Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard
Cyganiak. http://lod-cloud.net/
February 20, 2018A.I. 10
„Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” Good (1965),
Artificial General Intelligence (AGI)
N-AI-1
N-AI-2
N-AI-k….
Artificial Narrow Intelligence
Narrow AI present
General AI
Super AI/strong AI
1950 Turing: "Computing Machinery and Intelligence„ learning
Possibility of learning is an empirical observation.
The most incomprehensible thing about the world
is that it is at all comprehensible.
Albert Einstein.
No theory of knowledge
should attempt to explain
why we are successful in
our attempt to explain
things.
K.R.Popper: Objective
Knowledge, 1972
Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data.
The principle of Occam's razor (1285 - 1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.
What is the probability that the sun will rise tomorrow?◦ P[{sun rises tomorrow} | {it has risen k times
previously}]=(k+1)/(k+2)◦ (k: ...Laplace inferred the number of days by saying
that the universe was created about 6000 years ago, based on a young-earth creationist reading of the Bible. ..)
◦ https://en.wikipedia.org/wiki/Sunrise_problem
Rule of succession◦ https://en.wikipedia.org/wiki/Rule_of_succession
February 20, 2018A.I. 15
February 20, 2018A.I. 16
February 20, 2018A.I. 17
Russel&Norvig: Artificial intelligence, ch.20
Russel&Norvig: Artificial intelligence
Russel&Norvig: Artificial intelligence
Russel&Norvig: Artificial intelligence
February 20, 2018A.I. 25
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12
sequential likelihood of a given data
h1 h2 h3 h4 h5
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis hsuch that h ≈ fgiven a training set of examples
(This is a highly simplified model of real learning:◦ Ignores prior knowledge◦ Assumes examples are given)◦
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Ockham’s razor: prefer the simplest hypothesis consistent with data
How do we know that h ≈ f ?1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size
Example from concept learning
X: i.i.d. samples.
n: sample size
H: hypotheses
bad
The Probably Approximately Correct PAC-learning
A single estimate of the expected error for a given hypothesis is convergent,
but can we estimate the errors for all hypotheses uniformly well??
Assume that the true hypothesis f is element of the hypothesis space H.
Define the error of a hypothesis h as its misclassification rate:
Hypothesis h is approximately correct if
(ε is the “accuracy”)
For h∈Hbad
𝑒𝑟𝑟𝑜𝑟 ℎ = 𝑝(ℎ(𝑥) ≠ 𝑓(𝑥))
𝑒𝑟𝑟𝑜𝑟 ℎ < 𝜀
𝑒𝑟𝑟𝑜𝑟 ℎ > 𝜀
H can be separated to H<ε and Hbad as Hε<
By definition for any h ∈ Hbad, the probability of error is larger than 𝜀thus the probability of no error is less than )1(
bad
Thus for m samples for a hb ∈ 𝐻𝑏𝑎𝑑:
For any hb ∈ 𝐻𝑏𝑎𝑑, this can be bounded as
𝑝 𝐷𝑛:ℎ𝑏 𝑥 = 𝑓 𝑥 ≤ (1 − 𝜀)𝑛
𝑝 𝐷𝑛:∃ℎ𝑏∈ 𝐻, ℎ𝑏 𝑥 = 𝑓 𝑥 ≤
≤ 𝐻𝑏𝑎𝑑 1 − 𝜀 𝑛
≤ |𝐻| (1 − 𝜀)𝑛
To have at least δ “probability” of approximate correctness:
By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity
|𝐻| (1 − 𝜀)𝑛≤ δ
1/𝜀(ln 𝐻 + ln1
δ) ≤ n
How many distinct concepts/decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to “bias + variance”
“bias”: expected error/modelling error
“variance”: estimation/empirical selection error
For a given sample size the error is decomposed:
Modeling error
Statistical error
(Model selection error)
Total error
Model complexity
To
tal e
rro
r
Normative predictive probabilistic inference◦ performs Bayesian model averaging
◦ implements learning through model posteriors
◦ avoids model identification(!)
Model identification is hard:◦ Probably Approximately Correct learning
◦ Bias-variance dilemma
February 20, 2018A.I. 41