+ All Categories
Home > Documents > Probability and Amount of Data - unipa.it fileProbability and Amount of Data Corrado Enrico Agnes...

Probability and Amount of Data - unipa.it fileProbability and Amount of Data Corrado Enrico Agnes...

Date post: 28-May-2019
Category:
Upload: phamdan
View: 215 times
Download: 0 times
Share this document with a friend
7
Probability and Amount of Data Corrado Enrico Agnes Polytechnic PhD School of Engineering, Turin, Italy Abstract Taken for granted the need to introduce the teaching of probability and the Shannon’s measure for the amount of data, as part of the essential knowledge school must impart to the citizen, the vicious circle put into action when implementing the task is considered. Accounting for and building on the results of the cognitive science about probability and decision- making, the suggested introduction to probability is almost traditional in the substance, but the choice of words and analogies to make it more friendly, substantially change the conceptual approach. In the organization of the paper this is step one, developed in the two initial sections. A different approach is used for the amount of data, which is introduced without the traditional analysis with signs and messages, mainly to highlight the features that most parallel statistical physics. This is step two and is developed in the last two sections of the paper. Keywords Probability, Information, Data, Cognitive Biases 1. The problem of a citizen oriented teaching of probability. At the Braga GIREP Conference [Agnes, 1994] I defended the need to introduce in science education the quantity invented by Shannon as a measure for the amount of data. Following the flow of data from everyday life examples, it could have been the way of understanding the “true logic of this world” with the side benefit of learning “the calculus of probability”, as advocated by Maxwell in the famous quote. I believe the main thesis remains actual and valid, but a vicious circle was hidden in this hope, first discovered since long time by the mathematicians [Ramsey, 1926], [de Finetti, 1931] and later confirmed by cognitive scientists. Oversimplifying, the mathematical theory of probability needs to take into account human opinions, and probability judgments in daily life are deeply biased because of the human psychology. So that the essential tool for teaching of data, the calculus of probability, is made unavailable. The following example adapted from medicine [Gigerenzer, 2008] convinced me, and I hope will convince the reader of the necessity of real changes in the teaching of probability to everybody who is not going to become a specialist. Let 0,001 be the probability of being severely ill, so that 0,999 is the probability of being sane. There is a test for this illness, which is 0,999 positive if you are ill, and of course 0,001 if you are sane. Having tested positive, what is the probability of being ill? This is an elementary exercise on Bayes Theorem and the answer is? Let’s follow the first advice of the cognitive psychologist: to find 1 ill person we have to accurately scrutinize 1,000 people, and a reasonable sample for the problem is 1,000,000. Discourage the use of decimals (and percentages!) because the process of identifying one among many, look for the needle in the haystack, is less meaningful of the count of the trials for getting one (on the average): how many patients have to be treated for curing one, is what the doctor never says this way. The second advice is to encourage the tree diagrams that make the representation of the problem part of the solution: in Fig1a the splitting of the ensemble made by the illness is shown, together with the splitting made by the test: now it is easy to count the consistency of each group and evaluate the probability of about 0,5 (!). The same calculation applies to any large sample split according to the same proportions, now from law: a suspect is 50% guilty, notwithstanding he matches gender and race, p=0,001, and DNA test, p=0,999 , with the perpetrator of a crime.
Transcript

Probability and Amount of Data Corrado Enrico Agnes Polytechnic PhD School of Engineering, Turin, Italy Abstract Taken for granted the need to introduce the teaching of probability and the Shannon’s measure for the amount of data, as part of the essential knowledge school must impart to the citizen, the vicious circle put into action when implementing the task is considered. Accounting for and building on the results of the cognitive science about probability and decision-making, the suggested introduction to probability is almost traditional in the substance, but the choice of words and analogies to make it more friendly, substantially change the conceptual approach. In the organization of the paper this is step one, developed in the two initial sections. A different approach is used for the amount of data, which is introduced without the traditional analysis with signs and messages, mainly to highlight the features that most parallel statistical physics. This is step two and is developed in the last two sections of the paper. Keywords Probability, Information, Data, Cognitive Biases

1. The problem of a citizen oriented teaching of probability. At the Braga GIREP Conference [Agnes, 1994] I defended the need to introduce in science education the quantity invented by Shannon as a measure for the amount of data. Following the flow of data from everyday life examples, it could have been the way of understanding the “true logic of this world” with the side benefit of learning “the calculus of probability”, as advocated by Maxwell in the famous quote. I believe the main thesis remains actual and valid, but a vicious circle was hidden in this hope, first discovered since long time by the mathematicians [Ramsey, 1926], [de Finetti, 1931] and later confirmed by cognitive scientists. Oversimplifying, the mathematical theory of probability needs to take into account human opinions, and probability judgments in daily life are deeply biased because of the human psychology. So that the essential tool for teaching of data, the calculus of probability, is made unavailable. The following example adapted from medicine [Gigerenzer, 2008] convinced me, and I hope will convince the reader of the necessity of real changes in the teaching of probability to everybody who is not going to become a specialist. Let 0,001 be the probability of being severely ill, so that 0,999 is the probability of being sane. There is a test for this illness, which is 0,999 positive if you are ill, and of course 0,001 if you are sane. Having tested positive, what is the probability of being ill? This is an elementary exercise on Bayes Theorem and the answer is? Let’s follow the first advice of the cognitive psychologist: to find 1 ill person we have to accurately scrutinize 1,000 people, and a reasonable sample for the problem is 1,000,000. Discourage the use of decimals (and percentages!) because the process of identifying one among many, look for the needle in the haystack, is less meaningful of the count of the trials for getting one (on the average): how many patients have to be treated for curing one, is what the doctor never says this way. The second advice is to encourage the tree diagrams that make the representation of the problem part of the solution: in Fig1a the splitting of the ensemble made by the illness is shown, together with the splitting made by the test: now it is easy to count the consistency of each group and evaluate the probability of about 0,5 (!). The same calculation applies to any large sample split according to the same proportions, now from law: a suspect is 50% guilty, notwithstanding he matches gender and race, p=0,001, and DNA test, p=0,999 , with the perpetrator of a crime.

Figure 1. a) Illness Test Calculated Risk b) Sum and Product Rules for Tree Diagrams

Very different from what people answer, due to the “psychological effect” of the 999/1000 positive. The drive to jump to a conclusion, the discomfort with uncertainty is the characteristic of the automatic way of working of the human mind, the System One or “fast thinking”, in opposition to the “slow thinking” of the System Two [Kahneman, 2011]. The hope of large gains makes people seek risk and reject favourable settlement (possibility effect): they buy lottery tickets, because hope, however little, is important. The fear of disappointment makes people risk averse and accept unfavourable settlement (certainty effect): do you prefer to take 1000 $ for sure or bet for 1500 $ with probability 0,9 ?1

Nothing of the traditional teaching went lost because the procedure is quite general and suitable for any probability problem. Namely the axioms we traditionally put at the beginning of probability theory are visible properties of the tree in Fig1b: each horizontal line at each level sums up to the total number; the path from any terminal point of the graph to the vertex reproduces the splitting factors, the compound probability as a product of probabilities.

                                                                                                               1  In  the  Kahneman-­‐Tverski  prospect  theory,  these  are  two  examples  out  of  the  so  called  ‘fourfold  pattern’.  

Figure 2. Damien Hirst: Controlled Substances, 1994, Tate Gallery, London

2. New words and old ideas for teaching probabilities.

The just summarized results of cognitive science, lead to think twice, before introducing probability in the usual abstract way, with events and fractions (a priori probabilities) and (a posteriori) frequencies, and favor a more friendly approach, counting with integer numbers individual items from groups. And to give probabilities a concrete physical meaning, I’ll use the analogy with the composition of a mixture. Words used to speak about few particles at a time, a paradoxical situation for chemistry and the physics of gas, but not forbidden by the current laws, and words I’ll use too for the spot paintings by Damien Hirst of Fig2, the example I chose to find support for the communication from the original gaze of an artist. On one side they have titles inspired by an imaginary chemistry, on the other they remember the urns full of colored marbles of the exercises on probability, and of course the particles of statistical physics. I believe these pictures are useful to introduce the problem of distinguishable, not distinguishable, identical and diverse elements of the set, to which I’ll extend the idea that otherwise “identical particles” could be considered diverse chemical substances, if they differ in the values of a physical quantity [Einstein, 1914], [Job, 2007]. In our example the color of the marbles, so that the ratios of the number Ni of individuals of the species i to the total number N, measure the composition of the set. The same ratios Ni/N evidently represent the relative concentrations when the mixtures are physical and probabilities in the case of mathematical urns. It is not the first time the same quantity is invented to deal with similar problems in different fields. By chemists when the substances are accessible and can be analyzed quantitatively and qualitatively, by mathematicians when the set is accessible only one item at a time, and only as far as in the future!

So that “concentration of a mixture” becomes “possibility of choice”. Words full of arbitrariness, which take us nearer to our subject, but I want to show that some arbitrariness is also hidden in the description of a mixture. It is well known that chemists represent mixtures with variables very similar to coordinates, not position in space but composition in the space of substances. What is less known is that also in substance space we can arbitrarily choose the “axes” to the extent of taking “gedanken substances” as reference substances. The substance reference system is not given by nature, but follows and adjusts the new discovered substances2. The point of the comment is that the arbitrariness embodied in the judgment if two elements are identical or diverse is of the same type of the choice of the substance coordinate system, so that in the examples from the paintings, the judgment of identical or diverse colors is subjective, but not contradictory with the rules of

                                                                                                               2  This  notion  well  exposes  the  rhetoric  of  divulgation  about  “the  ultimate  building  blocks  of  matter”.  

the calculus of probability, as well as with probability distributions based on opinions. The terms uncertainty-indeterminacy pollute the usual verbal environment of probability with fatal consequences for the teaching, but they are in no way related to probabilities. Probabilities may be unknown, or may be arbitrary in the sense of the choice of the coordinate system, or they are imprecise in the usual physical sense, but they are never uncertain neither indeterminate.

The basic idea is to introduce probability before and independently of data amount, so that we go from step one, where probability is introduced starting from the composition of a set, to step two, where a new property of the set is defined, based on probabilities.

Taking the summary of the main differences from the traditional teaching of probability: the concept and terminology of “events” is substituted with “substance portions” and their ratios, and the calculus of probabilities is not performed by the repeated application of the two theorems of Fig.1b to real numbers between 0 and 1. Instead a subset with the desired characteristics is chosen from the tree representation of the total set, and its elements are counted by integers, what cognitive scientist call “natural frequencies”.

Avoiding the use of "probable and possible events" postpones the didactical and psychological problems connected with uncertainty and undeterminacy. They are of course unavoidable, but they belong to another physical quantity, the one built on probabilities exactly with the aim to measure quantitatively the uncertainty-indeterminacy!

I believe a better term to deal with these problems is “doubt”. A set of diverse objects offer the possibility of choice and this creates the embarrassment of choice, the doubt.

It is important to be aware from the beginning of the special feature of the coming physical quantity, because if the number of choices is kept fixed, it increases and maximizes when the choices are equally attractive, or without preference at all, or full of indifference; as we learnt from Hamlet, Jules et Jim3 and equilibrium statistical physics!

3. Words and Ideas for teaching Data

How to measure doubt, but before that, what physical system has this doubt as its property? Of course the person who is ready to draw a marble, but from the point of view of physics it is a property of the urn, as well as the number of choices is too a property of the urn. Here is where the human psychology makes thing difficult! The “doubt” is in the urn and is not a decision of the blindfolded goddess, whom the System One of the human thinking takes responsible, in its action-oriented search for a cause.

Let’s have an urn, a box with only one marble, or two or three or more ... but identical: this box cannot create any doubt because it does not offer any choice, so that the measure for the doubt we are looking for has the value zero, whatever be the unit of measure we don’t have yet! Let’s have a second box full of the same number of identical marbles, so that the value for the second box is also zero, because identity has been defined locally and the boxes are separated. Now we make a gedanken experiment: we put both sets in the same box separated by a wall that is removed. Like the famous gedanken experiment from statistical physics4, this one too has different outcomes if the marbles are all identical, so that the “doubt” remains zero, or if the marbles, through the process of mixing, “recognize their diversity”. In this case the diversity creates the possibility of choice, and this choice is independent from the actual number of marbles, it depends only from the composition, so it is the same for a box containing the minimum number of marbles that guarantees the same choice and the same doubt: only two marbles! We choose this “little bit” as the physical system that represents the unit of the quantity used to compare the amount of data of any other physical system.

Before leaving this analogy between composition and probability, let me point to a correspondence, which seems of no use, if not to remind us that analogies are not right or wrong but useful or useless. And when analogy points to some unlikeness, it may as well be useful to stress the diversity, like the well-known didactic weapon of the counterexample.

The mathematical union of two sets is obviously analogous to the mixture, but the direct product of sets does not seem having any useful meaning related to substances: but in probability theory and amount of data computation it becomes the key concept. Think of the multiple-choice problem, the direct product of the sets represents the combined independent possibilities. In the tree representation this amounts to the splitting of the same set according to diverse criteria, diverse probability distributions as in Fig3b. In the language of sets, each level of the “tree of hands5” is the direct product of the set at upper level and the set of the new choices. Repeated trials lead immediately to the “power” sets and one example is the binary tree of the Fig3a. From now on let’s use standard terminology, bit for the unit and data for the quantity and H for the

                                                                                                               3  French  film  directed  by  François  Truffaut  in  1962.  4  Gibbs  Paradox.  5  Because  fingers  are  for  choosing!  The  expression  came  out  of  an  experimental  session  with  children.    

symbol6, bit/s for the unit of the current, rate of transfer of data symbol IH , trying to avoid commercial terminology about power or velocity of data transfer; and also limit the use of words like choice and doubt which I consider useful for understanding, but, like all words, they inevitably carry more meanings, and we want to be restricted to the precise physical meaning. The idea for evaluating the data amount of a generic set is operative, and comes directly from the freedom of choice of the substance reference system. We build the set from “nothing”, that is, after having “identified” all the objects, we begin with a set of identical objects, which we know has data amount equal to zero. Then we gain knowledge about the set “diversifying” it step by step, until we reproduce the actual set. Let’s go back to the tree representation in the very special case in which the choice is at each step binary, as in Fig3a. Beginning with a set of N=2n identical objects, which diversifies and halves at any branch, we observe that the amount of data increases of 1 unit at each level, so that the quantity is simply evaluated by the exponent of 2, that is the base 2 logarithm H = n = ldN bit. To be convinced the result is valid for any N, we consider a pure multiplicative tree, that is a tree in which all the paths from the vertex to the base consist of the same factors as in the example in Fig3b: N=3*2*5*... The number of choices from one level to the next gets multiplied while the amount of data adds, because being additive is the ”must” requirement of a measure: exactly the functional definition of the logarithm: H(3*2*5) =H(3)+H(2)+H(5).

Figure 3. a) Binary Tree b) Pure Multiplicative Tree

To obtain the general formula we look to a generic terminal of the tree, reached by a branch carrying the compound probability of the path, which we calculated in Fig1b; so that its contribution to the total amount of data will be ld N/Ni , weighed with the factor Ni/N due to the sum rule, and the final result is the Shannon formula:

H = − pi ld∑ pi =N i

N∑ ld NNi

A final comment is important: the winner of the N tickets of the lottery, together with all the ones which gave us hope and lost, carry the amount of data ld N bit, only before the extraction. After the choice has been physically made, all of them become ascertained events, which carry H=0 bit, exactly the same value of the impossible event, the extraction from the lottery urn of something which is not there.

Now we can connect the mathematical theory with the practice of communication [Shannon, 1949]: the physical system which prepares the message, choosing the signs one by one, the “Emitter”, behaves like we did preparing the microstates of the system in statistical physics. But for the physical system “Receiver”, the message carries the data corresponding to the choices made by the sender, because it does not know them yet.

The reason for leaving aside messages and signs during the introduction of data measure, may come from the “germicidal quotation mark” put in the following phrase: “the ‘meaning’ of a message is irrelevant”, by Shannon himself [Gleick, 2011];

                                                                                                               6  From  Boltzmann  H  theorem,  maybe  with  the  Greek  capital  E  of  entropy?!  

but the very reason is to pave the way to the recognition of the Shannon measure as a legitimate physical quantity, as we’ll show in the last section. The main difference with the traditional teaching is to build directly any amount of data from H=0 bit, through a process of differentiation and identification, instead of finding an additive measure, that is the logarithm of the total number of different messages. Consequently the unit bit is not defined starting with the set of two signs, which is the simplest, but let me say, with the “elementary quantum of diversity”. Of course it is only the reverse of the traditional formulation, but also here the conceptual approach is substantially changed.

4. Brand New and Used Up Physical Quantities

To bring the quantities measured in bit and bit/s from the virtual computer world to the physical world, I need first to persuade teachers of science that data is a legitimate physical quantity. Let’s take the following simple but theoretically sound definition: a physical quantity is a relation between physical systems, which becomes a property of the physical system. As shown by the prototypical example of length: the relation of things with the stab in Paris has become the property length. [Falk, 1990]. This is exactly what we did when the message, which is the relation with the set of signs, became its data, and the relation between the emitter and the receiver became the current of data carried by the physical connection of the two.

Thus far I thoroughly used “amount of data”, or simply “data”, and accurately avoided the term “information”, to take advantage from the simplicity of representing the data as “contained in” and “transferred to” physical systems, and to minimize undesired extra information and unnecessary obstacles for the understanding process. In the words of Shannon [Gleick, 2011], his ‘Information’ although related to the everyday meaning of the word, should not be confused with it. The confusion is due to the “value” of the information, and to the increasing amount of unavoidable data, which have been well expressed by the pregnant new word Dataclysm [Rudder, 2014]. Our continuing infatuation with Big Data [Mayer-Schönberger, Cukier, 2013] comes from the harmful idea that the increasing quantity of data could become a quality in itself: the mirage of possessing “All the Data” can be dissolved only by the familiarity with this physical quantity and its conceptual content. Of course the question about the “value” of data is important but up to now has no answer according to physics. Whenever a physical quantity appears to have a diverse “value”, like for example one cubic meter of water, when in the sea or in the dam, the physical answer is to look for another physical quantity which can tell the diversity. Thermodynamics tells us that, in the water example, energy plays that role, and the gravitational potential is the right quantity to measure this very “value”. The amount of data is not a primary energy carrier [Herrmann, Schmälzle, 1987], their transfer is convective, they are carried together with other physical quantities: a specific quantity carried with data, which could be useful in relation to the value of information, has not yet invented, but probably is on the way. Maybe a final observation on the “value of information” can be useful: the peculiar aspect of data in relation with the “true” and “false” dichotomy. Suppose we make a copy of each object (no data added) then label a twin pair true and false. The computation of the data increase results in 1bit, only because the number N of objects is doubled: H(2N)=H(N)+1 bit.

Everything points to the fact that data is a new physical quantity, but I disseminated hints from statistical physics that it has a close relation with entropy, and Shannon clearly wrote it [Shannon, 1949 §6,7]. With our definition of physical quantity we can easily understand why sometimes new physical quantities are not diverse from the old ones, used in diverse contexts. Because between two physical systems we can have different relations, each one becoming a different property of the system. Also the analogy I used to introduce probability points to this: I could tell probability has been discovered / invented three times. Once by proto-chemists as mixing ratios, later by mathematicians and I like to add the contribution of statistical physicists as independent, not as applied mathematics, [Maxwell, 1860], [Boltzmann, 1872], [Gibbs, 1903].

And from Maxwell and his famous quote I take the conclusion, recalling his dedication to popularization as a form of teaching, hoping that, once made the weapon probability more handy, the amount of data, once embedded into everyday life, be fit to quantify the least quantifiable properties of the real world. To confine it within specialized boundaries is a collateral damage both for education and culture.

References Agnes, C.E., (1994). The Missing Quantity in Physics Education. In Pereira L.C., Ferreira J.A., Lopez H.E. (Ed.), Girep 1993 Proceedings of the conference on Light and Information, (pp.260-264). Braga: Universidade do Minho.

Boltzmann, L., (1872). Further Studies on the Thermal Equilibrium of Gas Molecules" (“Weitere Studien über das Wärmegleichgewicht unter Gasmolekülen”), in Sitzungsberichte der Akademie der Wissenschaften, Mathematische-Naturwissenschaftliche Klasse (pgs. 275-370), Bd. 66, Dritte Heft, Zweite Abteilung, Vienna: Gerold. de Finetti, B. (1931). The True Subjective Probability Problem, http://link.springer.com/chapter/10.1007/978-94-010-2288-0_2#page-1

Einstein A., (1914). Beiträge zur Quantentheorie, Verh. Deut. Phys. Gesell. XVI, 820–827 (see also: Contributions to Quantum Theory, in The Collected Papers of Albert Einstein, 1997, 6, Princeton University Press, 20–26. Falk, G., (1990). Zahl und Realität, Birkhäuser, Basel, Switzerland. Gibbs, J.W., (1902). Elementary Principles in Statistical Mechanics: developed with especial reference to the rational foundation of thermodynamics. Nabu Public Domain Reprints. Gigerenzer, G., (2008). Calculated Risk: How to Know when Numbers Deceive You, Simon & Schuster New York, USA.

Gleick, J., (2011). The Information, Random House, New York, USA. Herrmann, F., Schmälzle, P., (1987). Daten und Energie, J.B.Metzler + B.G.Teubner, Stuttgart, Germany.

Job G., (2007). An Elementary Approach to Quantum Statistical Problems, Eduard-Job Foundation preprint (the paper can be requested at [email protected]).Kahneman, D., (2011). Thinking, Fast and Slow, Farrar, Straus and Giroux, New York, USA.

Mayer-Schönberger V., Cukier K., (2013). Big Data, Houghton Mifflin Harcourt, New York, USA. Maxwell, J.C. (1860). Illustrations of the dynamical theory of gases. Part I. On the motions and collisions of perfectly elastic spheres, Philosophical Magazine, 4th series, 19 : 19-32. Illustrations of the dynamical theory of gases. Part II. On the process of diffusion of two or more kinds of moving particles among one another, Philosophical Magazine, 4th series, 20 : 21-37.

Ramsey, F.P., (1926) Truth and Probability http://fitelson.org/coherence/ramsey.pdf

Rudder, C., (2014). Dataclysm, Crown Publishers, New York, USA.

Shannon, C.E., Weaver, W., (1949). The mathematical theory of communication, University Press Urbana, USA. http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf

Corrado Enrico Agnes Polytechnic PhD School of Engineering, Turin, Italy Corso Duca degli Abruzzi 24 10129 Turin Italy e-mail: [email protected]  


Recommended