Overview Viewing Distributions of UseViewing Distributions of
Use Descriptive rather than AnalyticalDescriptive rather than
Analytical Effect of UncertaintiesEffect of Uncertainties Contexts
for Library ApplicationContexts for Library Application The Types
of DistributionsThe Types of Distributions
Slide 3
Viewing Distributions of Use There are two ways to view
distributions and related J-shaped curves: l (1) in sequence of
increasing frequency of uses, in which the number of items is the
dependent variable l (2) in sequence of increasing numbers of
items, in which the frequency of use is the dependent variable
Slide 4
Sequence of Increasing Uses To illustrate, consider the
following distribution: Note that the listing is in increasing
order of the frequency of use. For example, there are 500 items
that are used only once and 1 item that is used 10 times. The graph
of this distribution looks as follows:
Slide 5
Sequence of Increasing Uses
Slide 6
Example of Increasing Frequency This way of viewing might
typically be used when looking at statistics on circulation of
library materials in which the number of items circulating once
would be followed by the number circulating two times, etc.
Slide 7
Sequence of Increasing Items The alternative picture, for the
same data: Effectively, the data are now arranged in order of
decreasing frequency of use. The graphical picture is quite
different:
Slide 8
Sequence of Increasing Items
Slide 9
Example of Decreasing Frequency This means of viewing the data
is typically used in applying laws such as Zipfs law, in which
words are listed in decreasing order of frequency of use.
Similarly, in the original formulation of Bradfords law, journals
are sequenced in order of decreasing productivity for a subject
field and then grouped into zones of equal productivity (the zones
containing successively greater numbers of journals).
Slide 10
Distributions of Use Each of the distributions that will be
presented here it intended to represent situations in which a few
items (journals, scientists, users, volumes, etc.) account for the
many (articles, citations, uses, circulations, etc. ). These models
have value as means for assessing the effects of the patterns upon
various kinds of decisions. In the library, those decisions might
relate to acquisitions, to alternative means for storage of
holdings, to staffing for services.
Slide 11
Descriptive rather than Analytical It is important to recognize
that, with one exception, these models are essentially descriptive
of empirical data. That is, most of them do not provide explanation
for the behavior exhibited in the data; they merely represent that
behavior in a mathematical form. Furthermore, they do not represent
cause and effect relationships. The one exception is the "mixture
of Poissons" which does provide an explanation for the behavior,
deriving it from the assumption of a heterogeneous (i.e., mixed)
population and random processes around the average for each of the
components of the population.
Slide 12
Effect of Uncertainties It is also important to recognize that
the empirical data, in any real situation, are themselves
uncertain, subject to variation as a result of many factorserrors
in observation, changes from one time period to another, changes in
the mixtures of populations involved, changes in the context of
observation. As a result, whatever model may best fit needs in
analysis is the one to be used, since any of the models is likely
to be as accurate as any other. Furthermore, unlike physical
phenomena, patterns of usage reflect not underlying laws of nature
but the effects of individual decisions or, in many cases, large
scale policy decisions.
Slide 13
Contexts for Library Applications For the library, there are
four contexts in which these kinds of distributions seem relevant:
l (1) the context of the users,the context of the users, l (2) the
context of the use of materials,the context of the use of materials
l (3) the interaction between users and materials,the interaction
between users and materials l (4) the context of bibliometric
analysis.the context of bibliometric analysis The first usually
shows a distribution of uses across the set of users that exhibits
a J-shaped curve. The second usually shows a distribution of uses
across the set of materials that also exhibits a J-shaped curve.
The third helps to identify the nature of uses. The fourth helps in
assessing contributions of journals to publication and use of the
articles they contain.
Slide 14
Library User Patterns Library users differ in their relative
frequency of use. For example, in academic libraries, faculty will,
on average, use the library much more frequently that will
students. And among students, graduate students will, on average,
do so more frequently than undergraduates. The following shows
relative use at the UCLA library:
Slide 15
Library Collection Use Patterns Turning to the second
contextthe use of materials again the evidence is that in the
library the extent to which materials are used varies greatly.
Considering circulation data as a measure of use, some library
materials are heavily circulated each year and some are virtually
never circulated. Leaving aside for the moment differences for
specific items, there are categories of items that almost by
definition will vary in their circulation. There are materials that
are put "on reserve" precisely because they are expected to be
heavily circulated. There are rare books that will never be
circulated and even will rarely be used at all. There are current
"best sellers" that will be heavily used, and there are "dusty old
volumes" that will almost never be used.
Slide 16
Library Collection Use Patterns Beyond that, though, are the
differences among items, independent of identified categories. Some
of those differences relate to date of publication or acquisition,
some to the subject matter, some to the changeable role as assigned
readings. Despite excellent efforts (thinking especially of that by
Fussler and Simon) to identify reasons for such differences, there
are no easy criteria for a priori identification of which items
will be heavily used and which rarely so. The differences therefore
usually need to be identified from actual experience, as
exemplified in circulation records.
Slide 17
Relationship between Users & Materials Beyond the separate
distributions for users and materials, there are also important
distributions that reflect the relationships between the two. To
illustrate, the following shows the relative use of two categories
of materials (items with one use and all other items) by categories
of users at UCLA: Note the relatively greater use by faculty of One
Use items, especially in comparison with All Others.
Slide 18
Bibliometric Patterns A number of models (such as Bradfords
law) are used to describe characteristics of the literature. For
example how is the literature on a particular subject scattered or
distributed in the journals? For libraries, the significance lies
in the fact that, in a bibliography on any subject, there is always
a small group of core journals that account for a substantial
percentage (say 1/3) of the articles on that subject. Then there is
a second, larger group of journals that account for another third
while a much larger group of journals picks up the last third.
Slide 19
Bibliometric Patterns Distribution frequencies of the Bradford
type are also evident in other bibliometric phenomena. Lotkas law,
for example, describes the productivity of scientists within a
given population. Productivity is defined here as the number of
papers a scientist publishes within a given time. Underlying the
J-shaped curve is an assumption that, if an individual (scientist
or journal) is successful (writes or publishes an article) on one
attempt, the probability of success on subsequent attempts
increases. This has been called "cumulative advantage" equivalent
to "success-breeds-success".
Slide 20
The Types of Distributions Negative Exponential
DistributionsNegative Exponential Distributions l Bradford's Law -
1 Bradford's Law - 1 Negative Power or Harmonic
DistributionsNegative Power or Harmonic Distributions l Zipf's Law
Zipf's Law l Bradfords Law - 2 Bradfords Law - 2 l Lotka's Law
Lotka's Law l Pareto Law Pareto Law l Cumulative Advantage
Processes Cumulative Advantage Processes Mixture of Poisson
DistributionsMixture of Poisson Distributions Negative Binomial
DistributionsNegative Binomial Distributions Logistic
DistributionsLogistic Distributions Linear DistributionsLinear
Distributions
Slide 21
Negative Exponential Distributions The negative exponential is
represented by the equation F(k) = N*2 (-A*k). There are two
characterizing parameters: N and A The base for the exponential can
be other than 2. It could be e or 10 or any other positive number.
The choice of base simply affects the value of A. For example, if
the base were 10, F(k) = N*2 (-A*k) = N*10 (log 10 (2))*(-A*k) =
N*10 (-A*log 10 (2))*(k) so A is replaced by A*log 10 (2)
Slide 22
Negative Exponential Distributions Graphically, it looks as
follows, for N = 1000 and A = 1:
Slide 23
Bradford's Law - 1 Samuel C. Bradford first formulated his law
in 1934. But it did not receive wide attention until publication of
his book, Documentation, in 1948. Bradford called it the law of
scattering, since it describes how the literature on a particular
subject is scattered or distributed in the journals. In information
science, Bradfords law is perhaps the best known of all the
bibliometric laws. A huge body of literature has been written on
it. "Bradfords law, as originally defined, is a negative
exponential distribution.
Slide 24
Initiation of Bradford's Law In Documentation, Bradford
analyzed a four-year bibliography of references to articles in
applied geophysics. He listed journals containing references to
that field in descending order of productivity. He then divided the
list into three zones, each containing roughly the same number of
references. Bradford observed that the number of journals
contributing references to each zone increased by a multiple of
about five. Specifically, the first zone contained nine journals
which contributed 429 references. The second contained 59 journals
producing 499 references. In the third zone 258 journals provided
404 references.
Slide 25
Bradford S C. Documentation. Washington, DC: Public Affairs
Press, 1950.
Slide 26
Qualitative Form of Bradford's Law On the basis of these
observations, Bradford wrote, the numbers of periodicals in the
nucleus and succeeding zones will be as 1, n, n 2, (p. 116). For
applied geophysics then, the number of journals in each zone was
proportionate to 1, 5, 25, Given that, the average frequency of use
for journals in across the zone is represented by a negative
exponential distribution: 1, 1/n, 1/n 2, Later, we will derive this
negative exponential distribution from an underlying negative power
distribution.
Slide 27
Log-linear Form Given that the negative exponential is
represented by the equation F(k) = N*2 (-A*k). note that log 2
(F(k)) = log 2 (N) A*k. This log-linear form is useful for plotting
the values of log 2 F(k) as a function of k or for estimating the
values for N and A by regression. The following graph shows the
log-linear form.
Slide 28
Log-linear Form Graphically the log-linear form looks as
follows:
Slide 29
Negative Power or Harmonic Distributions The negative power or
harmonic distributions derive from the harmonic series: 1, 1/2,
1/3, , 1/n, That basic series is augmented with two parameters, A
and B, in the following formula: P(x) = (A/x)*(B/x) A defined over
the interval 0 < B < x. Note that P(x) is expressed as a
negative power of the value x. Hence, negative power
distribution.
Slide 30
Harmonic or Negative Power Distributions Graphically it looks
as follows, for a = 1.2 and b = 0.7
Slide 31
Zipf's Law In his book Human Behavior and the Principle of
Least Effort, George K. Zipf treated the frequency with which words
occur in a given piece of literature. Zipf arranged the 29,899
different words found in Joyces Ulysses in descending order of
their frequency of occurrence. Then to each word he assigned a
rank, from r = 1 (most frequently occurring word) to r = 29,899
(least frequently occurring). He found that by multiplying the
numerical value of each rank r by its corresponding frequency F, he
obtained a product, C, which was constant throughout the entire
list of words. The formula for Zipfs law is thus F(r) = C/r, so it
is a harmonic distribution.
Slide 32
Bradford's Law - 2 At about the same time that Zipf published
his book, Bradford wrote Documentation. We have already discussed
the negative exponential distribution represented by the original
formulation of Bradfords law. We will now look at the underlying
harmonic distribution and derive the exponential one from it.
Slide 33
Frequency of Use of Journals Underlying Bradfords law is the
frequency of use of journals, as exemplified by their occurrence in
a bibliography for a subject field. Let P(n) be the frequency of
use of journal (n), listed in decreasing order of that frequency of
use, so that P(n) > P(n+1). The empirical facts appear to be
that, overall and with varying degrees of accuracy, the frequency
of use of journals fits an harmonic distribution. Thus, P(n) = A/n
(more or less)
Slide 34
Frequency for Groups of Journals Suppose that we now group the
journals, the first group containing the most frequently used
journals, the second the next most frequently used, and so on. Let
G k be the number of journals in group k. Consider the frequency of
use of the journals in each of the several groups: F(1) = P(1) +
P(2) + + P(G 1 ) = 1/1 + 1/2 + + 1/ G 1 F(2) = P(G 1 +1) + P(G 1
+2) + + P(G 1 +G 2 ) = 1/ (G 1 +1) + + 1/ (G 1 +G 2 ) F(3) = P(G 1
+G 2 +1) + + P(G 1 +G 2 +G 3 ) = 1/ (G 1 +G 2 +1) + + 1/ (G 1 +G 2
+G 3 ) and so on.
Slide 35
Sums of the Harmonic Series There is not a closed form for
evaluation of the several sums of an harmonic series, but we can
compare the total for the areas of the rectangles with the integral
of the function 1/x, shown in red in the following graph:
Slide 36
Harmonic Series & Natural Logarithm The sum from 1/(A+1) to
1/B can be approximated by the integral of 1/x from (A + 1 - 0.5)
to (B + 0.5). The integral of 1/x is ln(x), so the sum from 1/(A+1)
to 1/B would be approximately ln ((B + 0.5)/(A + 0.5)) Use that
approximation and let T k = G i, T 0 = 0 so that F(k+1) = ln ((T
k+1 + 0.5)/(T k + 0.5)) In Bradfords description of the law, the
successive groups of journals were chosen to have about the same
number of citations, so F(1) = F(2) = F(3) = F(4), etc. Hence, (T
k+1 + 0.5)/(T k + 0.5) = (T k + 0.5)/(T k-1 + 0.5) (T k + 0.5) 2 =
(T k+1 + 0.5)*(T k-1 + 0.5) = (T k + 0.5 + G k+1 )*(T k + 0.5 - G k
) = (T k + 0.5) 2 + G k+1 *(T k-1 + 0.5) G k *(T k + 0.5)
Slide 37
Harmonic Series & Natural Logarithm From that equation, G
k+1 = G k *(T k + 0.5)/(T k-1 + 0.5) For k = 1, G 2 = G 1 *(G 1 +
0.5)/(0 + 0.5) = G 1 *(2G 1 + 1) By induction, we prove that G k+1
= G 1 *(2G 1 + 1) k : First, if G i = G 1 *(2G 1 + 1) (i 1) for all
i < k + 1,then T k = G 1 * i (2G 1 + 1) i = G 1 *((2G 1 + 1) k
1)/(2G 1 + 1 1) = ((2G 1 + 1) k - 1)/2 = (2G 1 + 1) k /2 - 0.5
Hence, T k + 0.5 = (2G 1 + 1) k /2 and T k-1 + 0.5 = (2G 1 + 1)
(k-1) /2 Hence, since G k+1 = G k *(T k + 0.5)/(T k-1 + 0.5), G k+1
= G k *((2G 1 + 1) k /2)/((2G 1 + 1) (k-1) /2) G k+1 = G 1 *(2G 1 +
1) k Q.E.D.
Slide 38
Harmonic Series & Natural Logarithm As a result, the number
of journals in group k is an exponential function of k. Given the
equal number of citations for each group, the frequency
distribution is negative exponential. However, it is important to
note that the approximation of the summation of 1/n by ln is
significantly in error at the start. Specifically, 1/1 = 1 but ln
(1.5/0.5) = 1.1. The result is over-estimate at the start by about
10%. This is at least a partial explanation of the difference
between empirical data and the exponential model in the region
called the core journals which will be illustrated next.
Slide 39
Graphical Form of Bradford's Law The following graph
illustrates Bradfords law for articles on tropical and subtropical
agriculture found in Tropical Abstracts during 1970. Note that the
x-axis is the logarithm of the number of journals and the y-axis is
the number of citations. The data-points on the graph are equally
spaced on the y-axis and logarithmically spaced on the x-axis. In
preparing graphs related to the prior discussion, the x and y axes
would be reversed so as to represent the log (number of journals)
as a function of k, the number of groups of journals.
Slide 40
Lawani, S. M. Bradfords law and the literature of agriculture.
Int. Lib. Rev. 5:341-50, 1973,
Slide 41
Anomalies in Bradford's Law Notice that the empirical data
initially appears as an upward curve before it becomes linear. This
is typical of Bradford graphs. The area represented by the curving
line is usually regarded as the nuclear zone, or journal core.
Notice also that the empirical data begins to droop at about the
250th journal. The droop consistently appears among many different
sets of empirical data. One theory is that including more journals
would maintain the linearity. Another theory is that the droop is
an integral part of article scatter. Later, when we look at the
logistic distribution, we will consider a third explanation.
Slide 42
An Interpretation of Bradford's Law In 1967, Ferdinand F.
Leimkuhler, Purdue University, proposed an equation for
representing Bradfords law: F(x)= ln((1 + bx)/(1 + b)), where x
denotes the fraction of documents in a collection which are most
productive, 0 < x
Pareto Distribution The Pareto distribution is represented by
the formula: f(x) = (a/b)[b/(b + x)] (a-1), x > b The Pareto
distribution is named after Vilfredo Pareto, an Italian economist,
who around 1900 determined that the majority of the world's wealth
was held by a minority of the people. This is not news to us today,
but was a revelation then. The format in which Pareto presented his
data was a bar graph sequenced in descending order of wealth, and
it was a J-shaped distribution.
Slide 49
Pareto Distribution In the 1920's, the Pareto distribution was
applied to quality control to show the frequency with which each
cause of problems had occurred. The result again was a J-shaped
curve, implying that most problems in quality result from a small
number of causes. It is valuable as a tool in determining the most
frequent causes of a particular problem and deciding where to focus
efforts for maximum effectiveness.
Slide 50
Pareto Distribution The Pareto distribution became known as the
80-20 rule: 80% of whatever may be involved is related to 20% of
the potential sources. In practice, the percentages may not be
always exactly 80/20, but there usually are "the vital few and the
trivial many." A Pareto chart combines a bar graph with a
cumulative line graph. The bar graph shows the values in the
descending order from left to right, with bar height reflecting the
frequency or impact of problems. The cumulative sum line shows the
percent contribution of all preceding bars.
Slide 51
Pareto Distribution Pareto analysis is typically carried out by
starting with a high level overview, identifying aspects of the
greatest effect. Then analyzing them into root causes for those
significant effects, and, if necessary, further Pareto the
sub-causes. This approach is necessary when dealing with complex
processes, enabling you to properly prioritize and focus on the
right issues.
Slide 52
Mixture of Poisson Distributions A "mixture of Poisson
distributions" is characterized by a set of parameters, where n(i)
is the number of items in component(i), and m(i) is the "a priori
expected frequency" with which an item in component(i) will
circulate during a given time period. This mixture leads to a
frequency distribution based on the following formulation: F j (k)
= n(j)*e (m(j)) *(m(j) k )/k! k = 0, 1, 2, P F(k) = F j (k) j=0
where P+1 is the number of components, and F(k) is the number of
volumes that the model predicts will circulate exactly k times in
the given time period.
Slide 53
Mixture of Poisson Distributions An equivalent formulation is F
j (0) = n(j)*e (m(j)) F j (k) = F j (k-1)*m(j)/k k = 1, 2, P F(k) =
F j (k) j=0 which is useful since it avoids the problems of
factorials for large values of k.
Slide 54
Illustration of a Mixture of Poissons
Slide 55
Algorithm for Estimating a Mixture Let D(k) be a distribution,
where D(k) is the number of items, out of a total of N, that occur
exactly k times, k varying from 0 to L. Let be a mixture of Poisson
distributions, j = 0 to P, with M(j) < M(j+1). That is, F j (0)
= N(j)*e (M(j)) F j (k) = F j (k-1)*M(j)/k k = 1, 2, P F(k) = F j
(k) j=0
Slide 56
Algorithm for Estimating a Mixture Calculate for all k for
which D(k) is not suppressed (D(k) being suppressed when it is
unknown or, perhaps, when k = 0): L N j ' = N j + F j (k)*
(D(k)-F(k))/D(k) k=0 L R j ' = R j + F j (k)*k*(D(k)-F(k))/D(k) k=0
M j ' = R j '/N j ' Iterate, replacing by until a desired degree of
convergence is reached.
Slide 57
Negative Binomial Distribution The negative binomial
distribution is represented by the following equation: P(k) = {(k
1)!/(r 1)!*(k r)!}*p (r) *(1 p) (k-r) The following is a graphical
illustration:
Slide 58
Logistic Distribution The behavior exhibited by empirical data
when applying the Bradford distribution (with exponential growth at
the beginning, followed by linear growth, followed by a
leveling-off) strongly suggests that the logistic distribution
might be applicable. The logistic distribution arises when there is
basic exponential growth which eventually is inhibited by the
effects of an upper limit. Such an upper limit clearly is present
for journals in the fact that there is a limit to the number of
journals that are published.
Slide 59
Derivation of Closed Form - 2 The standard closed form for the
logistic equation in the continuous case is the following: P(t) =
K/(1 + e a+b*t ) The following is an illustrative graph for the
logistic distribution:
Slide 60
Illustrative Logistic Difference Growth As this shows, the
curve produced by the logistic difference equation is S-shaped.
Initially there is an exponential growth phase, but as growth gets
closer to the carrying capacity (more or less at time step 37 in
this case), the growth slows down and the population asymptotically
approaches capacity.
Slide 61
Linear GrowthLinear Growth - 1 Note that, qualitatively, there
are three main sections of the logistic curve. The first has
exponential growth and the third has asymptotic growth to the
limit. But between those two is the third segment, in which the
growth is virtually linear.
Slide 62
Linear Growth - 2 Unlimited linear growth is represented by the
equation p t+1 = p t + C but that, like the exponential model,
grows to exceed any identifiable limits. It is therefore valuable
to consider a limited linear growth represented, perhaps, by the
equation p t+1 = p t + ((K p t )/K)*C