SEBASTIAN DUNAT
Akademia Techniczno-Humanistyczna w Bielsku-Białej
Vocabulary analysis: A corpus based study of “analyze” clusters
and collocates in academic and spoken discourse
Key words: corpus linguistics, academic discourse, spoken discourse, clusters, collocates,
quantitative statistical surveys
Słowa klucze: językoznawstwo korpusowe, dyskurs akademicki, dyskurs mówiony, zbitki wy-
razowe, kolokaty, ilościowe badania statystyczne
Introduction
Corpora are analyzed in various sort of ways to uncover the linguistic information: the
frequency of use of certain keywords or collocations with the use of corpus search engine, if it
is available online, or specific programs used for this purpose. As Jane Sunderland states:
A corpus is a representative, substantial body of semantically collected and recorded data,
spoken or written, which is normally electronically stored as text on a PC.1 A corpus might be
labeled with, not only syntactical or lexical features, but also speaker or text features. There
are various corpora available online which constitute a great library of examples, and its data
can be used for analysis with the help of any linguistic tools; some of the corpora provide
a search engine for easier data acquire.
Corpus linguistics obtain and analyze a large quantities of data and tries to provide an-
swers to researched questions which may concern: words or grammatical structures, the fre-
quency of their use, how they link with other words or structures, and their range of possible
meanings.2 According to Biber, Conrad and Reppen3 corpus based analysis characteristics are:
1 Sunderland, J., Language and Gender: An advanced resource book. London: Routledge, 2006, p. 56. 2 Ibid. 3 Biber D., Conrad S., Reppen R., Corpus Linguistics: Investigating language structure and use. Cambridge:
CUP, 1998, p.4.
44 Sebastian Dunat
▪ Empirical analysis of patterns of language use in natural texts
▪ Development of corpus or large, natural texts collection
▪ Computer use for wide-range of analysis techniques (ex. automatic or interactive)
▪ Reliance to quantitative and qualitative analysis
Method
Computational linguistics practical tasks
Development of humans and computers/machines communication in all areas of linguis-
tic analysis influenced Hausser4 to present several practical tasks of computational linguistics,
although, the list is not complete and open to discussion.
▪ Indexing and retrieval in textual databases
▪ Machine translation
▪ Automatic text production
▪ Automatic text checking
▪ Automatic content analysis
▪ Automatic tutoring
▪ Automatic dialog and information systems
First of the practical tasks, textual databases constitute of various kind of electronically
stored data (texts, sentences, word frequencies). The easiness of access makes the databases
a great tool for researchers interested in any type of texts or passages relevant for their speci-
fic analysis. The biggest freely available database is the Worlds Wide Web but its unstruc-
tured form might pose some difficulties in obtaining the precise data.5 Second, machine trans-
lation has the remarkable potential of making research easier with the automatic or semi-
automatic translation of research articles around the world. Third, precise linguistic know-
ledge might influence and improve the automatic text production and help to create various
forms of highly flexible and interactive systems.6 Its use might apply to modification of
maintenance manuals for new lines of products or products descriptions. Automatic text
checking, the fourth of the tasks, serves in a variety of computer applications for example
simple word spelling auto-correction. Moreover, there are word form recognition programs or
4 Hausser, R., Computational linguistics: Human-Computer Communication in Natural Language (3rd Ed.).
Springer, 2014, p.30. 5 Ibid. 6 Ibid.
45 Vocabulary analysis: A corpus based…
syntax error checking applications based on syntactic parsers.7 The fifth practical task, auto-
matic content analysis, may provide summaries of literature, even in specialized fields, such
as: science or economics. Automatic content analysis is a precondition for concept-base in-
dexing, needed for accurate retrieval from textual databases, as well as for adequate machine
translation.8 Sixth of the above mentioned tasks, automatic tutoring, can provide interactive,
online systems for foreign language teaching and practice. Furthermore, such systems could
provide data on students errors or amount of time needed for completing various exercises.9
Automatic tutoring opens a new field of research in which textbooks are replaced by electro-
nic medium to assist in teaching and learning.10 Last of the tasks uses the information systems
to provide automatic information services, for example, bus and train schedules, tax consult-
ing, or medical databases11 (Ibid).
Discourse definition
There are two definitions of Discourse according to Mary Bucholtz. First one, formal de-
finition, derives from linguistic units organization, similar to morphology and syntax defini-
tions, it is the linguistic level in which sentences are combined into larger units.12 On the
other hand, the alternative definition focuses on discourse as language used in context: lan-
guage as it is put to use in social situations, not the more idealized and abstracted linguistics
forms that are the central concern of much linguistic theory.13
On the basis of previous definitions, the discourse analysis is defined by Bucholtz to be:
a collection of perspectives on situated language use that involve a general shared theoretical
orientation and a broadly methodological approach.14
Present Study
The aim of this study was to check the variety of collocates and clusters used with the
analyze noun, in two discourses: academic and spoken, between the years 2010 and 2015. The
material used for the research comes from the Corpus of Contemporary American English
(COCA) available online. Three hundred examples in total; 150 examples for each of the
7 Ibid. 8 Ibid. p.31. 9 Ibid. 10 Ibid. 11 Ibid. 12 Bucholtz, M., “Theories of Discourse as Theories of Gender: Discourse Analysis in Language and Gender
Studies.” The handbook of Language and Gender. Blackwell Publishing Ltd., 2003, p. 43. 13 Ibid 14 Ibid., p. 45.
46 Sebastian Dunat
fields, categorized by the corpus as examples selected from spoken discourse and academic
discourse. The study described the data of the corpus, and classified it in terms of the follow-
ing, selected categories:
▪ Year
▪ 2-word cluster
▪ 3-word cluster
▪ Collocates
The researcher would like to check whether:
▪ The difference in the quantitative use of the studied verb collocates will be evident in
the researched discourses, with a significant advantage of one of the collocate types
(in the number of tokens used.)
▪ The variation in the use of tokens with a classified type of clusters will be evident for
the presented years and fields.
▪ There will be a quantitative difference in the distribution of sentences containing the
studied type of clusters or collocates in diachronic spectrum, with a predominance of
one cluster/collocate type in at least one of the studied years.
Additionally, statistical Pearson’s chi-squared test surveys are carried out to confirm the fo-
llowing theses:
▪ One of the clustered parts of speech used with the tested verb will have a higher fre-
quency of its use, in at least one of the researched fields.
▪ Distribution of specific collocates will be greater for at least one of the studied fields.
Moreover, it will be supported by significant statistical survey (p-value less than 0.05).
Research
Corpus Description
Present research divided the corpus into two discourse fields: spoken and academic.
Additionally it divided the data diachronically into six categorized years, each field and cate-
gory has the same number of examples (25). Graph 1 visualizes the distribution of the corpus
data.
47 Vocabulary analysis: A corpus based…
Graph 1 Corpus data distribution
The following tables present the corpus clusters and collocates acquired in Antconc soft-
ware for the purpose of this research.
Table 1 2-word clusters for spoken and academic discourse
Rank
Spoken Academic
2010 2011 2010 2011
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 5 analyze the 10 analyze the 4 analyze the 11 analyze the
2 2 analyze - critically 2 analyze these 3 analyze their 2 analyze a
3 2 analyze history 2 analyze. ! rep 2 analyze data 2 analyze and
4 2 analyze what 1 analyze all 1 analyze and 2 analyze menu
5 2 analyze. (begin 1 analyze for 1 analyze arguments 2 analyze them
6 1 analyze a 1 analyze old 1 analyze as 1 analyze duration
7 1 analyze and 1 analyze president 1 analyze commonalities 1 analyze family
8 1 analyze both 2 analyze that 1 analyze concrete 1 analyze pending
9 1 analyze her 1 analyze today 1 analyze differences 1 analyze specific
10 1 analyze his 1 analyze volcanic 1 analyze excavated 1 analyze through
11 1 analyze it 1 analyze wedding 1 analyze how 1 analyze variation
12 1 analyze seafood 1 analyze. ! bill 1 analyze human 1 analyze your
13 1 analyze that 1 analyze. joy 1 analyze if 1 analyze. some
14 1 analyze their 1 analyze its
15 1 analyze these 1 analyze public
16 1 analyze this 1 analyze such
17 1 analyze those 1 analyze validly
18 1 analyze with 1 analyze why
19 1 analyze, fox 1 analyze, diagram
Rank
2012 2013 2012 2013
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 10 analyze the 7 analyze it 8 analyze the 5 analyze the
2 3 analyze it 4 analyze the 3 analyze their 2 analyze how
48 Sebastian Dunat
3 1 analyze and 2 analyze and 3 analyze and 1 analyze a
4 1 analyze attorney 2 analyze that 2 analyze data 1 analyze and
5 1 analyze brains 2 analyze them 1 analyze bigger 1 analyze as
6 1 analyze by 2 analyze what 1 analyze complete 1 analyze canadian
7 1 analyze from 1 analyze - dee 1 analyze image 1 analyze each
8 1 analyze his 1 analyze details 1 analyze metrics 1 analyze factors
9 1 analyze what 1 analyze for 1 analyze moderators 1 analyze hiv
10 1 analyze why 1 analyze how 1 analyze my 1 analyze laboratory
11 1 analyze yourself 1 analyze their 1 analyze operations 1 analyze meaningful
12 1 analyze, attorneys 1 analyze your 1 analyze postcard 1 analyze multiple
13 1 analyze, we 1 analyze student 1 analyze performance
14 1 analyze. pinsky 1 analyze specific
15 1 analyze text
16 1 analyze their
17 1 analyze them
18 1 analyze what
19 1 analyze, explain
20 1 analyze. beyond
Rank
2014 2015 2014 2015
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 6 analyze the 7 analyze the 10 analyze the 8 analyze the
2 2 analyze, fox 2 analyze it 1 analyze a 2 analyze data
3 1 analyze (ph 2 analyze this 1 analyze and 2 analyze teachers
4 1 analyze all 2 analyze what 1 analyze data 2 analyze and
5 1 analyze both 1 analyze -- rachel 1 analyze each 1 analyze classroom
6 1 analyze doha 1 analyze all 1 analyze hi 1 analyze content
7 1 analyze gregory 1 analyze anything 1 analyze how 1 analyze film
8 1 analyze him 1 analyze exactly 1 analyze possible 1 analyze hypotheses
9 1 analyze in 1 analyze his 1 analyze research 1 analyze information
10 1 analyze it 1 analyze human 1 analyze risk 1 analyze relationships
11 1 analyze that 1 analyze my 1 analyze students 1 analyze surveillance
12 1 analyze this 1 analyze probabilities 1 analyze their 1 analyze their
13 1 analyze to 1 analyze these 1 analyze treatment 1 analyze this
14 1 analyze what 1 analyze where 1 analyze, figured 1 analyze unknown
15 1 analyze when 1 analyze why 1 analyze, interpret 1 analyze, critically
16 1 analyze where 1 analyze, experts 1 analyze; at
17 1 analyze whether 1 analyze. (begin
18 1 analyze, does 1 analyze. army
19 1 analyze. caution
Table 1 presents 2-word clusters for spoken and academic discourse in the years between
2010 and 2015. All of the most frequent clusters connote analyze with definite article (the),
except year 2013 in the spoken discourse, where analyze with pronoun it is the most frequent
2-word cluster. It seems that indefinite article (a) does not occur in spoken discourse at all.
Although, it is visible in academic discourse in the years 2011, 2013 and 2014. Personal pro-
nouns his/her occur in spoken discourse 2010, 2012, 2015 and him occurs in 2014. Analyze it
49 Vocabulary analysis: A corpus based…
occurs in spoken discourse of 2010, 2012, 2013, 2014 and 2015 year. Moreover, the demon-
stratives are common in 2-word clusters with analyze. They occur more frequently in spoken
discourse. This is enlisted in academic discourse in the year 2015. All demonstratives are vi-
sible in 2-word spoken discourse clusters in 2010. Years 2011, 2014, 2015 use two demon-
stratives. Year 2012 has none, and in 2012 there is only that demonstrative used. Analyze and
is visible on both lists spoken and academic, but it occurs more frequently on the academic
list. Years 2011, 2014, 2015 of the spoken discourse do not use and in any of the 2-word clus-
ters. Furthermore, pronouns used in the 2-word clusters in spoken discourse are: their in 2010,
yourself and we in 2012, them, their and your in 2013, him in 2014, and my in 2015. While,
for the academic discourse list, the following are more frequent: in 2010 (their, its), in 2011
(them, your), in 2012 (their, my), in 2013 (their, them), in 2014 and in 2015 (their). Spoken
discourse used wh-adverbs more often than the academic one. What, when, where, whether
are used in 2014; what, where and why in 2015; what, and why in 2012; what in 2010 and
2013. In academic discourse only what, and why are used, accordingly in 2013 and 2010. By,
from, for, with, in and to are on the list of spoken discourse and none of them occur on the
academic list. As, if, how and at are more frequent for academic discourse clusters, with the
exception of how which occurs in the 2013 spoken list, no other is present on the spoken dis-
course in the researched years. Analyze each occurs on the academic list (2013, 2014), ana-
lyze both, on the other hand, on the spoken discourse list (2010, 2014). Analyze data (2010,
2013, 2014, 2015) is visible only on academic list, analyze all (2011, 2015) can be seen only
on spoken discourse list. Some, and such are present on the academic list, 2011 and 2010 re-
spectively; they do not occur on the spoken discourse list.
Table 2 3-word clusters for spoken and academic discourse
Rank
Spoken Academic
2010 2011 2010 2011
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 2 analyze - critically analyze 2 analyze the week 1 analyze and solve 2 analyze menu) to
2 2
analyze history
from 2 analyze the weeks 1 analyze arguments and 2 analyze the recorded
3 2
analyze. (begin-
video 2 analyze these things 1 analyze as many 1 analyze a 2-stage
4 1 analyze a political 1 analyze all this 1 analyze commonalities in 1 analyze a data
5 1 analyze and dissect 1 analyze for me 1 analyze concrete historical 1 analyze and implement
6 1 analyze both sides 1 analyze old data 1 analyze data regarding 1 analyze and organize
7 1
analyze her ap-
pearances 1
analyze president
obama 1 analyze data. results 1
analyze duration. some-
what
8 1 analyze his diet 1 analyze that speech 1 analyze differences between 1 analyze family home-lessness
9 1 analyze it to 1
analyze the contend-
ers 1 analyze excavated artifacts 1 analyze pending tax
50 Sebastian Dunat
10 1
analyze seafood
samples 1 analyze the cost 1 analyze how housing 1 analyze specific gas
11 1 analyze that car 1 analyze the entire 1 analyze human remains 1 analyze the case
12 1 analyze the -- the 1 analyze the records 1 analyze if and 1 analyze the country
13 1 analyze the moti-vation 1 analyze the situation 1 analyze its broader 1 analyze the magazines
14 1
analyze the over-
night 1 analyze the very 1 analyze public spheres 1 analyze the music
15 1 analyze the press 1 analyze today's 1 analyze such hybridity 1 analyze the narrative
16 1 analyze the weeks 1 analyze volcanic moon 1 analyze the data 1 analyze the pairwise
17 1
analyze their
expressions 1
analyze wedding
rituals 1 analyze the implications 1 analyze the planning
18 1
analyze these
numbers 1 analyze. ! bill-maher 1 analyze the interaction 1 analyze the results
19 1 analyze this flow 1 analyze. ! rep-nancy 1 analyze the transcript 1 analyze the value
20 1
analyze those
reports 1 analyze. ! rep-peter 1 analyze their business 1 analyze them as
21 1 analyze what the 1 analyze. joy-behar 1 analyze their long 1 analyze them to
22 1 analyze what's 1 analyze. that would 1 analyze their own 1 analyze through perform
23 1 analyze with karl 1 analyze validly. in 1 analyze variation among
24 1 analyze, fox news 1 analyze why, after 1 analyze your motivation
25 1 analyze, diagram, and 1 analyze. some brief
Rank
2012 2013 2012 2013
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 5 analyze the week 2 analyze it. so 1 analyze and improve 2 analyze the link
2 1
analyze and inves-
tigate 1 analyze - dee dee 1 analyze and interpret 1 analyze a large
3 1
analyze attorney
general 1 analyze and debate 1 analyze bigger and 1 analyze and evaluate
4 1 analyze brains in 1 analyze and write 1 analyze complete genomes 1 analyze as well
5 1 analyze by the 1 analyze details of 1 analyze data and 1 analyze canadian identity
6 1 analyze from the 1 analyze for you 1 analyze data; and 1 analyze each clause
7 1
analyze his dis-
turbed 1 analyze how to 1 analyze image-space 1 analyze factors that
8 1 analyze it because 1 analyze it solely 1 analyze metrics, update 1 analyze hiv/aids
9 1 analyze it. (begin 1 analyze it with 1 analyze moderators for 1 analyze how stereotypes
10 1 analyze it? do 1 analyze it, not 1 analyze my data 1 analyze how their
11 1 analyze the legality 1 analyze it, to 1 analyze operations, track 1 analyze laboratory and
12 1 analyze the loose 1 analyze it. a 1
analyze postcard representa-
tion 1 analyze meaningful data
13 1 analyze the rom-ney 1 analyze that at 1 analyze student learning 1 analyze multiple data
14 1 analyze the tapes 1 analyze that correctly 1 analyze the bacteriology 1
analyze performance
within
15 1 analyze the whole 1 analyze the mission 1 analyze the data 1 analyze specific risk
16 1 analyze what these 1 analyze the political 1 analyze the difference 1 analyze text structures
17 1 analyze why they 1 analyze the situation 1 analyze the openurls 1 analyze the data
18 1 analyze yourself and 1 analyze the vocals 1 analyze the orf 1 analyze the resulting
19 1
analyze, attorneys
kimberly 1
analyze their compo-
nents 1 analyze the production 1 analyze the time
20 1 analyze, we create 1 analyze them with 1 analyze the statistical 1 analyze their data
21 1
analyze. pinsky#
well 1 analyze them. but 1 analyze the types 1 analyze them, has
22 1 analyze what you 1 analyze their associations 1 analyze what these
51 Vocabulary analysis: A corpus based…
23 1 analyze what's 1 analyze their classroom 1 analyze, explain, or
24 1
analyze your
23andme 1 analyze their own 1 analyze. beyond the
25 1 analyze, and generate
Rank
2014 2015 2014 2015
Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster
1 2 analyze, fox news 4 analyze the week 1 analyze a column 2 analyze the data
2 1 analyze (ph) my 2 analyze this, may 1 analyze and evaluate 1 analyze and interpret
3 1 analyze all the 1
analyze -- rachel
campos 1 analyze data to 1 analyze classroom data
4 1 analyze both jewel 1 analyze all of 1 analyze each participant 1 analyze data and
5 1 analyze doha's 1 analyze anything. you 1 analyze hi in 1 analyze data, read
6 1
analyze gregory,
but 1 analyze exactly what 1 analyze how and 1 analyze film music
7 1 analyze him or 1 analyze his donor 1 analyze possible teacher 1 analyze hypotheses six
8 1 analyze in terms 1
analyze human
behavior 1 analyze research studies 1 analyze information to
9 1 analyze it on 1 analyze it all 1 analyze risk factors 1
analyze relationships
between
10 1 analyze that and 1 analyze it faster 1 analyze students' self 1 analyze surveillance data
11 1
analyze the candi-
dates 1 analyze my interview 1 analyze the alignment 1 analyze teachers' pck
12 1
analyze the flavor-
ing 1
analyze probabilities
and 1 analyze the app 1 analyze teachers' use
13 1 analyze the game 1 analyze the causes 1 analyze the apps 1 analyze the first
14 1 analyze the media 1 analyze the news 1 analyze the data 1 analyze the images
15 1
analyze the situa-
tion 1 analyze the nominee 1 analyze the demographic 1 analyze the melody
16 1 analyze the week 1 analyze these bills 1 analyze the gain 1 analyze the performances
17 1 analyze this story 1 analyze what 11 1 analyze the literary 1 analyze the sources
18 1 analyze to further 1 analyze what he 1 analyze the records 1 analyze the word
19 1
analyze what
happened 1 analyze where the 1 analyze the svs 1 analyze their e
20 1 analyze when the 1 analyze why, why 1 analyze the tool 1 analyze this dataset
21 1 analyze where the 1 analyze, experts everywhere 1 analyze their own 1 analyze unknown science
22 1
analyze whether
they 1 analyze. (begin-video 1 analyze treatment effects 1 analyze, and report
23 1 analyze, does this 1 analyze. army ser-geant 1 analyze, figured most 1
analyze, critically exam-ine
24 1
analyze. caution,
you 1 analyze, interpret, and 1 analyze; at times
25 1 analyze; at times
Most frequent 3-word clusters for spoken and academic discourses are presented in table
two. The most frequent cluster which occurs in spoken discourse is analyze the week/s. It is
visible on the list of all the researched years except 2013. Total numbers of use throughout the
corpus is 15. Analyze the situation occurs twice on the list of spoken discourse. For academic
discourse the most frequent 3-word cluster is analyze the data which occurs 5 times in total.
Next, analyze the link (4 instances), and analyze the time (2 instances). Furthermore, the re-
searcher would like to underline the differences between the 3-word clusters concerning the
52 Sebastian Dunat
use of clusters with determiners used in both discourses. In Spoken: the motivation, the over-
night, the press, the weeks (2010); the week, the weeks, the contenders, the cost, the entire, the
records, the situation, the very (2011); the week, the loose, the romney, the tapes, the whole
(2012); the mission, the political, the situation, the vocals (2013); the flavoring, the game, the
media, the situation, the week (2014); the causes, the news, the nominee (2015) and a political
(2010). For academic discourse the 3-word clusters concerning the use of determiners are: the
implications, the interaction, the transcript (2010); the case, the county, the magazines, the
music, the narrative, the pairwise, the planning, the results, the value (2011); the bacteriolo-
gy, the data, the difference, the openurls, the orf, the production, the statistical, the types
(2012); the link, the data, the resulting, the time (2013); the alignment, the app, the apps, the
data, the demographic, the gain, the literary, the records, the svs, the tool (2014); the first, the
images, the melody, the performances, the sources, the word (2015) and a 2-stage, a data
(2011); a large (2013); a column (2014). Please note that the only two clusters which are pre-
sent on both lists are analyze the records and analyze what these.
The clusters used in the corpus might suggest that spoken discourse uses more what and
where clusters: how to (2011); what these, why they (2013); what happened, when the, where
the, whether they (2014); where the (2015). On the other hand, academic discourse uses more
how clusters: how housing (2010); how stereotypes, how their, what these (2013); how and
(2014).
Analyze plus verb clusters are different for both subcorpora, thus, for spoken discourse,
one may find: analyze and dissect (2010), analyze and investigate (2012); analyze and de-
bate, analyze and write (2013). While academic discourse uses: analyze and solve (2010);
analyze and implement, analyze and organize (2011); analyze and improve, analyze and in-
terpret (2012); analyze and evaluate (2013); analyze and evaluate (2014); analyze and inter-
pret, analyze and report (2015).
Furthermore, the academic discourse uses their pronouns more frequent than spoken dis-
course, one may find the following examples: their business, their long, their own (2010);
them as, them to, your motivation (2011); my data, their associations, their classroom, their
own (2012); , how their, their data (2013); their own (2014). The spoken discourse, on the
other hand, uses other possessive pronouns: her appearances, his diet (2010); for you (2011);
their components, them with, it solely, it with, it not, it to (2013); it on (2014); anything you, it
all (2014); his donor, it faster (2015).
Among various examples of 3-word clusters which could be interesting, according to
their correspondence to various nouns, those which use demonstratives should be underlined.
53 Vocabulary analysis: A corpus based…
In the spoken discourse: history from, seafood samples, both sides, this flow, those reports
(2010); these things, all this, old data, volcanic moon, wedding rituals (2011); attorney ge-
neral, brains in (2013); details of, (2013); all the, both jewel, in terms (2014); all of, human
behavior, probabilities and, these bills, experts everywhere, and army sergeant (2015) are
used. Respectively, in academic discourse: commonalities in, difference between, excavated
artifacts, human remains (2010); family homelessness, pending tax, variation among, specific
gas (2011); complete genomes, image-space, moderators for, postcard representation, student
learning (2012); as well, Canadian identity, each clause, factors that, meaningful data, multi-
ple data, performance within, specific risk, text structures (2013); data to, each participant,
possible teacher, research studies, risk factors, students' self (2014); classroom data, film
music, information to, relationship between, surveillance data, teachers' pck, unknown
science, and this dataset (2015) are enlisted. The researcher may conclude that this and those
demonstratives are more frequently used in the spoken discourse, while that demonstrative is
more frequent in the academic discourse.
Table 3 First twenty Academic and Spoken collocates for analyze noun (window span of 5L and 5R)
Rank
Spoken Academic
Freq. Freq. left Freq. right Statistic Collocate Freq. Freq. left Freq. right Statistic Collocate
1 89 78 11 3.98188 to 118 104 14 4.09570 to
2 77 13 64 3.57089 the 96 46 50 3.68879 and
3 58 30 28 3.58809 and 94 25 69 3.46859 the
4 33 26 7 3.78556 we 33 4 29 4.17859 data
5 29 18 11 3.47177 you 28 8 20 3.02041 of
6 29 6 23 3.66732 it 22 8 14 3.03969 in
7 26 5 21 3.31423 s 18 6 12 3.33515 for
8 25 24 1 4.12199 will 16 15 1 4.29647 used
9 22 2 20 4.46808 news 15 3 12 4.01072 their
10 17 9 8 3.12308 i 13 9 4 3.48234 students
11 17 8 9 2.85100 a 12 4 8 3.18629 as
12 16 15 1 4.44161 here 12 3 9 2.54583 a
13 15 3 12 3.67043 what 11 10 1 3.34086 were
14 15 5 10 2.67043 that 11 10 1 4.06076 can
15 15 7 8 3.02657 of 10 8 2 3.92325 was
16 14 2 12 4.57089 week 10 8 2 3.01072 that
17 14 7 7 3.71291 they 9 5 4 3.46639 we
18 14 13 1 4.76354 brooks 9 5 4 4.05136 use
19 13 8 5 2.94041 is 9 4 5 3.68879 how
20 13 7 6 3.03771 in 8 6 2 3.78190 they
The most frequent collocate in both subcorpora is to; 89 instances in spoken discourse
and 104 in academic. And has second location on academic list, while it is third on spoken
54 Sebastian Dunat
list, respectively 96 and 56 examples. Third on the academic list, the, with 94 instances, is
second for spoken discourse, 77 instances. The a determiner is more frequent on the spoken
list than on the academic one; 17 and 12 instances respectively. Fourth on the academic list,
data, does not occur on the spoken discourse list; there are 33 examples of its use. It is visible
that spoken discourse uses we, and they more frequently; 33 and 14 examples. There, is not
present on spoken discourse list; situated as 9th position on academic list with the frequency of
15. Same situation occurs for use and how; 9 and 8 examples on the list respectively. Present
form of to be verb (is) occurs on the spoken list, while its past counterparts (was, were) occur
on the academic one. First instance has 19th position (13 examples), second has 15th position
(10 examples), and third is 13th (11 examples). Demonstrative that occurs on both lists, with
15 examples for spoken, and 10 examples for academic discourse (accordingly 14th and 16th
position). Last similar words on both lists are: in and of prepositions, first is 6th on the aca-
demic list and 20th on the spoken list; correspondingly 22 and 13 examples. Second enlisted as
15th and 5th, on spoken and academic list respectively, with the frequency of 15 and 28. News,
and week are enlisted on the spoken discourse list only. First one has 9th position (22), and
second is 16th (14). Used and students enlisted as 8th and 10th, are present only on the acade-
mic list. It is worth to note that analyze rarely collocate with modal verbs. Only will and can
are visible on the list; first for spoken, second for academic discourse. Eighth position, will,
has the frequency of 25. Spoken discourse uses personal pronouns I, you, it, their frequency is
17 instances, 29 instances and 29 instances respectively; visible in table three. In the same
category, pronoun collocates, academic discourse uses possessive pronouns their which is
enlisted as 9th and has the frequency of 15 instances. Lastly, spoken discourse enlists wh-
adverb (what) on the 13th position with the frequency of 15 examples in the corpus; adverbs
are not present in the first 20 most frequent collocates on the academic counterpart list.
Table 4 Parts of speech used in the 2-word clusters
Rank
Spoken Academic
2010 2011 2012 2010 2011 2012
Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq
1 dt 7 Nn 6 pp 4 nns 4 nn 4 nns 4
2 pp 4 dt 3 in 2 nn 3 dt 2 nn 3
3 nn 3 jj 2 nns 2 wrb 2 pp 2 pp 2
4 in 1 in 1 np 2 rb 2 in 1 dt 1
5 that 1 that 1 dt 1 pp 2 jj 1 jjr 1
6 rb 1 rb 1 wp 1 jj 2 rb 1 np 1
7 wp 1 wrb 1 vvn 1 np 1 vv 1
8 np 1 nn 1 np 1 vvg 1
9 vv 1 in 1
55 Vocabulary analysis: A corpus based…
10 dt 1
Rank
2013 2014 2015 2013 2014 2015
Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq
1 pp 4 nn 5 dt 3 nn 5 dt 3 nn 6
2 dt 2 dt 4 pp 3 dt 3 nn 3 nns 4
3 in 1 in 3 nn 3 jj 2 nns 2 dt 2
4 that 1 pp 2 rb 2 pp 2 in 1 pp 1
5 wp 1 to 2 wrb 2 rb 2 jj 1 rb 1
6 wrb 1 wrb 2 nns 2 np 2 pp 1 np 1
7 nn 1 that 1 jj 1 wp 1 uh 1 vv 1
8 nns 1 rb 1 wp 1 wrb 1 wrb 1
9 np 1 wp 1 vv 1 nns 1 np 1
10 vv 1 vvz 1 vv 1 vvd 1
As table 4 presents the determiners (dt) are more frequently used in spoken discourse 7
(2010), 3 (2011), 1 (2012), 2 (2013), 4 (2014), 3 (2015), while, singular nouns (nn) 3 (2010),
4 (2011), 4 (2012), 5 (2013), 3 (2014), 6 (2015) and plural nouns (nns) 4 (2010), 4 (2012),
1 (2013), 2 (2014), 4 (2015) are used more often in academic discourse. What is interesting,
none of the plural verbs occurred in the 2011 part of the subcorpora. Proper nouns (np) do
not occur in the 2-word clusters in spoken discourse. They are present on the academic list
with 1 (2010), 1 (2012), 2 (2013), 1 (2014), 1 (2015) frequency. Furthermore, personal pro-
nouns (pp) are present and more frequent on the list of spoken discourse. Wh-adverbs (wrb)
occur on the both lists (4 instances each), but in different years. For spoken: in 2012 (1), 2013
(1), and 2014 (2), while for academic: in 2010 (2), 2013 (1), and 2014 (1). On the other hand,
wh-pronouns (wp) are used more frequently on the spoken list. Their frequency in 2010, 2012,
2013, 2014 and 2015 is one example per each year. For academic discourse wh-pronouns are
visible on the list for the years of 2013 and 2014; respectively 2 and 1 example. Adjectives (jj)
and comparatives (jjr) are more common for the academic discourse: 2 (2010), 1 (2011), 1 (2012),
2 (2013), 1 (2014). Their distribution in spoken discourse: 2 (2011), 1 (2015). Furthermore,
spoken discourse uses prepositions (in) more often: 1 (2010), 1 (2011), 2 (2012), 1 (2013),
3 (2014). In academic discourse, their use is distributed as follows: 1 (2010), 1 (2011),
1 (2014). Verbs (vv), past tense verbs (vvd), participle/gerund verbs (vvg), and present 3rd
person singular verbs (vvz) are not common but present on both lists. There are 4 instances of
its use in spoken discourse and 6 instances in the academic. It is worth to note that spoken
discourse uses only base form, and 3rd person singular forms of verbs. While academic dis-
course uses base form, past tense, past participle, and gerund/participle.
56 Sebastian Dunat
Table 5 Parts of speech used in the 3-word clusters
Rank
Spoken Academic
2010 2011 2012 2010 2011 2012
Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq
1 dt 13 nn 14 dt 9 nns 11 nn 15 nn 12
2 nn 10 dt 12 pp 7 jj 7 dt 13 nns 11
3 nns 7 nns 6 nn 7 nn 7 to 4 dt 8
4 pp 4 that 2 in 5 in 5 jj 3 cc 6
5 in 2 jj 2 nns 3 dt 4 pp 3 pp 4
6 jj 2 in 1 vv 2 pp 4 rb 3 vv 3
7 to 2 md 1 cc 2 rb 4 nns 3 jj 3
8 vv 2 pdt 1 jj 2 cc 2 vv 2 in 1
9 wp 2 pos 1 rb 2 wrb 2 cc 2 jjr 1
10 cc 1 pp 1 vvp 2 np 2 cd 1
11 np 1 rb 1 wp 1 vv 1 in 1
12 pos 1 wrb 1 jjr 1 vvg 1
13 rb 1 vvg 1 vvn 1
14 that 1 vvn 1 vvp 1
Rank
2013 2014 2015 2013 2014 2015
Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq
1 pp 12 dt 12 nn 10 dt 8 dt 12 nn 14
2 in 7 nn 11 dt 9 nn 8 nn 10 nns 13
3 dt 5 pp 5 pp 6 nns 7 nns 10 dt 8
4 rb 5 in 4 nns 5 jj 5 jj 4 cc 3
5 nn 5 nns 3 wp 3 cc 3 in 3 cd 3
6 that 4 cc 2 wrb 3 pp 3 vv 2 jj 3
7 to 4 that 2 rb 2 rb 3 cc 2 vv 2
8 nns 3 to 2 cc 1 vv 2 to 2 in 2
9 vv 2 wrb 2 cd 1 in 2 np 2 pos 2
10 cc 2 jjr 1 in 1 wrb 2 jjs 1 sym 2
11 wp 2 pdt 1 jj 1 that 1 pos 1 to 2
12 jj 1 pos 1 md 1 sym 1 pp 1 pp 1
13 pos 1 rb 1 rbr 1 wp 1 uh 1 rb 1
14 wrb 1 rp 1 np 1 np 1 wrb 1 vvd 1
15 wp 1 vvg 1 vvd 1
16 np 1 vvz 1
17 vvn 1 vhz 1
18 vvz 1
Table 5 presents parts of speech used in the 3-word clusters of the researched material.
Academic discourse uses nouns (nn, nns), adjectives (jj) and determiners (dt) most frequently.
Nouns, most frequent part of speech, used in the academic discourse occur with the following
frequency: 18 (2010), 18 (2011), 23 (2012), 15 (2013), 20 (2014), 27 (2015). Adjectives are
enlisted with the frequencies of: 7 (2010), 3 (2011), 3 (2012), 5 (2013), 4 (2014), 3 (2015).
57 Vocabulary analysis: A corpus based…
While, determiners have the frequency: 4 (2010), 13 (2011), 8 (2012), 8 (2013), 12 (2014),
8 (2015). Most frequent parts of speech for the spoken discourse are: determiners (dt), nouns
(nn, nns) and personal pronouns (pp). First of the above, has the frequencies of: 13 (2010),
12 (2011), 9 (2012), 5 (2013), 12 (2014), 9 (2015). Second, are enlisted with the following
number of examples: 17 (2010), 20 (2011), 10 (2012), 8 (2013), 14 (2014), 15 (2015). Third,
personal pronouns, are used: 4, 1, 7, 12, 5, and 6; chronologically throughout the researched
years. For the academic discourse personal pronouns are used less frequently, chronological-
ly: 4, 3, 4, 3, and 1 example. Adjectives used in the spoken discourse are scarce: but fairly
constant, chronologically 2, 2, 2, 1, 1, and 1. What is worth to note in 2014 there is one exam-
ple of comparative adjective (jjr) used, while for the academic discourse there are compara-
tives used in 2010 (1), 2012 (1), and one superlative (jjs) in 2014. Prepositions and subordi-
nating conjunctions (in) are more frequent for the fields of academic discourse: 5 (2010),
1 (2011), 1 (2012), 2 (2013), 3 (2014), 2 (2015); for spoken, there is only: 1 (2011), 7 (2013),
4 (2014), 1 (2015). On the other hand, only spoken discourse uses that complementizer:
1 (2011), 2 (2012), 4 (2013) and 2 (2014) instances. Next, to preposition is used in spoken
discourse: 2 (2010), 4 (2013), and 2 (2014), and in academic one only twice in 2014. Coordi-
nating conjunctions (cc) are present on both lists. The spoken discourse uses: 1, 0, 2, 2, 2, and
1; chronologically throughout the researched years. In addition, academic discourse enlists
them (cc) more frequent: 2, 2, 6, 3, 2, and 3 instances, chronologically. Wh-pronouns (wp)
and wh-adverbs (wrb) are used more frequently in the spoken discourse. The frequencies for
the above mentioned pronouns are: 2 (2010), 1 (2012), 2 (2013), and 1 (2014). For academic
discourse, there is only on instance of its usage in 2013. Furthermore, wh-adverbs enlisted
frequency on spoken discourse list: 1 (2012), 1 (2013), 2 (2014), 3 (2015); and on academic
list: 2 (2010), 2 (2013), 1 (2014). Other adverbs (rb) are more frequently used in the academic
discourse: 4 (2010), 3 (2011), 3 (2013), 1 (2015); and 1 (2010), 1 (2011), 2 (2012), 5 (2013)
in the spoken discourse. Next, verb clusters (vv, vvg, vvd, vvz, vvn, vvp, vhz) are used more
frequently in the academic discourse: 4 (2010), 5 (2011), 3 (2012), 5 (2013), 3 (2014), and
3 (2015), than in the spoken one: 2 (2010), 4 (2012), 2 (2013), 2 (2014). What is worth to
note, have present, 3rd person singular (vhz); past tense (vvd) and gerund/participle (vvg)
verbs are used only on the academic list. Lastly, possessive ending (pos) is present on the list
both lists, but more frequent for spoken discourse 1 (2010), 1 (2011), 1 (2013), and 1 (2014);
while for the academic, they are present with the frequency of: 1 (2014), and 2 (2015).
58 Sebastian Dunat
Conclusions
It might be concluded that, the difference in the quantitative use of the studied noun col-
locates is evident in the researched discourses, with an advantage of some of the collocate
types in the number of tokens used. There is a variation of cluster types used in both research
fields and presented years. There is a quantitative difference in the distribution of sentences
containing the studied types of clusters in diachronic spectrum. Additionally, some of the
cluster types are more frequent than others in the studied years. What is worth to note, the
analyze verb rarely collocates with modal verbs. Only will and can are visible on the lists;
first for spoken, second for academic discourse. This trend may constitute the basis for further
research.
Table 6 Person’s Chi-squared test for significance of the parts of speech used within researched clusters
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson’s Chi-squared test
coordinating conjunctions 18 8
X-squared = 42.748, df = 10, p-value = 5.517e-06
determiners 65 80
preposition/subordinating
conjunctions 17 28
adjectives 31 11
singular/mass noun 90 76
plural noun 70 32
proper noun 12 7
personal pronouns 26 52
adverbs 17 17
to 8 10
wh-adverbs 9 13
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
singular/mass noun 90 76
X-squared = 5.5518, df = 2, p-value = 0.06229 plural noun 70 32
proper noun 12 7
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
adjectives 31 11 X-squared = 9.5238, df = 1, p-value = 0.002028
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
coordinating conjunctions 18 8
X-squared = 6.6637, df = 2, p-value = 0.03573 preposition/subordinating
conjunctions 17 28
to 8 10
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
59 Vocabulary analysis: A corpus based…
adverbs 17 17 X-squared = 0.15357, df = 1, p-value = 0.6951
wh-adverbs 9 13
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
determiners 65 80 X-squared = 1.5517, df = 1, p-value = 0.2129
Cluster/part of speech Academic
freq.
Spoken
freq. Pearson's Chi-squared test
personal pronouns 26 52 X-squared = 8.6667, df = 1, p-value = 0.003241
Pearson’s chi-squared tests for significance, please see table 6, revealed that noun-
clusters are more frequent for the academic discourse, p-value equals 0.06. Although, they
cannot be taken into consideration as significant, since the research p-value of significance
should be less than 0.05. Next, adjectives in the clusters are significantly more frequent in the
academic discourse; p-value hit 0.002. Furthermore, coordinating conjunctions are more fre-
quent in academic discourse, while, subordinating conjunctions and prepositions are more
frequent in spoken discourse. The Pearson’s chi-squared test shows the p-value of 0.03,
correspondingly, differences in conjunctions use are significant. Moreover, statistical test for
adverbs use does not prove that any of the researched discourses uses them more frequent; the
p-value of the Pearson’s chi-squared test equals 0.69. Last but not least, determiners use in the
discourses scored the p-value of 0.21 and are not proved to be significant, nevertheless the
number of the determiners used in the spoken discourse is greater than in the academic. Final-
ly, personal pronouns are significant, and proved to be more frequent in the fields of spoken
discourse; the total p-value is 0.003. The overall p-value of the surveyed data equals
0.0000005 which proves that the cluster variation within different discourses can be used as
the basis for interesting, further research.
Table 7 Pearson’s chi-squared test for significance of the words used as collocates in 5L to 5R widow span
Collocate Academic
freq.
Spoken
freq. Pearson's Chi-squared test
a 12 17
X-squared = 32.873, df = 8, p-value = 6.491e-05
and 96 58
in 22 13
of 28 15
that 10 15
the 94 77
they 8 14
to 118 89
we 9 33
Collocate Academic
freq.
Spoken
freq. Pearson's Chi-squared test
a 12 17 X-squared = 0.86207, df = 1, p-value = 0.3532
60 Sebastian Dunat
and 96 58 X-squared = 9.3766, df = 1, p-value = 0.002198
in 22 13 X-squared = 2.3143, df = 1, p-value = 0.1282
of 28 15 X-squared = 3.9302, df = 1, p-value = 0.04743
that 10 15 X-squared = 1, df = 1, p-value = 0.3173
the 94 77 X-squared = 1.6901, df = 1, p-value = 0.1936
they 8 14 X-squared = 1.6364, df = 1, p-value = 0.2008
to 118 89 X-squared = 4.0628, df = 1, p-value = 0.04384
we 9 33 X-squared = 13.714, df = 1, p-value = 0.0002128
Firstly, indefinite article a is used more frequently in spoken discourse, but the difference
between the discourses frequency is slight (please see table 7). Definite article, on the other
hand, is used more frequently in the academic discourse. Although, both cannot be taken into
consideration as significant, for the p-value scores are greater than 0.05. Secondly, and is
more frequent for academic discourse: the total difference in its use frequency between the
discourses is 38, and the significant p-value equals 0.002. Thirdly, in, is used 22 times, in
academic discourse, which is 9 instances more than in spoken discourse. However, the
p-value totals 0.12, therefore, it is not significant. Next, that p-value for both discourses
equals 0.31 which makes it significantly irrelevant. Same thing occurs for they, where the
p-value scores 0.20. On the other hand, of, to and we have the p-value small enough to be
taken into consideration as significant; 0.04, 0.04 and 0.0002 respectively. First two of the
above mentioned are used more frequently in the academic discourse, while the third one is
significantly more frequent in spoken discourse. As Pearson’s chi-squared tests for signifi-
cance shows, the collocates from seventh table scored the overall p-value of 0.000006. It
proves that the collocates research in the discourse setting might provide a good basis for fur-
ther study.
Bibliography
BIBER D., SUSAN C., RANDI R., Corpus Linguistics: Investigating language structure and
use, Cambridge, Cambridge University Press, 1998.
BUCHOLTZ, M., Theories of Discourse as Theories of Gender: Discourse Analysis in Lan-
guage and Gender Studies, The handbook of Language and Gender., Blackwell Publishing
Ltd., 2003.
Corpus of Contemporary American English. N.p., n.d. Dostęp. 14.11.2019,
15.11.2019.Dostępne online https://corpus.byu.edu/coca/
HAUSSER R., Computational linguistics: Human-Computer Communication in Natural Lan-
guage (3 Ed.). Springer, 2014.
61 Vocabulary analysis: A corpus based…
SUNDERLAND J., Language and Gender: An advanced resource book, Routledge, London
2006.
The R Foundation for Statistical Computing. R x64 version 3.4.1.,30 Jun. 2017. Free-
ware software. Dostępne online < http://cran.r-project.org/>
LAURENCE A., TagAnt x64, version 1.2.0., 15 Sep. 2015. Freeware software. Do-
stępne online < http://www.laurenceanthony.net/software/tagant>
LAURENCE A., ProtAnt x64, version 0.1., 21 Mar. 2017. Freeware software. Do-
stępne online < http://www.laurenceanthony.net/software/protant>
LAURENCE A., AntConc 3.5.7., 30 Sept. 2018. Freeware software. Dostępne online
Analiza słownictwa: Badanie korpusowe zbitek wyrazowych i kolokatów czasownika “analyze” w dys-
kursie akademickim i mówionym.
Istnieje wiele korpusów dostępnych online, które można dowolnie oznakować różnymi funkcjami językowymi.
Wiele z nich stanowi świetną bibliotekę przykładów, a dane w nich zawarte można wykorzystać do analizy za
pomocą dowolnego narzędzia językowego. Celem tego badania było sprawdzenie różnorodności kolokatów
i zbitek wyrazowych używanych z czasownikiem „analyze”, w dwóch dyskursach: akademickim i mówionym.
W pracy przedstawiono opis danych korpusowych (300 przykładów), uprzednio sklasyfikowanych pod
względem wybranych kategorii badawczych. Narzędzia językoznawstwa komputerowego posłużyły tutaj do
przeprowadzenia badań statystycznych z użytkiem danych korpusowych. Testy chi-kwadrat Pearsona dowiodły
istotności użytku niektórych zbitek wyrazowych i kolokatów w zestawieniu ilościowym, w badanym materiale.
Podsumowując, różnorodność wykorzystania zbitek wyrazowych oraz kolokatów w ramach zbadanych dys-
kursów może być podstawą do przeprowadzenia dalszych, interesujących badań.