Download - Vocabulary analysis: A corpus based study of “analyze ...€¦ · Vocabulary analysis: A corpus based study of “analyze” clusters and collocates in academic and spoken discourse

SEBASTIAN DUNAT

Akademia Techniczno-Humanistyczna w Bielsku-Białej

Vocabulary analysis: A corpus based study of “analyze” clusters

and collocates in academic and spoken discourse

Key words: corpus linguistics, academic discourse, spoken discourse, clusters, collocates,

quantitative statistical surveys

Słowa klucze: językoznawstwo korpusowe, dyskurs akademicki, dyskurs mówiony, zbitki wy-

razowe, kolokaty, ilościowe badania statystyczne

Introduction

Corpora are analyzed in various sort of ways to uncover the linguistic information: the

frequency of use of certain keywords or collocations with the use of corpus search engine, if it

is available online, or specific programs used for this purpose. As Jane Sunderland states:

A corpus is a representative, substantial body of semantically collected and recorded data,

spoken or written, which is normally electronically stored as text on a PC.1 A corpus might be

labeled with, not only syntactical or lexical features, but also speaker or text features. There

are various corpora available online which constitute a great library of examples, and its data

can be used for analysis with the help of any linguistic tools; some of the corpora provide

a search engine for easier data acquire.

Corpus linguistics obtain and analyze a large quantities of data and tries to provide an-

swers to researched questions which may concern: words or grammatical structures, the fre-

quency of their use, how they link with other words or structures, and their range of possible

meanings.2 According to Biber, Conrad and Reppen3 corpus based analysis characteristics are:

1 Sunderland, J., Language and Gender: An advanced resource book. London: Routledge, 2006, p. 56. 2 Ibid. 3 Biber D., Conrad S., Reppen R., Corpus Linguistics: Investigating language structure and use. Cambridge:

CUP, 1998, p.4.

44 Sebastian Dunat

▪ Empirical analysis of patterns of language use in natural texts

▪ Development of corpus or large, natural texts collection

▪ Computer use for wide-range of analysis techniques (ex. automatic or interactive)

▪ Reliance to quantitative and qualitative analysis

Method

Computational linguistics practical tasks

Development of humans and computers/machines communication in all areas of linguis-

tic analysis influenced Hausser4 to present several practical tasks of computational linguistics,

although, the list is not complete and open to discussion.

▪ Indexing and retrieval in textual databases

▪ Machine translation

▪ Automatic text production

▪ Automatic text checking

▪ Automatic content analysis

▪ Automatic tutoring

▪ Automatic dialog and information systems

First of the practical tasks, textual databases constitute of various kind of electronically

stored data (texts, sentences, word frequencies). The easiness of access makes the databases

a great tool for researchers interested in any type of texts or passages relevant for their speci-

fic analysis. The biggest freely available database is the Worlds Wide Web but its unstruc-

tured form might pose some difficulties in obtaining the precise data.5 Second, machine trans-

lation has the remarkable potential of making research easier with the automatic or semi-

automatic translation of research articles around the world. Third, precise linguistic know-

ledge might influence and improve the automatic text production and help to create various

forms of highly flexible and interactive systems.6 Its use might apply to modification of

maintenance manuals for new lines of products or products descriptions. Automatic text

checking, the fourth of the tasks, serves in a variety of computer applications for example

simple word spelling auto-correction. Moreover, there are word form recognition programs or

4 Hausser, R., Computational linguistics: Human-Computer Communication in Natural Language (3rd Ed.).

Springer, 2014, p.30. 5 Ibid. 6 Ibid.

45 Vocabulary analysis: A corpus based…

syntax error checking applications based on syntactic parsers.7 The fifth practical task, auto-

matic content analysis, may provide summaries of literature, even in specialized fields, such

as: science or economics. Automatic content analysis is a precondition for concept-base in-

dexing, needed for accurate retrieval from textual databases, as well as for adequate machine

translation.8 Sixth of the above mentioned tasks, automatic tutoring, can provide interactive,

online systems for foreign language teaching and practice. Furthermore, such systems could

provide data on students errors or amount of time needed for completing various exercises.9

Automatic tutoring opens a new field of research in which textbooks are replaced by electro-

nic medium to assist in teaching and learning.10 Last of the tasks uses the information systems

to provide automatic information services, for example, bus and train schedules, tax consult-

ing, or medical databases11 (Ibid).

Discourse definition

There are two definitions of Discourse according to Mary Bucholtz. First one, formal de-

finition, derives from linguistic units organization, similar to morphology and syntax defini-

tions, it is the linguistic level in which sentences are combined into larger units.12 On the

other hand, the alternative definition focuses on discourse as language used in context: lan-

guage as it is put to use in social situations, not the more idealized and abstracted linguistics

forms that are the central concern of much linguistic theory.13

On the basis of previous definitions, the discourse analysis is defined by Bucholtz to be:

a collection of perspectives on situated language use that involve a general shared theoretical

orientation and a broadly methodological approach.14

Present Study

The aim of this study was to check the variety of collocates and clusters used with the

analyze noun, in two discourses: academic and spoken, between the years 2010 and 2015. The

material used for the research comes from the Corpus of Contemporary American English

(COCA) available online. Three hundred examples in total; 150 examples for each of the

7 Ibid. 8 Ibid. p.31. 9 Ibid. 10 Ibid. 11 Ibid. 12 Bucholtz, M., “Theories of Discourse as Theories of Gender: Discourse Analysis in Language and Gender

Studies.” The handbook of Language and Gender. Blackwell Publishing Ltd., 2003, p. 43. 13 Ibid 14 Ibid., p. 45.

46 Sebastian Dunat

fields, categorized by the corpus as examples selected from spoken discourse and academic

discourse. The study described the data of the corpus, and classified it in terms of the follow-

ing, selected categories:

▪ Year

▪ 2-word cluster

▪ 3-word cluster

▪ Collocates

The researcher would like to check whether:

▪ The difference in the quantitative use of the studied verb collocates will be evident in

the researched discourses, with a significant advantage of one of the collocate types

(in the number of tokens used.)

▪ The variation in the use of tokens with a classified type of clusters will be evident for

the presented years and fields.

▪ There will be a quantitative difference in the distribution of sentences containing the

studied type of clusters or collocates in diachronic spectrum, with a predominance of

one cluster/collocate type in at least one of the studied years.

Additionally, statistical Pearson’s chi-squared test surveys are carried out to confirm the fo-

llowing theses:

▪ One of the clustered parts of speech used with the tested verb will have a higher fre-

quency of its use, in at least one of the researched fields.

▪ Distribution of specific collocates will be greater for at least one of the studied fields.

Moreover, it will be supported by significant statistical survey (p-value less than 0.05).

Research

Corpus Description

Present research divided the corpus into two discourse fields: spoken and academic.

Additionally it divided the data diachronically into six categorized years, each field and cate-

gory has the same number of examples (25). Graph 1 visualizes the distribution of the corpus

data.


Graph 1 Corpus data distribution

The following tables present the corpus clusters and collocates acquired in Antconc soft-

ware for the purpose of this research.

Table 1 2-word clusters for spoken and academic discourse

Rank

Spoken Academic

2010 2011 2010 2011

Freq. Cluster Freq. Cluster Freq. Cluster Freq. Cluster

1 5 analyze the 10 analyze the 4 analyze the 11 analyze the

2 2 analyze - critically 2 analyze these 3 analyze their 2 analyze a

3 2 analyze history 2 analyze. ! rep 2 analyze data 2 analyze and

4 2 analyze what 1 analyze all 1 analyze and 2 analyze menu

5 2 analyze. (begin 1 analyze for 1 analyze arguments 2 analyze them

6 1 analyze a 1 analyze old 1 analyze as 1 analyze duration

7 1 analyze and 1 analyze president 1 analyze commonalities 1 analyze family

8 1 analyze both 2 analyze that 1 analyze concrete 1 analyze pending

9 1 analyze her 1 analyze today 1 analyze differences 1 analyze specific

10 1 analyze his 1 analyze volcanic 1 analyze excavated 1 analyze through

11 1 analyze it 1 analyze wedding 1 analyze how 1 analyze variation

12 1 analyze seafood 1 analyze. ! bill 1 analyze human 1 analyze your

13 1 analyze that 1 analyze. joy 1 analyze if 1 analyze. some

14 1 analyze their 1 analyze its

15 1 analyze these 1 analyze public

16 1 analyze this 1 analyze such

17 1 analyze those 1 analyze validly

18 1 analyze with 1 analyze why

19 1 analyze, fox 1 analyze, diagram

Rank

2012 2013 2012 2013


1 10 analyze the 7 analyze it 8 analyze the 5 analyze the

2 3 analyze it 4 analyze the 3 analyze their 2 analyze how

48 Sebastian Dunat

3 1 analyze and 2 analyze and 3 analyze and 1 analyze a

4 1 analyze attorney 2 analyze that 2 analyze data 1 analyze and

5 1 analyze brains 2 analyze them 1 analyze bigger 1 analyze as

6 1 analyze by 2 analyze what 1 analyze complete 1 analyze canadian

7 1 analyze from 1 analyze - dee 1 analyze image 1 analyze each

8 1 analyze his 1 analyze details 1 analyze metrics 1 analyze factors

9 1 analyze what 1 analyze for 1 analyze moderators 1 analyze hiv

10 1 analyze why 1 analyze how 1 analyze my 1 analyze laboratory

11 1 analyze yourself 1 analyze their 1 analyze operations 1 analyze meaningful

12 1 analyze, attorneys 1 analyze your 1 analyze postcard 1 analyze multiple

13 1 analyze, we 1 analyze student 1 analyze performance

14 1 analyze. pinsky 1 analyze specific

15 1 analyze text

16 1 analyze their

17 1 analyze them

18 1 analyze what

19 1 analyze, explain

20 1 analyze. beyond

Rank

2014 2015 2014 2015


1 6 analyze the 7 analyze the 10 analyze the 8 analyze the

2 2 analyze, fox 2 analyze it 1 analyze a 2 analyze data

3 1 analyze (ph 2 analyze this 1 analyze and 2 analyze teachers

4 1 analyze all 2 analyze what 1 analyze data 2 analyze and

5 1 analyze both 1 analyze -- rachel 1 analyze each 1 analyze classroom

6 1 analyze doha 1 analyze all 1 analyze hi 1 analyze content

7 1 analyze gregory 1 analyze anything 1 analyze how 1 analyze film

8 1 analyze him 1 analyze exactly 1 analyze possible 1 analyze hypotheses

9 1 analyze in 1 analyze his 1 analyze research 1 analyze information

10 1 analyze it 1 analyze human 1 analyze risk 1 analyze relationships

11 1 analyze that 1 analyze my 1 analyze students 1 analyze surveillance

12 1 analyze this 1 analyze probabilities 1 analyze their 1 analyze their

13 1 analyze to 1 analyze these 1 analyze treatment 1 analyze this

14 1 analyze what 1 analyze where 1 analyze, figured 1 analyze unknown

15 1 analyze when 1 analyze why 1 analyze, interpret 1 analyze, critically

16 1 analyze where 1 analyze, experts 1 analyze; at

17 1 analyze whether 1 analyze. (begin

18 1 analyze, does 1 analyze. army

19 1 analyze. caution

Table 1 presents 2-word clusters for spoken and academic discourse in the years between

2010 and 2015. All of the most frequent clusters connote analyze with definite article (the),

except year 2013 in the spoken discourse, where analyze with pronoun it is the most frequent

2-word cluster. It seems that indefinite article (a) does not occur in spoken discourse at all.

Although, it is visible in academic discourse in the years 2011, 2013 and 2014. Personal pro-

nouns his/her occur in spoken discourse 2010, 2012, 2015 and him occurs in 2014. Analyze it


occurs in spoken discourse of 2010, 2012, 2013, 2014 and 2015 year. Moreover, the demon-

stratives are common in 2-word clusters with analyze. They occur more frequently in spoken

discourse. This is enlisted in academic discourse in the year 2015. All demonstratives are vi-

sible in 2-word spoken discourse clusters in 2010. Years 2011, 2014, 2015 use two demon-

stratives. Year 2012 has none, and in 2012 there is only that demonstrative used. Analyze and

is visible on both lists spoken and academic, but it occurs more frequently on the academic

list. Years 2011, 2014, 2015 of the spoken discourse do not use and in any of the 2-word clus-

ters. Furthermore, pronouns used in the 2-word clusters in spoken discourse are: their in 2010,

yourself and we in 2012, them, their and your in 2013, him in 2014, and my in 2015. While,

for the academic discourse list, the following are more frequent: in 2010 (their, its), in 2011

(them, your), in 2012 (their, my), in 2013 (their, them), in 2014 and in 2015 (their). Spoken

discourse used wh-adverbs more often than the academic one. What, when, where, whether

are used in 2014; what, where and why in 2015; what, and why in 2012; what in 2010 and

2013. In academic discourse only what, and why are used, accordingly in 2013 and 2010. By,

from, for, with, in and to are on the list of spoken discourse and none of them occur on the

academic list. As, if, how and at are more frequent for academic discourse clusters, with the

exception of how which occurs in the 2013 spoken list, no other is present on the spoken dis-

course in the researched years. Analyze each occurs on the academic list (2013, 2014), ana-

lyze both, on the other hand, on the spoken discourse list (2010, 2014). Analyze data (2010,

2013, 2014, 2015) is visible only on academic list, analyze all (2011, 2015) can be seen only

on spoken discourse list. Some, and such are present on the academic list, 2011 and 2010 re-

spectively; they do not occur on the spoken discourse list.

Table 2 3-word clusters for spoken and academic discourse

Rank

Spoken Academic

2010 2011 2010 2011


1 2 analyze - critically analyze 2 analyze the week 1 analyze and solve 2 analyze menu) to

2 2

analyze history

from 2 analyze the weeks 1 analyze arguments and 2 analyze the recorded

3 2

analyze. (begin-

video 2 analyze these things 1 analyze as many 1 analyze a 2-stage

4 1 analyze a political 1 analyze all this 1 analyze commonalities in 1 analyze a data

5 1 analyze and dissect 1 analyze for me 1 analyze concrete historical 1 analyze and implement

6 1 analyze both sides 1 analyze old data 1 analyze data regarding 1 analyze and organize

7 1

analyze her ap-

pearances 1

analyze president

obama 1 analyze data. results 1

analyze duration. some-

what

8 1 analyze his diet 1 analyze that speech 1 analyze differences between 1 analyze family home-lessness

9 1 analyze it to 1

analyze the contend-

ers 1 analyze excavated artifacts 1 analyze pending tax

50 Sebastian Dunat

10 1

analyze seafood

samples 1 analyze the cost 1 analyze how housing 1 analyze specific gas

11 1 analyze that car 1 analyze the entire 1 analyze human remains 1 analyze the case

12 1 analyze the -- the 1 analyze the records 1 analyze if and 1 analyze the country

13 1 analyze the moti-vation 1 analyze the situation 1 analyze its broader 1 analyze the magazines

14 1

analyze the over-

night 1 analyze the very 1 analyze public spheres 1 analyze the music

15 1 analyze the press 1 analyze today's 1 analyze such hybridity 1 analyze the narrative

16 1 analyze the weeks 1 analyze volcanic moon 1 analyze the data 1 analyze the pairwise

17 1

analyze their

expressions 1

analyze wedding

rituals 1 analyze the implications 1 analyze the planning

18 1

analyze these

numbers 1 analyze. ! bill-maher 1 analyze the interaction 1 analyze the results

19 1 analyze this flow 1 analyze. ! rep-nancy 1 analyze the transcript 1 analyze the value

20 1

analyze those

reports 1 analyze. ! rep-peter 1 analyze their business 1 analyze them as

21 1 analyze what the 1 analyze. joy-behar 1 analyze their long 1 analyze them to

22 1 analyze what's 1 analyze. that would 1 analyze their own 1 analyze through perform

23 1 analyze with karl 1 analyze validly. in 1 analyze variation among

24 1 analyze, fox news 1 analyze why, after 1 analyze your motivation

25 1 analyze, diagram, and 1 analyze. some brief

Rank

2012 2013 2012 2013


1 5 analyze the week 2 analyze it. so 1 analyze and improve 2 analyze the link

2 1

analyze and inves-

tigate 1 analyze - dee dee 1 analyze and interpret 1 analyze a large

3 1

analyze attorney

general 1 analyze and debate 1 analyze bigger and 1 analyze and evaluate

4 1 analyze brains in 1 analyze and write 1 analyze complete genomes 1 analyze as well

5 1 analyze by the 1 analyze details of 1 analyze data and 1 analyze canadian identity

6 1 analyze from the 1 analyze for you 1 analyze data; and 1 analyze each clause

7 1

analyze his dis-

turbed 1 analyze how to 1 analyze image-space 1 analyze factors that

8 1 analyze it because 1 analyze it solely 1 analyze metrics, update 1 analyze hiv/aids

9 1 analyze it. (begin 1 analyze it with 1 analyze moderators for 1 analyze how stereotypes

10 1 analyze it? do 1 analyze it, not 1 analyze my data 1 analyze how their

11 1 analyze the legality 1 analyze it, to 1 analyze operations, track 1 analyze laboratory and

12 1 analyze the loose 1 analyze it. a 1

analyze postcard representa-

tion 1 analyze meaningful data

13 1 analyze the rom-ney 1 analyze that at 1 analyze student learning 1 analyze multiple data

14 1 analyze the tapes 1 analyze that correctly 1 analyze the bacteriology 1

analyze performance

within

15 1 analyze the whole 1 analyze the mission 1 analyze the data 1 analyze specific risk

16 1 analyze what these 1 analyze the political 1 analyze the difference 1 analyze text structures

17 1 analyze why they 1 analyze the situation 1 analyze the openurls 1 analyze the data

18 1 analyze yourself and 1 analyze the vocals 1 analyze the orf 1 analyze the resulting

19 1

analyze, attorneys

kimberly 1

analyze their compo-

nents 1 analyze the production 1 analyze the time

20 1 analyze, we create 1 analyze them with 1 analyze the statistical 1 analyze their data

21 1

analyze. pinsky#

well 1 analyze them. but 1 analyze the types 1 analyze them, has

22 1 analyze what you 1 analyze their associations 1 analyze what these


23 1 analyze what's 1 analyze their classroom 1 analyze, explain, or

24 1

analyze your

23andme 1 analyze their own 1 analyze. beyond the

25 1 analyze, and generate

Rank

2014 2015 2014 2015


1 2 analyze, fox news 4 analyze the week 1 analyze a column 2 analyze the data

2 1 analyze (ph) my 2 analyze this, may 1 analyze and evaluate 1 analyze and interpret

3 1 analyze all the 1

analyze -- rachel

campos 1 analyze data to 1 analyze classroom data

4 1 analyze both jewel 1 analyze all of 1 analyze each participant 1 analyze data and

5 1 analyze doha's 1 analyze anything. you 1 analyze hi in 1 analyze data, read

6 1

analyze gregory,

but 1 analyze exactly what 1 analyze how and 1 analyze film music

7 1 analyze him or 1 analyze his donor 1 analyze possible teacher 1 analyze hypotheses six

8 1 analyze in terms 1

analyze human

behavior 1 analyze research studies 1 analyze information to

9 1 analyze it on 1 analyze it all 1 analyze risk factors 1

analyze relationships

between

10 1 analyze that and 1 analyze it faster 1 analyze students' self 1 analyze surveillance data

11 1

analyze the candi-

dates 1 analyze my interview 1 analyze the alignment 1 analyze teachers' pck

12 1

analyze the flavor-

ing 1

analyze probabilities

and 1 analyze the app 1 analyze teachers' use

13 1 analyze the game 1 analyze the causes 1 analyze the apps 1 analyze the first

14 1 analyze the media 1 analyze the news 1 analyze the data 1 analyze the images

15 1

analyze the situa-

tion 1 analyze the nominee 1 analyze the demographic 1 analyze the melody

16 1 analyze the week 1 analyze these bills 1 analyze the gain 1 analyze the performances

17 1 analyze this story 1 analyze what 11 1 analyze the literary 1 analyze the sources

18 1 analyze to further 1 analyze what he 1 analyze the records 1 analyze the word

19 1

analyze what

happened 1 analyze where the 1 analyze the svs 1 analyze their e

20 1 analyze when the 1 analyze why, why 1 analyze the tool 1 analyze this dataset

21 1 analyze where the 1 analyze, experts everywhere 1 analyze their own 1 analyze unknown science

22 1

analyze whether

they 1 analyze. (begin-video 1 analyze treatment effects 1 analyze, and report

23 1 analyze, does this 1 analyze. army ser-geant 1 analyze, figured most 1

analyze, critically exam-ine

24 1

analyze. caution,

you 1 analyze, interpret, and 1 analyze; at times

25 1 analyze; at times

Most frequent 3-word clusters for spoken and academic discourses are presented in table

two. The most frequent cluster which occurs in spoken discourse is analyze the week/s. It is

visible on the list of all the researched years except 2013. Total numbers of use throughout the

corpus is 15. Analyze the situation occurs twice on the list of spoken discourse. For academic

discourse the most frequent 3-word cluster is analyze the data which occurs 5 times in total.

Next, analyze the link (4 instances), and analyze the time (2 instances). Furthermore, the re-

searcher would like to underline the differences between the 3-word clusters concerning the

52 Sebastian Dunat

use of clusters with determiners used in both discourses. In Spoken: the motivation, the over-

night, the press, the weeks (2010); the week, the weeks, the contenders, the cost, the entire, the

records, the situation, the very (2011); the week, the loose, the romney, the tapes, the whole

(2012); the mission, the political, the situation, the vocals (2013); the flavoring, the game, the

media, the situation, the week (2014); the causes, the news, the nominee (2015) and a political

(2010). For academic discourse the 3-word clusters concerning the use of determiners are: the

implications, the interaction, the transcript (2010); the case, the county, the magazines, the

music, the narrative, the pairwise, the planning, the results, the value (2011); the bacteriolo-

gy, the data, the difference, the openurls, the orf, the production, the statistical, the types

(2012); the link, the data, the resulting, the time (2013); the alignment, the app, the apps, the

data, the demographic, the gain, the literary, the records, the svs, the tool (2014); the first, the

images, the melody, the performances, the sources, the word (2015) and a 2-stage, a data

(2011); a large (2013); a column (2014). Please note that the only two clusters which are pre-

sent on both lists are analyze the records and analyze what these.

The clusters used in the corpus might suggest that spoken discourse uses more what and

where clusters: how to (2011); what these, why they (2013); what happened, when the, where

the, whether they (2014); where the (2015). On the other hand, academic discourse uses more

how clusters: how housing (2010); how stereotypes, how their, what these (2013); how and

(2014).

Analyze plus verb clusters are different for both subcorpora, thus, for spoken discourse,

one may find: analyze and dissect (2010), analyze and investigate (2012); analyze and de-

bate, analyze and write (2013). While academic discourse uses: analyze and solve (2010);

analyze and implement, analyze and organize (2011); analyze and improve, analyze and in-

terpret (2012); analyze and evaluate (2013); analyze and evaluate (2014); analyze and inter-

pret, analyze and report (2015).

Furthermore, the academic discourse uses their pronouns more frequent than spoken dis-

course, one may find the following examples: their business, their long, their own (2010);

them as, them to, your motivation (2011); my data, their associations, their classroom, their

own (2012); , how their, their data (2013); their own (2014). The spoken discourse, on the

other hand, uses other possessive pronouns: her appearances, his diet (2010); for you (2011);

their components, them with, it solely, it with, it not, it to (2013); it on (2014); anything you, it

all (2014); his donor, it faster (2015).

Among various examples of 3-word clusters which could be interesting, according to

their correspondence to various nouns, those which use demonstratives should be underlined.


In the spoken discourse: history from, seafood samples, both sides, this flow, those reports

(2010); these things, all this, old data, volcanic moon, wedding rituals (2011); attorney ge-

neral, brains in (2013); details of, (2013); all the, both jewel, in terms (2014); all of, human

behavior, probabilities and, these bills, experts everywhere, and army sergeant (2015) are

used. Respectively, in academic discourse: commonalities in, difference between, excavated

artifacts, human remains (2010); family homelessness, pending tax, variation among, specific

gas (2011); complete genomes, image-space, moderators for, postcard representation, student

learning (2012); as well, Canadian identity, each clause, factors that, meaningful data, multi-

ple data, performance within, specific risk, text structures (2013); data to, each participant,

possible teacher, research studies, risk factors, students' self (2014); classroom data, film

music, information to, relationship between, surveillance data, teachers' pck, unknown

science, and this dataset (2015) are enlisted. The researcher may conclude that this and those

demonstratives are more frequently used in the spoken discourse, while that demonstrative is

more frequent in the academic discourse.

Table 3 First twenty Academic and Spoken collocates for analyze noun (window span of 5L and 5R)

Rank

Spoken Academic

Freq. Freq. left Freq. right Statistic Collocate Freq. Freq. left Freq. right Statistic Collocate

1 89 78 11 3.98188 to 118 104 14 4.09570 to

2 77 13 64 3.57089 the 96 46 50 3.68879 and

3 58 30 28 3.58809 and 94 25 69 3.46859 the

4 33 26 7 3.78556 we 33 4 29 4.17859 data

5 29 18 11 3.47177 you 28 8 20 3.02041 of

6 29 6 23 3.66732 it 22 8 14 3.03969 in

7 26 5 21 3.31423 s 18 6 12 3.33515 for

8 25 24 1 4.12199 will 16 15 1 4.29647 used

9 22 2 20 4.46808 news 15 3 12 4.01072 their

10 17 9 8 3.12308 i 13 9 4 3.48234 students

11 17 8 9 2.85100 a 12 4 8 3.18629 as

12 16 15 1 4.44161 here 12 3 9 2.54583 a

13 15 3 12 3.67043 what 11 10 1 3.34086 were

14 15 5 10 2.67043 that 11 10 1 4.06076 can

15 15 7 8 3.02657 of 10 8 2 3.92325 was

16 14 2 12 4.57089 week 10 8 2 3.01072 that

17 14 7 7 3.71291 they 9 5 4 3.46639 we

18 14 13 1 4.76354 brooks 9 5 4 4.05136 use

19 13 8 5 2.94041 is 9 4 5 3.68879 how

20 13 7 6 3.03771 in 8 6 2 3.78190 they

The most frequent collocate in both subcorpora is to; 89 instances in spoken discourse

and 104 in academic. And has second location on academic list, while it is third on spoken

54 Sebastian Dunat

list, respectively 96 and 56 examples. Third on the academic list, the, with 94 instances, is

second for spoken discourse, 77 instances. The a determiner is more frequent on the spoken

list than on the academic one; 17 and 12 instances respectively. Fourth on the academic list,

data, does not occur on the spoken discourse list; there are 33 examples of its use. It is visible

that spoken discourse uses we, and they more frequently; 33 and 14 examples. There, is not

present on spoken discourse list; situated as 9th position on academic list with the frequency of

15. Same situation occurs for use and how; 9 and 8 examples on the list respectively. Present

form of to be verb (is) occurs on the spoken list, while its past counterparts (was, were) occur

on the academic one. First instance has 19th position (13 examples), second has 15th position

(10 examples), and third is 13th (11 examples). Demonstrative that occurs on both lists, with

15 examples for spoken, and 10 examples for academic discourse (accordingly 14th and 16th

position). Last similar words on both lists are: in and of prepositions, first is 6th on the aca-

demic list and 20th on the spoken list; correspondingly 22 and 13 examples. Second enlisted as

15th and 5th, on spoken and academic list respectively, with the frequency of 15 and 28. News,

and week are enlisted on the spoken discourse list only. First one has 9th position (22), and

second is 16th (14). Used and students enlisted as 8th and 10th, are present only on the acade-

mic list. It is worth to note that analyze rarely collocate with modal verbs. Only will and can

are visible on the list; first for spoken, second for academic discourse. Eighth position, will,

has the frequency of 25. Spoken discourse uses personal pronouns I, you, it, their frequency is

17 instances, 29 instances and 29 instances respectively; visible in table three. In the same

category, pronoun collocates, academic discourse uses possessive pronouns their which is

enlisted as 9th and has the frequency of 15 instances. Lastly, spoken discourse enlists wh-

adverb (what) on the 13th position with the frequency of 15 examples in the corpus; adverbs

are not present in the first 20 most frequent collocates on the academic counterpart list.

Table 4 Parts of speech used in the 2-word clusters

Rank

Spoken Academic

2010 2011 2012 2010 2011 2012

Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq Keyword Freq

1 dt 7 Nn 6 pp 4 nns 4 nn 4 nns 4

2 pp 4 dt 3 in 2 nn 3 dt 2 nn 3

3 nn 3 jj 2 nns 2 wrb 2 pp 2 pp 2

4 in 1 in 1 np 2 rb 2 in 1 dt 1

5 that 1 that 1 dt 1 pp 2 jj 1 jjr 1

6 rb 1 rb 1 wp 1 jj 2 rb 1 np 1

7 wp 1 wrb 1 vvn 1 np 1 vv 1

8 np 1 nn 1 np 1 vvg 1

9 vv 1 in 1


10 dt 1

Rank

2013 2014 2015 2013 2014 2015


1 pp 4 nn 5 dt 3 nn 5 dt 3 nn 6

2 dt 2 dt 4 pp 3 dt 3 nn 3 nns 4

3 in 1 in 3 nn 3 jj 2 nns 2 dt 2

4 that 1 pp 2 rb 2 pp 2 in 1 pp 1

5 wp 1 to 2 wrb 2 rb 2 jj 1 rb 1

6 wrb 1 wrb 2 nns 2 np 2 pp 1 np 1

7 nn 1 that 1 jj 1 wp 1 uh 1 vv 1

8 nns 1 rb 1 wp 1 wrb 1 wrb 1

9 np 1 wp 1 vv 1 nns 1 np 1

10 vv 1 vvz 1 vv 1 vvd 1

As table 4 presents the determiners (dt) are more frequently used in spoken discourse 7

(2010), 3 (2011), 1 (2012), 2 (2013), 4 (2014), 3 (2015), while, singular nouns (nn) 3 (2010),

4 (2011), 4 (2012), 5 (2013), 3 (2014), 6 (2015) and plural nouns (nns) 4 (2010), 4 (2012),

1 (2013), 2 (2014), 4 (2015) are used more often in academic discourse. What is interesting,

none of the plural verbs occurred in the 2011 part of the subcorpora. Proper nouns (np) do

not occur in the 2-word clusters in spoken discourse. They are present on the academic list

with 1 (2010), 1 (2012), 2 (2013), 1 (2014), 1 (2015) frequency. Furthermore, personal pro-

nouns (pp) are present and more frequent on the list of spoken discourse. Wh-adverbs (wrb)

occur on the both lists (4 instances each), but in different years. For spoken: in 2012 (1), 2013

(1), and 2014 (2), while for academic: in 2010 (2), 2013 (1), and 2014 (1). On the other hand,

wh-pronouns (wp) are used more frequently on the spoken list. Their frequency in 2010, 2012,

2013, 2014 and 2015 is one example per each year. For academic discourse wh-pronouns are

visible on the list for the years of 2013 and 2014; respectively 2 and 1 example. Adjectives (jj)

and comparatives (jjr) are more common for the academic discourse: 2 (2010), 1 (2011), 1 (2012),

2 (2013), 1 (2014). Their distribution in spoken discourse: 2 (2011), 1 (2015). Furthermore,

spoken discourse uses prepositions (in) more often: 1 (2010), 1 (2011), 2 (2012), 1 (2013),

3 (2014). In academic discourse, their use is distributed as follows: 1 (2010), 1 (2011),

1 (2014). Verbs (vv), past tense verbs (vvd), participle/gerund verbs (vvg), and present 3rd

person singular verbs (vvz) are not common but present on both lists. There are 4 instances of

its use in spoken discourse and 6 instances in the academic. It is worth to note that spoken

discourse uses only base form, and 3rd person singular forms of verbs. While academic dis-

course uses base form, past tense, past participle, and gerund/participle.

56 Sebastian Dunat

Table 5 Parts of speech used in the 3-word clusters

Rank

Spoken Academic

2010 2011 2012 2010 2011 2012


1 dt 13 nn 14 dt 9 nns 11 nn 15 nn 12

2 nn 10 dt 12 pp 7 jj 7 dt 13 nns 11

3 nns 7 nns 6 nn 7 nn 7 to 4 dt 8

4 pp 4 that 2 in 5 in 5 jj 3 cc 6

5 in 2 jj 2 nns 3 dt 4 pp 3 pp 4

6 jj 2 in 1 vv 2 pp 4 rb 3 vv 3

7 to 2 md 1 cc 2 rb 4 nns 3 jj 3

8 vv 2 pdt 1 jj 2 cc 2 vv 2 in 1

9 wp 2 pos 1 rb 2 wrb 2 cc 2 jjr 1

10 cc 1 pp 1 vvp 2 np 2 cd 1

11 np 1 rb 1 wp 1 vv 1 in 1

12 pos 1 wrb 1 jjr 1 vvg 1

13 rb 1 vvg 1 vvn 1

14 that 1 vvn 1 vvp 1

Rank

2013 2014 2015 2013 2014 2015


1 pp 12 dt 12 nn 10 dt 8 dt 12 nn 14

2 in 7 nn 11 dt 9 nn 8 nn 10 nns 13

3 dt 5 pp 5 pp 6 nns 7 nns 10 dt 8

4 rb 5 in 4 nns 5 jj 5 jj 4 cc 3

5 nn 5 nns 3 wp 3 cc 3 in 3 cd 3

6 that 4 cc 2 wrb 3 pp 3 vv 2 jj 3

7 to 4 that 2 rb 2 rb 3 cc 2 vv 2

8 nns 3 to 2 cc 1 vv 2 to 2 in 2

9 vv 2 wrb 2 cd 1 in 2 np 2 pos 2

10 cc 2 jjr 1 in 1 wrb 2 jjs 1 sym 2

11 wp 2 pdt 1 jj 1 that 1 pos 1 to 2

12 jj 1 pos 1 md 1 sym 1 pp 1 pp 1

13 pos 1 rb 1 rbr 1 wp 1 uh 1 rb 1

14 wrb 1 rp 1 np 1 np 1 wrb 1 vvd 1

15 wp 1 vvg 1 vvd 1

16 np 1 vvz 1

17 vvn 1 vhz 1

18 vvz 1

Table 5 presents parts of speech used in the 3-word clusters of the researched material.

Academic discourse uses nouns (nn, nns), adjectives (jj) and determiners (dt) most frequently.

Nouns, most frequent part of speech, used in the academic discourse occur with the following

frequency: 18 (2010), 18 (2011), 23 (2012), 15 (2013), 20 (2014), 27 (2015). Adjectives are

enlisted with the frequencies of: 7 (2010), 3 (2011), 3 (2012), 5 (2013), 4 (2014), 3 (2015).


While, determiners have the frequency: 4 (2010), 13 (2011), 8 (2012), 8 (2013), 12 (2014),

8 (2015). Most frequent parts of speech for the spoken discourse are: determiners (dt), nouns

(nn, nns) and personal pronouns (pp). First of the above, has the frequencies of: 13 (2010),

12 (2011), 9 (2012), 5 (2013), 12 (2014), 9 (2015). Second, are enlisted with the following

number of examples: 17 (2010), 20 (2011), 10 (2012), 8 (2013), 14 (2014), 15 (2015). Third,

personal pronouns, are used: 4, 1, 7, 12, 5, and 6; chronologically throughout the researched

years. For the academic discourse personal pronouns are used less frequently, chronological-

ly: 4, 3, 4, 3, and 1 example. Adjectives used in the spoken discourse are scarce: but fairly

constant, chronologically 2, 2, 2, 1, 1, and 1. What is worth to note in 2014 there is one exam-

ple of comparative adjective (jjr) used, while for the academic discourse there are compara-

tives used in 2010 (1), 2012 (1), and one superlative (jjs) in 2014. Prepositions and subordi-

nating conjunctions (in) are more frequent for the fields of academic discourse: 5 (2010),

1 (2011), 1 (2012), 2 (2013), 3 (2014), 2 (2015); for spoken, there is only: 1 (2011), 7 (2013),

4 (2014), 1 (2015). On the other hand, only spoken discourse uses that complementizer:

1 (2011), 2 (2012), 4 (2013) and 2 (2014) instances. Next, to preposition is used in spoken

discourse: 2 (2010), 4 (2013), and 2 (2014), and in academic one only twice in 2014. Coordi-

nating conjunctions (cc) are present on both lists. The spoken discourse uses: 1, 0, 2, 2, 2, and

1; chronologically throughout the researched years. In addition, academic discourse enlists

them (cc) more frequent: 2, 2, 6, 3, 2, and 3 instances, chronologically. Wh-pronouns (wp)

and wh-adverbs (wrb) are used more frequently in the spoken discourse. The frequencies for

the above mentioned pronouns are: 2 (2010), 1 (2012), 2 (2013), and 1 (2014). For academic

discourse, there is only on instance of its usage in 2013. Furthermore, wh-adverbs enlisted

frequency on spoken discourse list: 1 (2012), 1 (2013), 2 (2014), 3 (2015); and on academic

list: 2 (2010), 2 (2013), 1 (2014). Other adverbs (rb) are more frequently used in the academic

discourse: 4 (2010), 3 (2011), 3 (2013), 1 (2015); and 1 (2010), 1 (2011), 2 (2012), 5 (2013)

in the spoken discourse. Next, verb clusters (vv, vvg, vvd, vvz, vvn, vvp, vhz) are used more

frequently in the academic discourse: 4 (2010), 5 (2011), 3 (2012), 5 (2013), 3 (2014), and

3 (2015), than in the spoken one: 2 (2010), 4 (2012), 2 (2013), 2 (2014). What is worth to

note, have present, 3rd person singular (vhz); past tense (vvd) and gerund/participle (vvg)

verbs are used only on the academic list. Lastly, possessive ending (pos) is present on the list

both lists, but more frequent for spoken discourse 1 (2010), 1 (2011), 1 (2013), and 1 (2014);

while for the academic, they are present with the frequency of: 1 (2014), and 2 (2015).

58 Sebastian Dunat

Conclusions

It might be concluded that, the difference in the quantitative use of the studied noun col-

locates is evident in the researched discourses, with an advantage of some of the collocate

types in the number of tokens used. There is a variation of cluster types used in both research

fields and presented years. There is a quantitative difference in the distribution of sentences

containing the studied types of clusters in diachronic spectrum. Additionally, some of the

cluster types are more frequent than others in the studied years. What is worth to note, the

analyze verb rarely collocates with modal verbs. Only will and can are visible on the lists;

first for spoken, second for academic discourse. This trend may constitute the basis for further

research.

Table 6 Person’s Chi-squared test for significance of the parts of speech used within researched clusters

Cluster/part of speech Academic

freq.

Spoken

freq. Pearson’s Chi-squared test

coordinating conjunctions 18 8

X-squared = 42.748, df = 10, p-value = 5.517e-06

determiners 65 80

preposition/subordinating

conjunctions 17 28

adjectives 31 11

singular/mass noun 90 76

plural noun 70 32

proper noun 12 7

personal pronouns 26 52

adverbs 17 17

to 8 10

wh-adverbs 9 13


freq.

Spoken

freq. Pearson's Chi-squared test

singular/mass noun 90 76

X-squared = 5.5518, df = 2, p-value = 0.06229 plural noun 70 32

proper noun 12 7


freq.

Spoken


adjectives 31 11 X-squared = 9.5238, df = 1, p-value = 0.002028


freq.

Spoken


coordinating conjunctions 18 8

X-squared = 6.6637, df = 2, p-value = 0.03573 preposition/subordinating

conjunctions 17 28

to 8 10


freq.

Spoken



adverbs 17 17 X-squared = 0.15357, df = 1, p-value = 0.6951

wh-adverbs 9 13


freq.

Spoken


determiners 65 80 X-squared = 1.5517, df = 1, p-value = 0.2129


freq.

Spoken


personal pronouns 26 52 X-squared = 8.6667, df = 1, p-value = 0.003241

Pearson’s chi-squared tests for significance, please see table 6, revealed that noun-

clusters are more frequent for the academic discourse, p-value equals 0.06. Although, they

cannot be taken into consideration as significant, since the research p-value of significance

should be less than 0.05. Next, adjectives in the clusters are significantly more frequent in the

academic discourse; p-value hit 0.002. Furthermore, coordinating conjunctions are more fre-

quent in academic discourse, while, subordinating conjunctions and prepositions are more

frequent in spoken discourse. The Pearson’s chi-squared test shows the p-value of 0.03,

correspondingly, differences in conjunctions use are significant. Moreover, statistical test for

adverbs use does not prove that any of the researched discourses uses them more frequent; the

p-value of the Pearson’s chi-squared test equals 0.69. Last but not least, determiners use in the

discourses scored the p-value of 0.21 and are not proved to be significant, nevertheless the

number of the determiners used in the spoken discourse is greater than in the academic. Final-

ly, personal pronouns are significant, and proved to be more frequent in the fields of spoken

discourse; the total p-value is 0.003. The overall p-value of the surveyed data equals

0.0000005 which proves that the cluster variation within different discourses can be used as

the basis for interesting, further research.

Table 7 Pearson’s chi-squared test for significance of the words used as collocates in 5L to 5R widow span

Collocate Academic

freq.

Spoken


a 12 17

X-squared = 32.873, df = 8, p-value = 6.491e-05

and 96 58

in 22 13

of 28 15

that 10 15

the 94 77

they 8 14

to 118 89

we 9 33

Collocate Academic

freq.

Spoken


a 12 17 X-squared = 0.86207, df = 1, p-value = 0.3532

60 Sebastian Dunat

and 96 58 X-squared = 9.3766, df = 1, p-value = 0.002198

in 22 13 X-squared = 2.3143, df = 1, p-value = 0.1282

of 28 15 X-squared = 3.9302, df = 1, p-value = 0.04743

that 10 15 X-squared = 1, df = 1, p-value = 0.3173

the 94 77 X-squared = 1.6901, df = 1, p-value = 0.1936

they 8 14 X-squared = 1.6364, df = 1, p-value = 0.2008

to 118 89 X-squared = 4.0628, df = 1, p-value = 0.04384

we 9 33 X-squared = 13.714, df = 1, p-value = 0.0002128

Firstly, indefinite article a is used more frequently in spoken discourse, but the difference

between the discourses frequency is slight (please see table 7). Definite article, on the other

hand, is used more frequently in the academic discourse. Although, both cannot be taken into

consideration as significant, for the p-value scores are greater than 0.05. Secondly, and is

more frequent for academic discourse: the total difference in its use frequency between the

discourses is 38, and the significant p-value equals 0.002. Thirdly, in, is used 22 times, in

academic discourse, which is 9 instances more than in spoken discourse. However, the

p-value totals 0.12, therefore, it is not significant. Next, that p-value for both discourses

equals 0.31 which makes it significantly irrelevant. Same thing occurs for they, where the

p-value scores 0.20. On the other hand, of, to and we have the p-value small enough to be

taken into consideration as significant; 0.04, 0.04 and 0.0002 respectively. First two of the

above mentioned are used more frequently in the academic discourse, while the third one is

significantly more frequent in spoken discourse. As Pearson’s chi-squared tests for signifi-

cance shows, the collocates from seventh table scored the overall p-value of 0.000006. It

proves that the collocates research in the discourse setting might provide a good basis for fur-

ther study.

Bibliography

BIBER D., SUSAN C., RANDI R., Corpus Linguistics: Investigating language structure and

use, Cambridge, Cambridge University Press, 1998.

BUCHOLTZ, M., Theories of Discourse as Theories of Gender: Discourse Analysis in Lan-

guage and Gender Studies, The handbook of Language and Gender., Blackwell Publishing

Ltd., 2003.

Corpus of Contemporary American English. N.p., n.d. Dostęp. 14.11.2019,

15.11.2019.Dostępne online https://corpus.byu.edu/coca/

HAUSSER R., Computational linguistics: Human-Computer Communication in Natural Lan-

guage (3 Ed.). Springer, 2014.


SUNDERLAND J., Language and Gender: An advanced resource book, Routledge, London

2006.

The R Foundation for Statistical Computing. R x64 version 3.4.1.,30 Jun. 2017. Free-

ware software. Dostępne online < http://cran.r-project.org/>

LAURENCE A., TagAnt x64, version 1.2.0., 15 Sep. 2015. Freeware software. Do-

stępne online < http://www.laurenceanthony.net/software/tagant>

LAURENCE A., ProtAnt x64, version 0.1., 21 Mar. 2017. Freeware software. Do-

stępne online < http://www.laurenceanthony.net/software/protant>

LAURENCE A., AntConc 3.5.7., 30 Sept. 2018. Freeware software. Dostępne online

Analiza słownictwa: Badanie korpusowe zbitek wyrazowych i kolokatów czasownika “analyze” w dys-

kursie akademickim i mówionym.

Istnieje wiele korpusów dostępnych online, które można dowolnie oznakować różnymi funkcjami językowymi.

Wiele z nich stanowi świetną bibliotekę przykładów, a dane w nich zawarte można wykorzystać do analizy za

pomocą dowolnego narzędzia językowego. Celem tego badania było sprawdzenie różnorodności kolokatów

i zbitek wyrazowych używanych z czasownikiem „analyze”, w dwóch dyskursach: akademickim i mówionym.

W pracy przedstawiono opis danych korpusowych (300 przykładów), uprzednio sklasyfikowanych pod

względem wybranych kategorii badawczych. Narzędzia językoznawstwa komputerowego posłużyły tutaj do

przeprowadzenia badań statystycznych z użytkiem danych korpusowych. Testy chi-kwadrat Pearsona dowiodły

istotności użytku niektórych zbitek wyrazowych i kolokatów w zestawieniu ilościowym, w badanym materiale.

Podsumowując, różnorodność wykorzystania zbitek wyrazowych oraz kolokatów w ramach zbadanych dys-

kursów może być podstawą do przeprowadzenia dalszych, interesujących badań.