A Comparable Corpus Driven, Multivariate Approach to Light Verb
Variations in World Chineses
Jingxia LIN2, Menghan JIANG1, and Chu-Ren HUANG1
1 The Hong Kong Polytechnic University, 2Nanyang Technological University
Light verbs in Chinese
Similar to English light verbs: take rest, give advice, give description
•Semantically bleached: containing no eventive information• The predicative content mainly comes from its taken complement
進行討論 jin4xing2 tao3lun4 ‘have a discussion’
•Being semantically bleached, they do not strongly select their objects• They can take a wide range of objects, including deverbal nouns, eventive
nouns, and sometime concrete numbers with eventive meaning
• They are sometimes interchangeable with the same nominal object
Underspecified Selecitonal Restriction of Chinese Light Verbs
•從事 cong2shi4, 搞 gao3, 加以 jia1yi3, 進行 jin4xing2, 做 zuo4 are among the most frequently used (also most typical) light verbs in Modern Chinese
•The use of these five light verbs are sometimes interchangeable
•從事 / 搞 / 加以 / 進行 / 做研究
•cong2shi4/gao3/jia1yi3/jin4xing2/zuo4 yan2jiu1
• “to do research”
Underspecified Selecitonal Restriction of Chinese Light Verbs II
•Collocation constraints are sometimes found with these light verbs,
• e.g., 進行 /* 加以 /* 從事 / 搞 /* 做赛事 ,
jin4jing2/*jia1yi3/*cong2shi4/gao3/*zuo4 bi3sai4
“play a game”
• * 進行 / 加以 /* 從事 /* 搞 /*做考慮 *jin4jing2/jia1yi3/*cong2shi4/*gao3/*zuo4 kao3lv4
“give consideration”
Variations of Light Verb Usages in Mainland and Taiwan Mandarin Variants
• Even with the very limited collocation constraints, variations still exist: Taiwan light verbs tend to take more types of NPs and even VPs as its complements
•進行感恩之旅 / 君子之爭
Jin4xing2 gan3en1zhi1lv3/ju1zi3zhi1zheng1
“to proceed with a ‘thanksgiving trip’/‘gentlemen’s dispute’”
•進行抹黑 / 開票
Jin4xing2 mo3hei1/kai1piao4
“to proceed with ‘mud-slinging’/’ballot counting’ ”
-------(Huang et al. 2013)
Theoretical Challenges for Corpus-based Studies of Chinese Light
Verbs•C
an distribution based statistically analysis identify the differences among different Chinese light verbs?•The contrasts among the light verbs are often tendencies rather than grammaticality dichotomies; hence the distributional patterns are less prominent and harder to characterize
•Can the subtle light verb variations between different variants of Chinese, be identified through statistical analysis based on comparable corpora (cf. Huang et al. 2013).
Main Research QuestionsFacing the above challenges, we try to resolve the following four research questions:
•Can light verbs be differentiated from each other by statistical methods?
•Can the grammatical differences between variants of the same language be empirically verified by distributional features?
•Are these differences statistically significant?
•If answers to both questions are yes, how do they differ statistically from each other? • That is, is the distributional difference between two different light verbs
or the between two variants of the same light verb more prominent?
Methodology
•A comparable-corpus-driven statistical approach
•加以 jia1yi3, 進行 jin4xing2, 從事 cong2shi4, 搞 gao3, 做 zuo4 in Mainland Mandarin and Taiwan Mandarin
•Statistical methods and tools•Univariate analysis + multivariate analysis
•Polytomous package in R (Arppe 2008)
Data •C
hinese Gigaword corpus (over 1.1 billion Chinese words)
• Central News Agency (Taiwan, about 700 million characters)
• Xinhua News Agency (Mainland China, about 400 million characters)
•Random sample: 200 sentences for each of the five light verbs in Mainland and Taiwan corpora•1,000 in total for Mainland Chinese•1,000 in total for Taiwan Chinese
•12 factors: (e.g. Zhu 1985, Zhou 1987, Cai 1982, Huang et al. 1995, among others)
Value levels
Co-occur with other light verbs“OTHERLV”
開始 進行 比賽 kai1shi3/jin4xing2/bi3sai4
“start the game”
Yes, no
Take aspectual
marker: 著,了,過“ASP”
昨天進行了比賽zuo2tian1/jin4xing2/le0/bi3sai4
“played the game yesterday”
No, le, zhe, guo
Event complement is at subject
position“EVECOMP”
比賽在學校進行bi3sai4/zai4/xue2xiao4/jin4xing2
“play the game at school”
Yes, no
POS“POS”
進行比賽( N )jin4xing2/bi3sai4
進行戰鬥( V )jin4xing2/zhan4dou4
“play the game”
“fight the battle”
N, V
Argument structure“ARGSTR”
進行調查( two )jin4xing2/diao4cha2
“carry on investigation”
One, two, zero
VO compound as argument
“VOCOMP”
進行投 票jin4xing2/tou2piao4
“carry on voting”
Yes, no
Spontaneous/controllable
event“SPONTEVT”
進行投票jin4xing2/tou2piao4
“carry on voting”
Yes, no
durative event“DUREVT”
進行比賽jin4xing2/bi3sai4
“play a game”
Yes, no
formal event“FOREVT”
進行訪問jin4xing2/fang3wen4
“pay an official visit”
Yes, no
psychological activity
“PSYEVT”
加以考慮jia1yi3/kao3lv4
“give consideration”
Yes, no
event involving
interaction of agent and
patient“INTEREVT”
進行溝通jia1yi3/gou1tong1inflict/communicate
“do communication”
Yes, no
accomplishment complement“ACCOMPEVT”
進行修正jin4jing2/xiu1zheng4proceed/correct
“make corrections/amendments”
Yes, no
Mainland Chinese-An overall look of the factors
> str(MLLV3)'data.frame': 1000 obs. of 13 variables: $ LV : Factor w/ 5 levels "congshi","gao",..: 1 1 1 1 1 1 1 1 1 1 ... $ POS : Factor w/ 2 levels "N","V": 2 2 2 2 1 1 2 2 2 2 ... $ ARGSTR : Factor w/ 3 levels "one","two","zero": 1 1 2 1 3 3 2 1 1 1 ... $ VOCOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ EVECOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ OTHERLV : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1 ... $ ASP : Factor w/ 4 levels "guo","le","no",..: 3 3 3 3 3 3 3 3 3 3 ... $ SPONTEVT : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ... $ DUREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... $ FOREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... $ PSYEVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ INTEREVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ ACCOMPEVT: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
•Among the 12 independent variables, two have only one level•OTHERLV: occurrence of the dependent variable (light verbs) with another light verb • All five light verbs (1000 sentences) do not co-occur with another light verb
•SPONTEVT: with spontaneous events as the complement to light verbs • All five light verbs (1000 sentences) take spontaneous events as their complements
• the two factors are not effective in distinguishing the five light verbs, and are thus excluded from further statistical analysis
Univariate analysis of Chinese light verbs
•Chi-squared tests for the significance of the co-occurrence of the factor with individual light verbs
•Chisq.posthoc() function in the Polytomous package automatically transforms the results (Standardized pearson residuals eij (Agresti 2002)) into signs“+”: eij > 2, statistically significant overuse of the light verb with the
factor“-”: eij < -2, statistically significant underuse of the light verb with the
factor“0”: eij [-2,2], lack of statistical significance
Mainland Chinese – a univariate analysis
Four features show no significance (p-value <0.05) in distinguishing the five light verbs.
Mainland Chinese – a univariate analysis
Also the table presents that each light verb shows significant preference for certain factors.
Polytomous Logistic Regression•加
以 / 進行 / 從事 / 搞 / 做 研究 .
•Jia1yi3/jin4xing2/cong2shi4/gao3/zuo4 yan2jiu1
• “to do research”
•Five light verbs as the possible outcome• Estimate the probability of presence of each of the potential light verb
•Polytomous logistic regression• An extension of standard logistic regression
• allows for simultaneous estimation of the probability of multiple outcomes (light verbs in the current study)
Main Results of Polytomous for Mainland Chinese
congshi gao jiayi jinxing zuo (Intercept) (1/Inf) 0.02271 (1/Inf) (1/Inf) (1/Inf) ACCOMPEVTyes (1/Inf) 0.09863 56.25 0.1849 (1/Inf) ARGSTRtwo 0.2652 2.895 76.47 (1.481) 0.2177 ARGSTRzero (1.097) 3.584 (1/Inf) (1.179) 0.245 ASPle (0.7487) (0.1767) (0.8257) (0.9196) (1.853) ASPno (Inf) (1.499) (Inf) (0.2307) (0.2389) ASPzhe (1.603) (1/Inf) (0.4571) (Inf) (1/Inf) DUREVTyes (Inf) (2.958) (1/Inf) (Inf) (Inf) EVECOMPyes (1/Inf) (1.726) (1/Inf) 3.975 (1.772) FOREVTyes (2.744) (1.227) (Inf) (0.7457) 0.2679 INTEREVTyes 0.03255 (0.5281) (0.5432) 18.67 0.08902 PSYEVTyes (1/Inf) (1/Inf) 19.87 (1/Inf) (0.9619) VOCOMPyes (0.1346) (3.043) 23.54 (1.086) (0.5344)
• odds>1: the chance of the occurrence of a light verb is significantly increased by the feature (marked in orange)
• odds<1: the chance of the occurrence of a light verb is significantly decreased by the feature (marked in blue)
• Non-significant odds (p-value >0,05) are given in parentheses
Distributional Contrasts Can Differentiate
Light Verb Pairs Most pairs of light verbs can be effectively differentiated by one of more factors (i.e. those where they have contrasting positive/negative tendencies to appear)
congshi/gao: ARGSTRtwo congshi/jiayi: ARGSTRtwo
congshi/jinxing: INTEREVTypes gao/jiayi: ACCOMPEVTypes
gao/zuo: ARGSTRtwo/ARGSTRzero jiayi/jingxing: ACCOMPEVTypes
jiayi/zuo: ARGSTRtwo jinxing/zuo: INTEREVTypes
Only two pairs are without contrasting significant features
congshi/zuo
gao/jinxing
•A probability model is adopted to predict the identity of light verb at its position of occurrence.
•The overall performance of the model is good• the most frequently predicted light verb of each column corresponds to the
light verb that actually occurs in the data (see the red figures) predicted observed congshi gao jiayi jinxing zuo congshi 131 1 62 1 5 gao 69 16 86 16 13 jiayi 1 1 192 6 0 jinxing 31 9 47 62 51 zuo 50 5 44 4 97
PROBABILITY OF OCCURRENCE OF LIGHT VERBS
F-score of Automatic Identification of Five Light Verbs Based on Mainland Mandarin Data
recall precision F-score
congshi 0.655 0.4645 0.5436
gao 0.08 0.5 0.1379
jiayi 0.96 0.4455 0.6086
jinxing 0.31 0.6966 0.4291
zuo 0.485 0.5843 0.5300
•Each light verb can be successful identified with a better F-score than chance (0.2) with the exception of 搞 gao3, while the performance varies from light verb to light verb
加以 Jia1yi3 > 從事 cong2shi4/ 做 zuo4 > 進行 jin4xing2 > 搞 gao3
•-加以 Jia1yi3 is the only light verb with effective differentiating factors with all other light verbs.// All four significant factors are positive (i.e. direct evidence for its occurrence).
•事cong2shi4/ 做 zuo4: Both have only one type of significant factors, but they are negative ones (i.e. indirect evidence).
•搞gao3, and 進行 jin4xing2 have both positive and negative factors, which may have cancelled each other out. The significance of their factors are also relatively weak.
•Note that the low f-score of 搞 gao3 is consistent with the linguistic observation that it is rarely used as LV in ML.
Analysis of Outcome (ML)
F-score of Automatic Identification of Five Light Verbs Based on Taiwan Mandarin Data
recall precision F-score
congshi 0.32 0.5614 0.4076
gao 0.695 0.5036 0.5840
jiayi 0.95 0.4139 0.5766
jinxing 0.335 0.5929 0.4281
zuo 0.16 0.8421 0.2689
•Each light verb can be successful identified with a better f-score than chance (0.2). But the performance varies from light verb to light verb
搞
gao3/ 加以 Jia1yi3 > 進行 jin4xing2/ 從事 cong2shi4 > 做 zuo4
•搞 gao3/ 加以 Jia1yi3 each have significant factors are positive only (i.e. direct evidence for its occurrence).
•從事 cong2shi4 negative significant factors only (i.e. indirect evidence). 進行 jin4xing2 has more positive than negative significant factors
•做zuo4 have both types of significant factors, but negative ones outnumber positive ones.
•Linguistically,
Analysis of Outcome (TW)
Key results:ML and TW 做 zuo4 show opposite usage tendency of the feature ARGSTR.twoML and TW 進行 jin4xing2 show opposite usage tendencies of the features ASP.le and ASP.noBut the difference is between a significant and non-significant feature, rather than between a significant positive vs. a significant negative feature
Comparison of Mainland and Taiwan light verbs -univariate analysis
Probability estimates of Mainland and Taiwan light verbs
by Polytomous
•In both ML and TW, the model in overall is good: •the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures)
•The results also show while a light verb has a highest probability given a particular context (a set of factors), other light verbs might also have a chance to occur. the reason why empirically more than one light verb can occur in the same context.
congshi gao jiayi jinxing zuo
ML TW ML TW ML TW ML TW ML TW (Intercept) (1/Inf) (1/Inf) 0.02271 (1/Inf) (1/Inf) (1/Inf) (1/Inf) (1/Inf) (1/Inf) (1/Inf)
ACCOMPEVTyes (1/Inf) (0. 3419) 0.09863 (1/Inf) 56.25 11.33 0.1849 (0.1607) (1/Inf) 0.2272
ARGSTRtwo 0.2652 0.1283 2.895 (0.7615) 76.47 (Inf) (1.481) (0.7062) 0.2177 (1.217)
ARGSTRzero (1.097) (0.6251) 3.584 7.177 (1/Inf) (4.382) (1.179) 0.5393 0.245 0.2075
ASPle (0.7487) (1/Inf) (0.1767) (1/Inf) (0.8257) (0.3027) (0.9196) (Inf) (1.853) 32.98
ASPno (Inf) (0.9291) (1.499) (0.6946) (Inf) (Inf) (0.2307) (Inf) (0.2389) (0.2386)
ASPzhe (1.603)
(1/Inf)
(0.4571)
(Inf)
(1/Inf)
DUREVTyes (Inf) (Inf) (2.958) (Inf) (1/Inf) (1/Inf) (Inf) (0.9575) (Inf) (Inf)
EVECOMPyes (1/Inf) (1/Inf) (1.726) (0.8534) (1/Inf) (1/Inf) 3.975 8.115 (1.772) (0.5016)
FOREVTyes (2.744) 0.08674 (1.227) (Inf) (Inf) (Inf) (0.7457) (1.441) 0.2679 (1.467)
INTEREVTyes 0.03255 0.1896 (0.5281) (1/Inf) (0.5432) (0.951) 18.67 10.46 0.08902 (0.3979)
PSYEVTyes (1/Inf) (1/Inf) (1/Inf) (1/Inf) 19.87 (1.395) (1/Inf) (1/Inf) (0.9619) (3.323)
SPONTEVTyes (Inf)
(1/Inf) (1/Inf)
(Inf)
(Inf)
VOCOMPyes (0.1346) 0.18 (3.043) (2.351) 23.54 (Inf) (1.086) 3.16 (0.5344) (0.5956)
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
congshi
ML TW (Intercept) (1/Inf) (1/Inf)
ACCOMPEVTyes (1/Inf) (0. 3419)
ARGSTRtwo 0.2652 0.1283
ARGSTRzero (1.097) (0.6251)
ASPle (0.7487) (1/Inf)
ASPno (Inf) (0.9291)
ASPzhe (1.603)
DUREVTyes (Inf) (Inf)
EVECOMPyes (1/Inf) (1/Inf)
FOREVTyes (2.744) 0.08674
INTEREVTyes 0.03255 0.1896
PSYEVTyes (1/Inf) (1/Inf)
SPONTEVTyes (Inf)
VOCOMPyes (0.1346) 0.18
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patterns.
They differ only in that TW is less likely to take formal event as arguments (FOREVTyes). This is consistent with the intuition that jingxing will be preferred in this context in TW.
gao
ML TW (Intercept) 0.02271 (1/Inf)
ACCOMPEVTyes 0.09863 (1/Inf)
ARGSTRtwo 2.895 (0.7615)
ARGSTRzero 3.584 7.177
ASPle (0.1767) (1/Inf)
ASPno (1.499) (0.6946)
ASPzhe (1/Inf)
DUREVTyes (2.958) (Inf)
EVECOMPyes (1.726) (0.8534)
FOREVTyes (1.227) (Inf)
INTEREVTyes (0.5281) (1/Inf)
PSYEVTyes (1/Inf) (1/Inf)
SPONTEVTyes (1/Inf)
VOCOMPyes (3.043) (2.351)
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patterns. Both ML and TW 搞 gao3 are significantly favored by
ML 搞 gao3 is less likely to occur with accomplishment object. This and the fact that it is unlikely to occur with the aggregate of default variable values suggest that it is unlikely to be used as light verb in ML.
jiayi
ML TW (Intercept) (1/Inf) (1/Inf)
ACCOMPEVTyes 56.25 11.33
ARGSTRtwo 76.47 (Inf)
ARGSTRzero (1/Inf) (4.382)
ASPle (0.8257) (0.3027)
ASPno (Inf) (Inf)
ASPzhe (0.4571)
DUREVTyes (1/Inf) (1/Inf)
EVECOMPyes (1/Inf) (1/Inf)
FOREVTyes (Inf) (Inf)
INTEREVTyes (0.5432) (0.951)
PSYEVTyes 19.87 (1.395)
SPONTEVTyes (1/Inf)
VOCOMPyes 23.54 (Inf)
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patternsML 加以 jia1yi3 are more likely to occur with two arguments (ARGSTRtwo), as well as taking VO compound or psychological events as objects (VOCOMPyes, and PSYEVTyes). Which confirms the intuition that it is more frequently used in ML.
jinxing
ML TW (Intercept) (1/Inf) (1/Inf)
ACCOMPEVTyes 0.1849 (0.1607)
ARGSTRtwo (1.481) (0.7062)
ARGSTRzero (1.179) 0.5393
ASPle (0.9196) (Inf)
ASPno (0.2307) (Inf)
ASPzhe (Inf)
DUREVTyes (Inf) (0.9575)
EVECOMPyes 3.975 8.115
FOREVTyes (0.7457) (1.441)
INTEREVTyes 18.67 10.46
PSYEVTyes (1/Inf) (1/Inf)
SPONTEVTyes (Inf)
VOCOMPyes (1.086) 3.16
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patterns.
ML jinxing is not likely to take accomplishment objects (ACCOMPEVTypes), while TW 進行jin4xing2 is very likely to take VO compound objects (VOCOMPyes), consistent with Huang et al. (2013)
zuo
ML TW (Intercept) (1/Inf) (1/Inf)
ACCOMPEVTyes (1/Inf) 0.2272
ARGSTRtwo 0.2177 (1.217)
ARGSTRzero 0.245 0.2075
ASPle (1.853) 32.98
ASPno (0.2389) (0.2386)
ASPzhe (1/Inf)
DUREVTyes (Inf) (Inf)
EVECOMPyes (1.772) (0.5016)
FOREVTyes 0.2679 (1.467)
INTEREVTyes 0.08902 (0.3979)
PSYEVTyes (0.9619) (3.323)
SPONTEVTyes (Inf)
VOCOMPyes (0.5344) (0.5956)
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patterns
Their distributional patterns are consistent with the analysis of zuo4 as the most bleached of Mandarin light verbs. (The attachment of perfect aspect –le is known to be shared grammatical potential of all light verbs.)
Conclusion•T
his study compares the usage tendencies of Chinese light verbs • (1) Among five different light verbs• (2) Between Mainland and Taiwan Mandarin Usage of the same light verb
•The comparable-corpus-driven statistical analysis is able to generalize about the similarities and differences among light verbs with different factors• The contrast between different light verb pairs can be anchored by statistically significant positive
vs. statistically significant negative pairs,• The difference between two Chinese varieties for the same light verbs, however, is between
statistically significant vs. non-significant pairs.
•The above result allows us to hypothesize that• Different light verbs, even with its weak selectional features, can be identified and
differentiated by contrasting distributional tendencies• Variants of the same language, however, do not show contrasting tendencies but can be
differentiated by existence (i.e. significant vs. non-significant) of some distributional tendencies
References•A
rppe, A. (2008) Univariate, bivariate and multivariate methods in corpus-based lexicography – a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978-952-10-5175-3.
•Arppe, A. (2009) Linguistic choices vs. probabilities – how much and what can linguistic theory explain? In: Featherston, S. & S. Winkler (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 1–24.
•Arppe, A. (in prep.) Solutions for fixed and mixed effects modeling of polytomous outcome settings.
•Han, Weifeng, Arppe, Antti & Newman, John (2013). Topic marking in a Shanghainese corpus: from observation to prediction. Corpus Linguistics and Linguistic Theory (preprint).
•Butt, M., & Geuder, W. (2001). On the (semi) lexical status of light verbs. Semi-lexical Categories, 323-370.
•Cattell, R. (1984). Composite Predicates in English. Syntax and Semantics Volume 17. Sydney: Academic Press Australia.
•Cai, Wenlan. (1982). Issues on the Complement of ‘jinxing’ (“ 進行”帶賓問題 ). Chinese Language Learning ( 漢語學習 ) (3), 7-11.
References•H
uang, Chu-Ren and Jingxia Lin. (2013). The ordering of Mandarin Chinese light verbs. Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.): CLSW 2012, LNAI 7717, pp. 728-735. Heidelberg: Springer.
•Huang Chu-Ren, Jingxia Lin, and Huarui Zhang (2013). World Chineses based on comparable corpus: The case of grammatical variations of jinxing. 《澳门语言文化研究》 , 397-414.
•Jespersen, O. (1965). A Modern English Grammar on Historical Principles. Part VI, Morphology. London: George Allen and Unwin Ltd.
•Zhou, Gang. (1987a). Subdivision of Dummy Verbs ( 形式動詞的次分類 ). Chinese Language Learning ( 漢語學習 ), 1, 11-14.
•Zhou, Xiaobing. (1987b). Sentence Pattern Comparison of ‘jinxing’ and ‘jiayi’ (“ 進行”“加以”句型比較 ). Chinese Language Learning ( 漢語學習 ), 6, 1-5.
•Zhu, Dexi. (1985). Dummy Verbs and NV in Modern Chinese ( 現代書面漢語里的虛化動詞和名動詞 ). Journal of Peking University (Humanities and Social Sciences) ( 北京大學學報 ( 哲學社會科學版 )), 5, 1-6.
36
Thank you