Conversational Network in the Chinese Buddhist Canon
Tak-sum Wong and John Sie Yuen LeeCity University of Hong Kong
Conference on Digital Humanities 2015
2
Application of corpus
• It is common to apply linguistic annotation to study the language therein.
• Can we apply dependency relations to analyze the characters in a literary text?
3
Outline
• What is treebank?• Construction of our treebank• Conversational network of Buddhist text
– Goddess of Mercy– Who spoke most?– Mahāyāna vs Hīnayāna
4
Syntactic tree
5
Treebanks: What?
• Many types of parse trees– Example:
Stanford dependency parse tree
6
Treebanks: What?
• A treebank is a collection of syntactically analyzed sentences– Typically in the form of parse trees
7
Treebanks: What?
• A dependency tree represents grammatical relations between words
‘Bills’ is the child of ‘submitted’ in the relation nsubjpass
8
Treebank: What?
• A tree also includes part-of-speech tags– Critical for Chinese since it has no inflectional
morphologyDependency relation: ‘monk’is a direct object of ‘see’
‘not’ ‘hear’ ‘Buddha’ ‘sutra’ ‘and’ ‘not’ ‘see’ ‘monk’
[He] has neither heard about the Buddhist scriptures nor seen any monk.
Part-of-speech tag: ‘monk’ is a noun
9
Treebanks: Why?
• Quickly find examples to support linguistic research– E.g., In the passive structure 為…所…,為 is sometimes dropped in Buddhist Chinese text
• A feature of Buddhist Chinese
– Easy to search for passive sentences in treebank
10
Treebanks: Why?
• Characterize the “profile” of a word
What can you pray ‘for’, and who can you pray ‘to’?
Word Sketch[Kilgarriff et al. 2004]
11
Treebanks: Why?
• Sketch differences between ‘clever’ and ‘intelligent’
Compare the meaning of “clever” and “intelligent”by looking at adjectives that collocate with them
Word Sketch[Kilgarriff et al. 2004]
12
Treebank development
• Training data– Small-scale treebank created by Lee & Kong (2014)– 50k characters– Finely tagged by Buddhist specialists– POS-tag set: adapted from Penn Chinese Treebank– Dependency label: largely followed Stanford
Dependencies for Chinese + 5 new relations for MC
13
Treebank development
• Pre-processing:– Transplanted punctuations to the Tripiṭaka Koreana 高麗藏 from the Taishō edition 大正藏
• No parser for Classical Chinese– Word segmentation, POS-tagging by CRF++
– Dependency parsing by MST parser– External dictionaries
• Soothill-Hodous Dictionary of Chinese Buddhist Terms• Person and Place Authority Databases from DDBC
14
Interesting problems
• How close are the characters in the Buddhist world?
15
Interesting problems
• How close are the characters in the Buddhist world?
• We aim at answering this question by making inquiry on the conversation in Buddhist texts.
16
Most Frequent Say verbs
• 言 yán ‘to say’ (10979) • 告 gào ‘to tell; to announce to’ (5401) • 白……言 bái… yán ‘to address … and say’ (5015)• 答曰 dáyuē ‘to reply and say’ (2157)• 曰 yuē ‘to say’ (2126)• 問 wèn ‘to inquire’ (2091)• 告……言 gào… yán ‘to tell… and say’ (737)• 白 bái ‘to address’ (475)• 語 yù ‘to say’ (453)
17
Extraction of speaker and listener
18
The case of Goddess of Mercy
Kwun Yam, Gwan-eum, Kanon, Guānyīn, and Quan Âm 觀音
Avalokiteśvara
19
The case of Goddess of Mercy
Buddha92%
bodhisattva6%
others2%
Distribution of listeners of the Goddess of Mercy (N=195)
address86%
tell1% unmarked
11%reply2%
Distribution of type of saying verbs, the Goddess of Mercy as speaker (N=195)
白
20
The case of Goddess of Mercy
Buddha90%
bodhisattva9%
others1%
Distribution of speakers of the Goddess of Mercy (N=143)
ask/reply3%
unmarked41%
tell55%
address1%
Distribution of type of saying verbs, the Goddess of Mercy as listener (N=143)
告
Visualization of conversational network
22
Conversational network
Conversational network of the CBC, showing edges with 200 utterances or more
23
Protagonists
24
Who Talked the most?• Subhūti • Maudgalyāyana
• Avalokiteśvara • Śākyamuni Buddha
25
Protagonists
26
Interlocutors of the protagonists
27
Speak to Listen Ratio
28
Buddhist network without Buddha
29
Mahāyāna and Hīnayāna
30
Mahāyāna section
Absolute fundamental realityPerfection of wisdom
Theory of wisdom endowed with insight into emptiness
Conversational network of the Mahāyāna section of the Buddhist Canon, showing edges with more than 280 utterances.
31
Hīnayāna section
Conversational network of the Hīnayāna section of the Chinese Buddhist Canon, showing edges with 100 or more utterances
32
Mahāyāna and Hīnayāna
690彌勒菩薩
878帝釋天
924比丘
1333天子
1772菩薩
2702阿難
3316舍利弗
3553文殊菩薩
5013須菩提
17457釋迦牟尼佛
365摩訶迦葉456優波離457婆羅門564王612舍利弗682比丘尼833人2273阿難10096比丘
15154釋迦牟尼佛
33
Conclusion
• Built the first corpus of CBC of 46 million characters semi-automatically with limited manually annotated data of 50k chars
• Demonstrated how to exploit linguistic annotations to conduct analysis of the characters in a large-scale Chinese literary texts by using dependency relations– Studied conversational network– Statistics, e.g. protagonists, interlocutors of protagonists– Mahāyāna versus Hīnayāna
Thank you!
Q&A