Buddha Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations
Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013
Building the Digital Research Platform for Chinese Buddhist Literature
Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013
Achievements of Digitized Chinese Buddhist Texts
CBETA (Chinese Buddhist Electronic Text Association) is founded in 1998.
In the last 15 years, CBETA has converted a substantial number of Chinese Buddhist scriptures to digital format.
In CBETA 2011 DVD, it consists of more then 160 million Chinese characters.
Statistics of CBETA Digitized Content
Time Name of Collection Works ( 部 ) Fascicles Characters
1998-2003 Taishō Tripiṭaka ( 大正藏 ) 2,373 8,982 78,770,000 2004-2007 Shinsan Zokuzōkyō ( 卍續藏 ) 1,229 5,066 71,220,000
2008-2011
Passages concerning Buddhist activities from the Official History ( 正史佛教資料類編 )
1 10 333,000
Buddhist texts not contained in the Tripiṭaka ( 藏外佛教文獻 )
77 136 1,663,000
Selection of stone rubbings from Northern Dynasties( 北朝佛教石刻拓片百品 )
100 100 74,000
Supplement from other editions of Tripiṭaka ( 歷代藏經補輯 )
385 2631 24,193,000
2012-2013
Chinese Translations of Pali Canon(Based on Yuan Heng Temple Edition)
36 c.a. 7,500,000
Selections from the Taiwan National Central Library Buddhist Rare Book Collection.
64 c.a. 5,500,000
Total ( 總計 ) 4,265 16,925 c.a.189,253,000
The Chance and Challenge with “BIG DATA” (I)
The rapid growth of digital resources let scholars to be able to acquire more relevant materials with less time.
However, most of digital resources are not integrated. Scholars have to find an more efficient way to master the large amount of data in order not to be drown in the data ocean.
The Chance and Challenge with “BIG DATA” (II)
We also believe that these large amount of digital resources will not only provide a convenient research environment but also will help to gain new insights.
One very promising solution is to perform text analysis on Buddhist electronic text corpus to find out hidden pattern behind texts.
However, it sounds like a very difficult task for Buddhist scholars.
Digital Research Platform for Chinese Buddhist Literature
Main Mission of the Digital Research Platform :1. Data Providing: Provide complete, integrated
reference data in easy access way.
2. Data Organizing: Provide customization tools for user to organize materials into knowledge.
3. Data Analyzing: Provide digital analysis tools for discovering hidden patterns.
Project Information 2 years project, granted by National Science Council. (Digital
Humanities Project). It consists of three sub-projects:
Sub-project1: responsible for digitizing new resource for supporting this platform. (directed by Aming TU)
Sub-project2: responsible for developing new methodology for analyzing digital corpus, especially focusing on phonology materials. ( directed by Chien-Kang Huang)
Sub-project3: responsible for integrating project result, develop text quantitative analysis tool and establishing the platform.
Plan for the First yearTarget 1: build up the platform for integrating resources
Design a good way to integrate digital resources. Incudes: CBETA full text, catalogue, dictionaries,
phonology materials, other digital resource created by DDBC.
Target 2: implement text analysis functions Building up data set for text analysis. Creating tools. Ex: Buddha N-gram viewer is an example
tool for this purpose . It visualizes over time occurrences of inputted phrases in Chinese Buddhist texts.
Target1: Building the Digital Research Platform
Idea of the Research Platform Our experience: in the last decade, we have
executed more than 20 digital achieve projects.
Every database has its own archive content, design principle and different media type.
The only overlap is perhaps the sutra text
To integrate those resources, we decide to establish a rich functional sutra reading interface, and bind other related information to the text.
Sutra Reading
Text Analysis
Tools
Phonology
Materials
Dictionaries
Word Segmentation Tools
Tripitaka Catalogu
e
Main Idea of Integration
Catalogue Data
Basic Information
Only embed critical apparatus, and gaiji information.
Information from Sutra catalogue, click here will be leaded to our catalogue Project.
N-gram Information
婆羅 ,727如是 ,705比丘 ,694羅門 ,693沙門 ,614
世尊 ,477如來 ,469云何 ,428眾生 ,388由旬 ,387
爾時 ,384復有 ,358是為 ,346阿難 ,317無有 ,313
Other Related Sutra
• Other Parallel Translation. • List of Commentary• Related Research
Catalogue Data
N-gram Information
婆羅 ,727如是 ,705比丘 ,694羅門 ,693沙門 ,614
世尊 ,477如來 ,469云何 ,428眾生 ,388由旬 ,387
爾時 ,384復有 ,358是為 ,346阿難 ,317無有 ,313
Other Related Sutra
• Other Parallel Translation. • List of Commentary• Related Research
Extra Information for Selected Terms
婆羅
Dictionary Lookup
Occurrences of 婆羅 in different time period
婆羅《丁福保佛學大辭典》【職位】 Vihārapāla ,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」
Information from our glossaries project, click here will be leaded to glossaries project website.
Word Segmentation Tools
Place Name, Person Name, Calendar Look up
This information is from Buddha Ngram Viewer.
Target 2: Implement Text analysis Functions
What is the Text Analysis Text analysis: utilizing computer software to analyze
the text content in large size corpus, e.g.: CBETA. The objective is to discover hidden patterns and further derive new insights.
The patterns could be: Words that are frequently used in one place but never
show anywhere else. High-frequency collocations in a group of documents. Special usage patterns of commonly used words. Other possible and meaningful patterns ……
Difficulties in applying text analysis to the CBETA corpus Data is too complex:
The textual content and structure of Buddhist works are highly complex and complicated.
Analysis Tool is very difficult to learn The leverage of general text analysis tool requires some skills
in computer programming and advanced statistical knowledge.
How to let more (Humanity) scholars to adopt ‘text analysis’ technique in addressing their research questions?
We create some easy-use tools.
Buddha Ngram Viewer: (http://dev.ddbc.edu.tw/BuddhaNgramViewer/) Buddha Ngram Viewer (under construction)
A tool that allows users to visualize the over-time occurrences of inputted phrases in Chinese Buddhist texts.
Click any point in the chart to start.http://dev.ddbc.edu.tw/BuddhaNgramViewer/
Idea of Buddha Ngram Viewer
Combine Search result and sutra translation time from triptaka catalogue.
Search result in CBReader
Sutra No.
Sutra Name
Dynasty
T01n0001 長阿含經 後秦T01n0005 佛般泥洹經 西晉T01n0023 大樓炭經 西晉
Number of occurrences of search term in different time period.
+
||
+後秦 = C.E. 410
西晉 = C.E. 314
泥洹 , 涅槃
Number of occurrences
Western Years
Chinese Dynasties
Click this point to see the details of CE.401
The occurrences of 泥洹 , 涅槃 in the sutras translated in C.E. 401
Scroll down for more sutras
The occurrences in the 22 fascicles of T1 ( 長阿含經 ).
A quick way to understand the frequencies of selected terms in texts.
Click this point to see the details of 3rd fascicles in T1
Shows the matched place of 泥洹 , 涅槃 in the third fascicle of T1
Click here for displaying only matches of 泥洹
Only display matches of 泥洹 in the third fascicle of T1
Click for viewing this line in CBETA Text
CBETA Full text of the selected line
Integrate Buddha Ngram Viewer to the Research Platform
婆羅
Dictionary Lookup
Occurrences of 婆羅 over time
婆羅《丁福保佛學大辭典》【職位】 Vihārapāla ,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」
Word Segmentation Tools
Place Name, Person Name, Calendar Look up
This information is from Buddha Ngram Viewer.
Future Work
Future Work
Keep adding temporal and spatial information of sutras: Taisho shinshu Daizokyo, Showa hobou
makuroku. The Korean Buddhist Canon: A Descriptive
Catalogue by Dr. Lewis R. Lancaster, 1979.
Complete the sutra reading interface and continue to integrate more related information to the platform.
Keep bring new idea to the platform.
Ex
Thank you for listening.
Q & A !!