Master Thesis
Evaluation of
Crowdsourcing Translation Processes
Supervisor Professor Toru Ishida
Department of Social Informatics
Graduate School of Informatics
Kyoto University
Jun MATSUNO
April 2010 admission
March 2012 completion
i
Evaluation of Crowdsourcing Translation Processes
Jun MATSUNO
Abstract
Recently, the needs of translations have increased with the globalization. The
translations of information in companies are needed for the overseas transfers of
companies. The overseas transfers of small businesses are accelerating because of the
strong yen in Japan. The overseas transfers of companies are also accelerating because
of the rapid economic growth in China. At the level of the individual, translations are
needed to send and get information on the Web written in other languages than the
mother tongue and English. A professional translator is thought as a translator when
translations are requested. By requesting translations to professional translators, the
very high-quality results of translations are acquired, but the high cost is involved.
Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for
translation issues gets attention these days. The crowdsourcing translation can realize
low-cost and rapid translations because non-professional translators can participate in
translations.
The quality of translations is not guaranteed because non-professional translators
can participate in the crowdsourcing translation. Here, it is expected that the quality of
translations can improve if multiple translators address translations cooperatively. The
crowdsourcing translation by multiple translators has the need for the translation issues
to be done at less cost than the translation by a professional translator and to be
guaranteed in the quality.
However, we don’t know how multiple translators should translate cooperatively. In
this study, we tried to resolve the following two challenges to propose the method of
evaluation for crowdsourcing translation process.
1. Establishment of the experiment environment for the evaluation of crowdsourcing
translation processes
This experiment environment is built using a crowdsourcing market. Cooperative
processes by multiple users have to be formed and translation tasks have to be
processed by using the processes in the crowdsourcing market. We need to build
the experiment environment for the evaluation of translation processes by
ii
extending the function of the existing crowdsourcing market because the
crowdsourcing market meeting the requirements does not exist.
2. Evaluation of crowdsourcing translation processes
The best translation is not always acquired by the process where multiple
translators improve a translation in order. For example, the better translation may
be produced by the process where one translator translates based on the two
translations after two translators translate independently if three translators translate
cooperatively. Translation processes have to be efficiently evaluated because the
evaluation of all translation processes is not realistic.
We built the experiment environment for the evaluation of crowdsourcing
translation processes using Amazon Mechanical Turk as a crowdsourcing market. We
evaluated crowdsourcing translation processes in the built experiment environment.
The contributions to the above challenges are as follows.
1. Establishment of the experiment environment using Amazon Mechanical Turk
We built the experiment environment of a crowdsourcing translation using Amazon
Mechanical Turk. Variety of tasks including a translation task can be requested to
Amazon Mechanical Turk, and APIs and tools for forming disposal processes in
Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable
for a crowdsourcing market where the experiment environment is built.
2. Efficient evaluation of crowdsourcing translation processes using Amazon
Mechanical Turk
We evaluated parallel and iterative processes by translating 20 Chinese sentences to
English in the built experiment environment. The parallel and iterative processes
are the fundamental cooperative processes proposed by a previous study. In
addition, we approximately evaluated the process with combination of the parallel
and iterative processes using the evaluation results of parallel and iterative
processes. The evaluation values of the parallel process, the iterative process and
the process with combination of the parallel and iterative processes were
respectively 4.22, 4.15 and 4.27 when the number of translators is 3. The evaluation
results showed that the process with combination of the parallel and iterative
processes was the best. We realized the evaluation of crowdsourcing translation
processes cutting time and cost by evaluating translation processes approximately.
iii
クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価
松野 淳
内容梗概内容梗概内容梗概内容梗概
近年,グローバル化に伴い翻訳のニーズはますます高まっている.例えば,
企業の海外移転のためには,社内情報の翻訳が必要とされる.日本では円高の
影響で,中小企業の海外移転が加速していき,また中国では経済の急成長で,
企業の海外移転が加速している.個人レベルでも,母国語や英語以外で表記さ
れた web情報を発信・取得するために,翻訳が求められる.このような翻訳案
件を依頼する先としては,まずプロの翻訳者が考えられる.プロの翻訳者へ翻
訳案件を依頼することで,極めて高品質な翻訳結果を得ることは可能であるが,
高いコストが伴われる.そこで,翻訳案件に対して翻訳者が任意に手を挙げて
翻訳を行う,クラウドソーシング翻訳が注目を集めている.クラウドソーシン
グ翻訳は,プロでない翻訳者も参加できるため,低コストかつ迅速な翻訳が実
現可能である.
クラウドソーシング翻訳にはプロでない翻訳者も参加できるため,翻訳品質
が十分に保証されないという問題が存在する.ここで,複数の翻訳者が協力し
て翻訳すれば,翻訳品質は向上すると考えられる.複数の翻訳者によるクラウ
ドソーシング翻訳のニーズは,プロの翻訳者による翻訳よりもコストを抑えて,
翻訳品質が保証されるべき翻訳案件にある.
しかし,複数の翻訳者がどのように協力して翻訳すべきかが分からない.そ
こで,本研究では,クラウドソーシング翻訳プロセスの評価手法を提案するた
めに,以下の 2つの課題の解決を試みた.
1. クラウドソーシング翻訳プロセスを評価するための実験環境の構築
実験環境はクラウドソーシング市場を用いて構築される.そのクラウドソ
ーシング市場では複数のユーザによる協働プロセスが形成可能であり,そ
のプロセスを用いて翻訳タスクが処理可能でなければならない.しかし,
このようなクラウドソーシング市場は実在しない.そのため,実在するク
ラウドソーシング市場の機能を拡張することで,翻訳プロセスを評価する
ための実験環境を構築する必要がある.
2. クラウドソーシング翻訳プロセスの評価
クラウドソーシング翻訳において,複数の翻訳者が順番に翻訳の加筆・修
正を繰り返すことで最も良い翻訳結果が得られるとは限らない.例えば,3
iv
人の翻訳者で翻訳を行う場合は,最初に 2 人の翻訳者が独立に翻訳を行っ
た後で,1 人の翻訳者がそれらの翻訳結果に基づいて翻訳を行った方がよ
り良い翻訳結果が生成されるかもしれない.全ての翻訳プロセスを評価す
ることは現実的ではないため,効率的に翻訳プロセスを評価する必要があ
る.
クラウドソーシング市場として Amazon Mechanical Turkを用いることで,ク
ラウドソーシング翻訳プロセスを評価するための実験環境を構築した.そして,
構築された実験環境でクラウドソーシング翻訳プロセスを評価した.上記で述
べた課題に対する本研究の貢献は以下の 2点である.
1. Amazon Mechanical Turkを用いた実験環境の構築
Amazon Mechanical Turkを用いて,クラウドソーシング翻訳の実験環境を構
築した.Amazon Mechanical Turkでは,翻訳タスクを含めた様々なタスクの
依頼が可能であり,Amazon Mechanical Turkのための API や処理プロセスを
形成するためのツールが公開されている.そのため,Amazon Mechanical Turk
は実験環境を構築するクラウドソーシング市場として適している.
2. クラウドソーシング翻訳プロセスの効率的な評価
構築した実験環境で,中国語 20文を英語に翻訳することで,並列プロセス
と繰り返しプロセスを評価した.並列プセスと繰り返しプロセスは,先行
研究で提案されている Amazon Mechanical Turkにおける基本的な協働プロ
セスである.さらに,並列プロセスと繰り返しプロセスの評価結果を用い
ることで,並列プロセスと繰り返しプロセスを組み合わせて形成可能な翻
訳プロセスを近似的に評価した.並列プロセス,繰り返しプロセス,並列
プロセスと繰り返しプロセスを組み合わせたプロセスによる翻訳の評価値
は,それぞれ 4.22,4.15,4.27であった.評価結果から,並列プロセスと繰
り返しプロセスを組み合わせたプロセスが最も優れていることが分かった.
近似的に翻訳プロセスを評価することで,時間とコストを抑制したクラウ
ドソーシング翻訳プロセスの評価を実現した.
Evaluation of Crowdsourcing Translation Processes
Contents
Chapter 1 Introduction 1
Chapter 2 Amazon Mechanical Turk 4
2.1 Requester ··············································································· 5
2.2 Worker ·················································································· 8
Chapter 3 Related Works 10
3.1 Translation using Amazon Mechanical Turk ······································ 10
3.2 Increase in Quality of Tasks in Amazon Mechanical Turk ······················ 11
3.3 Relation between Result and Reward of Tasks ··································· 13
3.4 Task Processing with Cooperation ·················································· 16
Chapter 4 Crowdsourcing Translation 20
4.1 Increase in Demand of Crowdsourcing Translation ······························ 20
4.2 Example of Crowdsourcing Translation ··········································· 21
4.3 Increase in Quality of Translations by Cooperative Processes ·················· 23
Chapter 5 Establishment of Experiment Environment 24
5.1 Process of Tasks by Cooperative Processes ······································· 24
5.2 Request of Translatin Task ··························································· 25
5.3 Request of Vote Task ································································· 25
5.4 Screening of Workers ································································ 27
Chapter 6 Expeiment and Evaluation 30
6.1 Experiment ············································································ 30
6.2 Evaluation ············································································· 22
6.3 Discussion ············································································· 35
6.4 Lessons Learned ······································································ 42
Chapter 7 Conclusion 43
Acknowledgements 45
References 46
1
Chapter 1 Introduction
Recently, the needs of translations have increased with the globalization. The
translations of information in companies are needed for the overseas transfers of
companies. The overseas transfers of small businesses are accelerating because of the
strong yen in Japan. The overseas transfers of companies are also accelerating because
of the rapid economic growth in China. At the level of the individual, translations are
needed to send and get information on the Web written in other languages than the
mother tongue and English. A professional translator is thought as a translator when
translations are requested. By requesting translations to professional translators, the
very high-quality results of translations are acquired, but the high cost is involved.
Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for
translation issues gets attention these days. The crowdsourcing translation can realize
low-cost and rapid translations because non-professional translators can participate in
translations.
The quality of translations is not guaranteed because non-professional translators
can participate in the crowdsourcing translation. Here, it is expected that the quality of
translations can improve if multiple translators address translations cooperatively. The
crowdsourcing translation by multiple translators has the need for the translation issues
to be done at less cost than the translation by a professional translator and to be
guaranteed in the quality.
However, we don’t know how multiple translators should translate cooperatively. In
this study, we tried to resolve the following two challenges to propose the method of
evaluation for crowdsourcing translation process.
Establishment of the experiment environment for the evaluation of crowdsourcing
translation processes
This experiment environment is built using a crowdsourcing market. Cooperative
processes by multiple users have to be formed and translation tasks have to be
processed by using the processes in the crowdsourcing market. We need to build
the experiment environment for the evaluation of translation processes by
extending the function of the existing crowdsourcing market because the
crowdsourcing market meeting the requirements does not exist.
2
Evaluation of crowdsourcing translation processes
The best translation is not always acquired by the process where multiple
translators improve a translation in order. For example, the better translation may
be produced by the process where one translator translates based on the two
translations after two translators translate independently if three translators translate
cooperatively. Translation processes have to be efficiently evaluated because the
evaluation of all translation processes is not realistic.
We built the experiment environment for the evaluation of crowdsourcing translation
processes by using Amazon Mechanical Turk as a crowdsourcing market. We evaluated
crowdsourcing translation processes in the built experiment environment. Variety of
tasks including a translation task can be requested to Amazon Mechanical Turk, and
APIs and tools for forming disposal processes in Amazon Mechanical Turk are released.
Thus, Amazon Mechanical Turk is suitable for a crowdsourcing market where the
experiment environment is built.
The processing of translation tasks using Amazon Mechanical Turk has already
been proposed [1, 2]. The studies showed the usability of Amazon Mechanical Turk for
translation tasks, but translators create or improve translations independently in the
studies. In our study, the processing of translation tasks by multiple translators using
Amazon Mechanical Turk is performed. Comparing with the previous studies, the
better experiment environment for translation tasks is built by the screening of
translators. The reason why the screening of translators was performed is because many
users don’t process tasks seriously in Amazon Mechanical Turk [3].
Parallel and iterative processes [4] were used as processes for translations by
multiple translators. The parallel process is suitable for tasks which are more improved
by spending more time. The iterative process is suitable for tasks whose purpose is the
suggestion of a unique idea. We evaluated the results of applying these processes to
translation tasks. In addition, we approximately evaluated the results of the processes
with combination of the parallel and iterative processes.
This paper is organized as follows. In Chapter 2, this paper explains Amazon
Mechanical Turk used for the evaluation of crowdsourcing processes. In Chapter 3,
previous studies relevant to this research are introduced. In Chapter 4, this paper
explains a real crowdsourcing translation on the web and the crowdsourcing translation
3
using cooperative processes. In Chapter 5, this paper explains the establishment of the
experiment environment for the evaluation of crowdsourcing translation processes
using Amazon Mechanical Turk. In Chapter 6, this paper reports the experiment and
evaluation of the experiment and, Chapter 7 presents the conclusion.
4
Chapter 2 Amazon Mechanical Turk
Amazon Mechanical Turk1 is one of web services by Amazon. It is possible to process
the tasks which can be easily resolved by humans in Amazon Mechanical Turk, and a
lot of micro tasks needing human intelligence are generally requested. The tasks are
resolved by many users around the world. In Amazon Mechanical Turk, a task is called
Human Intelligence Task (HIT). The users requesting and processing tasks are
respectively called Requester and Worker.
Figure 2.1 shows the screen for browsing HITs in Amazon Mechanical Turk.
Figure 2.1: Screen for browsing HITs in Amazon Mechanical Turk
We use the leading HIT in Figure 2.1 to explain a HIT. “Verify Businesses’
Websites 1’’ is the title of the HIT, and “Requester: Dolores Labs’’ is the name of the
user requesting the HIT. A requester can set “HIT Expiration Date’’, “Reward’’,
“Time Allotted’’ and “HITs Available’’. “Time Allotted” is the time used for
processing one task, and “HITs Available’’ is the number of HITs which can be
processed by the worker browsing the screen. Moreover, “Description’’, ”Keywords’’ 1 https://www.mturk.com/mturk/welcome
5
and ”Qualification Required’’ respectively represent the explanation of the HIT, the
keywords relating to the HIT and the qualification required to process the HIT. These
will appear on the screen by clicking the title of the HIT.
2.1 Requester A requester is limited to the user which can register the address in the United States. A
requester can request variety of HITs and freely set information of the HIT. The reward
of a HIT is different, but the general reward of a HIT is between 0.01$ and 0.1$.
Amazon provides the three mechanisms to guarantee the result of a HIT in Amazon
Mechanical Turk. The first mechanism is one having multiple workers process one HIT.
Thanks to this mechanism, a requester can select a better result of a HIT. The second
mechanism is one allowing a requester to set the qualification required to process a HIT.
A requester often set the acceptance rate of the HITs processed by a worker and the
country where a worker is living as qualifications. A requester can also give a
qualification to the workers passing the test created by the requester. The third
mechanism is one allowing a requester to reject the results of HITs processed by
workers and not to pay the reward. However, a requester needs to tell workers the
legitimate reason why the requester rejected the results of HITs.
There are two methods to request HITs. The first method is using GUI provided by
Amazon. If you request HITs and acquire the results using GUI, you need to go
through the three steps of designing HITs, publishing HITs and managing HITs.
� The design of HITs
You should create favorite HITs using templates because variety of templates is
prepared for the design of HITs. Figure 2.2 shows the example of the screen for the
design of a HIT. In this example, the HIT of translations from Chinese to English are
being designed. HITs have to be designed according to the format of HTML. You also
input information of HITs such as a title and a reward in the design of HITs.
� The publication of HITs
This process is designed to request designed HITs to Amazon Mechanical Turk. If a
requested HIT needs images, you upload the images and check the preview of the HIT.
Finally, HITs are published on Amazon Mechanical Turk if it is possible to pay the cost
required to publish the HITs. The cost includes the reward paid to workers and the
6
Figure 2.2: Example of the screen for the design of a HIT
agent’s commission paid to Amazon. The agent’s commission is 10 % of the reward.
� The management of HITs
Published HITs are managed in this process. It is possible to check how much HITs are
processed and the results of processed HITs. Figure 2.3 shows the screen for the
management of HITs. In this screen, you can approve and reject the results of
processed HITs after you checked the results of processed HITs. You can also
download the CSV file containing the results of processed HITs.
The second method to request HITs is using Requester API provided by Amazon.
API can be used in variety of languages such as Java, PHP and Ruby. The tools of
Amazon Mechanical Turk have been developed and published recently. It is possible to
request HITs more flexibly by using tools. For example, workers process HITs
independently in Amazon Mechanical Turk, but it becomes possible to request the HITs
using the results of HITs processed by other workers automatically if tools are used. If
you use API, you can request HITs without troublesome operation by GUI and output
the results of processed HITs to the console of program and files. The results of
requested HITs using API are also reflected in the screen for the management of HITs.
7
Figure 2.3: Screen for the management of HITs
2.2 Worker Anyone around the world can be a worker. A worker can acquire a reword by
processing tasks. A worker selects a HIT based on the content and reward of the HIT.
Amazon announced that there were more than 100 thousands workers coming from
more than 1 hundred countries in Amazon Mechanical Turk as of spring 2007. It was
announced that 76 % of workers were American and 8 % of them were Indian as of
March 2008, but 56 % of workers were American and 36 % of them were Indian as of
November 2009 from a statistical analysis [5]. Even now, most of the workers should
be occupied by American and Indian.
American workers don’t feel that they have a strong relation to Amazon
Mechanical Turk. On the other hand, Indian workers feel that they have a strong
relation to Amazon Mechanical Turk. Many American workers may think that the
process of HITs is just a hobby, but many Indian workers think that the process of HITs
will lead to their rich life. Hence, it is expected that many Indian workers don’t process
tasks seriously to acquire a reward efficiently. In fact, we confirmed that many Indian
workers submitted the results by machine translations and sloppy translation results
when we requested the tasks of translations to Amazon Mechanical Turk. The higher
8
reward than a general reward is set to translation tasks because translation tasks are
technical ones, and translation tasks are not processed appropriately by workers than
other simple tasks. The screening of workers is necessary to acquire good translations
results when translation tasks are requested to Amazon Mechanical Turk. A requester
receives the low-quality translation results and takes a lot of trouble with the
management of HITs without the screening of workers.
Figure 2.4 shows the screen of the HIT which workers actually process. This HIT
requires workers to input the information of restaurants found in the designated wet site.
Workers can see the content of the HIT by clicking the link of “View a HIT in this
group’’ in the screen for the browsing HITs in Amazon Mechanical Turk. Workers can
process the HIT by pushing the button of Accept HIT after they saw the content of the
HIT. Worker should push the button of Skip HIT if they don’t want to process the HIT.
Workers can see the content of another HIT if workers pushed the button of Skip HIT.
If the results of HITs processed by workers are approved by the requester, the workers
can acquire the reward of the HITs. If the results of HITs processed by workers are
rejected by the requester, the workers can’t acquire the reward of the HITs, and the
performance of the workers will decrease. If the performance of a worker is low, the
worker can’t process the HITs which the requester wants to request to the reliable
workers.
Figure 2.4: Screen of a HIT
9
Chapter 3 Related Works
3.1 Translation using Amazon Mechanical Turk Amazon Mechanical Turk is used in various fields of study. The most popular use of
Amazon Mechanical Turk is the request of a lot of micro HITs. It is possible to cut a lot
of cost and time by using Amazon Mechanical Turk. The studies using Amazon
Mechanical Turk in this way include the annotation to image data [6], evaluation of
visual design [7] and collection of audio data [8]. The completions of HITs at low-cost
and guaranteed quality of processed HITs have been shown in these studies.
Anyone can’t perform a translation because a translation is a technical process. A
translation is not generally suitable for a HIT requested to Amazon Mechanical Turk. In
fact, many HITs related to translations are not requested to Amazon Mechanical Turk,
but there have been a few studies for translations using Amazon Mechanical Turk.
The translations from 50 English sentences to French, German, Spanish, Chinese
and Urdu have been requested to Amazon Mechanical Turk [1]. Multiple-Translation
Chinese Corpus in LDC Catalog1 and NIST ME Eval 2008 Urdu-English Test Set
were used as the source sentences. These sentences are generally used to test the
performance of machine translations. There was the notice saying “you must not use
machine translations” in the screen of the HIT. However, many workers ignored the
notice and submitted the translations created by machine translations to the requester.
The translations created by machine translations were removed by requesting the
additional task for the review of translations to Amazon Mechanical Turk. As a result,
at least 30 % of the translations were created by machine translations. The translations
were evaluated based on BLEU, which was a method for automatic evaluation of
machine translations. Figure 3.1 shows the evaluation results of translations by workers.
In all languages, the evaluation value of translations by workers was lower than that by
professional translators but significantly exceeded that by machine translations. The
evaluation would get better by removing the machine translations from the translations
of workers. It is expected that the evaluation results increase more by the screening of
workers in advance. As for a reward, the reward of a HIT for the translation of one
1 http://www.ldc.upenn.edu/Catalog/
10
Blue: Professional translator Green: Worker Orange: Machine translation
Figure 3.1: Evaluation results of translations by Workers
sentence was 0.1$, and the reward of a HIT for checking whether a translation was
created by machine translations or not is 0.06$. As for the completion time of HITs, the
times were respectively less than 4 hours, 20 hours, 22.5 hours, 2days and 4 days in
Spanish, French, German, Chinese and Urdu.
3.2 Increase in Quality of Tasks in Amazon Mechanical Turk There are many workers which don’t process HITs seriously in Amazon Mechanical
Turk. The quality of HIT decreases if many workers don’t’ process HITs seriously. The
one of approaches to avoid such a consequence is the screening of workers. There is
the study about the approaches to screen workers in Amazon Mechanical Turk [3].
Workers were asked demographics (age, sex and occupation) to screen workers in this
study. In addition, workers were asked the two questions about an E-mail. If workers
answer demographics and the questions correctly, they can get the right to process HITs
which a very expensive reward is paid for. This study showed the following by
conducting the experiment.
� The proportion of the workers which answer demographics and the questions
about E-mail correctly was 61 %
� Women answered the questions more correctly than men
� The more older a worker was, more correctly the worker answered the questions
11
As a result, asking workers demographics and simple questions was very effective to
the screening of workers. The screening of workers based on the processing time was
also considered, but there was not a significant difference between high-quality and
low-quality results.
Another approach to increase the quality of results is the improvement of a HIT
design. There has been the study showing that the quality of results increased by the
improvement of a HIT [9]. In this study, the HIT having workers evaluate Wikipedia
articles was used. Workers were asked to evaluate the Wikipedia articles based on a
seven point Likert scale and answer how the Wikipedia articles should be improved
before the HIT design was improved. Workers were additionally asked to answer the
simple questions about the Wikipedia articles after the HIT design was improved.
Workers can deeply understand the Wikipedia articles by answering the questions.
Table 3.1 shows the experiment results.
Table 3.1: Increase in the quality of HITs by the improvement of HIT design
Proportion of the invalid results
Processing time Proportion of the results processed within 1 minute
Without the improvement of HIT design
48.6%
1:30
30.5%
With the improvement of HIT design
2.5%
4:06
6.5%
The invalid results are the answers not including how the Wikipedia articles should be
improved. The proportion of the results processed within 1 minute was measured
because many HITs were invalid when the processing time was within 1 minute before
the HIT design was improved. Table 3.1 indicated that the quality of results increased
by improving the HIT design.
3.3 Relation between Result and Reward of Tasks There was the question of crowdsourcing models such as Amazon Mechanical Turk
which economists and psychologists were interested in for a long time. The question is
how a reward has effect on the quality of results. In the traditional theory of economics,
it was considered that the higher a reward is, the better the quality of tasks is. However,
12
many studies showed that giving a high reward decreased the internal motivation of
enjoying tasks and the quality of tasks decreased. The results of the studies were
against the traditional theory of economics.
Many HITs are processed at low-cost and quickly by many workers in Amazon
Mechanical Turk, but how much HITs are processed correctly depends on how
requester can enhance the motivation of workers. A reward is given as one of the
external motivations of workers. The two experiments have been conducted to survey
the relation between the performance and reward of HITs [10]. In these experiments,
the quality (accuracy) and quantity (number) of results were quantitatively measured.
HITs used in these experiments didn’t depend on the ability of workers, and the harder
workers processed HITs, the more the results of HITs could increase.
The HIT having workers rearrange the pictures taken at two seconds intervals by a
traffic camera in chronological order was used in the first experiment. The basic reward
was 0.1$, and the additional reward was paid to workers according to the effort of
workers. If workers accept the HIT and provide their information, the basic reward was
paid to workers. After the basic reward was paid, the level (easy: 2image,medium:
3image or hard: 4image) and reward (low: 0.01$, medium: 0.05$ or high: 0.1$ per 1
HIT)of HITs were selected randomly, and workers processed the main HITs. Finally,
the experiment would continue until workers stopped processing HITs or all HITs were
processed.
Figure 3.2 shows the relation between the reward and number of processed HITs,
and Figure 3.3 shows the relation between the reward and accuracy of processed HITs.
Figure 3.2 indicated the followings two. The first was that the higher a reward was, the
more HITs were processed regardless of the difficulty of HITs. More workers to which
0.1$ was paid processed all HITs than ones to which 0.01$ was paid. Many workers to
which 0.01$ was paid were included in the workers processing less than 10 HITs. This
result was consistent with the general theory of economics saying that the higher a
reward was, the more HITs were processed. Figure 3.3 indicated that there was no
relation between the accuracy and reward of HITs. It can be considered that the
13
Figure 3.2: Relation between the reward and the number of processed HITs
Figure 3.3: Relation between the reward and the accuracy of processed HITs
14
Figure 3.4: Relation between the real reward and reward expected by workers
different effects which a reward gave the quality and quantity was attributed to
anchoring effect. Anchoring effect is the phenomenon of having effect to the next
decision-making or information by having the strong impression with the first
information or number (anchor). Figure 3.4 indicates that the expected reward was
higher than the real reward. Therefore, the difference of rewards didn’t have effect to the
accuracy of the HITs processed by workers. However, workers thought that the higher a
reward was, the more valuable the HIT was for workers, and the higher a reward was,
the more HITs were processed.
The HIT having workers find the words hidden in a puzzle was used in the second
experiment. Workers didn’t know how many words were hidden in the puzzle. The
quantity and quality of processed HITs were respectively measured based on the number
of processed puzzles and the number of words found in the puzzles. There were two
methods to pay a reward. One method was the quota scheme paying a reward to workers
each time one puzzle was correctly completed. Another method was the piece rate
scheme paying a reward to workers each time one word was found. The four levels of a
reward including nonpaying were considered in each scheme of a reward. Thus, there
were seven different experiment conditions in all. The number of found words, hidden
15
words and the reward paid to workers were presented to workers.
More tasks were processed and more words were found whether the scheme of a
reward is a quota scheme or piece-rate scheme in the case of non-paying than paying.
The significant difference between the first and second experiments was that there was
not the relation between the level of a reward and the quantity of processed tasks. It was
because that there was the strong relation between the feeling of enjoying HITs and the
quantity of processed HITs. For example, a worker completed all HITs by spending five
hours and found all words except for two words under the condition of non-paying.
There was not also the relation between the level of a reward and the quality of HITs in
the second experiment like the first experiment.
The amount of a reward per a word was less in the quota scheme paying a reward if
all words were found in a puzzle than the piece-rate scheme paying a reward each time a
word was found. However, the quality of processed HITs more increased using the
quota scheme than the piece-rate scheme. In addition, the quality of processed HITs
more increased by using the non-paying scheme than using the quota scheme. A reward
was not paid to workers if workers could not find all words in the quota scheme. The
quality of processed HITs more increased by using the quota scheme because workers
tried to find more words in a puzzle in the case of the quota scheme than in the case of
the piece-rate scheme. Workers were asked the value of a puzzle and a word. The results
showed that workers made an effort to process HITs, motivated by the rewards other
than the financial reward when the financial reward was no paid to workers. Hence, even
if there is a great difference between the reward expected by workers and the real reward,
the accuracy of processed HITs is high in the case of nonpaying. On the other hand,
workers decided whether the workers process HITs or not by comparing the real reward
with the reward expected by the workers in the case of paying a reward. The quantity of
processed HITs didn’t increase simply because a reward increased. These experiment
results indicated the followings.
� The quantity of processed HITs more increases in the case of paying a reward then
in the case of non-paying
� The way of paying a reward has a larger effect to the quantity and quality of
processed HITs than the amount of rewards
16
3.4 Task Processing with Cooperation There must be the tasks which are more efficiently processed by the cooperation of
multiple workers. For example, the creation of Wikipedia articles is considered as one
such task. Wikipedia is the collaboration type of a crowdsourcing site promoting
collaboration on the web and having users create contents cooperatively. In Wikipedia,
even users who don’t have the account can edit articles. The relation between the
increase in editors of articles and the quality of articles in the creation of Wikipedia
articles has been surveyed [11]. The edit of Wikipedia articles is the task having the high
interdependency of users and, the high cost of the cooperation between users may
happen. For example, the modifications of the grammar or the spelling error in an article
are the task having a low cooperativeness, but the modifications of the construction or
the difference of the contents in an article are the tasks having a high cooperativeness
because the unified opinion has to be established. In this study, a hypothesis was
formulated. The hypothesis said that the increase in the number of editors leaded to the
benefit of not the tasks needing a low cooperativeness but the tasks needing a high
cooperativeness. The results of the verification showed that the hypothesis was correct.
In Amazon Mechanical Turk, the experiment verifying the effectiveness of the
cooperative processed has been conducted [4]. The parallel and iterative processes
which were respectively represented by Figure 3.5 and 3.6 were used in this experiment.
Voters participate in the vote task to decide a better processed HIT by majority vote in
these processed. The best processed task is decided after the same HIT was processed by
workers in the parallel process. The next worker can see the HIT processed by the
previous worker in the iterative process. The HITs of writing the image description and
suggesting the new company name were used to survey what kinds of HITs these
processes were useful to. The number of workers and vote were respectively 6 and 5.
The number of voters per one vote was also 5.
In the HIT of writing the image description, 0.02$ and 0.01$ were respectively paid
to the worker writing the image description and a voter. The workers inputted the
sentence representing the content of the presented image, and the voters evaluated the
two sentences representing the content of the presented image with the values from 1 to
10. The sentences representing the contents of 30 images were acquired using the
parallel and iterative processes. Figure 3.7 shows the relation between the number of
17
Figure 3.5: Parallel process (The number of workers is n)
Figure 3.6: Iterative process (The number of workers is n)
workers and the evaluation values for the two processes. The table indicated that the
iterative process was useful to the task of writing the image description than the parallel
process. This was because that the longer a sentence was, the better the evaluation value
was.
In the HIT of suggesting the new company name, 0.02$ and 0.01$ were respectively
paid to the worker suggesting the new company name and a voter. The voters evaluated
the two company names with the values from 1 to 10. The new names of 30 companies
were acquired using the parallel and iterative processes. The best evaluation values were
respectively 7.3 and 8.3 by using the iterative and parallel processes. It was difficult to
get the high evaluation value by using the iterative process because the next worker was
influenced by the idea of the previous worker. The average values were respectively 6.4
18
and 6.2 by using the iterative and parallel processes though the best evaluation was got
by using the parallel process. Figure 3.8 shows the relation between the number of
workers and the evaluation values for the iterative processes. In the iterative process, the
more the number of workers was, the better the evaluation value was.
The results of these experiments showed that there was the relation of trade-off
between the average quality and the bets quality of processed tasks. The iterative process
decreased the variety of processed results which was very important for getting the best
result though the average quality of processed tasks increased in the iterative process
This was because the next worker often used the result by the previous worker as a
reference in the iterative process.
Figure 3.7: Relation between the number of workers and the evaluation values
(blue: iterative process, red: parallel process)
19
Figure 3.8: Relation between the number of workers and the evaluation values
(blue: the evaluation values by the iterative process, red: the average
evaluation value by the parallel process)
20
Chapter 4 Crowdsourcing Translation
4.1 Increase in Demand of Crowdsourcing Translation The number of professional translators is limited, and the much time and cost are needed
to request translations to professional translators. Low-cost and quick translations can be
realized by the crowdsourcing translation because non-professional translators can also
participate in the crowdsourcing translation.
The crowd sourcing translation is mainly used for the globalization and overseas
transfer of companies. It is expected that the demand of crowdsourcing translations will
increase more from the foreign direct investment of Japan in recent years. The foreign
direct investment of Japan is the direct investment for foreign companies by Japanese
companies. The more the amount of foreign direct investment increase, the higher the
possibility of overseas transfer is. Table 4.1 shows the amount of the foreign direct
investment of Japan for ASEAN (2008-2010) published by the Bank of Japan. Table 4.2
also shows the comparison between the amounts of the foreign direct investments of
Japan for ASEAN in 2010 and 2011 publish by the Bank of Japan. Table 4.1 and 4.2
indicate that the overseas transfer of Japanese companies to ASEAN has increased
recently. This is mainly because that the yen is appreciating. A media reported that local
governments supported the overseas transfer of smaller businesses in 2011. For example,
the Ota district of Tokyo provides consultation for the overseas development and
supports the translations of foreign documents. The demand of crowdsourcing
translation by companies is increasing not only in Japan. Table 4.3 shows the amount of
the foreign direct investment of Chinese (2008-2010) published by Japan External Trade
Organization (JETRO). Table 4.3 indicates that more Chinese companies are performing
the overseas transfer.
The crowdsourcing translation can be used for not companies but individuals.
Individuals use the crowdsourcing translation to send and get information on web. The
example of use for sending information is the translation of the HP, blog or the
explanation of created application. The example of use for getting information is the
translation of news articles in foreign countries. These translations don’t have to be
perfect, and there is no problem as long as the meaning of these translations is correct.
21
The time and cost required for translations are expected to be decreased. Therefore, the
translations given here are suitable for the translation issue of the crowdsourcing
translation.
Table 4.1: Amount of foreign direct investment of Japan for ASEAN in recent years
2008 2009 2010 Amount of investment (billion yen)
6,518
6,587
7,711
Table 4.2: Comparison between the amounts of the foreign direct investments of Japan
for ASEAN in 2010 and 2011
First-quarter in 2010
First-quarter in 2011
Second-quarter in 2010
Second-quarter in 2011
Amount of investment to ASEAN (billion yen)
666
1,016
1,867
2,768
Table 4.3: Amount of the foreign direct investment of Chinese in recent years
2008 2009 2010 Amount of investment (million-dollar)
41,859 47,800 59,000
4.2 Example of Crowdsourcing Translation myGengo1 is given as the example of the crowdsourcing translation. myGengo is the
service created in Japan, and the purpose of myGengo is the support of globalization in
business. The customer companies of myGengo include the major companies in Japan,
and myGengo is one of proven crowdsoucing translation services. The flow of use of my
Gengo is as follows.
1. The order of a translation issue through the web site or API of myGengo
2. The translators registering at myGengo start to process the translation issue
3. It is possible to exchange comments between the requester and the translators in the
process of the translation. The necessary modification is also performed at no charge
after the requester checked the translation.
1 http://ja.mygengo.com/
22
4. The delivery is notified by E-mail. The translation is sent to the requester
automatically if API is used.
The translators passing the qualification test can register at myGengo. The
translators are classified into the standard and pro level of a translation. The requester
can select the level of a translation. At the stage of spring 2011, more than 1600
translators register at myGengo. The translation languages used in myGengo are
Japanese, English, Chinese, French, German, Italian and Spanish.
Figure 4.1 shows the screen of the translation request in myGengo. Requesting
translations to professional translators generally takes a lot of trouble. In myGengo,
anyone can very easily request translations. The actual examples of the translations
requested to myGengo are the translations of the explanation for an application, the
press release of a company and the manual of a company.
Figure 4.1: Screen of the translation request in myGengo
23
4.3 Increase in Quality of Translations by Cooperative Processes The quality of a translation is not sufficiently guaranteed because one translator
generally takes responsibility for one translation issue in the crowdsourcing translation.
It can be considered that the quality of a translation increases if multiple translators
process the translation cooperatively. In this case, the translation is divided among
multiple translators appropriately, and the reward is divided among multiple translators
according to the achievement of each translator. The way to decide the contribution of
the translation by a translator and pay the reward to the translator according to the
contribution is very important, but we don’t consider the way. In this study, we focus on
the best way to increase the quality of a translation by multiple translators.
myGengo is the controlled crowdsourcing service to process translations appropriately.
Amazon Mechanical Turk is not controlled for translations though Amazon Mechanical
Turk is also the crowdsourcing service. Very technical tasks such as a translation are not
suitable for HITs requested to Amazon Mechanical Turk because workers and requesters
can respectively process and request HITs very feely in Amazon Mechanical Turk.
However, there is no other crowdsourcing services in which translation tasks can be
requested and processed by cooperative processed. Therefore, we decided to use
Amazon Mechanical Turk.
24
Chapter 5 Establishment of Experiment Environment
We built the experiment environment of the translation from Chinese to English in this
study.
5.1 Process of Tasks by Cooperative Processes We formed crowdsourcing translation processes using the parallel and iterative
processes respectively represented by Figure 3.5 and 3.6. We used Turkit [12] to realize
the parallel and iterative processes in Amazon Mechanical Turk. Turkit is the tool for the
execution of the HIT processed iteratively in Amazon Mechanical Turk, and it is
possible to process HITs using the process described by the JavaScript program.
Figure 5.1 shows the execution screen of Turkit.
Figure 5.1: Execution screen of Turkit
The process and the content of HIT are input in the screen represented by 1 in Figure 5.1.
25
The result of HITs processed by workers and voters are output in the screen represented
by 2 in Figure 5.1. The links to the pages of the HITs requested to Amazon Mechanical
Turk are displayed in the screen represented by 3 in Figure 5.1. The screens represented
by 2 and 3 are updated as the process progresses.
5.2 Request of Translatin Task There are the following two translation tasks in this experiment.
(a) The translation of a source sentence to the target language
(b) The improvement of the translated sentence based on the source sentence
The translation task (a) is processed by the workers in the parallel process and the first
worker in the iterative process, and the translation task (b) is processed by the workers
other than the first worker in the iterative process.
Figure 5.2 and 5.3 respectively show the request screen of the translation tasks (a)
and (b) from Chinese to English. There is the notice saying “The result of a HIT is
rejected if a worker use machine translation or the quality of a translation is very low” in
these screens. The translation input by the previous worker is provided in the translation
task (b), and the worker can also create the translation by modifying the provided
translation. The notice in the translation task (b) says “a worker can use the provided
translation, but a worker can start over if you don’t want to use the provided translation.
The reward and processing time of the translation task (a) are respectively 0.2$ and 60
minutes, and the reward and processing time of the translation task (b) are respectively
0.1$ and 30 minutes. The processing time is long because we want workers to create a
better translation. For example, we hope that workers find the translations of technical
words using the dictionaries on the web.
5.3 Request of Vote Task Figure 5.4 shows the request screen of the vote task. Workers select a better English
translation from Chinese by comparing two translations in this task. Workers can see the
Chinese sentence when workers process the vote task. The reward and processing time
of the vote task respectively are 0.03$ and 10 minutes. The processing time is long
because we want workers to make a thoughtful choice. A vote task is important because
the effort of translators is wasted if the function of a vote doesn’t correctly perform.
26
Figure 5.2: Request screen of the translation task from Chinese to English
Figure 5.3: Request screen of the improvement of the translation from Chinese to
English
27
Figure 5.4: Request screen of the vote task
5.4 Screening of Workers The previous studies [3] and [9] indicate that there are many workers who don’t process
HITs seriously in Amazon Mechanical Turk. Especially, the translation task needing a
technical ability has to be requested, and it is difficult to determine whether a worker
processed vote tasks seriously. Thus, the screening of workers is necessary in this
experiment.
There are two methods to screen workers in Amazon Mechanical Turk. The first
method is that you have workers solve the Qualification test after you created and
published a formal Qualification test. However, you can’t create and publish a formal
Qualification test using GUI. You have to execute the program created by Java, Ruby
and Perl. You prepare the correct answer of a Qualification test in advance, and you can
automatically give a Qualification to a worker by comparing the answer of the worker
with the correct answer. The second method is that you request a HIT as a Qualification
test. In this case, you have to write that a HIT is the Qualification test in the title or
description of the HIT. You can create and publish a Qualification test using GUI though
this method may not official. You can manually check whether the answer of a worker is
28
correct or not. We adopted the second method because translations should be manually
checked in a Qualification test.
Figure 5.5 shows the request screen of the Qualification test for Chinese-English
translation.
Figure 5.5: Request screen of the Qualification test for the Chinese-English translation
The Guideline of the Qualification test says that “a worker can get the Qualification for
many other translation tasks if the worker translate the Chinese sentence to English
correctly’’ and “a worker must not use machine translations’’. A worker has to answer
his mother language, country of origin and country of residence after the worker solved
the test of a translation in the Qualification test. We used the questionnaire to screen
workers easily. In the Qualification test, the source sentence is “对不起,我们这里没有
这个人。’’, and the example of the correct translation is “I’m sorry, but we don’t have
such person here.’’ We used a very simple test because many workers don’t take the test
if a test is difficult. It is possible to see the number of workers passing the formal
Qualification test published in Amazon Mechanical Turk. The number of workers
passing the Qualification test for the Chinese-English translation published by other
workers was around 30. This number is very low. The translation test used in this
Qualification test was more difficult than our Qualification test. The reward and
processing time of the Qualification test are respectively 0.01$ and 5 minutes. We set the
29
reward of our Qualification test 0$, but we considered that many workers would not
gather if we set the reward 0$. The processing time is very low because the test is so
easy that workers don’t need use dictionaries.
30
Chapter 6 Experiment and Evaluation
6.1 Experiment We conducted the experiment of the translations using the parallel and iterative
processes. We would clarify the relation between the number of translators and quality
of translations in the parallel and iterative processes. We would clarify the between the
number of translators and quality of translations in the translation processes with
combination of the parallel and iterative processes.
We published the Qualification test for the Chinese-English translation on Amazon
Mechanical Turk to gather the workers participating in the experiment. The number of
workers passing the Qualification test was 30 after 1 week has passed since the
Qualification test was published. We set the number of translators and voters per one
vote 3 because the number of workers passing the Qualification test was low. The
parallel and iterative processes used in our experiment are respectively shown by
Figure 6.1 and 6.2. The experiment was launched after 1 week has passed since the
Qualification test was published. The Qualification test was published while the
experiment was being conducted, and the worker passing the test was given the
Qualification at anytime.
The purpose, procedure and hypothesis of the experiment are as follows.
Purpose of the experiment
The evaluations of the crowdsourcing translation processes with combination of the
parallel and iterative processes (the number of translators is 3)
Procedure of the experiment
The experiment 1 and 2 are conducted in order.
Experiment 1. The experiment for the relation between the number of translators and
quality of translations in the parallel process
We translate Chinese sentences to English using the parallel process in Amazon
Mechanical Turk. We acquire three translations for a Chinese sentence, and the three
translations are considered as the translations by translator 1, 2 and 3 according to the
order of the acquisition. We acquire the best translation by two votes.
Experiment 2. The experiment for the relation between the number of translators and
31
quality of translations in the iterative process
Chinese sentences are translated to English using the iterative process where the
translations by translator 1 in the experiment 1 are used as the translations by translator
1.
Figure 6.1: Parallel process (The number of translators is 3)
Figure 6.2: Iterative process (The number of translators is 3)
Hypotheses of the experiment
Hypothesis 1. The better translation is acquired by using the iterative process than the
parallel process
The reason of the hypothesis 1. It is considered that the quality of translations increases
by the iterative modifications (the improvements of the grammar and spelling error)
Hypothesis 2. The better translation is acquired by using the process with combination
of the parallel and iterative processes (Figure 6.3) than the iterative process
The reason of the hypothesis 2. It is expected that the first translation for an
improvement has the strong effect to the translators improving translations. Thus, it is
possible to increase the quality of the final translation by acquiring the first translation
for an improvement from multiple translators.
The source sentence and reward are as follows.
� Source sentence
Each 5 articles are randomly selected from the categories of sports, society, economic
32
and culture in京报网 (One of the leading news site in China). The source sentences
are 20 sentences including the initial 1 sentence in each article.
Figure 6.3: Process with combination of the parallel and iterative processes
(The number of translators is 3)
� Reward
[The reward of the translations using the parallel process]
The reward of a translation task: 0.2$
The reward of a vote task: 0.09$ per one task (0.03$ per one worker)
The reward of one sentence: 0.2×3+0.09×2=0.78$
The reward of 20 sentences: 0.78×20=15.6$
[The reward of the translation using the iterative process]
The reward of a translation task: 0.1$
The reward of a vote task: 0.09$ per one task (0.03$ per one worker)
The reward of one sentence: 0.1×2+0.09×2=0.38$
The reward of 20 sentences: 0.38×20=7.6$
[Total reward]
15.6+7.6=23.2$
6.2 Evaluation We created the correct translations by having a Chinese-English bilingual translate the
source sentences and a native English speaker modify the English translations. We had
3 English speakers evaluate the results of translations on a scale of 1 to 5 (All, Most,
Much, Little and None) by comparing the results of translations with the correct
translations. The evaluation values of translations are represented by the average of the
33
3 evaluation values.
Figure 6.4 shows the results of the experiment 1 and 2. The processes shown by
Figure 6.3 and 6.5 can also be considered as the process with combination of the
parallel and iterative processes. The process shown by Figure 6.3 can be seen as the
process with combination of the parallel and iterative processes using 2 translators, and
it is possible to get the relation between the number of translators and quality of
translations in the process shown by Figure 6.3.
Evaluation value of the translation from vote 1 in the process shown by Figure 6.3
The evaluation value of the translation from vote 1 in the process shown by Figure 6.3
is approximately represented by the evaluation value of the translation by the parallel
process using 2 translators. This evaluation value is 4.27.
Evaluation value of the translation from vote 2 in the process shown by Figure 6.3
The third translator improves the translation in the process shown by Figure 6.3. The
degree of the improvement of the translation by the third translator must be different
according to the quality of the translation from vote 1. We compute the degree of the
improvement of the translation for each quality of the first translation in the iterative
process using 2 translators here. First, we classify the translations from translator 1 and
translator 2 in the iterative process into 4 categories (1.00-1.99, 2.00-2.99, 3.00-3.99
and 4.00-5.00) according to the evaluation values. We compute the average of the
evaluation values for each classified category. We also compute the average of the
evaluation values of the translations from the votes (vote1 for the translation from
translator 1 or vote 2 for the translation from translator 2) for each classified category.
As a result, we can get the difference of the improvement of the translation for each
quality of the first translation in the iterative process using 2 translators (Figure 6.6).
The evaluation value from vote 2 is the product of the evaluation value of the
translation from vote1 and the degree of the improvement of the translation
corresponding to the evaluation value. For example, the evaluation value of the
translation from vote2 is 3.5×3.67/3.33=3.86 if the evaluation value of the translation is
3.5. The evaluation value of the translation from vote2 in the process shown by Figure
6.3 is approximately represented by 4.27
Figure 6.4 shows the relation between the number of translators and quality of
translations in the process shown by Figure 6.3. Figure 6.4 indicates that the process
34
shown by Figure 6.3 gives the best translation using 3 translators.
It is impossible to get the relation between the number of translators and quality of
translations in the process shown by Figure 6.5 approximately, but the first translation
for the improvement is acquired from only translator 1 in the process shown by Figure
6.5. The quality of the first translation for the improvement increases by acquiring the
first translation for the improvement from two translators. It is unlikely that the process
shown by Figure 6.5 is better than that by Figure 6.3 because the first translation for the
improvement has the strong effect to the translators improving translations.
Figure 6.4: Relation between the number of translators and the quality of translations
Figure 6.5: Process with combination of the iterative and parallel processes
(The number of translators is 3)
35
Figure 6.6: Difference of the improvement of the translation for each quality of the first
translation in the iterative process
6.3 Discussion 58 workers have passed the Qualification test for 25 days from the publication of the
Qualification test until the end of the experiment. We had workers answer their mother
language, countries of origin and countries of residence when the workers took the test.
Table 6.1, 6.2 and 6.3 show the results of the answers. Most of the mother languages of
workers were Chinese, and most of the countries of origin were China. Most of the
countries of residence were also the United States.
Table 6.1: Mother languages of workers
Chinese English numbers 47 11
Table 6.2: Countries of origin of workers
China United States Singapore Malaysia other
numbers 31 10 6 6 5
36
Table 6.3: Countries of residence of workers
United States China Singapore Malaysia Vietnam
numbers 36 9 6 6 1
Experiments 1 and 2 were respectively completed in 11 and 7 days. Experiment 2
was completed earlier than Experiment 1because the translation by translator 1 in
Experiment 1 was used as the translation by translator 1 in Experiment 2 in this
experiment. The translations from 50 English sentences to Chinese were completed in
2 days in the previous study [1]. This was because that a Qualification test was not used.
There must be the relation of trade-off between the speed and quality of translations.
The experiment results show that hypothesis 1 is not correct. It may be because that
the number of translators is low in this experiment. The number of translators was 3 in
this experiment though the number of translators was 6 in the previous study [4]. It can
be expected that the iterative process gives a better translation than the parallel process
if the number of translators increases. The experiment results show that hypothesis 2 is
correct. It can be expected that the parallel process using 2 translators gives the best
translation in the parallel process because the parallel process using 3 translators is
worse than that using 2 translators in Figure 6.4. Thus, the process with combination of
the parallel process using 2 translators and iterative process would be the best. For
example, the process with combination of the parallel process using 2 translators and
iterative process using 3 translators would be the best if the number of translators is 5.
However, the numbers of translators in a process and source sentences are low to
verify hypotheses. The number of experiments is also low. We try to consider the time
required to verity hypothesis here. We needed around 18 days to conduct the
experiment. The number of translators was 3 in this experiment. At least 36 days are
needed for the experiment if the number of translators is 6. The experiment has to be
conducted five times if we follow the previous study [4]. As a result, at least 6 months
are needed for the steady experiment. The time for the analysis and evaluation of the
results is also needed.
Table 6.4 shows the example of the good translation by the iterative process. The
translation by the next translator was better than that by the previous translator, and the
votes were performed correctly in the example. Translator 2 created the translation
37
based on the translation by translator 1, but translator 3 didn’t use most of the
translation by translator 2. Translator 3 probably started over because the translation
ability of translator 3 was much better than that of translator 1 and 2. Table 6.5 shows
the example of the bad translation by the iterative process. The translation by translator
1 was the best in this example. The translation by translator 1 could not be selected by
vote 2 because the translation by translator 2 which was worse than the translation by
translator 1 was selected by vote 1. The iterative process didn’t work well when the
votes were not performed correctly. The iterative process also didn’t work well when
the translators after 2didn’t change the translation by the previous translator.
Table 6.4: Example of the good translation by the iterative process
Type of sentence Content of sentence Evaluation value
Source Chinese sentence 今晚,北京金隅男篮将在主场迎战浙江广
厦。对于已经两连败的北京队来说,能不
能重新找回信心,停止连败,这场比赛非
常关键。
-
Correct English translation Tonight, Beijing Jinyu will hold a game at home against Zhejiang Guangsha. This game is a very critical for Beijing to regain confidence from a two-game losing streak.
-
English translation from translator 1
Tonight, BBMG against Zhejiang guangsha men's basketball team in the arena. Has been for two season for the Beijing team, can confidence in myself again, stop the season, the match is crucial.
1.67
English translation from translator 2
Tonight, bbmg against Zhejiang guangsha men's basketball team in the arena. Already had two even defeated the Beijing team, can regain confidence, stop continuous defeat, the game is critical.
2.33
English translation from translator 3
Tongiht, BBMG will do battle against Zhejiang Guangsha men's basketball team. The match will be extremely critical for deciding if BBMG, which has lost already two matches in a row , could regain its confidence and stop its string of continuous defeats.
4.33
38
English translation from vote1
< English translation from translator 2>
2.33
English translation from vote2
< English translation from translator 3>
4.33
Table 6.5: Example of the bad translation by the iterative process
Type of sentence Content of sentence Evaluation value
Source Chinese sentence 前天清晨 5点左右,一辆奔驰汽车在工体
北门附近将一名青年男子撞伤,司机随即
弃车逃逸。两名路过的女大学生见状,和
其他 3 名年轻人一起将伤者送往医院抢
救。
-
Correct English translation At 5 o'clock the day before yesterday, a Mercedes-Benz car hit a young man near the north gate of Worker Gymnasium. The driver abandoned the vehicle and fled. Two passing-by female students, and three other young men sent injured person to the hospital.
-
English translation from translator 1
At 5 o'clock the day before yesterday, a Mercedes-Benz car hit a young man near the Worker's North Gate, the driver abandoned the vehicle and fled. Two female college students reported the incident and took the injured to the hospital with three other young people.
4.67
English translation from translator 2
The day before yesterday at about 5 o'clock in the morning, a Mercedes car near the North Gate of the workers will be a young man was injured, the driver then abandoned the vehicle and escaped. Two female students passing by seeing this, along with other 3 young men wounded and rushed to hospital for treatment.
2.67
English translation from translator 3
At 5 o'clock the day before yesterday, the driver of a Mercedes-Benz abandoned his vehicle and fled after hitting a young man near the Worker's North Gate . Two female college students passing witnessed the incident and took the injured to the hospital with the help of three other young people.
3.33
39
English translation from vote1
< English translation from translator 2>
2.67
English translation from vote2
< English translation from translator 3>
3.33
Table 6.6 shows the proportion of the correct votes in each process.
Table 6.6: Proportion of the correct votes in each process
Proportion of the correct votes
Parallel process 67.5%(27/40)
Iterative process 77.5%(31/40)
Total 72.5%(58/80)
Table 6.6 indicates that 72.5% of the all votes were correct. There were workers
selecting a translation randomly because workers only had to select a better translation.
It is very difficult to screen such workers automatically. One of the methods to screen
such workers is asking the workers the reasons why they selected the translation. In this
case, it is necessary to check the reasons manually. Workers will not also write the
reasons if the reward of a vote doesn’t increase.
Figure 6.7 shows the relation between the number of translators and quality of
translations in the parallel and iterative processes for the case of a perfect vote.
(However, under the condition that the translation by the translator improving the
translation is same if the translation selected by a vote is different from the real
translation in the experiment). If all votes are performed correctly, the higher the
number of translators is, the more the quality of translations increases in the parallel
and iterative processes. Figure 6.8 shows the difference of the improvement of the
translation for each quality of the first translation in the iterative process for the case of
a perfect vote. The relation between the number of translators and quality of
translations in the process shown by Figure 6.3 is approximately represented by the
computation using Figure 6.8. The evaluation value from vote 1 in the process shown
by Figure 6.3 is approximately represented by the evaluation value of the translation by
the parallel process using 2 translators when all votes are performed correctly. This
evaluation value is 4.4. The evaluation value of the translation from vote 2 in the
40
Figure 6.7: Relation between the number of translators and the quality of translations
in the parallel and iterative processes for the case of a perfect vote
process shown by Figure 6.3 is approximately represented by 4.54. Figure 6.9 shows
the relation between the number of translators and quality of translations in the process
shown by Figure 6.3. Figure 6.7 and 6.9 indicate that the process shown by Figure 6.3
is the best.
41
Figure 6.8: Difference of the improvement of the translation for each quality of
the first translation in the iterative process for the case of a perfect vote
Figure 6.9: Relation between the number of translators and quality of translations in the
process with combination of the parallel and iterative processes for the case
of a perfect vote
42
6.4 Lessons Learned Amazon Mechanical Turk is the crowdsourcing market for the process of a lot of
simple micro tasks. It is considered that the translation tasks used in this experiment are
not suited for Amazon Mechanical Turk because translation tasks need high skills.
We found that most of the translation tasks requested to Amazon Mechanical Turk
were processed using machine translations. It was possible to screen malicious workers
by using a simple translation test in this experiment, but a lot of times were required to
complete this experiment because the number of workers passing the test was very low.
This is because that translation tasks needs high skills and a lot of troublesome. It
seemed that workers in Amazon Mechanical Turk liked simple tasks such as tagging
images to acquire rewards efficiently.
The set a higher reward of a translation task, use of simpler source sentence and
change of source and translated languages are given to complete a translation
experiment more quickly. A more difficult translation test is needed to acquire better
translations. However, the time required to complete a translation experiment increases
because the number of workers passing the test decreases if a more difficult test is used.
Therefore, it is considered that there is a relation of trade-off between the quality and
time of translations.
43
Chapter 7 Conclusion
A crowdsourcing translation where translators arbitrarily raise their hand for translation
issues gets attention these days. The crowdsourcing translation can realize low-cost and
rapid translations because non-professional translators can participate in translations.
However, the quality of translations is not guaranteed because non-professional
translators can participate in the crowdsourcing translation. We have to get multiple
translators to participate in a translation process and consider how multiple translators
should translate cooperatively.
In this study, we proposed the method of evaluation for crowdsourcing translation
process using Amazon Mechanical Turk which is one of crowdsourcing markets. In
Amazon Mechanical Turk, variety of tasks including a translation task can be requested,
and translation processes by multiple translators can be freely formed by using APIs
and Tools. We built the experiment environment using Amazon Mechanical Turk and
evaluated crowdsourcing translation processes.
The contributions of this study are as follows.
Establishment of the experiment environment using Amazon Mechanical Turk
We built the experiment environment of a crowdsourcing translation using Amazon
Mechanical Turk. Variety of tasks including a translation task can be requested to
Amazon Mechanical Turk, and APIs and tools for forming disposal processes in
Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable
for a crowdsourcing market where the experiment environment is built.
Efficient evaluation of crowdsourcing translation processes using Amazon
Mechanical Turk
We evaluated parallel and iterative processes by translating 20 Chinese sentences to
English in the built experiment environment. The parallel and iterative processes
are the fundamental cooperative processes proposed by a previous study. In
addition, we approximately evaluated the process with combination of the parallel
and iterative processes using the evaluation results of parallel and iterative
processes. The evaluation values of the parallel process, the iterative process and
the process with combination of the parallel and iterative processes were
respectively 4.22, 4.15 and 4.27 when the number of translators is 3. The evaluation
44
results showed that the process with combination of the parallel and iterative
processes was the best. We realized the evaluation of crowdsourcing translation
processes cutting time and cost by evaluating translation processes approximately.
However, more experiments are needed to gain persuasive power for the results of
this experiment. It is possible to evaluate crowdsourcing translation processes using the
same method in this experiment if more translators participate in the crowd sourcing
translation processes.
45
Acknowledgments
The author would like to express sincere gratitude to the supervisor, Professor Toru
Ishida at Kyoto University, for his continuous guidance, valuable advice, and helpful
discussions.
The author would like to tender his acknowledgments to Associate Professor
Shigeo Matsubara and Assistant Professor Hiromitsu Hattori at Kyoto University, for
his technical and constructive advice.
The author would like to express his appreciations to the advisers, Associate
Professor Keishi Tajima at Kyoto University and Researcher Yohei Murakami at
National Institute of Information and Communications Technology for his valuable
advice.
This research was partially supported by Strategic Information and
Communications R&D Promotion Programme (SCOPE) from Ministry of Internal
Affairs and Communications of Japan and a Grant-in-Aid for Scientific Research (A)
(21240014, 2009-2011) from Japan Society for the Promotion of Science (JSPS).
Finally, the author would like to thank all members of Ishida and Matsubara
laboratory for their various supports and discussions.
46
References
[1] Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical
Turk, Chris Callison-Burch, Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2009), pp.286-295, 2009.
[2] Crowdsourcing Translation: Professional Quality from Non-Professionals, Omar F.
Zaidan and Chris Callison-Burch, Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics (ACL 2011), pp.1220-1229, 2011.
[3] Are your participants gaming the system?: screening mechanical turk workers, Julie
S. Downs, Mandy B. Holbrook, Steve Sheng and Lorrie Faith Cranor, Proceedings
of the 28th international conference on Human factors in computing systems (CHI
2010), pp.2399-2402, 2010.
[4] Exploring Iterative and Parallel Human Computation Processes, Greg Little, Lydia
B. Chilton, Max Goldman, Robert C. Miller, Proceedings of the ACM SIGKDD
Workshop on Human Computation (KDD-HCOMP 2010), pp.68-76, 2010.
[5] Who are the crowdworkers?: shifting demographics in mechanical turk, Joel Ross,
Lilly Irani, M. Six Silberman, Andrew Zaldivar, Bill Tomlinson, Proceedings of the
28th of the international conference extended abstracts on Human factors in
computing systems (CHI EA 2010), pp.2863-2872, 2010.
[6] Utility data annotation with Amazon Mechanical Turk, Alexander Sorokin and
David Forsyth, Computer Vision and Pattern Recognition Workshops 2008
(CVPRW 2008), pp.286-295, 2008.
[7] Crowdsourcing graphical perception: using mechanical turk to assess visualization
design, Jeffrey Heer and Michael Bostock, Proceedings of the 28th international
conference on Human factors in computing systems (CHI 2010), pp.203-212, 2010.
[8] Using the Amazon Mechanical Turk for transcription of spoken language, Matthew
Marge, Satanjeev Banerjee and Alexander I. Rudnicky, Proceedings of the 35th
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP 2010), pp.5270-5273, 2010.
[9] Crowdsourcing user studies with Mechanical Turk, Aniket Kittur, Ed H. Chi and
Bongwon Suh, Proceedings of the 26th international conference on Human factors
in computing systems (CHI 2008), pp.453-456, 2008.
47
[10] Financial Incentives and the “Performance of Crowds”, Winter Mason and Duncan
J. Watts, Newsletter ACM SIGKDD Explorations Newsletter Volume 11 Issue 2,
pp.77-85, 2009.
[11] Coordination in collective intelligence: the role of team structure and task
interdependence, Aniket Kittur, Bryant Lee and Robert E. Kraut, Proceedings of the
27th international conference on Human factors in computing systems (CHI 2009),
pp.1496-1504, 2009.
[12] TurKit: human computation algorithms on mechanical turk, Greg Little, Lydia B.
Chilton, Max Goldman, and Robert C. Miller, Proceedings of the 23th annual ACM
Symposium on User Interface Software and Technology (UIST 2010), pp.57-66,
2010.