Master Thesis - 京都大学ai.soc.i.kyoto-u.ac.jp/publications/thesis/M_H23_matsuno...Master Thesis...

Master Thesis

Evaluation of

Crowdsourcing Translation Processes

Supervisor Professor Toru Ishida

Department of Social Informatics

Graduate School of Informatics

Kyoto University

Jun MATSUNO

April 2010 admission

March 2012 completion

i

Evaluation of Crowdsourcing Translation Processes

Jun MATSUNO

Abstract

Recently, the needs of translations have increased with the globalization. The

translations of information in companies are needed for the overseas transfers of

companies. The overseas transfers of small businesses are accelerating because of the

strong yen in Japan. The overseas transfers of companies are also accelerating because

of the rapid economic growth in China. At the level of the individual, translations are

needed to send and get information on the Web written in other languages than the

mother tongue and English. A professional translator is thought as a translator when

translations are requested. By requesting translations to professional translators, the

very high-quality results of translations are acquired, but the high cost is involved.

Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for

translation issues gets attention these days. The crowdsourcing translation can realize

low-cost and rapid translations because non-professional translators can participate in

translations.

The quality of translations is not guaranteed because non-professional translators

can participate in the crowdsourcing translation. Here, it is expected that the quality of

translations can improve if multiple translators address translations cooperatively. The

crowdsourcing translation by multiple translators has the need for the translation issues

to be done at less cost than the translation by a professional translator and to be

guaranteed in the quality.

However, we don’t know how multiple translators should translate cooperatively. In

this study, we tried to resolve the following two challenges to propose the method of

evaluation for crowdsourcing translation process.

1. Establishment of the experiment environment for the evaluation of crowdsourcing

translation processes

This experiment environment is built using a crowdsourcing market. Cooperative

processes by multiple users have to be formed and translation tasks have to be

processed by using the processes in the crowdsourcing market. We need to build

the experiment environment for the evaluation of translation processes by

ii

extending the function of the existing crowdsourcing market because the

crowdsourcing market meeting the requirements does not exist.

2. Evaluation of crowdsourcing translation processes

The best translation is not always acquired by the process where multiple

translators improve a translation in order. For example, the better translation may

be produced by the process where one translator translates based on the two

translations after two translators translate independently if three translators translate

cooperatively. Translation processes have to be efficiently evaluated because the

evaluation of all translation processes is not realistic.

We built the experiment environment for the evaluation of crowdsourcing

translation processes using Amazon Mechanical Turk as a crowdsourcing market. We

evaluated crowdsourcing translation processes in the built experiment environment.

The contributions to the above challenges are as follows.

1. Establishment of the experiment environment using Amazon Mechanical Turk

We built the experiment environment of a crowdsourcing translation using Amazon

Mechanical Turk. Variety of tasks including a translation task can be requested to

Amazon Mechanical Turk, and APIs and tools for forming disposal processes in

Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable

for a crowdsourcing market where the experiment environment is built.

2. Efficient evaluation of crowdsourcing translation processes using Amazon

Mechanical Turk

We evaluated parallel and iterative processes by translating 20 Chinese sentences to

English in the built experiment environment. The parallel and iterative processes

are the fundamental cooperative processes proposed by a previous study. In

addition, we approximately evaluated the process with combination of the parallel

and iterative processes using the evaluation results of parallel and iterative

processes. The evaluation values of the parallel process, the iterative process and

the process with combination of the parallel and iterative processes were

respectively 4.22, 4.15 and 4.27 when the number of translators is 3. The evaluation

results showed that the process with combination of the parallel and iterative

processes was the best. We realized the evaluation of crowdsourcing translation

processes cutting time and cost by evaluating translation processes approximately.

iii

クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価クラウドソーシング翻訳プロセスの評価

松野淳

内容梗概内容梗概内容梗概内容梗概

近年，グローバル化に伴い翻訳のニーズはますます高まっている．例えば，

企業の海外移転のためには，社内情報の翻訳が必要とされる．日本では円高の

影響で，中小企業の海外移転が加速していき，また中国では経済の急成長で，

企業の海外移転が加速している．個人レベルでも，母国語や英語以外で表記さ

れた web情報を発信・取得するために，翻訳が求められる．このような翻訳案

件を依頼する先としては，まずプロの翻訳者が考えられる．プロの翻訳者へ翻

訳案件を依頼することで，極めて高品質な翻訳結果を得ることは可能であるが，

高いコストが伴われる．そこで，翻訳案件に対して翻訳者が任意に手を挙げて

翻訳を行う，クラウドソーシング翻訳が注目を集めている．クラウドソーシン

グ翻訳は，プロでない翻訳者も参加できるため，低コストかつ迅速な翻訳が実

現可能である．

クラウドソーシング翻訳にはプロでない翻訳者も参加できるため，翻訳品質

が十分に保証されないという問題が存在する．ここで，複数の翻訳者が協力し

て翻訳すれば，翻訳品質は向上すると考えられる．複数の翻訳者によるクラウ

ドソーシング翻訳のニーズは，プロの翻訳者による翻訳よりもコストを抑えて，

翻訳品質が保証されるべき翻訳案件にある．

しかし，複数の翻訳者がどのように協力して翻訳すべきかが分からない．そ

こで，本研究では，クラウドソーシング翻訳プロセスの評価手法を提案するた

めに，以下の 2つの課題の解決を試みた．

1. クラウドソーシング翻訳プロセスを評価するための実験環境の構築

実験環境はクラウドソーシング市場を用いて構築される．そのクラウドソ

ーシング市場では複数のユーザによる協働プロセスが形成可能であり，そ

のプロセスを用いて翻訳タスクが処理可能でなければならない．しかし，

このようなクラウドソーシング市場は実在しない．そのため，実在するク

ラウドソーシング市場の機能を拡張することで，翻訳プロセスを評価する

ための実験環境を構築する必要がある．

2. クラウドソーシング翻訳プロセスの評価

クラウドソーシング翻訳において，複数の翻訳者が順番に翻訳の加筆・修

正を繰り返すことで最も良い翻訳結果が得られるとは限らない．例えば，3

iv

人の翻訳者で翻訳を行う場合は，最初に 2 人の翻訳者が独立に翻訳を行っ

た後で，1 人の翻訳者がそれらの翻訳結果に基づいて翻訳を行った方がよ

り良い翻訳結果が生成されるかもしれない．全ての翻訳プロセスを評価す

ることは現実的ではないため，効率的に翻訳プロセスを評価する必要があ

る．

クラウドソーシング市場として Amazon Mechanical Turkを用いることで，ク

ラウドソーシング翻訳プロセスを評価するための実験環境を構築した．そして，

構築された実験環境でクラウドソーシング翻訳プロセスを評価した．上記で述

べた課題に対する本研究の貢献は以下の 2点である．

1. Amazon Mechanical Turkを用いた実験環境の構築

Amazon Mechanical Turkを用いて，クラウドソーシング翻訳の実験環境を構

築した．Amazon Mechanical Turkでは，翻訳タスクを含めた様々なタスクの

依頼が可能であり，Amazon Mechanical Turkのための API や処理プロセスを

形成するためのツールが公開されている．そのため，Amazon Mechanical Turk

は実験環境を構築するクラウドソーシング市場として適している．

2. クラウドソーシング翻訳プロセスの効率的な評価

構築した実験環境で，中国語 20文を英語に翻訳することで，並列プロセス

と繰り返しプロセスを評価した．並列プセスと繰り返しプロセスは，先行

研究で提案されている Amazon Mechanical Turkにおける基本的な協働プロ

セスである．さらに，並列プロセスと繰り返しプロセスの評価結果を用い

ることで，並列プロセスと繰り返しプロセスを組み合わせて形成可能な翻

訳プロセスを近似的に評価した．並列プロセス，繰り返しプロセス，並列

プロセスと繰り返しプロセスを組み合わせたプロセスによる翻訳の評価値

は，それぞれ 4.22，4.15，4.27であった．評価結果から，並列プロセスと繰

り返しプロセスを組み合わせたプロセスが最も優れていることが分かった．

近似的に翻訳プロセスを評価することで，時間とコストを抑制したクラウ

ドソーシング翻訳プロセスの評価を実現した．

Evaluation of Crowdsourcing Translation Processes

Contents

Chapter 1 Introduction 1

Chapter 2 Amazon Mechanical Turk 4

2.1 Requester ··············································································· 5

2.2 Worker ·················································································· 8

Chapter 3 Related Works 10

3.1 Translation using Amazon Mechanical Turk ······································ 10

3.2 Increase in Quality of Tasks in Amazon Mechanical Turk ······················ 11

3.3 Relation between Result and Reward of Tasks ··································· 13

3.4 Task Processing with Cooperation ·················································· 16

Chapter 4 Crowdsourcing Translation 20

4.1 Increase in Demand of Crowdsourcing Translation ······························ 20

4.2 Example of Crowdsourcing Translation ··········································· 21

4.3 Increase in Quality of Translations by Cooperative Processes ·················· 23

Chapter 5 Establishment of Experiment Environment 24

5.1 Process of Tasks by Cooperative Processes ······································· 24

5.2 Request of Translatin Task ··························································· 25

5.3 Request of Vote Task ································································· 25

5.4 Screening of Workers ································································ 27

Chapter 6 Expeiment and Evaluation 30

6.1 Experiment ············································································ 30

6.2 Evaluation ············································································· 22

6.3 Discussion ············································································· 35

6.4 Lessons Learned ······································································ 42

Chapter 7 Conclusion 43

Acknowledgements 45

References 46

1

Chapter 1 Introduction

Recently, the needs of translations have increased with the globalization. The

translations of information in companies are needed for the overseas transfers of

companies. The overseas transfers of small businesses are accelerating because of the

strong yen in Japan. The overseas transfers of companies are also accelerating because

of the rapid economic growth in China. At the level of the individual, translations are

needed to send and get information on the Web written in other languages than the

mother tongue and English. A professional translator is thought as a translator when

translations are requested. By requesting translations to professional translators, the

very high-quality results of translations are acquired, but the high cost is involved.

Therefore, a crowdsourcing translation where translators arbitrarily raise their hand for

translation issues gets attention these days. The crowdsourcing translation can realize

low-cost and rapid translations because non-professional translators can participate in

translations.

The quality of translations is not guaranteed because non-professional translators

can participate in the crowdsourcing translation. Here, it is expected that the quality of

translations can improve if multiple translators address translations cooperatively. The

crowdsourcing translation by multiple translators has the need for the translation issues

to be done at less cost than the translation by a professional translator and to be

guaranteed in the quality.

However, we don’t know how multiple translators should translate cooperatively. In

this study, we tried to resolve the following two challenges to propose the method of

evaluation for crowdsourcing translation process.

Establishment of the experiment environment for the evaluation of crowdsourcing

translation processes

This experiment environment is built using a crowdsourcing market. Cooperative

processes by multiple users have to be formed and translation tasks have to be

processed by using the processes in the crowdsourcing market. We need to build

the experiment environment for the evaluation of translation processes by

extending the function of the existing crowdsourcing market because the

crowdsourcing market meeting the requirements does not exist.

2

Evaluation of crowdsourcing translation processes

The best translation is not always acquired by the process where multiple

translators improve a translation in order. For example, the better translation may

be produced by the process where one translator translates based on the two

translations after two translators translate independently if three translators translate

cooperatively. Translation processes have to be efficiently evaluated because the

evaluation of all translation processes is not realistic.

We built the experiment environment for the evaluation of crowdsourcing translation

processes by using Amazon Mechanical Turk as a crowdsourcing market. We evaluated

crowdsourcing translation processes in the built experiment environment. Variety of

tasks including a translation task can be requested to Amazon Mechanical Turk, and

APIs and tools for forming disposal processes in Amazon Mechanical Turk are released.

Thus, Amazon Mechanical Turk is suitable for a crowdsourcing market where the

experiment environment is built.

The processing of translation tasks using Amazon Mechanical Turk has already

been proposed [1, 2]. The studies showed the usability of Amazon Mechanical Turk for

translation tasks, but translators create or improve translations independently in the

studies. In our study, the processing of translation tasks by multiple translators using

Amazon Mechanical Turk is performed. Comparing with the previous studies, the

better experiment environment for translation tasks is built by the screening of

translators. The reason why the screening of translators was performed is because many

users don’t process tasks seriously in Amazon Mechanical Turk [3].

Parallel and iterative processes [4] were used as processes for translations by

multiple translators. The parallel process is suitable for tasks which are more improved

by spending more time. The iterative process is suitable for tasks whose purpose is the

suggestion of a unique idea. We evaluated the results of applying these processes to

translation tasks. In addition, we approximately evaluated the results of the processes

with combination of the parallel and iterative processes.

This paper is organized as follows. In Chapter 2, this paper explains Amazon

Mechanical Turk used for the evaluation of crowdsourcing processes. In Chapter 3,

previous studies relevant to this research are introduced. In Chapter 4, this paper

explains a real crowdsourcing translation on the web and the crowdsourcing translation

3

using cooperative processes. In Chapter 5, this paper explains the establishment of the

experiment environment for the evaluation of crowdsourcing translation processes

using Amazon Mechanical Turk. In Chapter 6, this paper reports the experiment and

evaluation of the experiment and, Chapter 7 presents the conclusion.

4

Chapter 2 Amazon Mechanical Turk

Amazon Mechanical Turk1 is one of web services by Amazon. It is possible to process

the tasks which can be easily resolved by humans in Amazon Mechanical Turk, and a

lot of micro tasks needing human intelligence are generally requested. The tasks are

resolved by many users around the world. In Amazon Mechanical Turk, a task is called

Human Intelligence Task (HIT). The users requesting and processing tasks are

respectively called Requester and Worker.

Figure 2.1 shows the screen for browsing HITs in Amazon Mechanical Turk.

Figure 2.1: Screen for browsing HITs in Amazon Mechanical Turk

We use the leading HIT in Figure 2.1 to explain a HIT. “Verify Businesses’

Websites 1’’ is the title of the HIT, and “Requester: Dolores Labs’’ is the name of the

user requesting the HIT. A requester can set “HIT Expiration Date’’, “Reward’’,

“Time Allotted’’ and “HITs Available’’. “Time Allotted” is the time used for

processing one task, and “HITs Available’’ is the number of HITs which can be

processed by the worker browsing the screen. Moreover, “Description’’, ”Keywords’’ 1 https://www.mturk.com/mturk/welcome

5

and ”Qualification Required’’ respectively represent the explanation of the HIT, the

keywords relating to the HIT and the qualification required to process the HIT. These

will appear on the screen by clicking the title of the HIT.

2.1 Requester A requester is limited to the user which can register the address in the United States. A

requester can request variety of HITs and freely set information of the HIT. The reward

of a HIT is different, but the general reward of a HIT is between 0.01$ and 0.1$.

Amazon provides the three mechanisms to guarantee the result of a HIT in Amazon

Mechanical Turk. The first mechanism is one having multiple workers process one HIT.

Thanks to this mechanism, a requester can select a better result of a HIT. The second

mechanism is one allowing a requester to set the qualification required to process a HIT.

A requester often set the acceptance rate of the HITs processed by a worker and the

country where a worker is living as qualifications. A requester can also give a

qualification to the workers passing the test created by the requester. The third

mechanism is one allowing a requester to reject the results of HITs processed by

workers and not to pay the reward. However, a requester needs to tell workers the

legitimate reason why the requester rejected the results of HITs.

There are two methods to request HITs. The first method is using GUI provided by

Amazon. If you request HITs and acquire the results using GUI, you need to go

through the three steps of designing HITs, publishing HITs and managing HITs.

� The design of HITs

You should create favorite HITs using templates because variety of templates is

prepared for the design of HITs. Figure 2.2 shows the example of the screen for the

design of a HIT. In this example, the HIT of translations from Chinese to English are

being designed. HITs have to be designed according to the format of HTML. You also

input information of HITs such as a title and a reward in the design of HITs.

� The publication of HITs

This process is designed to request designed HITs to Amazon Mechanical Turk. If a

requested HIT needs images, you upload the images and check the preview of the HIT.

Finally, HITs are published on Amazon Mechanical Turk if it is possible to pay the cost

required to publish the HITs. The cost includes the reward paid to workers and the

6

Figure 2.2: Example of the screen for the design of a HIT

agent’s commission paid to Amazon. The agent’s commission is 10 % of the reward.

� The management of HITs

Published HITs are managed in this process. It is possible to check how much HITs are

processed and the results of processed HITs. Figure 2.3 shows the screen for the

management of HITs. In this screen, you can approve and reject the results of

processed HITs after you checked the results of processed HITs. You can also

download the CSV file containing the results of processed HITs.

The second method to request HITs is using Requester API provided by Amazon.

API can be used in variety of languages such as Java, PHP and Ruby. The tools of

Amazon Mechanical Turk have been developed and published recently. It is possible to

request HITs more flexibly by using tools. For example, workers process HITs

independently in Amazon Mechanical Turk, but it becomes possible to request the HITs

using the results of HITs processed by other workers automatically if tools are used. If

you use API, you can request HITs without troublesome operation by GUI and output

the results of processed HITs to the console of program and files. The results of

requested HITs using API are also reflected in the screen for the management of HITs.

7

Figure 2.3: Screen for the management of HITs

2.2 Worker Anyone around the world can be a worker. A worker can acquire a reword by

processing tasks. A worker selects a HIT based on the content and reward of the HIT.

Amazon announced that there were more than 100 thousands workers coming from

more than 1 hundred countries in Amazon Mechanical Turk as of spring 2007. It was

announced that 76 % of workers were American and 8 % of them were Indian as of

March 2008, but 56 % of workers were American and 36 % of them were Indian as of

November 2009 from a statistical analysis [5]. Even now, most of the workers should

be occupied by American and Indian.

American workers don’t feel that they have a strong relation to Amazon

Mechanical Turk. On the other hand, Indian workers feel that they have a strong

relation to Amazon Mechanical Turk. Many American workers may think that the

process of HITs is just a hobby, but many Indian workers think that the process of HITs

will lead to their rich life. Hence, it is expected that many Indian workers don’t process

tasks seriously to acquire a reward efficiently. In fact, we confirmed that many Indian

workers submitted the results by machine translations and sloppy translation results

when we requested the tasks of translations to Amazon Mechanical Turk. The higher

8

reward than a general reward is set to translation tasks because translation tasks are

technical ones, and translation tasks are not processed appropriately by workers than

other simple tasks. The screening of workers is necessary to acquire good translations

results when translation tasks are requested to Amazon Mechanical Turk. A requester

receives the low-quality translation results and takes a lot of trouble with the

management of HITs without the screening of workers.

Figure 2.4 shows the screen of the HIT which workers actually process. This HIT

requires workers to input the information of restaurants found in the designated wet site.

Workers can see the content of the HIT by clicking the link of “View a HIT in this

group’’ in the screen for the browsing HITs in Amazon Mechanical Turk. Workers can

process the HIT by pushing the button of Accept HIT after they saw the content of the

HIT. Worker should push the button of Skip HIT if they don’t want to process the HIT.

Workers can see the content of another HIT if workers pushed the button of Skip HIT.

If the results of HITs processed by workers are approved by the requester, the workers

can acquire the reward of the HITs. If the results of HITs processed by workers are

rejected by the requester, the workers can’t acquire the reward of the HITs, and the

performance of the workers will decrease. If the performance of a worker is low, the

worker can’t process the HITs which the requester wants to request to the reliable

workers.

Figure 2.4: Screen of a HIT

9

Chapter 3 Related Works

3.1 Translation using Amazon Mechanical Turk Amazon Mechanical Turk is used in various fields of study. The most popular use of

Amazon Mechanical Turk is the request of a lot of micro HITs. It is possible to cut a lot

of cost and time by using Amazon Mechanical Turk. The studies using Amazon

Mechanical Turk in this way include the annotation to image data [6], evaluation of

visual design [7] and collection of audio data [8]. The completions of HITs at low-cost

and guaranteed quality of processed HITs have been shown in these studies.

Anyone can’t perform a translation because a translation is a technical process. A

translation is not generally suitable for a HIT requested to Amazon Mechanical Turk. In

fact, many HITs related to translations are not requested to Amazon Mechanical Turk,

but there have been a few studies for translations using Amazon Mechanical Turk.

The translations from 50 English sentences to French, German, Spanish, Chinese

and Urdu have been requested to Amazon Mechanical Turk [1]. Multiple-Translation

Chinese Corpus in LDC Catalog1 and NIST ME Eval 2008 Urdu-English Test Set

were used as the source sentences. These sentences are generally used to test the

performance of machine translations. There was the notice saying “you must not use

machine translations” in the screen of the HIT. However, many workers ignored the

notice and submitted the translations created by machine translations to the requester.

The translations created by machine translations were removed by requesting the

additional task for the review of translations to Amazon Mechanical Turk. As a result,

at least 30 % of the translations were created by machine translations. The translations

were evaluated based on BLEU, which was a method for automatic evaluation of

machine translations. Figure 3.1 shows the evaluation results of translations by workers.

In all languages, the evaluation value of translations by workers was lower than that by

professional translators but significantly exceeded that by machine translations. The

evaluation would get better by removing the machine translations from the translations

of workers. It is expected that the evaluation results increase more by the screening of

workers in advance. As for a reward, the reward of a HIT for the translation of one

1 http://www.ldc.upenn.edu/Catalog/

10

Blue: Professional translator Green: Worker Orange: Machine translation

Figure 3.1: Evaluation results of translations by Workers

sentence was 0.1$, and the reward of a HIT for checking whether a translation was

created by machine translations or not is 0.06$. As for the completion time of HITs, the

times were respectively less than 4 hours, 20 hours, 22.5 hours, 2days and 4 days in

Spanish, French, German, Chinese and Urdu.

3.2 Increase in Quality of Tasks in Amazon Mechanical Turk There are many workers which don’t process HITs seriously in Amazon Mechanical

Turk. The quality of HIT decreases if many workers don’t’ process HITs seriously. The

one of approaches to avoid such a consequence is the screening of workers. There is

the study about the approaches to screen workers in Amazon Mechanical Turk [3].

Workers were asked demographics (age, sex and occupation) to screen workers in this

study. In addition, workers were asked the two questions about an E-mail. If workers

answer demographics and the questions correctly, they can get the right to process HITs

which a very expensive reward is paid for. This study showed the following by

conducting the experiment.

� The proportion of the workers which answer demographics and the questions

about E-mail correctly was 61 %

� Women answered the questions more correctly than men

� The more older a worker was, more correctly the worker answered the questions

11

As a result, asking workers demographics and simple questions was very effective to

the screening of workers. The screening of workers based on the processing time was

also considered, but there was not a significant difference between high-quality and

low-quality results.

Another approach to increase the quality of results is the improvement of a HIT

design. There has been the study showing that the quality of results increased by the

improvement of a HIT [9]. In this study, the HIT having workers evaluate Wikipedia

articles was used. Workers were asked to evaluate the Wikipedia articles based on a

seven point Likert scale and answer how the Wikipedia articles should be improved

before the HIT design was improved. Workers were additionally asked to answer the

simple questions about the Wikipedia articles after the HIT design was improved.

Workers can deeply understand the Wikipedia articles by answering the questions.

Table 3.1 shows the experiment results.

Table 3.1: Increase in the quality of HITs by the improvement of HIT design

Proportion of the invalid results

Processing time Proportion of the results processed within 1 minute

Without the improvement of HIT design

48.6%

1:30

30.5%

With the improvement of HIT design

2.5%

4:06

6.5%

The invalid results are the answers not including how the Wikipedia articles should be

improved. The proportion of the results processed within 1 minute was measured

because many HITs were invalid when the processing time was within 1 minute before

the HIT design was improved. Table 3.1 indicated that the quality of results increased

by improving the HIT design.

3.3 Relation between Result and Reward of Tasks There was the question of crowdsourcing models such as Amazon Mechanical Turk

which economists and psychologists were interested in for a long time. The question is

how a reward has effect on the quality of results. In the traditional theory of economics,

it was considered that the higher a reward is, the better the quality of tasks is. However,

12

many studies showed that giving a high reward decreased the internal motivation of

enjoying tasks and the quality of tasks decreased. The results of the studies were

against the traditional theory of economics.

Many HITs are processed at low-cost and quickly by many workers in Amazon

Mechanical Turk, but how much HITs are processed correctly depends on how

requester can enhance the motivation of workers. A reward is given as one of the

external motivations of workers. The two experiments have been conducted to survey

the relation between the performance and reward of HITs [10]. In these experiments,

the quality (accuracy) and quantity (number) of results were quantitatively measured.

HITs used in these experiments didn’t depend on the ability of workers, and the harder

workers processed HITs, the more the results of HITs could increase.

The HIT having workers rearrange the pictures taken at two seconds intervals by a

traffic camera in chronological order was used in the first experiment. The basic reward

was 0.1$, and the additional reward was paid to workers according to the effort of

workers. If workers accept the HIT and provide their information, the basic reward was

paid to workers. After the basic reward was paid, the level (easy: 2image，medium:

3image or hard: 4image) and reward (low: 0.01$, medium: 0.05$ or high: 0.1$ per 1

HIT)of HITs were selected randomly, and workers processed the main HITs. Finally,

the experiment would continue until workers stopped processing HITs or all HITs were

processed.

Figure 3.2 shows the relation between the reward and number of processed HITs,

and Figure 3.3 shows the relation between the reward and accuracy of processed HITs.

Figure 3.2 indicated the followings two. The first was that the higher a reward was, the

more HITs were processed regardless of the difficulty of HITs. More workers to which

0.1$ was paid processed all HITs than ones to which 0.01$ was paid. Many workers to

which 0.01$ was paid were included in the workers processing less than 10 HITs. This

result was consistent with the general theory of economics saying that the higher a

reward was, the more HITs were processed. Figure 3.3 indicated that there was no

relation between the accuracy and reward of HITs. It can be considered that the

13

Figure 3.2: Relation between the reward and the number of processed HITs

Figure 3.3: Relation between the reward and the accuracy of processed HITs

14

Figure 3.4: Relation between the real reward and reward expected by workers

different effects which a reward gave the quality and quantity was attributed to

anchoring effect. Anchoring effect is the phenomenon of having effect to the next

decision-making or information by having the strong impression with the first

information or number (anchor). Figure 3.4 indicates that the expected reward was

higher than the real reward. Therefore, the difference of rewards didn’t have effect to the

accuracy of the HITs processed by workers. However, workers thought that the higher a

reward was, the more valuable the HIT was for workers, and the higher a reward was,

the more HITs were processed.

The HIT having workers find the words hidden in a puzzle was used in the second

experiment. Workers didn’t know how many words were hidden in the puzzle. The

quantity and quality of processed HITs were respectively measured based on the number

of processed puzzles and the number of words found in the puzzles. There were two

methods to pay a reward. One method was the quota scheme paying a reward to workers

each time one puzzle was correctly completed. Another method was the piece rate

scheme paying a reward to workers each time one word was found. The four levels of a

reward including nonpaying were considered in each scheme of a reward. Thus, there

were seven different experiment conditions in all. The number of found words, hidden

15

words and the reward paid to workers were presented to workers.

More tasks were processed and more words were found whether the scheme of a

reward is a quota scheme or piece-rate scheme in the case of non-paying than paying.

The significant difference between the first and second experiments was that there was

not the relation between the level of a reward and the quantity of processed tasks. It was

because that there was the strong relation between the feeling of enjoying HITs and the

quantity of processed HITs. For example, a worker completed all HITs by spending five

hours and found all words except for two words under the condition of non-paying.

There was not also the relation between the level of a reward and the quality of HITs in

the second experiment like the first experiment.

The amount of a reward per a word was less in the quota scheme paying a reward if

all words were found in a puzzle than the piece-rate scheme paying a reward each time a

word was found. However, the quality of processed HITs more increased using the

quota scheme than the piece-rate scheme. In addition, the quality of processed HITs

more increased by using the non-paying scheme than using the quota scheme. A reward

was not paid to workers if workers could not find all words in the quota scheme. The

quality of processed HITs more increased by using the quota scheme because workers

tried to find more words in a puzzle in the case of the quota scheme than in the case of

the piece-rate scheme. Workers were asked the value of a puzzle and a word. The results

showed that workers made an effort to process HITs, motivated by the rewards other

than the financial reward when the financial reward was no paid to workers. Hence, even

if there is a great difference between the reward expected by workers and the real reward,

the accuracy of processed HITs is high in the case of nonpaying. On the other hand,

workers decided whether the workers process HITs or not by comparing the real reward

with the reward expected by the workers in the case of paying a reward. The quantity of

processed HITs didn’t increase simply because a reward increased. These experiment

results indicated the followings.

� The quantity of processed HITs more increases in the case of paying a reward then

in the case of non-paying

� The way of paying a reward has a larger effect to the quantity and quality of

processed HITs than the amount of rewards

16

3.4 Task Processing with Cooperation There must be the tasks which are more efficiently processed by the cooperation of

multiple workers. For example, the creation of Wikipedia articles is considered as one

such task. Wikipedia is the collaboration type of a crowdsourcing site promoting

collaboration on the web and having users create contents cooperatively. In Wikipedia,

even users who don’t have the account can edit articles. The relation between the

increase in editors of articles and the quality of articles in the creation of Wikipedia

articles has been surveyed [11]. The edit of Wikipedia articles is the task having the high

interdependency of users and, the high cost of the cooperation between users may

happen. For example, the modifications of the grammar or the spelling error in an article

are the task having a low cooperativeness, but the modifications of the construction or

the difference of the contents in an article are the tasks having a high cooperativeness

because the unified opinion has to be established. In this study, a hypothesis was

formulated. The hypothesis said that the increase in the number of editors leaded to the

benefit of not the tasks needing a low cooperativeness but the tasks needing a high

cooperativeness. The results of the verification showed that the hypothesis was correct.

In Amazon Mechanical Turk, the experiment verifying the effectiveness of the

cooperative processed has been conducted [4]. The parallel and iterative processes

which were respectively represented by Figure 3.5 and 3.6 were used in this experiment.

Voters participate in the vote task to decide a better processed HIT by majority vote in

these processed. The best processed task is decided after the same HIT was processed by

workers in the parallel process. The next worker can see the HIT processed by the

previous worker in the iterative process. The HITs of writing the image description and

suggesting the new company name were used to survey what kinds of HITs these

processes were useful to. The number of workers and vote were respectively 6 and 5.

The number of voters per one vote was also 5.

In the HIT of writing the image description, 0.02$ and 0.01$ were respectively paid

to the worker writing the image description and a voter. The workers inputted the

sentence representing the content of the presented image, and the voters evaluated the

two sentences representing the content of the presented image with the values from 1 to

10. The sentences representing the contents of 30 images were acquired using the

parallel and iterative processes. Figure 3.7 shows the relation between the number of

17

Figure 3.5: Parallel process (The number of workers is n)

Figure 3.6: Iterative process (The number of workers is n)

workers and the evaluation values for the two processes. The table indicated that the

iterative process was useful to the task of writing the image description than the parallel

process. This was because that the longer a sentence was, the better the evaluation value

was.

In the HIT of suggesting the new company name, 0.02$ and 0.01$ were respectively

paid to the worker suggesting the new company name and a voter. The voters evaluated

the two company names with the values from 1 to 10. The new names of 30 companies

were acquired using the parallel and iterative processes. The best evaluation values were

respectively 7.3 and 8.3 by using the iterative and parallel processes. It was difficult to

get the high evaluation value by using the iterative process because the next worker was

influenced by the idea of the previous worker. The average values were respectively 6.4

18

and 6.2 by using the iterative and parallel processes though the best evaluation was got

by using the parallel process. Figure 3.8 shows the relation between the number of

workers and the evaluation values for the iterative processes. In the iterative process, the

more the number of workers was, the better the evaluation value was.

The results of these experiments showed that there was the relation of trade-off

between the average quality and the bets quality of processed tasks. The iterative process

decreased the variety of processed results which was very important for getting the best

result though the average quality of processed tasks increased in the iterative process

This was because the next worker often used the result by the previous worker as a

reference in the iterative process.

Figure 3.7: Relation between the number of workers and the evaluation values

(blue: iterative process, red: parallel process)

19

Figure 3.8: Relation between the number of workers and the evaluation values

(blue: the evaluation values by the iterative process, red: the average

evaluation value by the parallel process)

20

Chapter 4 Crowdsourcing Translation

4.1 Increase in Demand of Crowdsourcing Translation The number of professional translators is limited, and the much time and cost are needed

to request translations to professional translators. Low-cost and quick translations can be

realized by the crowdsourcing translation because non-professional translators can also

participate in the crowdsourcing translation.

The crowd sourcing translation is mainly used for the globalization and overseas

transfer of companies. It is expected that the demand of crowdsourcing translations will

increase more from the foreign direct investment of Japan in recent years. The foreign

direct investment of Japan is the direct investment for foreign companies by Japanese

companies. The more the amount of foreign direct investment increase, the higher the

possibility of overseas transfer is. Table 4.1 shows the amount of the foreign direct

investment of Japan for ASEAN (2008-2010) published by the Bank of Japan. Table 4.2

also shows the comparison between the amounts of the foreign direct investments of

Japan for ASEAN in 2010 and 2011 publish by the Bank of Japan. Table 4.1 and 4.2

indicate that the overseas transfer of Japanese companies to ASEAN has increased

recently. This is mainly because that the yen is appreciating. A media reported that local

governments supported the overseas transfer of smaller businesses in 2011. For example,

the Ota district of Tokyo provides consultation for the overseas development and

supports the translations of foreign documents. The demand of crowdsourcing

translation by companies is increasing not only in Japan. Table 4.3 shows the amount of

the foreign direct investment of Chinese (2008-2010) published by Japan External Trade

Organization (JETRO). Table 4.3 indicates that more Chinese companies are performing

the overseas transfer.

The crowdsourcing translation can be used for not companies but individuals.

Individuals use the crowdsourcing translation to send and get information on web. The

example of use for sending information is the translation of the HP, blog or the

explanation of created application. The example of use for getting information is the

translation of news articles in foreign countries. These translations don’t have to be

perfect, and there is no problem as long as the meaning of these translations is correct.

21

The time and cost required for translations are expected to be decreased. Therefore, the

translations given here are suitable for the translation issue of the crowdsourcing

translation.

Table 4.1: Amount of foreign direct investment of Japan for ASEAN in recent years

2008 2009 2010 Amount of investment (billion yen)

6,518

6,587

7,711

Table 4.2: Comparison between the amounts of the foreign direct investments of Japan

for ASEAN in 2010 and 2011

First-quarter in 2010

First-quarter in 2011

Second-quarter in 2010

Second-quarter in 2011

Amount of investment to ASEAN (billion yen)

666

1,016

1,867

2,768

Table 4.3: Amount of the foreign direct investment of Chinese in recent years

2008 2009 2010 Amount of investment (million-dollar)

41,859 47,800 59,000

4.2 Example of Crowdsourcing Translation myGengo1 is given as the example of the crowdsourcing translation. myGengo is the

service created in Japan, and the purpose of myGengo is the support of globalization in

business. The customer companies of myGengo include the major companies in Japan,

and myGengo is one of proven crowdsoucing translation services. The flow of use of my

Gengo is as follows.

1. The order of a translation issue through the web site or API of myGengo

2. The translators registering at myGengo start to process the translation issue

3. It is possible to exchange comments between the requester and the translators in the

process of the translation. The necessary modification is also performed at no charge

after the requester checked the translation.

1 http://ja.mygengo.com/

22

4. The delivery is notified by E-mail. The translation is sent to the requester

automatically if API is used.

The translators passing the qualification test can register at myGengo. The

translators are classified into the standard and pro level of a translation. The requester

can select the level of a translation. At the stage of spring 2011, more than 1600

translators register at myGengo. The translation languages used in myGengo are

Japanese, English, Chinese, French, German, Italian and Spanish.

Figure 4.1 shows the screen of the translation request in myGengo. Requesting

translations to professional translators generally takes a lot of trouble. In myGengo,

anyone can very easily request translations. The actual examples of the translations

requested to myGengo are the translations of the explanation for an application, the

press release of a company and the manual of a company.

Figure 4.1: Screen of the translation request in myGengo

23

4.3 Increase in Quality of Translations by Cooperative Processes The quality of a translation is not sufficiently guaranteed because one translator

generally takes responsibility for one translation issue in the crowdsourcing translation.

It can be considered that the quality of a translation increases if multiple translators

process the translation cooperatively. In this case, the translation is divided among

multiple translators appropriately, and the reward is divided among multiple translators

according to the achievement of each translator. The way to decide the contribution of

the translation by a translator and pay the reward to the translator according to the

contribution is very important, but we don’t consider the way. In this study, we focus on

the best way to increase the quality of a translation by multiple translators.

myGengo is the controlled crowdsourcing service to process translations appropriately.

Amazon Mechanical Turk is not controlled for translations though Amazon Mechanical

Turk is also the crowdsourcing service. Very technical tasks such as a translation are not

suitable for HITs requested to Amazon Mechanical Turk because workers and requesters

can respectively process and request HITs very feely in Amazon Mechanical Turk.

However, there is no other crowdsourcing services in which translation tasks can be

requested and processed by cooperative processed. Therefore, we decided to use

Amazon Mechanical Turk.

24

Chapter 5 Establishment of Experiment Environment

We built the experiment environment of the translation from Chinese to English in this

study.

5.1 Process of Tasks by Cooperative Processes We formed crowdsourcing translation processes using the parallel and iterative

processes respectively represented by Figure 3.5 and 3.6. We used Turkit [12] to realize

the parallel and iterative processes in Amazon Mechanical Turk. Turkit is the tool for the

execution of the HIT processed iteratively in Amazon Mechanical Turk, and it is

possible to process HITs using the process described by the JavaScript program.

Figure 5.1 shows the execution screen of Turkit.

Figure 5.1: Execution screen of Turkit

The process and the content of HIT are input in the screen represented by 1 in Figure 5.1.

25

The result of HITs processed by workers and voters are output in the screen represented

by 2 in Figure 5.1. The links to the pages of the HITs requested to Amazon Mechanical

Turk are displayed in the screen represented by 3 in Figure 5.1. The screens represented

by 2 and 3 are updated as the process progresses.

5.2 Request of Translatin Task There are the following two translation tasks in this experiment.

(a) The translation of a source sentence to the target language

(b) The improvement of the translated sentence based on the source sentence

The translation task (a) is processed by the workers in the parallel process and the first

worker in the iterative process, and the translation task (b) is processed by the workers

other than the first worker in the iterative process.

Figure 5.2 and 5.3 respectively show the request screen of the translation tasks (a)

and (b) from Chinese to English. There is the notice saying “The result of a HIT is

rejected if a worker use machine translation or the quality of a translation is very low” in

these screens. The translation input by the previous worker is provided in the translation

task (b), and the worker can also create the translation by modifying the provided

translation. The notice in the translation task (b) says “a worker can use the provided

translation, but a worker can start over if you don’t want to use the provided translation.

The reward and processing time of the translation task (a) are respectively 0.2$ and 60

minutes, and the reward and processing time of the translation task (b) are respectively

0.1$ and 30 minutes. The processing time is long because we want workers to create a

better translation. For example, we hope that workers find the translations of technical

words using the dictionaries on the web.

5.3 Request of Vote Task Figure 5.4 shows the request screen of the vote task. Workers select a better English

translation from Chinese by comparing two translations in this task. Workers can see the

Chinese sentence when workers process the vote task. The reward and processing time

of the vote task respectively are 0.03$ and 10 minutes. The processing time is long

because we want workers to make a thoughtful choice. A vote task is important because

the effort of translators is wasted if the function of a vote doesn’t correctly perform.

26

Figure 5.2: Request screen of the translation task from Chinese to English

Figure 5.3: Request screen of the improvement of the translation from Chinese to

English

27

Figure 5.4: Request screen of the vote task

5.4 Screening of Workers The previous studies [3] and [9] indicate that there are many workers who don’t process

HITs seriously in Amazon Mechanical Turk. Especially, the translation task needing a

technical ability has to be requested, and it is difficult to determine whether a worker

processed vote tasks seriously. Thus, the screening of workers is necessary in this

experiment.

There are two methods to screen workers in Amazon Mechanical Turk. The first

method is that you have workers solve the Qualification test after you created and

published a formal Qualification test. However, you can’t create and publish a formal

Qualification test using GUI. You have to execute the program created by Java, Ruby

and Perl. You prepare the correct answer of a Qualification test in advance, and you can

automatically give a Qualification to a worker by comparing the answer of the worker

with the correct answer. The second method is that you request a HIT as a Qualification

test. In this case, you have to write that a HIT is the Qualification test in the title or

description of the HIT. You can create and publish a Qualification test using GUI though

this method may not official. You can manually check whether the answer of a worker is

28

correct or not. We adopted the second method because translations should be manually

checked in a Qualification test.

Figure 5.5 shows the request screen of the Qualification test for Chinese-English

translation.

Figure 5.5: Request screen of the Qualification test for the Chinese-English translation

The Guideline of the Qualification test says that “a worker can get the Qualification for

many other translation tasks if the worker translate the Chinese sentence to English

correctly’’ and “a worker must not use machine translations’’. A worker has to answer

his mother language, country of origin and country of residence after the worker solved

the test of a translation in the Qualification test. We used the questionnaire to screen

workers easily. In the Qualification test, the source sentence is “对不起，我们这里没有

这个人。’’, and the example of the correct translation is “I’m sorry, but we don’t have

such person here.’’ We used a very simple test because many workers don’t take the test

if a test is difficult. It is possible to see the number of workers passing the formal

Qualification test published in Amazon Mechanical Turk. The number of workers

passing the Qualification test for the Chinese-English translation published by other

workers was around 30. This number is very low. The translation test used in this

Qualification test was more difficult than our Qualification test. The reward and

processing time of the Qualification test are respectively 0.01$ and 5 minutes. We set the

29

reward of our Qualification test 0$, but we considered that many workers would not

gather if we set the reward 0$. The processing time is very low because the test is so

easy that workers don’t need use dictionaries.

30

Chapter 6 Experiment and Evaluation

6.1 Experiment We conducted the experiment of the translations using the parallel and iterative

processes. We would clarify the relation between the number of translators and quality

of translations in the parallel and iterative processes. We would clarify the between the

number of translators and quality of translations in the translation processes with

combination of the parallel and iterative processes.

We published the Qualification test for the Chinese-English translation on Amazon

Mechanical Turk to gather the workers participating in the experiment. The number of

workers passing the Qualification test was 30 after 1 week has passed since the

Qualification test was published. We set the number of translators and voters per one

vote 3 because the number of workers passing the Qualification test was low. The

parallel and iterative processes used in our experiment are respectively shown by

Figure 6.1 and 6.2. The experiment was launched after 1 week has passed since the

Qualification test was published. The Qualification test was published while the

experiment was being conducted, and the worker passing the test was given the

Qualification at anytime.

The purpose, procedure and hypothesis of the experiment are as follows.

Purpose of the experiment

The evaluations of the crowdsourcing translation processes with combination of the

parallel and iterative processes (the number of translators is 3)

Procedure of the experiment

The experiment 1 and 2 are conducted in order.

Experiment 1. The experiment for the relation between the number of translators and

quality of translations in the parallel process

We translate Chinese sentences to English using the parallel process in Amazon

Mechanical Turk. We acquire three translations for a Chinese sentence, and the three

translations are considered as the translations by translator 1, 2 and 3 according to the

order of the acquisition. We acquire the best translation by two votes.

Experiment 2. The experiment for the relation between the number of translators and

31

quality of translations in the iterative process

Chinese sentences are translated to English using the iterative process where the

translations by translator 1 in the experiment 1 are used as the translations by translator

1.

Figure 6.1: Parallel process (The number of translators is 3)

Figure 6.2: Iterative process (The number of translators is 3)

Hypotheses of the experiment

Hypothesis 1. The better translation is acquired by using the iterative process than the

parallel process

The reason of the hypothesis 1. It is considered that the quality of translations increases

by the iterative modifications (the improvements of the grammar and spelling error)

Hypothesis 2. The better translation is acquired by using the process with combination

of the parallel and iterative processes (Figure 6.3) than the iterative process

The reason of the hypothesis 2. It is expected that the first translation for an

improvement has the strong effect to the translators improving translations. Thus, it is

possible to increase the quality of the final translation by acquiring the first translation

for an improvement from multiple translators.

The source sentence and reward are as follows.

� Source sentence

Each 5 articles are randomly selected from the categories of sports, society, economic

32

and culture in京报网 (One of the leading news site in China). The source sentences

are 20 sentences including the initial 1 sentence in each article.

Figure 6.3: Process with combination of the parallel and iterative processes

(The number of translators is 3)

� Reward

[The reward of the translations using the parallel process]

The reward of a translation task: 0.2$

The reward of a vote task: 0.09$ per one task (0.03$ per one worker)

The reward of one sentence: 0.2×3+0.09×2=0.78$

The reward of 20 sentences: 0.78×20=15.6$

[The reward of the translation using the iterative process]

The reward of a translation task: 0.1$

The reward of a vote task: 0.09$ per one task (0.03$ per one worker)

The reward of one sentence: 0.1×2+0.09×2=0.38$

The reward of 20 sentences: 0.38×20=7.6$

[Total reward]

15.6+7.6=23.2$

6.2 Evaluation We created the correct translations by having a Chinese-English bilingual translate the

source sentences and a native English speaker modify the English translations. We had

3 English speakers evaluate the results of translations on a scale of 1 to 5 (All, Most,

Much, Little and None) by comparing the results of translations with the correct

translations. The evaluation values of translations are represented by the average of the

33

3 evaluation values.

Figure 6.4 shows the results of the experiment 1 and 2. The processes shown by

Figure 6.3 and 6.5 can also be considered as the process with combination of the

parallel and iterative processes. The process shown by Figure 6.3 can be seen as the

process with combination of the parallel and iterative processes using 2 translators, and

it is possible to get the relation between the number of translators and quality of

translations in the process shown by Figure 6.3.

Evaluation value of the translation from vote 1 in the process shown by Figure 6.3

The evaluation value of the translation from vote 1 in the process shown by Figure 6.3

is approximately represented by the evaluation value of the translation by the parallel

process using 2 translators. This evaluation value is 4.27.

Evaluation value of the translation from vote 2 in the process shown by Figure 6.3

The third translator improves the translation in the process shown by Figure 6.3. The

degree of the improvement of the translation by the third translator must be different

according to the quality of the translation from vote 1. We compute the degree of the

improvement of the translation for each quality of the first translation in the iterative

process using 2 translators here. First, we classify the translations from translator 1 and

translator 2 in the iterative process into 4 categories (1.00-1.99, 2.00-2.99, 3.00-3.99

and 4.00-5.00) according to the evaluation values. We compute the average of the

evaluation values for each classified category. We also compute the average of the

evaluation values of the translations from the votes (vote1 for the translation from

translator 1 or vote 2 for the translation from translator 2) for each classified category.

As a result, we can get the difference of the improvement of the translation for each

quality of the first translation in the iterative process using 2 translators (Figure 6.6).

The evaluation value from vote 2 is the product of the evaluation value of the

translation from vote1 and the degree of the improvement of the translation

corresponding to the evaluation value. For example, the evaluation value of the

translation from vote2 is 3.5×3.67/3.33=3.86 if the evaluation value of the translation is

3.5. The evaluation value of the translation from vote2 in the process shown by Figure

6.3 is approximately represented by 4.27

Figure 6.4 shows the relation between the number of translators and quality of

translations in the process shown by Figure 6.3. Figure 6.4 indicates that the process

34

shown by Figure 6.3 gives the best translation using 3 translators.

It is impossible to get the relation between the number of translators and quality of

translations in the process shown by Figure 6.5 approximately, but the first translation

for the improvement is acquired from only translator 1 in the process shown by Figure

6.5. The quality of the first translation for the improvement increases by acquiring the

first translation for the improvement from two translators. It is unlikely that the process

shown by Figure 6.5 is better than that by Figure 6.3 because the first translation for the

improvement has the strong effect to the translators improving translations.

Figure 6.4: Relation between the number of translators and the quality of translations

Figure 6.5: Process with combination of the iterative and parallel processes

(The number of translators is 3)

35

Figure 6.6: Difference of the improvement of the translation for each quality of the first

translation in the iterative process

6.3 Discussion 58 workers have passed the Qualification test for 25 days from the publication of the

Qualification test until the end of the experiment. We had workers answer their mother

language, countries of origin and countries of residence when the workers took the test.

Table 6.1, 6.2 and 6.3 show the results of the answers. Most of the mother languages of

workers were Chinese, and most of the countries of origin were China. Most of the

countries of residence were also the United States.

Table 6.1: Mother languages of workers

Chinese English numbers 47 11

Table 6.2: Countries of origin of workers

China United States Singapore Malaysia other

numbers 31 10 6 6 5

36

Table 6.3: Countries of residence of workers

United States China Singapore Malaysia Vietnam

numbers 36 9 6 6 1

Experiments 1 and 2 were respectively completed in 11 and 7 days. Experiment 2

was completed earlier than Experiment 1because the translation by translator 1 in

Experiment 1 was used as the translation by translator 1 in Experiment 2 in this

experiment. The translations from 50 English sentences to Chinese were completed in

2 days in the previous study [1]. This was because that a Qualification test was not used.

There must be the relation of trade-off between the speed and quality of translations.

The experiment results show that hypothesis 1 is not correct. It may be because that

the number of translators is low in this experiment. The number of translators was 3 in

this experiment though the number of translators was 6 in the previous study [4]. It can

be expected that the iterative process gives a better translation than the parallel process

if the number of translators increases. The experiment results show that hypothesis 2 is

correct. It can be expected that the parallel process using 2 translators gives the best

translation in the parallel process because the parallel process using 3 translators is

worse than that using 2 translators in Figure 6.4. Thus, the process with combination of

the parallel process using 2 translators and iterative process would be the best. For

example, the process with combination of the parallel process using 2 translators and

iterative process using 3 translators would be the best if the number of translators is 5.

However, the numbers of translators in a process and source sentences are low to

verify hypotheses. The number of experiments is also low. We try to consider the time

required to verity hypothesis here. We needed around 18 days to conduct the

experiment. The number of translators was 3 in this experiment. At least 36 days are

needed for the experiment if the number of translators is 6. The experiment has to be

conducted five times if we follow the previous study [4]. As a result, at least 6 months

are needed for the steady experiment. The time for the analysis and evaluation of the

results is also needed.

Table 6.4 shows the example of the good translation by the iterative process. The

translation by the next translator was better than that by the previous translator, and the

votes were performed correctly in the example. Translator 2 created the translation

37

based on the translation by translator 1, but translator 3 didn’t use most of the

translation by translator 2. Translator 3 probably started over because the translation

ability of translator 3 was much better than that of translator 1 and 2. Table 6.5 shows

the example of the bad translation by the iterative process. The translation by translator

1 was the best in this example. The translation by translator 1 could not be selected by

vote 2 because the translation by translator 2 which was worse than the translation by

translator 1 was selected by vote 1. The iterative process didn’t work well when the

votes were not performed correctly. The iterative process also didn’t work well when

the translators after 2didn’t change the translation by the previous translator.

Table 6.4: Example of the good translation by the iterative process

Type of sentence Content of sentence Evaluation value

Source Chinese sentence 今晚，北京金隅男篮将在主场迎战浙江广

厦。对于已经两连败的北京队来说，能不

能重新找回信心，停止连败，这场比赛非

常关键。

-

Correct English translation Tonight, Beijing Jinyu will hold a game at home against Zhejiang Guangsha. This game is a very critical for Beijing to regain confidence from a two-game losing streak.

-

English translation from translator 1

Tonight, BBMG against Zhejiang guangsha men's basketball team in the arena. Has been for two season for the Beijing team, can confidence in myself again, stop the season, the match is crucial.

1.67


Tonight, bbmg against Zhejiang guangsha men's basketball team in the arena. Already had two even defeated the Beijing team, can regain confidence, stop continuous defeat, the game is critical.

2.33


Tongiht, BBMG will do battle against Zhejiang Guangsha men's basketball team. The match will be extremely critical for deciding if BBMG, which has lost already two matches in a row , could regain its confidence and stop its string of continuous defeats.

4.33

38

English translation from vote1

< English translation from translator 2>

2.33



4.33

Table 6.5: Example of the bad translation by the iterative process

Type of sentence Content of sentence Evaluation value

Source Chinese sentence 前天清晨 5点左右，一辆奔驰汽车在工体

北门附近将一名青年男子撞伤，司机随即

弃车逃逸。两名路过的女大学生见状，和

其他 3 名年轻人一起将伤者送往医院抢

救。

-

Correct English translation At 5 o'clock the day before yesterday, a Mercedes-Benz car hit a young man near the north gate of Worker Gymnasium. The driver abandoned the vehicle and fled. Two passing-by female students, and three other young men sent injured person to the hospital.

-


At 5 o'clock the day before yesterday, a Mercedes-Benz car hit a young man near the Worker's North Gate, the driver abandoned the vehicle and fled. Two female college students reported the incident and took the injured to the hospital with three other young people.

4.67


The day before yesterday at about 5 o'clock in the morning, a Mercedes car near the North Gate of the workers will be a young man was injured, the driver then abandoned the vehicle and escaped. Two female students passing by seeing this, along with other 3 young men wounded and rushed to hospital for treatment.

2.67


At 5 o'clock the day before yesterday, the driver of a Mercedes-Benz abandoned his vehicle and fled after hitting a young man near the Worker's North Gate . Two female college students passing witnessed the incident and took the injured to the hospital with the help of three other young people.

3.33

39



2.67



3.33

Table 6.6 shows the proportion of the correct votes in each process.

Table 6.6: Proportion of the correct votes in each process

Proportion of the correct votes

Parallel process 67.5%(27/40)

Iterative process 77.5%(31/40)

Total 72.5%(58/80)

Table 6.6 indicates that 72.5% of the all votes were correct. There were workers

selecting a translation randomly because workers only had to select a better translation.

It is very difficult to screen such workers automatically. One of the methods to screen

such workers is asking the workers the reasons why they selected the translation. In this

case, it is necessary to check the reasons manually. Workers will not also write the

reasons if the reward of a vote doesn’t increase.

Figure 6.7 shows the relation between the number of translators and quality of

translations in the parallel and iterative processes for the case of a perfect vote.

(However, under the condition that the translation by the translator improving the

translation is same if the translation selected by a vote is different from the real

translation in the experiment). If all votes are performed correctly, the higher the

number of translators is, the more the quality of translations increases in the parallel

and iterative processes. Figure 6.8 shows the difference of the improvement of the

translation for each quality of the first translation in the iterative process for the case of

a perfect vote. The relation between the number of translators and quality of

translations in the process shown by Figure 6.3 is approximately represented by the

computation using Figure 6.8. The evaluation value from vote 1 in the process shown

by Figure 6.3 is approximately represented by the evaluation value of the translation by

the parallel process using 2 translators when all votes are performed correctly. This

evaluation value is 4.4. The evaluation value of the translation from vote 2 in the

40

Figure 6.7: Relation between the number of translators and the quality of translations

in the parallel and iterative processes for the case of a perfect vote

process shown by Figure 6.3 is approximately represented by 4.54. Figure 6.9 shows

the relation between the number of translators and quality of translations in the process

shown by Figure 6.3. Figure 6.7 and 6.9 indicate that the process shown by Figure 6.3

is the best.

41

Figure 6.8: Difference of the improvement of the translation for each quality of

the first translation in the iterative process for the case of a perfect vote

Figure 6.9: Relation between the number of translators and quality of translations in the

process with combination of the parallel and iterative processes for the case

of a perfect vote

42

6.4 Lessons Learned Amazon Mechanical Turk is the crowdsourcing market for the process of a lot of

simple micro tasks. It is considered that the translation tasks used in this experiment are

not suited for Amazon Mechanical Turk because translation tasks need high skills.

We found that most of the translation tasks requested to Amazon Mechanical Turk

were processed using machine translations. It was possible to screen malicious workers

by using a simple translation test in this experiment, but a lot of times were required to

complete this experiment because the number of workers passing the test was very low.

This is because that translation tasks needs high skills and a lot of troublesome. It

seemed that workers in Amazon Mechanical Turk liked simple tasks such as tagging

images to acquire rewards efficiently.

The set a higher reward of a translation task, use of simpler source sentence and

change of source and translated languages are given to complete a translation

experiment more quickly. A more difficult translation test is needed to acquire better

translations. However, the time required to complete a translation experiment increases

because the number of workers passing the test decreases if a more difficult test is used.

Therefore, it is considered that there is a relation of trade-off between the quality and

time of translations.

43

Chapter 7 Conclusion

A crowdsourcing translation where translators arbitrarily raise their hand for translation

issues gets attention these days. The crowdsourcing translation can realize low-cost and

rapid translations because non-professional translators can participate in translations.

However, the quality of translations is not guaranteed because non-professional

translators can participate in the crowdsourcing translation. We have to get multiple

translators to participate in a translation process and consider how multiple translators

should translate cooperatively.

In this study, we proposed the method of evaluation for crowdsourcing translation

process using Amazon Mechanical Turk which is one of crowdsourcing markets. In

Amazon Mechanical Turk, variety of tasks including a translation task can be requested,

and translation processes by multiple translators can be freely formed by using APIs

and Tools. We built the experiment environment using Amazon Mechanical Turk and

evaluated crowdsourcing translation processes.

The contributions of this study are as follows.

Establishment of the experiment environment using Amazon Mechanical Turk

We built the experiment environment of a crowdsourcing translation using Amazon

Mechanical Turk. Variety of tasks including a translation task can be requested to

Amazon Mechanical Turk, and APIs and tools for forming disposal processes in

Amazon Mechanical Turk are released. Thus, Amazon Mechanical Turk is suitable

for a crowdsourcing market where the experiment environment is built.

Efficient evaluation of crowdsourcing translation processes using Amazon

Mechanical Turk

We evaluated parallel and iterative processes by translating 20 Chinese sentences to

English in the built experiment environment. The parallel and iterative processes

are the fundamental cooperative processes proposed by a previous study. In

addition, we approximately evaluated the process with combination of the parallel

and iterative processes using the evaluation results of parallel and iterative

processes. The evaluation values of the parallel process, the iterative process and

the process with combination of the parallel and iterative processes were

respectively 4.22, 4.15 and 4.27 when the number of translators is 3. The evaluation

44

results showed that the process with combination of the parallel and iterative

processes was the best. We realized the evaluation of crowdsourcing translation

processes cutting time and cost by evaluating translation processes approximately.

However, more experiments are needed to gain persuasive power for the results of

this experiment. It is possible to evaluate crowdsourcing translation processes using the

same method in this experiment if more translators participate in the crowd sourcing

translation processes.

45

Acknowledgments

The author would like to express sincere gratitude to the supervisor, Professor Toru

Ishida at Kyoto University, for his continuous guidance, valuable advice, and helpful

discussions.

The author would like to tender his acknowledgments to Associate Professor

Shigeo Matsubara and Assistant Professor Hiromitsu Hattori at Kyoto University, for

his technical and constructive advice.

The author would like to express his appreciations to the advisers, Associate

Professor Keishi Tajima at Kyoto University and Researcher Yohei Murakami at

National Institute of Information and Communications Technology for his valuable

advice.

This research was partially supported by Strategic Information and

Communications R&D Promotion Programme (SCOPE) from Ministry of Internal

Affairs and Communications of Japan and a Grant-in-Aid for Scientific Research (A)

(21240014, 2009-2011) from Japan Society for the Promotion of Science (JSPS).

Finally, the author would like to thank all members of Ishida and Matsubara

laboratory for their various supports and discussions.

46

References

[1] Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical

Turk, Chris Callison-Burch, Proceedings of the 2009 Conference on Empirical

Methods in Natural Language Processing (EMNLP 2009), pp.286-295, 2009.

[2] Crowdsourcing Translation: Professional Quality from Non-Professionals, Omar F.

Zaidan and Chris Callison-Burch, Proceedings of the 49th Annual Meeting of the

Association for Computational Linguistics (ACL 2011), pp.1220-1229, 2011.

[3] Are your participants gaming the system?: screening mechanical turk workers, Julie

S. Downs, Mandy B. Holbrook, Steve Sheng and Lorrie Faith Cranor, Proceedings

of the 28th international conference on Human factors in computing systems (CHI

2010), pp.2399-2402, 2010.

[4] Exploring Iterative and Parallel Human Computation Processes, Greg Little, Lydia

B. Chilton, Max Goldman, Robert C. Miller, Proceedings of the ACM SIGKDD

Workshop on Human Computation (KDD-HCOMP 2010), pp.68-76, 2010.

[5] Who are the crowdworkers?: shifting demographics in mechanical turk, Joel Ross,

Lilly Irani, M. Six Silberman, Andrew Zaldivar, Bill Tomlinson, Proceedings of the

28th of the international conference extended abstracts on Human factors in

computing systems (CHI EA 2010), pp.2863-2872, 2010.

[6] Utility data annotation with Amazon Mechanical Turk, Alexander Sorokin and

David Forsyth, Computer Vision and Pattern Recognition Workshops 2008

(CVPRW 2008), pp.286-295, 2008.

[7] Crowdsourcing graphical perception: using mechanical turk to assess visualization

design, Jeffrey Heer and Michael Bostock, Proceedings of the 28th international

conference on Human factors in computing systems (CHI 2010), pp.203-212, 2010.

[8] Using the Amazon Mechanical Turk for transcription of spoken language, Matthew

Marge, Satanjeev Banerjee and Alexander I. Rudnicky, Proceedings of the 35th

IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP 2010), pp.5270-5273, 2010.

[9] Crowdsourcing user studies with Mechanical Turk, Aniket Kittur, Ed H. Chi and

Bongwon Suh, Proceedings of the 26th international conference on Human factors

in computing systems (CHI 2008), pp.453-456, 2008.

47

[10] Financial Incentives and the “Performance of Crowds”, Winter Mason and Duncan

J. Watts, Newsletter ACM SIGKDD Explorations Newsletter Volume 11 Issue 2,

pp.77-85, 2009.

[11] Coordination in collective intelligence: the role of team structure and task

interdependence, Aniket Kittur, Bryant Lee and Robert E. Kraut, Proceedings of the

27th international conference on Human factors in computing systems (CHI 2009),

pp.1496-1504, 2009.

[12] TurKit: human computation algorithms on mechanical turk, Greg Little, Lydia B.

Chilton, Max Goldman, and Robert C. Miller, Proceedings of the 23th annual ACM

Symposium on User Interface Software and Technology (UIST 2010), pp.57-66,

2010.

Date post:	13-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Master Thesis - 京都大学ai.soc.i.kyoto-u.ac.jp/publications/thesis/M_H23_matsuno...Master Thesis...

Documents