Hacettepe University Graduate School Of Social Sciences 

Department of Translation and Interpretation 

 
AUTOMATIC SPEECH RECOGNITION IN 

CONSECUTIVE INTERPRETER WORKSTATION: 

 
COMPUTER-AIDED INTERPRETING TOOL 

‘SIGHT-TERP’ 
 

Cihan ÜNLÜ 

 
Master’s Thesis 

 
Ankara, 2023 


AUTOMATIC SPEECH RECOGNITION IN CONSECUTIVE INTERPRETER 

WORKSTATION: COMPUTER-AIDED INTERPRETING TOOL ‘SIGHT-TERP’ 

 
Cihan ÜNLÜ 

 
Hacettepe University Graduate School of Social Sciences 

Department of Translation and Interpretation 

 
Master’s Thesis 

 
Ankara, 2023 

 
KABUL VE ONAY 

Cihan ÜNLÜ tarafından hazırlanan “Automatic Speech Recognition in Consecutive Intepreter 

Workstation: Computer Aided Interpreting Tool ‘Sight-Terp’” (Otomatik Konuşma Tanıma Sistemlerinin 

Ardıl Çeviride Kullanılması: Sight-Terp) başlıklı bu çalışma, 15.06.2023 tarihinde yapılan savunma 

sınavı sonucunda başarılı bulunarak jürimiz tarafından Yüksek Lisans Tezi olarak kabul edilmiştir. 

 
Dr. Öğr. Üyesi Alper KUMCU (Başkan) 

 
Prof. Dr. Aymil DOĞAN (Danışman) 

 
Doç. Dr. Gökçen HASTÜRKOĞLU (Üye) 

 
Yukarıdaki imzaların adı geçen öğretim üyelerine ait olduğunu onaylarım. 

 
Prof. Dr. Uğur ÖMÜRGÖNÜLŞEN 

Enstitü Müdürü 

 
YAYIMLAMA VE FİKRİ MÜLKİYET HAKLARI BEYANI 

Enstitü tarafından onaylanan lisansüstü tezimin/raporumun tamamını veya herhangi bir 

kısmını, basılı (kağıt) ve elektronik formatta arşivleme ve aşağıda verilen koşullarla 

kullanıma açma iznini Hacettepe Üniversitesine verdiğimi bildiririm. Bu izinle 

Üniversiteye verilen kullanım hakları dışındaki tüm fikri mülkiyet haklarım bende 

kalacak, tezimin tamamının ya da bir bölümünün gelecekteki çalışmalarda (makale, kitap, 

lisans ve patent vb.) kullanım hakları bana ait olacaktır. 

Tezin kendi orijinal çalışmam olduğunu, başkalarının haklarını ihlal etmediğimi ve 

tezimin tek yetkili sahibi olduğumu beyan ve taahhüt ederim. Tezimde yer alan telif hakkı 

bulunan ve sahiplerinden yazılı izin alınarak kullanılması zorunlu metinlerin yazılı izin 

alınarak kullandığımı ve istenildiğinde suretlerini Üniversiteye teslim etmeyi taahhüt 

ederim. 

Yükseköğretim Kurulu tarafından yayınlanan “Lisansüstü Tezlerin Elektronik Ortamda 

Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” kapsamında tezim 

aşağıda belirtilen koşullar haricince YÖK Ulusal Tez Merkezi / H.Ü. Kütüphaneleri Açık 

Erişim Sisteminde erişime açılır. 

Enstitü / Fakülte yönetim kurulu kararı ile tezimin erişime açılması mezuniyet 

tarihimden itibaren 2 yıl ertelenmiştir. (1) 

Enstitü / Fakülte yönetim kurulunun gerekçeli kararı ile tezimin erişime 

açılması mezuniyet tarihimden itibaren ... ay ertelenmiştir. (2) 

Tezimle ilgili gizlilik kararı verilmiştir. (3) 

                21/06/2023  

                                                                                                                              
                                                                                                    Cihan ÜNLÜ 
 
 “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge”  

(1) Madde 6. 1. Lisansüstü tezle ilgili patent başvurusu yapılması veya patent alma sürecinin devam etmesi 

durumunda, tez danışmanının önerisi ve enstitü anabilim dalının uygun görüşü üzerine enstitü veya fakülte 

yönetim kurulu iki yıl süre ile tezin erişime açılmasının ertelenmesine karar verebilir.   

 
(2) Madde 6. 2. Yeni teknik, materyal ve metotların kullanıldığı, henüz makaleye dönüşmemiş veya patent gibi 

yöntemlerle korunmamış ve internetten paylaşılması durumunda 3. şahıslara veya kurumlara haksız kazanç 

imkanı oluşturabilecek bilgi ve bulguları içeren tezler hakkında tez danışmanının önerisi ve enstitü anabilim 

dalının uygun görüşü üzerine enstitü veya fakülte yönetim kurulunun gerekçeli kararı ile altı ayı aşmamak 

üzere tezin erişime açılması engellenebilir. 

 
(3) Madde 7. 1. Ulusal çıkarları veya güvenliği ilgilendiren, emniyet, istihbarat, savunma ve güvenlik, sağlık vb. 

konulara ilişkin lisansüstü tezlerle ilgili gizlilik kararı, tezin yapıldığı kurum tarafından verilir *. Kurum ve 

kuruluşlarla yapılan işbirliği protokolü çerçevesinde hazırlanan lisansüstü tezlere ilişkin gizlilik kararı ise, 

ilgili kurum ve kuruluşun önerisi ile enstitü veya fakültenin uygun görüşü üzerine üniversite yönetim 

kurulu tarafından verilir. Gizlilik kararı verilen tezler Yükseköğretim Kuruluna bildirilir.  

Madde 7.2. Gizlilik kararı verilen tezler gizlilik süresince enstitü veya fakülte tarafından gizlilik kuralları 

çerçevesinde muhafaza edilir, gizlilik kararının kaldırılması halinde Tez Otomasyon Sistemine yüklenir  

 
* Tez danışmanının önerisi ve enstitü anabilim dalının uygun görüşü üzerine enstitü veya fakülte 

yönetim kurulu tarafından karar verilir. 


ETİK BEYAN 

Bu çalışmadaki bütün bilgi ve belgeleri akademik kurallar çerçevesinde elde ettiğimi, 

görsel, işitsel ve yazılı tüm bilgi ve sonuçları bilimsel ahlak kurallarına uygun olarak 

sunduğumu, kullandığım verilerde herhangi bir tahrifat yapmadığımı, yararlandığım 

kaynaklara bilimsel normlara uygun olarak atıfta bulunduğumu, tezimin kaynak 

gösterilen durumlar dışında özgün olduğunu, Prof. Dr. Aymil DOĞAN danışmanlığında 

tarafımdan üretildiğini ve Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü Tez Yazım 

Yönergesine göre yazıldığını beyan ederim. 

 
Cihan ÜNLÜ 

  
 iv 

ACKNOWLEDGEMENTS 

My first sincere thanks go to my advisor Prof. Dr. Aymil Doğan. I have nothing but 

admiration for her wisdom, attention and efforts for her students. I would also like to 

thank the participants of this study who generously gave their time and energy to take 

part in this research. Their willingness to share their experiences and insights has been 

invaluable and deeply appreciated. I would like to express my deepest gratitude to Assoc. 

Assoc. Prof. Dr. Didem Tuna and Asst. Prof. Dr. Javid Aliyev for their invaluable 

academic guidance, understanding, and sincere help throughout both my undergraduate 

and graduate education. I would also like to extend my thanks to my friend Ebru Kürkcü 

for her unwavering moral support. I am grateful to Sebahat Gören, Dr. Pınar Uysal 

Cantürk, Aslı Yolcu and Büşra Ceren Tangül for her kind help in the statistical assessment 

and evaluation. Lastly, I would like to thank Prof. Didem Tuna and Prof. Alev Bulut for 

their invaluable expert opinions on the methodology used in this work. 

 
I owe a debt of gratitude to my family, colleagues at Istanbul Yeni Yüzyıl University and 

friends who provided me with much-needed encouragement, motivation, and support 

during this challenging time. Their faith in me has been a constant source of motivation 

and inspiration. 

  
 v 

ÖZET 

ÜNLÜ, Cihan. Otomatik Konuşma Tanıma Sistemlerinin Ardıl Çeviride Kullanılması: 

Sight-Terp, Yüksek Lisans Tezi, Ankara, 2023. 

 
Bu deneysel çalışma, bilgisayar destekli sözlü çeviri (BDS) aracı olan "Sight-Terp" 

kullanımının ardıl çeviri sürecine etkisini araştırmaktadır. Bu çalışmanın yazarı 

tarafından tasarlanan ve geliştirilen Sight-Terp, dijital not defteri, otomatik konuşma 

tanıma (OKT), gerçek zamanlı konuşma çevirisi, adlandırılmış varlık tanıma ve 

vurgulama ve otomatik segmentasyon işlevlerine sahiptir. Çalışma, katılımcıların 

performanslarını iki koşulda (Sight-Terp'li ve Sight-Terp'siz) test etmek ve 

performanslarını doğruluk ve akıcılık kriterlerine göre analiz etmek için grupiçi tekrarlı 

ölçümler tasarımı kullanmıştır. İki farklı koşuldaki doğruluk oranları arasındaki farkı 

analiz etmek için doğruluk değişkeni, anlamsal olarak eşdeğer bir şekilde aktarılan anlam 

birimlerinin sayısının ortalaması ile ölçülmüştür (Seleskovitch, 1989). Akıcılık ise, her 

bir performans için yanlış başlangıçlar, dolgulu duraksamaların sıklığı, sessiz 

duraksamalar, tüm sözcük tekrarları, bozuk sözcükler ve tamamlanmamış tümceler gibi 

akıcısızlık göstergelerinin toplam sayısı hesaplanarak ölçülmüştür. Ek olarak, 

katılımcıların araç kullanımına ilişkin algılarını analiz etmek için deney sonrası anket 

uygulanmıştır. Elde edilen bulgular, OKT ile entegre edilmiş BDS aracı Sight-Terp'ten 

yararlanmanın katılımcıların çevirilerinin doğruluğunda bir artışa yol açtığını 

göstermektedir. Ancak Sight-Terp kullandıklarında katılımcılarda daha fazla akıcısızlık 

belirteçleri meydana gelmiş ve çeviri için harcadıkları süre görece uzamıştır. Kullanıcılar 

aracı kullanırken herhangi bir zorluk veya yabancılık hissetmeseler de çalışma sonuçları 

yazılımın faydasını daha da artırabilecek potansiyel iyileştirme ve değişiklik alanlarını da 

ortaya koymaktadır. Bu çalışma, OKT teknolojisini sözlü çeviri sürecine dahil etmenin 

faydalarını ve zorluklarını vurgulayarak sözlü çeviri eğitimini ve pratiğini bilgilendirmeyi 

amaçlamakta ve sözlü çevirmenler için BDS araçlarının gelecekteki gelişimi için pratik 

öneriler sunmayı amaçlamaktadır. 

 
Keywords: bilgisayar destekli sözlü çeviri, otomatik konuşma tanıma, sözlü çeviri 

teknolojileri, ardıl çeviri, not alma, tablet destekli sözlü çeviri 


 vi 

ABSTRACT 

ÜNLÜ, Cihan. Automatic Speech Recognition in Consecutive Interpreter Workstation: 

Computer-Aided Interpreting Tool ‘Sight-Terp’, Master’s Thesis, Ankara, 2023. 

 
This experimental study investigates the effect of using an automatic speech recognition 

(ASR)-enhanced computer-assisted interpreting (CAI) tool “Sight-Terp” on the 

performances of a group of participants in consecutive interpreting tasks. Sight-Terp, 

which is designed and developed by the author of this study, provides a digital note-pad, 

real-time speech translation, named entity recognition and highlighting, and automatic 

segmentation of a speech. The study employs a within-subjects repeated measures design 

to test participants' performances in two conditions (with and without Sight-Terp) and 

analyses their performances based on the criteria of accuracy and fluency. In seeking the 

significant difference between the accuracy ratios in two different conditions, accuracy 

was measured by the average of the number of accurately conveyed units of meaning 

(Seleskovitch, 1989). Fluency, on the other hand, was measured by calculating the total 

number of occurrences of disfluency markers such as false starts, frequency of filled 

pauses, filler words, whole-word repetitions, broken words, and incomplete phrases for 

each performance. Additionally, a follow-up qualitative survey is conducted to obtain 

participants' comparative responses and perceptions of the tool usage. The analysis and 

quantitative results of the study indicate that leveraging the ASR-integrated CAI tool 

Sight-Terp led to an enhancement in the accuracy of the participants' interpretations. 

However, this also resulted in a higher occurrence of disfluencies and elongated durations 

of interpretations. While the users experienced little difficulty while using the tool, the 

study outcomes also suggest potential areas of improvement and modifications that could 

further enhance the utility of the tool. The study aims to inform interpreting education 

and practice by highlighting the benefits and challenges of incorporating ASR technology 

in the interpreting process and offers practical suggestions for the future development of 

CAI tools for interpreters. 

 
Keywords: computer-assisted interpreting, automatic speech recognition, interpreting 

technology, consecutive interpreting, note-taking, tablet interpreting 

 
 vii 

TABLE OF CONTENTS 

 
KABUL VE ONAY .......................................................................................................... i 

YAYIMLAMA VE FİKRİ MÜLKİYET HAKLARI BEYANI ................................ iv 

ETİK BEYAN ................................................................................................................. iii 

ACKNOWLEDGEMENTS ........................................................................................... iv 

ÖZET ............................................................................................................................... V 

ABSTRACT .................................................................................................................... vi 

TABLE OF CONTENTS .............................................................................................. vii 

LIST OF ABBREVIATIONS ....................................................................................... xi 

LIST OF TABLES .................................................................................................... Xİİİ 

LIST OF FIGURES ................................................................................................... XİV 

LIST OF CHARTS ..................................................................................................... XV 

 
INTRODUCTION ........................................................................................................... 1 

 
CHAPTER ONE: SCOPE OF THE STUDY ............................................................... 5 

1.1. AIM OF THIS STUDY ..................................................................................... 5 

1.2. SIGNIFICANCE OF THIS STUDY ............................................................... 5 

1.3. RESEARCH QUESTION(S) ........................................................................... 6 

1.4. LIMITATIONS ................................................................................................. 7 

1.5. ASSUMPTIONS ................................................................................................ 7 

1.6.  DEFINITIONS ................................................................................................. 8 

 
CHAPTER TWO: THEORETICAL BACKGROUND .............................................. 9 

2.1. INTERPRETING: AN OVERVIEW .............................................................. 9 

2.1.1. Defining Interpreting ................................................................................. 10 

2.1.2. History of Interpreting ............................................................................... 11 


 viii 

2.1.3. Interpreting in Modern Times ................................................................... 13 

2.1.4. Modes and Settings of Interpreting ........................................................... 15 

2.1.4.1. Consecutive Interpreting .................................................................... 18 

2.1.4.2. Simultaneous Interpreting .................................................................. 20 

2.1.4.3. Sight Interpreting................................................................................ 21 

2.1.4.4. Whispering (Chuchotage) .................................................................. 22 

2.1.4.5. Sign Language Interpreting ................................................................ 23 

2.2. EFFORT MODELS IN INTERPRETING ................................................... 23 

2.2.1. Effort Models in Consecutive Interpreting................................................ 25 

2.2.2. Effort Models in Human-Machine Interaction .......................................... 26 

2.3. TECHNOLOGY AND INTERPRETING .................................................... 29 

2.3.1. The Emergence of Information Technologies in Interpreting ................... 29 

2.3.1.1.Categorization of Technologies in Interpreting .................................. 30 

2.3.2. Computer-Assisted Interpreting Tools ...................................................... 37 

2.3.2.1. InterpretBank ...................................................................................... 41 

2.3.2.2. Kudo Interpreter Assist ...................................................................... 44 

2.3.2.3. SmarTerp ............................................................................................ 46 

2.3.3. Speech Technologies and Automatic Speech Recognition ....................... 48 

2.3.3.1. ASR Integration into Translation ....................................................... 53 

2.3.3.2. ASR Integration into Interpreting ...................................................... 56 

2.3.4. Technology and Consecutive Interpreting ................................................ 61 

2.1.4.1.Sim-Consec ......................................................................................... 62 

3.1.4.1. Tablet Interpreting .............................................................................. 63 

2.4. SIGHT-TERP ......................................................................................................... 65 

2.4.1. General Features ........................................................................................ 66 

2.4.1.1. Automatic Speech Recognition and Speech Translation ................... 67 


 ix 

2.4.1.2. Automatic Text Segmentation ............................................................ 69 

2.4.1.3. Named Entity Recognition and Highlighting ..................................... 71 

2.4.1.4. Digital Notepad .................................................................................. 73 

 
CHAPTER THREE: METHODOLOGY ................................................................... 76 

3.1. DESIGN OF THE STUDY ............................................................................. 76 

3.2. DATA COLLECTION INSTRUMENTS ..................................................... 77 

3.2.1. Speeches .................................................................................................... 78 

3.2.2. Questionnaires ........................................................................................... 81 

3.3. PARTICIPANTS ............................................................................................. 82 

3.4. PROCEDURE ................................................................................................. 83 

3.4.1. Training ..................................................................................................... 86 

3.4.2. Preliminary test ......................................................................................... 87 

3.5. DATA ANALYSIS TECHNIQUES .............................................................. 89 

 
CHAPTER FOUR: FINDINGS AND DISCUSSION ................................................ 91 

4.1. FINDINGS AND DISCUSSION RELATED TO THE ACCURACY 

DİFFERENCES ..................................................................................................... 91 

4.2. FINDINGS AND DISCUSSION RELATED TO THE FLUENCY 

DIFFERENCES ..................................................................................................... 93 

4.3. POST-EXPERİMENT QUESTIONNAIRE RESULTS.............................. 96 

 
CONCLUSION AND RECOMMENDATIONS ...................................................... 105 

BIBLIOGRAPHY ....................................................................................................... 110 

 
APPENDIX 1. SPEECH MATERIALS .................................................................... 120 

APPENDIX 2. TABLE OF ICT TOOLS AND PLATFORMS RELATED TO 

INTERPRETING TECHNOLOGY .......................................................................... 124 


 x 

APPENDIX 3.  ETHICS COMMITTEE APPROVAL ........................................... 126 

APPENDIX 4. THESIS/DISSERTATION ORIGINALITY REPORT ................. 127 

 
 xi 

LIST OF ABBREVIATIONS 

AI :  Artificial Intelligence 

AIIC :  International Association of Conference Interpreters 

API :  Application Programming Interface 

AR :  Augmented Reality 

ARI :  The Automated Readability Index 

ASR :  Automatic Speech Recognition 

CAI :  Computer-Assisted Interpreting 

CI :  Consecutive Interpreting 

EM :  Effort Model 

ER :  External Resources 

ESIT :  École Supérieure d’Interprètes et de Traducteurs (School for Interpreters and 

Translators in Paris, France) 

ETI :  École de Traduction et d'Interprétation (School of Translation and Interpreting in 

Geneva, Switzerland) 

EVS :  Ear-Voice Span 

HMI :  Human-Machine Interaction 

ICT :  Information and Communications Technology 

LLM :  Language Learning Machine  

MFD :  Mean Fixation Duration 

MI :  Machine Interpreting 

MT :  Machine Translation 

NER :  Named Entity Recognition 

NLP :  Natural Language Processing 

PE :  Post-Editing  

RI :  Remote Interpreting 

RSI : Remote Simultaneous Interpreting 

S2ST : Speech-to-Speech Translation 

SCI :  Sight-Consecutive Interpreting 

SI :  Simultaneous Interpreting 

SMOG :  Simple Measure of Gobbledygook 

ST :  Speech Translation 


 xii 

TD :  Translation Dictation 

TIS :  Translation and Interpreting Studies 

UI :  User Interface 

VR :  Virtual Reality 

WER :  Word Error Rate 

  
 xiii 

LIST OF TABLES 

Table 1: ICT Tools and Platforms Related to Interpreting Technology 

Table 2: Advantages and Disadvantages of Tablet Interpreting (Goldsmith, 2018, p.357) 

Table 3: Readability Index Results and Lexical Density Ratios of Speech Materials 

Table 4: Detailed Descriptions of Speech Materials (Duration, Length, Units of Meaning) 

Table 5: Word-Error-Rate Results and Precision of ASR in Named Entity Recognition 

Table 6: Distribution of Speech Materials per Participant 

Table 7: Instances of Disfluency Markers per Participant 

 
 xiv 

LIST OF FIGURES 

Figure 1. The conceptual spectrum of interpreting drafted by Pöchhacker  

Figure 2. Glossary creation and editing in InterpretBank 

Figure 3. The memory feature of InterpretBank 

Figure 4. The main interface of InterpretBank ASR 

Figure 5. Glossary management page of Interpreter Assist  

Figure 6. ASR Feature in KUDO Interpreter Assist 

Figure 7. The user interface of SmarTerp 

Figure 8. The workflow of the ASR-CAI integration in the case of InterpretBank 

Figure 9. The functionalities of Livescribe™ Echo® Smartpen 

Figure 10. The main layout of Sight-Terp (Tablet View)  

Figure 11. A segmented text on the interface of Sight-Terp 

Figure 12. Named entities highlighted in Sight-Terp interface 

Figure 13. Digital Notepad feature of Sight-Terp 

Figure 14. The comparable results of the preliminary test: complete renditions of 

meaning units in % 

Figure 15. The comparable results of the main test: complete renditions of units of 

meaning in %. 

Figure 16. The durations of the performances (in minutes and seconds) 

Figure 17. The answers to the question “How would you evaluate your experience with 

the Sight-Terp tool?” 

Figure 18. The answers to the Likert item “I think the Sight-Terp tool is easy to use.” 

Figure 19. The answers to the Likert item “Using automatic speech recognition during 

the consecutive interpreting task negatively affected my performance.” 

Figure 20. The answers to the Likert item “I think the features in Sight-Terp contributed 

to my consecutive interpreting performance.” 

Figure 21. The answers to the question “Do you think the automatic speech recognition 

function in Sight-Terp is accurate and reliable?” 

Figure 22. The answers to the question “Which automatically generated output did you 

use for support during consecutive interpreting?” 

Figure 23.  Answers to the question “Would you use the Sight-Terp tool in your future 

professional life?” 


 xv 

LIST OF CHARTS  

Chart 1. The procedure followed in the study 

  
 1 

INTRODUCTION 

The key role of information and communication technologies (ICT) in interpreting is 

inarguably prominent considering recent tailor-made technological solutions for 

interpreters. Remote interpreting (RI) solutions have changed the way interpreters work 

and created a digital identity along with its problems and contributions. Machine 

interpreting (MI), on the other hand, though far from human parity, has the potential to 

create thought-provoking debates on user perception, multilingualism, and 

communicative perspective.  The advancement of technology has brought about a 

plethora of tools and solutions to enhance the accuracy and efficiency of interpreters. 

With the use of computer-assisted interpreting tools (CAI) and natural language 

processing (NLP) applications, interpreters now have access to a whole new world of 

linguistic and technical possibilities, which can revolutionize the way they approach their 

work.  

 
Computer-assisted interpreting is defined as software which is ‘specifically designed and 

developed to assist interpreters in at least one of the different sub-processes of 

interpreting’ (Fantinuoli, 2018b, p. 12). CAI tools emerged to fulfil the common 

objective, which is helping interpreters in a wide range of productivity and quality-related 

tasks from easing cognitive load to conference preparation and terminology organization. 

As a matter of fact, technological trends in the field of interpreting have changed with 

new developments in natural language processing, speech technologies, general artificial 

intelligence and changing role of interpreters with the rise of remote simultaneous 

interpreting (RSI) and the so-called technologization process or ‘technological turn’ 

(Fantinuoli, 2018b) has changed the way “computer-assisted interpreting” is perceived.  

 
Automatic speech recognition technology is game-changer for the new generation CAI 

tools. The quality of ASR systems has been incrementally improved thanks to new 

advancements in deep learning1, which brought about the question of whether CAI tools 

and ASR can be integrated. ASR-integrated CAI tools have been proposed and designed 

 
1 Deep learning is a subset of artificial intelligence (AI) that focuses on teaching machines to learn and 

process data in ways that resemble human learning. 


 2 

to alleviate the cognitive strain on interpreters during the interpreting process, while 

simultaneously augmenting their processing capabilities. The aim in principle is to 

automate the querying system in real-time in simultaneous interpreting and make it 

possible to automatically display the reliable transcript of the source speech in a short 

time that fits into interpreters’ ear-voice span (EVS). These new generation ASR-

enhanced CAI tools have newly gained traction thanks to the tools (or projects) such as 

InterpretBank (Fantinuoli, 2016), SmarTerp (Rodriguez et al., 2021), VIP (Corpas-Pastor, 

2021) and KUDO Interpreter Assist (Fantinuoli et al., 2022).  

ASR with “considerable potential for changing the way interpreting is practiced” 

(Pöchhacker, 2016, p. 188) has a pivotal role in shaping the concept of human-machine 

interaction in the context of interpreting. Several empirical studies are questioning 

possible ASR implementation as an automated querying system (Hansen-Schirra, 2012; 

Fantinuoli, 2017), investigating the feasibility of ASR-enhanced CAI tools in the context 

of problem triggers (Ricci, 2020; Van Cauwenberghe, 2020; Defrancq & Fantinuoli, 

2021; Rodríguez et al., 2021; Pisani & Fantinuoli, 2021; Montecchio, 2021; Prandi, 

2023), using ASR for meeting the preparatory needs of interpreters (Gaber et al., 2020) 

and implementing ASR for supporting interpreters with the transcription of the source 

speech (Cheung & Tianyun, 2018; Wang & Wang, 2019). In order to enhance the depth 

of empirical research on CAI tools, this study deviates from the earlier studies that 

primarily investigated the use of ASR in simultaneous interpreting, instead focusing on 

the usage of an ASR and MT-enhanced CAI tool in consecutive mode. The study2 

attempts to fill a gap in the available literature on computer-assisted interpreting tools by 

proposing a prototype of an ASR-enhanced digital application and providing insights into 

the effectiveness of ASR and technology usage in enhancing interpreter performance in 

consecutive interpreting (CI), which could help shape the creation of more sophisticated 

CAI tools that cater to the specific needs of interpreters. The study aims at exploring and 

identifying a significant difference in the performances of a group of participants in CI 

tasks, using an ASR-enhanced CAI tool “Sight-Terp” (see section 2.4.) which is 

 
2 The scope and the results of the preliminary test of this thesis were presented with the title “Investigating 

the usage of ASR and speech translation in consecutive interpreter workstation: A pilot study on ASR-

enhanced CAI tool prototype ‘Sight-Terp’” in the TC44 Translating and the Computer Conference 

organized in Luxembourg on the 22-25th of November 2022. 


 3 

developed by the author within the scope of this thesis. Sight-Terp3 is a prototype of a 

CAI tool that initiates continuous speech recognition and provides real-time speech 

translation, named entity recognition and automatic segmentation of a speech. The named 

entity recognition (NER) function allows users to easily detect the named entities in the 

automated texts, such as numerals and proper names to improve their lookup mechanism. 

Participants' performances were tested and analyzed for accuracy and fluency using a 

repeated measures design. Accuracy was measured by calculating the percentage of the 

accurately rendered “units of meaning” (Seleskovitch, 1989) in each performance. A non-

parametric statistical test (Wilcoxon Signed-Rank) was used to compare performances 

without technological aid and with Sight-Terp. A follow-up qualitative survey were given 

to the participants to obtain comparative responses and perceptions on the tool usage. The 

study has the potential to inform interpreting education and practice by highlighting the 

benefits and challenges of incorporating ASR technology in the interpreting process. By 

doing so, the study can also offer important practical suggestions for the future 

development of CAI tools for interpreters. 

 
The first chapter of the study serves as the introduction and outlines the aim, significance, 

research questions, limitations, assumptions, and research definitions.  The second 

chapter focuses on the background and the literature review of this study, starting with 

historical and etymological aspects of interpreting (2.1.) and cognitive dimensions of 

interpreting with a focus on Effort Models by Daniel Gile (2.2.).  Chapter two also touches 

upon technology in interpreting by providing a classification of ICT tools and platforms 

(2.3.1). Further, in section 2.3.2., the definition of CAI tools is made with three examples 

of ASR-enhanced CAI tools available on the market. Section 2.3.3. then explains speech 

technologies in general coupled with qualitative and quantitative data from various 

studies on ASR integration into interpreting and translation. Section 2.3.4 mentions the 

usage of technological solutions for consecutive interpreting. Finally, the last section of 

chapter two gives a detailed description of the proposed CAI tool Sight-Terp (2.4.).   

 
3 Sight-Terp is publicly available at: https://www.sightterp.net.  

https://www.sightterp.net/


 4 

Chapter three outlines the methodology of the study, including its design, data collection 

instruments, participants, and procedure.  

 
Chapter four presents the findings and discussions related to the accuracy and fluency 

differences in interpreting performance as well as comprehensive feedback of the users.  

 
Finally, chapter five dwells on the conclusion reached at the end of the study and provides 

recommendations for future research. 

  
 5 

CHAPTER ONE 

SCOPE OF THE STUDY 

1.1. AIM OF THIS STUDY 

The purpose of this study is to evaluate the effectiveness of the ASR-enhanced CAI tool 

Sight-Terp (https://www.sightterp.net), developed in the scope of this thesis, in enhancing 

the performance of consecutive interpreters by facilitating real-time speech translation, 

named entity recognition and automatic segmentation. The research aims to investigate 

whether the use of Sight-Terp improves the accuracy and fluency of CI as its primary 

objective. By means of the within-participants repeated measures design, the study seeks 

to empirically test the performance of a group of interpreters who will use Sight-Terp 

during the post-test phase. Furthermore, the research attempts to collect qualitative 

feedback from the participants through a follow-up survey, which will offer insights into 

their experiences and perspectives of using the tool. The contribution of this study to the 

field of interpreting will be to provide evidence of the effectiveness of ASR-based CAI 

tools in improving interpreters' performance by identifying a significant difference in 

participants' performance.  

1.2. SIGNIFICANCE OF THIS STUDY 

Process-oriented translation and interpreting research in experimental settings have 

gained traction in recent decades. Within the scope of interpreting research, research 

trends investigating the impact of technology-enabled interpreting tools on interpreters’ 

tasks have mostly centred upon simultaneous interpreting. Recognizing the need to 

expand empirical research on computer-assisted interpreting (CAI) tools, this study 

diverges from previous investigations that primarily focused on automatic speech 

recognition (ASR) in simultaneous interpreting. Instead, it examines the utilization of an 

ASR and machine translation (MT) augmented CAI tool in consecutive interpreting. By 

proposing a prototype digital application, the study aims to address a gap in the current 

body of literature pertaining to CAI tools. The empirical research conducted in this study 

can provide valuable insights into the effectiveness of these tools in improving interpreter 

https://www.sightterp.net/


 6 

performance and can inform the development of more advanced CAI tools. This thesis 

distinguishes itself by employing the English-Turkish language pair, whereas similar 

studies investigating ASR in interpreting have predominantly focused on high-resource 

European languages or Chinese. In addition to addressing the research questions posed 

by the methodology, this study seeks to compile a comprehensive table in the literature 

review section that highlights the various tools and platforms associated with information 

and communication technologies in interpreting. By doing so, it aims to provide an 

extensive overview of the resources that influence, either partially or entirely, the practice 

of interpreting. 

 
Moreover, the methodology utilized in this empirical study might bring new questions as 

to whether or not we need new methodological designs for product-oriented CAI tool 

research for better generalizability, particularly in technology-assisted CI. The results of 

this study can have practical implications for professional interpreters, interpreter training 

programmes and speech technology developers, as they can inform the development and 

integration or introduction of more efficient and effective interpreting technology tools, 

particularly enhanced with AI and automatic speech recognition.  Finally, the results of 

this study can lead to a better understanding of the potential of human-machine interaction 

in interpreting and contribute to ongoing efforts to improve the quality of interpreting 

services. 

1.3. RESEARCH QUESTIONS 

1. Does the use of the CAI tool Sight-Terp in consecutive interpreting, which provides 

both a source transcription and a machine translation output, lead to a significant 

improvement in the interpreting accuracy of interpreters compared to their 

performance without technological aid? 

2. Are there significant differences in the number of disfluencies (pauses, hesitations, 

repetitions, stuttering, false starts) between pre-test performances without CAI 

support and post-test performances with Sight-Terp support? 

3. How do users interact with the tool Sight-Terp? Do its interface design and ergonomic 

features meet the required standards for efficient and effective interpretation? 


 7 

1.4. LIMITATIONS 

The results obtained from our empirical evaluation must be interpreted in a nuanced 

manner, as they are subject to certain limitations. One such limitation is that the 

experiment of the study is conducted with students/novice interpreters. On the other hand, 

the language pair used in this study is Turkish and English and the interpreting task 

requested from the participants is in the direction from English into Turkish. The 

directionality is another phenomenon which may bring other factors and interferes with 

the accuracy and completeness of the interpreting performance particularly when it comes 

to technology-mediated interpreting scenarios. The use of pre-recorded speeches may not 

be reflective of the challenges and demands of live interpreting, which could impact the 

generalizability of the results. It is also critical to acknowledge that variables such as 

specific domains, speech characteristics, and accents - among other factors - are highly 

relevant and may significantly affect the tool's performance and usability. As a fourth 

limitation, the ASR system that Sight-Terp relies on is Microsoft Azure Speech 

Recognition API. At the time of writing this thesis, the Microsoft Speech Recognition 

API is considered to be one of the best ASR systems when compared to other equivalent 

software. However, this limitation should still be taken into account when evaluating the 

proposed software's overall performance and effectiveness. 

1.5. ASSUMPTIONS 

1. All participants are presumed to possess similar but comparable skills and levels of 

expertise in consecutive interpretation. 

2. The participants are assumed to perform to the best of their ability and to be motivated 

to achieve high levels of accuracy and fluency in their interpretation, regardless of the 

presence of technological aids. 

3. The participants are assumed to be honest and sincere in their self-assessment of their 

performance and to provide accurate responses in the questionnaires. 

4. The reliability index results for the materials used in the research are presumed to be 

adequate and valid for assessing the validity of the performances and the pre-test and 

post-test speeches are assumed to have similar levels of difficulty and content 

familiarity. 


 8 

5. The laboratory conditions are assumed to affect all subjects in a similar manner and 

that all subjects participate in the tasks with their utmost focus and concentration. 

1.6. DEFINITIONS 

Automatic Speech Recognition (ASR): ASR is a subfield of natural language 

processing and artificial intelligence (AI) that focuses on the development of algorithms 

and models to convert spoken language into written text. 

Speech Translation (ST): Speech Translation is a machine learning algorithm that 

utilizes a variety of techniques and models to facilitate the process of translating spoken 

language from one language to another. 

Computer-Assisted Interpreting (CAI) tools: CAI tools refer to a wide range of 

computer programs that have been developed with the primary purpose of supporting 

interpreters in one or more of the diverse sub-processes of interpreting. CAI tools provide 

human interpreters with real-time support in the form of speech recognition, translation, 

and other tools to enhance their interpretation performance, providing aid in the 

interpreting process. 

Named Entity Recognition (NER): NER is a natural language processing task (or 

technique) used to identify and extract important entities such as names, locations, and 

organizations from a text, providing a more comprehensive understanding of the 

information being processed. 

  
 9 

CHAPTER TWO 

THEORETICAL BACKGROUND 

This chapter delineates the theoretical background of this thesis and provides a broad 

literature review of the core concepts that are linked with the professional, academic 

technological aspects of interpreting. In the first section, after a historical and 

etymological overview of interpreting per se, modes and settings of interpreting are 

defined with their precise subsections. The second section outlines the main principles of 

the cognitive dimension of interpreting with a particular focus on Daniel Gile’s Effort 

Models, which are closely associated with the cognitive aspects of interpreting. Section 

three of this chapter elaborates on the information and communication technologies in 

interpreting and classifies technology-relevant interpreting tools and platforms in a single 

frame. Further, the brief description of speech technologies including ASR integration 

into the interpreting and translation are described and outlined along with relevant data 

from qualitative and quantitative studies. Moreover, a subsection is allocated for detailing 

the use of technology in CI, with a few articles published so far for a better understanding 

of recent approaches. Finally, the last section introduces the computer-assisted 

interpreting tool Sight-Terp, which this thesis is grounded on, and provides an elaborative 

description of its features. 

2.1.    INTERPRETING: AN OVERVIEW 

Throughout history, interpreting was always required in any cross-linguistic 

communicative event in which across barriers of culture and language. It has been used 

for centuries to facilitate communication between individuals or groups who speak 

different languages, playing a crucial role in facilitating communication between people 

of different languages. The use of interpreters has continued to evolve and expand over 

periods of history. With the rise of globalization, communication between countries 

increased, so has the demand for interpreters. This has caused the interpreting industry to 

become more professional and standardized, creating professional groups and introducing 

interpreter training and certification programs. As a result, the ancient human practice of 

interpreting has undergone many social, cultural and most importantly professional 


 10 

phases up until now.  This section begins with a brief introduction to the concept of 

interpreting and its definition along with its history. It then goes on to explain the 

ramifications of the practice with different modes and settings. 

2.1.1. Defining Interpreting 

Briefly defined, interpreting is the act of transferring a message from a language (signed 

or oral) into another language form. Different conceptual approaches are observable in 

defining interpreting in a broad manner. In Routledge Encyclopaedia of Translation 

Studies, Interpreting scholar Daniel Gile defines interpreting as “the oral or signed 

translation of oral or signed discourse, as opposed to oral translation of written texts” 

(2009, p. 51). 

 
Many languages have a corresponding equivalent word for interpreter and interpreting 

which are distinct from the words used for (written) translation.  Etymologically, the first 

trace comes from the Akkadian word targumannu and its corresponding form turgemana 

from Aramaic the semantic component of which is ‘to explain’ (Pöchhacker, 2015, p. 

198). The word finds its correspondence as tarjuman/targuman in Arabic, dragoumanos 

in middle Greek, dragumannus in middle Latin, dragomanno in Italian, 

drugemen/drogman in French, tercüman in Turkish, tolmács in Hungarian. The semantic 

inference of ‘explaining’ in these words have also a root in the greek word hermeneus or  

hermeneuties, referring to the Greek god Hermes interpreting the ethereal communique 

of the gods to the language of mortals for the sake of humanity. 

 
The English term "interpreting" has its origins in the Latin words interpres and 

interpretari. These words travelled through Old French and Anglo-French before finally 

being incorporated into modern English, accommodating diverse dialects and linguistic 

norms. As a result, the term has taken on different meanings in different contexts, with 

some restricting it to the act of facilitating communication between multilingual speakers 

and others embracing a more expansive interpretation that includes any kind of 

translation, whether in the form of written or spoken. 

 
 11 

Apart from the etymological origin, it is also possible to draw a line in the distinction 

between translation and interpreting in that interpreting is performed ‘here and now’ and 

its feature of ‘immediacy’ makes the word ‘interpreting’ distinguished from other 

translational activities (Pöchhacker, 2016, p. 10). This denomination allows for the 

incorporation of other manifestations like signed language interpreting and excludes 

dichotomies of oral vs written translation by getting away from the common definition of 

“the oral translation of an oral discourse” (Gile, 1998, p. 40; 2004). Otto Kade defines the 

practice of interpreting as “the source-language text is presented only once and thus 

cannot be reviewed or replayed, and the target-language text is produced under time 

pressure, with little chance for correction and revision” (1968, as cited by Pöchhacker, 

2016). This definition clearly articulates the feature of immediacy as the interpreter has 

limited potential to access the source text (can be substituted with “acts of discourse” 

and/or “utterances”) in its “one-time presentation” (p. 10). All in all, all definitions feature 

interpreting as an in-the-moment activity that focuses on facilitating oral communication. 

2.1.2. History of Interpreting 

Throughout history, mediation, reciprocity, connectivity, and interconnectedness have 

always been at the heart of the engagement of civilizations, countries, tribes etc. This 

engagement at the basis of all cultural interactions was wealth, reputation, invasion, and 

the struggle for sovereignty. Whether in conflict or not, peace-making has also been also 

a matter of talking and therefore of language. Having been older than the invention of 

writing, interpreting has taken an inevitable and crucial role in war, peace, trade, and 

administration in addition to its undeniable role in peace negotiations, social interactions 

of civilizations, the spread of religions and in the context of many periods.  

 
Historically, records about interpreting are not in abundance for some presumable 

reasons, particularly prior to middle age. First, interpreting might have been considered a 

daily, common activity. Secondly, people in power in history writing did not consider the 

interpreter’s name worth mentioning, which resulted in a lack of historical 

documentation (Roland, 1982, p. 4).  Another possible reason is the merit of invisibility 

as an integral ethical principle upheld by interpreters. As such, they were not considered 

to be worth recording in the official minutes and administrative documents. The earliest 


 12 

known evidence of interpreting is from historical documents inscribing or mentioning the 

interpreter engaging in the practice of interpreting such as the hieroglyph from ancient 

Egypt depicting a communicative action between parties (Delisle & Woodsworth, 2012; 

p. 248) or a handful of documentary evidence on the role of interpreters in the Roman 

Empire (Giambruno, 2008, p. 28). 

 
Interpreters escorted conquerors as they marched into foreign lands, assumed important 

roles in diplomacy and government in Ancient Egypt  and in Ottoman Empire, had social 

privileges in many societies (Diriker, 2005, p. 88), constituted a recognized occupational 

group in Rome (Hermann, 2002). In ancient times, they mostly consisted of people with 

multiple ethnic backgrounds, slaves, or prisoners (Roditi, 1982). Correspondingly, the 

motivation for embarking on an expedition was not limited to religion but trade, power 

and annexing new areas. Conquerors selected their interpreters from the land conquered 

by taking them to the native country to teach their language (Andres, 2012, p.3). Ottoman 

interpreters, the dragomans, who were mostly in charge of embassies and consulates of 

European states in cities under Ottoman rule, were from non-muslims of Christian 

communities of Fener and Pera districts who were knowledgeable with western culture 

and languages (Hitzel, 1995; Abbasbeyli, 2015). This was seen in ancient Greeks, who 

were not eager to learn new languages as they think their language is superior and made 

interpreters from bilingual foreign people whom they call “barbarians” (Wiotte-Franz, 

2001). 

 
Profession-wise, it is also possible to trace the old code of ethics stipulated for 

interpreters. Mexican interpreters called “Nahuatlatos” were actively used in the Spanish 

influx into Central and South America. In this specific historical context, the striking point 

to lay out is that partly comprehensive legislation on interpreters was drafted by Spanish 

authorities which enshrines the training, accreditation, and definition of interpreters in a 

code of ethics (Baigorri-Jalón, 2015, p. 16).  Overall, the origins of interpreting hark back 

to ancient civilizations. However, it wasn't until the 20th century that interpreting became 

a globally recognised profession, influenced by the convergence of significant political, 

technological, economic and social advances that played a crucial role in its development 

and growth. 


 13 

2.1.3. Interpreting in Modern Times 

The oldest and, at the same time, one of the most modern professions, interpreting has 

undergone many transitions on its way to institutionalization and becoming a full 

profession as well as an academic discipline. In the past 100 years, interpreting 

experienced new transformations and ramifications with new modes emerging in new 

settings mostly driven by economic, political, and social developments.  

 
The widespread adoption of multilingualism in international conferences became possible 

after the emergence of official French-English bilingualism at the League of Nations in 

the early 20th century. This was a remarkable turning point in that it ensured 

multilingualism at international conferences and solidified the role of interpreters in 

facilitating communication between diverse linguistic backgrounds.  Before the end of 

the First World War and the Paris Peace Conference of 1919, the prevalence of French in 

diplomatic proceedings was such that the demand for interpreters was minimal, as most 

participants were fluent in the language.  On the rare occasion that a delegate was unable 

to speak French, they were assisted by a personal secretary or interpreter. Nevertheless, 

considering the need for interpreting was much less than in today's era of globalisation, 

conference interpreting was not considered a profession in its own right at the time.  

During these times, CI was mostly used for the meetings though it would double the 

duration of them. SI with equipment was not considered thoroughly until the twenties. 

Chronologically, in 1925, Edward Filene, a businessman, philanthropist and entrepreneur 

came up with the idea of simultaneous interpreting. He then appealed to Gordon Finlay, 

a staff member of the ILO, to conceive of a technique that could provide delegates with 

a method to listen to speeches via telephone. This system, called ‘the Filene-Finlay 

simultaneous translator”4 was operational using the available telephone equipment. It is 

known that, on June 4, 1927, the first meeting with simultaneous interpretation took place 

at the International Labour Conference in Geneva (Gaiba 1998, p. 3; Taylor-Bouladon, 

2011). However, it can be said that there is uncertainty on the exact date and meeting 

where the first SI with equipment was used. While western scholars indicate that ILO was 

the first place, soviet histography mentions that SI is used for the first time in the VI 

 
4 The system was later named “International Translator System” by IBM in 1945. 


 14 

Congress of the Comintern held in 1928 (Flerov, 2013). Another SI system was used at 

the International Conference on Energy in Berlin in 1930, invented by Siemens & Halske 

(Gaiba, 1998). Between 1920 and 1940, SI was used in some international conferences 

across Europe (Taylor-Bouladon, 2011) but CI was used still quite often, especially in 

parliamentary meetings of ILO and the League of Nations. 

 
The start of the rich and storied history of conference interpreting dates to the successful 

deployment of SI in the infamous Nuremberg Trials of 1945-1946, which is considered 

to have marked a crucial milestone in the development of conference interpreting as a 

formal and respected profession. During these trials, interpreters were tasked with 

interpreting the speeches of Nazi war criminals, defendants, prosecutors and judges in 

English, French, Russian and German. Back then, Colonel Léon Dostert, the interpreter 

of General Eisenhower was entrusted to organize the language mediation process of the 

trials. 

 
John Tusa and Ann Tusa in their book “The Nuremberg Trial” (1983) describe the event 

as follows: 

 
“Colonel Dostert, the head of the translation section, had grouped his simultaneous 

translators into three teams of twelve: one team had to sit in court and work a shift 

of one and a half hours; another to sit in a separate room, relatively relaxed, but still 

wearing headphones and following the proceedings closely so as to ensure continuity 

and standard vocabulary when they took over; the third having a well-earned half-

day off. The work was exacting. It needed great linguistic skills and total 

concentration. For many of those involved the subject matter imposed a further 

emotional strain. Working conditions were uncomfortable: the translators were 

cramped in their booths, which were even hotter than the courtroom. They spoke 

through a lip microphone to try to dampen their sound (the booth was not enclosed 

at the top) but not even the use of the microphone nor the huge headphones they 

wore could deaden the noise made by their colleagues. As they worked they had to 
fight the distractions of other versions and other languages” 

 
The time-saving feasibility of the SI and its organized application over a long run 

throughout the trials was another sign of future usability of SI.  It wasn't until the 1950s 

that simultaneous interpretation was widely implemented at the United Nations in New 

York. At this time, the interpreters who worked in the English booth for the Security 

Council gained nationwide acclaim as their interpretations were broadcast over the radio 


 15 

(Taylor-Bouladon, 2011, p. 29). In later years, the system became operational using wired 

systems and wireless/infrared.  

 
The International Association of Conference Interpreters (AIIC) was founded on in 1953. 

This occasion marked a turning point in the history of interpreting as we know it. This is 

because AIIC adopted a code of ethics and professional standards to regulate the working 

conditions of interpreters and to raise the profession's profile on the global stage, which 

was a great success. Alongside its birth, the AIIC also established complex administrative 

structures that continue to exist to this very day, with a highly centralized professional 

organization currently operating in Geneva.  

 
Today, modern technology has revolutionised the field of interpreting, with interpreters 

relying on cutting-edge tools such as soundproof booths, wireless headsets and computer-

assisted translation software to enhance their work. The industry has also become more 

nuanced, with specialist interpreters serving specific sectors such as finance, law and 

healthcare. Moreover, because of the extensive use of simultaneous interpreting which 

started with the Nuremberg trials, a great need for trained interpreters has arisen, leading 

to the creation of numerous degree courses around the world. The formal education 

started with the foundation of the École de traduction et d’interprétation (ETI) in Geneva 

and respectively with the HEC School of Interpreting in Paris which was later replaced 

by the Sorbonne School of Interpreting and Translating (ESIT). In time many courses and 

programmes have been established offering bachelor, master's and PhD degrees to 

prospective interpreters and help in the professionalization of the field.   

 
The world has undergone significant changes since the early days of interpreting, and new 

modes and settings have emerged due to advances in technology. These changes have 

transformed the field of interpreting, and the next section will delve into the intricate 

details of these various modes and settings. 

2.1.4. Modes and Settings of Interpreting 

It is possible to draw a conceptual map of interpreting with different settings and 

constellations such as inter-social and intra-social settings and the situational 


 16 

constellation of interaction (i.e., conference interpreting vs dialogue interpreting) 

(Pöchhacker, 2016). Among many classifications, a distinct division can be drawn based 

on the methods and contexts as the practice of interpreting takes place in a number of 

different modes and settings, each of which presents unique challenges and opportunities 

for the interpreter.  

 
In the literature, there are no precise lines when it comes to classifying interpreting based 

on the settings, in which the action takes place, and the modes, which denote the temporal 

relationship of interpretation and the source message. Researchers tend to use different 

criteria while explaining the settings and modes of interpreting. According to Diriker 

(2018) interpreting can generally be classified based on the languages used in the 

communicational context (spoken language interpreting and sign language interpreting), 

the form of the interpretation (simultaneous interpreting, consecutive interpreting, 

whispering, sight interpreting), and the context in which the translation is performed 

(conference interpreting and community interpreting).  Doğan (2022) adopts a particular 

classification. She initially outlines types of interpreting based on the method of execution 

with a particular focus on consecutive and simultaneous modes and then further delineates 

another classification based on settings where spoken and sign language mediation is 

needed. Accordingly, CI has subtypes such as classic consecutive, liaison interpreting, 

dialogue interpreting, and over-the-phone interpreting (p. 50), while simultaneous 

interpreting falls under the umbrella of subtypes such as TV interpreting, whispering, 

video-conference interpreting, sight interpreting, conference interpreting and sign 

language interpreting. On the other hand, the settings, namely the subjects of interpreting, 

are community interpreting, court interpreting, police interpreting, disaster interpreting, 

(Disaster Relief Interpreters, ARÇ in short), sports interpreting, healthcare interpreting, 

and conflict interpreting. 

 
Interpreting scholar Franz Pöchhacker (2016) takes another step and creates a broader 

systematic typological map of interpreting based on  language modality (spoken vs 

signed), working mode (simultaneous, consecutive, sight interpreting etc.), directionality, 

technology use, and professional status (professional vs non-professional). The historical 

prevalence of professional interpreting at international conferences and meetings has led 


 17 

to the belief that conference interpreting is carried out exclusively through the use of 

consecutive and simultaneous interpreting. These two modes have become synonymous 

with conference interpreting, and it is often assumed that they are the only methods used 

in this type of interpreting being ‘misconstrued in a taxonomic sense’ (Pöchhacker, 2015). 

Pöchhacker sets forth the following interpretation regarding the topic: 

 
Aside from the modality of the language(s) involved, which serves to contrast spoken 

language with sign language interpreting, the most common distinction is made in 

terms of the temporal relationship between the interpretation (target text) and the 

source text, which yields consecutive interpreting and simultaneous interpreting 

as the two main modes of interpreting. In a looser sense, different ‘modes’ can also 

be identified with reference to the directness of the interpreting process  (relay 

interpreting) and the use of technology to deliver the interpretation, as in the case of  

remote interpreting provided in ‘distance mode’. Much more relevant, however, are 

conceptual distinctions with reference to the settings in which interpreter-mediated 

social contacts take place. On the broadest level, inter-social (or inter-national) 

scenarios, involving diplomats, politicians, scientists, business leaders or other types 

of representatives of comparable standing, can be viewed as different from intra-

social (community-based) ones, in which one of the interacting parties is an 

individual speaking on his or her own behalf. The latter, subsumed under the broad 

heading of community interpreting, allow multiple  interpreting  subdivisions in 

terms of different institutional contexts, including legal interpreting,  healthcare 

interpreting and educational interpreting, with numerous institution-related  

subtypes. (Pöchhacker, 2015, p. 199) 

 
Moreover, by combining all distinctions, Pöchhacker details different formats of 

interaction by drafting the scheme in the Figure 1. 


 18 

Figure 1. The conceptual spectrum of interpreting drafted by Pöchhacker (2016, p. 17) 

 
In this subsection of the study, the main modes of interpreting, namely simultaneous 

interpreting, consecutive interpreting, whispering, sight interpreting and sign language 

interpreting will be briefly addressed. These crucial facets of interpreting will be delved 

into, exploring their specificities and intricacies in a succinct manner. 

2.1.4.1. Consecutive Interpreting 

Consecutive interpreting (CI), which is the main mode and practice used in the 

experiment phase of this study, is a mode of interpreting which involves listening to the 

speaker's message in one language with or without the use of electronic equipment, taking 

notes, and then delivering a full and immediate consecutive interpretation in another 

language. The interpreter in this mode waits until the speaker has finished a segment of 

speech before beginning to interpret it. In other words, the interpreter and the speaker take 

turns after they speak. This mode of interpreting requires the interpreter to possess a 

distinctive set of skills and abilities, including good memory retention, unwavering 

attention to detail, and note-taking dexterity, all while possessing a profound grasp of the 

languages in question.   

 
CI can be practised for any duration as long as the original act of discourse continues, 

since the length of the speech to be interpreted is not predetermined. It involves the 

interpretation of both short utterances and extended speeches and thus “can be conceived 


 19 

of as a continuum which ranges from the rendition of utterances as short as one word to 

the handling of entire speeches, or more or less lengthy portions thereof, ‘in one go’” 

(Pöchhacker, 2016).  There are factors that affect the CI process such as the interpreter's 

working style, memory and situational factors. To cope with longer speeches, the note-

taking technique i.e. taking notes which represent ideas and concepts rather than words is 

used, which was first introduced by pioneer conference interpreters in the early 1900s. 

Note-taking serves as a memory jogger for the interpreter. There are numerous methods 

and approaches in note-taking for CI each with its own unique nuances and subtleties 

such as mind-mapping, sentence condensation, and jotting down symbols, abbreviations, 

bullet points and keywords that trigger the memory of the speech content. Whether or not 

to use a note-taking technique divides CI into two classes: classic consecutive, where 

note-taking is most commonly used, and short consecutive where the duration of the 

speech is less than two or three minutes and does not require the interpreter to take notes. 

 
The term 'consecutive interpreting' was considered as, so to speak, a standard or default 

mode of interpreting and emerged in the 1920s to distinguish it from the new method of 

interpreting known as 'telephonic' or simultaneous interpreting (Baigorri-Jalón, 2014; 

Andres, 2015), which later paved the way for the birth of the profession of conference 

interpreting (see section 2.1.3). Subsequent to the effective deployment of the technique 

of simultaneous interpreting at the Nuremberg Trial and later adoption by the United 

Nations, the use of CI became less widespread.  

 
Simultaneous interpretation is commonly utilized for meetings with many languages and 

a big number of participants whereas consecutive interpretation is more suited for smaller 

sessions with technical or secret content, as well as ministerial negotiations. Additionally, 

CI  is more flexible than simultaneous interpreting in terms of allowing the interpreter to 

communicate and clarify with participants, regulate the dialogical discourse, and look at 

the physical circumstances of the participants and their surroundings (Russel and Takeda,  

2015). 

 
Simultaneous interpreting is widely regarded as a more advanced form of interpreting and 

more cognitively challenging than CI. According to Gile (2001a), it is often 


 20 

recommended that students begin their interpreting training with CI since it serves as a 

ground for the more complex task of simultaneous interpreting. Though this argument is 

still debatable (e.g. Seleskovitch & Lederer 1989; Russell et al. 2010), this approach can 

be observed in the curricula of schools of translation and interpreting around the world. 

Before moving on to the simultaneous mode, CI is included as one of the basic practices, 

along with courses in sight translation and note-taking (Niska, 2005, p. 49). 

2.1.4.2. Simultaneous Interpreting 

Simultaneous interpreting (SI) is a type of interpreting that involves interpreting the 

spoken language in real time while the speaker is still speaking. It is typically used in 

situations where a large group of people need to understand a speaker speaking in another 

language, such as at international conferences or meetings. Interpreters working in 

simultaneous mode are expected to produce a logically coherent output that is consistent 

with the source. The simultaneity of the act distinguishes simultaneous interpreting as a 

cognitively demanding process requiring a high level of language management. 

 
During simultaneous interpreting, the interpreter listens to the speaker through her/his 

headphone and at the same time speaks into a microphone, allowing the audience to hear 

the interpretation via headphones or loudspeakers. This dynamic makes simultaneous 

interpreting demand an exceptional degree of cognitive dexterity and linguistic prowess. 

The interpreter must be able to keep pace with the speaker's delivery, dexterously 

interpreting as they listen, a feat requiring extraordinary mental agility and linguistic 

virtuosity.  

 
Conference interpreters in this mode usually work in a booth where he or she can 

concentrate on the interpretation without distractions. Collaboration is a key aspect of 

simultaneous interpreting because interpreters rarely work alone. Instead, they work in 

pairs or even trios, each taking a 20-to-30-minute shift. This tag-team approach allows 

one interpreter to take a short break while the other does the heavy lifting, interpreting in 

real time for the audience. In this situation, teamwork is essential, with each interpreter 

assisting the other as needed, for example with difficult terminology. To be successful, 

interpreters must have an in-depth knowledge of their working languages and cultures, as 


 21 

well as exceptional short-term memory. Adequate preparation is also essential, including 

prior research into the subject(s) of the event, which can cover a wide range of areas such 

as finance, medicine, law and science. 

2.1.4.3. Sight Interpreting 

Sight interpreting is an interpreting modality that requires the interpreter to promptly 

interpret written materials in real time. Sight interpreting can be considered a hybrid mode 

where the written source text is turned into an oral in another language. This mode has 

become a critical aspect of various industries like law, medicine, and professional 

services, where immediate verbalization of written documents or letters is imperative for 

the recipient's comprehension. The tempo of the interpreter's delivery in simultaneous 

interpreting often aligns with the pace of the speaker's speech, yet the pace of interpreting 

from written text lies solely within the hands of the interpreter to manipulate as they see 

fit. The other modes of interpreting mostly depend on auditory input, sight interpreting 

frees up the interpreter's memory, but also poses an added challenge in the form of 

allocating their processing capacity to the visual channel. The complexity of the modality 

is also scrutinized in the framework of translation process research. In their study, 

Dragsted and Gorm Hansen (2007) showed that interpreters, in sight interpreting tasks, 

are different than translators in temporal variables and translational approach. Eye 

tracking studies also show that the visual presence of the source text requires more 

cognitive effort and visual interference, which needs further sources allocated to cope 

with the lexical and syntactic complexity of a written text (Shreve, e. al., 2010). 

 
In education, similar to the utilization of CI, sight interpreting has long been used to assess 

a candidate's aptitude - the ability to swiftly comprehend and articulate the core essence 

of a given text. This mode of testing widely considered a benchmark in determining an 

individual's competency in the field of interpreting (Russo, 2011).  It is also commonly 

believed that sight interpreting can help students navigate a text in a non-linear manner 

and identify key information (Čeňková, 2015). 

 
Sight interpreting has the potential to elevate the practice of simultaneous interpreting to 

an even greater level of accuracy and precision. This is due to the fact that conference 


 22 

speeches are often written beforehand, thus granting interpreters the ability to not only 

listen but also peruse the text in front of them. The skills gained from practising 

simultaneous interpreting with written material can also be extended to the scenarios of 

mixed media presentations, such as presentations using PowerPoint and, most notably, 

presentations with real-time subtitles displayed on screens (Setton, 2015).  This blend of 

listening and text-based interpretation results in what Gile coins as "simultaneous 

interpreting with text” (1995). SI with text modality is considered to be a more favourable 

technique, however, its intricacy is unparalleled.  The written text, being dense in 

information and language, often lacks the fluidity and prosody of spontaneous speech. 

This brings the issue of the complex balance of the two acts: relying too heavily on the 

written text, which might result in lagging behind, and relying solely on auditory input, 

which can be too fast to process.  The studies on this modality indicate some benefits as 

well. According to Lambert's study (2004), it was revealed that providing text materials 

to student interpreters impacts their simultaneous interpreting performance. The results 

indicated a substantial improvement in their performance when they were given ten 

minutes to prepare with the text being made available to them.  Likewise, Lamberger-

Felber (2001, 2003) examined the impact of simultaneous interpreting with and without 

text on target-text accuracy and omissions. A remarkable difference was observed in the 

proportion of correctly translated proper names and numbers when interpreters had access 

to written text, in contrast to the figures obtained in the absence of written text. The 

accuracy soared to 98% with time to prepare and 92% without. 

 
It is important to note that ASR aid in consecutive interpretation, which is what this study 

partly aims to investigate in a product-oriented methodology, shows similarities with SI 

with text since the text is available as a reference during the execution of the interpreting 

practice. Therefore, ASR with consecutive interpreting might be entitled to consecutive 

interpreting with text or sight-consecutive modality. 

2.1.4.4. Whispering (Chuchotage) 

Whispering, also known as chuchotage, is a different form of simultaneous interpreting. 

Known for its use in intimate and close-quartered situations, such as business dealings or 

guided tours, this mode provides a one-on-one, personal experience for the listener. The 


 23 

interpreter, who is physically near the listener, whispers the interpretation, ensuring 

unobtrusive communication without disrupting the pace of the meeting or event. 

2.1.4.5. Sign Language Interpreting  

Sign language interpreting involves the interpretation of verbal communication into sign 

language and vice versa, providing an unparalleled level of access and understanding for 

people with hearing impairments such as deafness and hearing loss.  The role of a sign 

language interpreter requires a remarkable level of fluency in both sign language and 

spoken language, coupled with a comprehensive understanding of deaf culture and 

appropriate deaf etiquette in order to effectively interpret the intended message.  

2.2.  EFFORT MODELS IN INTERPRETING 

The cognitive dimension of interpreting has garnered extensive interest from experts in 

various disciplines including neurology, psychology, linguistics, and cognitive science. 

This rich intellectual landscape has spurred numerous investigations into the fundamental 

cognitive processes involved in interpreting, including the pivotal aspects of listening and 

comprehension, production, and delivery. This section delves into how these crucial 

components of interpreting are explained by the interpreting scholar Daniel Gile’s Effort 

Models (1997/2002). 

 
The cornerstone of the interpreting process lies in the crucial stage of listening and 

analysis. It is here where the interpreter must attentively perceive the source text produced 

by the speaker and embark on the initial step of speech analysis. This involves delving 

into the source text to decipher its message and subsequently, finding its equivalent in the 

target language. Consequently, the speech is produced in a series of processes that range 

from the initial formation of the message in the mind to speech planning and 

implementation.  In interpreting studies, the first modelling attempt at the translational 

process was put forward by Danica Seleskovitch (1962) and later developed by Lederer 

(1981). Seleskovitch's contribution to the cognitive analysis of interpreting is widely 

known, particularly for her triangular process model of interpreting. In this model, 'sense' 

is seen as the culmination of the process, rather than mere linguistic transcoding. More 


 24 

specifically, it is the interpreter's ability to grasp and convey the underlying 'sense' of a 

message rather than fixed linguistic correspondences that is the essential component of 

interpreting. ‘Sense’, according to Seleskovitch, is a deliberate cognitive addition to 

linguistic meaning, with the added characteristic of being non-verbal. In general, the main 

idea in the interpretive theory is that 'deverbalised' meaning is more important in 

translation than linguistic conversion processes. 

 
Later, more comprehensive multi-phase models are created focusing particularly on 

‘processing difficulties’ (Pöchhacker, 2016). In this regard, Daniel Gile’s ‘Effort Models’ 

(1985, 1997/2002) is based on the idea that in situations where cognitive decisions are 

necessary to complete tasks, the issue of multiple-task performance arises, as the 

combined cognitive demands may surpass the individual's capacity limit for processing. 

Gile's effort model (EM) posits that there is a finite amount of cognitive 'effort', with three 

basic processes competing for this resource. These processes, 'listening and analysis' (L), 

'production' (P) and 'memory' (M), are essential components of the interpreting process. 

According to the model, all efforts require processing capacity and the sum of the three 

efforts must not exceed the interpreter's processing capacity, suggesting that successful 

interpreting requires careful management of cognitive resources. Gile introduced this 

model in 1985 and it remains a fundamental framework for understanding the cognitive 

demands of interpreting. The equation hereby can be solidified as L + P + M < Capacity. 

 
In his later work, Gile expanded the model and added “Coordination Effort” (C) 

(management effort) and modelled simultaneous interpreting as SI = L(istening) + 

P(roduction) + M(emory) + C(oordination). The following set of formulas (Gile 

1997/2002) was created in order to explain the relationship of the components. The 

overall processing capacity needs as the result of the sum of the individual processing 

capacity requirements (Pöchhacker, 2016, p. 91). 

 
TR (Total processing capacity requirements) = LR + MR + PR + CR 

LA ≥ LR 

MA ≥ MR 

PA ≥ PR 


 25 

CA ≥ CR 

TA ≥ TR  

 
Gile consequently assumes that the entire available capacity must be equivalent to or 

greater than the total requirements. His contribution demonstrates, based on all these 

formulas, that during the interpreting process, an interpreter operates within the limits of 

their own capacity. For the interpreting process to proceed smoothly, the available 

capacity for each effort must be greater than or equal to the capacity required by the 

relevant task. If an effort is not performed adequately, there might be errors, omissions 

and infelicities such as incomplete comprehension or incorrect target reformulation or 

incomplete retrieval of information. The EMs in SI, devised by Gile, are underpinned by 

the Tightrope Hypothesis (Gile, 1999). In essence, this theory posits that interpreters, 

much like tightrope walkers, operate on the brink of cognitive saturation (Gile, 2009, p. 

198). This precarious balancing act is a constant challenge, as they must coordinate 

various sub-tasks. Gile's analysis reveals that when the interpreter's cognitive capacity 

reaches its limit, errors and "infelicities" (EOIs) occur. These missteps stem from an 

inability to effectively deal with "problem triggers" (Gile, 1999, p. 157), such as 

specialized terms, proper names, and numerical data, which demand heightened cognitive 

resources.  

 
Gile has created other models that serve to represent the distinct challenges and efforts 

associated with various interpreting modalities, such as simultaneous interpreting with 

text, consecutive interpreting, sign language interpreting, and even remote interpreting. 

Due to its relevance to this study, I will focus on EMs in CI and EMs for human-machine 

intreraction (HMI) briefly below. 

2.2.1. Effort Models in Consecutive Interpreting 

Different from SI, in CI (with notes), the model includes other operations since different 

tasks are included. To be more precise, during the listening phase, the listening effort is 

the same as in SI but another production effort is executed when the notes are manually 

produced for memory-jogging. Additionally, during the listening phase again, a short-

term memory effort is required to store the information until it is noted (Gile, 2001). 


 26 

During the reformulation phase, three efforts are required: the note-reading effort 

(deciphering), the long-term memory effort, which entails retrieving information from 

long-term memory and reconstructing the speech content, and finally, the production 

effort, the operation for generating the target-language speech.  

 
Ultimately, for CI with notes, the following model can be drafted (Gile, 2023): 

 
Listening and Comprehension phase: L + M + NP + C (NP: Note Production) 

Reformulation phase: NR + SR + P + C (NR: Note Reading SR: Speech Reconstruction 

from Memory) 

 
Based on this model, it is worth noting that the interpreter is able to dedicate a greater 

degree of attention to the monitoring of their output during the speech, compared to the 

simultaneous interpreting process, where such monitoring may be more difficult to 

accomplish due to the demands of real-time production. Similarly, in SI, as it involves 

the simultaneous processing of two languages in working memory (Gile, 2001), 

interpreters devote some attention to inhibiting the influence of the source language to 

avoid ‘linguistic interference’ (p. 2), making it a more challenging task. Conversely, in 

CI, the effort of inhibiting the source language influence might be much weaker or even 

non-existent since the notes taken are shorter, more summarized and organized. 

Therefore, from this point of view, note-taking during comprehension would inflict more 

cognitive requirements whereas cognitive pressure during the reformulation phase in CI 

with notes would be relatively less. However, this balance may shift when technological 

aids like Sight-Terp are incorporated into the interpreting process. The equilibrium could 

alter depending on which subtask the technology helps to reduce cognitive load for. 

2.2.2. Effort Models in Human-Machine Interaction 

Daniel Gile suggests in his keynote speech (2020) that Effort Models could give rise to 

new versions if researchers and teachers discover novel functions connected to significant 

attentional resource requirements in interpreting. A potential situation could arise if 

interpreters were required to direct significant attention to interaction with more screens, 

interfaces, and technological tools. A recent development might serve as an example. 


 27 

During the COVID pandemic, remote interpreting platforms grew in number and for the 

last three years, there has been an increasing volume of demand for interpreters working 

remotely. When team members are not in the same location, the communication between 

boothmates must be through video-conference platforms which, though with some 

essential similarities, have different interfaces and functions. In similar veins, CAI tools 

(see section 2.3.2), especially those designed specifically for in-booth scenarios, have 

certain functionalities that require familiarity and additional cognitive resources. In this 

respect, Gile (2020; 2023) postulates the following model (for SI), taking into account 

the changing technology and working environments: 

 
SI: R + M + P + HMI + C  

 
Here, ‘R’ is for reception, ‘which can be both auditive and visual’ (Gile, 2020) while HMI 

stands for human-machine interaction. HMI is a broad concept which might have different 

efforts. In the example of remote interpreting, Gile adds the turn-taking effort as TT. 

Turn-taking in remote interpreting can be more complex and challenging than in 

traditional settings, due to factors such as latency, audio quality, and coordination with 

other participants. However, in general, there are many combined efforts required to 

manage and troubleshoot technology-related issues, such as connectivity problems, audio 

and video settings, and platform-specific features. 

 
In the main study of this thesis, the software Sight-Terp uses ASR to generate the speech 

transcript with which the interpreter can deliver the interpretation by looking at the script. 

Since there's no need for note-taking5, the interpreter can allocate their attention to every 

detail in the speech, and focus on formulating the interpretation in their mind, without the 

extra cognitive pressure of note-taking. Though that would mean less cognitive pressure 

on the comprehension phase, the software's constant visual presence of the auto-generated 

text may induce more cognitive pressure on the reconstruction phase, requiring the 

interpreter to reformulate and adjust their interpretation constantly. The additional 

features of Sight-Terp (named entity highlighting, automatic segmentation), which is 

 
5 Sight-Terp, in fact, allows for digital note-taking with a stylus (like Apple Pen). Though the feature of 

digital note-taking embedded in Sight-Terp is described in the study, note-taking is excluded from the main 

study and the participants are instructed to only use the ASR function. 


 28 

detailed in the following sections (section 2.4), are deployed in order to mitigate linguistic 

interference which is more generally associated with sight interpreting (Agrifoglio, 2004). 

 
Based on Gile's Effort Models, a formula for an effort model specific to sight-consecutive 

interpreting can be drafted. In sight-consecutive interpreting, the interpreter relies on a 

text-based reference generated by an ASR system, which reduces the cognitive load 

associated with listening and memory to some extent. Consequently, the effort model for 

sight-consecutive interpreting might place more emphasis on the analysis of the text, the 

production of the target language, and the coordination of these efforts. In light of these 

restrictions, mitigations and possible cognitive requirements brought by Sight-Terp the 

following model can be drafted to encompass sight-consecutive (SCI) modality: 

 
SCI: Listening and comprehension phase: L + M + NV + C  

 
(L: listening, M: memory, NV: note verification, C: coordination) 

 
NP is replaced with NV (note verification) implying the effort of the interpreter to monitor 

the accuracy of the ASR, make corrections and take up strategies and coping mechanisms 

accordingly. The cognitive demands of using the tool will likely vary depending on the 

quality of the ASR output. 

 
Reformulation phase: BR + SR + P + C  

 
(BR: bilingual note reading, SR: speech reconstruction, P: production, C: coordination) 

In the reformulation phase, BR (bilingual note reading) is included to manage the 

bilingual format of the text MT and auto-generated source transcript, SR (speech 

reconstruction) to reconstruct the meaning of the source text, P (production) to produce 

the interpretation, and C (coordination) to manage the use of the tool. In the reformulation 

phase, Strong C and P are needed because of the linguistic interference potentially 

resulting from the bilingual format of the text MT and auto-generated source transcript 

together (see 2.4.1.1.). 

  
 29 

2.3.  TECHNOLOGY AND INTERPRETING 

As technological advances are overhauling the interpreting sphere, they cause a shift in 

the traditional practices of interpreters. The proliferation of large language models 

(LLMs), machine translation, speech recognition technologies and other cutting-edge 

tools have the potential to transform the interpreting process and demand a change in the 

way interpreters approach their work. The impact of technology on interpreting is 

multifaceted. On the one hand, technological innovations have streamlined information 

access and work management for interpreters, leading to an increase in productivity. On 

the other hand, the emergence of new technologies has disrupted the demand for 

interpretation services in the marketplace and has overhauled the entire landscape of the 

industry. 

 
The following section explores the proliferation of technology and its impact on 

interpreting and the latest technological developments and concepts with a particular 

focus on ASR-enhanced CAI tools. I will delve into speech technologies and their impact 

on both written translation and interpreting and examine ASR-enhanced computer-

assisted interpreting tools.  As technology continues to shape the landscape, I will explore 

the ways in which it affects consecutive interpreting, highlighting innovative methods 

and techniques. Finally, I will focus on the proposed tool 'Sight-Terp' and provide an 

insight into its intriguing features and capabilities. 

2.3.1. The Emergence of Information Technologies in Interpreting 

Information and communication (ICT) tools have been a driving force in the pursuit of 

improved quality and productivity in both translation and interpreting over the last two 

decades. Interpreting has not experienced such a significant impact in contrast to the 

transformative effects that ICT has had on translation. However, it is possible to say that 

there have been crucial technological advances in the field of interpreting. When 

discussing the evolution of interpreting in light of the emergence of information 

technologies, it is worth highlighting some key breakthroughs in the field. One such 

example can be, as stated in section 2.1.3, the advent of simultaneous interpreting. SI 

stands out as the first game-changing innovation which took place in the 1920s when IBM 


 30 

made a ground-breaking breakthrough in developing a hardwired system for 

instantaneous speech transmission. Gaining popularity in several other international 

conferences, the wired system eventually made its mark in history by becoming an 

irreplaceable asset during the Nuremberg trials. Needless to say, this breakthrough 

changed the way interpretation is facilitated on daily basis and created an imminent social 

status for interpreters. The second and most important breakthrough is the introduction of 

the world wide web, which has revolutionized the way that interpreters access and share 

information, opening up new avenues for research and collaboration. The significance of 

the internet lies behind the crucial need for preparation for interpreting assignments: 

conference interpreters are constantly engaging in different “specific terms, semantic 

background knowledge and context knowledge” in each assignment they are in (Rütten, 

2016). The World Wide Web, with its fast ability to gather information from a multitude 

of sources, has a powerful advantage. By streamlining the information management 

process, interpreters have increased the efficiency of their preparation.  

 
Today, the current landscape of interpreting technology is a vast and varied one, 

characterised by a wide range of technological solutions that have played a significant 

role in ushering in a ‘technological turn’ (Fantinuoli, 2018b) in the profession and 

creating bespoke and non-bespoke computer-assisted interpreting tools. The 

categorization of the recent technologies of today’s interpreting technology sphere would 

be a line between the purpose and functions of such tools. Considering that the 

interpreting technology is a vast umbrella term, classification is a must for a thorough 

understanding indeed.  

 
2.3.1.1. Categorization of Technologies in Interpreting 

 
There are a couple of approaches when it comes to the classification of ICT tools in 

interpreting. Fantinuoli (2018a) suggests two classifications: setting-oriented 

technologies and process-oriented technologies. Setting-oriented technologies “primarily 

influence the external conditions in which interpreting is performed” (2018a, p. 155). On 

the other hand, process-oriented technologies include a variety of tools, such as 

“terminology management systems, knowledge extraction software, and corpus analysis 


 31 

tools“ (p. 155), all of which aim to assist interpreters in different sub-processes and 

various phases of an assignment.  

 
In parallel with this approach, according to Braun (2019), interpreting technology can be 

categorized into three. The first category is “technology-mediated interpreting” which 

encompasses all technologies employed to expand the reach and effectiveness of 

interpreting services, including remote simultaneous interpreting (RSI) equipment. In 

broad terms, technologies mediating interpreting entail distance interpreting 

technologies, which cover “a whole range of technologically different setups” (Ziegler & 

Gigliobianco 2018, p. 121). Remote interpreting can be defined as the utilization of 

various instruments of ICT to enable interpreter-mediated communication from a 

physically removed location. During the COVID-19 pandemic, remote interpreting 

served as the catalyst for the development of a fresh generation of conference interpreter 

profiles, a location-independent alternative to traditional conference settings. Moreover, 

the proliferation of video conference platforms (e.g., Zoom, Interactio, KUDO, and 

Interprefy) during the pandemic paved the way for computer-assisted interpreting tools 

explicitly developed for incorporation in RSI scenarios (see Interpreter Assist in section 

2.3.2.2.).  The incorporation of cutting-edge augmented reality (AR) innovations, 

including the deployment of advanced virtual reality goggles, can be the next evolutionary 

leap in remote interpreting by mitigating “the feeling of isolation” (Ziegler & 

Gigliobianco 2018, p. 136) and/or integrating the CAI tool interfaces on the virtual reality 

screen worn by the interpreter6 (Gieshoff, 2022).  

 
The second category is technology-generated interpreting, which implies machine 

interpreting (MI) or speech-to-speech translation. MI can be characterized as a 

technological advancement enabling the conversion of spoken language into another 

language through computer programming (speech technologies)7. MI involves a multi-

 
6 At the time of writing, a group of three scholars at Zurich University of Applied Sciences are examining 

whether augmented reality technology can provide assistance to interpreters in their additional exertion of 

having to consult terms. In other words, the research focuses on integration of ASR-enhanced CAI tool 

interface on augmented reality screen by postulating that instead of switching between different types of 

visual information and redirecting the visual attention for CAI output, interpreters can benefit from the 

output directly on their augmented reality interface by wearing virtual reality headset. 
7 The section 2.3.3. briefly focuses on aforementioned speech technologies. 


 32 

step approach that generates an audible version of the translated text by creating a 

synthetic speech in the target language. In cascade systems, the steps are as follows: ASR 

transcribes oral speech into written text. This is followed by machine translation,  and 

finally, text-to-speech synthesis is used to generate an audible version of the translated 

text. 

 
The third category is “technology-supported interpreting”, which entails all technologies 

that can be used to augment or facilitate interpreters' preparation, performance, and 

workflow. In this context, technologies supporting interpreting can be considered as a 

wide group of technological applications and hardware that are used before, during and 

after the interpreting process, thereby affecting the cognitive processes behind the actual 

task of interpreting. CAI tools (see 2.3.3.) and other technologies that aim to enhance the 

performance of the task can be listed under technology-supported interpreting. The CAI 

tools falling under the technology-supported interpreting class has also classifications 

namely ‘generations’ depending on their purpose, feature and release date, as described 

in 2.3.3. 

 
Drawing inspiration from Ortiz and Cavallo's list of ICT tools for interpreting (2018, p. 

17), which categorizes tools by their function, specificity, and update date, I have 

expanded the list to include new categories such as Speech Bank, Audio and Video 

Conference platforms, Machine Interpreting and Real-time Speech Translation. In table 

3 below, the tools have been matched according to their specificities, purposes, 

modalities, and features to provide a comprehensive overview of the range of tools 

currently available to interpreters as of January 2023. The categories are training 

platform, speech bank, glossary management, corpora building, terminology extraction, 

speech recognition, note-taking, virtual booth service, audio and video conference, 

machine interpreting, and real-time speech translation. 

 
The tools under the category of ‘interpreter training’ and/or ‘speech bank’ show various 

platforms and software that facilitate lexical and terminological searches for both novice 

and expert interpreters. These tools aim to help interpreters hone their interpreting skills 

and strengthen their grasp of both their native language and foreign languages by allowing 

them to conduct deliberate practice using speech and other materials. Glossary 


 33 

management, corpora building and term extraction tools (regardless of their specificity 

for interpreters) indicate the resources that can be used to aid interpreters during 

preparation, allowing them to delve deeper into the primary topic they will be interpreting. 

Additionally, interpreters can develop and reference personalized glossaries throughout 

the interpretation process, while also familiarizing themselves with the speakers' accents 

and backgrounds by watching videos and scouring online sources. The categories of 

Speech Recognition, Real-time Speech Translation, Note-taking and Virtual Booth 

Service include tools that are utilized for the interpreting process itself. The tools under 

this class are ASR-enhanced CAI tools for SI, speech translation solutions for various 

purposes, and note-taking applications that can be used for interpreting scenarios. 

Therefore, this class of categories as well as categories related to preparation/terminology 

can be listed under the division of technology-supported interpreting. 

 
Platforms, where remote simultaneous interpreting can be carried out8 (corresponds to 

technology-mediated interpreting), are listed in the category of ‘Audio and Video 

Conference’. Finally, tools under the category of ‘Machine Interpreting’ (speech-to-

speech interpreting) are specified as ‘replacement’ (corresponds to technology-generated 

interpreting) referring to full automation of the interpreting process, resulting in a 

complete replacement of human interpreters. In this category, available devices and tools 

on the market are added based on their availability.  The columns of the table show 

specificity (whether it is designed for interpreters), purpose (main aim of usage), modality 

(simultaneous interpreting and/or consecutive interpreting), and feature (remote 

interpreting platform, ASR-enhanced or fully ASR-powered, replacement by MI).