AMI – Assessment Metrics for Interpreting – Dolmetscher-wissen-alles.de

AMI – Assessment Metrics for Interpreting – is a framework for qualitative and quantitative analysis of machine and human interpreting performance. It can be used for human as well as machine interpretation. On the basis of detailed human expert analysis, a statistical overview is created, which is finally boiled down to one single benchmark indicator, the deviation coefficient (see below).

You can download the latest version of the AMI template here: AMI_V7.

Introduction

The AMI assessment template is divided into a quantitative and a qualitative part.

The purpose of AMI is to provide a framework for detailed and balanced quality assessment, performed by human expert raters.

Based on this expert rating, AMI automatically calculates comparable and easy to read – yet meaningful – statistics, and finally boils these numbers down to one total benchmark indicator (quotient).

Optionally, an automatic quality estimation score like COMET can be inserted. This can help to sort the segments to be assessed by their likelihood to contain (substantive) deviations.

For a distraction-free view while doing the quality assessment, it may be more convenient to hide the statistics panel.

There are assessment templates to perform an assessment of just one interpretation performance or for a comparative analysis.

In the comparative analysis template, one assessment template is in blue font (the one on the left) and the other is in black font (on the right) for visual orientation.

The quality assessement – source and target text

You simply copy and paste your source and target text into the respective columns.

Alignment

Source and target text segments don‘t need to be perfectly aligned unless you want to run an automatic quality estimation.

The rating process

Classification – the types of deviations (“mistakes” or “improvements”)

Each segment (= each spreadsheet line) is checked for deviations in a series of criteria. The criteria are grouped in the categories of Terminology, Content, Form, Style, Context, Delivery.

You check for positive and negative deviations. Negative deviations are „mistakes“, positive deviations are instances where the target is better than the source. This can be corrected grammer, smoothing out of false starts or self-corrections etc.

Rembember that unlike written texts and translations, spoken text and interpretation are never „perfect“ (whatever that may mean), as there is no such thing as editing, proof-reading, or post-editing.

For each segment, you can note down deviations in as many categories as you like.

The quality grading – severity, letter codes, and colour codes

Furthermore, for each deviation, you distinguish between different degrees of severity.

If the deviation is negative, you enter zzz for a critical deviation, zz for a major and z for a minor deviation.

For the positive deviation, the codes you enter are aaa for substantive positive deviation, aa for major and y for minor positive deviation.

As soon as you enter these codes into the respective cells, the background colours will change to different shades of red (negative) or yellow (positive).

Overview – criteria, deviations, severity

There is a dedicated sheet where the different criteria and their meanings and degrees of severity are explained.

Filters

Once the grading is done, the filter function can be very useful.

For example, you can filter for all segments containing deviations in a certain category, or filter for all segments containing a specific word or expression.

Statistics explained

For each criterion and degree of severity, the total sums are calculated.

However, critical deviations should normally count more than minor mistakes. This is why a second, weighted sum is calculated. For this purpose the weights for the different degrees of severity can be entered. By default, the weights are 1 for severe/critical deviations, 0.5 for major and 0.25 for minor deviations. The weights can of course be changed according to your preferences.

Finally, these weighted totals are set in relation to the amount of text: a quotient of the weighted number of ocurrences for each criterion in relation to the number of words is calculated and multiplied by 100.

These weighted quotients can then be read as follows:

For example, -1,0 means that there is 1 critical deviation (or 2 major or 4 minor or 1 major + 2 minor) per 100 words. -0.25 means that there is one minor critical deviation per 100 words.

This value is more useful to compare interpretations of different source texts.

The deviation coefficient

And finally, to obtain one single „benchmark“ indicator, the deviation coeffient, the weighted quotients for the different categories (terminology, content etc.) are summed up and multiplied by 100. In this final step, to take into account that some categories (usually content) are considered more important than other, they can be assigned different weights. By default, content has a weight of 1 (i.e. being counted in full), and the other five categories have a weight of 0.2, adding up to 1. This way content is as important as the sum of all other categories. These weights can of course also be adjusted according to individual preferences.

This deviation coefficient is the value calculated at the very top of the spreadsheet. It indicates the relative number of positive or negative deviations of a target text in relation to the source per 100 words. Here are some examples:

Deviation coefficient = 0 means that there are no deviations.

Deviation coefficient = -8 means that there are 8 critical negative deviations per 100 words.

Deviation coefficient = -0.25 means that there is one minor critical deviation per 100 words.

Bear in mind that positive and negative deviations are summed up against each other: If a text has, for example, 10 minor positive and 10 minor negative deviations, the coefficient will be 0. So for a detailed analysis, it will always be better to look at the AMI statistics chart. There, for example, if you wanted a deviation coefficient that only reflect the negative deviations (mistakes), you could set the weight of the positive ones to zero.

Using COMET

If you want to use an automatic quality estimation like COMET, you can paste the scores into the respective column.

This way you can compare the automatic scores with the human quality assessement, or sort the segments by their scores in order to concentrate (first) on those segments with lower scores, which might present more (positive and negative) deviations.

Questions? Suggestions? AMI is in a process of constant improvement. Feedback is very much appreciated!

Need more detailed instructions or support in an assessment project? I am glad to assist!

Reference:

Can automatic quality estimation help to pre-assess the quality of human simultaneous interpretation and automatic speech translation? An exploratory case study based on MQM-inspired Assessment Metrics for Interpreting (AMI) and COMET

Lebende Sprachen

2026-04-22 | Journal article

DOI: 10.1515/les-2025-0046

About the author:

Anja Rütten has specialised in tech, information and terminology management since the mid-1990s. She holds a professorship in interpreting studies and Computer-Aided Interpreting at the Cologne University of Applied Sciences.

Copyright:

This work is licensed under CC BY-NC-SA 4.0

Disclaimer:

Views or opinions expressed are solely my own and do not express the views or opinions of my employer.

Contents