Why we measure performance of our translation A.I. in practice and not in the lab

A.I. Powered Auto-Complete

We see many research reports being published about new approaches for applying artificial intelligence for language and translation. Sometimes that even leads to a mini-hype, like recently when Google released research on improvements in Neural Machine Translation for some language combinations.

It’s easy to forget that applying these technologies in practice is really what it’s all about and without doing that, the research results remain mainly theoretical. Since we work for real clients with real challenges to be solved, we looked for a better way of measuring success.

**Why is measuring with real data so important?**All success of A.I. models is heavily influenced by the relationship between training data and the correlation of that data to the scenarios where that model will be queried and applied to. This is why self-driving Tesla’s drive off a cliff once in a while. That scenario wasn’t in the model yet.

Within our context at Tolq, (we provide high volume human translation services assisted by A.I.), to be able to build successful models, it means having to deal with a broad range of variable inputs.

Each client might have:

Different types of content with a different and specific usage of language
A larger or smaller translation memory to use as training data
Different language pairs (or different direction)
Constantly changing data-sets, translations are added on an ongoing basis

All these factors will have significant impact on the results of the model applied. On top of that when applying the model during translation, there are high-impact variables at play too:

New content to be translated may vary greatly each time
Translator input can go in any direction

That’s quite a dynamic and wide range of situations for any A.I. model to have to be successful in.

**Time for an A.I. performance KPI Dashboard!**So we asked ourselves: how do we measure the impact of our A.I. in this situation?

It needs to be measured continuously
It needs to be very specific, so to individual client level
It needs to have an easy interface to select all the different models in the different scenarios (client TMs, placed orders, language pairs)
It should focus on one main score to measure model effectiveness, for which we chose A.I. auto-complete matches

We decided to build that A.I. success metrics right into the core of our product, so we can query anything in real-time at any time, and view it in a dashboard like this:

Tolq A.I. Performance Dashboard

We have direct insight in how our A.I. performs where it matters most: in practice with all it’s complexities and nuances.

This information is directly related to the core of our business so we make it available for everyone in the team to learn from.

Interested in learning more?

We’d love to take your existing translation memory data and process it for you to give you an idea of how our A.I. can help your translators work more efficiently. Contact us at [email protected].