|
|
 |
Overview |
 |
|
|
|
The Asia Online platform is based on a concept known as statistic machine translation
(SMT) combined with tools that enable the platform to learn from humans in real-time.
Statistical machine translation is very different from the rule-based machine translation
solutions of old. Instead of attempting to create fixed sets of grammar and dictionaries
of words like rule-based solutions, statistical machine translation works by analyzing
matched (or paired) sentences in two or more different languages and finding linguistic
patterns. The more sentence pairs analyzed, the more accurately it can translate.
Asia Online has taken statistical machine technology to its pinnacle:
More intelligence: Asia Online have created proprietary tools processes that
enable our statistical machine translation platform to be loaded with many, many
more sentence pairs that ever before. The result is much more intelligent and accurate
translations.
More languages: Asia Online currently supports 200 language pairs, making
it the most extensive translation solution in the market. By year end 2008, we plan
to support over 400 language pairs. Languages that are currently supported include:
Easier to train: Asia Online allows users to quickly add training data –
including specialized terms, cultural nuances and technical jargon. As more training
data is added, the solution becomes more accurate, not just for individual users,
but for everyone who needs the translation platform.
Learns in real-time: Asia Online has an integrated set of proofreading tools
than not only allow subject matter experts and human translators to correct early
translations quickly and efficiently, but also feeds corrections back into the platform
so it learns to not make the same mistake twice. Over time, the Asia Online platform
becomes steadily more accurate.
|
|
|
|
|
|
|
 |
|
|
Top |
|
|
|
 |
A Machine that Learns in Real-Time |
 |
|
|
|
Asia Online’s statistical machine translation platform has the unique ability to
learn in real time and never make the same mistake twice. The platform combines
statistical machine translation with an environment for collaborative and continuous
improvement. Put simply, this means that the machine constantly learns from human
proofreaders and subject matter experts and constantly improves.
Part of the process of all translation is to have human subject matter experts proofread
early translations and make corrections as needed. The Asia Online statistical machine
translation platform assists with this proofreading task and then takes a comprehensive
analysis of the corrections, from which the platform learns. The result is that
the platform will never make the same mistake twice and subsequent translations
are higher quality. In addition to improving overall translation quality, this approach
lets the system learn idioms, cultural nuances and industry jargon.
|
|
|
|
|
|
|
 |
|
|
Top |
|
|
|
 |
Tools for Large-Scale Translation Projects |
 |
|
|
|
Asia Online’s platform includes a wide range of tools that allows users to streamline
and manage many different aspects of the translation process. In addition to the
ability to submit large numbers of documents in TMX file format for translation,
the platform gives users the ability to:
Review the translation queue to review progress, alter job priorities, and download
finished documents.
Manage translation data, including glossaries, datasets, dictionaries, termbases
etc.
Link multiple data sources for higher quality and specialized translations.
Manage and optimize the statistical machine translation engine for specific requirements
Proofread translations with corrections being passed back to the SMT platform to
improve future translations.
Extract text for different file formats, clean data, and split TMX files into single-language
documents.
Monitor and manage translation accounts
 Click to enlarge
|
|
|
|
|
|
|
 |
|
|
Top |
|
|
|
 |
SMT Explained |
 |
|
|
|
Asia Online uses a technique for software translation called Statistical Machine
Translation (SMT). There has been significant research on SMT for many years, but
until recently SMT had not been commercialized into fully automated language translation
software. Only with the recent availability of a large amount of previously translated
content covering a wide variety of topics in digital form (used as training material)
and powerful computing power at low cost (to periodically train the software engines
on each language pair) has SMT been practical to adopt as a machine translation
technology.
SMT breaks away from the traditional "rules-based" approach used in machine translation
and uses statistical techniques from cryptography, utilizing learning algorithms
that learn to translate automatically from existing human translations. SMT is a
form of artificial intelligence that uses pattern recognition and probability statistics
to train software how to convert text input from one pattern (or language) to another.
SMT engines produce better results with increased training materials of previously
translated matching sentence pairs in any two languages (called parallel corpus).
With a few thousand sentence pairs the process and engine technology can be validated,
but translation output quality is very poor. With approximately 1.5 million sentence
pairs as training material, the results start are impressive, and with 10 million
or more sentence pairs the results are usually excellent.
What is learned by the SMT software is contemporary, appropriate and idiomatic,
because it is learned directly from human translations as training material. The
software can be customized to any subject area or style, and unlike translation
systems such as translation memory, Asia Online's SMT products can do full translations
on previously unseen text.
The statistical translation and language models used in the translation process
yield not only better quality results than has been available before in computerized
translation, but also more natural, readable output.
|
|
|
|
|
|
|
 |
|
|
Top |
|
 |
SMT versus
Rules Based Translation |
 |
|
|
 |
Machine translation is widely considered among the most difficult tasks in natural
language processing, and in artificial intelligence in general, because accurate
translation seems to be impossible without a comprehension of the text to be translated.
The path to developing a working machine translation system requires decades of
hand-crafting grammar, dictionaries, and translation rules in close consultation
with human translation experts. This has been the basis for the rules-based approach
taken by most machine translation systems to date. However the reality is that a
human translator can't possibly set down in sufficient detail the "algorithm" he
applies when translating a document.
Every language in the world has its own distinct grammatical rules that address
phonetics, phonology, morphology, syntax and semantics. Most state-of-the-art commercial
machine translation systems in use today have been developed using a rules-based
approach and require a lot of work by linguists to define vocabularies and grammar.
The problems with the rules-based approach are: there are too many exceptions to
the rules in practically every sentence in every language; individual words in every
language can have multiple or ambiguous meaning; language changes over time which
affects the assumptions made in the rules engine; and many idioms are not an exact
word for word translation between languages but are entirely different phrases that
convey the same meaning. As a result, the translation output from rules-based systems
is often disappointing, despite decades of work. And for commercialization purposes,
rules-based systems often take years to develop and once a certain point is reached
are difficult to improve upon.
SMT suffers from none of these constraints and produces far higher quality translated
output using software that can be trained on any two language pairs so long as there
is sufficient quantity and quality of previous translated material available as
training material. Asia Online has amassed a huge amount of matched sentence pairs
for Asian and European languages that is unique in the world, as well as ensuring
the sources of the translated sentence pairs are of as high as standard as possible
to ensure quality and accuracy of the software that is trained from this material.
SMT suffers from none of these constraints and produces far higher quality translated
output using software that can be trained on any two language pairs so long as there
is sufficient quantity and quality of previous translated material available as
training material. Asia Online has amassed a huge amount of matched sentence pairs
for Asian and European languages that is unique in the world, as well as ensuring
the sources of the translated sentence pairs are of as high as standard as possible
to ensure quality and accuracy of the software that is trained from this material.
|
|
|
|
|
|
|
 |
|
|
Top |
|
 |
Machine Translation Quality
|
 |
|
|
 |
It is expensive and time-consuming to use humans to evaluate the quality of machine
translation and difficult to sustain any consistency in the process. Over the past
several years, a number of automated means of measuring translation quality have
been used. One of the most popular is called BLEU (BiLingual Evaluation Understudy)
which was developed by a team at IBM’s Thomas J. Watson Research Lab.
The BLEU system awards a score between 0 to 1 the closer a machine translation output
is to a that produced by a professional human translator. The highest scores attainable
(and equal to a highly proficient human translator, such as those that work for
the UN) are around 0.85, while a very good translation would give a score of about
0.65. Most current rules-based systems in use globally rarely score above a 0.15
on the BLEU index. Only those rules-based systems that have been worked on for many
years for extremely popular language pairs (such as English<->French) or that are
between two very closely related languages (such as Spanish<->Italian) produce BLEU
scores between 0.15 and 0.30.
In contrast, even basic SMT translation software (that has not been optimized for
the context or domain of the material being translated) would normally score anywhere
between 0.10 to 0.30 on the BLEU scale as long as the software has been trained
on a reasonable quality of corpus .
If the SMT software is given information about the context or domain of the material
it is about to translate, it can alter the way it processes its internal probability
tables accordingly and produce translation output that can range between 0.35 to
0.70 on the BLEU scale. If a good quantity of Translation Memory (see below) is
added to the SMT software, output can range between 0.40 and 0.85 using the BLEU
metric.
|
|
|
|
|
|
|
 |
|
|
Top |
|
 |
Multiple Domain Support Explained |
 |
|
|
|
A domain is a description of the context of a document, such as ‘Sports’ or ‘Architecture’
or ‘Politics’. This context can be detected automatically by Asia Online’s software
as a first step to processing any material for translation. Knowing this context
greatly assists SMT software to skew the way it processes its internal probability
tables to produce a more accurate translation output. For example, knowing that
a document is related to ‘Fishing’ can help the SMT software determine that the
most appropriate translation for the word ‘bank’ found in the document is more likely
to mean ‘river-bank’ than any other possible meaning for that word.
Asia Online supports Domains (not all SMT software does), but also is the only SMT
engine globally that currently supports multiple domains. As an example, the Asia
Online SMT software might discern that a document is not only about ‘Automotive’
but is actually about ‘Toyota’ and in fact more specifically is about the ‘Prius’
model of Toyota car. As a result, the software can adjust its processing accordingly
to produce an extremely high quality of translated output.
|
|
|
|
|
|
|
 |
|
|
Top |
|
 |
Translation Memory Explained |
 |
|
|
 |
One further step that the Asia Online SMT software takes to ensure as high a quality
of translated output as possible is to use extensive Translation Memory support
features. Translation Memory is a database of previously translated phrases that
can be used to over-ride the interim output from the SMT engine before the final
output is produced. A translation memory consists of text segments in a source language
and their translations into one or more target languages. These segments can be
blocks, paragraphs, sentences, or phrases.
For every language pair, a huge number of common phrases and expressions can be
stored in Translation Memory, as well as idioms.
For example, the English idiom “as difficult as herding cats” can be translated
word for word into Thai, but the meaning would not be conveyed. In Thailand, the
expression “as hard as putting crabs in a bucket” conveys the same meaning, so the
Translation Memory database would have this idiom matched in both languages and
would be used to over-ride any translated output from the SMT software to produce
a translation that captures as closely as possible the intent of the original text.
|
|
|
|
|
|
|
 |
|
|
Top |
|
|
|
|
 |
Statistical Machine Translation |
 |
|
|
From Wikipedia, the free encyclopedia
Statistical machine translation (SMT) is a machine translation
paradigm where translations are generated on the basis of statistical models whose
parameters are derived from the analysis of bilingual text corpora. The statistical
approach contrasts with the rule-based approaches to machine translation as well
as with example-based machine translation.
The first ideas of statistical machine translation were introduced by Warren Weaver
in 1949,
including the ideas of applying Claude Shannon's information theory.
Statistical machine translation was re-introduced in 1991 by researchers at IBM's
Thomas J. Watson Research Center and has contributed to the significant resurgence
in interest in machine translation in recent years. As of 2006, it is by far the
most widely-studied machine translation paradigm.
Benefits
The benefits of statistical machine translation over traditional paradigms
that are most often cited are the following:
Better use of resources
- There is a great deal of natural language in machine-readable format.
- Generally,
SMT systems are not tailored to any specific pair of languages.
- Rule-based translation
systems require the manual development of linguistic rules, which can be costly,
and which often do not generalize to other languages.
More natural translations
The ideas behind statistical machine translation come out of information theory.
Essentially, the document is translated on the probability
p(e | f) that a string
e in native language (for example, English) is the translation of a string f in
foreign language (for example, French). Generally, these probabilities are estimated
using techniques of parameter estimation.
The Bayes Theorem is applied to p(e | f), the probability that the foreign string produces the native string to get , where the translation model p(f | e)
is the probability that the native string is the translation of the foreign string,
and the language model p(e) is the probability of seeing that native string. Mathematically
speaking, finding the best translation
is done by picking up the one that gives
the highest probability:
For a rigorous implementation of this one would have to perform an exhaustive search
by going through all strings e * in the native language. Performing the search efficiently
is the work of a machine translation decoder that uses the foreign string, heuristics
and other methods to limit the search space and at the same time keeping acceptable
quality. This trade-off between quality and time usage can also be found in speech
recognition.
As the translation systems are not able to store all native strings
and their translations, a document is typically translated sentence by sentence,
but even this is not enough. Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but
there is additional complexity due to different sentence lengths and word order
s in the languages.
The statistical translation models were initially word based
(Models 1-5 from IBM), but significant advances were made with the introduction
of phrase based models.
Recent work has incorporated syntax or quasi-syntactic structures.
Word-based translation
In word-based translation, translated elements
are words. Typically, the number of words in translated sentences are different
due to compound words, morphology and idioms. The ratio of the lengths of sequences
of translated words is called fertility, which tells how many foreign words each
native word produces. Simple word-based translation is not able to translate language
pairs with fertility rates different from one. To make word-based translation systems
manage, for instance, high fertility rates, the system could be able to map a single
word to multiple words, but not vice versa.
Phrase-based translation
In phrase-based translation, the restrictions produced
by word-based translation have been tried to reduce by translating sequences of
words to sequences of words, where the lengths can differ. The sequences of words
are called, for instance, blocks or phrases, but typically are not linguistic phrases
but phrases found using statistical methods from the corpus. Restricting the phrases
to linguistic phrases has been shown to decrease translation quality.
Challenges
with statistical machine translation
Problems that statistical machine translation
have to deal with include, compound
words, idioms, morphology, syntax, out of vocabulatu words and different word orders.
For example, word order in languages differ. Some classification can be done by naming the typical
order of subject (S), verb (V) and object (O) in a sentence and one can talk, for
instance, of SVO or VSO languages. There are also additional differences in word
orders, for instance, where modifiers for nouns are located.
In Speech Recognition,
the speech signal and the corresponding textual representation can be mapped to
each other in blocks in order. This is not always the case with the same text in
two languages. For SMT, the translation model is only able to translate small sequences
of words and word order has to be taken into account somehow. Typical solution has
been re-ordering models, where a distribution of location changes for each item
of translation is approximated from aligned bi-text. Different location changes
can be ranked with the help of the language model and the best can be selected.
|
|
 |
|
Top |
|
|