Join the Asia Online
Mailing List



Overview

A Machine that Learns in Real-Time

Tools for Large-Scale Translation
    Projects


SMT Explained

SMT versus Rules Based Translation

Machine Translation Quality

Multiple Domain Support Explained

Translation Memory Explained

Statistical Machine Translation
Overview
  The Asia Online platform is based on a concept known as statistic machine translation (SMT) combined with tools that enable the platform to learn from humans in real-time.

Statistical machine translation is very different from the rule-based machine translation solutions of old. Instead of attempting to create fixed sets of grammar and dictionaries of words like rule-based solutions, statistical machine translation works by analyzing matched (or paired) sentences in two or more different languages and finding linguistic patterns. The more sentence pairs analyzed, the more accurately it can translate.

Asia Online has taken statistical machine technology to its pinnacle:

  • More intelligence: Asia Online have created proprietary tools processes that enable our statistical machine translation platform to be loaded with many, many more sentence pairs that ever before. The result is much more intelligent and accurate translations.
  • More languages: Asia Online currently supports 200 language pairs, making it the most extensive translation solution in the market. By year end 2008, we plan to support over 400 language pairs. Languages that are currently supported include:
  • Easier to train: Asia Online allows users to quickly add training data – including specialized terms, cultural nuances and technical jargon. As more training data is added, the solution becomes more accurate, not just for individual users, but for everyone who needs the translation platform.
  • Learns in real-time: Asia Online has an integrated set of proofreading tools than not only allow subject matter experts and human translators to correct early translations quickly and efficiently, but also feeds corrections back into the platform so it learns to not make the same mistake twice. Over time, the Asia Online platform becomes steadily more accurate.
  •  

     

    Top
    A Machine that Learns in Real-Time
      Asia Online’s statistical machine translation platform has the unique ability to learn in real time and never make the same mistake twice. The platform combines statistical machine translation with an environment for collaborative and continuous improvement. Put simply, this means that the machine constantly learns from human proofreaders and subject matter experts and constantly improves.

    Part of the process of all translation is to have human subject matter experts proofread early translations and make corrections as needed. The Asia Online statistical machine translation platform assists with this proofreading task and then takes a comprehensive analysis of the corrections, from which the platform learns. The result is that the platform will never make the same mistake twice and subsequent translations are higher quality. In addition to improving overall translation quality, this approach lets the system learn idioms, cultural nuances and industry jargon.

    Top
    Tools for Large-Scale Translation Projects
      Asia Online’s platform includes a wide range of tools that allows users to streamline and manage many different aspects of the translation process. In addition to the ability to submit large numbers of documents in TMX file format for translation, the platform gives users the ability to:

  • Review the translation queue to review progress, alter job priorities, and download finished documents.
  • Manage translation data, including glossaries, datasets, dictionaries, termbases etc.
  • Link multiple data sources for higher quality and specialized translations.
  • Manage and optimize the statistical machine translation engine for specific requirements
  • Proofread translations with corrections being passed back to the SMT platform to improve future translations.
  • Extract text for different file formats, clean data, and split TMX files into single-language documents.
  • Monitor and manage translation accounts


  • Click to enlarge
    Click to enlarge
    Top

    SMT Explained
      Asia Online uses a technique for software translation called Statistical Machine Translation (SMT). There has been significant research on SMT for many years, but until recently SMT had not been commercialized into fully automated language translation software. Only with the recent availability of a large amount of previously translated content covering a wide variety of topics in digital form (used as training material) and powerful computing power at low cost (to periodically train the software engines on each language pair) has SMT been practical to adopt as a machine translation technology.

    SMT breaks away from the traditional "rules-based" approach used in machine translation and uses statistical techniques from cryptography, utilizing learning algorithms that learn to translate automatically from existing human translations. SMT is a form of artificial intelligence that uses pattern recognition and probability statistics to train software how to convert text input from one pattern (or language) to another.

    SMT engines produce better results with increased training materials of previously translated matching sentence pairs in any two languages (called parallel corpus). With a few thousand sentence pairs the process and engine technology can be validated, but translation output quality is very poor. With approximately 1.5 million sentence pairs as training material, the results start are impressive, and with 10 million or more sentence pairs the results are usually excellent.

    What is learned by the SMT software is contemporary, appropriate and idiomatic, because it is learned directly from human translations as training material. The software can be customized to any subject area or style, and unlike translation systems such as translation memory, Asia Online's SMT products can do full translations on previously unseen text.

    The statistical translation and language models used in the translation process yield not only better quality results than has been available before in computerized translation, but also more natural, readable output.

    Top
    SMT versus Rules Based Translation
    Rosetta Stone Machine translation is widely considered among the most difficult tasks in natural language processing, and in artificial intelligence in general, because accurate translation seems to be impossible without a comprehension of the text to be translated.

    The path to developing a working machine translation system requires decades of hand-crafting grammar, dictionaries, and translation rules in close consultation with human translation experts. This has been the basis for the rules-based approach taken by most machine translation systems to date. However the reality is that a human translator can't possibly set down in sufficient detail the "algorithm" he applies when translating a document.

    Every language in the world has its own distinct grammatical rules that address phonetics, phonology, morphology, syntax and semantics. Most state-of-the-art commercial machine translation systems in use today have been developed using a rules-based approach and require a lot of work by linguists to define vocabularies and grammar.

    The problems with the rules-based approach are: there are too many exceptions to the rules in practically every sentence in every language; individual words in every language can have multiple or ambiguous meaning; language changes over time which affects the assumptions made in the rules engine; and many idioms are not an exact word for word translation between languages but are entirely different phrases that convey the same meaning. As a result, the translation output from rules-based systems is often disappointing, despite decades of work. And for commercialization purposes, rules-based systems often take years to develop and once a certain point is reached are difficult to improve upon.

    SMT suffers from none of these constraints and produces far higher quality translated output using software that can be trained on any two language pairs so long as there is sufficient quantity and quality of previous translated material available as training material. Asia Online has amassed a huge amount of matched sentence pairs for Asian and European languages that is unique in the world, as well as ensuring the sources of the translated sentence pairs are of as high as standard as possible to ensure quality and accuracy of the software that is trained from this material.

    SMT suffers from none of these constraints and produces far higher quality translated output using software that can be trained on any two language pairs so long as there is sufficient quantity and quality of previous translated material available as training material. Asia Online has amassed a huge amount of matched sentence pairs for Asian and European languages that is unique in the world, as well as ensuring the sources of the translated sentence pairs are of as high as standard as possible to ensure quality and accuracy of the software that is trained from this material.

    Top
    Machine Translation Quality
    It is expensive and time-consuming to use humans to evaluate the quality of machine translation and difficult to sustain any consistency in the process. Over the past several years, a number of automated means of measuring translation quality have been used. One of the most popular is called BLEU (BiLingual Evaluation Understudy) which was developed by a team at IBM’s Thomas J. Watson Research Lab.

    The BLEU system awards a score between 0 to 1 the closer a machine translation output is to a that produced by a professional human translator. The highest scores attainable (and equal to a highly proficient human translator, such as those that work for the UN) are around 0.85, while a very good translation would give a score of about 0.65. Most current rules-based systems in use globally rarely score above a 0.15 on the BLEU index. Only those rules-based systems that have been worked on for many years for extremely popular language pairs (such as English<->French) or that are between two very closely related languages (such as Spanish<->Italian) produce BLEU scores between 0.15 and 0.30.

    In contrast, even basic SMT translation software (that has not been optimized for the context or domain of the material being translated) would normally score anywhere between 0.10 to 0.30 on the BLEU scale as long as the software has been trained on a reasonable quality of corpus .

    If the SMT software is given information about the context or domain of the material it is about to translate, it can alter the way it processes its internal probability tables accordingly and produce translation output that can range between 0.35 to 0.70 on the BLEU scale. If a good quantity of Translation Memory (see below) is added to the SMT software, output can range between 0.40 and 0.85 using the BLEU metric.

    Top
    Multiple Domain Support Explained
      A domain is a description of the context of a document, such as ‘Sports’ or ‘Architecture’ or ‘Politics’. This context can be detected automatically by Asia Online’s software as a first step to processing any material for translation. Knowing this context greatly assists SMT software to skew the way it processes its internal probability tables to produce a more accurate translation output. For example, knowing that a document is related to ‘Fishing’ can help the SMT software determine that the most appropriate translation for the word ‘bank’ found in the document is more likely to mean ‘river-bank’ than any other possible meaning for that word.

    Asia Online supports Domains (not all SMT software does), but also is the only SMT engine globally that currently supports multiple domains. As an example, the Asia Online SMT software might discern that a document is not only about ‘Automotive’ but is actually about ‘Toyota’ and in fact more specifically is about the ‘Prius’ model of Toyota car. As a result, the software can adjust its processing accordingly to produce an extremely high quality of translated output.

    Top
    Translation Memory Explained
    One further step that the Asia Online SMT software takes to ensure as high a quality of translated output as possible is to use extensive Translation Memory support features. Translation Memory is a database of previously translated phrases that can be used to over-ride the interim output from the SMT engine before the final output is produced. A translation memory consists of text segments in a source language and their translations into one or more target languages. These segments can be blocks, paragraphs, sentences, or phrases.

    For every language pair, a huge number of common phrases and expressions can be stored in Translation Memory, as well as idioms.

    For example, the English idiom “as difficult as herding cats” can be translated word for word into Thai, but the meaning would not be conveyed. In Thailand, the expression “as hard as putting crabs in a bucket” conveys the same meaning, so the Translation Memory database would have this idiom matched in both languages and would be used to over-ride any translated output from the SMT software to produce a translation that captures as closely as possible the intent of the original text.

    Top
    Statistical Machine Translation
    From Wikipedia, the free encyclopedia

    Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

    The first ideas of statistical machine translation were introduced by Warren Weaver in 1949,  including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in 1991 by researchers at IBM's Thomas J. Watson Research Center and has contributed to the significant resurgence in interest in machine translation in recent years. As of 2006, it is by far the most widely-studied machine translation paradigm.

    Benefits

    The benefits of statistical machine translation over traditional paradigms that are most often cited are the following:
  • Better use of resources
    • There is a great deal of natural language in machine-readable format.
    • Generally, SMT systems are not tailored to any specific pair of languages.
    • Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
  • More natural translations

    The ideas behind statistical machine translation come out of information theory. Essentially, the document is translated on the probability p(e | f)  that a string  e in native language (for example, English) is the translation of a string f  in foreign language (for example, French). Generally, these probabilities are estimated using techniques of parameter estimation.

    The Bayes Theorem is applied to p(e | f), the probability that the foreign string produces the native string to get  p(e|f) \propto p(f|e) p(e), where the translation model p(f | e) is the probability that the native string is the translation of the foreign string, and the language model p(e) is the probability of seeing that native string. Mathematically speaking, finding the best translation  \tilde{e} is done by picking up the one that gives the highest probability:

           \tilde{e} = arg \max_{e \in e^*} p(e|f) = arg \max_{e\in e^*} p(f|e) p(e)

    For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings e * in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.

    As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough. Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word order s in the languages.

    The statistical translation models were initially word based (Models 1-5 from IBM), but significant advances were made with the introduction of phrase based models.  Recent work has incorporated syntax or quasi-syntactic structures.

    Word-based translation
    In word-based translation, translated elements are words. Typically, the number of words in translated sentences are different due to compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Simple word-based translation is not able to translate language pairs with fertility rates different from one. To make word-based translation systems manage, for instance, high fertility rates, the system could be able to map a single word to multiple words, but not vice versa.

    Phrase-based translation
    In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ. The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from the corpus. Restricting the phrases to linguistic phrases has been shown to decrease translation quality.

    Challenges with statistical machine translation

    Problems that statistical machine translation have to deal with include, compound words, idioms, morphology, syntax, out of vocabulatu words and different word orders.

    For example, word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located.

    In Speech Recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the translation model is only able to translate small sequences of words and word order has to be taken into account somehow. Typical solution has been re-ordering models, where a distribution of location changes for each item of translation is approximated from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.
  • Top

    The World Speaks One Language - Yours
    Home / News /Translation / Services / Technology / About Us / Contact Us /Join Mailing List