Become a Partner/Become a Customer/Contact Us          Enterprise Translation Portal
 
Join the Asia Online
Mailing List
IMPORTANT

Asia Online maintains that clean data matters. With clean data you only need a fraction of the data that other SMT tools require to produce the same or better results.

To fix errors, others SMT vendors will ask you for more data, but cannot tell you what data to give. Just more data.

The theory with dirty data that some prescribe to is that good data patterns will occur more frequently than bad data patterns, so good data will rise to the top and bad data will fall to the bottom as statistically irrelevant. This is fine in theory, but Asia Online has been able to match and better the quality of many of these dirty data systems with just 1/100th of the data because all our data is clean.

More Data Is Not Always Better...
A Smaller Amount of Clean Data Is Better

Asia Online's SMT platform also requires more data to improve. But a major differentiating factor is that Asia Online is able to tell you exactly what data is needed and often it is as little as 1-10 sentences to fix a problem so that it will never occur again.

Lost & found in translation

Did you know that Thailand has the world's longest place name?

A total of 157 letters make up the official name of the city more commonly known as Bangkok. The full name is:

Krung-thep-maha-nakorn-boworn-ratana-kosin-mahintar-ayudhya-amaha-dilok-pop-nopa-ratana-rajthani-burirom-udom-rajniwes-mahasat-arn-amorn-pimarn-avatar-satit-sakattiya-visanukam-prasit.

Overview of the Asia Online Custom Translation Engine Build Process

This overview is a step by step guide to building a custom engine foundation. Some steps can be performed in parallel, while others must be performed in sequence. While not identical for every customer, this overview captures the essentual and common processes and steps. Some steps also require manual tasks, while others are completely or semi automated.

Building a custom engine is a detailed process that has been broken into a series of clear and specific steps that result in the highest quality translated output possible.

The diagram below shows each step that is required to build a custom engine.

Step Index:
Click on a step to navigate.
  • Build Data Catalog
  • Prepare Phrase Pairs
  • Prepare Dictionary and Glossary
  • Prepare Tuning Set and Test Set
  • Prepare Language Model
  • Select Baseline to Build From
  • Prepare Capitalization Data
  • Automated Training
  • Publish to Production

Build Data Catalog
Data comes in different forms and needs to be pre-processed into a format that is useful to that Asia Online tools.

The Data Catalog is a spreadsheet that tracks the life of the data from start to finish through each of the steps. This includes the first step of receiving the data into the system, processing, cleaning, generating new data from the processed data, training and fine tuning.

This data includes the following:

  • Phrase Pairs - Phrase Pairs data is either full sentences or partial sentences that make up a phrase. They are in 2 or more languages.
  • Dictionary and Glossary - Dictionary and Glossary data are a list of specific words and their mapping. These can also include idioms to ensure that the mappings of important concepts are translated correctly.
  • Language Model - Language Model data tells the translation platform how to use words and phrases in the output language. It can be used to stylize and refine grammar.
  • Tuning Set - Tuning Set data is a set of “gold-standard” translation examples that the translation platform uses to determine the most optimal settings for high-quality translations.
  • Test Set - The test Set is an additional “gold-standard” set of translation examples that is used to test the finished translation engine and give it an appropriate quality rating.

Prepare Phrase Pairs
Phrase Pair data arrives in a variety of formats, ranging from simple text files to more complex XML formats such as TMX or XLIFF. Asia Online have developed tools that convert the data from all the most common formats into a simple and clean format that other tools in the preparation process can then work with. An additional set of powerful tools have also been developed that enable Asia Online to extract and reformat data from nearly any proprietary format. Files will typically arrive in two forms:

  • Unaligned: Raw data that has never been paired at the sentence of paragraph level. In some cases, even documents are not aligned.
  • Aligned: Text is aligned at the paragraph or sentence level.

Asia Online has a variety of tools that automatically align text from documents such as HTML, Microsoft Word and other formats. This will align data first by paragraph and then by sentence within a paragraph. This is a complex process and can take some time as significant analysis of the text is performed by the text alignment applications.

The output from the preceding process is a text file encoded in UTF-8 format with two columns (one per language) with each sentence separated by a tab character. At this point, the data is in a format ready to be processed by the AOValidateTextPairs tools.

The data is then checked by human and adjusted as necessary. Frequently even very clean data has formatting problems, or the data has been exported from tools that allow for variables to be used in place of some words. This data must be cleaned and edited before use.

Once all the data is cleaned and validated, one final step is performed using the AOTextExtractKnowledge tools. This process analyzes the text and can extract a large amount of information from the input data. This data is used to further improve the quality of translations.

The following tools are utilized for the preparations of Phrase Pairs are:

  • AOTagAlign: A set of tools that examine documents based on their tags (HTML, XML and Word). Output is a simple aligned set of data with two languages that is linked by original positioning matches between two documents.
  • AOTextFilter: A set of tools that can manipulate and process almost any text or data format.
  • AOExtractTextPairs: A set of tools that convert data from industry standard formats such as XLIFF and TMX to a simple format ready for use with Asia Online tools.
  • AOSentenceSegmentText: A set of tools that split text into sentence units. This is a complex task and differs between each language.
  • AOAlignTextPairs: A set of tools that align flat text between two languages.
  • AOValidateTextPairs: A set of tools that check the data between two languages for quality related and formatting issues.

Prepare Dictionary and Glossary
Dictionaries and glossaries are used to cover specific language that may not be available in the Phrase Pair data. It is also used to bridge the gaps in vocabulary within the Phrase Pair data.

Dictionary and glossary preparation is a simple task. The data is first extracted from its original form, converted to UTF-8 and then cleaned. Some human effort is usually required to validate the cleanliness of the data.

The output of which is a flat CSV file that has the source on the left, separated by a tab and the target on the right.

The following tools are utilized for the preparation of Dictionary and Glossary resources used:

  • AODictionaryExtract: A set of tools that extract dictionary from a variety of formats into a simple form useable by Asia Online.
  • AODictionaryClean: A set of tools that analyze the dictionary text and cleans out the formatting information so that only clean data remains.

Prepare Tuning Set and Test Set
The Tuning Set data is critical to overall quality. The Asia Online translation platform has a large number of parameters that can be adjusted to improve the quality and performance of the system.

Preparing the Tuning Set is a very manual step where it relies on humans translating between 4000 and 6000 sentences of text. The data MUST not be in the phrase pairs as this will distort the tuning. The data should be very high-quality human translated text that represents the type of translations that you will be performing. The Test Set preparation process is identical to the Tuning Set preparation process. The Test Set is used at a later state to asses overall translation quality from the final translation system.

The following tools are utilized for the preparations of Tuning Set are:

  • AOTuningSetCheck: A set of tools that analyze the tuning set for a variety of issues that could affect the result of the tuning process.

Prepare Language Model
Language Model is very important to the translation process because it guides the translation platform on the use of grammar and vocabulary.

Language model can also come from other sources of data including plain monolingual text in the target language or websites in the target language.

The following tools are utilized for the preparations of Language Model are:

  • AOTextAnalyze: A set of tools that analyze text for words and phrases that language model should be generated for.
  • AOGenerateLanguageModel: A set of tools that generate language model from a variety of data sources and convert it to a format ready to be used by the translation platform.
  • AOWebTools: A set of tools that extracts and downloads data from the web.
  • AOExtractLanguageModel: A set of tools that extracts language model data from various formats including Word, XML, text files and HTML.
  • AOCleanLanguageModel: A set of tools that clean the language model.
  • AOSentenceSegment: A set of tools that split text into sentence units. This is a complex task and differs between each language.

Select Baseline to Build From
Asia Online has prepared a number of baseline language builds which are constantly updated. Baseline data is used as the foundation for an engine. Very few customers have the large amounts of data required to build a translation engine with just their own data. Asia Online has built a number of baselines for each language pair. These cover dictionary and vocabulary used in each language pair for general translations.

By adding customer data to the baseline, you get the general vocabulary plus the customers specific vocabulary. Without the baseline, there would be large holes in vocabulary and the translation would be very low quality as any words or phrases that were not in the customer provided data would be unknown to the engine.

Asia Online offers a number of baselines to choose from. The baseline in the language pair you require and most appropriate to the nature of your translations should be selected.

Scenario 1: Baseline + Custom Data


Suitable when:
  • Customer has limited training data (less than 2 million segments)
  • Translation covers a range of topics or vocabulary
Benefits:
  • Covers greater vocabulary
  • Addresses gaps in data that are not covered in customer training data
  • A custom engine can be built with as little as 20,000 translation units.
 
Scenario 2: Custom Data (No Baseline)


Suitable when:
  • Customer has large volumes training data (greater than 2 million segments)
  • Translation covers a defined topic with limited vocabulary
Benefits:
  • A highly focussed engine on narrow domain/topic will always give higher quality translation when translated material is within domain. Out of domain materail will translate poorly.
  • Very specific quality and vocabulary control
  • No chance for non-customer vocabulary to be statistically more relevant than customer data

Prepare Capitalization Data
Not all languages need capitalization. Language such as Chinese, Japanese and Thai do not use Romanized text, so this option may be bypassed for such languages.

All translations are performed in lower case to ensure all variations of text in upper can lower case are treated the same and to improve accuracy. The downside to this approach is that the text must be re-capitalized after translation. Large amounts of target language data are used to statistically analyze the text to ensure the right capitalization is applied. This ensures that phrases like “Welcome to the White House” are not capitalized as “Welcome to the white house” and that the word “The mountain near my house was covered in white snow” is not capitalized as “The mountain near my House was covered in White snow” or even “John lived in a beautiful white house” .

AOWebTools are used to gather data from high-quality websites that represent the vocabulary of the same domain that is being translated.

The following tools are utilized for the preparations of Capitalization are:

  • AOExtractCapitalizationText: A set of tools that analyze existing data and extract text relevant to capitalization.
  • AOWebTools: A set of tools that extracts and downloads data from the web.
  • AOPrepareCapitalizationText: A set of tools that prepares the text in the format required to train the capitalization system.

Automated Training
The next steps are all fully automated, with the output being a completed custom translation engine:

  • Train Language Model
  • Train Phrase Pairs
  • Train Capitalization
  • Tune Engine
  • Binarize Data

Publish to Production
Other than initial configuration for the training job, this step requires no human involvement. It uses large amounts of processing resources and is spread across many servers. All training tasks can be run concurrently or independently.

Once both the Phrase Pair and the Language Model training have completed, the engine can be tuned. This process can take several days depending on how many machines are dedicated to the tasks.

Once all training and tuning is complete, the data is converted to a binary format so that it is small enough to be loaded into memory on a production environment and then published to the production environment by a systems administrator.

NOW WE HAVE A WORKING TRANSLATION ENGINE
BUT REAL QUALITY COMES FROM FINE TUNING
Top Top
Overview of the Asia Online Custom Translation Engine Fine Tuning Process

Once an engine has been built, it is ready and available for translation. Many of the translations will be accurate, but some will still have errors. Fine tuning removes errors and constantly refines the quality of translations.

Fine Tuning a custom engine is a detailed process that has been broken into a series of clear and specific steps that result in the highest quality translated output possible.

The diagram below shows each step that is required to fine tune an engine once it is built.

Translate Trial Data Set
Translating a trial data set involves feeding in sample documents and generating the output. This can be done using the Asia Online translation portal website.

Proof Read Trial Output
Asia Online provides a series of tools in the translation portal for post editing. The output for these tools can then be fed back into the translation system as additional training data.

Build Corrective Data for Problems
Asia Online’s translation portal provides tools to better understand why errors were made. The example below shows the statistical patterns that were available as choices for the engine to select from.

Often the simplest reason for a poor translation is there is not enough data in the phrase pairs or dictionaries that match the new source material. The context may not have been seen in any of the other sentences provided, so the system does not know how to use the phrase correctly. This can be seen and just a few sentences created to address the problem for the next training of the system.

Asia Online provides the tools that allow users to provide feedback and corrective patterns that completely eliminate many of the most prominent issues.

Extract Unknown Words
Another tool that is very useful is the Extract Unknown Words tool. This tool analyzes the text for unknown words from a translated document.

Translate Unknown Words
Once an unknown word list is available, it can be used to translate from and generate dictionary, glossary and spelling correction data.

Repeat Same Steps as in Custom Engine Foundation
The remaining steps in Fine Tuning are identical to the steps of the same name in the Custom Engine Build section.

The Fine Tuning process can be repeated as many times as required. Each time, the quality of translation will improve. The more effort put into your specific proof reading and fine tuning, the higher quality output for that specific type of translation will be delivered.

Top Top
The World Speaks One Language - Yours
Home/ Portal/ Translation/ Solutions/ Technology/ Tools & Downloads/ Resources/ News/ FAQ/ About Us/ Blog/ Contact Us/ Join Mailing List