United Nations, Department for General Assembly and Conference Management

Download the UN Parallel Corpus

XML files

All documents are organized into folders by language, publication year, and publication symbols. Corresponding documents are placed in parallel folder structures, and a document's translation in any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.

Documents for which multiple language versions exist have corresponding link files for each of the maximum 15 language pairs. They contain information about the alignment link type, ids of linked sentences and the alignment quality score.

English French Spanish Russian Chinese Arabic Links
UNv1.0-TEI.en.tar.gz.00
UNv1.0-TEI.en.tar.gz.01
UNv1.0-TEI.fr.tar.gz.00
UNv1.0-TEI.fr.tar.gz.01
UNv1.0-TEI.es.tar.gz.00
UNv1.0-TEI.es.tar.gz.01
UNv1.0-TEI.ru.tar.gz.00
UNv1.0-TEI.ru.tar.gz.01
UNv1.0-TEI.zh.tar.gz.00 UNv1.0-TEI.ar.tar.gz.00
UNv1.0-TEI.ar.tar.gz.01
UNv1.0-TEI.links.tar.gz.00
UNv1.0-TEI.links.tar.gz.01
UNv1.0-TEI.links.tar.gz.02
UNv1.0-TEI.links.tar.gz.03

Plain-text bitexts

We also make available plain-text bitexts that span all documents for a specific language pair and can be used more readily with SMT training pipelines. Inside a language-pair specific archive consists of a plain-text file for each language and one file with ids.

FrenchSpanishRussianChineseArabic
English UNv1.0.en-fr.tar.gz.00
UNv1.0.en-fr.tar.gz.01
UNv1.0.en-fr.tar.gz.02
UNv1.0.en-es.tar.gz.00
UNv1.0.en-es.tar.gz.01
UNv1.0.en-ru.tar.gz.00
UNv1.0.en-ru.tar.gz.01
UNv1.0.en-ru.tar.gz.02
UNv1.0.en-zh.tar.gz.00
UNv1.0.en-zh.tar.gz.01
UNv1.0.ar-en.tar.gz.00
UNv1.0.ar-en.tar.gz.01
French - UNv1.0.es-fr.tar.gz.00
UNv1.0.es-fr.tar.gz.01
UNv1.0.es-fr.tar.gz.02
UNv1.0.fr-ru.tar.gz.00
UNv1.0.fr-ru.tar.gz.01
UNv1.0.fr-ru.tar.gz.02
UNv1.0.fr-zh.tar.gz.00
UNv1.0.fr-zh.tar.gz.01
UNv1.0.ar-fr.tar.gz.00
UNv1.0.ar-fr.tar.gz.01
Spanish - UNv1.0.es-ru.tar.gz.00
UNv1.0.es-ru.tar.gz.01
UNv1.0.es-ru.tar.gz.02
UNv1.0.es-zh.tar.gz.00
UNv1.0.es-zh.tar.gz.01
UNv1.0.ar-es.tar.gz.00
UNv1.0.ar-es.tar.gz.01
Russian - UNv1.0.ru-zh.tar.gz.00
UNv1.0.ru-zh.tar.gz.01
UNv1.0.ar-ru.tar.gz.00
UNv1.0.ar-ru.tar.gz.01
UNv1.0.ar-ru.tar.gz.02
Chinese - UNv1.0.ar-zh.tar.gz.00
UNv1.0.ar-zh.tar.gz.01

Fully aligned subcorpus

Fully aligned plain subcorpus in the six official UN languages

All languages
UNv1.0.6way.tar.gz.00
UNv1.0.6way.tar.gz.01
UNv1.0.6way.tar.gz.02
UNv1.0.6way.tar.gz.03

Test and Development Sets

Documents released in 2015 (excluded from the current corpus) were used to create official development and test sets for machine translation tasks. Development data was randomly selected from documents that were released in the first quarter of 2015 and test data was selected from the second quarter. To avoid repetitions, we only chose translation tuples for which the English sentence was unique. We also skewed the distribution of sentence lengths slightly by requiring that half of the sentences not be chosen if their length was below 50 characters and not imposing any restrictions on the other half. This was done to reduce the occurrence of formulaic and less informative sentences.

Both sets comprise 4,000 sentences that are 1-1 alignments across all official languages. As in the case of the fully aligned subcorpus, any translation direction can be evaluated

Test / Development Sets
UNv1.0.testsets.tar.gz

Decompression

Due to the size of the files, they have been split into chunks of at most 1GB each.

For instance the 6-way corpus consists of three files:

To reverse the splitting process you can either recreate the original tar.gz archive and then decompress the results:

cat UNv1.0.6way.tar.gz.* > UNv1.0.6way.tar.gz
tar -xzf UNv1.0.6way.tar.gz

Or uncompress on-the-fly without creating the intermediate archive with the following command:

cat UNv1.0.6way.tar.gz.* | tar -xzf -

Decompressing any other archive is done analogously. Make sure to include all parts of each archive as otherwise data will be corrupted.