United Nations Parallel Corpus

Introduction

The United Nations Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.

The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations SMT system, Tapta4UN.

The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.

When using the United Nations Parallel Corpus, the user must acknowledge the United Nations as the source of the information. When making reference to the United Nations Parallel Corpus, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

For further enquiries, please contact gtext-support@unov.org.

Download

Corpus statistics

Statistics for pair-wise aligned documents:


	ar	en	es	fr	ru	zh
ar	–	111,241 18,539,207	113,065 18,578,118	112,605 18,281,635	111,896 18,863,363	91,345 15,595,948
en	456,552,223 512,087,009	–	123,844 21,911,121	149,741 25,805,088	133,089 23,239,280	91,028 15,886,041
es	459,383,823 593,671,507	590,672,799 678,778,068	–	125,098 21,915,504	115,921 19,993,922	91,704 15,428,381
fr	452,833,187 597,651,233	668,518,779 782,912,487	674,477,239 688,418,806	–	133,510 22,381,416	91,613 15,206,689
ru	462,021,954 491,166,055	601,002,317 569,888,234	623,230,646 513,100,827	691,062,370 557,143,420	–	92,337 16,038,721
zh	387,968,412 387,931,939	425,562,909 381,371,583	493,338,256 382,052,741	498,007,502 377,884,885	417,366,738 392,372,764	–

The cells above the diagonal contain the number of documents and lines per language pair. The cells below the diagonal contain the number of tokens in a language pair. The upper number refers to the language in the column title, the lower number to the language in the row title. Tokens were counted after processing with the Moses tokenizer. For Chinese, Jieba was used before applying the Moses tokenizer with default settings.

Document statistics


Total documents	Aligned document pairs
799,276	1,727,539

Fully aligned subcorpus statistics


Documents	Lines	English tokens
86,307	11,365,709	334,953,817

Disclaimer and terms of use

The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user's sole and exclusive remedy is to discontinue using the United Nations Corpus.
When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.

File organization and format

All documents are organized into folders by language, publication year, and publication symbol. Corresponding documents are placed in parallel folder structures, and a document's translation into any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.

For individual documents, it was decided to follow the TEI-based format of the JRC-Acquis parallel corpus. Documents retain the original paragraph structure, and sentence splits have been added automatically. Documents for which multiple language versions exist have corresponding linked files for each of the language pairs, of which there are 15 at most.

In addition to the one-file-per-document type of distribution, we also make available plain-text bi-texts that span all documents for a specific language pair and can be used more readily with SMT training pipelines.

For further details about the preparation process of the Corpus, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

Test and development sets

Data from documents released in 2015 were set aside, and official development and test sets created across all language pairs. Of these documents, 100 were randomly selected — 50 per development set and test set each. As in the case of the fully aligned subcorpus, all development and test set sentences are available for all official languages, and any translation directions can be evaluated.

For machine translation baselines, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

Document metadata

Every document in XML file format has embedded meta-information:

Symbol: Each United Nations document has a unique symbol. All language versions of a document have the same symbol. Symbols include both letters and numbers. Some elements of the symbol have meaning, while others do not. In general, the symbol does not necessarily indicate the topic of the document.
Translation job number: This is a unique, language-specific document identifier.
Publication date: This is the original publication date for a document by symbol, which applies to all language versions. This date does not necessarily correspond to the release date of each individual document.
Processing place: Possible locations are New York, Geneva and Vienna.
Keywords: These include any number of subjects covered by the document, according to the ODS subject lexicon, which is based on the United Nations Bibliographic Information System Thesaurus.

United Nations, Department for General Assembly and Conference Management