United Nations, Department for General Assembly and Conference Management

United Nations Parallel Corpus

Introduction

The United Nations Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.

The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations SMT system, Tapta4UN.

The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.

When using the United Nations Parallel Corpus, the user must acknowledge the United Nations as the source of the information. When making reference to the United Nations Parallel Corpus, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

For further enquiries, please contact gtext-support@unov.org.

Please provide your contact details and affiliation, and the purpose of your United Nations Corpus usage.


Corpus statistics

Statistics for pair-wise aligned documents:

arenesfrruzh
ar111,241
18,539,207
113,065
18,578,118
112,605
18,281,635
111,896
18,863,363
91,345
15,595,948
en 456,552,223
512,087,009
123,844
21,911,121
149,741
25,805,088
133,089
23,239,280
91,028
15,886,041
es 459,383,823
593,671,507
590,672,799
678,778,068
125,098
21,915,504
115,921
19,993,922
91,704
15,428,381
fr 452,833,187
597,651,233
668,518,779
782,912,487
674,477,239
688,418,806
133,510
22,381,416
91,613
15,206,689
ru 462,021,954
491,166,055
601,002,317
569,888,234
623,230,646
513,100,827
691,062,370
557,143,420
92,337
16,038,721
zh387,968,412
387,931,939
425,562,909
381,371,583
493,338,256
382,052,741
498,007,502
377,884,885
417,366,738
392,372,764

The cells above the diagonal contain the number of documents and lines per language pair. The cells below the diagonal contain the number of tokens in a language pair. The upper number refers to the language in the column title, the lower number to the language in the row title. Tokens were counted after processing with the Moses tokenizer. For Chinese, Jieba was used before applying the Moses tokenizer with default settings.

Document statistics

Total documentsAligned document pairs
799,2761,727,539

Fully aligned subcorpus statistics

DocumentsLinesEnglish tokens
86,30711,365,709334,953,817

Disclaimer and terms of use

The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

  • The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
  • Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user's sole and exclusive remedy is to discontinue using the United Nations Corpus.
  • When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
  • Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.

File organization and format

All documents are organized into folders by language, publication year, and publication symbol. Corresponding documents are placed in parallel folder structures, and a document's translation into any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.

For individual documents, it was decided to follow the TEI-based format of the JRC-Acquis parallel corpus. Documents retain the original paragraph structure, and sentence splits have been added automatically. Documents for which multiple language versions exist have corresponding linked files for each of the language pairs, of which there are 15 at most.

In addition to the one-file-per-document type of distribution, we also make available plain-text bi-texts that span all documents for a specific language pair and can be used more readily with SMT training pipelines.

For further details about the preparation process of the Corpus, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

Test and development sets

Data from documents released in 2015 were set aside, and official development and test sets created across all language pairs. Of these documents, 100 were randomly selected — 50 per development set and test set each. As in the case of the fully aligned subcorpus, all development and test set sentences are available for all official languages, and any translation directions can be evaluated.

For machine translation baselines, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

Document metadata

Every document in XML file format has embedded meta-information:

Symbol
Each United Nations document has a unique symbol. All language versions of a document have the same symbol. Symbols include both letters and numbers. Some elements of the symbol have meaning, while others do not. In general, the symbol does not necessarily indicate the topic of the document.
Translation job number
This is a unique, language-specific document identifier.
Publication date
This is the original publication date for a document by symbol, which applies to all language versions. This date does not necessarily correspond to the release date of each individual document.
Processing place
Possible locations are New York, Geneva and Vienna.
Keywords
These include any number of subjects covered by the document, according to the ODS subject lexicon, which is based on the United Nations Bibliographic Information System Thesaurus.