The United Nations Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.
The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations SMT system, Tapta4UN.
The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.
When using the United Nations Parallel Corpus, the user must acknowledge the United Nations as the source of the information. When making reference to the United Nations Parallel Corpus, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
For further enquiries, please contact gtext-support@unov.org.
Statistics for pair-wise aligned documents:
ar | en | es | fr | ru | zh | |
---|---|---|---|---|---|---|
ar | – | 111,241 18,539,207 | 113,065 18,578,118 | 112,605 18,281,635 | 111,896 18,863,363 | 91,345 15,595,948 |
en | 456,552,223 512,087,009 | – | 123,844 21,911,121 | 149,741 25,805,088 | 133,089 23,239,280 | 91,028 15,886,041 |
es | 459,383,823 593,671,507 | 590,672,799 678,778,068 | – | 125,098 21,915,504 | 115,921 19,993,922 | 91,704 15,428,381 |
fr | 452,833,187 597,651,233 | 668,518,779 782,912,487 | 674,477,239 688,418,806 | – | 133,510 22,381,416 | 91,613 15,206,689 |
ru | 462,021,954 491,166,055 | 601,002,317 569,888,234 | 623,230,646 513,100,827 | 691,062,370 557,143,420 | – | 92,337 16,038,721 |
zh | 387,968,412 387,931,939 | 425,562,909 381,371,583 | 493,338,256 382,052,741 | 498,007,502 377,884,885 | 417,366,738 392,372,764 | – |
The cells above the diagonal contain the number of documents and lines per language pair. The cells below the diagonal contain the number of tokens in a language pair. The upper number refers to the language in the column title, the lower number to the language in the row title. Tokens were counted after processing with the Moses tokenizer. For Chinese, Jieba was used before applying the Moses tokenizer with default settings.
Document statistics
Total documents | Aligned document pairs |
---|---|
799,276 | 1,727,539 |
Fully aligned subcorpus statistics
Documents | Lines | English tokens |
---|---|---|
86,307 | 11,365,709 | 334,953,817 |
The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):
All documents are organized into folders by language, publication year, and publication symbol. Corresponding documents are placed in parallel folder structures, and a document's translation into any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.
For individual documents, it was decided to follow the TEI-based format of the JRC-Acquis parallel corpus. Documents retain the original paragraph structure, and sentence splits have been added automatically. Documents for which multiple language versions exist have corresponding linked files for each of the language pairs, of which there are 15 at most.
In addition to the one-file-per-document type of distribution, we also make available plain-text bi-texts that span all documents for a specific language pair and can be used more readily with SMT training pipelines.
For further details about the preparation process of the Corpus, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Data from documents released in 2015 were set aside, and official development and test sets created across all language pairs. Of these documents, 100 were randomly selected — 50 per development set and test set each. As in the case of the fully aligned subcorpus, all development and test set sentences are available for all official languages, and any translation directions can be evaluated.
For machine translation baselines, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Every document in XML file format has embedded meta-information: