EU Translation Memories
Now includes documents up to the end of 2015!
This is the world's largest downloadable multilingual corpus of European Union texts, in aligned sentence pairs. The corpus covers all the official languages of the EU except for Croatian.
Intended users: translators, whether they use computer-assisted translation (CAT) software or not, machine translation system developers and linguistics researchers.
The corpus is provided in various formats:
- Tmx - for translators who use a CAT tool such as Trados, MemoQ, Wordfast etc.
- Tab delimited txt - for use with TMLookup or Xbench if you don't use a CAT, and for those who wish to process the data with their own software tools
- Excel - on request, in case you wish to search manually and read two or more texts in parallel
- Sdltm - available on request for Trados Studio users, so you don't have to run the (very slow) TMX import yourself
- Monolingual HTML - available on request, in case you wish to read the source texts in full (Note: do NOT rely on these texts for legal information. Only European Union legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.)
- Trados Studio autosuggest dictionaries (.bpm) - on request
- Custom formats (XML, multilingual HTML, SQLite database etc.) are also available on request
Rich metadata: CELEX numbers, document titles, subject areas. The whole corpus is broken down into 22 subsets by subject area. When you are translating a text on agriculture, you can choose to only use the agriculture subcorpus, which is much more manageable in size than the whole corpus. Your lookups will be much quicker and the results you get will be much more relevant to your text. When you get a hit in your CAT tool, the title of the document the hit comes from will also be shown, giving you a better idea of context. You will also be shown the document's unique identifier, the CELEX number, so you can look up the full text of the document in question.
Available document sets:
- CELEX collection (€75) The CELEX collection contains many document types; a full list is available in the statistics table. Some examples of the document types in this TM: legislation (directives, regulations etc.) These are the documents that are the most likely to provide 100% TM hits if you translate documents connected to EU legislation. Commission proposals comprise draft legislation and other texts from the European Commission, largely generated as part of the legislative process. They contain a lot of fresh material that has not yet made it into final legislation. They are very useful when translating documents connected to ongoing legislative work or newly issued legislation, and for interpreters working for EU institutions. The CELEX collection also contains documents issued by the European Court of Justice: judgements, orders, opinions etc. There is a significant amount of overlap between the CELEX collection and the DGT-TM. To sum up the difference in one sentence, the CELEX collection contains many more document types than the DGT-TM, making it three times larger, and it comes with more metadata (document titles & subject areas), filtering and sorting. Everything that is in the DGT-TM is also in the CELEX collection.
- EP collection (€30): plenary session transcripts and EP reports. The plenary session transcripts are largely the same texts you'll find in the Europarl corpus; the aligned texts of the EP reports are only available here.
- Press release collection (€20): press releases issued by the Commission, the Court of Justice, the European Economic and Social Committee, the Committee of the Regions, the Court of Auditors, the Data Protection Supervisor, the EIB, the European Ombudsman and OLAF. This corpus contains short journalistic texts aimed at the general public on a wide variety of EU-related topics. The language is less legalistic than the language of the CELEX collection. To the best of my knowledge, this data is not available from any other source (except for the EU press release website, where monolingual documents can be accessed one by one).
- Single documents can also be purchased. If you need one particular directive, regulation etc. in two, three or twenty-three languages, you can order it on its own.
- Special arrangements are also possible. If you need TMs made from all the EU documents that contain the term "agricultural subsidy" or the Greek transcripts of everything Daniel Cohn-Bendit ever said in the European Parliament, contact me and I will see what I can do.
Subject areas of the CELEX collection:
- 01. General, financial and institutional matters (shortened name: General_financial_institutional)
- 02. Customs Union and free movement of goods (shortened name: Customs)
- 03. Agriculture (shortened name: Agriculture)
- 04. Fisheries (shortened name: Fishery)
- 05. Freedom of movement for workers and social policy (shortened name: Social_policy)
- 06. Right of establishment and freedom to provide services (shortened name: Right_of_establishment)
- 07. Transport policy (shortened name: Transport)
- 08. Competition policy (shortened name: Competition)
- 09. Taxation (shortened name: Taxation)
- 10. Economic and monetary policy and free movement of capital (shortened name: Economic_policy)
- 11. External relations (shortened name: External_relations)
- 12. Energy (shortened name: Energy)
- 13. Industrial policy and internal market (shortened name: Industry_internal_market)
- 14. Regional policy and coordination of structural instruments (shortened name: Regional_policy)
- 15. Environment, consumers and health protection (shortened name: Environment_consumers_health)
- 16. Science, information, education and culture (shortened name: Science_education_culture)
- 17. Law relating to undertakings (shortened name: Companies)
- 18. Common Foreign and Security Policy (shortened name: Foreign_security_policy)
- 19. Area of freedom, security and justice (shortened name: Freedom_security_justice)
- 20. People's Europe (shortened name: Peoples_Europe)
- 21. Documents of the European Court of Justice (shortened name: Court)
- 22. Other/unclassified (shortened name: Other)
Note that many documents belong to more than one subject area (e.g. agreements on the customs tariffs of agricultural products may belong to the Agriculture, Customs Union and free movement of goods and External relations subject areas).
The texts were autoaligned with very limited manual checking. (The alignments were done using custom software with Hunalign as the alignment engine and using extensive dictionary data to improve alignment quality.) It's just not feasible to manually review this much text (well in excess of 100 million sentences in all languages combined), so only spot checks were done. Some errors in the texts or in the alignment are unavoidable, but they only affect a small percentage of segments - scroll down for information on samples and see for yourself.
Manually reviewed and corrected alignments of individual documents or document sets can be ordered at the prices listed on the Alignment page.
The following segments are filtered out of bilingual TMX files: duplicates (including segments where only numbers differ), segments where the text is the same in both languages (these are usually proper names or text that was left untranslated), segments where there is text in one language only, and segments where one language only contains non-word characters (numbers, whitespace, punctuation etc.). The following are filtered out of multilingual TMX files: duplicates, segments where the text is the same in all languages, and segments that only contain non-word characters (numbers, whitespace, punctuation etc.). This aggressive filtering halves the number of segments, making the TMs much more manageable in size without removing anything of significant value. (Unfiltered files are provided in tab delimited TXT format, and custom filterings are available on request.)
How to handle the size
These datasets are very large, containing up to 15 million sentence pairs in any given language combination (See the Statistics section below). As a rule, only tools that use some kind of indexed database format can handle the entire dataset at the same time (Trados, MemoQ etc.). Tools that load everything into memory (ApSIC Xbench etc.) can only handle a smaller subset of the data. How far you can push things depends on your hardware configuration (a fast SSD drive and a lot of RAM helps a lot), the efficiency of the tool you're using and how long you are willing to wait for search results.
As a general rule, TM imports and lookups tend to be faster and more reliable if you use several somewhat smaller TMs instead of a single gigantic one. This is part of the reason why separate TMX files are provided in each of the 22 subject areas of the CELEX collection. Furthermore, larger TMs are shipped in chunks of 500,000 segments. (E.g. the English-French Agriculture TM contains more than one million segments, so it is shipped in 3 TMX files.) The recommended procedure is to import each TMX file into a separate TM and always use whichever TM or set of TMs is most appropriate for the translation project at hand.
If you don't use a CAT tool such as Trados, MemoQ or Wordfast, you can still benefit from this dataset. There are a number of free and paid lookup tools that can help you leverage TMs. TMLookup, my custom free and open source lookup tool can search the whole dataset, and it can even handle multilingual databases. On request, I can send you a copy of TMLookup with your database already imported and ready to use. You don't need to install, set up or import anything, just launch the program and type in your first search. (TMLookup is provided 'as is', without any warranty or formal support agreement.) ApSIC Xbench is another option, although it can only handle a subset of the data at a time. New versions of Xbench are paid, but you can still download the older free version from ApSIC.
The CELEX corpus contains more than 1.8 million documents in all languages combined. There are about 140,000 documents (about 16 million sentences) in the languages of old member states: English, German, French, Spanish, Italian, Dutch, Danish and Portuguese; there are about 100,000 documents (about 13 million sentences) in Finnish and Swedish, and 30,000 to 50,000 (6 to 8 million sentences) in the languages of newer member states (Polish, Hungarian etc.) There are about 800 documents in Irish and none in Croatian at the moment.
In the EP corpus, there are about 2.7 million sentences in the languages of old member states and about 2 million sentences in the languages of the EU10 countries. The press release corpus contains about 1 million sentences in English, German and French, and significantly fewer in other languages.
This xls file contains detailed statistics on the number of documents of all types in the CELEX and EP collections.
Data on the number of sentence pairs in a specific language combination before and after filtering is available on request.
A sample is provided so that you can assess the quality of the autoalignment, have a look at the available file formats and make sure that your CAT tool can import my TMX files correctly.
The sample includes tab-delimited txt, xls and tmx versions of the same document in 22 languages (all official languages of the EU except Irish and Croatian). You can import the TMX into a TM in any language combination as normal (the import might be slower than usual due to the large number of languages in the TMX).
Randomly selected samples taken from across larger document sets in your specific language combination are available on request.
- CELEX collection in two languages: €75 (add €25 for each additional language)
- EP collection in two languages: €30 (add €10 for each additional language)
- Press release collection in two languages: €20 (add €10 for each additional language)
- Up to five specific documents in two languages: €15; 25 documents: €50
- Multilingual collections are available at reduced rates, and custom datasets (including TMs made from documents that contain the terms specified by you) are available on special request.
Note: the above pricing applies to freelance translators and other individual users. Corporate users (translation agencies, research groups, companies developing MT systems etc.) should contact me for corporate pricing.
Copyright of the source texts: © European Parliament and © European Union, http://eur-lex.europa.eu/
Only European Union legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.
Copyright of the alignments: © FarkasTranslations.com
Translation memories and other resources are sold for use by the purchaser alone. Any unauthorized dissemination is prohibited. Unique identifiers are hidden in each file, and the purchaser is liable for any damages caused by unauthorized dissemination.