Glossaries

This is a collection of glossaries, mostly compiled by automated means from large public datasets. They were created with translators' needs in mind: they are provided in a variety of formats suitable for importing into CAT tools, making access to the data as convenient as possible. If you don't use a CAT tool, you can search them in excel or TMLookup. Custom formats are available on request.
If you have a large amount of bilingual or multilingual text (public or confidential) and you would like me to extract terminological data from it to create specialized termbases/glossaries like these, get in touch.


EU legislation dictionary - NOW AVAILABLE!

This is a dictionary prepared from the "Definitions" sections of the many documents that make up EU law. As such, this is a miscellaneous collection of about 12,000 terms as they are used in practice in EU regulations, directives etc. The dictionary includes terms in a variety of languages and their definitions in English. After automated collection, the data went through a full manual review, which means it is highly reliable (the occasional error notwithstanding). However, this dataset is not a curated dictionary in the traditional sense; many terms mean different things (and thus have different translations) in different contexts, and only one of them - often one associated with a rather specific context - is listed in this dictionary. As with any terminology resource, you are advised to use your own professional judgement when translating a specific term in a specific context.
The dictionary currently covers English, French, German, Spanish, Italian, Portuguese, Dutch and Hungarian. Other EU languages are available on request. The number of entries depends on the language pair requested.
A sample is available here (xls). The sample file is an English-French-German dictionary, containing the terms that start with the letter A. Make sure to have a good look at the sample file to get an idea of what the data is like.
Formats: tab delimited txt, xls and tmx. Other formats (tbx, xml etc.) available on request. I recommend using this data as a termbase, not a TM (i.e. import it into MultiTerm or the terminology module of your CAT of choice).
Price: EUR 35 for a bilingual glossary, plus EUR 15 for each additional language.


EU acronym collection - NOW AVAILABLE!

Similarly to the glossary, the acronyms were also harvested from aligned document sets using custom software tools made for this purpose, with a limited amount of manual correction. The acronym collection is available as a bilingual or multilingual glossary. In bilingual versions, each entry contains four fields: the acronym in language 1, the full expression in language 1, the acronym in language 2 and the full expression in language 2. E.g. ETO / European telecommunications office / BET / Bureau européen des télécommunications. In some entries, certain fields (full expression in languages other than English) are empty. In most cases, this is because the language in question uses the English acronym and thus the letters of the acronym don't match the full form, which prevents automated recognition. Every English acronym is listed along with the corresponding full English expression, and detailed statistics on other languages are available on request. Due to the method of collection, some acronym are listed multiple times with slight differences in the full form. Some typos made by the people publishing EU legislation are also present.
The acronym collection covers a vast range of areas, with entries ranging from the British Aluminium Foil Rollers Association (BAFRA) to the International Plant Protection Convention (IPPC, French: CIPV, Convention internationale pour la protection des végétaux). There are about 8,000 entries in all, with the potential to save you untold hours of tedious research. The acronym collection covers all EU languages except Croatian and Irish. The number of entries depends on the language pair requested.
A sample is available here (xls). The sample file contains the English, French and German versions of all the acronyms that start with the letter A. Make sure to have a good look at the sample file to get an idea of what the data is like.
Formats: tab delimited txt, xls and tmx. Other formats (tbx, xml etc.) available on request. I recommend using this data as a termbase, not a TM (i.e. import it into MultiTerm or the terminology module of your CAT of choice). If your terminology software can't handle synonyms (e.g. two English columns and two French columns in the same table), let me know and I will create a special two-column version that allows both the acronyms and the full forms to be all imported into the same database.
Price: EUR 25 for a bilingual glossary, plus EUR 10 for each additional language.


CELEX title database

The CELEX title database contains the titles of the documents in the CELEX collection in all languages. The collection includes the titles of directives, regulations, communications and so on. Court documents are not in this database as their titles tend to be along the lines of "Decision issued 6 January 2003". This resource can be used as a TM, but perhaps it's even more useful as a termbase. If you translate texts that cite or mention EU regulations, directives or other EU documents, having the names of (almost) all legislative acts at your fingertips can save a lot of research.
The database contains 98,000 titles in the languages of old member states (English, German, French etc.). There are fewer entries in other languages; detailed statistics are available on request. The list was compiled by automated means, with partial review/correction. It is not perfect (some titles contain extraneous material at the end, such as "Only the English version is authentic", and a handful of them are incomplete). By my estimate, these errors affect about 1% of entries.
A random sample can be downloaded from this link. The sample file contains the title of every 1000th document in English, French and German, in txt, tmx and xls formats.
The CELEX title collection is provided in tab separated txt and tmx formats and costs EUR 20 for a language pair and an additional EUR 5 for additional languages.


Wikipedia glossary

This glossary was generated based on Wikipedia's interlanguage links. It is made available free of charge based on Wikipedia's licensing terms (co-licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License and the GNU Free Documentation License).
Due to the nature of Wikipedia, the glossary contains a lot of proper names (people, places, organisations etc.), scientific and technological terms (theories, phenomena, plant and animal species) and generally everything that tends to have its own article on Wikipedia.
The glossary is rather massive: there are 400,000 term pairs in the English-French language pair - and that is after filtering out all the entries that are the same in both languages (mostly proper names).
Currently available languages: English, French, German, Afrikaans, Czech, Spanish, Hungarian, Italian, Latvian, Dutch, Russian, Ukrainian. If you would like me to add your language, PayPal me a donation of 10 to 50 EUR to the email address at the bottom of the page and I will add it.

Wikipedia glossary download link