HTML2TMX – A tool to grab bilingual content from the web and import into your CAT tool

Recently, on the OmegaT mailing list a user was describing a problem that others might have faced before. Imagine you have come across a website that offers sentences in a bilingual table format, such as the search results provided by Linguee:


Or you might have some legacy content in a HTML file that you would like to use in your favourite CAT tool.

Well, after a bit of research, this problem turned into a little side project of mine and has led to a tool which I called “HTML2TMX” (please do let me know, if you have a better name).

I have published the files for “HTML2TMX” here
and here

and the source code (under LGPL) here:

The tool is written in Java, which means it runs on any platform (Mac, Linux, Windows…). It turns any HTML table into a TMX file in a two step process:

- First, you need to tell the tool where to find the table; this can be a URL or a file on your local file system. HTML2TMX will then extract the header information from this table, so that you can select which column of the table maps to which language.

- Second, you run the tool again, this time providing the link to the table, the mapping information and the filename to the TMX file.

Once you have the TMX created, you just need to import this into your preferred CAT tool. At the moment, there HTML2TMX has the limitation of providing a command line interface only. The advantage is that you can use the tool in scripts of your own creation, but the downside is, of course, that many translators would probably prefer a user-friendly GUI. If there is enough interest (which you can express by sending me an email: martin AT wunderlich DOT com), I am more than happy to also create the GUI and perhaps even include this functionality as a new feature in OmegaT.