Where have I been?

April 17th, 2016

It has been quite a while since the last post on this blog – and there is a very good reason for it. In September 2015 I completed my two-years Master of Science degree in Natural Language Processing (with Computer Science) at the University of Munich (LMU) / Centre of Information and Language Processing. This was done as a side-project, next to work and family. So, I have been busy and instead of writing blog posts, my time was dedicated to these exemplary projects:

  • Automated text classification with neural networks using the Enron email corpus (details); technologies and methods: Java, SNIPE (NN library), language modeling, supervised learning with cross-validation
  • Model-theoretic semantics; technologies and methods: Prolog, Montague grammar, generation of syntax trees and transformation into discourse semantics (DRT)
  • Implementation of various algorithms; technologies and methods: Java, finite state machines/automata, tries, Hidden Markov models/Viterbi algorithm etc.
  • Functional programming; technologies and methods: Scala, Akka, Play web framework
  • Building a fulltext search engine server from scratch, based on inverted positional index; technologies and methods: Java, Jetty web server, REST API
  • Discourse analysis of violent conflicts; technologies and methods: Python, SKLearn, supervised learning with SVMs, MaxEnt and Naive Bayes, feature selection with Chi Square and Mutual Information, newspaper data from LexisNexis
  • Deterministic machine translation of Old English; technologies and methods: Grammatical Framework
  • Using Solr for fulltext indexing in a digital humanities project (Wittenstein corpus/Wittfind); technologies and methods: Solr, Lucene, XML-TEI, XSLT
  • Word sense disambiguation in Old English; technologies and methods: Maven machine learning framework, Java, supervised learning. Published together with Alexander Fraser and Paul Sander Langeslag in the proceedings of GSCL 2015 under the title “God Wat Þæt Ic Eom God – An Exploratory Investigation Into Word Sense Disambiguation in Old English” (PDF, BIB)
  • Master thesis: Argumentation mining and automated discourse analysis; technologies and methods: UIMA, DKPro, Java, Maven, supervised learning, extensive feature engineering.




The true power of open-source software

November 17th, 2013

Sometimes you need to be truly stuck in order to appreciate the capabilities of your vehicle. On a rough gravel bed, you will quickly find out how good the suspension and sturdy wheels of your bike really are. And likewise, when working on a urgent software development project, you will find out about the true power of open-source software when running into a library compatibility issue.

Let me explain, expand, expound and exposit:

Recently, I was working on adding SRX-based segmentation capabilities to an XLIFF converter. Rather than trying to reinvent the wheel, I decided to use the fantastic Okapi localization framework, which provides an SRX segmenter class (amongst many, many other useful things). It was easy enough to set up: Simply include the Okapi library jar in your Eclipse project, read through the excellent developer guide and off you go. Until that strange bug hits you, when trying to actually read in the first SRX file:

“ClassCastException: Cannot cast java.util.ArrayList (id=77) to org.w3c.dom.NodeList “.

“Very interesting”, you think to yourself, keeping the non-quotable exclamations well restrained inside your little developer brain. And then – the true power of open-source kicks in: Instead of trying to get hold of some tech support person and – if lucky – waiting for an updated release, I simply include the actual Okapi source code in my project. I can inspect the code, identify the problematic line, and modify it to make my own code work again.

In the end, it turned out that the XML parser library we were using was a bit out-of-date and that the return type of an XPath evaluation method was an ArrayList instead of the expected NodeList. Having written to the Okapi mailing list, I quickly got assistance from the ever so helpful Yves Savourel. However, in the meantime I was also able to continue on my project, after having modified the Okapi code, thanks to the open-ness of open-source software.

Sweet, isn’t it? Imagine what the situation might have been with a proprietary library…


Integrating OmegaT with MyMemory

August 29th, 2013

MyMemory (created by the Italian company Translated.net) provides an open translation memory (TM) server that can be accessed for free by everyone. There are some limitations on the number of queries a user can run, but for the average freelance translator it should be sufficient. In addition to manual searches, MyMemory offers an open API that can be used to retrieve TM contents. These results are returned in either JSON or TMX format and they also include a machine-translation match from Google.

I have recently integrated the MyMemory TM server into OmegaT as a machine translation plugin. Initially, I went for the JSON format, which is easier to parse, but alas! I noticed too late that the JSON library I was using wasn’t compatible with OmegaT’s GPL v2 license (which has been updated to v3 in the meantime). So, I went for TMX instead, which requires some XML parsing and XPath rules to extract the right contents.

During testing we noticed that more often than not the results provided by the TM server were not that helpful. This is not too surprising, since retrieving meaningful matches this way is only possible, if someone else has translated a very similar sentence for this particular language combination before and uploaded it to MyMemory. Therefore, I separated the plugin into an MT plugin and a TM plugin. Both can be used independently. The TM contents could still be useful, of course, for instance in team projects. In any case, the users should be careful when sending contents to MyMemory that they don’t violate any copyrights or non-disclosure agreements. After all, anything you send to MyMemory will be available to the rest of the world.

The update is available in OmegaT v3.0.4 Update 2. Have fun playing around with the plugin and if you have any comments, send me a note: martin (at) wunderlich DOT com. Thank you.


HTML2TMX – A tool to grab bilingual content from the web and import into your CAT tool

April 16th, 2013

Recently, on the OmegaT mailing list a user was describing a problem that others might have faced before. Imagine you have come across a website that offers sentences in a bilingual table format, such as the search results provided by Linguee:


Or you might have some legacy content in a HTML file that you would like to use in your favourite CAT tool.

Well, after a bit of research, this problem turned into a little side project of mine and has led to a tool which I called “HTML2TMX” (please do let me know, if you have a better name).

I have published the files for “HTML2TMX” here
and here

and the source code (under LGPL) here:

The tool is written in Java, which means it runs on any platform (Mac, Linux, Windows…). It turns any HTML table into a TMX file in a two step process:

- First, you need to tell the tool where to find the table; this can be a URL or a file on your local file system. HTML2TMX will then extract the header information from this table, so that you can select which column of the table maps to which language.

- Second, you run the tool again, this time providing the link to the table, the mapping information and the filename to the TMX file.

Once you have the TMX created, you just need to import this into your preferred CAT tool. At the moment, there HTML2TMX has the limitation of providing a command line interface only. The advantage is that you can use the tool in scripts of your own creation, but the downside is, of course, that many translators would probably prefer a user-friendly GUI. If there is enough interest (which you can express by sending me an email: martin AT wunderlich DOT com), I am more than happy to also create the GUI and perhaps even include this functionality as a new feature in OmegaT.



Visualizing recipes – a novel approach by Sascha Wahlbrink

October 28th, 2012

(this post is about _cooking_ recipes; not design patterns or stuff like that :)

I recently came across a cook book by Sascha Wahlbrink that caught my attention due to the amazing new approach to recipes. It uses a diagram type that somewhat resembles UML activity diagrams instead of the traditional textual approach with a simple list of steps to follow. This new approach has a number of advantages:

  • You can see at a glance what utensils you will need during the preparation.
  • The complexity or simplicity of the receipe becomes obvious due to the diagram structure.
  • Stuff that is meant to happen in parallel in differnent places (or different pots and pans) is easily visible.
  • And finally, time gaps become obvious (e.g. you don’t want to see an instruction like “Now freeze for 4 hours in your freezer” just 10 minutes before your guests arrive).


I have given the approach a try by encoding a mildly complex pasta recipe found on the German “Chefkoch” community. The tool I used was “Dia“, an open-source alternative to M$ Visio. Here are the results (available under creative commons license CC-BY-SA):


I have also tried creating a small library of the symbols in Dia, but this didn’t work, because Dia’s extension mechanism didn’t allow for dynamic text in custom shapes. Too bad. But you can easily copy and paste from the Source file provided above to replicate this approach and create your own recipes.



WordPress commenting dis-allowed, due to spam

October 9th, 2012

I have had to switch off the commenting function here, due to the incredibly amount of commenting and trackback spam that I have been getting. If you would like to comment on any post, please send an email to martin ät wunderlich dot com. Thanks a lot.


Un-conferencing again – remember the date: Friday, 19th of Oct., at Localization World, Seattle

September 19th, 2012

OK, I am biased on this one. As one of the co-organisers of the first two European localization un-conferences in Dublin (in 2009 and 2010), I can only say: Go to this event, if you are in the area! It is a phenomenal opportunity to put those valuable coffee break conversations into the centre. No sales talk, no powerpoints, just straightforward, down-to-earth exchanges with your peers. Have fun and let me know how it went.


A new blog for localization geeks – espell labs blog

September 19th, 2012

The world of language is full of technology – at least so in the area of localization and business translation. And, consequently, it is a fantastic playground for the geekily inclined. Most translation service providers maintain a small zoo of CATs and larger ones even employ professional CAT herders. Internal processes need to be automated and re-engineered to stay on top of the competition in an era of grossly underpaid translation services and not-so-generous profit margins.
In come the language-loving nerds to save the world. And some even write blogs, such as the newly started espell Labs blog. Have a look here.


German GUI localization of OmegaT updated

September 13th, 2012

I have brought German version of OmegaT to the current state of things. A lot has happened since the last update of the German GUI and so there were 200+ new segments (out of ca. 750 in total) – this by itself speaks for the development of the tool.

In case you haven’t heard of it, OmegaT is probably the most popular open-source CAT tool (or TEnT for translation environment tools as Jost Zetzsche calls them). It is a Java application and therefore runs on all platforms – Windows, Mac, Linux, etc. OmegaT has been designed with openness in mind and therefore supports many formats that other tools have been ignoring, such as PO or OpenOffice. Also, open standards play a big role in the architecture (TMX, XLIFF, SRX…) and the TM created by the tool is directly accessible in TMX form in the projects folder.

Part of the openness means that everyone can contribute their localised version of the GUI and the documentation. If you also would like to join the OmegaT translation team, have a look at the website and/or join the mailing list.


How to fix a cracked iPhone screen

August 5th, 2012

I recently dropped my iPhone in the street – really bad, really hard. The screen was cracked and looked somewhat like this:

The phone was still working, but with the cracked screen, it wasn’t a satisfactory state. So, rather than buying a new iPhone (which would be a waste of money and resources, besides being very un-ecological), I looked around a bit and found some great instructinos on how to replace the screen:
On Vimeo.
And on Ifixit.

On ebay you can easily find repair sets, either with tools or just the required glass.

Here are some lessons learnt:
- When buying the repair set, make sure to get the right one for your iPhone model. There are some cable plugs that are different, for instance, from 3G to 3GS.
- You don’t necessarily need to buy a set with tools, if you have a very small Philips screwdriver at home.
- If you do buy the glass only, make sure it contains the required double-sided sticker tape, too. It was missing in my case and I used a double sided carpet tape instead. It works, but doesn’t hold the glass as tightly and the surface isn’t fully flush with the frame.

All in all, the repair didn’t take too long and it will take a lot less time the next time around. The cost was around 9 Euros for the glass – a lot less than a new iPhone. Plus you have the satisfaction of having fixed the iPhone yourself!