Welcome! My name is Wessel Stoop. I work with language & machine learning, and make videogames in my free time. This website aims to be a collection of the software and prose I write, and what the media subsequently write about these. Furthermore, this website contains some pictures of me to suggest that I totally do not spend all of my waking hours staring at a computer screen.
W. Stoop, F. Kunneman, A. van den Bosch & B. Miller (2019). Detecting harassment in real-time as conversations develop.
We developed a machine-learning-based method to detect video game players that harass teammates or opponents in chat earlier in the conversation. In this paper we visualize what the classifier is doing at each point in the conversation, and show that the confidence threshold above which a player should be considered toxic should start really low in the beginning of a conversation, and should then slowly increase.
Proceedings of the 3th Workshop on Abusive Language Online
D. Foster, S. Aalberse & W. Stoop (2019). Examining Twitter as a source for address research using Colombian Spanish.
In Columbian Spanish, you choose between tu, usted and vos when you want to say you. This paper shows you can use Twitter to figure out when which one is used; we see clear regional preferences, but also social and emotional effects.
B. Kluge & M. I. Moyna (eds.): It’s not all about you. New perspectives on address research
Signbank is basically a sign dictionary with a huge set of extra tools for sign language researchers; this is a proxy paper for all the software development that has been done by me and colleagues. The exciting part of the Signbank project I'm part of is that it has the ambition to bring many many datasets from different sign languages all together in one database, allowing groundbreaking new research.
An important way to research dreams is to study written descriptions of dreams; huge collections of dream reports have been created to facilitate this approach. This analysis, however, has so far only be done by hand. In this paper, we show what various natural language processing tools can add to this research field.
W. Stoop, I. Hendrickx & T. van Ees (2017). PaperClip: automated dossier reorganizing.
This paper was produced at fintech company Davinci. One of Davinci's main products does information extraction for the financial sector, on individual documents. Unfortunately, customers often deliver documents in large files containing multiple documents, sometimes even with the pages shuffled. Our proposed solution PaperClip solves this by analyzing the content on the pages of such a dossier, trying to guess the document type and the page number, and returns single document files with pages in the correct order.
Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, pages 471-478
For digital language researchers, Twitter has been the main resource so far. Facebook, which has a much larger audience, is unfortunately not open for research. We describe a way to get access to Facebook language (or other data), creating a so-called 'Facebook app' that greatly facilitates voluntary data donation.
In P. Sojka et al. (eds.), Text, Speech and Dialogue: 19th International Conference, TSD 2016, LNAI 9924 (pp. 249-258). Springer.
A linguistic paper about a theory why the form 'du' has disappeared from Dutch, while almost all related languages still have it. We introduce a more computational approach for text style for this particular research question.
Journal of Pragmatics, 88, 190-201. doi: 10.1016/j.pragma.2015.07.003
We describe how you can improve text prediction for a person by training the predictor on his/her own written texts (an idea we call 'idiolect', taken from sociolinguistics). As an example, we predict tweets of a person by training on the earlier tweets of that same person. Interestingly, friends talk that much alike that you can also train on the tweets of friends and have even better predictions. Download the data.
Another paper about text prediction, this time focusing on how it can be used to help people with impaired language and/or speech: even if no linguistic material is available for a patient, a useful language model can be trained using written language by people close to this person.
Dutch Journal of Applied Linguistics, 3:2, pp. 136-153.
Saying Ik geloof niet dat hij komt (I don't believe that he will come) while meaning Ik geloof dat hij niet komt (I believe that he will not come) is a well described phenomenon in linguistics called neg raising: the negation 'moves' moves from the subclause to the main clause. In this short paper, we describe that this can also happen with pragmatic markers in Dutch. For example, you can say Ik geloof toch dat hij komt (I believe anyway that he comes) while meaning Ik geloof dat hij toch komt (I believe that he comes anyway).
A linguistic paper about the Dutch construction Wessel die bouwt een website. It was believed previously that construction was used to indicate contrast. I show that it is used in a contrastive environment in only a small part of the cases, and offer some alternative hypotheses.
In Dutch, the word zij (they) is slowly but steadily being replaced by hun (them), a phenomenon that has been described various times in the study of language change. This change is extremely controversial and despised by language purists, but this does not seem to halt its rise. In this paper, we describe a potential explanation for its popularity: unlike zij, hun can only refer to humans, which would make it more efficient in conversation.
Olvand is a small multiplayer role playing game. It encourages players to build towns together, and contains various minigames. At its peak, it had an active fanbase of around a hundred people, running servers and creating secundary material like tutorials, wikis and Youtube videos.
Tic Tac Team is a two-player puzzle game for the Apple iPad. It contains almost no language, which forces the two players to figure out the game mechanics together, and then create a small communication system to achieve their goals. Together with Jop van Heesch.
Vowel Space Travel is a simple tool that asks the player to identify differences between vowels in English, hoping that this will improve their language skills. My goal was to make an app around this repetitive task that makes it more attractive, in a futuristic setting.
I'm a big proponent of explaining complicated things with interactive visualizations: being able to 'play with an idea' is not only a much more efficient way to understand something, I believe it will also make the understanding much deeper.
Taal voorspellen: a description of Soothsayer in Vaktaal, aimed at Neerlandicists around the world.
I've written a number of popular science writings about linguistics in Dutch, aimed students and researchers of linguistics. Highlights:
Waarom het woord "snotneus" geen toeval is: I show that in languages around the world the world for 'nose' contains a nasal sound (sounds that use the nasal cavity) far more often than you would expect based on chance.
Hoe ik een familie-adjectief afpakte: my parents, speaking Brabantian dialect, refer to me differently than to my sister for unclear reasons. A linguistic theory for why this is the case.
Taaie reepjes: an anecdote how low educated workers subconsciously came up with an etymological theory for the word 'tie wrap'.
You are what you tweet is a webdemo displaying the power of language technology and machine learning: it imports all tweets of a particular Twitter timeline, and then performs text prediction and term profiling on it, as well as text classification. For that last task, it has language models for gender, age, aggression and sarcasm. Together with Florian Kunneman.
Stemming 2017 is a webdemo displaying research by Eric Sanders on whether you can predict elections based on language analysis on Twitter. With over 10.000 visitors, this is my most popular project to date. Unfortunately, the predictions were further off than we'd hoped (but still a lot better than chance). Together with Eric Sanders.
Soothsayer is my master's thesis project about text prediction. In the thesis, I showed that text prediction improves when (1) using language models based on text written by the user and (2) that text written by friends of the user also improve the results. In the demo, you can test Soothsayer with various language models. The thesis led to various publications, media attention, and the Radboud University 2013 thesis prize. I explain my project in the video below:
Robot Nao colleague guessing game
Robot Nao visited the department where I work. As a fun sideproject, I wrote a game for it where it asks questions that should be answered with yes or no. With this information, it can guess which colleague you had in mind (apologies for the vertical video ;) ):
Catmull-Clark subdivision is a smoothing algorithm and a basic tool in 3D modeling software. It was, however, not yet available for the Unity game engine. I created an implementation of it, which is available at the Unity Asset Store.
Fowlt is the English version of Valkuil, a context-sensitive spelling corrector using machine learning. It recognizes errors by comparing all incoming text to the many examples of correct text it has seen. If it finds something that is nearly identical, but not completely identical, to a frequent pattern, it marks it as an error. This way, it is able to mark errors where other spelling correctors typically fail, like the difference between to, two and too, or the difference between there, their and they're.