Preliminary Work Towards a Xitsonga Spellchecker

by Way of Reverse-Engineering the Tsonga Dictionary

Last updated on: June 19, 2016 by Musa Kurhula Baloyi

Contents of this article

  1. Introduction
  2. Xitsonga spelling rules
  3. Put me to the test
  4. Input data and results
  5. Bringing it all together
  6. Shortcomings and future works
  7. Further reading

Introduction

LibreOffice claims to have a Xitsonga spellchecker implementaed for their office programs. Well, in that case, "dog" must be a Tsonga word. Being a Tsonga speaker, I know that this is not true. I expect "Dog" and "dog" to be underlined with a squiggly red line since LibreOffice is set to Tsonga (as you can see at the bottom center in the image below). That being so, I set out to implement one with the hope of perhaps incorporating it in any text editor that does not currently have this functionality. In my experience, no text editor or word program, open-source or commercial, has this functionality.


My most potent weapon in this task is my in-depth understanding of the Tsonga language. Through the school system and subsequent curiosity and extensive reading of Xitsonga books of all kinds, I was able to come up with some rules of thumb through which I could tell instantly whether a word is possibly Tsonga or not. If it followed the correct spelling structure, I could then see if it has a known meaning. I then took these rules and wrote Python functions for each, as I show in the next sections.

Xitsonga spelling rules

  • Every word in Xitsonga ends in a vowel
  • There are certain alphabets that never follow each other
  • A consonant is never followed by a hyphen
  • Only the letter 'n' can be followed by an apostrophe
  • Xitsonga words never start with an apostrophe
  • Two vowels which are not the same never come after each other

Every word in Xitsonga ends in a vowel

This is perhaps one of the most obvious rules. There are, however, some Tsonga surnames (such as Lowan and Marolen) that are spelt without a vowel at the end. This is just a result of habit and preference. It is assumed that "i" is appended at the end of these surnames. In their daily use as words, the "i" does appear. That said, our focus in this project at this stage is on words and not the names of people. Our code says, if you come across a word which does not end in a vowel, assume it cannot be Tsonga.

There are certain alphabets that never follow each other

Two alphabets each after the other must make a valid sound. For example, "ai" or "pl" do not make valid Tsonga sounds. They may appear in other languages, or words borrowed from other languages, but never intrinsically in Xitsonga. Common sounds are "ch", "dz", "mb", and others.

A consonant is never followed by a hyphen

The reason for this is that hyphens are used to concatenate two words (called marito-nkatsano). It follows from the previous rule that since a consonant is not a vowel, no Tsonga word can end in a consonant, making it impossible to join that word with another via a hyphen.

Only the letter 'n' can be followed by an apostrophe

"n'" is a particular sound found in words such as n'anga, n'eni, n'wheti, n'weti, and many others. There is no instance wherein another alphabet appears in this form.

Xitsonga words never start with an apostrophe

This is self-explanatory. I would add that apostrophes are used where a word changes form. For example, n'anga above comes from munanga. The apostrophe is used so that the word can take its new form. These are the instances where the apostrophe is used, to swallow up some syllable.

Two vowels which are not the same never come after each other

A repeating vowel is common in exclamations. Words such as dzwii, hatii, ponomonoo, etc, have repeating vowels of the same type in order to emphasize the sound.

Put me to the test

How will we know that our program works as intended? Well, for that I implemented a few integration tests in order to ensure that all our functions do what they are supposed to be doing. It did not seem easy or appealing to write unit tests for every function because I eliminated all intermediate results since we have no interest in them. Thus, these integration tests will break should any of the functions fail and it would be clear which one(s) has/have broken since we can see how the output differs from our expectations.

Input data and results

I created a file containing all the invalid combinations of alphabets known to me. This file is used as input for the second rule in the above list. The end product is a bunch of files for each number of alphabets used to generate the file. For example, file xitsonga_dictionary_1.txt would contain just one alphabet words, file xitsonga_dictionary_2.txt would contain one syllable words (mimpfumawulo), whereas file xitsonga_dictionary_26.txt would contain all alphabets with or without special characters, depending on when in the sequence they were introduced.

Bringing it all together

I have a create_all_xitsonga_words function which takes in the alphabet and the CreateAllWords class reference. This function is able to call all other functions in an order that the developer prefers. For example, the developer may deem it more suitable to implement the "A consonant is never followed by a hyphen" rule before the "Every word in Xitsonga ends in a vowel" rule because they deem it would make the program run faster. This particular program runs faster (and correctly) if more eliminations can be made well ahead of time without imposing the danger of removing valid words.

One advantage of this program worth pointing out is that the alphabet is pretty flexible. A developer can substitute the alphabet by another and the program would still perform well. Another advantage is that the rules can be extended, modified or removed depending on the language under implementation. Thus, even though this program is geared towards the Xitsonga language, with little effort any African or other language can have its spellchecker developed.

Shortcomings and future works

I will need to use a comprehensive corpus which contains all the words available in the Xitsonga language to be able to eliminate the words generated through reverse engineering that follow the structure but are not valid Xitsonga words.

The Mozambican and Zimbabwean orthographies are still to be incorporated. The Mozambican orthography is greatly influenced by Portuguese whereas the Zimbabwean orthography is greatly influenced by Shona. For example, "c-", "nh-" and "ss-" in the Mozambican Tsonga languages are "k-", "ny-" and "s-" in the South African standard. A glaring difference between South African and Zimbabwean orthographies is the use of "ch-" in Zimbabwe for the place of "x-" in South Africa.

Another hurdle still waiting to be conquered are spelling suggestions. After we've determined that a word cannot possibly exist in the Xitsonga language, we must make suggestions to the writer of other words whose correct spellings are in proximity with the misspelt word.

Further reading

  1. How to Write a Spelling Corrector by Peter Norvig