Inferring Phylogenetic Graphs for Natural Languages using MML

Languages, like everything around us, evolve and change over a period of time. The aim of this report is to be able to model this evolution that occurs between natural languages. We introduce the idea of inferring phylogenetic (or evolutionary) models for natural languages using the Minimum Message Length (MML) principle. Phylogenetic models show the evolutionary interrelatiionship among various species or other entities. We extend phylogenetic trees to phylogenetic graphs. Minimum Message Length (MML) is an inductive inference method that measures the goodness of a model. We use MML to infer phylogenetic graphs (including mutation probabilities along arcs). We introduce the use of MML to infer phylogenetic graphs for artificial languages as well as for some European languages (English, French, Spanish and German). Unlike phylogenetic trees, phylogenetic graphs are capable of modelling evolution where a child node inherits features from more than one parent node. In a phylogenetic tree, each child node has exactly one parent node. This means that each child language is allowed to inherit from only one parent language. However, it is clear that in the real world, such a situation is unlikely to occur. Hence, we extend phylogenetic trees to phylogenetic graphs to model the fact that a language can be influenced by more than one other language. The first part of our modelling assumes only copy and change operations on characters, and is based on words that have the same length in all natural languages considered, whereas the subsequent section uses string alignment techniques to model words with different lengths and allows for copy, change, insert and delete operations on characters. All methods have been verified by testing them on artificial languages for which the evolutionary order is known. The resulting phylogenetic model inferred by MML reflects the correct evolutionary order.


