Hello

Language Evolution in Humans

A 15-300 research project

Proposal

As far as researchers have observed, humans are unique in their ability to learn language and communicate orally. While several other species have vocal learning, a necessary trait for language, no other species demonstrates the ability to acquire speech with such complex meaning as humans. As language is essential to the human experience and an interesting trait in its own right, it is natural to be curious about how it has evolved. In addition, studying the evolution of language may provide answers to causes of language disorders ranging from dyslexia to communication impairments. To pursue this question, I will be working with Professor Andreas Pfenning and Dr. Morgan Wirthlin to investigate the genetic origins of language.

The general procedure for identify genetic changes for a trait is to compare the genes of organisms with the trait and organisms without. Unfortunately, as humans are unique in this ability to communicate, it is harder to perform this comparison. However, due to a wealth of research in this area, there is, in addition to an abundance of data on the human genome, a large selection of high-quality primate genomes and genomes from our close, extinct cousins, such as Neanderthals and Denisovans. The latter two are particularly useful as it has been conjectured that they have at least some rudimentary form of language. They may therefore represent milestones along the evolutionary path to language as expressed in humans.

Immediately, this gives us the ability to line up all the genomes we have and identify, on the level of bases, which changes may have occurred. One tactic, given that primates share a great deal of DNA with us, would be to collect all the alleles where other primates differ from us. This is a fairly large dataset, so we can refine it by considering only those genetic regions that are believed to be implicated in language. Then, with the Neanderthal and Denisovan genomes, we can preference alleles where we share a base with Neanderthals and Denisovans. For example, in a specific location, humans, Neanderthals, and Denisovans all have an A, while primates have other alleles, this is likely significant.

This is still quite noisy, so we can employ other techniques to improve the quality of our data. One area of consideration is that there is variance within humans, even in the genetic underpinnings of traits as universal as language. However, since language is so universal, we do not expect this variance to be significant. Accordingly, we are going to use the 1000 Genomes dataset to exclude all genetic variants that project identified. Instead of comparing to all primates, we will infer the shared ancestral allele of humans, hominins like Neanderthals and Denisovans, and other primates, so that we can compare that allele instead (presumably, our shared ancestors did not have language). This will allow us to make a stricter comparison. However, the methodology for using this is more complicated. Since primate DNA is not likely to be perfectly aligned to human DNA, it will be necessary to adapt tools that can take primate DNA and port it to coordinates such that it can be compared with human DNA. Then, we will need to use tools to infer the ancestral allele. Currently, we are planning on using Ortheus to do this.

Not all alleles that are different will prove significant. To evaluate the effect, we will run machine learning models on the identified changes which will predict the differences in gene expression. If there is likely a significant change, we will investigate further, potentially by causing that mutation in a test organism to see if the expression change really manifests.

This may be called the 100% goal. Since we do not know whether or not this methodology will derive us anything of significance, the 75% goal is to simply have these alleles identified and compiled. The 125% goal is to refine this search. Perhaps instead of a hard boundary for whether to include an allele or not, we might have intervals of inclusion, allowing for alleles with more variance to also be considered, if not as strongly.

Milestones:

1st technical milestone: By the last day of the semester, I would like to have compiled a list of interesting alleles, as discussed

Biweekly milestones:

January 27th – error check the alleles provided, add masking for quality

February 10th – identify some models to infer the expression change

February 24th – run models, take preliminary results

March 16th – based on the results, add in regions of the DNA or focus the analysis on certain portions

March 30th – identify a list of alleles to test

April 13th – test these alleles in the lab

April 27th – have an analysis of the results observed

A great deal of existing literature prefaces this search, much of it compiled by my mentor. In particular, she has compiled datasets of language genes and high quality primate genomes, which she has provided to me. These are taken from other papers, such as Auton et al , but there is otherwise little to say about them.

Our particular direction is influenced strongly by Professor Pfenning’s previous research . In a previous paper comparing vocal learning in songbirds and humans, it was found that there were convergent regulatory features in songbirds and humans compared to birds without vocal learning and a closely related primate (also without vocal learning). Due to this, we find regulatory regions particularly interesting for language learning and are exclusively looking for changes in those regions.

A study of a similar nature was undertaken by Kuhlwihm and Boeckx , recently published. This paper compared Neanderthal and Denisovans with humans to identify and analyze changes, thus providing a summary of the genetic changes that set us apart from our close relatives. Many of their methods are relevant to us, and we intend to use their processed data to compare against ours for sanity checking.

These are the main two papers with the highest influence on the method design of this study. There are many other approaches we can take for this investigation, however, which we may explore over the course of the research.

We will need tools to manipulate genome data given in various common formats, as well as the tools mentioned above to map primate DNA to human coordinates and infer the ancestral allele. We have these, but need to make them work.