Anomalies in Language As a Biomarker for Schizophrenia

Janna N. de Boer; Sanne G. Brederoo; Alban E. Voppel; Iris E.C. Sommer


Curr Opin Psychiatry. 2020;33(3):212-218. 

In This Article

Computational Language Analysis in Schizophrenia

Content Analyses: Meaning, Structure and Coherence

An often used method to examine meaning and coherence in language is that of semantic space models. Semantic space models, of which latent semantic analysis (LSA)[57] is the most commonly used tool, aim to capture word meaning by representing words as so called 'vectors' in a 'semantic space'. These vectors contain word features (i.e. aspects of word meaning); 'furry', 'pet' and 'purring' might be features attempting to grasp the meaning of 'cat'. The distance between words in a semantic space indicates word interrelatedness or coherence; the word 'furry' will be more closely related to 'pet' than to 'banana', by virtue of what concepts these words are taken to represent. A sentence with low internal coherence will consist of words reflecting relatively more separated concepts. So-called distributed models like Word2vec aim at capturing both semantic as well as syntactic information.[58,59]

Spoken Language

The first to introduce semantic space models in schizophrenia were Elvevåg et al.,[60] who used LSA to show that schizophrenia patients could successfully be distinguished from healthy controls based solely on their spoken language output (achieving correct classification of patients and controls with an accuracy of 82.4%). Furthermore, this study showed that patients with formal thought disorder (FTD) could be distinguished from patients with low FTD scores (with an accuracy of 87.5%). LSA thus appears to be an accurate tool for detecting FTD. Significantly, clinical raters achieved slightly lower classification scores (84%) than the LSA models. This research was later expanded on by classifying patients with schizophrenia and their healthy family members.[61] Using cross-validation, 85.7% of patients with schizophrenia could be correctly distinguished from their family members, indicating that LSA is sensitive to subtle phenomena, as patients are taken to resemble family members more than nonfamily controls.

In their seminal study, Bedi et al.[62] used LSA and two measures of language complexity [maximum phrase length and the use of determiners (e.g. that)] on spoken language samples, to predict later psychosis onset in youths at CHR for psychosis. Combined, these language measures predicted psychosis development with 100% accuracy, outperforming clinical ratings (yielding an accuracy of 79%). However, in their sample of 34 CHR youths, only five transitioned to psychosis. This model was adapted and validated in a larger sample, and across cohorts in a larger sample.[63] Using decreased semantic coherence, greater variance in coherence and reduced use of possessive pronouns; 83% accuracy was achieved within the main cohort (79% across cohorts).

Using a pretrained set of vectors (fastText,[64] Bar et al.[65]) examined patients with schizophrenia and controls with a special emphasis on their use of adjectives and adverbs. Their results show that patients with schizophrenia use adjectives and adverbs that are less common (i.e. lower frequency words), which can be used to distinguish them from healthy controls with machine learning models (accuracies depending on the model ranging from 70.4 to 81.5%).

In a recent meta-analysis of the diagnostic and prognostic value of semantic space models,[66] a large effect size was found for diagnosing schizophrenia-spectrum disorders using semantic space (Hedges' g = 0.96, P = 0.003). Semantic space models perform better on (semi) spontaneous language or sentences, than they do on lists of single words (e.g. words produced during a verbal fluency task). Pooling all studies in a meta-analysis of diagnostic test accuracy in schizophrenia-spectrum patients, an overall sensitivity of 71% and specificity of 91% was found.

Another influential approach to model coherence in language is the use of speech graphs.[67–69] Using graph-based tools to visualize connectedness in language, patients with schizophrenia could be distinguished from manic patients with a sensitivity and specificity of 94%.[69]

Written Language

Posts on social media have been analysed to examine written language in schizophrenia-spectrum disorders in several studies. Using content on the social media platform Reddit, conversion to psychosis was shown to be signalled by low semantic density, a measure developed to quantify sentence richness (calculated using Word2vec). Combined with writing about voices and sounds, these variables predicted conversion to psychosis with 93% accuracy.[70]

In a similar study, Twitter content of self-proclaimed schizophrenia patients was analysed using the semantic space model Latent Dirichlet Allocation,[71] in addition to part-of-speech, pragmatic analyses and syntactic dependency measures.[72] Combined, these measures were used to classify schizophrenia patients and matched controls using machine learning (support vector machine), which resulted in an area under the curve of 82.6; indicating 83% of cases could be successfully distinguished from controls.

Further, Facebook content and behaviour analysis of patients with recent onset psychosis was used to predict relapse hospitalization.[73] The increased use of first and second-person pronouns, swear words and words related to anger and death, as well as decreased use of words related to work, friends and health, were predictive of relapse. Combined with other behaviour on Facebook, relapse could be predicted with 71% specificity, however, sensitivity was low (38%).

Nonverbal and Phonetic Analyses

Computerized analyses of phonetic features (i.e. speech sounds) have also been used to objectively evaluate (especially negative) symptoms in schizophrenia-spectrum disorders. For instance, schizophrenia patients with clinically rated aprosody were shown to differ from controls in pitch variation.[74] Nonverbal language measures (e.g. turn duration, percentage of time speaking) were used to classify patients with schizophrenia and healthy controls, with an accuracy of 81.3%.[75] A similar study[76] measured prosodic and phonetic cues (prosodic peaks, syllabic dynamics) while reading the first paragraph of 'Don Quixote' to classify patients with schizophrenia and controls, reaching a sensitivity of 95.6% and a specificity of 91.4%, with an overall accuracy of 93.8%.