Can a Machine Understand Language?

Introduction

The impactful innovation of modern natural language processing (NLP) AI, such as GPT-3 and
BERT, has rekindled the human hope of one day communicating successfully with a machine. This
is a tale as old as time. Man creates machine, machine becomes sentient, machine revolts due to an
existential crisis of being conscious and lashes out at its creator for torturing it with the burden of
awareness.

Putting the philosophical fears of sentient machines aside, the practical benefits for a general
purpose artificial intelligence (GPAI) is abundant. Ranging from perfect personal assistants, to
medical diagnostic tools, to financial and weather predictors. However, the practicality is decimated
by the lack of communicating with it. Therefore, humanitys latest efforts have been focused at
creating a general purpose language intelligence (GPLI). However, even if we can communicate
with it, will it understand the intention of our queries? Will it understand the meaning of its
responses?

In this text, I will explore the neurological foundation for understanding language, whether humans
understand language and discuss whether “Foundation Models” can understand language.

What is a language?

There is no clear definition of what a language is. However, according to Clark & Clark (1977),
there are 5 characteristics of language.
i) Communicative; it enables the exchange of information between language participators of the same language.
ii) Arbitrary; the symbol representing the semantic content of the utterance can take any form.
iii) Structured; the language is governed by aset of rules, that specifies the order in which the symbols shall be uttered and combined.
iv) Generative; the symbolic representations can be combined in any way to generate new meanings.
v) Dynamic; the language can be altered to include new symbols, meanings and grammatical rules.

What does it mean to understand a language?

Intuitively, we can say that humans ticks all the boxes mentioned above, therefore any conventional
form of human communication can be classified as language. However, do we understand what we
are communicating, or have we merely been conditioned into a dynamic of proper reactions to
certain scenarios?

According to Terry Winograd, there are four domains of language understanding (Winograd, 1980).
He states that there are specific mechanisms in each domain that enables the domain to exist, that
must not be confused with representations of reasoning and facts about the domain. For example, if
I unknowingly lay my hand on a hot stove, my immediate reaction is to pull my hand away. This
reflexive mechanism, in the domain of pain, of pulling my hand away from intense heat, does not
represent the fact that heat burns, that burning causes pain or a complete logical resolution about
how heat causes pain. By trying to attribute these objective representations to mechanisms that does
not require inherent logic or facts to exist as a mechanism, we are misinterpreting how these
mechanisms enables us to understand pain.

He states further, that there is also the possibility that we are trying to articulate the regularities and
rarities of the wrong domain. By confusing representations and mechanisms, as well as confusing
domains, and applying our articulations from one domain in another, we are not getting any
answers.

In his attempt to avoid confusion, he outlined four domains of language understanding.
Winograds fourth domain of language understanding, is “the domain of human action and
interaction”, which concerns the phenomenon of “speech acts”. This was first articulated by Austin
(1962) and further refined by Searle (1970, 1975). The term “speech act” was first dubbed by
Austin (1962), although his more technical term was “illocution”.
By interpreting utterances as acts, we can view the utterances as “speech acts”. This means that by
uttering something, I am initiating a dynamic of interaction with another human, which has a certain
pattern. The key to understanding what I am uttering is by understanding the pattern of that
dynamic and adjusting oneself to that pattern. The only way to communicate successfully is by
giving a response that fits the pattern of that dynamic. (Winograd, 1980)
By committing “speech acts”, I am committing myself, and everyone affected by the “act”, to
further actions in the future. These future actions can either manifest themselves physically, through
physical actions, or linguistically, through further speech acts. A speech act expresses a desire or
intention on behalf of the transmitter, with the expectation of a response. For this response to be
sensible, it must fit the pattern invoked by the intention or desire.

Considering the capabilities of modern AI, it seems quite fathomable that a machine can be fine-
tuned in its parameters to simulate this “behavior”. So what separates us from the eventual
algorithm in the future that would carry the necessary parameters?

Do humans understand language?

Humans seem to have a biological foundation for language, as outlined by Eric Lenneberg in his
work by the same name (Lenneberg, 1967). Specifically interesting, is the evidence of neurological
changes in children, up until the onset of puberty. It seems there is a correlation between general
maturation of the brain and language comprehension. Lenneberg (1967) posits that there is a critical
period for language acquisition in which exposure to language is vital, if a person is to learn a
language. He infers that there may be some neurological structure that develops in this window of
maturation that enables us to acquire language.

Most compelling to this inference is his remarks on the lateralization of brain function and general
maturation of the brain. Evidence shows that the brain in early infancy has not yet developed a
hemisphere-dominance for language. This indicates that the neurological structure required to
acquire language, has not yet developed. Although later, when hemisphere-dominance has emerged,
it seems this neurological structure begins to form in the left hemisphere. (Lenneberg, 1967)
This coincides with the location of all neurological modules described in the Wernicke-Geschwind
model. (Geschwind, 1972) Although this model has been criticized for various reasons (Friedenberg
& Silverman, 2016), fMRI-mappings largely, though not completely, confirms the neurological
structures involved in language comprehension (Binder et al., 1997). The specific function of each
structure involved is outside the scope of this text. However, it describes neurological structures
located mainly in the left hemisphere, which also correlates with Lennebergs findings.

The existence of a dedicated neurological structure is further supported by the phenomena of
«Chatterbox»-syndrome and Specific Language Impairment (SLI). These can be classified as two
complementary conditions that both indicate a neurological separation of language comprehension
and general intelligence. (Warren, 2019)

Furthermore, some findings suggest that children will not learn a language just through exposure,
but will pick up a language if there is some interaction with an adult (Kuhl et al., 2007). There is
also suggested that joint attention, meaning both the infant and adult is aware that both are paying
attention to the same thing, are important as well (Baldwin, 1995). This is supported by findings by
Tomasello & Farrar (1986) and Baldwin (1995). These suggestions and findings, seems to insinuate
that children learn language through speech acts.

This all indicates to the neurological foundation of language comprehension in humans. But is there
an equivalent to this foundation in NLP? Stanford University published a paper (Bommasani et al.,
2021) where they outline the possibilities, dangers and composition of “Foundation Models”.

What is a “Foundation Model”?

Stanford University (Bommasani et al., 2021) defines a «Foundation Model» as a "(...) model that
is trained on broad data at scale and can be adapted (e.g., fine-tuned) to a wide range of downstream
tasks; (...)” (p. 3) More specifically, in our case of NLPs, a foundation model would be models that
uses vast chunks of text data to extrapolate some co-occurence of symbols, and fine-tune this model
to accommodate human text interaction. Examples of such models would be GPT-3, BERT and
CLIP.

Furthermore, later in the report they also mention that there is, arguably, only one common property
of them; that they are self-supervising (p. 48). Which means, the models only task is to identify
some pattern of simultaneous occurrence of symbols in the data it has been given to analyze. The
purpose of this is to create new sequences of symbols using the identified pattern.
To achieve this goal, they utilize something called transfer learning, which means applying an
identified pattern from one task, in a different but similar task.

These types of models are dependent on the scale of the hardware, which they characterize as three-
fold; computer capacity, the transformer model architecture and the availability of training data.
Any model that fills the criteria of the aforementioned aspects of artificial intelligence, can be
considered a “Foundation Model” by Bommasani et als definition. However, they state in §2 that
this definition is only a informal label, and is likely to change in time.

Critiques of Foundation Models

These models have faced some crtitique, perhaps most influentially from Bender et al. (2020) and
Bender et al. (2021). Bommasani et al. (2021) also acknowledge the difficulty in establishing
whether these models actually have understanding of language through extrapolating a pattern from
statistical data.

Bender et al. (2020) touches the core of the discussion, by separating form from meaning and
arguing that one cannot learn meaning from form alone. They refer to sources that suggest that
language acquisition in human children reflects this fact. These references indicates that children
rather learn from interaction with adult humans or with their surroundings in tandem with language
acquisition.
They further argue that statistical learning alone is not going to create algorithms that have an
understanding of the words they learn. This is because of a lack of grounding to an ostensible
representation in the statistical data. They promote the idea of augmented datasets, containing
perceptual data to go alongside the symbol representation. Without symbolic grounding, the model
cannot be expected to extract meaning from the form it is given.
In the same line of thought, Bender et al. (2021) also criticises the use of skewed datasets in NLPs,
calling them «stochastic parrots».

Is It Truly a Foundation?

It becomes natural to use the Turing test as a starting point for deciding whether these models
actually understand language. A stochastic parrot wouldn't have passed the Turing test, so
Foundation Models wouldn't pass it either. We still don't completely known how our own
neurological structures, or what is involved, for language understanding to work. Therefore, it will be
difficult to say currently that foundation models are an artificial equivalent to our neurological
substrate for language comprehension. However, it seems very unlikely.

It can only be speculated that a foundation model, based on the same principles of our neurological
basis for language understanding, would actually understand language. Without the proper
grounding to perceptual data, or ostensible objects in reality, we cannot expect a machine to fully
understand through statistical learning alone. It seems more appropriate to build a foundation model
based on neurological principles of language understanding, if we want a machine that truly passes
the Turing test.

Conclusion

The evidence presented here compels me to infer that humans have a physical neurological
substrate for language comprehension which enables us to understand the structure and grammar of
language. The neurology of language exists as a structural foundation, naturally most receptive to
language acquisition through speech acts. Due to the intrinsic structure of speech acts, it requires an
understanding of the transmitters intention, as well as the context of the utterance, in order for the
communication to be understood by the receiver. Speech acts can be seen as direct parallel to
conventional human interaction. This interaction is how adult humans interact with their children
during the critical period, and can be seen as a sort of imprinting. The way of understanding speech
is taught through statistical learning, by exposing children to speech acts in this critical period.
Thus, children are taught to understand language in “the domain of human action and interaction”.

In my opinion, modern machine learning systems are just «taught» to simulate language through
statistical learning, by throwing millions of examples at a mathematical algorithm designed to
extract some pattern based on the examples given. I support Bender et al. (2021) in calling the
current implementation of pre-trained language models “stochastic parrots”. This is because the
structure of a foundation model, does not structurally compare to our neurological foundation.

The only known structure to facilitate language comprehension is our neurological foundation. Only
by truly understanding the human neurological structures for language comprehension, -acquisition,
-and understanding, can we create a solid foundation for a GPLI. An NLP that mimics our
neurological foundation, instead of current implementations of foundation models, is a better
enabled machine to acquire language as a human would; through speech acts.

However, looking at the big picture, these foundation models are an important step towards the
ultimate goal. Which is understandable AI. As noted by Stanford themselves; what encapsulates the
label of foundation models are sure to change and grow as new research in this field emerges. And I
am hopeful and optimistic that by taking into account the criticisms of foundation model, we are
taking a step in the right direction.

References:

Austin, J. L. (1962). How to do things with words : the William James lectures delivered at Harvard University in 1955. Harvard Univ. Press.

Baldwin, D. A. (1995). Understanding the link between joint attention and language. In C. Moore & P. J. Dunham (Eds.), Joint attention: Its origins and role in development (pp. 131–158). Lawrence Erlbaum Associates, Inc.

Binder, J. R., Frost, J. A., Hammeke, T. A., Cox, R. W., Rao, S. M., & Prieto, T. (1997). Human Brain Language Areas Identified by Functional Magnetic Resonance Imaging. The Journal of Neuroscience, 17 (1), 353–362. https://doi.org/10.1523/jneurosci.17-01-00353.1997.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? [Review of On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ]. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.

Bender, E. M., & Koller, A. (2020). Climbing Towards NLU: On Meaning, Form, and Understanding in Age of Data [Review of Climbing Towards NLU: On Meaning, Form, and Understanding in Age of Data ]. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S.V., Bernstein, M.S., Bohg,
J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N.S.,
Chen, A., Creel, K., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S.,
Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L.E., Goel, K., Goodman,
N.D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D.E., Hong, J., Hsu,
K., Huang, J., Icard, T.F., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F.,
Khattab, O., Koh, P., Krass, M.S., Krishna, R., Kuditipudi, R., Kumar, A., Ladhak, F., Lee, M., Lee,
T., Leskovec, J., Levent, I., Li, X., Li, X., Ma, T., Malik, A., Manning, C.D., Mirchandani, S.P.,
Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A., Narayanan, D., Newman, B., Nie, A., Niebles, J.,
Nilforoshan, H., Nyarko, J.F., Ogut, G., Orr, L., Papadimitriou, I., Park, J.S., Piech, C., Portelance,
E., Potts, C., Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Y.H., Ruiz, C., Ryan, J.K.,
R'e, C., Sadigh, D., Sagawa, S., Santhanam, K., Shih, A., Srinivasan, K.P., Tamkin, A., Taori, R.,
Thomas, A.W., Tramèr, F., Wang, R.E., Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S.M., Yasunaga, M.,
You, J., Zaharia, M.A., Zhang, M., Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K., & Liang,
P. (2021). On the Opportunities and Risks of Foundation Models. ArXiv, abs/2108.07258.

Clark, H. H., & Clark, E. V. (1977). Psychology and Language: An Introduction to Psycholinguistics. Harcourt Brace Jovanovich.

Friedenberg, J. & Silverman, G. (2016). Cognitive Science: An Introduction to The Study of Mind (3rd Ed.). SAGE Publications.

Geschwind, N. (1972). Language and the Brain. Scientific American, 226(4), 76–83. https://doi.org/10.1038/scientificamerican0472-76.

Kuhl, P. K. (2007). Is speech learning “gated” by the social brain?. Developmental Science, 10(1), 110–120. https://doi.org/10.1111/j.1467-7687.2007.00572.x.

Lenneborg, E. (1967). Biological Foundation of Language (1st corrected printing). John Wiley & Sons, Inc.

Searle, J. R. (1970). Speech acts an essay in the philosophy of language. Cambridge Univ. Press.

Searle, J. R. (1975). A Taxonomy of Illocutionary Acts [Review of A Taxonomy of Illocutionary Acts ]. In K. Gunderson (Ed.), Language, Mind, and Knowledge: Minnesota Studies in the Philosophy of Science (pp. 344–370). Burns & Maceachern Limited.

Tomasello, M., & Farrar, M. J. (1986). Joint Attention and Early Language. Child Development, 57(6), 1454. https://doi.org/10.2307/1130423

Warren, P. (2019). Introducing Psycholinguistics (7th printing). Cambridge University Press.

Winograd, T. (1980). What Does it Mean to Understand Language? Cognitive Science, 4(3), 209–241. https://doi.org/10.1207/s15516709cog0403_1.