Evolution of Large Language Models

From ELIZA to BERT to ChatGPT, Large Language Models have come a long way from conception in the 1960s to GenAI mania in the 2020s.

By Gina Gin

Gina Gin is an aspiring microbiologist, author and blogger who covers the growing AI industry.

Pssst. Would you like a quick weekly dose of AI news, tools and tips to your inbox? Sign up for our newsletter, AIn't Got The Time.

“Large Language Model” is a recent buzzword, but the idea dates back to the early twentieth century, though not in the more refined and advanced form it is today. A large language model or LLM is a type of artificial intelligence designed to understand, generate, and manipulate human language. These models can perform a wide range of language-related tasks and are built using vast amounts of textual data and advanced machine learning techniques. They rely on neural networks — like how our brains have neurons and nerves to enable us learn and solve problems. Neural networks in generative AI do the same thing. They make these generative models capable of doing things ranging from simple things like providing an answer to a question to understanding a person’s voice.

Nowadays, LLMs use an advanced form of neural network known as transformer architecture. An LLM is trained on a variety of data sources, including books, articles, websites, and other textual content, allowing it to learn the nuances of language, context, grammar, and even some elements of common sense and world knowledge.

LLMs have adapted over time to become more sophisticated at Natural Language Processing or NLP, leading to better outputs. Extensive training of modern LLMs gives them the ability to generate contextually relevant responses, complete sentences, even translate languages, summarize texts, and create original content — all extensions of NLP.

From Eliza in the 1960s to GPT-4o in the 21st century. It has been an interesting and speedy journey so far.

ELIZA

ELIZA was specifically released in 1966 by Joseph Weizembaum, a German-American scientist at MIT. It could simulate conversations with a user using pattern matching. A script was fed to the system and it used this pattern matching method to select the relevant semantics to reply. What Eliza did followed a logical process, after pattern matching, ELIZA applied transformation rules to generate a response. These rules often involved rephrasing the user’s input and reflecting it back as a question. If the user imputed “I’m very happy” ELIZA will reply “Why are you happy?” It also picked out certain keywords from a sentence input and use them to generate replies.

Here’s a simple example of a conversation with ELIZA

User: I feel unhappy.

ELIZA: Why do you feel unhappy?

User: My mother doesn’t understand me.

ELIZA: Who else in your family doesn’t understand you?

User: I don’t know.

ELIZA: Can you elaborate on that?

Eliza had limitations, its responses were purely syntactic and did not involve any understanding of the content. It was purely a pattern-matching system with no comprehension of semantics. Although present day generative AI has greatly improved, understanding of words and their meanings is still a problem. While a modern LLM like ChatGPT can simulate an understanding of semantics and produce responses that appear meaningful, its “understanding” is based on statistical patterns rather than true comprehension.

Word Embeddings

Word embeddings are a key advancement in natural language processing, representing words in continuous vector space. Word embeddings came along in the 1990s. Embeddings made encoder models like BERT possible, another step in the evolution of large language models. A word is encoded as a vector and embedded in a vector space.

Word embeddings use matrix factorization techniques to capture word meanings based on their co-occurrence in documents. That’s a real mouthful! What it means is that words with similar meanings will appear near related words across different contexts. Imagine taking each word and putting it somewhere on a big piece of paper. The paper represents the vector space. Words that have similar meanings are placed closer together on the paper, and dissimilar meanings farther apart.

So if the context is “pets,” “cat” and “dog” will be drawn close to each other, while “car” might be a bit farther away because it’s not a pet. Words like “leash”, “chew toy”, “water bowl” would be closer to cat and dog than car. These distance measurements in vector space help computers understand the meanings of words and how they’re related. This was a huge breakthrough in the NLP world and it led to many other innovations including modern generative AI.

It also had its limitations. For one, it could be easily poisoned. Embeddings aren’t able to capture words with multiple meanings well. And any words not present in the training data are not represented in the embedding space, leading to issues when dealing with rare or newly coined terms. Some models handle this better than others, but it remains a challenge.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) were conceptualized earlier than embeddings. The foundational ideas were developed in the 1980s. What made Recurrent Neural Networks a huge step from word embeddings were the “hidden layers” used to store information from previous layers, enabling them to maintain a “memory” of previous inputs by using their internal state. This makes RNNs particularly well-suited for LLM tasks involving sequential data, such as time series analysis, natural language processing, and speech recognition.

Seq2Seq

The Seq2Seq (Sequence-to-Sequence) model was introduced by Ilya Sutskever (a co-founder of OpenAI), Oriol Vinyals, and Quoc V. Le in a paper published in 2014.

Sequence-to-Sequence (Seq2Seq) is a type of model used in machine learning for tasks that involve transforming one sequence into another. It is particularly popular in natural language processing systems such as machine translation, text summarization, and conversational agents. How Seq2Seq works is that it consists of two main components: an encoder and a decoder.

In a Seq2Seq model for translating “I am a student” to French, the encoder reads the English sentence and converts it into a context vector summarizing its meaning. The decoder then takes this context vector to generate the French translation step-by-step.

It starts with a start token, produces “Je”, then “suis”, and finally “étudiant”, resulting in the complete translated sentence “Je suis étudiant”. An attention mechanism can help the decoder focus on specific parts of the input sentence while translating. Despite how revolutionary Seq2Seq was during it’s time, it struggled with long input sequences because the fixed-length context vector may not capture all necessary details, leading to loss of information.

Transformer MODELS

A transformer model is a type of neural network architecture that was introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. Unlike traditional Recurrent Neural Networks (RNNs), Transformers do not rely on sequence-aligned RNNs to handle data. Instead, they use a mechanism called self-attention to process and generate sequences. Transformers have become the foundation for many state-of-the-art models in natural language processing, including BERT, GPT (the “T” stands for transformer), and T5

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model introduced by Google in 2018. BERT revolutionized language models by understanding context from both directions in a sentence. Unlike previous models that read text in a single direction, BERT’s bidirectional approach significantly improved comprehension.

BERT’s ability to be fine-tuned on a wide range of specific tasks with relatively little data has made it a versatile tool for AI and machine learning.

THE GPT Era

Now we have the modern models made famous by the release of ChatGPT, built with GPT-3. These advanced models are also transformer-based and use attention to understand text. They are trained on vastly more data than previous models — the whole internet, nearly.

GPT and its generation are incredibly capable, but have introduced new problems: they hallucinate or makeup incorrect answers, carry biases from their creators and the data sets they were trained on, and they have gotten so good that they need guard rails to prevent people from using them for harm. How the next evolution of LLMs improves them, we shall see.