News
Overfitting: A Conversation Starter for the Next Summer Party
Overfitting in AI, where models over-specialize in training data, exemplifies the challenge faced by Transformer-based LLMs which memorize extensive details. This raises significant legal concerns regarding data protection and copyright. Addressing these issues necessitates interdisciplinary collaboration to navigate the complexities of AI within the legal framework.
Imagine you are at a lakeside summer party, looking for a more engaging topic than your job as a lawyer. Here is a suggestion: ask the person next to you how they tackle overfitting in their AI models. Don't worry, overfitting has nothing to do with the fit of clothing, making it a safe yet fascinating conversation starter that doesn't immediately scream "legal professional." Overfitting is a term from the realms of algorithms, mathematics and machine learning—the very foundations of all contemporary AI.
To make meaningful predictions, a machine learning system must be meticulously trained. If the goal is to distinguish between bridges and traffic lights to crack captchas automatically, the classification model must be shown numerous clear images of bridges and traffic lights. This way, it can identify the defining features and extract them as distinct patterns. However, during training, the model might learn the training data too precisely, picking up on irrelevant details like the color of the sky or the shapes of shadows and car types, instead of focusing solely on the relevant patterns of "bridge" or "traffic light." When tested on new images, such an overtrained model typically fails to recognize bridges and traffic lights accurately. It becomes too closely adapted to its training data and struggles to make general predictions beyond it. Voilà, overfitting.
Sure, you could have just asked ChatGPT or any of the many AI systems currently turning vast amounts of water and electricity into dry words. What, then, is the intrinsic value of this article? Large language models (LLMs) didn't just appear out of nowhere in 2022; they are the result of decades of research. Today's LLMs are based on the Transformer architecture, a deep learning innovation published in 2017 under the title "Attention is All You Need." This legendary paper is widely discussed but seldom read.
Transformers introduced key innovations like self-attention and multi-head attention. Self-attention allows an LLM to choose subsequent words to match the preceding ones by considering the entire previous output. Multi-head attention means the LLM can perform several self-attention calculations independently, capturing different relationships within an input sequence simultaneously and thus better processing the many dimensions of language. Voilà, attention understood.
But why can an LLM now answer questions about overfitting and attention? Buckle up; we're going in circles: it's also due to overfitting. Neural networks tend to this inherently undesirable phenomenon. We want systems that don’t merely “stochastically parrot” the data processed during training but apply learned patterns to new, unseen situations—what we might call humanoid intelligence. Unfortunately, current LLMs struggle with this, largely due to their sheer size. Large neural networks with billions of parameters are prone to memorizing even the tiniest details from the training data.
Using Jane Austen's "Pride and Prejudice" as an example—also great for small talk—scientists explain how LLMs memorize their training data verbatim and reproduce it word for word. This effect intensifies with the increasing size of LLMs:
One well-known complaint from the US music industry is that Anthropic's model "Claude" accurately reproduced Katy Perry's "Roar" lyrics and Don McLean's "American Pie" when prompted to write a song about Buddy Holly’s death. Why? Let's dive deeper.
Deep learning models can’t process text in its raw literary form; they need numbers to work with. So, the text is converted into numbers, specifically into information-rich, dense vectors. A whole paragraph of text can be reduced to a vector. With vectors, semantic similarities can be embedded, hence the term vector embeddings. These embeddings are high-dimensional, meaning every conceivable association or perspective on the embedded text is represented in numerical values, making it quite complex with thousands of dimensions (GPT-3 works with 12,288 dimensions). High-dimensional vector spaces for language could be visualized as word clouds, where some words are closer and others farther apart, depending on their semantic similarity.
For clarity, it's important to note that LLMs do not embed text word by word; the text is first broken into tokens—small word components—which are then converted into numerical token IDs that the neural network can process. Tokens enable the LLM to handle compounds, neologisms, and linguistic nuances better. What do tokens look like? Ciphertext. Try it out with the Tiktokenizer.
The precision of the embeddings means many texts processed during model training quickly nest in a niche in the high-dimensional space and remain reconstructable there. Why exactly this happens, and whether overfitting is the only reason, is still under research. Machine learning models, especially deep neural networks, have long been known to be prone to memorization. LLMs have inadvertently demonstrated their capacity to function as vast associative databases, housing fragments of the world's knowledge. And where else should this knowledge come from if not from the training data?
Let’s take an example: We use any LLM and ask it to complete a famous quote from Yoda: "Do. Or do not. There is no try," providing the model only the first four words. The LLM now searches associatively for a fitting answer and calculates it word by word (better: token by token) based on the probability values learned during pre-training and fine-tuning. And behold, the LLM responds with the full quote. This is quite a challenge, as visualizing this process of text genesis reveals an overwhelming linguistic variety. Using "beam search," the path of the model’s answer can be traced. The probability of a token sequence is calculated as the product of the probabilities of each token. In the case of perfectly completing a famous quote, it will show that the probability of the quote’s output is significantly higher than all other possible variants. Voilà, memorization and overfitting proven.
There is no secret ingredient in LLMs, no internal world model that conveys knowledge. LLMs can approximately reproduce the training data. Training texts, from books to chat messages, can be extracted in gigabytes from open-source models like LLaMA and Falcon. The fact that not all training data is always memorized can be explained by the efficiency-driven design of LLMs, which prevents them from being infinitely large during training, as the rising costs of their operation would be astronomical.
The fact that LLMs now run on commercially available smartphones is the result of reducing model complexity by removing unnecessary data layers. Due to the efficient embedding of training data, comparing LLMs to compression algorithms like MP3, JPEG, or ZIP seems logical. Each compression method has a fundamental limit based on the entropy of the dataset to be compressed. The entropy of the dataset can also be seen as a measure of its predictability (or unpredictability). LLMs are trained to output the probabilities of all possible next tokens based on the preceding tokens, a process known as conditional probability distribution. In other words, an LLM provides us with the probability information needed to achieve optimal compression.
The confirmation of training data memorization addresses the applicability of data protection, copyright, and personality rights laws. In the legal realm, the form in which personal information is stored is not determinative. Whether customer data is stored in SQL or NoSQL databases is as irrelevant as storing text as ones and zeros, magnetic values on hard drives, or electrical charge states on SSDs. If LLMs store texts as numerical sequences within high-dimensional vector embeddings, it is certainly a fascinating process, but it does not pose a legal challenge. Even the re-identification of disparate or compressed data has long been researched and described. LLMs function as associative networks and databases. In data protection debates, this serves as the prelude to an earnest discussion, as LLMs offer substantive material for debates both de lege lata and de lege ferenda.
Returning to our lakeside scenario: To engage in stimulating conversations that extend beyond the knowledge curve shaped by one's profession, you need not only occasional fine-tuning with newly discovered training data from contracts and court briefs but also a continuous influx of intellectual stimuli through interaction with other scientific disciplines. How wonderful it is that we can keep our professional training data sufficiently diverse, ensuring that we lawyers remain as sharp and insightful as ever—if not a touch more humble.
Article provided by INPLP member: Peter Hense (Spirit Legal, Germany)
Discover more about the INPLP and the INPLP-Members
Dr. Tobias Höllwarth (Managing Director INPLP)
News Archiv
- Alle zeigen
- Dezember 2024
- November 2024
- Oktober 2024
- September 2024
- August 2024
- Juli 2024
- Juni 2024
- Mai 2024
- April 2024
- März 2024
- Februar 2024
- Jänner 2024
- Dezember 2023
- November 2023
- Oktober 2023
- September 2023
- August 2023
- Juli 2023
- Juni 2023
- Mai 2023
- April 2023
- März 2023
- Februar 2023
- Jänner 2023
- Dezember 2022
- November 2022
- Oktober 2022
- September 2022
- August 2022
- Juli 2022
- Mai 2022
- April 2022
- März 2022
- Februar 2022
- November 2021
- September 2021
- Juli 2021
- Mai 2021
- April 2021
- Dezember 2020
- November 2020
- Oktober 2020
- Juni 2020
- März 2020
- Dezember 2019
- Oktober 2019
- September 2019
- August 2019
- Juli 2019
- Juni 2019
- Mai 2019
- April 2019
- März 2019
- Februar 2019
- Jänner 2019
- Dezember 2018
- November 2018
- Oktober 2018
- September 2018
- August 2018
- Juli 2018
- Juni 2018
- Mai 2018
- April 2018
- März 2018
- Februar 2018
- Dezember 2017
- November 2017
- Oktober 2017
- September 2017
- August 2017
- Juli 2017
- Juni 2017
- Mai 2017
- April 2017
- März 2017
- Februar 2017
- November 2016
- Oktober 2016
- September 2016
- Juli 2016
- Juni 2016
- Mai 2016
- April 2016
- März 2016
- Februar 2016
- Jänner 2016
- Dezember 2015
- November 2015
- Oktober 2015
- September 2015
- August 2015
- Juli 2015
- Juni 2015
- Mai 2015
- April 2015
- März 2015
- Februar 2015
- Jänner 2015
- Dezember 2014
- November 2014
- Oktober 2014
- September 2014
- August 2014
- Juli 2014
- Juni 2014
- Mai 2014
- April 2014
- März 2014
- Februar 2014
- Jänner 2014
- Dezember 2013
- November 2013
- Oktober 2013
- September 2013
- August 2013
- Juli 2013
- Juni 2013
- Mai 2013
- April 2013
- März 2013
- Februar 2013
- Jänner 2013
- Dezember 2012
- November 2012
- Oktober 2012
- September 2012
- August 2012
- Juli 2012
- Juni 2012
- Mai 2012
- April 2012
- März 2012
- Februar 2012
- Jänner 2012
- Dezember 2011
- November 2011
- Oktober 2011
- September 2011
- Juli 2011
- Juni 2011
- Mai 2011
- April 2011
- März 2011
- Februar 2011
- Jänner 2011
- November 2010
- Oktober 2010
- September 2010
- Juli 2010