A Speech about Chatgpt

Source#

Originally, I was supposed to give a speech in class, but it was forced to stop due to the final exams QAQ
Just leaving a memento (

Main Text#

Some say that 2023 is the year of artificial intelligence. This is because OpenAI publicly released ChatGPT on November 30, 2022. Since then, people have been able to directly understand the conveniences that artificial intelligence brings us through conversations with ChatGPT. Now that it is already 2024, I believe everyone has some understanding of ChatGPT, whether from online media reports, from Chinese essays, or from English readings. But does anyone want to gain a deeper understanding of it? Today, let’s start with ChatGPT to understand the principles behind it and further explore the basic patterns of the neural networks behind it.

Let’s start with the name ChatGPT. ChatGPT stands for Chat Generative Pre-trained Transformer, which means Generative Pre-trained Transformer. "Generative" refers to ChatGPT being a robot used to generate new text. "Pre-trained" means the model has undergone a learning process from a large amount of data. The key term is the last word, transformer—what is that? It is actually a special type of deep learning model, a neural network. It was proposed in a paper titled "Attention Is All You Need" published by Google in 2017. ChatGPT is derived from a modified version of the Transformer model.

So, what does deep learning mean? Deep learning is a type of machine learning, and machine learning uses data-driven approaches to provide feedback to model parameters, guiding model behavior. This might be a bit difficult to understand. Here, the model is a broadly defined function, that is, a mapping. The neural network refers to the internal implementation of this function. I will introduce it shortly. Continuing with machine learning, for example, a function that can label images, fx, where the input x is an image, and the output fx is the label. The core idea of machine learning is not to define any behavior in the code but to construct a flexible function with adjustable parameters, and then use a large number of examples to let the machine adjust various parameters to mimic this behavior. It is somewhat similar to the method of undetermined coefficients in mathematics.

Having understood the Transformer model, let’s take a simple look at the principles of ChatGPT. In one sentence, it can be summarized as: based on all the previous text, it predicts the next word with the highest frequency to generate text. It is somewhat like the autocomplete feature of search engines; every time we input a word, the input box starts predicting the subsequent text, with higher probabilities ranked higher. But how does the model determine the probability of each word?

This brings us to the Transformer architecture mentioned above. As stated, the Transformer is essentially a function. If we take f(x) as an example, its input value x is all the previous text, and the output value is the probability of each word. We all know that a function, such as a quadratic function f(x) = ax^2 + bx + c, has parameters a, b, and c. For this function, the number of parameters is 3. So, we naturally want to ask, how many parameters does the modified Transformer architecture of ChatGPT have? You can take a guess. The answer is 175 billion.

Why are there so many parameters, and how does the machine adjust them? Next, let’s use a simpler example, digit recognition, to understand the neural network. This is the internal implementation mechanism of the function.

First, we must acknowledge a fact: it is actually very difficult for computers to recognize digits. Your 3 and my 3 may look very different. Scientists invented a magical thing called the neural network to solve this problem.

As the name suggests, neural networks are derived from the structure of the human brain. In neurons, the most important components are the neurons themselves and the connections between them. Neurons can be understood as containers filled with numbers, containing activation values ranging from 0 to 1. For the task of digit recognition, each input neuron corresponds to the color intensity of each pixel in the image. Taking a 28x28 pixel image as an example, there are 784 input neurons, forming the first layer of the neural network. Then, jumping to the last layer, there are ten neurons representing the digits 0 to 9, with their activation values also ranging from 0 to 1. These values represent the system's belief in the likelihood that the input image corresponds to each digit. There are also hidden layers in the network that perform the specific work of recognizing digits.

When the neural network runs, the activation values of the previous layer determine the activation values of the next layer. Therefore, the core of the neural network is how the activation values of the previous layer calculate the activation values of the next layer. In fact, the designed neural network aims to mimic the biological nervous system, where the activation of certain neurons leads to the activation of other neurons.

The activation values of the first layer will calculate the activation values of each neuron in the second layer, and so on, until the last layer, where the brightest neuron represents the choice of the neural system. So why do we need to layer? We actually expect each layer to specifically recognize certain features... How the activation values of each layer calculate the activation values of the next layer relies on "parameters." For example, if the activation values of the first layer's 784 neurons are a1, a2, a3,…, a784, then suppose the second layer has 16 neurons, each denoted as bi. To calculate them from the first layer, 784 parameters are needed, such as b1 = w1a1 + w2a2 + … + w784*a784. Initially, these values are all random, and we cannot expect the computer to recognize anything. What we need to do is to present a large number of pre-written digit images along with their corresponding digits, "feeding" them to the computer each time, calculating how "bad" the computer's performance is—this is the cost function—and providing "advice" on how to modify the parameters—the negative gradient of the cost function. After one adjustment, these parameters are iteratively adjusted, and then when encountering new images, the neural network can handle them.

In fact, isn't this similar to our daily learning? Timely evaluation of our performance, finding the most "efficient" improvement methods combined with extensive practice will definitely lead to better self-learning. + Supplement with some QAQ