Why are temperature and nucleus sampling in language models named this way as they seem less relevant to the rest? In physics, temperature increases kinetic energy of particles, leading to greater randomness in motion and the breaking of chemical bonds. Some particles under certain circumstances form new products and substances are mixed together faster. Likewise, in LLM, increasing temperature flattens the distribution shape of possible tokens, making the tokens previously having low probabilities now have higher chance of being selected, and ultimately leading to the production of more creative output.
How about Nucleus sampling? Nucleus or top-p sampling is a decoding technique that selects the smallest number of tokens whose cumulative probability reaches or exceeds the threshold p by starting from the most probable token. The higher value of p results in a diverse selection of tokens. I think this cumulative probability is quite similar to the mass distribution in an atomic model where the nucleus consisting of protons and neutrons dominates the total mass of the electron cloud.
Similarity measurements of embeddings including Euclidean distance, cosine similarity, and dot product are from the Pythagorean theorem and the law of cosine. While the Pythagorean theorem gives us the relationship of 3 sides in a right triangle, the law of cosine shows us how the cosine of an angle relates to its sides. Besides algebraic meanings, these 3 measurements all have corresponding geometrical meanings, and I think this is one of the reasons why linear algebra has been widely used in many fields of engineering.
Information in reality, for example, is any object that we can see by naked eyes or any sound we can hear through our ears but is not interpretable by computers. Fortunately, these types of information once converted to digital audio, image, video, and text are understandable and processable by computers.
Audio in analog form is actually waves of air molecules moving back and forth, and through ear canal reaches ear drum, causing it to vibrate so that we can hear. Unlike continuous nature of analog, digital audio is divided into samples and represented in numbers. This is done by measuring amplitude in a fixed interval of time, and the unit of kHz tells us how many samples are recorded in a second. Each sample is stored in a fixed number of bits, known as bit depth, higher bit depth yields higher subtle and precise in quality. As you might hear of 128kbps, 320kbps, or FLAC everyday while listening to music from digital devices, these are bitrates, indicating the quality of the audio reproducible by electronic devices and calculated by the given formula: bitrate = samples per second × bit depth. The higher it is, the better quality the audio has.
An image can be thought as a projection of an object in 3D world onto 2D plane and composed of pixels. Each pixel, for example in RGB channel, is a combination of red, green, and blue. Similar to the conversion of analog to digital audio, motion of an object in a specific period of time is recorded and composed of many images into a new form of information, a video.
In reality, text is visible to human as objects, and has been storable and processable for decades in the form of bytes. In traditional manipulation, text in any language is processed to perform exact and approximate pattern matching, sorting, as well as statistical analysis.
In summary, given any information in reality, there is always a method to convert it into a representation that is consumable by computers. In machine learning, information becomes more powerful as it is transformed into a numerical representation or embedding while still retaining phonetic, semantic, or visual characteristics, enabling machine to “understand”, learn, and perform various advanced tasks such as translation, voice recognition, object detection, question answering, text generation, and so on.
If a neural network of multiple layers have linear relationships, it may be rewritten to have a single layer. This single layer may have many neurons in order to represent the pattern of training data. But what makes the neural network powerful is the non-linearity of activation function, allowing the network to capture more complex relationships of the input and output. Math is about relationship. If we simply think of this relationship as a function then the more complex it is, the higher degree it has. Many architectures were developed to capture even more complex relationships such as RNN, CNN, LSTM, and Transformers, and these have been used in a wide range of applications.
Backpropagation is one of the most fundamental algorithms when working with neural networks, especially in the context of supervised learning. The purpose of the algorithm is to determine how parameters should be updated during the training process to minimize the error calculated by the cost function.
Calculus lies at the heart of the algorithm. This branch of mathematics primarily has two main concepts: differential with derivative to measure the rate of change (also known as rise over run) of a function with respect to its arguments at a given point, and integral to measure accumulation of quantities, which is usually interpreted as the area under a curve. These concepts remind me of Leibniz and Newton, both of them are known for the invention of calculus and their works in mathematics and physics. As the backpropagation relies on differential and derivative, so taking an example from calculus-based physics would not be a bad idea.
Imagine one is traveling from A to B on a street with the displacement over time given by the curve function x(t) with its ups and downs. Taking the derivative of x(t) results in v(t) = dx/dt, indicating the velocity at which he is moving at a given time. Similarly, the derivative of v(t) is acceleration a(t) = dv/dt, which indicates how he is accelerating or decelerating at a given time. Notice that the results of v(t) and a(t) at a given time t not only describe the rate of change in value but also the direction, either positive or negative, via the sign.
The concept of derivative is crucial in computing gradients and updating parameters accordingly in the backpropagation. After the error between the actual and expected values is calculated by the cost function, the parameters in the last layer are adjusted to minimize that error with additional help of hyperparameters, and this process continues through the hidden layers to the input layer using the chain rule of differential.
The idea of calculating derivative of x(t) originally is to answer the question by Newton: what is the exact velocity at a specific time t? Given h is the difference in time, the derivative of x(t) is calculated as v(t) = dx/dt = (x(t + h) - x(t)) / h when h approaches 0. And this helps solve such problems of non-linear motion in which the calculation of average velocity in algebra-based physics using the equation v = d/t, where t must be measurable, does not.
By nature, objects and particles have a tendency to move toward their lowest energy state of a system to stay stable. This principle is fundamental in physics and can be found in many natural processes. For example, a ball rolling downhill has its gravitational potential energy converted to kinetic energy and vice versa, losing its energy due to friction with the surface and air resistance over time, and eventually reaches a state of rest. Another example is when observing an atomic model, an electron tends to occupy a specific energy level around the nucleus, excited when it absorbs a photon and quickly transitioning to ground when it emits the photon with the same energy. In thermodynamics, systems evolve toward states of maximum entropy (randomness) and minimum energy.
These typical examples, when represented in an energy landscape, are quite similar to the landscape of the cost function derivative used in gradient descent, where the algorithm gradually guides the adjustments of parameters to minimize the error toward a minimum.
Diffusion model is inspired by the reverse of diffusion process in physics, meaning going back in time (denoise) by starting with random noise. The forward process can be illustrated as the movement of particles from the space of high concentration hitting each other and gradually moving (adding noise) to the space of lower concentration until equilibrium (lowest energy state) is achieved (full of noise). Temperature, nucleus, diffusion, and equilibrium are all related to physics. But what exactly branch of physics is it? Three out of four suggest that it probably relates to the study of thermodynamics and statistical mechanics.
Thermodynamics deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws of thermodynamics, which convey a quantitative description using measurable macroscopic physical quantities, but may be explained in terms of microscopic constituents by statistical mechanics. Thermodynamics plays a role in a wide variety of topics in science and engineering. - Wikipedia
In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applications include many problems in the fields of physics, biology, chemistry, neuroscience, computer science, information theory and sociology. Its main purpose is to clarify the properties of matter in aggregate, in terms of physical laws governing atomic motion. - Wikipedia
Perception is added to robotics and vehicles and allow them to sense, analyze, and respond to the surrounding physical environment. Inputs from the environment are perceived through sensors such as cameras for capturing images and recording videos, RADAR for using radio waves of different wavelengths, LiDAR for mapping 3D terrain with high resolution using lasers, ultrasonic for measuring distances to obstacles using reflected waves, GNSS for navigation through satellite communication, and odometry for estimating changes in position over time by measuring motion. Many of these sensors are easily found on a commercial airplane, rocket, and spacecraft for vision, communication, and navigation assist. For LiDAR, I have discovered the use of Monocular Depth Estimation that covers similar functionalities.
While robotics takes advantage of reinforcement learning to learn through trial-and-error and improve the outcome for each decision made over time, autonomous vehicles are safety-critical and require serious training, planning, and decision-making to deal with street rules and the detection and avoidance of other vehicles or obstacles. Comparing to an airplane, the autopilot of an autonomous vehicle might work harder to respond proactively to the environment with small margin of error. Whereas, an airplane cruising at high speed is unable to change its course of motion immediately as it may lose speed and altitude, and ultimately stall. Plus, it is unnecessary for an airplane to perform a sudden change in motion as the environment around airplane does not include obstacles at close distances most of the time.
The physical world is perceived by animal, human, and machine with the help of massive and massless particles. Human is aware of object if the object emits or reflects light to the eyes. Bats use ultrasonic wave to detect the obstacles in their flying path. Human is capable of hearing sound in a certain range of frequency because of the wave and vibration of air molecules. Since electromagnetic waves in different wavelengths are composed of photons, a special particle that has no mass, infinite energy, and travels at the speed of constant c in vacuum, they are the means for electronic devices to understand information in reality. Likewise, waves of massive particles, as in analog sound, are digitizable and interpretable by devices such as voice recorders or speakers.
These are a few random ML notes that I found interestingly physics-related. Even though they are more on thoughts than detailed writings, they help organize and navigate what I have learned early this year. I may update or add more sections to the post in the future.