Method That Reduces Computation While Helping Generalization: Dropout and Sparse Structures
Method That Fixes Only Necessary Parts: Parameter-Efficient Fine-tuning
Reducing Precision Makes AI Lighter: Quantization
Methods of Using Trained Models in Practice: Deployment Frameworks

Recent AI technology has truly developed to a remarkable degree. Now AI can understand human speech, draw wonderful pictures, and even write text on its own. However, the more capable AI is, the more electricity and larger storage space it requires.

For example, an AI model called GPT-3 has over hundreds of billions of parameters — parameters can be thought of as the many small switches AI uses when making decisions. To operate a model this large requires very wide memory and powerful computers, so it is difficult to operate properly on small devices like smartphones or laptops that we commonly use.

To solve these problems, scientists and engineers began finding new methods. They developed methods of reducing size and computation while maintaining AI model performance as much as possible. These technologies are broadly called 'Model Compression' and 'Cost Optimization.' Model compression is literally technology for compressing large AI models small. It's similar in principle to compressing a large photo file so the size becomes smaller while quality is nearly maintained. Cost optimization is technology making models more efficient to use less resources like electricity. Recently these two technologies have become particularly important and are being intensively researched at many companies and research institutions.

The reason model compression is important is that small devices with limited size and processing capabilities like smartphones, smartwatches and other wearables, and IoT (Internet of Things) devices are increasingly numerous. These small devices are small in size and limited in processing capability so it is difficult to use large AI models as-is. For example, to immediately recognize voice commands on smartphones or quickly find objects in photos, small and efficient models are absolutely necessary.

Model compression technology is also essential in services like chatbots, voice recognition, and image processing where fast response speed is crucial. If answers come out too slowly when conversing with AI, users will quickly feel frustrated. To solve this problem technology making AI models smaller to operate faster is indispensable. Also when operating cloud servers that store and process data via internet, this technology is very important to save electricity usage and reduce costs.

Finally, model compression and cost optimization technologies make AI research easier and faster. If research always uses too large models, much time and cost is required, but if experiments can be conducted quickly multiple times with small lightweight models, good results can be found more efficiently.

Now let's examine one by one how these important 'model compression' and 'cost optimization' technologies work and how they are actually utilized in our lives.

Method of Compressing and Transmitting Knowledge: Knowledge Distillation

AI models are developing with truly remarkable capability. Now AI can understand human speech, draw wonderful pictures, and even write text on its own. But the more capable AI is, the more electricity and larger storage space it needs. To solve these problems, various technologies have been developed to maintain AI model performance while reducing size and computation. One representative technology is Knowledge Distillation.

The name knowledge distillation actually comes from the distillation process used when making whiskey or perfume. Just as when making whiskey, raw materials are heated and only the liquid with strongly condensed alcohol and fragrance is extracted, knowledge distillation is also a method of extracting only important information from complex AI models and transmitting it to smaller models.

This process is similar to a capable teacher teaching difficult content in ways easy for students to understand. The AI playing the teacher role is called the Teacher Model, and the AI playing the learning role is called the Student Model. Student models are smaller and simpler than teacher models, but if they learn well from teacher models they can produce nearly similar performance.

The core of knowledge distillation is not simply teaching only correct answers, but transmitting to student models why teacher models made such judgments. For example, if a model judging a photo and a teacher model predicted 80% probability of being a cat and 15% probability of being a dog, student models are also trained to similarly follow this judgment ratio. This method of transmitting the probability ratio for various options together is called Soft Target.

Such principles can also be found in everyday life. For example, if an art teacher tells a student "paint tree parts touched by sunlight brighter, and shadowed parts darker" rather than simply saying "this is a tree," students can better understand the overall principle of drawing pictures. Knowledge distillation similarly helps student models learn even the context of the judgment process.

One concept that must be known here is Logits. Logits are numbers expressing how strongly AI reacted to each candidate before making a decision. But since these numbers are difficult to understand directly, a calculation called Softmax is used to convert them to probability forms easily understandable by humans. Student models reference these probabilities and try to make judgments similar to teacher models.

Student models trained this way are small in size, fast in calculation speed, and can produce performance similar to teacher models. In practice, knowledge distillation is also used when compressing complex systems combining multiple large models into a single simple student model. Small models made this way are effectively used even in environments with limited computation resources like smartphones, web services, and automobile systems.

Also when models become smaller, power consumption decreases and heat generation also decreases so it also has good environmental impact. Therefore knowledge distillation can be said to be an important method that makes AI not just smaller but more efficient and smart.

Method That Reduces Computation While Helping Generalization: Dropout and Sparse Structures

The more data AI models learn, the better performance becomes. But in some cases they may be very accurate with data used for training but weak with new unfamiliar data. This phenomenon is called Overfitting. Literally, being too overfitted to training so adapting poorly to new situations.

One method for solving such problems is technology called Dropout. Dropout is a method of intentionally temporarily turning off some elements inside AI — computing units called neurons — during learning. Heard for the first time, turning off functions intentionally may sound strange. But this process of randomly turning off actually helps models not excessively depend only on certain specific neurons. It makes them practice finding answers through various paths.

Similar situations can also be thought of in everyday life. When studying for exams by repeatedly solving only certain types of problems, one becomes familiar with those problems but may be weak against slightly modified problems. So practicing with a variety of problem types is sometimes more helpful. Dropout is similarly a method giving models opportunities to think on their own under various conditions.

Also since dropout rests some neurons during the training process, that much computation also decreases. Less computation while improving model ability — it has a two-for-one effect.

In a similar context, another concept that has appeared is Sparse Structure. Originally deep learning models basically use Dense Structure where most neurons are interconnected. But in practice not all connections are necessarily needed. In many cases sufficient good judgment can be made with only some connections.

For example, just as when friends discuss topics, conclusions can be reached with only two or three opinions even without all ten speaking, in neural networks also good results can be obtained even with only necessary connections remaining. Sparse structures reduce or eliminate unnecessary connections, simultaneously reducing overall computation and memory usage.

AI models made this way become smaller in size and faster in operating speed. Thanks to this, they can be used efficiently even inside small devices like smartphones. Dropout and sparse structures are technologies playing important roles in making models lightweight while more sturdy and practical.

Technology for Cutting Away Unnecessary Parts: Model Pruning

Another method of compressing models is removing less important parts from already learned neural networks. This method is called 'Model Pruning.' The name pruning comes from cutting away unnecessary branches of trees. In AI models it is a method of simplifying by cutting away neurons or connections (weights) that do not greatly affect results.

The core of pruning lies in evaluating how important each neuron or connection is for producing results, and slowly removing starting from those with lower importance. This process is similar to organizing unnecessary objects on a desk, neatly tidying complex connections inside models. Doing this organization reduces the overall size of models and simultaneously reduces computation and memory used.

Initially mainly methods of eliminating connections with small numerical values or almost no influence were used. But recently more efficient and precise methods are being used. For example, going through the process of slightly trimming low-importance connections while retraining the model in between, and finally re-optimizing performance. Doing this can improve efficiency while maintaining model performance even after pruning.

Pruning also has deep relation with hardware efficiency or processing speed beyond simply making models smaller. Models with reduced unnecessary connections can operate quickly even on smartphones or small devices. So recently pruning technology is frequently utilized to make small and fast AI models that are actually better to use in practice.

Like this, model pruning is positioning as an important technology for making AI practical to use in actual environments by improving computation efficiency and saving memory.

Selective Structure for Improving Computation Efficiency: Mixture of Experts

In situations like today where AI models are becoming increasingly large and complex, computation efficiency is a very important task. Particularly in large-scale language models, too much computation means long learning time and much resources needed even when actually using them. One structure appearing to solve such problems is Mixture of Experts. Also abbreviated as MoE.

Mixture of Experts is similar to forming a team with multiple specialists excellent in specific fields, and entrusting work only to the most suitable specialist for each situation. For example, when preparing a school festival, various specialists are needed. When installing audio equipment, an audio specialist takes charge; when installing lighting, a lighting specialist handles it; when decorating the stage, an art specialist is asked. Without everyone doing all work, each specialist excellent in their field does only the work suited to them — overall efficiency improves and results also become good. Mixture of Experts models similarly have small specialist models prepared in advance inside, and when new sentences input, they look at characteristics and situations of those sentences and entrust processing only to the most suitable specialist.

The 'Expert' mentioned here means a specific computation module inside AI. The MoE concept is applied particularly to the Feed-Forward Network (FFN) component frequently used in Transformers. While in general Transformers only one FFN exists per layer, in Mixture of Expert models multiple FFNs are prepared in advance and only needed ones are selected and used each time.

Then who decides which specialist to select? This role is handled by a small module called a 'Router.' The router looks at characteristics of input sentences or words and judges which expert can best process this input. For example, it selects only two out of 64 specialists to participate in calculation. This way even though the overall model size is very large, computation actually used at once is limited so it's much more efficient.

Looking at it with a familiar situation, it's similar to a general hospital having specialists in various fields. Not all patients being seen by all doctors simultaneously — a dermatologist sees certain patients and an orthopedic surgeon sees others — AI models similarly select only suitable specialists depending on input content and calculate.

The advantages of this structure don't stop at simply reducing computation. As selected specialists differ depending on input content, each specialist naturally comes to have their own specialty. For example, one specialist is strong at numerical calculations, while another is good at changing sentence styles. Multiple specialists with different capabilities inside one large model harmonize and work.

Such expert mixture structures are actually utilized in multiple large-scale AI models. Since only suitable specialists are selected and used for each situation rather than doing much computation, it helps make faster and smarter models.

Going forward as AI models become increasingly larger and more diverse, structures that selectively do only necessary calculations like this are expected to become even more important. Just as it's an effective method in human society too to know who is good at what and divide roles, a time has come when such wisdom is needed for AI as well.

Method That Fixes Only Necessary Parts: Parameter-Efficient Fine-tuning

AI models generally train first to acquire basic capabilities through large amounts of data. This is called pre-training. Such pre-trained models can be utilized for various purposes afterwards, but when actually using them a stage of slight adjustment to suit each task is needed. This process is called fine-tuning. For example, first learning a language model with writing capability, then modifying it to suit specific tasks like document summarization, translation, or question answering.

However, recently appeared large language models have tens of billions or more parameters — numbers adjusted in the learning process. Retraining all of such models requires very much time and cost. Therefore it is very difficult for individuals or small-scale organizations to utilize such large models.

What appeared to solve such problems is precisely Parameter-Efficient Fine-Tuning technology. The core of this technology is adjusting only necessary parts rather than relearning entire models. It's similar to fixing only some sentences rather than rewriting the entire book to adapt for different purposes.

The first technology to appear was Prompt Tuning. This method obtains desired results by adding simple hints or clues to input without changing model internals. That is, a method of giving additional instructions or explanations together when asking questions to the model.

The next appearing method was Adapters. Adapters is a method of inserting small simple additional modules between existing model layers and training only these added modules. The model body is maintained as-is and various tasks can be easily responded to by replacing or adding only small adapters.

LoRA (Low-Rank Adaptation) appearing later is a method of adjusting only small parts by adding them without almost changing the structure of existing models. For example it's like adding only small decorations or buttons to original clothes without buying new clothes to change to a different style. Such methods are very efficient as good performance can be obtained with little computation and storage space.

Such parameter-efficient fine-tuning technologies are particularly practical when handling various tasks like summarization, translation, and question answering with one model. Rather than creating new models each time, various purposes can be easily handled by adjusting only parts. Now not only large companies and research institutions but also individuals and small teams can flexibly and efficiently use large models.

Going forward these parameter-efficient fine-tuning technologies are expected to be even more widely used. Technologies including LoRA are being rapidly adopted in actual environments and are playing important roles in utilizing AI more lightly and practically.

Reducing Precision Makes AI Lighter: Quantization

AI models calculate very many numbers internally while operating. These numbers are not simple integers but values expressed very precisely down to decimal places. For example, the number 70 would be calculated considering very small differences like 70.123456. Such precise calculations make results accurate, but correspondingly calculations become complex and much storage space is needed. Usually 32 bits — 32 digits of information — are needed to express one number.

However, processing such complex many numbers takes a long time and consumes much electricity. Particularly in environments with limited performance like smartphones, it is difficult to use such complex models as-is. The technology appearing to solve such problems is Quantization.

Quantization is a method of making calculations simple by reducing the precision of numbers. For example, reducing numbers expressed in 32 bits to 8 bits or 4 bits speeds up calculations and reduces memory usage. As numbers are expressed more simply, required circuits also decrease so power usage decreases together.

Such concepts can also be commonly found in everyday life. For example, when measuring weight, it may be measured precisely like 70.135 kg, but generally in daily life about 70 kg is sufficient. The subtle small difference is not greatly important in daily life.

Recently research is actively being conducted to further reduce the number of bits used when expressing numbers. While previously 8 bits was thought sufficient, now various attempts are being made to express with even fewer bits like 4 bits or even 2 bits. Doing this makes calculation speed even faster and greatly reduces memory usage and power consumption.

However, if numbers are excessively simplified there is risk of calculation accuracy dropping. What appeared to solve this is Quantization-Aware Training. This method enables maintaining good performance even when precision drops by making models experience low-precision environments from when they are trained. It's like practicing recognizing objects even in blurry photos.

Quantization technology is already being used in many products and services. It's useful for example in voice recognition functions on smartphones and places where fast response and low power consumption are important like camera photo filters. Using quantized simple models rather than complex models enables fast and efficient operation.

Ultimately quantization is technology that greatly improves computation efficiency at the cost of slightly reducing number precision. With appropriate training methods added, models can maintain sufficient performance while being small and lightweight. Therefore quantization is important technology making AI more practical and effective.

Methods of Using Trained Models in Practice: Deployment Frameworks

If succeeding in making AI models small and fast, the next important task is making it possible to use them well in actual environments. No matter how outstanding a model's performance, if it cannot operate properly in actual usage environments like smartphones, web browsers, and cloud servers, its value is greatly reduced. Models that cannot be utilized in actual life remain only in laboratories.

For such reasons, tools are needed to help AI models be easily used on various devices and platforms. These tools are called Deployment Frameworks. Deployment frameworks play the role of optimizing and performing necessary conversions so models can run better when moving from trained environments to actual service environments.

Representative deployment frameworks include TensorFlow Lite, ONNX Runtime, and TorchScript. These tools automatically convert deep learning models to suit various environments like smartphones, web browsers, and servers. In this process, work of reducing unnecessary computation or efficiently adjusting calculation order to improve speed and stability is also carried out simultaneously.

Recently compiler-based deployment tools providing even more precise optimization are also being actively developed. Tools like TensorRT, TVM, and Glow minutely analyze model internal calculation processes and create calculation orders optimized for devices being used, or improve to simultaneously process multiple tasks. Thanks to this, even the same model can run with less power and resources at faster speeds.

Ultimately deploying a model means not simply moving a trained model to another device but the process of redesigning and refining with the actual environment where the model will be used in consideration. Models that have gone through the deployment process escape laboratories and meet people in everyday life, and stand at the starting point where AI technology exercises true value in the real world.

Movement of Technology Toward Computation Efficiency and Practicality

In this chapter we examined various technologies for making AI models small and fast. We learned multiple strategies from knowledge distillation that compresses and summarizes complex judgment processes, dropout and sparse structures that reduce unnecessary computation, and model pruning that keeps only necessary connections. Besides these, mixture of experts models that selectively perform computation according to input situations and parameter-efficient learning methods that simply adjust only parts rather than entire models were also addressed.

These technologies were born not just to reduce computation amount but from efforts to enable more people to easily use AI in actual life. In devices with limited performance and battery like smartphones, small and efficient models are absolutely necessary, but performance cannot be sacrificed for this. Also if the same performance can be provided with less energy and cost even on cloud servers, it would be much more efficient.

Ultimately making models small and efficient is not simply a technical problem. It is closely connected with raising accessibility of technology, strengthening sustainability, and improving ease of use in real life. For AI to be used more widely and naturally settle in everyday life going forward, it must continue to develop in directions that maintain performance while saving resources and enabling anyone to easily use them.