GPT4-Omni is Fast! Here’s Why?

9 min readMay 18, 2024

I have been experimenting with the various Large Language Models (LLMs) released in recent times, but since OpenAI released GPT4 Omni (GPT4-o), I have been captivated by its capabilities. Here is a deep dive into what Omni brings to the table.

Capabilities

While GPT4 was quite popular and powerful, GPT4-o packs quite a punch. Here is what has improved:

Generation of more coherent and detailed images from text
Browsing the web for current information. Gone are the days when ChatGPT used to say that its memory was limited by the data when it was last refreshed.
Running code snippets inside OpenAI runtime containers to ensure the accuracy of the code. Gone are the days when developers had to spend hours trying to debug the code provided by ChatGPT.
Improved Data Analytics, wherein ChatGPT can run the codes or create visualisations by itself to determine which model or visual is more appealing.
Improved memory management wherein ChatGPT can remember certain key information like birthdays, anniversaries, meetings, flight schedules, etc. that the user shares with it.

While some of these features were already present in GPT4, OpenAI had been constantly experimenting with newer features for Omni and deploying them. I recall that a couple of weeks ago, I randomly asked ChatGPT about an incident that took place very recently, and it was able to accurately describe the incident. This was definitely OpenAI Canary testing their new features.

If all these advanced features don’t make ChatGPT the GOAT, it has one final trick. SPEED. Yes, you heard that right. GPT4-o is one of the fastest LLMs on the market right now. People who have had some experience working with LLMs know that Latency is a very big problem when working with LLMs (even 1/1000th the size of ChatGPT), but OpenAI has nailed this to perfection.

Latency

To be really honest, I was quite impressed by the inferencing speed of GPT4–o, and to understand this engineering marvel, I spent some time digging deeper into the “closed” model trying to figure out what OpenAI could have possibly done, and here are my findings:

Optimized Model Architecture: They’ve fine-tuned the model architecture to balance performance and speed, employing advanced techniques like transformer optimizations and efficient layer implementations.
Hardware Acceleration: Leveraging state-of-the-art hardware, including GPUs and TPUs, helps accelerate the computation required for model inference. These specialized hardwares are designed to handle the parallel processing needs of large models efficiently.
Distributed Computing: By distributing the computation across multiple nodes and using high-speed interconnects, the system can handle large-scale data processing more efficiently.
Software Optimizations: Improvements in the software stack, including better algorithms for data handling, memory management, and parallel processing, contribute to faster inferencing times.
Caching and Preprocessing: Implementing smart caching mechanisms and efficient data preprocessing steps helps reduce the overall time required for inferencing.

All of these features look promising, but it is not until we really dive into each of them to understand exactly what has changed. Be advised, this might just blow your mind.

Optimized Model Architecture

Sparse Attention Mechanisms

Those who have been working with Attention mechanisms for a while now know that computing Attention (self-attention) requires some degree of computation. Self-attention has quadratic complexity, which can be represented as O(dn^2), where n is the sequence length and d is the token size. Imagine the computation required for computing multi-head attention, wherein multiple such self-attention heads are calculated for every pair of tokens (including masking) in parallel.

Instead of computing attention scores for every pair of tokens, sparse attention mechanisms only focus on the most relevant tokens, reducing computational complexity. Recently, a paper was published at NeurIPS last year that introduced the concept of Dynamic Sparse Flash Attention, which seems to be a key ingredient used by OpenAI. More avid readers are encouraged to read the paper [1].

Image from Dynamic Sparse Flash Attention [1], Pagliardini et al. 2023 showing how calculating a sparse subsets of Flash Attention essentially improves computation

Layer Pruning

Pruning is not a new concept in Artificial Intelligence (AI) world. It has been there for a long time (Decision Tree optimization). Pruning in the context of Deep Learning (DL) refers to removing less critical neurons or even entire layers that contribute minimally to the final output, which helps reduce the overall model size and inference time without significantly impacting performance.

After the success of Transformer Architecture (Vaswani et al., 2017), many researchers worked on pruning the Attention Matrix [B x H x T x T], wherein some set of researchers tried pruning the attention heads [B x H’ x T x T] (Li et al., 2021 [2]), while other set of researchers (Goyal et al., 2020 [3]) tried pruning tokens (dropping entire tokens) to alter the attention matrix [B x H x T_Q x T_KV], which rapidly increased inferencing speed without compromising too much on the output accuracy.

Image from PoWER-BERT, Goyal et al. 2020 showing how consecutive encoder stacks removes word vectors with lower significance levels

In my opinion, OpenAI has probably used some framework that has combined both of these pruning mechanisms to reduce the inferencing complexity without compromising on accuracy.

Efficient Transformers

The elephant in the room was definitely the quadratic complexity of self-attention, as I mentioned above. So that had to be addressed. Kitaev et al., 2019, introduced Reformer [5], which reduced the computation to O(n log n) using the Locally Sensitive Hashing (LSH) technique, which deserves a special mention, but this technique was found efficient only when the sequence lengths were larger than 2048 tokens.

Image from Reformer [5] displaying LSH technique reducing computation complexity

Fortunately, Facebook AI Research (FAIR) had already done the hard work wherein (Wang et al., 2020) introduced Linformer [4], which reduced the quadratic complexity to linear complexity using a low-rank matrix adaptation by decomposing the original scaled dot-product attention into multiple smaller attentions through linear projections.

Image from Linformer [4] showing the transformer architecture on the left, Linformer performance on the right and the low-rank matrix adaptation involved

Following that, Deepmind introduced the Performer [6] architecture (Choromanski et al., 2022), which could estimate kernelizable full-rank attention mechanisms using feature maps and orthogonal random features in linear time and space complexity without relying on sparsity or low-rankness, which had a very strong mathematical background.

OpenAI most likely used a combination of these techniques, as presented in Linformer or Performer, in conjunction with their customized attention kernels, which approximate the full attention mechanism with lower computational overhead, allowing faster processing while maintaining accuracy.

Mixed Precision Training

Previous research (Micikevicius et al., 2017) had already demonstrated how using lower precision (e.g., FP16) for certain parts of the model during training and inference can speed up computations significantly without sacrificing much precision. More recently, research [8] has shown that this can be further extended to use very low quantization precision to 4-bits of weights and activation tensors using advanced techniques like adaptive Gradient Scaling (GradScale). These techniques provide a massive boost of 7x times over the standard FP16 systems without sacrificing a lot on accuracy. Since, most of these techniques involve some kind of framework to maintain accuracy intact even after using low precision, OpenAI must be using their own quantization framework, most likely a combination of FP4, FP8 and FP16.

Layer Normalization Improvements

For a transformer, one needs a carefully designed learning rate warm-up stage and a very efficiently placed layer normalization. The placement of layer normalization between the residual blocks has been known to explode the gradients at the output layers, which is undesirable and makes the training unstable. On the other hand, placing the layer normalization inside the residual blocks offers more manageable gradients at the output layers. Research [9] has shown that this can be done without even using the warm-up stage, resulting in a significant drop in the model training, fine-tuning, and inferencing times without sacrificing too much accuracy. Optimizing the implementation of layer normalization can reduce the computational cost associated with this step.

Gradient Checkpointing

Gradient checkpointing [10], originally introduced by Chen et al., 2016, has been a well known technique in DL-world for a while now. This technique leverages the use of intermediate feature maps, computational graph analysis, which is essentially a trade-off between memory consumption and computation costs. Saving memory during training by recomputing some of the intermediate activations instead of storing them, which can be extended to inference for more efficient memory usage. Since GPT4-omni is generally available as an application for both mobile and desktop applications, it is crucial that it leaves behind a minimal footprint in order to accelerate computation. In my opinion, OpenAI has done a great job here and can automatically manage the computation graphs really well in real-time inferencing.

These were some of the key points of the technological marvel behind GPT4-omni; however, despite these state-of-the-art techniques serving billions in real time, one might wonder that some of these techniques are really trade-offs between one thing and the other, particularly when the company is optimizing the product for speed. There seems to be a strong focus on what to keep and what can be omitted to get a really fast model. While focus is on improving speed, it should not come at the cost of accuracy, especially when pruning is an important aspect.

In order to maintain the model’s accuracy even after pruning, OpenAI leverages many such techniques, such as:

Adaptive Pruning: The model can use dynamic techniques to decide which parts of the network to prune based on the specific context and importance of the information. This ensures that less critical pathways are pruned while retaining key information.
Contextual Memory Mechanisms: Implementing memory mechanisms that retain important context across interactions can help. For example, transformers with enhanced memory components can store and retrieve relevant context even if certain pathways are pruned.
Attention Mechanisms: Sophisticated attention mechanisms can help the model focus on the most relevant parts of the input dynamically. Even if some paths are pruned, the attention mechanism can still highlight important context from the remaining pathways.
Regularization Techniques: These techniques can help maintain the balance between pruning and retaining essential information. They ensure that the pruning process doesn’t disproportionately affect critical pathways.
Post-Pruning Fine-Tuning: After pruning, the model can be fine-tuned on relevant tasks to adjust for any potential loss of information, ensuring it still performs well in understanding and generating accurate responses.

And by doing all that, OpenAI has created a model that is very accurate as well as super fast in real-time. Thus, giving birth to GPT4-omni, taking the LLM competition game to the upper echelon.

All that I have mentioned in this article is based on my understanding of Large Language Models, and optimization of LLMs. The opinions are solely my own and do not represent my affiliation or anyone else. While these details are intellectual property reserved at the sole discretion of OpenAI, I have tried to explain how using these techniques might lead to the creation of GPT4-o.

If you have liked this post, don’t forget to like, clap, and share this article with anyone else who might find it beneficial. Feel free to check out my other articles, where I have talked about various other concepts related to Data Science, Machine Learning, Natural Language Processing and more. If you think I have done a good job, do follow me on Medium, which motivates me to write more such articles. Take care, until next time!

References

Pagliardini, M., Paliotta, D., Jaggi, M., & Fleuret, F. (2023). Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention. Neural Information Processing Systems.
Li, J., Cotterell, R., & Sachan, M. (2021). Differentiable Subset Pruning of Transformer Heads. Transactions of the Association for Computational Linguistics, 9, 1442–1459.
Goyal, S., Choudhury, A.R., Raje, S., Chakaravarthy, V.T., Sabharwal, Y., & Verma, A. (2020). PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. International Conference on Machine Learning.
Wang, S., Li, B.Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-Attention with Linear Complexity. ArXiv, abs/2006.04768.
Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The Efficient Transformer. ArXiv, abs/2001.04451.
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L.J., & Weller, A. (2020). Rethinking Attention with Performers. ArXiv, abs/2009.14794.
Micikevicius, P., Narang, S., Alben, J., Diamos, G.F., Elsen, E., García, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2017). Mixed Precision Training. ArXiv, abs/1710.03740.
Sun, X., Wang, N., Chen, C., Ni, J., Agrawal, A., Cui, X., Venkataramani, S., Maghraoui, K.E., Srinivasan, V., & Gopalakrishnan, K. (2020). Ultra-Low Precision 4-bit Training of Deep Neural Networks. Neural Information Processing Systems.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. (2020). On Layer Normalization in the Transformer Architecture. ArXiv, abs/2002.04745.
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training Deep Nets with Sublinear Memory Cost. ArXiv, abs/1604.06174.