Apple Intelligence

10 min readJun 11, 2024

Imagine a world where your Apple devices not only understand you but anticipate your needs, seamlessly integrating into your daily life to make every task effortless. Welcome to the forefront of artificial intelligence with Apple Intelligence, where cutting-edge generative models and meticulous optimization techniques converge to revolutionize user experience. By prioritizing human-centric evaluation and leveraging advanced fine-tuning methods, Apple is not just keeping pace with technological advancements but setting new standards for efficiency, safety, and usability. Dive into the world of Apple Intelligence and discover how your favorite devices are becoming smarter, more responsive, and incredibly intuitive.

“Walk a mile to avoid a Fight, but when one starts, don’t hold back”

Apple has always been known for its innovation and strategic planning. While other tech giants were openly competing for dominance in artificial intelligence (AI), Apple was quietly preparing behind the scenes. As we entered the 21st century, Apple recognized that the future of technology would heavily rely on hardware performance. This led to the development and release of their own custom chips, known as “Apple Silicon,” starting with the Apple M1 chip in 2020. These chips were designed to handle demanding tasks efficiently and optimize memory usage.

However, the technology landscape continued to evolve rapidly. In 2022, OpenAI introduced GPT-3.5, more commonly known as ChatGPT. This AI model quickly became a sensation, revolutionizing how people interacted with technology. ChatGPT’s popularity soared, becoming one of the most widely used online tools.

Recognizing the potential of Large Language Models (LLMs) like ChatGPT, Apple decided to invest heavily in generative AI (GenAI), dedicating a significant portion of its $3 trillion market cap to this area. Despite their efforts, Apple faced challenges related to hardware-software integration and reducing latency — the time it takes for data to travel between devices and servers. The race to optimize these language models and enhance their performance has been intense, with Apple continuously striving to lead the way.

Apple Intelligence Announcement at WWDC 2024

Yesterday (June 10, 2024) at Worldwide Developer Conference (WWDC), Apple announced a major partnership with OpenAI. This collaboration aims to integrate ChatGPT’s advanced AI capabilities into Apple’s ecosystem, making it easier for users to access powerful AI tools directly on their devices. This partnership marks a significant step in Apple’s journey towards making AI more accessible and integrated into everyday life. Apple Intelligence is now supported on iOS 18, iPadOS 18 and macOS Sequoia.

What is Apple Intelligence?

Apple Intelligence is an advanced AI-powered orchestration tool designed to help users seamlessly manage, search, and navigate their data within the Apple ecosystem. Using simple command-like instructions from the user, the agent can request Call-to-Action (CTA) capabilities, like prioritizing messages, optimizing travel routes, retrieving files, even generating cool images, and so much more. In addition to all of that, Apple Intelligence has been deeply rooted into Siri and in turn to the entire device — making Siri the most powerful than ever. This tool leverages the capabilities of LLMs while maintaining Apple’s strong emphasis on security and privacy — one of the company’s core principles.

Apple Intelligence operates through two main components:

On-Device LLM: For smaller tasks, Apple Intelligence uses a 100% local LLM with 3 billion parameters. This on-device model ensures that many operations can be performed quickly and securely without needing to connect to external servers.
Private Cloud Compute (PCC): For more complex tasks, Apple Intelligence taps into a larger LLM housed in what Apple calls the Private Cloud Compute (PCC) built using Apple Silicon servers. This setup allows for intensive computations while ensuring user data remains private (on-device). Apple assures that even they do not have access to the data processed in these private containers. They also claim that “PCC will cryptographically refuse to talk to any iPhone, iPad or Mac unless it’s software has been publicly logged for inspection”.

Both of these components integrate seamlessly into all the data used by various apps in the Apple eco-system and can reference them directly using a graph-like structure using its semantic index and feed it to the LLM. Apple Intelligence decides which part of the computation can be done on device and only sends the rest to the PCC. This has been made possible owing to the years of effort in hardware-software optimization by Apple. By combining these two engines, Apple Intelligence provides powerful AI capabilities directly on users’ devices, enhancing functionality without compromising privacy.

Breakdown

Let’s try to understand the innovation in Apple Intelligence in greater depth.

Apple Intelligence Pipeline (Source: Apple ML website)

Pre-Training

If you’re new to the world of LLMs, these models are typically trained in two main phases: pre-training and fine-tuning.

Pre-training is the initial stage where the model learns from a large and diverse dataset to develop a general understanding of language. For Apple Intelligence, it uses Apple’s open-source AXLearn framework, which is built on top of Google’s JAX (a framework for numerical computation) and XLA (Accelerated Linear Algebra, a framework for speeding up TensorFlow operations).

Here’s a breakdown of the techniques used during pre-training to optimize compute and enable massively parallel processing:

Data Parallelism — Large tasks are split into smaller, independent subtasks that can be processed simultaneously, speeding up the training process.
Tensor Parallelism (TP)— This model parallelism technique divides the computation of a model across multiple GPUs by splitting tensors into separate, non-overlapping pieces.
Sequence Parallelism (SP) — The sequence dimension of input tensors is divided across multiple GPUs, allowing for parallel processing of sequence data.
Fully Sharded Data Parallel (FSDP) — This method shards a model’s parameters, gradients, and optimizer states across all available GPUs, enabling efficient use of computational resources.

These advanced techniques ensure that the pre-training of Apple’s models is highly optimized for performance, utilizing massively parallel processing (MPP) to achieve efficient and effective training.

Training Data

“Data is the new Oil” — Clive Humby, 2006

Apple has made it clear that the training data for their AI models comes from several sources:

Licensed Data: Data that Apple has legally obtained and has permission to use.
Functional Data: Data that helps enhance specific features of their products.
Public Data: Information from public websites, collected by AppleBot, Apple’s in-house web crawler.

Apple ensures that any public website they index can choose to opt out if they do not want their data used for training purposes.

Importantly, Apple explicitly mentions “We never use our users’ private personal data or user interactions when training our foundation models”. This commitment to privacy sets Apple apart from many other tech companies.

Additionally, Apple takes extra steps to ensure the quality and safety of their training data by:

Removing Personal Identifiable Information (PII): This includes sensitive information like social security numbers and credit card numbers that may have been accidentally leaked online.
Filtering Out Profanity and Low-Quality Content: To prevent the model from learning anything “toxic” or inappropriate.
Data Enrichment: Apple filters, transforms, and extracts features from the data, using machine learning models to identify and utilize high-quality content.

Garbage-In, Garbage-Out (GIGO)

The principle of “Garbage In, Garbage Out” (GIGO) underscores Apple’s rigorous approach to data quality. By using high-quality, carefully vetted data, Apple ensures that their models are trained effectively, resulting in a robust and reliable AI system.

Kudos to Apple for going the extra mile to ensure their AI training data is both ethical and high-quality, contributing to their reputation as a trusted and user-centric brand.

Post-Training

After completing the pre-training phase, Apple further refines their models using a combination of human-annotated and synthetic data. This process, known as fine-tuning, enhances the model’s performance and accuracy.

Apple uses two innovative algorithms during post-training:

Rejection Sampling Fine-Tuning (RFT) Algorithm with Teacher Committee: This algorithm, introduced by Yuan et al. in 2023, improves the model’s mathematical reasoning by correcting reasoning paths. It is incorporated into a genetic algorithm inspired by the teacher-learner model, where multiple “teacher” models evaluate the outputs and guide the “learner” model towards better performance.
Reinforcement Learning from Human Feedback (RLHF): This technique, originally known as Preference-based Reinforcement Learning (PhRL) and introduced by Cheng et al. in 2011, and utilizes mirror descent policy optimization (MDPO) and a leave-one-out advantage estimator. This method allows the model to learn and improve based on feedback from human interactions, optimizing its responses to align with human preferences.

Apple reports that using these advanced algorithms significantly boosts their model’s performance compared to standard methods. By incorporating these sophisticated techniques, Apple ensures their AI models are not only powerful but also aligned with user expectations, providing more accurate and reliable results.

Optimization

Apple employs a range of innovative techniques to optimize their generative models for both on-device and server environments, ensuring speed and efficiency.

On-Device and Server Model Optimizations

Grouped-Query-Attention — Both on-device and server models use this technique to enhance the efficiency of attention mechanisms, which are critical for processing and generating language.
Shared Input and Output Vocab Embedding Tables — By mapping embedding tensors without duplication, this technique reduces memory requirements and inference costs.
— On-device model vocab size: 49K tokens.
— Server model vocab size: 100K tokens (includes additional language and technical tokens).

On-Device Inference Optimizations

Low-Bit Palletization
— Achieves necessary memory, power, and performance requirements by reducing the bit precision of model weights.
— Maintains model quality using a framework with LoRA adapters and a mixed 2-bit and 4-bit configuration, averaging 3.5 bits-per-weight.
Talaria Tool — An interactive tool for model latency and power analysis, guiding the selection of optimal bit rates for each operation.
Activation Quantization — Reduces the precision of activation values to save memory and increase inference speed, without significantly impacting accuracy.
Embedding Quantization — Lowers the precision of embedding vectors to reduce memory usage and computation costs.
Efficient Key-Value (KV) Cache Update — Optimizes the update process of the KV cache on neural engines to enhance inference speed and reduce latency.

Performance Metrics

Time-to-First-Token Latency: Achieved latency of about 0.6 milliseconds per prompt token on iPhone 15 Pro.
Generation Rate: Achieved a generation rate of 30 tokens per second.
Token Speculation Techniques: Further enhancements in token generation rate observed when employing these techniques.

These optimizations collectively ensure that Apple’s generative models are not only powerful but also highly efficient, both on-device and on their private cloud.

Model Adaptation

Apple’s foundation models are fine-tuned to handle everyday activities and can adapt dynamically to specific tasks. Here’s how this process works:

Dynamic Specialization with Adapters:

Apple uses adapters — small neural network modules — to fine-tune their models for specific tasks. These adapters can be plugged into different layers of the pre-trained model.

Targeted Fine-Tuning:

The attention matrices, attention projection matrix, and fully connected layers in the transformer architecture’s decoding layers are adjusted using these adapters.
By fine-tuning only the adapter layers, the core parameters of the base pre-trained model remain unchanged. This approach maintains the model’s general knowledge while tailoring it to specific tasks.

Efficient Memory Usage:

Adapter parameters are stored using 16 bits.
For the approximately 3 billion parameter on-device model, the parameters for a rank 16 adapter usually require only tens of megabytes.
Adapters can be dynamically loaded, temporarily stored in memory, and swapped as needed. This allows the model to specialize in real-time while efficiently managing memory and ensuring the system remains responsive.

Rapid Training and Deployment:

Apple has developed an efficient infrastructure for quickly retraining, testing, and deploying adapters whenever the base model or training data is updated.
Adapter parameters are initialized using an accuracy-recovery method, ensuring they start off effectively.

This approach ensures that Apple’s AI models are not only powerful but also highly adaptable, providing optimal performance for a wide range of tasks while maintaining system efficiency and responsiveness.

Performance and Evaluation

Apple’s generative models enhance user communication and productivity by leveraging advanced AI techniques. Human evaluation ensures these models provide a high-quality user experience.

For specific tasks like summarizing emails, Apple fine-tunes adapters using accuracy-recovery low-rank (LoRA) techniques, training on high-quality synthetic data. Evaluations with diverse, real-world datasets demonstrate significant improvements in handling various inputs, showing superior performance over standard models.

Both on-device and server models are tested on a wide range of tasks, from brainstorming to coding. Apple’s models consistently outperform competitors in human evaluations, proving more efficient and effective. The on-device model surpasses larger models in various benchmarks.

Safety and reliability are prioritized through continuous adversarial testing to mitigate risks. Instruction-following capabilities are also benchmarked, showing better performance than similarly sized models. Internal tests confirm the models’ strong writing abilities, even without specific adapters.

Apple’s approach combines advanced techniques, targeted fine-tuning, and rigorous evaluations, delivering powerful, efficient, and user-friendly AI solutions while maintaining high standards of safety and reliability.

Connections with OpenAI

While Apple did not explicitly talk about their degree of connections with OpenAI, but they did mention the integration of OpenAI into Siri for improved real-time language translation, and more complex tasks. Rumours have it, that the partnership of Apple with OpenAI is temporary, while Apple is gearing up its own AI foundational models to refine its own AI capabilities, which resonates with Apple’s commitment to security and privacy. We anxiously await for the press release where Apple commits to switch entirely to its in-house models, but till then, we will see the controversies and memes flowing in.