On the two completely different tracks of “car building” and “AI”, Apple decided to cancel the electric car project that it has been working on for more than ten years this year, and announced that the company will “break new ground” in the field of generative artificial intelligence in 2024. “.
Sure enough, after setting the strategic direction, they moved quickly.
On the one hand, according to the latest report from Bloomberg, Apple acquired DarwinAI, a Canadian AI startup that specializes in vision-based technology. Although Apple and DarwinAI have not yet announced the deal, according to some experts on LinkedIn, several members of the startup’s team have joined Apple’s machine learning team in January.
On the other hand, on March 14, many netizens discovered that Apple, with the support of 30 researchers, entered the market with a large multi-modal model called MM1 .
Apple’s naming is as simple and easy to remember as ever: M1 is its own chip, and MM1 is its own large model.
In addition, Apple explained from the beginning that it has taken a different sharing route from the current open source and closed source large models. As the name of its paper indicates, it directly shared its predictions about the MM1 multimodal large language model in the paper. Training methods, analysis and implications.
Paper address: https://arxiv.org/pdf/2403.09611.pdf
1、Due to the “opacity” of the industry, Apple releases a large multi-modal model – MM1
Multimodality refers to large-scale language models trained on various formats such as image files, video files, and audio files, as well as plain text data.
The reason why MM1 was released, the Apple research team pointed out in the paper, is that many AI companies are now “opaque” in the learning methods of AI models.
The existing MLLM (Multimodal Large Language Model) in the industry is mainly divided into two categories: closed source models and open models.
- Although closed-source models are often available, little is known about their data models, model architecture, and training details.
- As for open models, many companies release model parameters along with detailed descriptions of data models and training configurations, allowing the community to further fine-tune them.
But in the view of the Apple team, most models, whether open or closed, disclose almost nothing about the algorithm design selection process they use, especially regarding multi-modal pre-training.
To further research in this area, the Apple research team believes it is imperative to share methods on how to build such models.
Therefore, without stopping, Apple published this paper, which not only brought MM1, but also directly recorded the construction process of MLLM in the paper, and also tried to share the experience and lessons learned in formulating the design as much as possible.
2、What can the 30 billion parameter MM1 be used for?
The researchers explained in the paper that by performing small-scale ablation experiments on model architecture decisions and pre-training data selections , as well as by conducting careful and comprehensive analysis of image encoders, visual language connectors, and various pre-training data selections, they Some key design lessons were discovered.
- The Apple research team demonstrates that in large-scale multi-modal pre-training, using a combination of image captions, interleaved image text, and plain text data is effective for achieving state-of-the-art performance on multiple benchmarks compared to other published pre-training results. SOTA) small number of test results are crucial.
- Furthermore, image encoder, image resolution, and number of image tags have a significant impact, while visual language connector design is of relatively minor importance.
Looking in detail, in terms of modeling, the researchers found that the importance of design is in the following order: image resolution, loss and capacity of the visual encoder, and pre-training data for the visual encoder.
Additionally, the researchers used three different types of pre-training data: image captions, interleaved image text, and plain text data. It follows that when it comes to few-shot and text-only performance, interleaved and text-only training data are crucial, while for zero-shot performance, subtitle data is most important.
These trends remain unchanged after supervised fine-tuning (SFT), both on the evaluation used in pre-training and on more benchmarks. This shows that the capabilities and modeling decisions discovered in pre-training are preserved after fine-tuning.
Finally, Apple’s model is extended by using larger LLMs (from 3B, 7B to 30B) and exploring mixed expert models (MoE) (from 3B MoE using 64 experts, to 7B MoE using 32 experts), thereby MM1 was built, a family of multimodal models with up to 30 billion parameters.
Due to large-scale multi-modal pre-training, MM1 has good image recognition and reasoning capabilities. More specifically, Apple MM1 is very good at finding “rules” in images and text entered by the user.
As shown in the figure below, when the user provides the arranged images and indicates the “object” and “object number” text related to each uploaded image :
- First photo: 1 teddy bear, 1 smartphone.
- Second photo: image of 2 cats.
- Third photo: 3 books, 1 vase and 1 glass.
In addition, MM1 also features:
(a) Count objects and follow custom formats to output text
(b) Reference some parts of the image and perform OCR recognition
(c) Demonstrate common sense and written knowledge about everyday objects
(d) Perform basic mathematical functions.
After instruction adjustment, MM1 also has strong small-volume learning capabilities.
For example, as shown below, if you upload a photo, MM1 can give you the specific quantity, the value of the image, the type of animal, etc.
At the same time, multiple pictures can be input. For example, the picture below:
Q: “Identify what number in the picture is shown on the cake. Then find the total.”
MM1-30B-Chat: The first image shows the number 3 on a cake, the second image shows the number 4 on a cake. The sum is 7.
Q: I show you an image and you need to predict the letters or numbers that will appear when the image is rotated 180 degrees.
MM1-30B-Chat:W,9
Q: I will show you some pictures and your task is to answer my questions. Estimate the temperature of each image in degrees Celsius. Also provide reasons.
MM1-30B-Chat:
- Image with umbrella on the beach: 25-30 degrees Celsius. Umbrellas were up and people were on the beach, hinting at a warm, sunny day.
- Image of an airplane flying over snow: -20 to -30 degrees Celsius. The plane was flying over a snow-covered area, and there were vehicles on the ground, indicating cold winter weather.
- Image of animals grazing in the meadow: 15-20 degrees Celsius. The animals are grazing, which usually occurs during milder weather conditions. The grass is green, further indicating a temperate climate.
Judging from the example, the overall effect is quite good. According to Apple researchers, MM1 models outperform most related studies. In particular, the pre-trained model MM1 is SOTA and outperforms Emu2, Flamingo and IDEFICS in both small and large sizes in the subtitle and visual question answering (VQA) tasks in a few-shot setting. After SFT processing, the final model achieved competitive performance on 12 established multi-modal benchmarks.
3、How was the MM1 built?
So, how is MM1 “made”? Apple researchers first shared from three dimensions:
- Architecture: The researchers studied different pre-trained image encoders and explored various methods of connecting LLM with these encoders.
- Data: They also consider different types of data and their relative blending weights.
- Training Procedure: Explores how to train an MLLM, including hyperparameters and which parts of the model are trained at which stages.
In the initial stage, since training a large MLLM requires a lot of resources, the researchers adopted a simplified ablation setting.
Specifically, the researcher used a smaller base configuration of the model and deleted it based on this. Modify one component at a time, whether it’s an architecture module or a data source, and then evaluate the impact of the design choices on each component. In this way, researchers can derive the final model-data configuration and scale it in terms of model parameters and training time.
The basic configuration of ablation is as follows:
- Image encoder: ViT-L/14 model using CLIP loss on DFN-5B and VeCap-300M; image size is 336×336.
- Visual Language Connector: C-Abstractor, with 144 image tags.
- Pre-training data: mixed-caption images (45%), interleaved image-text documents (45%), and plain-text data (10%).
- Language model: 1.2B Transformer decoder language model.
To evaluate different design decisions, we used zero-shot and few-shot (4-sample and 8-sample) performance on various VQA and subtitle tasks: COCO Captioning, NoCaps, TextCaps, VQAv2, TextVQA, VizWiz, GQA and OK-VQA.
Model architecture ablation
In this work, researchers analyze the components that enable LLM to process visual data, analyze how to best pretrain visual encoders, and how to connect visual features to the space of LLM.
In the past, most MLLMs used CLIP to pretrain image encoders, and recent research has also begun to explore the use of purely visual self-supervised models (such as DINOv2) as image encoders. Here, Apple researchers primarily eliminate the importance of image resolution and image encoder pre-training goals. They are using 2.9B LLM (instead of 1.2 B) to ensure there is enough capacity to use some of the larger image encoders.
During the experiments, the researchers found that increasing the image resolution from 224 to 336 improved all metrics for all architectures by approximately 3%. Increasing the model size from ViT-L to ViT-H doubles the parameters, but the performance improvement is not large, usually less than 1%. Finally, after adding VeCap-300M (a synthetic subtitle data set), the performance is improved by more than 1% in the case of a small number of shots.
In the VL connector dimension, the researchers found that the number of visual marks and image resolution were most important, while the type of VL connector had little effect. The graph below shows results showing that both zero-shot and few-shot performance improves as the number of visual markers or image resolution increases.
This is different from what many experts have found before, that is, different architectural designs do not seem to ultimately produce stronger models. After instruction tuning, all three architectures achieved very similar results at 336px and 114 token settings.
Large-scale and task-appropriate data are critical to training high-performance models. Usually, model training is divided into two stages: pre-training and instruction adjustment. The former stage uses network-scale data, while the latter stage uses mission-specific curated data.
There are two types of data commonly used to train MLLM: caption data consisting of images and paired text descriptions; and interleaved image-text documents from the web. It should be noted that subtitle data often contains relatively short text and is highly correlated with images.
In contrast, the text in the interleaved data is longer and more varied, but is on average less relevant to the surrounding images. Finally, Apple researchers also included plain text data to help preserve the language understanding capabilities of the underlying LLM. Here are all the datasets:
Pre-training data ablation
In the pre-training data ablation step, the researcher used the same model settings as the ablation model. The only difference was that 200k steps were trained here to make full use of large-scale data training.
Finally, the researcher summarized the following experiences:
- Data Lesson 1: Interleaved data helps improve few-shot and plain-text performance, while subtitle data improves zero-shot performance.
- Data Lesson 2: Plain text data helps achieve few-sample and plain-text performance
- Data Lesson 3: Carefully blending image and text data results in optimal multimodal performance while retaining strong text performance.
- Data Lesson 4: Synthetic data facilitates learning in small quantities
Based on the above, Apple researchers finally determined the final method for MM1 multi-modal pre-training:
- Image encoder: A ViT-H model with a resolution of 378x378px was used and pre-trained on DFN-5B using the CLIP target.
- Visual Language Connector: Since the number of visual tokens is most important, the researcher used a VL connector with 144 tokens. The actual architecture seems to be less important, and C-Abstract was chosen.
- Data: In order to maintain zero-sample and few-sample performance, the researcher used a combination of 45% interleaved image-text documents, 45% image-text pair documents, and 10% plain text documents.
To improve model performance, researchers expanded the size of LLM to 3B, 7B, and 30B parameters.
The underlying LLM is trained internally on the same plain text dataset. Since both LLM and visual encoders have been pre-trained, the researcher used them as the initialization of MM1 and performed multi-modal pre-training of 200k steps (approximately 100B markers) on the above data combination.
All models were pretrained completely without freezing with a sequence length of 4096, a maximum of 16 images per sequence (resolution 378×378), and a batch size of 512 sequences. All models are trained using the AXLearn framework.
Finally, the researchers evaluated the pre-trained model in the subtitle and VQA tasks with appropriate prompts. Get the following results:
Note that the researchers only compared their model to larger models, for example, MM1’s 30B model to two 80B models.
When it comes to “little” performance, MM1 outperforms all published pre-trained MLLMs. In both the subtitle benchmark and the VizWiz-QA benchmark, we saw excellent performance at 30B. On VQAv2, TextVQA and OKVQA, our performance is comparable to Emu2. In terms of zero-shot performance, even without instruction fine-tuning, the MM1 model performs well on TextCaps at all model sizes and is on par with Flamingo-3B at small scales for most benchmarks.
4、Supervise fine-tuning experiments
In addition to the above, the researchers also conducted supervised fine-tuning (SFT, Supervised Fine-Tuning ) experiments.
According to the results in the chart below, the MM1-3B-Chat and MM1-7B-Chat outperform all listed models of the same size including Google’s Gemini Nano.
At the same time, MM1-3B-Chat and MM1-7B-Chat perform well in VQAv2, TextVQA, ScienceQA, MMBench, and recent benchmarks (MMMU and MathVista).
Secondly, the researchers also analyzed two MoE models: 3B-MoE (64 experts) and 6B-MoE (32 experts). In almost all benchmarks, Apple’s MoE model achieved better performance than the dense model.
Furthermore, for the 30B size model, MM1-30B-Chat performs better than Emu2-Chat37B and CogVLM-30B on TextVQA, SEED and MMMU. However, LLaVA-NeXT does not support multi-image inference, nor does it support small number of hints, because each image is represented by 2,880 tags sent to LLM, while Apple’s total number of tags is only 720. This limits certain applications involving multiple images.
In addition, the Apple research team also studied the impact of image resolution and pre-training on SFT performance. The results are as follows.
5、MM1 has not yet been made public in what form.
It is worth noting that the paper released by Apple did not mention whether MM1 will be released.
However, as mentioned at the beginning of the article, Apple has stopped developing electric vehicles and acquired DarwinAI.
Additionally, Siri for iPhone CEO Dag Kittlaus has announced, “Siri will be doing some cool new things in 2024. Then it will accelerate and become a real force in artificial intelligence. Apple is uniquely positioned to enable new, useful and unexpected LLM use cases.”
MM1 is now only the first step in Apple’s official external AI layout, and we also look forward to its further development.
More technical details can be found in the paper report: https://arxiv.org/pdf/2403.09611.pdf