Mistral: Mixtral of Experts

Mistral AI continues its mission to deliver the best open models to the developer community. Moving forward in AI requires taking new technological turns beyond reusing well-known architectures and training paradigms. Most importantly, it requires making the community benefit from original models to foster new inventions and usages.

Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.

Mixtral has the following capabilities.

It gracefully handles a context of 32k tokens.
It handles English, French, Italian, German and Spanish.
It shows strong performance in code generation.
It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

Pushing the frontier of open models with sparse architectures

Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token.
Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.

Mixtral is pre-trained on data extracted from the open Web – we train experts and routers simultaneously.

Performance

We compare Mixtral to the Llama 2 family and the GPT3.5 base model. Mixtral matches or outperforms Llama 2 70B, as well as GPT3.5, on most benchmarks.

Performance overview

On the following figure, we measure the quality versus inference budget tradeoff. Mistral 7B and Mixtral 8x7B belong to a family of highly efficient models compared to Llama 2 models.

Scaling of performances

The following table give detailed results on the figure above.

Detailed benchmarks

Hallucination and biases. To identify possible flaws to be corrected by fine-tuning / preference modelling,
we measure the base model performance on TruthfulQA/BBQ/BOLD.

BBQ BOLD benchmarks

Compared to Llama 2, Mixtral is more truthful (73.9% vs 50.2% on the TruthfulQA benchmark) and presents less bias on the BBQ benchmark.
Overall, Mixtral displays more positive sentiments than Llama 2 on BOLD, with similar variances within each dimension.

Language. Mixtral 8x7B masters French, German, Spanish, Italian, and English.

Multilingual benchmarks

Instructed models

We release Mixtral 8x7B Instruct alongside Mixtral 8x7B. This model has been optimised through supervised fine-tuning and direct preference optimisation (DPO) for careful instruction following. On MT-Bench, it reaches a score of 8.30, making it the best open-source model, with a performance comparable to GPT3.5.

Note: Mixtral can be gracefully prompted to ban some outputs from constructing applications that require a strong level of moderation, as exemplified here. A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.

Deploy Mixtral with an open-source deployment stack

To enable the community to run Mixtral with a fully open-source stack, we have submitted changes to the vLLM project, which integrates Megablocks CUDA kernels for efficient inference.

Skypilot allows the deployment of vLLM endpoints on any instance in the cloud.

Use Mixtral on our platform.

We’re currently using Mixtral 8x7B behind our endpoint mistral-small, which is available in beta. Register to get early access to all generative and embedding endpoints.

Acknowledgement

We thank CoreWeave and Scaleway teams for technical support as we trained our models.

Hot Deals December 12, 2023 At 2:34 am

Hot Deals at Aliexpress https://s.click.aliexpress.com/e/_DldnjaJ

scam January 27, 2024 At 11:30 pm

Outstanding, superb effort

scam January 31, 2024 At 1:05 pm

I played on this gambling website and won a significant cash, but later, my mom fell sick, and I wanted to cash out some earnings from my casino account. Unfortunately, I experienced problems and couldn’t complete the withdrawal. Tragically, my mom died due to this online casino. I request for your support in bringing attention to this website. Please support me to obtain justice, so that others do not experience the pain I am going through today, and stop them from crying tears like mine. 😭😭😭�

Mistral: Mixtral of Experts

Pushing the frontier of open models with sparse architectures

Performance

Instructed models

Deploy Mixtral with an open-source deployment stack

Use Mixtral on our platform.

Acknowledgement

Bayesian Statistics: The three cultures

Reverse-engineering my speakers’ API to get reasonable volume control

Zen 5’s 2-ahead branch predictor: how a 30 year old idea allows for new tricks

3 COMMENTS

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Bayesian Statistics: The three cultures

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Top Ten Kitchen Shortcuts for Indian Food Delights

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY