Rıfkı-V3 Technical Report
1. Overview
Rıfkı-V3 is a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B active parameters per token. Inspired by the DeepSeek-V3 architecture, Rıfkı adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures to achieve high-performance inference while maintaining economic training costs. Use of FP8 mixed precision training further stabilizes the model.
2. Architecture Summary
Rıfkı model architecture is built upon the Transformer framework with key optimizations for efficiency at scale.
Multi-head Latent Attention (MLA)
Traditional Key-Value (KV) caching in transformers consumes significant memory. Rıfkı utilizes MLA technology to compress the KV cache, significantly reducing memory overhead during generation, allowing for longer context windows up to 128K tokens.
DeepSeekMoE Mixture-of-Experts
Instead of activating all parameters for every token, Rıfkı uses a MoE router to select only the most relevant experts. This ensures that for each token, only 37B parameters are active out of the 671B total, reducing computational cost by 90% compared to dense models.
3. Benchmarks
Rıfkı-V3 has been rigorously tested across standard benchmarks including MMLU, GSM8K, and HumanEval.
4. Usage
API Integration
Rıfkı provides an OpenAI-compatible API endpoint.
Local Run
You can run Rıfkı locally using standard inference engines like vLLM or SGLang.
Citation
If you use Rıfkı in your research, please cite: