Training LLMs: Building Your Own Local Rig
訓練 LLMs：打造您的本地訓練平台

My journey into Large Language Models (LLMs) began with the excitement of seeing ChatGPT in action. I started by exploring diffusion models, drawn to their ability to create beautiful visuals. However, working on an M1 chip had its limitations, which motivated me to build a custom rig with an NVIDIA 4090 GPU. As I continued to explore LLMs and experimented with multi-agent systems, I came to realize the importance of mastering the fundamentals. This realization led me to focus on training LLMs from scratch—not just to use them but to deeply understand how they function and evolve.
我踏入大型語言模型 (LLMs) 的旅程始於對 ChatGPT 運作方式的興奮。我從探索擴散模型開始，被其創造精美視覺圖像的能力所吸引。然而，使用 M1 晶片有著其局限性，這促使我打造一台搭載 NVIDIA 4090 顯示卡的客製化平台。隨著我持續探索 LLMs 並實驗多智能體系統，我意識到掌握基礎知識的重要性。這個體悟讓我專注於從頭訓練 LLMs——不僅僅是使用它們，而是深入了解它們的運作方式和演變過程。

Rig Evolution 平台演進

Rig with 2x NVIDIA 4090 GPUs

Note: This setup is capable of training models with up to 1 billion parameters; however, it performs better with ~500 million parameter models to achieve higher model utilization (MFU).
備註：此設定能夠訓練參數最多達 10 億個的模型；然而，對於約 5 億個參數的模型，其效能最佳，可達到更高的模型利用率 (MFU)。

Initial Build: Rig with 2x NVIDIA 4090 GPUs.
初始構建：搭載 2 個 NVIDIA 4090 顯示卡的平台。
Upgraded Build: Rig with 4x NVIDIA 4090 GPUs.
升級構建：搭載 4 個 NVIDIA 4090 顯示卡的平台。

Here’s a comprehensive guide to building a custom rig tailored for LLM training.
以下是如何打造一個專為 LLM 訓練量身打造的客製化平台的完整指南。

Total Cost: This entire setup cost approximately $12,000 USD. This may not be cost effective but I wanted to get hands on and do more experimentation.
總成本：整個設定大約花費 12,000 美元。這可能並不划算，但我希望親自動手並進行更多實驗。

You can always go with cloud for a fraction of the cost . Please check out https://lambdalabs.com/ , https://hyperbolic.xyz/ and several others
您也可以選擇雲端服務，成本會低得多。請查看 https://lambdalabs.com/ 、https://hyperbolic.xyz/ 以及其他幾個網站。

Next Step: My journey into LLMs
下一步：我與 LLMs 的旅程

1. Planning Your Build 1. 平台規劃

Define Your Objectives: Determine the scale and type of models you want to train. Smaller models may work with limited resources, but larger architectures demand higher computational power.
定義您的目標：確定您想要訓練的模型規模和類型。較小的模型可能適用於資源有限的情況，但較大的架構則需要更高的計算能力。
Budgeting: Set a realistic budget to balance performance with cost. Note that high-end components, especially GPUs, can be costly.
預算編列：設定一個合理的預算，以平衡效能與成本。請注意，高端組件，尤其是顯示卡，可能相當昂貴。

2. Selecting Hardware Components
2. 選擇硬體組件

Motherboard: Choose a server or workstation board, primarily for the number of PCIe lanes and compatibility with multiple GPUs. I recommend the SuperMicro M12SWA-TF. While it’s an excellent board, its noisy chipset fan may need replacement with a beefier heatsink and a Noctua fan.
主機板：選擇伺服器或工作站主機板，主要考量 PCIe 通道數量以及與多個顯示卡的相容性。我推薦 SuperMicro M12SWA-TF。雖然這是一塊優秀的主機板，但其嘈雜的晶片組風扇可能需要更換為更強大的散熱器和 Noctua 風扇。
CPU: Opt for a robust processor like the AMD Threadripper PRO 5955WX. The primary reason for choosing this CPU is its 128 PCIe lanes, allowing you to connect multiple GPUs without bandwidth constraints.
中央處理器 (CPU)：選擇效能強大的處理器，例如 AMD Threadripper PRO 5955WX。選擇這款 CPU 的主要原因是它擁有 128 條 PCIe 通道，讓您可以連接多個 GPU 而不受頻寬限制。
Memory (RAM): Ensure compatibility between your RAM and motherboard. A setup with 128 GB memory is recommended for large datasets and computational tasks.
記憶體 (RAM)：確保您的 RAM 與主機板相容。對於大型資料集和計算任務，建議使用 128 GB 記憶體。
GPUs: NVIDIA 4090 GPUs are ideal for LLM training due to their advanced Ada architecture. Key benefits include:
圖形處理器 (GPU)：NVIDIA 4090 GPU 憑藉其先進的 Ada 架構，是進行 LLM 訓練的理想選擇。主要優點包括：
- 24 GB VRAM: Sufficient for handling large models and datasets.
  24 GB VRAM：足以處理大型模型和資料集。
- BFloat16 Performance: Fourth-generation tensor cores deliver exceptional performance with up to 330 TFLOPS of bfloat16 precision, ensuring efficient computation for AI workloads.
  BFloat16 效能：第四代 Tensor Core 提供卓越的效能，bfloat16 精度高達 330 TFLOPS，確保 AI 工作負載的有效運算。
- CUDA Cores: 16,384 CUDA cores ensure unparalleled parallel processing capabilities.
  CUDA 核心：16,384 個 CUDA 核心確保無與倫比的平行處理能力。
- Architecture Advantages: Enhanced ray tracing, Shader Execution Reordering, and DLSS 3 technology for improved efficiency. Well this may not directly provide benefit but because this is a consumer grade card these features enabled having support for more advanced features such as bfloat16 and event float8 training support also the sheer number of cuda cores.
  架構優勢：增強的光線追蹤、著色器執行重新排序和 DLSS 3 技術，提升效率。雖然這可能沒有直接的好處，但由於這是消費級顯卡，這些功能使得它支援更先進的功能，例如 bfloat16 和甚至 float8 訓練支援，以及大量的 CUDA 核心。
- Flash Attention: Previous generations such as 3090 dont support latest Flash attentions
  閃電式注意力機制：前幾代產品，例如 3090，不支援最新的閃電式注意力機制。
A setup with 4x NVIDIA 4090s, connected using riser cables like this one, offers top-notch performance for training LLMs. Several people discourage using just the cable due to potential PCIe errors, but in my experience, it worked flawlessly without any issues.
使用 4 張 NVIDIA 4090，透過類似這樣的轉接線連接，可提供頂級的LLMs訓練效能。許多人建議不要只使用轉接線，因為可能會有 PCIe 錯誤，但以我的經驗，它運作完美無瑕，沒有任何問題。
Storage: Invest in high-capacity storage solutions. My setup includes 6 TB of NVMe SSDs for blazing-fast access and 8 TB of HDD storage for archiving.
儲存空間：投資高容量的儲存解決方案。我的設定包含 6 TB 的 NVMe SSD，以提供極快的存取速度，以及 8 TB 的 HDD 用於歸檔。
Power Supply: Dual PSU setups are often necessary for high-power builds. I used 2x 1500 Watt Be Quiet PSUs (Amazon link). Each PSU powers two GPUs, with one also powering the motherboard and CPU. Each GPU was consuming around 450W of power using DDP 500M model for ~10 days
電源供應器：高功率組裝通常需要雙電源供應器設定。我使用了 2 個 1500 瓦的 Be Quiet 電源供應器（Amazon 連結）。每個電源供應器為兩張 GPU 供電，其中一個也為主機板和 CPU 供電。使用 DDP 500M 模型訓練約 10 天，每張 GPU 消耗約 450W 的電力。
Case/Frame: For mounting, I recommend this case, which accommodates multiple GPUs and robust cooling.
機箱/機架：對於安裝，我推薦這個機箱，它可以容納多個 GPU 和強大的冷卻系統。
Cooling System: Replace noisy chipset fans with heatsinks like this one for quieter and more efficient cooling.
冷卻系統：將吵雜的晶片組風扇替換為像這樣的散熱器，以實現更安靜、更高效的冷卻。
Motherboard Baseboard: Use a baseboard like this one for proper fitting in the case.
主機板底板：使用像這樣的底板，以便在機箱中正確安裝。

3. Assembling the Rig 3. 組裝設備

Dual PSU Setup: When using two power supplies, ensure one powers the motherboard and CPU, while each PSU powers two GPUs. Specialized adapters can help synchronize their power-on sequence. This needs 30 AMP circuit, you might able to connect it to 2 different breakers but this is not recommended.
雙電源供應器設定：使用兩個電源供應器時，請確保一個為主機板和 CPU 供電，而每個電源供應器為兩張 GPU 供電。專用的轉接器可以幫助同步其開機順序。這需要 30 安培的電路，您或許可以將其連接到兩個不同的斷路器，但不建議這樣做。
Compatibility Check: Ensure all components are compatible to avoid assembly issues.
相容性檢查：確保所有組件都相容，以避免組裝問題。
Physical Assembly: Carefully install components, paying special attention to GPU placement and spacing for optimal airflow.
實體組裝：小心安裝組件，特別注意 GPU 的放置和間距，以獲得最佳氣流。
Cable Management: Organize cables neatly to improve airflow and simplify maintenance.
電纜管理：整齊地整理電纜，以改善氣流並簡化維護。

4. Software Configuration
4. 軟體設定

Operating System: Use a Linux-based OS (e.g., Ubuntu), known for its stability and suitability for machine learning tasks.
作業系統：使用基於 Linux 的作業系統（例如 Ubuntu），以其穩定性和適用於機器學習任務而聞名。
Drivers and Dependencies: Install the latest GPU drivers, CUDA, and cuDNN libraries to maximize GPU performance.
驅動程式和相依性：安裝最新的 GPU 驅動程式、CUDA 和 cuDNN 函式庫，以最大限度地提高 GPU 效能。
Machine Learning Frameworks: Set up frameworks like PyTorch or TensorFlow, essential for model training.
機器學習框架：設定 PyTorch 或 TensorFlow 等框架，這些框架對於模型訓練至關重要。
Custom Kernel: I used a custom kernel from Tinygrad to enable P2P communication between GPUs, further enhancing performance.
客製化核心：我使用了 Tinygrad 的客製化核心來啟用 GPU 之間的 P2P 通信，進一步提升效能。

5. Training Large Language Models
5. 訓練大型語言模型

Data Preparation: Curate, clean, and preprocess datasets to ensure high-quality inputs for training.
資料準備：整理、清理和預處理資料集，以確保訓練的高品質輸入。
Model Selection: Choose architectures like Llama2 or GPT, tailored to your hardware and training goals.
模型選擇：選擇適合您的硬體和訓練目標的架構，例如 Llama2 或 GPT。
Training Process: Initiate training, monitor resource utilization, and adjust configurations as needed for optimal results.
訓練過程：啟動訓練，監控資源使用情況，並根據需要調整配置以獲得最佳結果。

6. Optimization and Scaling
6. 優化與擴展

Multi-GPU Training: Use distributed training techniques such as Distributed Data Parallel (DDP) or ZeRO to fully utilize multiple GPUs.
多 GPU 訓練：使用分散式訓練技術，例如分散式數據平行處理 (DDP) 或 ZeRO，以充分利用多個 GPU。
George’s Hack: Leverage the kernel patch by George Hotz to enable peer-to-peer (P2P) communication for NVIDIA 4xxx GPUs, overcoming the lack of official support.
George 的黑科技：利用 George Hotz 的核心補丁啟用 NVIDIA 4xxx GPU 的點對點 (P2P) 通信，克服官方缺乏支援的限制。
Performance Tuning: Optimize hyperparameters, batch sizes, and learning rates to achieve better convergence and efficiency.
效能調整：優化超參數、批次大小和學習率，以實現更好的收斂性和效率。

7. Maintenance and Monitoring
7. 維護與監控

Regular Updates: Keep your system and software updated to leverage the latest optimizations and security patches.
定期更新：保持您的系統和軟體更新，以利用最新的優化和安全補丁。
System Monitoring: Use tools like NVIDIA’s nvidia-smi or Prometheus to track system health, utilization, and temperature.
系統監控：使用 NVIDIA 的 nvidia-smi 或 Prometheus 等工具來追蹤系統健康狀況、使用率和溫度。

Key Insights and Tips 關鍵見解與技巧

Hardware Alternatives: While GPUs like the A100 or H100 provide higher VRAM, consumer GPUs such as the 4090 offer excellent performance for cost-conscious setups.
硬體替代方案：雖然 A100 或 H100 等 GPU 提供更高的 VRAM，但 4090 等消費級 GPU 也能為注重成本的設置提供卓越的性能。
Cloud Considerations: On-premise rigs are ideal for long-term projects and experimentation, but cloud solutions offer flexibility for short-term tasks.
雲端考量：本地設備非常適合長期項目和實驗，但雲端解決方案則為短期任務提供了靈活性。
Community Resources: Explore tutorials from experts like Andrej Karpathy and guides from Hugging Face for additional insights.
社群資源：探索來自 Andrej Karpathy 等專家的教學和 Hugging Face 的指南，以獲得更多見解。

Building a rig for LLM training is a challenging but rewarding endeavor that opens up opportunities to push the boundaries of AI development. With careful planning and execution, your custom setup can become a powerful tool for exploring the vast landscape of machine learning.
建立一個用於 LLM 訓練的設備是一項充滿挑戰但回報豐厚的努力，它為突破人工智慧發展的界限開闢了機會。透過仔細的規劃和執行，您的自建設備可以成為探索機器學習廣闊領域的有力工具。

Rig with 4x NVIDIA 4090 GPUs

Training LLMs: Building Your Own Local Rig訓練 LLMs：打造您的本地訓練平台#

Rig Evolution 平台演進#

1. Planning Your Build 1. 平台規劃#

2. Selecting Hardware Components2. 選擇硬體組件#

3. Assembling the Rig 3. 組裝設備#

4. Software Configuration4. 軟體設定#

5. Training Large Language Models5. 訓練大型語言模型#

6. Optimization and Scaling6. 優化與擴展#

7. Maintenance and Monitoring7. 維護與監控#

Key Insights and Tips 關鍵見解與技巧#