|

Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture

A crew researchers from China have launched AntAngelMed, a massive open-source medical language mannequin that the crew describes as the biggest and most able to its sort presently out there.

What Is AntAngelMed?

AntAngelMed is a medical-domain language mannequin with 103 billion whole parameters, however it doesn’t activate all of these parameters throughout inference. Instead, it makes use of a Mixture-of-Experts (MoE) structure with a 1/32 activation ratio, which means solely 6.1 billion parameters are lively at any given time when processing a question.

It helps to understand how MoE architectures work. In a normal dense mannequin, each parameter participates in processing each token. In an MoE mannequin, the community is split into many ‘professional’ sub-networks, and a routing mechanism selects solely a small subset of them to deal with every enter. This means that you can have a very massive whole parameter depend — which generally correlates with sturdy data capability — whereas holding the precise compute value of inference proportional to the smaller lively parameter depend.

AntAngelMed inherits this design from Ling-flash-2.0, a base mannequin developed by inclusionAI and guided by what the crew calls Ling Scaling Laws. The particular optimizations layered on high embrace: refined professional granularity, a tuned shared professional ratio, consideration stability mechanisms, sigmoid routing with out auxiliary loss, an MTP (Multi-Token Prediction) layer, QK-Norm, and Partial-RoPE (Rotary Position Embedding utilized to a subset of consideration heads slightly than all of them). According to the analysis crew, these design decisions collectively enable small-activation MoE fashions to ship as much as 7× effectivity in comparison with equally sized dense architectures which implies with solely 6.1B activated parameters, AntAngelMed can match roughly 40B dense mannequin efficiency. Separately, as output size grows throughout inference, the relative velocity benefit may also attain 7× or extra over dense fashions of comparable dimension.

https://modelscope.cn/fashions/MedAIBase/AntAngelMed

Training Pipeline

AntAngelMed makes use of a three-stage coaching course of designed to layer common language understanding on high of deep medical area adaptation.

The first stage is continuous pre-training on large-scale medical corpora, together with encyclopedias, net textual content, and tutorial publications. This part is constructed on high of the Ling-flash-2.0 checkpoint, giving the mannequin a sturdy common reasoning basis earlier than medical specialization begins.

The second stage is Supervised Fine-Tuning (SFT), the place the mannequin is skilled on a multi-source instruction dataset. This dataset mixes common reasoning duties — math, programming, logic — to protect chain-of-thought capabilities, alongside medical situations akin to physician–affected person Q&A, diagnostic reasoning, and security and ethics instances.

The third stage is Reinforcement Learning utilizing the GRPO (Group Relative Policy Optimization) algorithm, mixed with task-specific reward fashions. GRPO, initially launched within the DeepSeekMath paper, is a variant of PPO that estimates baselines from group scores slightly than a separate critic mannequin, making it computationally lighter. Here, reward indicators are designed to form mannequin habits towards empathy, structured medical responses, security boundaries, and evidence-based reasoning — all with the purpose of decreasing hallucinations on medical questions.

Inference Performance

On H20 {hardware}, AntAngelMed exceeds 200 tokens per second, which the analysis crew studies is roughly 3× sooner than a 36 billion parameter dense mannequin. With YaRN (Yet Another RoPE extensioN) extrapolation, it helps a 128K context size — lengthy sufficient to deal with full medical paperwork, prolonged affected person histories, or multi-turn medical dialogues.

The analysis crew has additionally launched an FP8 quantized model of the mannequin. When this quantization is mixed with EAGLE3 speculative decoding optimization, inference throughput at a concurrency of 32 improves considerably over FP8 alone: 71% on HumanEval, 45% on GSM8K, and 94% on Math-500. These benchmarks measure coding and math reasoning duties — not medical duties straight — however function proxies for the mannequin’s common throughput stability throughout output varieties.

Benchmark Results

On HealthBench, the open-source medical analysis benchmark from OpenAI that makes use of simulated multi-turn medical dialogues to measure real-world medical efficiency, AntAngelMed ranks first amongst all open-source fashions and surpasses a vary of high proprietary fashions as effectively, with a significantly important benefit on the HealthBench-Hard subset.

On MedAIBench, an analysis system maintained by China’s National Artificial Intelligence Medical Industry Pilot Facility, AntAngelMed ranks on the high degree, with significantly sturdy scores in medical data Q&A and medical ethics and security classes.

On MedBench, a benchmark for Chinese healthcare LLMs protecting 36 independently curated datasets and roughly 700,000 samples throughout 5 dimensions — medical data query answering, medical language understanding, medical language era, complicated medical reasoning, and security and ethics — AntAngelMed ranks first general.

Marktechpost’s Visual Explainer

Technical Guide
AntAngelMed

1 / 7

01 — Overview
What Is AntAngelMed?
Jointly developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.

103BTotal Params
6.1BActive at Inference
128KContext Length

AntAngelMed is a medical-domain LLM constructed on a 1/32 activation-ratio MoE structure. With 103B whole parameters and solely 6.1B lively at inference time, it matches the efficiency of roughly 40B dense fashions at a fraction of the compute value.

Model weights are launched below Apache 2.0. The code repository is licensed below MIT.

02 — Architecture
MoE Architecture & Base Model
Built on Ling-flash-2.0 by inclusionAI, guided by Ling Scaling Laws.

AntAngelMed makes use of a 1/32 activation-ratio MoE with optimizations throughout all core parts. These decisions allow small-activation MoE fashions to ship as much as 7× effectivity over equally sized dense architectures — and as output size grows, relative speedups can attain 7× or extra.

Key architectural parts:

Expert Granularity
Shared Expert Ratio
Sigmoid Routing
No Auxiliary Loss
MTP Layer
QK-Norm
Partial-RoPE
YaRN Extrapolation
Attention Balance

03 — Training
Three-Stage Training Pipeline
Designed to layer common language understanding on high of deep medical area adaptation.

Stage 01
Continual Pre-Training
Built on Ling-flash-2.0, skilled on large-scale medical corpora — encyclopedias, net textual content, and tutorial publications — to inject deep area and world data.
Stage 02
Supervised Fine-Tuning (SFT)
Multi-source instruction knowledge mixing common duties (math, programming, logic) for chain-of-thought, plus medical situations (physician–affected person Q&A, diagnostic reasoning, security/ethics) for medical adaptation.
Stage 03
Reinforcement Learning by way of GRPO
Group Relative Policy Optimization with task-specific reward fashions. Shapes mannequin habits towards empathy, structural readability, security boundaries, and evidence-based reasoning to cut back hallucinations.

04 — Inference
Inference Performance
Hardware benchmarks on H20 and throughput enhancements from FP8 + EAGLE3 optimization.

>200 tok/s
On H20 {hardware}. Approximately 3× sooner than a comparable 36B dense mannequin.
7× effectivity
MoE vs. dense at equal dimension. Speedup will increase additional as output size grows.
+71% / +45% / +94%
FP8 + EAGLE3 throughput features over FP8 alone on HumanEval / GSM8K / Math-500 at concurrency 32.
128K context
Supported by way of YaRN extrapolation. Handles full medical paperwork and prolonged multi-turn dialogues.

05 — Benchmarks
Benchmark Results
Evaluated throughout three authoritative medical LLM benchmarks.

Benchmark Scope Result
HealthBenchOpenAI Simulated multi-turn medical dialogues for real-world medical efficiency. #1 open-source; surpasses a number of proprietary fashions. Largest lead on HealthBench-Hard.
MedAIBenchNat’l AI Medical Pilot Facility Chinese authority benchmark protecting data Q&A and medical ethics/security. Top-level. Strongest in data Q&A and medical ethics/security.
MedBenchChinese Healthcare Domain 36 datasets, ~700K samples throughout 5 medical dimensions. #1 general throughout all 5 dimensions.

06 — Quickstart
Run with Hugging Face Transformers
Requires trust_remote_code=True for the MoE routing code.

from transformers import AutoModelForCausalLM, AutoTokenizer

mannequin = AutoModelForCausalLM.from_pretrained(
    "MedAIBase/AntAngelMed",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("MedAIBase/AntAngelMed")

messages = [
  {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
  {"role": "user",   "content": "What should I do if I have a headache?"}
]
textual content   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt",
    return_token_type_ids=False).to(mannequin.gadget)
out    = mannequin.generate(**inputs, max_new_tokens=16384)
out    = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

Also helps: vLLM v0.11.0 (4-GPU tensor parallel), SGLang with FlashAttention-3, and vLLM-Ascend for Huawei Ascend 910B NPUs.

07 — Access
Resources & Links
Model weights Apache 2.0 — Code repository MIT — FP8 quantized variant out there individually.

Developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.
Coverage by Marktechpost — marktechpost.com


Key Takeaways

  • AntAngelMed is a 103B-parameter open-source medical LLM that prompts solely 6.1B parameters at inference time utilizing a 1/32 activation-ratio MoE structure inherited from Ling-flash-2.0.
  • It makes use of a three-stage coaching pipeline: continuous pre-training on medical corpora, SFT with blended common and medical instruction knowledge, and GRPO-based reinforcement studying for security and diagnostic reasoning.
  • On H20 {hardware}, the mannequin exceeds 200 tokens/s and helps 128K context size by way of YaRN extrapolation — roughly 3× sooner than a comparable 36B dense mannequin.
  • AntAngelMed ranks first amongst open-source fashions on OpenAI’s HealthBench, surpasses a number of proprietary fashions, and tops each MedAIBench and MedBench leaderboards.
  • The mannequin is accessible on Hugging Face, ModelScope, and GitHub; mannequin weights are Apache 2.0, code is MIT, and an FP8 quantized model can also be launched.

Check out the Model Weights on HF, GitHub Repo and Technical detailsAlso, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture appeared first on MarkTechPost.

Similar Posts