/
Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P] — Trendlair