{"id":23766,"date":"2026-04-30T23:35:36","date_gmt":"2026-04-30T23:35:36","guid":{"rendered":"https:\/\/www.europesays.com\/ai\/23766\/"},"modified":"2026-04-30T23:35:36","modified_gmt":"2026-04-30T23:35:36","slug":"automating-gpu-kernel-translation-with-ai-agents-cutile-python-to-cutile-jl","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ai\/23766\/","title":{"rendered":"Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl"},"content":{"rendered":"<p><a href=\"https:\/\/developer.nvidia.com\/cuda\/tile\" data-wpel-link=\"internal\" target=\"_self\" rel=\"follow nofollow noopener\">NVIDIA CUDA Tile<\/a> (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations\u2014loads, stores, and matrix multiply-accumulate\u2014rather than manually coordinating threads, warps, and shared memory.<\/p>\n<p><a href=\"https:\/\/github.com\/JuliaGPU\/cuTile.jl\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow nofollow noopener\">cuTile.jl<\/a> brings the same <a href=\"https:\/\/developer.nvidia.com\/blog\/cutile-jl-brings-nvidia-cuda-tile-based-programming-to-julia\/\" data-wpel-link=\"internal\" target=\"_self\" rel=\"follow nofollow noopener\">tile-based approach to the dynamic programming language Julia<\/a>. Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in <a href=\"https:\/\/juliahub.com\/products\/julia\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow nofollow noopener\">Julia\u2019s<\/a> scientific computing ecosystem\u2014 spanning differential equations, probabilistic programming, and physics simulations.\u00a0<\/p>\n<p><a href=\"https:\/\/docs.nvidia.com\/cuda\/cutile-python\/\" data-wpel-link=\"internal\" target=\"_self\" rel=\"follow nofollow noopener\">cuTile Python<\/a> has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch.<\/p>\n<p>This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to <a href=\"https:\/\/developer.nvidia.com\/blog\/cutile-jl-brings-nvidia-cuda-tile-based-programming-to-julia\/\" data-wpel-link=\"internal\" target=\"_self\" rel=\"follow nofollow noopener\">cuTile.jl<\/a> (Julia). It shows how to:<\/p>\n<p>Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side.<\/p>\n<p>Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs\u2014and silent mismatches produce wrong results, not compiler errors.<\/p>\n<p>Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in <a href=\"https:\/\/github.com\/NVIDIA\/TileGym\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow nofollow noopener\">TileGym<\/a> that produces validated Julia kernels in a single pass, systematizing a one-off porting effort.<\/p>\n<p>Cross-DSL GPU kernel translation<a href=\"#cross-dsl_gpu_kernel_translation\" aria-label=\"Scroll to Cross-DSL GPU kernel translation section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1.<\/p>\n<p>CategoryPython (cuTile)Julia (cuTile.jl)Indexing0-based (ct.bid(0))1-based (ct.bid(1))BroadcastingImplicit (a + b)Explicit dot syntax (a .+ b)Memory layoutRow-majorColumn-majorKernel definition@ct.kernel decoratorPlain function &#8230; endConstantsparam: ct.Constant[int] in signatureparam::Int in signature, ct.Constant(val) at launchType conversiontile.astype(ct.float32)convert(ct.Tile{Float32}, tile)Matrix multiplyct.mma(a, b, acc=acc)muladd(a, b, acc)Table 1. High-level differences between writing tile code in Python versus Julia<\/p>\n<p>None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1), and you get silent data corruption. Use * instead of .* for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours.<\/p>\n<p>A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow\u2014if the model is taught what to watch out for.<\/p>\n<p>Translating cuTile Python to cuTile.jl<a href=\"#translating_cutile_python_to_cutilejl\" aria-label=\"Scroll to Translating cuTile Python to cuTile.jl section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject.<\/p>\n<p>Matrix multiplication example<a href=\"#matrix_multiplication_example\" aria-label=\"Scroll to Matrix multiplication example section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout.<\/p>\n<p>cuTile Python:<\/p>\n<p>@ct.kernel<br \/>\ndef matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int],<br \/>\n                  tk: ct.Constant[int]):<br \/>\n    bid_m = ct.bid(0)<br \/>\n    bid_n = ct.bid(1)<\/p>\n<p>    num_k = ct.num_tiles(A, axis=1, shape=(tm, tk))<br \/>\n    acc = ct.full((tm, tn), 0, dtype=ct.float32)<\/p>\n<p>    dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype<\/p>\n<p>    for k in range(num_k):<br \/>\n        a = ct.load(A, index=(bid_m, k), shape=(tm, tk),<br \/>\n                    padding_mode=ct.PaddingMode.ZERO)<br \/>\n        b = ct.load(B, index=(k, bid_n), shape=(tk, tn),<br \/>\n                    padding_mode=ct.PaddingMode.ZERO)<br \/>\n        a = a.astype(dtype)<br \/>\n        b = b.astype(dtype)<br \/>\n        acc = ct.mma(a, b, acc)<\/p>\n<p>    acc = ct.astype(acc, C.dtype)<br \/>\n    ct.store(C, index=(bid_m, bid_n), tile=acc)<\/p>\n<p>cuTile.jl (Julia):<\/p>\n<p>function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},<br \/>\n                      tm::Int, tn::Int, tk::Int) where {T}<br \/>\n    bid_m = ct.bid(1)<br \/>\n    bid_n = ct.bid(2)<\/p>\n<p>    num_k = ct.num_tiles(A, 2, (tm, tk))<br \/>\n    acc = zeros(Float32, tm, tn)<\/p>\n<p>    U = T === Float32 ? ct.TFloat32 : T<\/p>\n<p>    for k in Int32(1):num_k<br \/>\n        a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero)<br \/>\n        b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero)<br \/>\n        a = convert(ct.Tile{U}, a)<br \/>\n        b = convert(ct.Tile{U}, b)<br \/>\n        acc = muladd(a, b, acc)<br \/>\n    end<\/p>\n<p>    acc = convert(ct.Tile{T}, acc)<br \/>\n    ct.store(C; index=(bid_m, bid_n), tile=acc)<br \/>\n    return<br \/>\nend<\/p>\n<p>Beyond the basic syntax changes, note the following:<\/p>\n<p>The layout flips: The Python row-major A(M,K) becomes column-major A_jl(K,M) in Julia. The accumulator, load indices, and store indices all change accordingly. Get the accumulator shape wrong\u2014say (TM, TN) instead of (TN, TM)\u2014and you get wrong results with no compiler warning.<\/p>\n<p>ct.mma \u2192 muladd: cuTile.jl maps matrix multiply-accumulate to the Julia standard muladd, and ct.PaddingMode.ZERO becomes ct.PaddingMode.Zero (PascalCase).<\/p>\n<p>Softmax example<a href=\"#softmax_example\" aria-label=\"Scroll to Softmax example section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>Softmax pushes things further. Three strategies were implemented in Julia\u2014tensor memory accelerator (TMA) single-tile, online, and chunked\u2014to handle different tensor sizes. On top of the matmul patterns, the softmax function brings in broadcast dot syntax (ct.exp(ct.sub(a, b)) \u2192 exp.(a .- b)), renamed reductions (ct.max \u2192 maximum, ct.sum \u2192 sum, axis +1), and element-wise ct.maximum(a, b) \u2192 max.(a, b).\u00a0<\/p>\n<p>But the real challenge isn\u2019t syntax\u2014it\u2019s maintaining correct running max\/sum statistics through the translation.<\/p>\n<p>Workflow generation with agent skills<a href=\"#workflow_generation_with_agent_skills\" aria-label=\"Scroll to Workflow generation with agent skills section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>The primary outcome of this project wasn\u2019t the translated kernels\u2014it was the skill built to produce them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1334\" height=\"394\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/04\/6-Step-Workflow.webp\" alt=\"A six-step workflow laid out left to right for producing a reusable GPU kernel, with labeled stages: Analyze Source Kernel, Load Rules and API Mappings, Reference Worked Examples, Generate CuTile.jl Kernel, Validate and Test, and Produce Reusable Output.\" class=\"lazyload wp-image-116190\"  data-\/><\/p>\n<p>\t\tFigure 1. The conversion skill packages translation rules, API mappings, examples, validation, and tests into a single reusable workflow<\/p>\n<p>A skill, in this context, is a directory of structured knowledge that lives in the repository and is picked up by an LLM agent. The path to this particular skill is:.claude\/skills\/converting-cutile-to-julia\/.<\/p>\n<p>.claude\/skills\/converting-cutile-to-julia\/<br \/>\n\u251c\u2500\u2500 SKILL.md                           # Entry point: workflow overview, top pitfalls<br \/>\n\u251c\u2500\u2500 translations\/<br \/>\n\u2502   \u2514\u2500\u2500 workflow.md                    # Step-by-step conversion with checklists<br \/>\n\u251c\u2500\u2500 references\/<br \/>\n\u2502   \u251c\u2500\u2500 api-mapping.md                 # Bidirectional Python\u2194Julia API table<br \/>\n\u2502   \u251c\u2500\u2500 critical-rules.md              # 17 rules (indexing, broadcasting, loops, &#8230;)<br \/>\n\u2502   \u251c\u2500\u2500 debugging.md                   # Error diagnosis for MethodError, IRError, etc.<br \/>\n\u2502   \u2514\u2500\u2500 testing.md                     # Test patterns, tolerances per dtype<br \/>\n\u251c\u2500\u2500 scripts\/<br \/>\n\u2502   \u2514\u2500\u2500 validate_cutile_jl.py          # Static checker for common anti-patterns<br \/>\n\u2514\u2500\u2500 examples\/<br \/>\n    \u251c\u2500\u2500 01_add\/                        # Python\u2192Julia for vector addition<br \/>\n    \u251c\u2500\u2500 02_matmul\/                     # Python\u2192Julia for matrix multiply<br \/>\n    \u2514\u2500\u2500 03_softmax\/                    # Python\u2192Julia for softmax (3 strategies)<\/p>\n<p>The critical-rules.md alone captures 17 pitfalls the team encountered. Table 2 details the most common pitfalls and the associated fixes.<\/p>\n<p>#PitfallFix1max(a, b) on tiles \u2192 IRErrorUse max.(a, b) (broadcast dot)2ct.load with order \u2014 index positions wrongorder remaps BOTH shape AND indexTable 2. Pitfalls and associated fixes for some of the more common issues encountered<\/p>\n<p>There\u2019s also a static validator script that catches things like leftover ct.bid(0), for loops inside kernels, and Python-style type names\u2014before running on the GPU. With all of this in place, the model doesn\u2019t have to rediscover the conversion rules each time. It reads the skill, follows the checklist, and applies the rules.<\/p>\n<p>The AI agent skill in TileGym<a href=\"#the_ai_agent_skill_in_tilegym\" aria-label=\"Scroll to The AI agent skill in TileGym section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>The concrete deliverable is a Julia subproject under julia\/ in TileGym, which is open source:<\/p>\n<p>julia\/<br \/>\n\u251c\u2500\u2500 Project.toml                # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test<br \/>\n\u251c\u2500\u2500 kernels\/<br \/>\n\u2502   \u251c\u2500\u2500 add.jl                  # 1D element-wise with alpha scaling<br \/>\n\u2502   \u251c\u2500\u2500 matmul.jl               # 2D tiled MMA with column-major layout<br \/>\n\u2502   \u2514\u2500\u2500 softmax.jl              # 3 strategies: TMA, online, chunked<br \/>\n\u2514\u2500\u2500 test\/<br \/>\n    \u251c\u2500\u2500 runtests.jl             # Test runner<br \/>\n    \u251c\u2500\u2500 test_add.jl<br \/>\n    \u251c\u2500\u2500 test_matmul.jl<br \/>\n    \u2514\u2500\u2500 test_softmax.jl<\/p>\n<p>These three kernels were deliberately selected. Kernel add is the simplest method to test the full translation surface. Matmul adds loop structure, tensor cores, and the layout flip. Softmax introduces multipass algorithms with invariants that have to survive translation. Each kernel has tests that compare against a CPU reference with per-dtype tolerances, including boundary cases where dimensions don\u2019t align to tile sizes.<\/p>\n<p>Results and lessons learned<a href=\"#results_and_lessons_learned\" aria-label=\"Scroll to Results and lessons learned section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>With the skill in place, the workflow for each kernel looked like this:<\/p>\n<p>Pre-flight: Scan the source for patterns that require special handling (for loops, ct.mma, order=, and so on).<\/p>\n<p>Convert: Apply the API mapping and critical rules.<\/p>\n<p>Validate: Run the static checker.<\/p>\n<p>Test: Run Julia tests against reference implementations.<\/p>\n<p>Fix: If something fails, use the debugging guide, fix, and rerun.<\/p>\n<p>For a representative general matrix multiply (GEMM) conversion, the process took about 4 minutes and ~78K tokens on a frontier LLM with no manual intervention. Subsequent kernels were faster because the examples and rules were already in the repo.<\/p>\n<p>Table 3 lists the pitfalls that caused bugs during ports, all of which are now handled automatically in the skills.<\/p>\n<p>PitfallSymptomRoot causect.bid(0) left unchangedWrong tile loaded, silent corruption0-based versus 1-based indexinga * b for element-wise multiplyMatrix multiply instead of element-wiseJulia * is matmul; need .*Accumulator shape (TM, TN)Wrong results in matmulColumn-major needs (TN, TM)ct.PaddingMode.ZEROUndefVarErrorJulia uses PascalCase: .ZeroTable 3. Common pitfalls, symptoms, and root causes that cause bugs during the porting of tile code from Python to Julia<\/p>\n<p>The takeaway isn\u2019t that AI wrote the code. It\u2019s the ability to capture what was learned into something the model can reuse next time. A prompt can say, \u201cBe careful with indexing.\u201d A skill can say, \u201cHere are the 17 specific things that go wrong, here\u2019s how to check for them, and here\u2019s a script that catches them automatically.\u201d<\/p>\n<p>Now, future ports can start from a repo that already has working examples, a tested API mapping, a static validator, and a debugging guide. Each one takes less effort than the last.\u00a0<\/p>\n<p>A broader takeaway is that the challenge in using AI for systems work isn\u2019t code generation\u2014it\u2019s producing correct code in domains where the compiler won\u2019t catch semantic mistakes. Encoding domain rules in version control, alongside the code they describe, is one way to address this.<\/p>\n<p>Get started using agent skills to translate Python kernels to Julia<a href=\"#get_started_using_agent_skills_to_translate_python_kernels_to_julia\" aria-label=\"Scroll to Get started using agent skills to translate Python kernels to Julia section\" class=\"heading-anchor-link\"><\/a><\/p>\n<p>Use the following code to try the Julia subproject and the conversion skill:<\/p>\n<p>cd TileGym<\/p>\n<p># Explore the Julia kernels<br \/>\nls julia\/kernels\/     # add.jl, matmul.jl, softmax.jl<\/p>\n<p># Explore the conversion skill<br \/>\nls .claude\/skills\/converting-cutile-to-julia\/<\/p>\n<p># Install Julia dependencies (requires Julia 1.12+, CUDA 13.1+ driver)<br \/>\njulia &#8211;project=julia\/ -e &#8216;using Pkg; Pkg.instantiate()&#8217;<\/p>\n<p># Run the Julia kernel tests<br \/>\njulia &#8211;project=julia\/ julia\/test\/runtests.jl<\/p>\n<p>Requirements:<\/p>\n<p>Julia 1.12+ and NVIDIA CUDA 13.1+ driver<\/p>\n<p>NVIDIA Ampere, NVIDIA Ada, or NVIDIA Blackwell GPU (compute capability 8.x, 10.x, 11.x, 12.x)<\/p>\n<p>An LLM agent with file system access (for example, <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/claude-code\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow nofollow noopener\">Claude Code<\/a>). To use the conversion skill for your own kernels, point your LLM agent at .claude\/skills\/converting-cutile-to-julia\/SKILL.md, provide a cuTile Python kernel as input, and start translating Python kernels to Julia.<\/p>\n","protected":false},"excerpt":{"rendered":"NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms&hellip;\n","protected":false},"author":2,"featured_media":23767,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[405,7537],"class_list":{"0":"post-23766","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-agentic-ai","8":"tag-ai-agents","9":"tag-artificial-intelligence-agents"},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/23766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/comments?post=23766"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/23766\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media\/23767"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media?parent=23766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/categories?post=23766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/tags?post=23766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}