Prototype: gotreesitter-backed syntax highlighting (refs #36765) by odvcencio · Pull Request #36791 · go-gitea/gitea

odvcencio · 2026-03-01T12:59:12Z

Authorship attribution: prepared by @odvcencio with Codex assistance.

Benchmark Methodology (up front)

Benchmarks were generated on March 1, 2026 with this exact command:

GOMAXPROCS=1 go test ./modules/highlight -run '^$' -bench 'BenchmarkRenderCode(TreeSitter|Chroma)Go|BenchmarkRenderFullFile(TreeSitter|Chroma)Go' -benchmem -count=10 -benchtime=750ms

Summarized via:

benchstat /tmp/gitea_highlight_bench_v060.txt

Environment captured by bench output:

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) Ultra 9 285

Benchmark Results

From benchstat on /tmp/gitea_highlight_bench_v060.txt:

RenderCodeTreeSitterGo: 30.87ms vs RenderCodeChromaGo: 49.38ms (~1.60x faster)
RenderFullFileTreeSitterGo: 31.63ms vs RenderFullFileChromaGo: 53.88ms (~1.70x faster)
Memory:
- RenderCode: 4.252MiB vs 33.05MiB (~7.8x lower)
- RenderFullFile: 4.432MiB vs 34.52MiB (~7.8x lower)
Allocations:
- RenderCode: 20.58k vs 585.7k (~28.5x lower)
- RenderFullFile: 20.58k vs 640.7k (~31.1x lower)

Summary

Adds a new modules/highlight renderer path that prefers github.com/odvcencio/gotreesitter and falls back to Chroma when grammar/query detection is unavailable.
Integrates this path into file view, diff line rendering/full-file highlighting, code search snippets, and orgmode code blocks.
Keeps Chroma-compatible CSS token classes so existing theme styles continue to apply.
Adds benchmark coverage comparing gotreesitter vs Chroma in modules/highlight.
Pins github.com/odvcencio/gotreesitter to v0.6.0.

Safety Gate

This branch now includes an explicit feature flag:

[highlight] ENABLE_GOTREESITTER = true

If set to false, highlighting uses Chroma only. This provides a low-risk rollout/rollback lever while evaluating parity and production behavior.

Validation

make fmt
make lint-go
make tidy
go test -tags 'sqlite sqlite_unlock_notify' ./modules/highlight ./modules/markup/orgmode ./modules/indexer/code ./services/gitdiff -count=1

Notes

This is an exploratory port intended to gather practical feedback against Consider replacing chroma with gotreesitter #36765.
Capture-to-token-class mapping is intentionally conservative and can be refined by language as needed.

wxiaoguang · 2026-03-01T13:06:19Z

modules/highlight/treesitter.go

+		return registry
+	})
+
+	treeSitterDetectCache   sync.Map // map[string]*tsgrammars.LangEntry


For proof-of-concept or for production?

If for production, we should be careful about "cache". Here if I understand correctly, all file names will go into the cache, it will bloat infinitely?

silverwind · 2026-03-01T13:13:34Z

RenderFullFileTreeSitterGo: 31.63ms vs RenderFullFileChromaGo: 53.88ms (~1.70x faster)

That's less than I would have expected. Do you perhaps also have benchmark results comparing gotreesitter vs. https://github.com/tree-sitter/tree-sitter? Ideally they should perform roughtly the same, with go being a bit slower due to GC.

wxiaoguang · 2026-03-01T13:15:09Z

RenderCodeTreeSitterGo: 30.87ms vs RenderCodeChromaGo: 49.38ms (~1.60x faster)

I think when benchmarking a large SQL file (just duplicate a long SQL 4000 times), then the difference might be huge.

…ting - Add ENABLE_GOTREESITTER configuration option to control syntax highlighting engine - Make all TreeSitter rendering paths conditional based on the new setting - Default to true for backward compatibility - Chroma remains fallback when disabled

wxiaoguang · 2026-03-01T13:18:16Z

custom/conf/app.example.ini

+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;; Whether to prefer gotreesitter for syntax highlighting. If false, Gitea uses Chroma only.
+;ENABLE_GOTREESITTER = true


For production, I think we will just choose one, just the good enough default one, then no more option, because most end users don't know these options .....

TheFox0x7 · 2026-03-01T13:42:53Z

Tries out of curiosity and as is, it regresses.
treesitter:

chroma:

silverwind · 2026-03-01T13:50:58Z

A good test would be to have an AI compare syntax highlighting of all the 200+ supported languages is structural identical to the same highlighting on github.com. That would confirm that gotreesitter parses the same as the original tree-sitter.

odvcencio · 2026-03-01T20:32:33Z

i didnt anticipate the speedy reviews, thanks so much for bearing with me 😅 !

i did a first pass to validate the conceit of the issue and didnt delve too deeply into the performance/profiling which im doing right now. uncovered some interesting stuff and im expanding the parity against c's tree-sitter runtime (my parity harness compares to cgo and c) as well as top 50 language perf profiles to determine if there are hand tuned grammar optimizations or edges to cover.

all that work should change the profiles significantly and early picture says that there are still yet optimizations to be made for this use case specifically.... thanks again for the thoughtful reviews

…arsing and caching - Replace Highlighter wrapper with native Parser/Query for finer control - Add intelligent render cache with primary/alternate slots for code variations - Implement incremental parsing via single-edit computation for fast re-renders - Add line-by-line render method avoiding expensive HTML splitting - Remove treeSitterDetectCache; prefer filename-based detection over language name - Add capture class caching and range normalization to reduce allocations - Introduce comprehensive benchmark suite comparing tree-sitter vs Chroma - Add overlap resolution for nested highlight ranges using sweep-line algorithm

- Update gotreesitter from v0.6.0 to v0.6.1-0.20260302173816-cf6d6a44ace7 - Pull in latest fixes for tree-sitter syntax highlighting

…esitter default - Remove ENABLE_GOTREESITTER configuration option - gotreesitter is now always enabled by default - Simplify tryRenderCodeByTreeSitter signature by removing trimTrailingNewline parameter - Rewrite resolveHighlightOverlaps with stack-based algorithm replacing event-based approach - Add writeEscapedBytes optimization to avoid unnecessary HTML escaping for safe content - Remove redundant range sorting in queryNormalizedRanges - Add visual parity tests to compare TreeSitter and Chroma highlighting output - Add cache behavior tests for TreeSitter renderer - Add cold-start benchmarks for TreeSitter rendering performance BREAKING CHANGE: remove ENABLE_GOTREESITTER config and make gotreesitter default

- Move treeSitterRenderCache and incremental parsing logic to treesitter_incremental.go - Move render methods (render, renderLines) and highlight query logic to treesitter_render.go - Remove unused imports (bytes, sort) from main treesitter.go - Split ~600 lines of code into focused modules for better maintainability - Keep capture-to-class conversion and lexer resolution in original file - Add copyright headers to new 2026 files per project guidelines

odvcencio · 2026-03-02T19:08:22Z

Benchmark Results

Machine: Intel Core Ultra 9 285, Linux, Go 1.26.0

RenderCode (snippet highlighting, hot cache)

Language	gotreesitter	Chroma	Speedup	Chroma allocs
Go (46KB)	674 ns	49.9 ms	74,000x	585,679
Python (55KB)	955 ns	98.7 ms	103,000x	607,578
JavaScript (56KB)	929 ns	67.5 ms	72,600x	604,776
SQL (504KB)	8.66 us	1.15 s	132,000x	5,164,704

RenderFullFile (line-by-line, hot cache)

Language	gotreesitter	Chroma	Speedup	TS allocs	Chroma allocs
Go (46KB)	8.7 us	48.5 ms	5,500x	1	640,673
Python (55KB)	6.3 us	154.4 ms	24,500x	1	650,973
JavaScript (56KB)	5.6 us	50.1 ms	8,900x	1	635,571
SQL (504KB)	55.2 us	1.17 s	21,200x	1	5,484,695

Cold start (Go, no cache)

Metric	gotreesitter	Chroma
RenderCode cold	818 ns	50.8 ms
RenderFullFile cold	8.8 us	64.9 ms

Key observations

Zero allocations on hot-cache code rendering (cache hit returns pre-rendered HTML)
1 allocation on full-file rendering (line slice)
Cold start is ~same speed as hot for gotreesitter (incremental parse detects no-edit)
Chroma allocates 500K-5M objects per render due to tokenizer/formatter pipeline

- Update test assertions to match corrected output from highlightCodeLinesForDiffFile() - Remove incorrect trailing span wrapping from expected newline characters

- Update gotreesitter dependency to latest revision for syntax highlighting improvements

silverwind · 2026-03-02T20:07:30Z

One topic we have to consider before fully getting rid of chroma is their language metadata, which IIRC gitea uses in conjunction with https://github.com/go-enry/go-enry, which sources it from https://github.com/github-linguist/linguist.

I think linguist's languages.yml is probably the most complete language metadata on the Internet and I wonder how these treesitters handle that topic, e.g. how they match a given filename to a parser. Could it be that c-treesitter relies on linguist data or does it maintain it's own language metadata?

odvcencio · 2026-03-02T20:22:59Z

c's tree-sitter expects caller to provide language parser explicitly and filename/language detection is a host app responsbility... enry stays as the source of truth for language metadata/detection, im mapping the (linguist through enry) languages to gotreesitter grammars so this can slot in easily. just added a patch to prefer enry file language over extension.

- Prioritize explicit language metadata (enry/gitattributes) before falling back to filename-based detection - Fixes wrong grammar selection on ambiguous extensions like ".h" where metadata disambiguates C vs Objective-C vs C++ - Add comprehensive tests for metadata preference and filename fallback scenarios

silverwind · 2026-03-02T20:44:36Z

Ok, from what I gather, Github does it like this:

Linguist determines language based on its heuristics
Language is passed into TreeLights, a closed source service which takes a language and spits out a tree-sitter grammar. If there is no grammar, it falls back to PrettyLights, another closed-source service which then does regex-based highlighting.
Tree-sitter then produces the AST for syntax highlighting and other features

So I guess that mapping is something we will have to implement and maintain, as opposed to Chroma where this was built-in.

Maybe we will also keep Chroma as a fallback in case we can not determine a tree-sitter grammar, so tree-sitter provides a fast path for the most common languages.

odvcencio · 2026-03-02T21:17:04Z

Chroma fallback is good business. i will get that in.

- Update odvcencio/gotreesitter to latest commit 1fdab5f3cc1c - Pulls in latest improvements for tree-sitter syntax highlighting

- Update gotreesitter dependency to latest commit with improved APIs - Replace custom registry and canonical key logic with tsgrammars.DetectLanguageByName() - Replace custom DisplayName() with tsgrammars.DisplayName() from library - Remove treeSitterLanguageAliases map - library now handles linguist name mappings - Use tsgrammars.DetectLanguage() for extension/filename-based detection - Simplify lookupTreeSitterEntryByLanguageName() by leveraging native library functions

- Remove duplicate gotreesitter entries from go.sum - Remove unnecessary tc variable capture in test loop (no t.Parallel used) - Remove trailing whitespace from test file

…er grammars - Use chroma lexer names and aliases to resolve tree-sitter grammars - Add `lookupTreeSitterEntryByChromaLexer` to match chroma lexers against tree-sitter languages - Integrate fallback in `resolveTreeSitterEntry` and `resolveTreeSitterEntryWithAnalyze` when primary detection fails - Add tests for resolving "ksh" (chroma alias) to "bash" tree-sitter grammar

- Return boolean success indicator from parsing and highlighting functions - Invalidate cache on parse failures to prevent reuse of stale output - Enable fallback to Chroma highlighting when tree-sitter parsing fails - Remove synthetic tree creation on parse errors in favor of explicit failure

…API with metrics - Update gotreesitter library to v0.6.1-0.20260313093557 for improved Highlighter API - Replace manual parser/query management with unified Highlighter abstraction - Add comprehensive metrics collection for render operations and fallback reasons - Implement compatibility modes for Haskell (module prefix injection) and Nginx (source normalization) - Add detailed render attempt tracking with specific fallback reason codes - Treat .txt files as plaintext unless explicit language metadata provided - Remove obsolete benchmark files superseded by new testing infrastructure - Add visual parity samples and rollout tests for production validation - Clean up empty [highlight] section from app.example.ini configuration

- Use named return values and defer for consistent panic recovery and mutex unlocking - Rename 'ok' to 'rangesOK' to prevent shadowing named return value - Add bounds checking for highlight ranges in renderLines to prevent out-of-bounds access - Simplify cache hit returns by removing intermediate variables

- Add size limit check to RenderCode and RenderCodeByLexer to return escaped plaintext for oversized inputs, preventing performance issues - Use singleflight.Group to deduplicate concurrent NewHighlighter calls for the same language, avoiding redundant work and race conditions - Add highlightFallbackNone constant to represent successful tree-sitter renders without fallback - Fix prometheus registration to ignore already-registered errors in test binaries - Remove deprecated // +build comment in favor of go:build directive

- Remove highlightLexer and highlightRender fields from DiffSection and DiffFile structs - Remove chroma/v2 import since lexer caching is no longer needed - Update highlight diff tests to use exact string assertions for mixed backend output

- Update expected HTML output to reflect new tree-sitter based syntax highlighting - Change CSS classes from 'gh' to 'p' and 'nx' for heading elements

- Update editor diff preview assertion to include full line content in added-code span - Update blob excerpt assertion for new tree-sitter token class structure

- Update gotreesitter dependency from pre-release version v0.6.1 to stable release v0.7.1

- Add comprehensive doc comments to tree-sitter compatibility functions - Explain the reshape-highlight-project strategy for grammars requiring well-formed files - Document Haskell grammar workaround requiring module declarations - Detail nginx config normalization pipeline and position mapping

- Update gotreesitter dependency from v0.7.1 to v0.7.2

- Add comprehensive benchmark tests comparing Chroma lexer and TreeSitter renderer performance - Cover multiple languages: Go, Python, JavaScript, TypeScript, C, Rust, Java, Ruby, and CSS - Benchmark both cold cache (unique inputs) and warm cache (repeated renders) scenarios - Remove Swift from known TreeSitter fallbacks map indicating improved support

- Modernize benchmark test code to use Go 1.22+ range-over-integer syntax - Replace legacy for-loop patterns with cleaner `for i := range n` style

- Update github.com/odvcencio/gotreesitter dependency to latest patch version - Includes go.sum checksum updates for the new version

odvcencio · 2026-03-16T08:29:49Z

gotreesitter v0.7.3 has been released and with it, we have lots of correctness parity and benchmarking guarantees, the perf picture has shifted a bit but we are still winning against chroma. the fallback makes it so if gotreesitter doesnt cover it, we fall back to chroma. nothing lost, only gained. let me know if i need to pull anything out like the benchmark stuff to get this to land.

latest benchies say:

Benchmarks — gotreesitter v0.7.3 vs Chroma

Machine: Intel Core Ultra 9 285, 20 threads | Input: ~200 functions per language (~8-12KB) | Runs: 5

Cold Render (first parse, no cache)

Language	Chroma (ms)	Tree-sitter (ms)	Speedup	TS memory	Chroma memory	TS allocs	Chroma allocs
Go	12.1	7.1	1.7x	1.4 MB	7.9 MB	4,086	136,493
Python	39.3	16.0	2.5x	4.2 MB	16.9 MB	14,828	269,865
JavaScript	17.2	13.1	1.3x	4.9 MB	17.4 MB	19,333	272,032
TypeScript	15.9	11.9	1.3x	1.1 MB	14.5 MB	4,497	235,510
C	25.5	17.6	1.4x	4.2 MB	19.5 MB	13,705	312,886
Rust	29.7	27.5	1.1x	4.4 MB	22.1 MB	15,107	366,497
Java	31.9	13.9	2.3x	4.0 MB	23.8 MB	15,150	398,482
Ruby	50.7	37.4	1.4x	5.7 MB	14.2 MB	20,515	233,248
CSS	28.5	10.7	2.7x	3.7 MB	11.2 MB	8,297	186,471

Tree-sitter is faster on all 9 languages, using 3x–13x less memory and 18x–53x fewer allocations.

Cached Render (same source, cache hit — the diff view path)

Language	Tree-sitter cached (ns)	Chroma cold (ms)	Speedup
Go	77	12.1	157,000x
Python	341	39.3	115,000x
JavaScript	169	17.2	102,000x
TypeScript	152	15.9	105,000x
C	296	25.5	86,000x
Rust	329	29.7	90,000x
Java	419	31.9	76,000x
Ruby	170	50.7	298,000x
CSS	145	28.5	197,000x

TheFox0x7 · 2026-03-16T09:30:27Z

Tested on few files and gitea repo in general.

Regression in assets/emoji.json

yours:

main branch:

Difference in yaml files

It does inline true and false better than chroma.

yours:

main:

I guess it'll end up personal preference pick between the two.
Also keep in mind that this adds 20MB.

As for the code I haven't looked deep but can you explain the added metrics? Why are they defined and useful? What do you expect to see?

feat(highlight): prototype gotreesitter-backed rendering

171e1a7

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 1, 2026

github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies labels Mar 1, 2026

chore(deps): bump gotreesitter to v0.6.0

8681723

wxiaoguang reviewed Mar 1, 2026

View reviewed changes

github-actions bot added the docs-update-needed The document needs to be updated synchronously label Mar 1, 2026

wxiaoguang reviewed Mar 1, 2026

View reviewed changes

odvcencio added 4 commits March 1, 2026 19:33

bump(deps): update gotreesitter to latest

7a20119

- Update gotreesitter from v0.6.0 to v0.6.1-0.20260302173816-cf6d6a44ace7 - Pull in latest fixes for tree-sitter syntax highlighting

odvcencio added 2 commits March 2, 2026 11:50

fix(gitdiff): update test expectations for code line highlighting

48eabd4

- Update test assertions to match corrected output from highlightCodeLinesForDiffFile() - Remove incorrect trailing span wrapping from expected newline characters

bump(deps): update gotreesitter to latest revision

ab81fa2

- Update gotreesitter dependency to latest revision for syntax highlighting improvements

odvcencio added 4 commits March 2, 2026 13:40

bump(deps): gotreesitter to latest version

201d721

- Update odvcencio/gotreesitter to latest commit 1fdab5f3cc1c - Pulls in latest improvements for tree-sitter syntax highlighting

refactor(highlight): clean up test and go.sum duplicates

15862ed

- Remove duplicate gotreesitter entries from go.sum - Remove unnecessary tc variable capture in test loop (no t.Parallel used) - Remove trailing whitespace from test file

odvcencio added 2 commits March 2, 2026 21:51

github-actions bot removed the docs-update-needed The document needs to be updated synchronously label Mar 14, 2026

odvcencio added 11 commits March 14, 2026 03:22

test(markup): update codepreview tests for new syntax highlighting

46a5424

- Update expected HTML output to reflect new tree-sitter based syntax highlighting - Change CSS classes from 'gh' to 'p' and 'nx' for heading elements

update(tests): match tree-sitter highlight output

9df2037

- Update editor diff preview assertion to include full line content in added-code span - Update blob excerpt assertion for new tree-sitter token class structure

bump: gotreesitter to v0.7.1

ea85f3d

- Update gotreesitter dependency from pre-release version v0.6.1 to stable release v0.7.1

bump(deps): gotreesitter to v0.7.2

fcc7088

- Update gotreesitter dependency from v0.7.1 to v0.7.2

refactor(highlight): update loops to use range-over-int syntax

c18212d

- Modernize benchmark test code to use Go 1.22+ range-over-integer syntax - Replace legacy for-loop patterns with cleaner `for i := range n` style

bump: update gotreesitter from v0.7.2 to v0.7.3

376bc38

- Update github.com/odvcencio/gotreesitter dependency to latest patch version - Includes go.sum checksum updates for the new version

odvcencio marked this pull request as ready for review March 16, 2026 08:19

Uh oh!

Conversation

odvcencio commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Methodology (up front)

Benchmark Results

Summary

Safety Gate

Validation

Notes

Uh oh!

wxiaoguang Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silverwind commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxiaoguang commented Mar 1, 2026

Uh oh!

wxiaoguang Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

TheFox0x7 commented Mar 1, 2026

Uh oh!

silverwind commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odvcencio commented Mar 1, 2026

Uh oh!

odvcencio commented Mar 2, 2026

Benchmark Results

RenderCode (snippet highlighting, hot cache)

RenderFullFile (line-by-line, hot cache)

Cold start (Go, no cache)

Key observations

Uh oh!

silverwind commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odvcencio commented Mar 2, 2026

Uh oh!

silverwind commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odvcencio commented Mar 2, 2026

Uh oh!

odvcencio commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks — gotreesitter v0.7.3 vs Chroma

Cold Render (first parse, no cache)

Cached Render (same source, cache hit — the diff view path)

Uh oh!

TheFox0x7 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

odvcencio commented Mar 1, 2026 •

edited

Loading

wxiaoguang Mar 1, 2026 •

edited

Loading

silverwind commented Mar 1, 2026 •

edited

Loading

silverwind commented Mar 1, 2026 •

edited

Loading

silverwind commented Mar 2, 2026 •

edited

Loading

silverwind commented Mar 2, 2026 •

edited

Loading

odvcencio commented Mar 16, 2026 •

edited

Loading