Prototype: gotreesitter-backed syntax highlighting (refs #36765)#36791
Prototype: gotreesitter-backed syntax highlighting (refs #36765)#36791odvcencio wants to merge 27 commits intogo-gitea:mainfrom
Conversation
modules/highlight/treesitter.go
Outdated
| return registry | ||
| }) | ||
|
|
||
| treeSitterDetectCache sync.Map // map[string]*tsgrammars.LangEntry |
There was a problem hiding this comment.
For proof-of-concept or for production?
If for production, we should be careful about "cache". Here if I understand correctly, all file names will go into the cache, it will bloat infinitely?
That's less than I would have expected. Do you perhaps also have benchmark results comparing |
I think when benchmarking a large SQL file (just duplicate a long SQL 4000 times), then the difference might be huge. |
…ting - Add ENABLE_GOTREESITTER configuration option to control syntax highlighting engine - Make all TreeSitter rendering paths conditional based on the new setting - Default to true for backward compatibility - Chroma remains fallback when disabled
custom/conf/app.example.ini
Outdated
| ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | ||
| ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | ||
| ;; Whether to prefer gotreesitter for syntax highlighting. If false, Gitea uses Chroma only. | ||
| ;ENABLE_GOTREESITTER = true |
There was a problem hiding this comment.
For production, I think we will just choose one, just the good enough default one, then no more option, because most end users don't know these options .....
|
A good test would be to have an AI compare syntax highlighting of all the 200+ supported languages is structural identical to the same highlighting on github.com. That would confirm that |
|
i didnt anticipate the speedy reviews, thanks so much for bearing with me 😅 ! i did a first pass to validate the conceit of the issue and didnt delve too deeply into the performance/profiling which im doing right now. uncovered some interesting stuff and im expanding the parity against c's tree-sitter runtime (my parity harness compares to cgo and c) as well as top 50 language perf profiles to determine if there are hand tuned grammar optimizations or edges to cover. all that work should change the profiles significantly and early picture says that there are still yet optimizations to be made for this use case specifically.... thanks again for the thoughtful reviews |
…arsing and caching - Replace Highlighter wrapper with native Parser/Query for finer control - Add intelligent render cache with primary/alternate slots for code variations - Implement incremental parsing via single-edit computation for fast re-renders - Add line-by-line render method avoiding expensive HTML splitting - Remove treeSitterDetectCache; prefer filename-based detection over language name - Add capture class caching and range normalization to reduce allocations - Introduce comprehensive benchmark suite comparing tree-sitter vs Chroma - Add overlap resolution for nested highlight ranges using sweep-line algorithm
- Update gotreesitter from v0.6.0 to v0.6.1-0.20260302173816-cf6d6a44ace7 - Pull in latest fixes for tree-sitter syntax highlighting
…esitter default - Remove ENABLE_GOTREESITTER configuration option - gotreesitter is now always enabled by default - Simplify tryRenderCodeByTreeSitter signature by removing trimTrailingNewline parameter - Rewrite resolveHighlightOverlaps with stack-based algorithm replacing event-based approach - Add writeEscapedBytes optimization to avoid unnecessary HTML escaping for safe content - Remove redundant range sorting in queryNormalizedRanges - Add visual parity tests to compare TreeSitter and Chroma highlighting output - Add cache behavior tests for TreeSitter renderer - Add cold-start benchmarks for TreeSitter rendering performance BREAKING CHANGE: remove ENABLE_GOTREESITTER config and make gotreesitter default
- Move treeSitterRenderCache and incremental parsing logic to treesitter_incremental.go - Move render methods (render, renderLines) and highlight query logic to treesitter_render.go - Remove unused imports (bytes, sort) from main treesitter.go - Split ~600 lines of code into focused modules for better maintainability - Keep capture-to-class conversion and lexer resolution in original file - Add copyright headers to new 2026 files per project guidelines
Benchmark ResultsMachine: Intel Core Ultra 9 285, Linux, Go 1.26.0 RenderCode (snippet highlighting, hot cache)
RenderFullFile (line-by-line, hot cache)
Cold start (Go, no cache)
Key observations
|
- Update test assertions to match corrected output from highlightCodeLinesForDiffFile() - Remove incorrect trailing span wrapping from expected newline characters
- Update gotreesitter dependency to latest revision for syntax highlighting improvements
|
One topic we have to consider before fully getting rid of chroma is their language metadata, which IIRC gitea uses in conjunction with https://github.com/go-enry/go-enry, which sources it from https://github.com/github-linguist/linguist. I think linguist's languages.yml is probably the most complete language metadata on the Internet and I wonder how these treesitters handle that topic, e.g. how they match a given filename to a parser. Could it be that c-treesitter relies on linguist data or does it maintain it's own language metadata? |
|
c's tree-sitter expects caller to provide language parser explicitly and filename/language detection is a host app responsbility... enry stays as the source of truth for language metadata/detection, im mapping the (linguist through enry) languages to gotreesitter grammars so this can slot in easily. just added a patch to prefer enry file language over extension. |
- Prioritize explicit language metadata (enry/gitattributes) before falling back to filename-based detection - Fixes wrong grammar selection on ambiguous extensions like ".h" where metadata disambiguates C vs Objective-C vs C++ - Add comprehensive tests for metadata preference and filename fallback scenarios
|
Ok, from what I gather, Github does it like this:
So I guess that mapping is something we will have to implement and maintain, as opposed to Chroma where this was built-in. Maybe we will also keep Chroma as a fallback in case we can not determine a tree-sitter grammar, so tree-sitter provides a fast path for the most common languages. |
|
Chroma fallback is good business. i will get that in. |
- Update odvcencio/gotreesitter to latest commit 1fdab5f3cc1c - Pulls in latest improvements for tree-sitter syntax highlighting
- Update gotreesitter dependency to latest commit with improved APIs - Replace custom registry and canonical key logic with tsgrammars.DetectLanguageByName() - Replace custom DisplayName() with tsgrammars.DisplayName() from library - Remove treeSitterLanguageAliases map - library now handles linguist name mappings - Use tsgrammars.DetectLanguage() for extension/filename-based detection - Simplify lookupTreeSitterEntryByLanguageName() by leveraging native library functions
- Remove duplicate gotreesitter entries from go.sum - Remove unnecessary tc variable capture in test loop (no t.Parallel used) - Remove trailing whitespace from test file
…er grammars - Use chroma lexer names and aliases to resolve tree-sitter grammars - Add `lookupTreeSitterEntryByChromaLexer` to match chroma lexers against tree-sitter languages - Integrate fallback in `resolveTreeSitterEntry` and `resolveTreeSitterEntryWithAnalyze` when primary detection fails - Add tests for resolving "ksh" (chroma alias) to "bash" tree-sitter grammar
- Return boolean success indicator from parsing and highlighting functions - Invalidate cache on parse failures to prevent reuse of stale output - Enable fallback to Chroma highlighting when tree-sitter parsing fails - Remove synthetic tree creation on parse errors in favor of explicit failure
…API with metrics - Update gotreesitter library to v0.6.1-0.20260313093557 for improved Highlighter API - Replace manual parser/query management with unified Highlighter abstraction - Add comprehensive metrics collection for render operations and fallback reasons - Implement compatibility modes for Haskell (module prefix injection) and Nginx (source normalization) - Add detailed render attempt tracking with specific fallback reason codes - Treat .txt files as plaintext unless explicit language metadata provided - Remove obsolete benchmark files superseded by new testing infrastructure - Add visual parity samples and rollout tests for production validation - Clean up empty [highlight] section from app.example.ini configuration
- Use named return values and defer for consistent panic recovery and mutex unlocking - Rename 'ok' to 'rangesOK' to prevent shadowing named return value - Add bounds checking for highlight ranges in renderLines to prevent out-of-bounds access - Simplify cache hit returns by removing intermediate variables
- Add size limit check to RenderCode and RenderCodeByLexer to return escaped plaintext for oversized inputs, preventing performance issues - Use singleflight.Group to deduplicate concurrent NewHighlighter calls for the same language, avoiding redundant work and race conditions - Add highlightFallbackNone constant to represent successful tree-sitter renders without fallback - Fix prometheus registration to ignore already-registered errors in test binaries - Remove deprecated // +build comment in favor of go:build directive
- Remove highlightLexer and highlightRender fields from DiffSection and DiffFile structs - Remove chroma/v2 import since lexer caching is no longer needed - Update highlight diff tests to use exact string assertions for mixed backend output
- Update expected HTML output to reflect new tree-sitter based syntax highlighting - Change CSS classes from 'gh' to 'p' and 'nx' for heading elements
- Update editor diff preview assertion to include full line content in added-code span - Update blob excerpt assertion for new tree-sitter token class structure
- Update gotreesitter dependency from pre-release version v0.6.1 to stable release v0.7.1
- Add comprehensive doc comments to tree-sitter compatibility functions - Explain the reshape-highlight-project strategy for grammars requiring well-formed files - Document Haskell grammar workaround requiring module declarations - Detail nginx config normalization pipeline and position mapping
- Update gotreesitter dependency from v0.7.1 to v0.7.2
- Add comprehensive benchmark tests comparing Chroma lexer and TreeSitter renderer performance - Cover multiple languages: Go, Python, JavaScript, TypeScript, C, Rust, Java, Ruby, and CSS - Benchmark both cold cache (unique inputs) and warm cache (repeated renders) scenarios - Remove Swift from known TreeSitter fallbacks map indicating improved support
- Modernize benchmark test code to use Go 1.22+ range-over-integer syntax - Replace legacy for-loop patterns with cleaner `for i := range n` style
- Update github.com/odvcencio/gotreesitter dependency to latest patch version - Includes go.sum checksum updates for the new version
|
gotreesitter v0.7.3 has been released and with it, we have lots of correctness parity and benchmarking guarantees, the perf picture has shifted a bit but we are still winning against chroma. the fallback makes it so if gotreesitter doesnt cover it, we fall back to chroma. nothing lost, only gained. let me know if i need to pull anything out like the benchmark stuff to get this to land. latest benchies say: Benchmarks — gotreesitter v0.7.3 vs ChromaMachine: Intel Core Ultra 9 285, 20 threads | Input: ~200 functions per language (~8-12KB) | Runs: 5 Cold Render (first parse, no cache)
Tree-sitter is faster on all 9 languages, using 3x–13x less memory and 18x–53x fewer allocations. Cached Render (same source, cache hit — the diff view path)
|
|
Tested on few files and gitea repo in general. I guess it'll end up personal preference pick between the two. As for the code I haven't looked deep but can you explain the added metrics? Why are they defined and useful? What do you expect to see? |






Authorship attribution: prepared by @odvcencio with Codex assistance.
Refs #36765
Benchmark Methodology (up front)
Benchmarks were generated on March 1, 2026 with this exact command:
GOMAXPROCS=1 go test ./modules/highlight -run '^$' -bench 'BenchmarkRenderCode(TreeSitter|Chroma)Go|BenchmarkRenderFullFile(TreeSitter|Chroma)Go' -benchmem -count=10 -benchtime=750msSummarized via:
benchstat /tmp/gitea_highlight_bench_v060.txtEnvironment captured by bench output:
goos: linuxgoarch: amd64cpu: Intel(R) Core(TM) Ultra 9 285Benchmark Results
From
benchstaton/tmp/gitea_highlight_bench_v060.txt:RenderCodeTreeSitterGo:30.87msvsRenderCodeChromaGo:49.38ms(~1.60xfaster)RenderFullFileTreeSitterGo:31.63msvsRenderFullFileChromaGo:53.88ms(~1.70xfaster)RenderCode:4.252MiBvs33.05MiB(~7.8xlower)RenderFullFile:4.432MiBvs34.52MiB(~7.8xlower)RenderCode:20.58kvs585.7k(~28.5xlower)RenderFullFile:20.58kvs640.7k(~31.1xlower)Summary
modules/highlightrenderer path that prefersgithub.com/odvcencio/gotreesitterand falls back to Chroma when grammar/query detection is unavailable.modules/highlight.github.com/odvcencio/gotreesittertov0.6.0.Safety Gate
This branch now includes an explicit feature flag:
[highlight] ENABLE_GOTREESITTER = trueIf set to
false, highlighting uses Chroma only. This provides a low-risk rollout/rollback lever while evaluating parity and production behavior.Validation
make fmtmake lint-gomake tidygo test -tags 'sqlite sqlite_unlock_notify' ./modules/highlight ./modules/markup/orgmode ./modules/indexer/code ./services/gitdiff -count=1Notes
chromawithgotreesitter#36765.