Skip to content

Prototype: gotreesitter-backed syntax highlighting (refs #36765)#36791

Open
odvcencio wants to merge 27 commits intogo-gitea:mainfrom
odvcencio:feat/gotreesitter-highlight-36765
Open

Prototype: gotreesitter-backed syntax highlighting (refs #36765)#36791
odvcencio wants to merge 27 commits intogo-gitea:mainfrom
odvcencio:feat/gotreesitter-highlight-36765

Conversation

@odvcencio
Copy link

@odvcencio odvcencio commented Mar 1, 2026

Authorship attribution: prepared by @odvcencio with Codex assistance.

Refs #36765

Benchmark Methodology (up front)

Benchmarks were generated on March 1, 2026 with this exact command:

GOMAXPROCS=1 go test ./modules/highlight -run '^$' -bench 'BenchmarkRenderCode(TreeSitter|Chroma)Go|BenchmarkRenderFullFile(TreeSitter|Chroma)Go' -benchmem -count=10 -benchtime=750ms

Summarized via:

benchstat /tmp/gitea_highlight_bench_v060.txt

Environment captured by bench output:

  • goos: linux
  • goarch: amd64
  • cpu: Intel(R) Core(TM) Ultra 9 285

Benchmark Results

From benchstat on /tmp/gitea_highlight_bench_v060.txt:

  • RenderCodeTreeSitterGo: 30.87ms vs RenderCodeChromaGo: 49.38ms (~1.60x faster)
  • RenderFullFileTreeSitterGo: 31.63ms vs RenderFullFileChromaGo: 53.88ms (~1.70x faster)
  • Memory:
    • RenderCode: 4.252MiB vs 33.05MiB (~7.8x lower)
    • RenderFullFile: 4.432MiB vs 34.52MiB (~7.8x lower)
  • Allocations:
    • RenderCode: 20.58k vs 585.7k (~28.5x lower)
    • RenderFullFile: 20.58k vs 640.7k (~31.1x lower)

Summary

  • Adds a new modules/highlight renderer path that prefers github.com/odvcencio/gotreesitter and falls back to Chroma when grammar/query detection is unavailable.
  • Integrates this path into file view, diff line rendering/full-file highlighting, code search snippets, and orgmode code blocks.
  • Keeps Chroma-compatible CSS token classes so existing theme styles continue to apply.
  • Adds benchmark coverage comparing gotreesitter vs Chroma in modules/highlight.
  • Pins github.com/odvcencio/gotreesitter to v0.6.0.

Safety Gate

This branch now includes an explicit feature flag:

  • [highlight] ENABLE_GOTREESITTER = true

If set to false, highlighting uses Chroma only. This provides a low-risk rollout/rollback lever while evaluating parity and production behavior.

Validation

  • make fmt
  • make lint-go
  • make tidy
  • go test -tags 'sqlite sqlite_unlock_notify' ./modules/highlight ./modules/markup/orgmode ./modules/indexer/code ./services/gitdiff -count=1

Notes

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 1, 2026
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies labels Mar 1, 2026
return registry
})

treeSitterDetectCache sync.Map // map[string]*tsgrammars.LangEntry
Copy link
Contributor

@wxiaoguang wxiaoguang Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proof-of-concept or for production?

If for production, we should be careful about "cache". Here if I understand correctly, all file names will go into the cache, it will bloat infinitely?

@silverwind
Copy link
Member

silverwind commented Mar 1, 2026

RenderFullFileTreeSitterGo: 31.63ms vs RenderFullFileChromaGo: 53.88ms (~1.70x faster)

That's less than I would have expected. Do you perhaps also have benchmark results comparing gotreesitter vs. https://github.com/tree-sitter/tree-sitter? Ideally they should perform roughtly the same, with go being a bit slower due to GC.

@wxiaoguang
Copy link
Contributor

RenderCodeTreeSitterGo: 30.87ms vs RenderCodeChromaGo: 49.38ms (~1.60x faster)

I think when benchmarking a large SQL file (just duplicate a long SQL 4000 times), then the difference might be huge.

…ting

- Add ENABLE_GOTREESITTER configuration option to control syntax highlighting engine
- Make all TreeSitter rendering paths conditional based on the new setting
- Default to true for backward compatibility - Chroma remains fallback when disabled
@github-actions github-actions bot added the docs-update-needed The document needs to be updated synchronously label Mar 1, 2026
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Whether to prefer gotreesitter for syntax highlighting. If false, Gitea uses Chroma only.
;ENABLE_GOTREESITTER = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For production, I think we will just choose one, just the good enough default one, then no more option, because most end users don't know these options .....

@TheFox0x7
Copy link
Contributor

Tries out of curiosity and as is, it regresses.
treesitter:
image
chroma:
image

@silverwind
Copy link
Member

silverwind commented Mar 1, 2026

A good test would be to have an AI compare syntax highlighting of all the 200+ supported languages is structural identical to the same highlighting on github.com. That would confirm that gotreesitter parses the same as the original tree-sitter.

@odvcencio
Copy link
Author

i didnt anticipate the speedy reviews, thanks so much for bearing with me 😅 !

i did a first pass to validate the conceit of the issue and didnt delve too deeply into the performance/profiling which im doing right now. uncovered some interesting stuff and im expanding the parity against c's tree-sitter runtime (my parity harness compares to cgo and c) as well as top 50 language perf profiles to determine if there are hand tuned grammar optimizations or edges to cover.

all that work should change the profiles significantly and early picture says that there are still yet optimizations to be made for this use case specifically.... thanks again for the thoughtful reviews

…arsing and caching

- Replace Highlighter wrapper with native Parser/Query for finer control
- Add intelligent render cache with primary/alternate slots for code variations
- Implement incremental parsing via single-edit computation for fast re-renders
- Add line-by-line render method avoiding expensive HTML splitting
- Remove treeSitterDetectCache; prefer filename-based detection over language name
- Add capture class caching and range normalization to reduce allocations
- Introduce comprehensive benchmark suite comparing tree-sitter vs Chroma
- Add overlap resolution for nested highlight ranges using sweep-line algorithm
- Update gotreesitter from v0.6.0 to v0.6.1-0.20260302173816-cf6d6a44ace7
- Pull in latest fixes for tree-sitter syntax highlighting
…esitter default

- Remove ENABLE_GOTREESITTER configuration option - gotreesitter is now always enabled by default
- Simplify tryRenderCodeByTreeSitter signature by removing trimTrailingNewline parameter
- Rewrite resolveHighlightOverlaps with stack-based algorithm replacing event-based approach
- Add writeEscapedBytes optimization to avoid unnecessary HTML escaping for safe content
- Remove redundant range sorting in queryNormalizedRanges
- Add visual parity tests to compare TreeSitter and Chroma highlighting output
- Add cache behavior tests for TreeSitter renderer
- Add cold-start benchmarks for TreeSitter rendering performance

BREAKING CHANGE: remove ENABLE_GOTREESITTER config and make gotreesitter default
- Move treeSitterRenderCache and incremental parsing logic to treesitter_incremental.go
- Move render methods (render, renderLines) and highlight query logic to treesitter_render.go
- Remove unused imports (bytes, sort) from main treesitter.go
- Split ~600 lines of code into focused modules for better maintainability
- Keep capture-to-class conversion and lexer resolution in original file
- Add copyright headers to new 2026 files per project guidelines
@odvcencio
Copy link
Author

Benchmark Results

Machine: Intel Core Ultra 9 285, Linux, Go 1.26.0

RenderCode (snippet highlighting, hot cache)

Language gotreesitter Chroma Speedup TS allocs Chroma allocs
Go (46KB) 674 ns 49.9 ms 74,000x 0 585,679
Python (55KB) 955 ns 98.7 ms 103,000x 0 607,578
JavaScript (56KB) 929 ns 67.5 ms 72,600x 0 604,776
SQL (504KB) 8.66 us 1.15 s 132,000x 0 5,164,704

RenderFullFile (line-by-line, hot cache)

Language gotreesitter Chroma Speedup TS allocs Chroma allocs
Go (46KB) 8.7 us 48.5 ms 5,500x 1 640,673
Python (55KB) 6.3 us 154.4 ms 24,500x 1 650,973
JavaScript (56KB) 5.6 us 50.1 ms 8,900x 1 635,571
SQL (504KB) 55.2 us 1.17 s 21,200x 1 5,484,695

Cold start (Go, no cache)

Metric gotreesitter Chroma
RenderCode cold 818 ns 50.8 ms
RenderFullFile cold 8.8 us 64.9 ms

Key observations

  • Zero allocations on hot-cache code rendering (cache hit returns pre-rendered HTML)
  • 1 allocation on full-file rendering (line slice)
  • Cold start is ~same speed as hot for gotreesitter (incremental parse detects no-edit)
  • Chroma allocates 500K-5M objects per render due to tokenizer/formatter pipeline

- Update test assertions to match corrected output from highlightCodeLinesForDiffFile()
- Remove incorrect trailing span wrapping from expected newline characters
- Update gotreesitter dependency to latest revision for syntax highlighting improvements
@silverwind
Copy link
Member

silverwind commented Mar 2, 2026

One topic we have to consider before fully getting rid of chroma is their language metadata, which IIRC gitea uses in conjunction with https://github.com/go-enry/go-enry, which sources it from https://github.com/github-linguist/linguist.

I think linguist's languages.yml is probably the most complete language metadata on the Internet and I wonder how these treesitters handle that topic, e.g. how they match a given filename to a parser. Could it be that c-treesitter relies on linguist data or does it maintain it's own language metadata?

@odvcencio
Copy link
Author

c's tree-sitter expects caller to provide language parser explicitly and filename/language detection is a host app responsbility... enry stays as the source of truth for language metadata/detection, im mapping the (linguist through enry) languages to gotreesitter grammars so this can slot in easily. just added a patch to prefer enry file language over extension.

- Prioritize explicit language metadata (enry/gitattributes) before falling back to filename-based detection
- Fixes wrong grammar selection on ambiguous extensions like ".h" where metadata disambiguates C vs Objective-C vs C++
- Add comprehensive tests for metadata preference and filename fallback scenarios
@silverwind
Copy link
Member

silverwind commented Mar 2, 2026

Ok, from what I gather, Github does it like this:

  1. Linguist determines language based on its heuristics
  2. Language is passed into TreeLights, a closed source service which takes a language and spits out a tree-sitter grammar. If there is no grammar, it falls back to PrettyLights, another closed-source service which then does regex-based highlighting.
  3. Tree-sitter then produces the AST for syntax highlighting and other features

So I guess that mapping is something we will have to implement and maintain, as opposed to Chroma where this was built-in.

Maybe we will also keep Chroma as a fallback in case we can not determine a tree-sitter grammar, so tree-sitter provides a fast path for the most common languages.

@odvcencio
Copy link
Author

Chroma fallback is good business. i will get that in.

- Update odvcencio/gotreesitter to latest commit 1fdab5f3cc1c
- Pulls in latest improvements for tree-sitter syntax highlighting
- Update gotreesitter dependency to latest commit with improved APIs
- Replace custom registry and canonical key logic with tsgrammars.DetectLanguageByName()
- Replace custom DisplayName() with tsgrammars.DisplayName() from library
- Remove treeSitterLanguageAliases map - library now handles linguist name mappings
- Use tsgrammars.DetectLanguage() for extension/filename-based detection
- Simplify lookupTreeSitterEntryByLanguageName() by leveraging native library functions
- Remove duplicate gotreesitter entries from go.sum
- Remove unnecessary tc variable capture in test loop (no t.Parallel used)
- Remove trailing whitespace from test file
…er grammars

- Use chroma lexer names and aliases to resolve tree-sitter grammars
- Add `lookupTreeSitterEntryByChromaLexer` to match chroma lexers against tree-sitter languages
- Integrate fallback in `resolveTreeSitterEntry` and `resolveTreeSitterEntryWithAnalyze` when primary detection fails
- Add tests for resolving "ksh" (chroma alias) to "bash" tree-sitter grammar
- Return boolean success indicator from parsing and highlighting functions
- Invalidate cache on parse failures to prevent reuse of stale output
- Enable fallback to Chroma highlighting when tree-sitter parsing fails
- Remove synthetic tree creation on parse errors in favor of explicit failure
…API with metrics

- Update gotreesitter library to v0.6.1-0.20260313093557 for improved Highlighter API
- Replace manual parser/query management with unified Highlighter abstraction
- Add comprehensive metrics collection for render operations and fallback reasons
- Implement compatibility modes for Haskell (module prefix injection) and Nginx (source normalization)
- Add detailed render attempt tracking with specific fallback reason codes
- Treat .txt files as plaintext unless explicit language metadata provided
- Remove obsolete benchmark files superseded by new testing infrastructure
- Add visual parity samples and rollout tests for production validation
- Clean up empty [highlight] section from app.example.ini configuration
@github-actions github-actions bot removed the docs-update-needed The document needs to be updated synchronously label Mar 14, 2026
- Use named return values and defer for consistent panic recovery and mutex unlocking
- Rename 'ok' to 'rangesOK' to prevent shadowing named return value
- Add bounds checking for highlight ranges in renderLines to prevent out-of-bounds access
- Simplify cache hit returns by removing intermediate variables
- Add size limit check to RenderCode and RenderCodeByLexer to return escaped plaintext for oversized inputs, preventing performance issues
- Use singleflight.Group to deduplicate concurrent NewHighlighter calls for the same language, avoiding redundant work and race conditions
- Add highlightFallbackNone constant to represent successful tree-sitter renders without fallback
- Fix prometheus registration to ignore already-registered errors in test binaries
- Remove deprecated // +build comment in favor of go:build directive
- Remove highlightLexer and highlightRender fields from DiffSection and DiffFile structs
- Remove chroma/v2 import since lexer caching is no longer needed
- Update highlight diff tests to use exact string assertions for mixed backend output
- Update expected HTML output to reflect new tree-sitter based syntax highlighting
- Change CSS classes from 'gh' to 'p' and 'nx' for heading elements
- Update editor diff preview assertion to include full line content in added-code span
- Update blob excerpt assertion for new tree-sitter token class structure
- Update gotreesitter dependency from pre-release version v0.6.1 to stable release v0.7.1
- Add comprehensive doc comments to tree-sitter compatibility functions
- Explain the reshape-highlight-project strategy for grammars requiring well-formed files
- Document Haskell grammar workaround requiring module declarations
- Detail nginx config normalization pipeline and position mapping
- Update gotreesitter dependency from v0.7.1 to v0.7.2
- Add comprehensive benchmark tests comparing Chroma lexer and TreeSitter renderer performance
- Cover multiple languages: Go, Python, JavaScript, TypeScript, C, Rust, Java, Ruby, and CSS
- Benchmark both cold cache (unique inputs) and warm cache (repeated renders) scenarios
- Remove Swift from known TreeSitter fallbacks map indicating improved support
- Modernize benchmark test code to use Go 1.22+ range-over-integer syntax
- Replace legacy for-loop patterns with cleaner `for i := range n` style
- Update github.com/odvcencio/gotreesitter dependency to latest patch version
- Includes go.sum checksum updates for the new version
@odvcencio odvcencio marked this pull request as ready for review March 16, 2026 08:19
@odvcencio
Copy link
Author

odvcencio commented Mar 16, 2026

gotreesitter v0.7.3 has been released and with it, we have lots of correctness parity and benchmarking guarantees, the perf picture has shifted a bit but we are still winning against chroma. the fallback makes it so if gotreesitter doesnt cover it, we fall back to chroma. nothing lost, only gained. let me know if i need to pull anything out like the benchmark stuff to get this to land.

latest benchies say:

Benchmarks — gotreesitter v0.7.3 vs Chroma

Machine: Intel Core Ultra 9 285, 20 threads | Input: ~200 functions per language (~8-12KB) | Runs: 5

Cold Render (first parse, no cache)

Language Chroma (ms) Tree-sitter (ms) Speedup TS memory Chroma memory TS allocs Chroma allocs
Go 12.1 7.1 1.7x 1.4 MB 7.9 MB 4,086 136,493
Python 39.3 16.0 2.5x 4.2 MB 16.9 MB 14,828 269,865
JavaScript 17.2 13.1 1.3x 4.9 MB 17.4 MB 19,333 272,032
TypeScript 15.9 11.9 1.3x 1.1 MB 14.5 MB 4,497 235,510
C 25.5 17.6 1.4x 4.2 MB 19.5 MB 13,705 312,886
Rust 29.7 27.5 1.1x 4.4 MB 22.1 MB 15,107 366,497
Java 31.9 13.9 2.3x 4.0 MB 23.8 MB 15,150 398,482
Ruby 50.7 37.4 1.4x 5.7 MB 14.2 MB 20,515 233,248
CSS 28.5 10.7 2.7x 3.7 MB 11.2 MB 8,297 186,471

Tree-sitter is faster on all 9 languages, using 3x–13x less memory and 18x–53x fewer allocations.

Cached Render (same source, cache hit — the diff view path)

Language Tree-sitter cached (ns) Chroma cold (ms) Speedup Allocations
Go 77 12.1 157,000x 0
Python 341 39.3 115,000x 0
JavaScript 169 17.2 102,000x 0
TypeScript 152 15.9 105,000x 0
C 296 25.5 86,000x 0
Rust 329 29.7 90,000x 0
Java 419 31.9 76,000x 0
Ruby 170 50.7 298,000x 0
CSS 145 28.5 197,000x 0

@TheFox0x7
Copy link
Contributor

Tested on few files and gitea repo in general.

Regression in assets/emoji.json

yours:
image
main branch:
image

Difference in yaml files

It does inline true and false better than chroma.

yours:
image

main:
image

I guess it'll end up personal preference pick between the two.
Also keep in mind that this adds 20MB.

As for the code I haven't looked deep but can you explain the added metrics? Why are they defined and useful? What do you expect to see?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/dependencies modifies/go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants