Skip to content

v3 Development: Improve reliability of decoder by treating only trial decodes and text validation as authoritative.#3

Merged
emcd merged 12 commits intomasterfrom
decode-refactor
Feb 14, 2026
Merged

v3 Development: Improve reliability of decoder by treating only trial decodes and text validation as authoritative.#3
emcd merged 12 commits intomasterfrom
decode-refactor

Conversation

@emcd
Copy link
Owner

@emcd emcd commented Feb 13, 2026

No description provided.

emcd and others added 4 commits February 12, 2026 03:59
Add comprehensive documentation for confidence scoring approach:
- Size-based scaling rationale and formula
- Detector-specific strategies (intrinsic vs constant confidence)
- Base confidence values for magic (0.95/0.75) and charset-normalizer (0.85)
- Examples and interaction with behavior thresholds

Add analysis of text validation and confidence threshold:
- text_validate_confidence is effectively unused (always 0.0 in main path)
- Validation checks textuality, not detection confidence (orthogonal concerns)
- Recommend removing confidence threshold, keeping tristate control

Fix docstring in is_permissive_charset() to correctly reflect that CP1252
is not permissive (has 5 undefined bytes).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Refactor decoders to use charset trial decoding with validator hooks.

Update default trial codec order to prefer UTF-8 before OS defaults and keep inference confidence gating.

Adjust docs and tests for BOM-aware charset normalization and decode behavior.

Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.openai.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 452af933e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

emcd and others added 5 commits February 12, 2026 21:09
Remove resolved Windows encoding investigation notes and keep active research notes focused on current v3 decisions.

Update ideas scope to post-v3.0+ and retain CP1252 historical finding in decode refactor notes.

Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.openai.com>
Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.openai.com>
Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.openai.com>
Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.github.com>
Co-Authored-By: GPT-5 Codex <gpt-5-codex@users.noreply.github.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fbad9168d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

emcd and others added 3 commits February 13, 2026 18:46
Propagate decode-attempt confidence into text validation gating.

Add tests for both above-threshold skip and below-threshold validation behavior.

Co-Authored-By: Codex <codex@users.noreply.openai.com>
Update architecture summary and validation decision documentation for current v3 behavior.

Restore conservative decode-attempt text validation confidence handling and remove threshold-gating tests.

Co-Authored-By: Codex <codex@users.noreply.openai.com>
Treat supplied HTTP Content-Type as authoritative parse input in inference paths.

Convert charset and MIME detection toggles to booleans and validate them via BehaviorsInvalidity.

Update tests and architecture notes for the v3 behavior model.

Co-Authored-By: Codex <codex@users.noreply.openai.com>
@emcd emcd merged commit 9f490ae into master Feb 14, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant