Skip to content

New (alternative) message entity parsers#2556

Closed
Badiboy wants to merge 6 commits intoeternnoir:masterfrom
Badiboy:master
Closed

New (alternative) message entity parsers#2556
Badiboy wants to merge 6 commits intoeternnoir:masterfrom
Badiboy:master

Conversation

@Badiboy
Copy link
Collaborator

@Badiboy Badiboy commented Feb 14, 2026

Description

I now myself faced the issue for some hard-structured message entities parsing, so the time had come to rewrite apply_html_entities processor. I was not able to code it myself due to lack of time, so I had videcoded them. I added 3 test versions of message entity parsers (apply_html_entities processor)/

apply_html_entities_ds
apply_html_entities_gm
apply_html_entities_cg

They can be chosen by setting
formatting.ENTITY_PASER_MODE
entity.

Describe your tests

I made some draft tests, all of them are assumed to work. I'll take them to prod and check how things are going. If any of them will survive - I'll provide it to replace original one.

Added 3 test versions of message entity parsers (apply_html_entities processor).
@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

@coder2020official Your oppinion?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds three alternative implementations of the message entity parser (apply_html_entities) function, generated by AI coding assistants (DeepSeek, Gemini, and ChatGPT). The implementations can be selected via the ENTITY_PASER_MODE configuration variable. The PR aims to provide alternative approaches to handling hard-structured message entity parsing, which will be tested in production to determine if any should replace the original implementation.

Changes:

  • Added global configuration variable ENTITY_PASER_MODE to select parser implementation
  • Added DeepSeek implementation (apply_html_entities_ds) with EntityProcessor class using recursive entity processing
  • Added Gemini implementation (apply_html_entities_gm) using marker-based tag insertion with intersection handling
  • Added ChatGPT implementation (apply_html_entities_cg) using tree-based entity rendering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Badiboy and others added 2 commits February 14, 2026 19:39
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@coder2020official
Copy link
Collaborator

🥶

Are you sure this is a good idea? I've known about the issue for some time, and I even made a PR like a year ago (was closed); I could take a look at this issue sometime later, but if this is an emergency--

I still don't like the idea of choice here

@coder2020official
Copy link
Collaborator

Couldn't find the PR, but had the code somewhere

@coder2020official
Copy link
Collaborator

Found my code, will see how it works

@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

Are you sure this is a good idea?

Current version is obviously not fully compliant. There can be various ways to solve it, I did what I could for now because I need it now )

You have alternative way - let's check it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 22 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +935 to +955
def apply_template(entity, inner: str, custom_subs: Optional[Dict[str, str]]) -> str:
t = entity.type
if t in ENTITY_TEMPLATES_CG:
tpl = ENTITY_TEMPLATES_CG[t]
elif custom_subs and t in custom_subs:
tpl = custom_subs[t]
else:
return inner

data = {"text": inner}

if t == "text_link":
data["url"] = getattr(entity, "url", "")
elif t == "text_mention":
data["url"] = f"tg://user?id={getattr(entity, 'user', {}).id if getattr(entity, 'user', None) else ''}"
elif t == "custom_emoji":
data["custom_emoji_id"] = getattr(entity, "custom_emoji_id", "")
elif (t == "pre") and getattr(entity, "language", None):
data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>'

return tpl.format(**data)
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChatGPT implementation has a critical bug in the template handling. When entity type is "pre" with a language, it sets data["text"] to include closing </pre> tag (line 953), but then the template at line 910 already has {text} within <pre>{text}</pre>. This will result in <pre><code class="language-X">content</code></pre></pre> with a double closing tag.

Copilot uses AI. Check for mistakes.
elif t == "expandable_blockquote":
return {'open': "<blockquote expandable>", 'close': "</blockquote>"}
elif t == "text_link":
return {'open': f'<a href="{entity.url}">', 'close': "</a>"}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Gemini implementation doesn't escape URLs before inserting them into HTML attributes. This is a security vulnerability that could lead to XSS attacks if entity.url contains malicious content. Unlike the DeepSeek version which uses escape_html(entity.url or ""), this implementation inserts the URL directly.

Suggested change
return {'open': f'<a href="{entity.url}">', 'close': "</a>"}
return {'open': f'<a href="{escape_html(entity.url or "")}">', 'close': "</a>"}

Copilot uses AI. Check for mistakes.
elif t == "text_mention":
return {'open': f'<a href="tg://user?id={entity.user.id}">', 'close': "</a>"}
elif t == "custom_emoji":
return {'open': f'<tg-emoji emoji-id="{entity.custom_emoji_id}">', 'close': "</tg-emoji>"}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct attribute access without safety checks. If entity doesn't have a 'custom_emoji_id' attribute, this will raise an AttributeError. Should use getattr(entity, 'custom_emoji_id', '') for safety.

Suggested change
return {'open': f'<tg-emoji emoji-id="{entity.custom_emoji_id}">', 'close': "</tg-emoji>"}
emoji_id = getattr(entity, "custom_emoji_id", "")
return {'open': f'<tg-emoji emoji-id="{emoji_id}">', 'close': "</tg-emoji>"}

Copilot uses AI. Check for mistakes.
elif t == "custom_emoji":
data["custom_emoji_id"] = getattr(entity, "custom_emoji_id", "")
elif (t == "pre") and getattr(entity, "language", None):
data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>'
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation has the same issue as in the DeepSeek version. The template incorrectly includes </pre> within the formatted string, which will cause the template.format() to append this closing tag to the text, then the template's {text} placeholder will insert it again, resulting in malformed HTML with double closing tags.

Suggested change
data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>'
data["text"] = f'<code class="language-{entity.language}">{inner}</code>'

Copilot uses AI. Check for mistakes.
Comment on lines 10 to 17
# Alternative message entities parsers. Can be:
# "deepseek" - deepseek version
# "gemini" - gemini version
# "chatgpt" - chatgpt version
# other values - original version
ENTITY_PARSER_MODE = None


Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ENTITY_PARSER_MODE configuration variable lacks documentation. It should have a docstring explaining its purpose, valid values, and how to use it. This is especially important for a public API configuration option.

Suggested change
# Alternative message entities parsers. Can be:
# "deepseek" - deepseek version
# "gemini" - gemini version
# "chatgpt" - chatgpt version
# other values - original version
ENTITY_PARSER_MODE = None
# Configuration option for selecting an alternative message entities parser.
# This is part of the public API and can be set by library users.
ENTITY_PARSER_MODE: Optional[str] = None
"""Configuration mode for alternative message entity parsers.
Set this variable to one of the supported string values to select a
specific parser implementation:
- ``"deepseek"``use the DeepSeek-based parser implementation.
- ``"gemini"``use the Gemini-based parser implementation.
- ``"chatgpt"``use the ChatGPT-based parser implementation.
- ``None`` or any other valueuse the original, built-in parser.
Typical usage is to set :data:`ENTITY_PARSER_MODE` once at application
start-up, before creating bots or sending/processing messages that
contain entities, for example:
.. code:: python3
from telebot import formatting
formatting.ENTITY_PARSER_MODE = "deepseek"
"""

Copilot uses AI. Check for mistakes.
mapping = utf16_index_map(text)
tree = build_tree(entities, mapping)
return render(text, tree, custom_subs)
#endregion
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The region comments (#region, #endregion) are IDE-specific markers that may not be meaningful to all developers. While they help with code folding in some IDEs, they don't follow Python conventions. Consider using standard docstrings or comment blocks instead to maintain portability and clarity across different development environments.

Suggested change
#endregion

Copilot uses AI. Check for mistakes.
@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

@coder2020official Now? )

@coder2020official
Copy link
Collaborator

If you insist on such structure - I guess it's okay

@coder2020official
Copy link
Collaborator

Personally though I wouldn't want to have this in the release..

@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

We'll clear unnecessary methods after finalizing that one of them are fully compliant. And even remove the original. I need now the ability to switch methods to test them.

@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

Personally though I wouldn't want to have this in the release..

So what you propose? Force replacing the original one with yours? Or what?

@Badiboy
Copy link
Collaborator Author

Badiboy commented Feb 14, 2026

Ok, wrong way for testing. I'll run them locally.

@Badiboy Badiboy closed this Feb 14, 2026
@coder2020official
Copy link
Collaborator

I think we're hurrying with a solution. We shouldn't release this right away into production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments