New (alternative) message entity parsers#2556
New (alternative) message entity parsers#2556Badiboy wants to merge 6 commits intoeternnoir:masterfrom
Conversation
Added 3 test versions of message entity parsers (apply_html_entities processor).
|
@coder2020official Your oppinion? |
There was a problem hiding this comment.
Pull request overview
This pull request adds three alternative implementations of the message entity parser (apply_html_entities) function, generated by AI coding assistants (DeepSeek, Gemini, and ChatGPT). The implementations can be selected via the ENTITY_PASER_MODE configuration variable. The PR aims to provide alternative approaches to handling hard-structured message entity parsing, which will be tested in production to determine if any should replace the original implementation.
Changes:
- Added global configuration variable
ENTITY_PASER_MODEto select parser implementation - Added DeepSeek implementation (
apply_html_entities_ds) with EntityProcessor class using recursive entity processing - Added Gemini implementation (
apply_html_entities_gm) using marker-based tag insertion with intersection handling - Added ChatGPT implementation (
apply_html_entities_cg) using tree-based entity rendering
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
🥶 Are you sure this is a good idea? I've known about the issue for some time, and I even made a PR like a year ago (was closed); I could take a look at this issue sometime later, but if this is an emergency-- I still don't like the idea of choice here |
|
Couldn't find the PR, but had the code somewhere |
|
Found my code, will see how it works |
Current version is obviously not fully compliant. There can be various ways to solve it, I did what I could for now because I need it now ) You have alternative way - let's check it. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 22 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def apply_template(entity, inner: str, custom_subs: Optional[Dict[str, str]]) -> str: | ||
| t = entity.type | ||
| if t in ENTITY_TEMPLATES_CG: | ||
| tpl = ENTITY_TEMPLATES_CG[t] | ||
| elif custom_subs and t in custom_subs: | ||
| tpl = custom_subs[t] | ||
| else: | ||
| return inner | ||
|
|
||
| data = {"text": inner} | ||
|
|
||
| if t == "text_link": | ||
| data["url"] = getattr(entity, "url", "") | ||
| elif t == "text_mention": | ||
| data["url"] = f"tg://user?id={getattr(entity, 'user', {}).id if getattr(entity, 'user', None) else ''}" | ||
| elif t == "custom_emoji": | ||
| data["custom_emoji_id"] = getattr(entity, "custom_emoji_id", "") | ||
| elif (t == "pre") and getattr(entity, "language", None): | ||
| data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>' | ||
|
|
||
| return tpl.format(**data) |
There was a problem hiding this comment.
The ChatGPT implementation has a critical bug in the template handling. When entity type is "pre" with a language, it sets data["text"] to include closing </pre> tag (line 953), but then the template at line 910 already has {text} within <pre>{text}</pre>. This will result in <pre><code class="language-X">content</code></pre></pre> with a double closing tag.
| elif t == "expandable_blockquote": | ||
| return {'open': "<blockquote expandable>", 'close': "</blockquote>"} | ||
| elif t == "text_link": | ||
| return {'open': f'<a href="{entity.url}">', 'close': "</a>"} |
There was a problem hiding this comment.
The Gemini implementation doesn't escape URLs before inserting them into HTML attributes. This is a security vulnerability that could lead to XSS attacks if entity.url contains malicious content. Unlike the DeepSeek version which uses escape_html(entity.url or ""), this implementation inserts the URL directly.
| return {'open': f'<a href="{entity.url}">', 'close': "</a>"} | |
| return {'open': f'<a href="{escape_html(entity.url or "")}">', 'close': "</a>"} |
| elif t == "text_mention": | ||
| return {'open': f'<a href="tg://user?id={entity.user.id}">', 'close': "</a>"} | ||
| elif t == "custom_emoji": | ||
| return {'open': f'<tg-emoji emoji-id="{entity.custom_emoji_id}">', 'close': "</tg-emoji>"} |
There was a problem hiding this comment.
Direct attribute access without safety checks. If entity doesn't have a 'custom_emoji_id' attribute, this will raise an AttributeError. Should use getattr(entity, 'custom_emoji_id', '') for safety.
| return {'open': f'<tg-emoji emoji-id="{entity.custom_emoji_id}">', 'close': "</tg-emoji>"} | |
| emoji_id = getattr(entity, "custom_emoji_id", "") | |
| return {'open': f'<tg-emoji emoji-id="{emoji_id}">', 'close': "</tg-emoji>"} |
| elif t == "custom_emoji": | ||
| data["custom_emoji_id"] = getattr(entity, "custom_emoji_id", "") | ||
| elif (t == "pre") and getattr(entity, "language", None): | ||
| data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>' |
There was a problem hiding this comment.
This implementation has the same issue as in the DeepSeek version. The template incorrectly includes </pre> within the formatted string, which will cause the template.format() to append this closing tag to the text, then the template's {text} placeholder will insert it again, resulting in malformed HTML with double closing tags.
| data["text"] = f'<code class="language-{entity.language}">{inner}</code></pre>' | |
| data["text"] = f'<code class="language-{entity.language}">{inner}</code>' |
| # Alternative message entities parsers. Can be: | ||
| # "deepseek" - deepseek version | ||
| # "gemini" - gemini version | ||
| # "chatgpt" - chatgpt version | ||
| # other values - original version | ||
| ENTITY_PARSER_MODE = None | ||
|
|
||
|
|
There was a problem hiding this comment.
The new ENTITY_PARSER_MODE configuration variable lacks documentation. It should have a docstring explaining its purpose, valid values, and how to use it. This is especially important for a public API configuration option.
| # Alternative message entities parsers. Can be: | |
| # "deepseek" - deepseek version | |
| # "gemini" - gemini version | |
| # "chatgpt" - chatgpt version | |
| # other values - original version | |
| ENTITY_PARSER_MODE = None | |
| # Configuration option for selecting an alternative message entities parser. | |
| # This is part of the public API and can be set by library users. | |
| ENTITY_PARSER_MODE: Optional[str] = None | |
| """Configuration mode for alternative message entity parsers. | |
| Set this variable to one of the supported string values to select a | |
| specific parser implementation: | |
| - ``"deepseek"`` – use the DeepSeek-based parser implementation. | |
| - ``"gemini"`` – use the Gemini-based parser implementation. | |
| - ``"chatgpt"`` – use the ChatGPT-based parser implementation. | |
| - ``None`` or any other value – use the original, built-in parser. | |
| Typical usage is to set :data:`ENTITY_PARSER_MODE` once at application | |
| start-up, before creating bots or sending/processing messages that | |
| contain entities, for example: | |
| .. code:: python3 | |
| from telebot import formatting | |
| formatting.ENTITY_PARSER_MODE = "deepseek" | |
| """ |
| mapping = utf16_index_map(text) | ||
| tree = build_tree(entities, mapping) | ||
| return render(text, tree, custom_subs) | ||
| #endregion |
There was a problem hiding this comment.
The region comments (#region, #endregion) are IDE-specific markers that may not be meaningful to all developers. While they help with code folding in some IDEs, they don't follow Python conventions. Consider using standard docstrings or comment blocks instead to maintain portability and clarity across different development environments.
| #endregion |
|
@coder2020official Now? ) |
|
If you insist on such structure - I guess it's okay |
|
Personally though I wouldn't want to have this in the release.. |
|
We'll clear unnecessary methods after finalizing that one of them are fully compliant. And even remove the original. I need now the ability to switch methods to test them. |
So what you propose? Force replacing the original one with yours? Or what? |
|
Ok, wrong way for testing. I'll run them locally. |
|
I think we're hurrying with a solution. We shouldn't release this right away into production |
Description
I now myself faced the issue for some hard-structured message entities parsing, so the time had come to rewrite apply_html_entities processor. I was not able to code it myself due to lack of time, so I had videcoded them. I added 3 test versions of message entity parsers (apply_html_entities processor)/
They can be chosen by setting
formatting.ENTITY_PASER_MODEentity.
Describe your tests
I made some draft tests, all of them are assumed to work. I'll take them to prod and check how things are going. If any of them will survive - I'll provide it to replace original one.