fix: relax filtering of heading elements with classnames that include the word "header"#868
fix: relax filtering of heading elements with classnames that include the word "header"#868inhumantsar wants to merge 5 commits intomozilla:mainfrom
Conversation
|
Hi @inhumantsar ! Thanks for investigating this. Did you mean to mark this as a work in progress and/or would you like feedback on this at this point? |
It should be complete but wanted to get the other PRs in before calling it ready. I'll rebase it and make sure it's all good tonight or tomorrow, then mark it for review |
|
ok so i had a chance to refresh my memory. i put this into draft until the PR with all of the unambiguously positive impacts:
ambiguously positive impacts:
negative impacts:
another issue is that i can probably deal with these less-than-ideal captures with some simple heuristics but not sure if that should get its own PR. i don't know where to start with the |
0bbbf9f to
a2ef447
Compare
a2ef447 to
30211ad
Compare
|
let's get this merged! |
This removes
headerfrom unlikely and adds it topositivein an attempt to avoid filtering legitimate heading elements.It does seem to improve parsing generally, even capturing some previously ignored metadata, but it does introduce a few unwanted artifacts.
Closes #855 and will likely have merge conflicts with #867 and #866