Fix false positive copyright detection from Unicode surrogates#4664
Fix false positive copyright detection from Unicode surrogates#4664Tarun-goswamii wants to merge 2 commits intoaboutcode-org:developfrom
Conversation
Remove Unicode surrogate characters (U+D800-U+DFFF) from text before copyright detection to prevent false positives like '(c) \Truei (c) Y' that occur when surrogate bytes are misinterpreted. This fixes issue aboutcode-org#4664 where files containing surrogate character ranges (like busybox-1.37.0/docs/unicode_full-bmp.txt) were incorrectly detected as having copyright content. Changes: - Add SURROGATE_PATTERN regex constant to match U+D800-U+DFFF range - Add sanitize_line_for_detection() function to strip surrogates - Integrate sanitization in detect_copyrights_from_lines() - Add test suite for surrogate handling Fixes: aboutcode-org#4664
6d1df73 to
54df6f8
Compare
Remove Unicode surrogate characters (U+D800-U+DFFF) from text before copyright detection to prevent false positives like '(c) \Truei (c) Y' that occur when surrogate bytes are misinterpreted. This fixes issue aboutcode-org#4664 where files containing surrogate character ranges (like busybox-1.37.0/docs/unicode_full-bmp.txt) were incorrectly detected as having copyright content. Changes: - Add SURROGATE_PATTERN regex constant to match U+D800-U+DFFF range - Add sanitize_line_for_detection() function to strip surrogates - Integrate sanitization in detect_copyrights_from_lines() - Add test suite for surrogate handling Fixes: aboutcode-org#4664 Signed-off-by: tarun111111 <tarunpuri2544@gmail.com>
54df6f8 to
e4ff487
Compare
Remove Unicode surrogate characters (U+D800-U+DFFF) from text before copyright detection to prevent false positives like '(c) \Truei (c) Y' that occur when surrogate bytes are misinterpreted. This fixes issue aboutcode-org#4664 where files containing surrogate character ranges (like busybox-1.37.0/docs/unicode_full-bmp.txt) were incorrectly detected as having copyright content. Changes: - Add SURROGATE_PATTERN regex constant to match U+D800-U+DFFF range - Add sanitize_line_for_detection() function to strip surrogates - Integrate sanitization in detect_copyrights_from_lines() - Add test suite for surrogate handling Fixes: aboutcode-org#4664 Signed-off-by: tarun111111 <tarunpuri2544@gmail.com>
e4ff487 to
1861e13
Compare
Remove Unicode surrogate characters (U+D800-U+DFFF) from text before copyright detection to prevent false positives like '(c) \Truei (c) Y' that occur when surrogate bytes are misinterpreted. This fixes issue aboutcode-org#4664 where files containing surrogate character ranges (like busybox-1.37.0/docs/unicode_full-bmp.txt) were incorrectly detected as having copyright content. Changes: - Add SURROGATE_PATTERN regex constant to match U+D800-U+DFFF range - Add sanitize_line_for_detection() function to strip surrogates - Integrate sanitization in detect_copyrights_from_lines() - Add test suite for surrogate handling Fixes: aboutcode-org#4664 Signed-off-by: tarun111111 <tarunpuri2544@gmail.com>
1861e13 to
08b4010
Compare
Remove Unicode surrogate characters (U+D800-U+DFFF) from text before copyright detection to prevent false positives like '(c) \Truei (c) Y' that occur when surrogate bytes are misinterpreted. This fixes issue aboutcode-org#4381 where files containing surrogate character ranges (like busybox-1.37.0/docs/unicode_full-bmp.txt) were incorrectly detected as having copyright content. Changes: - Add SURROGATE_PATTERN regex constant to match U+D800-U+DFFF range - Add sanitize_line_for_detection() function to strip surrogates - Integrate sanitization in detect_copyrights_from_lines() - Add test suite for surrogate handling Fixes: aboutcode-org#4381 Signed-off-by: tarun111111 <tarunpuri2544@gmail.com>
08b4010 to
18cc347
Compare
|
Regarding the failing license_validate_ignorables tests This PR only modifies: Tests relevant to this PR have passed: @chinyeungli Could you please confirm if the license_validate_ignorables failures are a known/pre-existing issue on the develop branch? |
|
@chinyeungli , @AyanSinhaMahapatra Please let me know if there are any adjustments or clarifications needed |
Fixes #4381
Tasks
Run tests locally to check for errors.