Open
Conversation
The optimized code achieves a **5% speedup** through two key performance improvements:
**1. Optimized Set Lookup for Number Words**
- Replaced O(n) list lookup `text in _num_words` with O(1) set lookup using a memoized `_num_words_set`
- The set is created only once on first function call and cached as a function attribute, avoiding repeated set construction overhead
- This optimization shows significant gains in test cases involving number words (30-50% faster for individual lookups, 32% faster for bulk word tests)
**2. More Efficient Fraction Detection**
- Changed `text.count("/") == 1` to `"/" in text` followed by `len(parts) == 2` validation after splitting
- This avoids the overhead of counting all "/" occurrences in long strings that don't contain fractions
- The profiler shows this reduces time spent on fraction detection from 13.7% to 9% of total runtime
**Performance Impact by Test Case:**
- **Number word lookups**: 30-50% faster due to O(1) set lookup vs O(n) list search
- **Non-number strings**: 18-60% faster as they fail early checks and benefit from faster word lookup fallback
- **Fraction processing**: Minimal impact (~1-5%) as most test cases have simple fractions
- **Basic digit recognition**: Unchanged performance for the most common case
The optimizations are particularly effective for workloads with frequent number word validation or processing of non-numeric strings that reach the final word lookup step. The memoization approach ensures the performance benefit persists across multiple function calls without memory leaks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 6% (0.06x) speedup for
like_numinspacy/lang/ca/lex_attrs.py⏱️ Runtime :
2.73 milliseconds→2.59 milliseconds(best of139runs)📝 Explanation and details
The optimized code achieves a 5% speedup through two key performance improvements:
1. Optimized Set Lookup for Number Words
text in _num_wordswith O(1) set lookup using a memoized_num_words_set2. More Efficient Fraction Detection
text.count("/") == 1to"/" in textfollowed bylen(parts) == 2validation after splittingPerformance Impact by Test Case:
The optimizations are particularly effective for workloads with frequent number word validation or processing of non-numeric strings that reach the final word lookup step. The memoization approach ensures the performance benefit persists across multiple function calls without memory leaks.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest # used for our unit tests
from spacy.lang.ca.lex_attrs import like_num
function to test
_num_words = [
"zero",
"un",
"dos",
"tres",
"quatre",
"cinc",
"sis",
"set",
"vuit",
"nou",
"deu",
"onze",
"dotze",
"tretze",
"catorze",
"quinze",
"setze",
"disset",
"divuit",
"dinou",
"vint",
"trenta",
"quaranta",
"cinquanta",
"seixanta",
"setanta",
"vuitanta",
"noranta",
"cent",
"mil",
"milió",
"bilió",
"trilió",
"quatrilió",
"gazilió",
"bazilió",
]
from spacy.lang.ca.lex_attrs import like_num
unit tests
------------------- Basic Test Cases -------------------
def test_basic_integer():
# Should recognize a simple integer
codeflash_output = like_num("123") # 1.03μs -> 947ns (9.19% faster)
def test_basic_zero():
# Should recognize zero
codeflash_output = like_num("0") # 950ns -> 867ns (9.57% faster)
def test_basic_num_word():
# Should recognize a basic Catalan number word
codeflash_output = like_num("tres") # 1.75μs -> 1.34μs (30.8% faster)
def test_basic_fraction():
# Should recognize a fraction with digits
codeflash_output = like_num("3/4") # 1.73μs -> 1.75μs (1.31% slower)
def test_basic_negative_number():
# Should recognize a negative integer
codeflash_output = like_num("-42") # 1.24μs -> 1.23μs (0.243% faster)
def test_basic_plus_number():
# Should recognize a positive number with plus sign
codeflash_output = like_num("+17") # 1.21μs -> 1.15μs (5.50% faster)
def test_basic_approx_number():
# Should recognize a number with tilde
codeflash_output = like_num("~99") # 1.20μs -> 1.16μs (3.36% faster)
def test_basic_fraction_with_sign():
# Should recognize a signed fraction
codeflash_output = like_num("-1/2") # 2.04μs -> 1.99μs (2.46% faster)
def test_basic_large_num_word():
# Should recognize a large Catalan number word
codeflash_output = like_num("milió") # 2.33μs -> 1.72μs (35.3% faster)
def test_basic_number_with_comma():
# Should recognize a number with comma as thousands separator
codeflash_output = like_num("1,000") # 1.27μs -> 1.21μs (4.97% faster)
def test_basic_number_with_dot():
# Should recognize a number with dot as thousands separator
codeflash_output = like_num("1.000") # 1.15μs -> 1.14μs (0.701% faster)
def test_basic_fraction_with_comma():
# Should recognize a fraction with comma in numerator
codeflash_output = like_num("1,000/2") # 2.11μs -> 2.04μs (3.13% faster)
------------------- Edge Test Cases -------------------
def test_edge_empty_string():
# Empty string should not be recognized as a number
codeflash_output = like_num("") # 1.78μs -> 1.19μs (50.1% faster)
def test_edge_only_sign():
# Only a sign, no digits
codeflash_output = like_num("-") # 1.93μs -> 1.48μs (31.1% faster)
def test_edge_only_fraction_slash():
# Only a slash, no digits
codeflash_output = like_num("/") # 2.17μs -> 1.89μs (14.8% faster)
def test_edge_fraction_with_non_digit():
# Fraction with non-digit numerator
codeflash_output = like_num("a/2") # 2.28μs -> 1.88μs (21.5% faster)
# Fraction with non-digit denominator
codeflash_output = like_num("2/b") # 1.19μs -> 1.13μs (5.32% faster)
# Fraction with both non-digits
codeflash_output = like_num("a/b") # 753ns -> 576ns (30.7% faster)
def test_edge_multiple_slashes():
# More than one slash should not be recognized
codeflash_output = like_num("1/2/3") # 1.81μs -> 1.64μs (10.2% faster)
def test_edge_num_word_with_sign():
# Number word with sign should not be recognized
codeflash_output = like_num("-tres") # 1.90μs -> 1.70μs (11.5% faster)
def test_edge_num_word_with_number_inside():
# Number word with embedded digit should not be recognized
codeflash_output = like_num("tres3") # 1.84μs -> 1.22μs (51.0% faster)
def test_edge_number_with_letters():
# Number with trailing letters should not be recognized
codeflash_output = like_num("123abc") # 1.92μs -> 1.19μs (61.6% faster)
# Number with leading letters should not be recognized
codeflash_output = like_num("abc123") # 860ns -> 673ns (27.8% faster)
def test_edge_number_with_space():
# Number with spaces should not be recognized
codeflash_output = like_num(" 123 ") # 1.74μs -> 1.15μs (51.7% faster)
def test_edge_fraction_with_space():
# Fraction with spaces should not be recognized
codeflash_output = like_num("1 /2") # 2.18μs -> 1.72μs (27.0% faster)
codeflash_output = like_num("1/ 2") # 1.34μs -> 1.22μs (9.93% faster)
def test_edge_num_word_case_sensitivity():
# Number word with different case should not be recognized
codeflash_output = like_num("Tres") # 1.66μs -> 1.14μs (44.9% faster)
codeflash_output = like_num("TRES") # 843ns -> 599ns (40.7% faster)
def test_edge_fraction_with_decimal():
# Fraction with decimal in numerator or denominator should not be recognized
codeflash_output = like_num("1.5/2") # 1.98μs -> 1.94μs (1.75% faster)
codeflash_output = like_num("2/3.5") # 1.19μs -> 1.21μs (1.65% slower)
def test_edge_number_with_multiple_signs():
# Multiple signs should not be recognized
codeflash_output = like_num("++123") # 2.14μs -> 1.57μs (36.2% faster)
codeflash_output = like_num("--123") # 934ns -> 765ns (22.1% faster)
def test_edge_number_with_sign_and_separator():
# Number with sign and separator should be recognized
codeflash_output = like_num("-1,234") # 1.34μs -> 1.30μs (3.24% faster)
codeflash_output = like_num("+1.234") # 700ns -> 721ns (2.91% slower)
def test_edge_num_word_with_separator():
# Number word with separator should not be recognized
codeflash_output = like_num("mil.lió") # 2.76μs -> 1.94μs (42.3% faster)
def test_edge_fraction_with_separator_in_denominator():
# Fraction with separator in denominator should be recognized
codeflash_output = like_num("1/1,000") # 2.04μs -> 1.86μs (9.78% faster)
def test_edge_fraction_with_separator_in_both():
# Fraction with separators in both parts should be recognized
codeflash_output = like_num("1,000/2,000") # 2.04μs -> 1.97μs (4.07% faster)
def test_edge_fraction_with_sign_in_denominator():
# Fraction with sign in denominator should not be recognized
codeflash_output = like_num("1/-2") # 2.56μs -> 1.98μs (29.3% faster)
def test_edge_fraction_with_sign_in_numerator():
# Fraction with sign in numerator should not be recognized
codeflash_output = like_num("-1/2") # 1.97μs -> 2.00μs (1.60% slower)
def test_edge_fraction_with_sign_in_both():
# Fraction with sign in both should not be recognized
codeflash_output = like_num("-1/-2") # 2.58μs -> 2.20μs (17.5% faster)
def test_edge_fraction_with_approx_sign():
# Fraction with tilde should be recognized
codeflash_output = like_num("~1/2") # 1.96μs -> 1.84μs (6.47% faster)
def test_edge_fraction_with_pm_sign():
# Fraction with ± sign should be recognized
codeflash_output = like_num("±1/2") # 2.03μs -> 2.09μs (2.73% slower)
def test_edge_number_with_pm_sign():
# Number with ± sign should be recognized
codeflash_output = like_num("±42") # 1.31μs -> 1.26μs (4.05% faster)
def test_edge_number_with_unusual_sign():
# Number with unsupported sign should not be recognized
codeflash_output = like_num("@123") # 2.04μs -> 1.28μs (60.3% faster)
------------------- Large Scale Test Cases -------------------
def test_large_many_digits():
# Should recognize a large number of digits
large_num = "9" * 1000
codeflash_output = like_num(large_num) # 3.23μs -> 3.29μs (2.06% slower)
def test_large_many_commas():
# Should recognize a number with many commas
num_with_commas = ",".join(["1"] * 500)
# This is not a valid number, so should be False
codeflash_output = like_num(num_with_commas) # 7.27μs -> 7.15μs (1.57% faster)
def test_large_many_dots():
# Should recognize a number with many dots
num_with_dots = ".".join(["1"] * 500)
# This is not a valid number, so should be False
codeflash_output = like_num(num_with_dots) # 7.28μs -> 7.24μs (0.566% faster)
def test_large_fraction_large_numbers():
# Should recognize a fraction with very large numbers
num = "9" * 500
denom = "8" * 500
codeflash_output = like_num(f"{num}/{denom}") # 5.79μs -> 5.49μs (5.45% faster)
def test_large_fraction_many_slashes():
# Fraction with many slashes should not be recognized
many_slashes = "/".join(["1"] * 10)
codeflash_output = like_num(many_slashes) # 1.92μs -> 1.94μs (1.13% slower)
def test_large_num_word_list():
# Should recognize all Catalan number words in the list
for word in _num_words:
codeflash_output = like_num(word) # 21.6μs -> 16.3μs (32.2% faster)
def test_large_invalid_num_word_list():
# Should not recognize any word not in the list
for i in range(100):
fake_word = f"word{i}"
codeflash_output = like_num(fake_word) # 59.8μs -> 39.6μs (50.8% faster)
def test_large_signed_numbers():
# Should recognize many signed numbers
for i in range(1, 1000, 100):
codeflash_output = like_num(f"+{i}") # 4.48μs -> 4.43μs (1.08% faster)
codeflash_output = like_num(f"-{i}")
codeflash_output = like_num(f"~{i}") # 3.49μs -> 3.53μs (1.27% slower)
codeflash_output = like_num(f"±{i}")
def test_large_fraction_with_separator():
# Should recognize many fractions with separators
for i in range(1, 1000, 111):
num = f"{i:,}".replace(",", "")
denom = f"{i+1:,}".replace(",", "")
codeflash_output = like_num(f"{num}/{denom}") # 6.29μs -> 6.09μs (3.20% faster)
def test_large_fraction_with_invalid_parts():
# Should not recognize fractions with invalid parts
for i in range(1, 1000, 111):
codeflash_output = like_num(f"{i}/abc") # 8.46μs -> 6.70μs (26.4% faster)
codeflash_output = like_num(f"abc/{i}")
codeflash_output = like_num(f"abc/abc") # 6.57μs -> 5.18μs (26.9% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest # used for our unit tests
from spacy.lang.ca.lex_attrs import like_num
function to test
_num_words = [
"zero",
"un",
"dos",
"tres",
"quatre",
"cinc",
"sis",
"set",
"vuit",
"nou",
"deu",
"onze",
"dotze",
"tretze",
"catorze",
"quinze",
"setze",
"disset",
"divuit",
"dinou",
"vint",
"trenta",
"quaranta",
"cinquanta",
"seixanta",
"setanta",
"vuitanta",
"noranta",
"cent",
"mil",
"milió",
"bilió",
"trilió",
"quatrilió",
"gazilió",
"bazilió",
]
from spacy.lang.ca.lex_attrs import like_num
unit tests
-------------------------
Basic Test Cases
-------------------------
def test_basic_integer_strings():
# Test basic integer strings
codeflash_output = like_num("123") # 866ns -> 830ns (4.34% faster)
codeflash_output = like_num("0") # 521ns -> 544ns (4.23% slower)
codeflash_output = like_num("42") # 411ns -> 400ns (2.75% faster)
def test_basic_signed_numbers():
# Test numbers with allowed leading sign characters
codeflash_output = like_num("+123") # 1.18μs -> 1.07μs (10.2% faster)
codeflash_output = like_num("-456") # 516ns -> 586ns (11.9% slower)
codeflash_output = like_num("±789") # 599ns -> 631ns (5.07% slower)
codeflash_output = like_num("~101") # 427ns -> 409ns (4.40% faster)
def test_basic_number_words():
# Test Catalan number words from _num_words
for word in _num_words:
codeflash_output = like_num(word) # 21.3μs -> 16.3μs (30.7% faster)
def test_basic_fraction():
# Test simple fractions
codeflash_output = like_num("1/2") # 1.50μs -> 1.66μs (9.41% slower)
codeflash_output = like_num("123/456") # 1.12μs -> 1.19μs (5.14% slower)
def test_basic_with_commas_and_periods():
# Test numbers with commas and periods (should be stripped)
codeflash_output = like_num("1,000") # 1.15μs -> 1.10μs (4.45% faster)
codeflash_output = like_num("2.000") # 535ns -> 574ns (6.79% slower)
codeflash_output = like_num("12,345,678") # 590ns -> 589ns (0.170% faster)
codeflash_output = like_num("9.876.543") # 497ns -> 496ns (0.202% faster)
-------------------------
Edge Test Cases
-------------------------
def test_edge_empty_string():
# Empty string should not be considered a number
codeflash_output = like_num("") # 1.77μs -> 1.17μs (51.9% faster)
def test_edge_non_numeric_string():
# Random non-numeric strings
codeflash_output = like_num("hello") # 1.78μs -> 1.15μs (55.4% faster)
codeflash_output = like_num("number") # 1.04μs -> 739ns (41.4% faster)
codeflash_output = like_num("onetwo") # 679ns -> 361ns (88.1% faster)
def test_edge_multiple_slashes():
# More than one slash is not a valid fraction
codeflash_output = like_num("1/2/3") # 1.70μs -> 1.67μs (1.73% faster)
codeflash_output = like_num("123/456/789") # 905ns -> 1.25μs (27.4% slower)
def test_edge_fraction_with_non_digit():
# Fractions with non-digit numerator or denominator
codeflash_output = like_num("one/2") # 2.21μs -> 1.70μs (30.2% faster)
codeflash_output = like_num("2/two") # 1.33μs -> 1.13μs (17.9% faster)
codeflash_output = like_num("one/two") # 1.07μs -> 811ns (32.6% faster)
codeflash_output = like_num("1/") # 1.01μs -> 835ns (21.0% faster)
codeflash_output = like_num("/2") # 846ns -> 682ns (24.0% faster)
def test_edge_leading_sign_and_fraction():
# Fractions with leading sign
codeflash_output = like_num("+1/2") # 1.88μs -> 1.91μs (1.88% slower)
codeflash_output = like_num("-123/456") # 1.34μs -> 1.35μs (0.371% slower)
codeflash_output = like_num("±7/8") # 1.03μs -> 1.02μs (0.391% faster)
codeflash_output = like_num("~9/10") # 870ns -> 844ns (3.08% faster)
def test_edge_leading_sign_and_number_word():
# Number words with leading sign are NOT valid
codeflash_output = like_num("+un") # 1.71μs -> 1.58μs (8.62% faster)
codeflash_output = like_num("-mil") # 1.20μs -> 990ns (21.5% faster)
codeflash_output = like_num("±zero") # 812ns -> 919ns (11.6% slower)
codeflash_output = like_num("~cent") # 850ns -> 607ns (40.0% faster)
def test_edge_number_with_letters():
# Numbers with letters are not valid
codeflash_output = like_num("123abc") # 1.80μs -> 1.19μs (51.2% faster)
codeflash_output = like_num("abc123") # 890ns -> 644ns (38.2% faster)
codeflash_output = like_num("1a2b3c") # 661ns -> 533ns (24.0% faster)
def test_edge_fraction_with_commas_and_periods():
# Fractions with commas/periods in numerator and denominator
codeflash_output = like_num("1,000/2,000") # 2.05μs -> 2.06μs (0.486% slower)
codeflash_output = like_num("3.000/4.000") # 987ns -> 987ns (0.000% faster)
codeflash_output = like_num("5,000.000/6.000,000") # 1.31μs -> 1.29μs (1.79% faster)
def test_edge_fraction_with_leading_sign_and_commas():
# Fractions with leading sign and commas
codeflash_output = like_num("+1,000/2,000") # 2.10μs -> 2.04μs (2.84% faster)
codeflash_output = like_num("-3,000/4,000") # 972ns -> 1.01μs (4.05% slower)
def test_edge_fraction_with_invalid_format():
# Fractions with missing numerator/denominator or extra chars
codeflash_output = like_num("/2") # 2.21μs -> 1.72μs (28.8% faster)
codeflash_output = like_num("2/") # 1.27μs -> 1.28μs (0.781% slower)
codeflash_output = like_num("/") # 910ns -> 768ns (18.5% faster)
codeflash_output = like_num("1//2") # 978ns -> 777ns (25.9% faster)
def test_edge_only_signs():
# Only sign characters
codeflash_output = like_num("+") # 1.83μs -> 1.35μs (36.1% faster)
codeflash_output = like_num("-") # 841ns -> 630ns (33.5% faster)
codeflash_output = like_num("±") # 721ns -> 479ns (50.5% faster)
codeflash_output = like_num("~") # 642ns -> 434ns (47.9% faster)
def test_edge_number_with_spaces():
# Numbers with spaces should not be valid
codeflash_output = like_num("1 000") # 1.85μs -> 1.09μs (69.5% faster)
codeflash_output = like_num("12 345") # 1.09μs -> 755ns (43.8% faster)
codeflash_output = like_num("milió ") # 1.25μs -> 980ns (27.3% faster)
codeflash_output = like_num(" milió") # 686ns -> 432ns (58.8% faster)
def test_edge_number_word_with_case_variation():
# Number words with case variation are not valid (function is case-sensitive)
codeflash_output = like_num("Un") # 1.52μs -> 1.04μs (45.4% faster)
codeflash_output = like_num("MIL") # 1.03μs -> 712ns (44.1% faster)
codeflash_output = like_num("Milió") # 1.11μs -> 734ns (51.8% faster)
def test_edge_fraction_with_leading_zero():
# Fractions with leading zeros
codeflash_output = like_num("001/002") # 1.68μs -> 1.74μs (3.28% slower)
codeflash_output = like_num("0001/0002") # 1.04μs -> 1.06μs (1.89% slower)
def test_edge_number_with_leading_zeros():
# Numbers with leading zeros
codeflash_output = like_num("000123") # 868ns -> 800ns (8.50% faster)
codeflash_output = like_num("0000") # 482ns -> 472ns (2.12% faster)
def test_edge_fraction_with_zero_denominator():
# "1/0" is still valid as a fraction syntactically
codeflash_output = like_num("1/0") # 1.50μs -> 1.59μs (5.53% slower)
def test_edge_number_word_with_punctuation():
# Number words with punctuation are not valid
codeflash_output = like_num("milió!") # 2.17μs -> 1.46μs (48.6% faster)
codeflash_output = like_num("milió.") # 1.45μs -> 1.43μs (0.976% faster)
codeflash_output = like_num("milió,") # 800ns -> 524ns (52.7% faster)
-------------------------
Large Scale Test Cases
-------------------------
def test_large_many_integer_strings():
# Test a large list of integer strings
for i in range(1, 1000):
codeflash_output = like_num(str(i)) # 278μs -> 279μs (0.343% slower)
def test_large_many_fraction_strings():
# Test a large list of valid fraction strings
for i in range(1, 1000):
for j in range(1, 5): # keep inner loop small
codeflash_output = like_num(f"{i}/{j}")
def test_large_many_number_words():
# Test all _num_words plus some invalid ones
for word in _num_words:
codeflash_output = like_num(word) # 22.3μs -> 16.8μs (32.7% faster)
# Add some invalid variants
invalid_words = [w.upper() for w in _num_words] + [w.capitalize() for w in _num_words]
for word in invalid_words:
codeflash_output = like_num(word) # 45.9μs -> 30.0μs (53.1% faster)
def test_large_mixed_valid_and_invalid():
# Mix of valid and invalid strings
valid = [str(i) for i in range(100, 200)] + [f"{i}/{i+1}" for i in range(100, 200)] + _num_words
invalid = ["abc", "1a2b", "milió!", "one/two", "123/abc", ""] + [f"{i}//{i+1}" for i in range(100, 200)]
for v in valid:
codeflash_output = like_num(v) # 98.3μs -> 91.1μs (7.88% faster)
for iv in invalid:
codeflash_output = like_num(iv) # 61.5μs -> 51.8μs (18.7% faster)
def test_large_numbers_with_commas_and_periods():
# Test numbers with many commas and periods
for i in range(1, 1000, 100):
s = f"{i:,}".replace(",", ",") # python's thousands separator
codeflash_output = like_num(s) # 3.53μs -> 3.38μs (4.62% faster)
s2 = f"{i:,}".replace(",", ".")
codeflash_output = like_num(s2)
def test_large_fraction_with_commas_and_periods():
# Test fractions with large numbers and commas/periods
for i in range(1, 1000, 100):
s = f"{i:,}/{i+1:,}".replace(",", ",")
codeflash_output = like_num(s) # 6.33μs -> 6.27μs (0.989% faster)
s2 = f"{i:,}/{i+1:,}".replace(",", ".")
codeflash_output = like_num(s2)
def test_large_invalid_fraction_formats():
# Test many invalid fraction formats
for i in range(1, 1000, 100):
codeflash_output = like_num(f"{i}/{i}/{i}") # 7.39μs -> 6.78μs (8.99% faster)
codeflash_output = like_num(f"/{i}") # missing numerator
codeflash_output = like_num(f"{i}/") # 8.05μs -> 6.19μs (30.1% faster)
codeflash_output = like_num(f"{i}//{i+1}") # double slash
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-like_num-mhwquqjhand push.