Skip to content

Conversation

@herwinw
Copy link
Member

@herwinw herwinw commented Jan 16, 2026

Keeping this as a draft for now. The comment in the MRI source for rb_interned_str reads as follows:

/**
 * Identical to rb_str_new(), except it returns an infamous "f"string.  What is
 * a  fstring?  Well  it is  a special  subkind of  strings that  is immutable,
 * deduped globally, and managed by our GC.   It is much like a Symbol (in fact
 * Symbols  are dynamic  these days  and are  backended using  fstrings).  This
 * concept has been  silently introduced at some point in  2.x era.  Since then
 * it  gained  wider acceptance  in  the  core.   Starting from  3.x  extension
 * libraries can also generate ones.
 *
 * @param[in]  ptr           A memory region of `len` bytes length.
 * @param[in]  len           Length  of  `ptr`,  in bytes,  not  including  the
 *                           terminating NUL character.
 * @exception  rb_eArgError  `len` is negative.
 * @return     A  found or  created instance  of ::rb_cString,  of `len`  bytes
 *             length, of  "binary" encoding,  whose contents are  identical to
 *             that of `ptr`.
 * @pre        At  least  `len` bytes  of  continuous  memory region  shall  be
 *             accessible via `ptr`.
 */
VALUE rb_interned_str(const char *ptr, long len);

But the encoding of the result is always Encoding::US_ASCII, which leads to the following spec:

it "support binary strings that are invalid in ASCII encoding" do
  str = "foo\x81bar\x82baz".b
  result = @s.rb_interned_str(str, str.bytesize)
  result.encoding.should == Encoding::US_ASCII
  result.should == str.dup.force_encoding(Encoding::US_ASCII)
  result.should_not.valid_encoding?
end

I will create an issue for Ruby to get clarification about the desired behaviour.

@herwinw
Copy link
Member Author

herwinw commented Jan 16, 2026

Upstream: https://bugs.ruby-lang.org/issues/21842

Updated them to the current status of the upstream bug report, which is
all strings are binary. This might change to either ASCII-7BIT or
BINARY.

The current specs pass with the latest upstream version of Ruby 4.1
(commit 3e13b7d4ef)
@herwinw
Copy link
Member Author

herwinw commented Jan 17, 2026

Updated the specs to match the behaviour of ruby/ruby#15888.

This might not be the final version, ruby/ruby#15894 has a proposed change to return US-ASCII if everything is ASCII compatible, and BINARY otherwise.

@herwinw
Copy link
Member Author

herwinw commented Jan 17, 2026

Updated again to match the new behaviour of ruby/ruby#15894

It the whole string is valid ASCII, it returns US-ASCII encoding.
Otherwise, it is BINARY.

The current specs pass with the latest upstream version of Ruby 4.1
(commit 78b7646bdb)
end
end

it "returns the same frozen strings for different encodings" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description/test is confusing because the encoding is ignored by that function, it only receives char* and length.

end
end

it "returns the same frozen strings for different encodings" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@eregon eregon marked this pull request as ready for review January 17, 2026 09:30
Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just 2 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants