Skip to content

Formalize the Vortex type system#29

Merged
gatesn merged 19 commits intodevelopfrom
ct/types
Mar 13, 2026
Merged

Formalize the Vortex type system#29
gatesn merged 19 commits intodevelopfrom
ct/types

Conversation

@connortsui20
Copy link
Contributor

@connortsui20 connortsui20 commented Mar 6, 2026

Rendered

I wanted to write this for 2 reasons, the first being that we do not have a formalized definition of the Vortex type system. Note that I'm not saying we don't understand how it works (I think all of us intuitively understand it), but I thought it would be good to map it to actual theory.

Copy link

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this RFC, it's very useful to formalize the type system a little to help motivate changes/design to the type system.

I think that it would also be useful to spell out the motivation for the existence of separate concepts in the type system in order to inform the decision framework. These are things that we can probably internally/intuitively articulate but again I think it's helpful to spell it out. Specifically:

  1. Why do we define DTypes as logical types separately from physical encodings?
  2. Why do we have the concept of canonical physical representations? What's the goal?
  3. What is the goal of extenstion types? How are they different from first-class dtypes?

Other than that, I think I mostly agree with the RFC. The conclusion I take away from the FSB discussion is that FSB should be part of the possible canonicalization targets of the Binary DType. Similarly, FixedSizeList should not be its own DType and another canonicalization target of the List DType.

One other thing I'm curious about which might be good to add to the RFC is "what amount of gating is required for a data type to be considered an extension type rather than a first-class dtype". Every type could essentially be sugar on a bytes type.

@connortsui20
Copy link
Contributor Author

connortsui20 commented Mar 9, 2026

Thoughts on me splitting this RFC into 2 RFCs? The first can just be the formalization and the second can be the other proposal.

Edit: I am going to split this RFC.

@connortsui20 connortsui20 changed the title Formalize the Vortex type system + add CanonicalTarget Formalize the Vortex type system Mar 9, 2026
@connortsui20
Copy link
Contributor Author

After some offline discussion I'm going to completely pull out the second part of this RFC as we need to better understand how execute should work before we think about execution targets.

@connortsui20
Copy link
Contributor Author

@asubiotto Note that this RFC doesn't make any claims that FixedSizeList shouldn't be a dtype, nor that FixedSizeBinary should be. It only has a framework in which we can think about these things.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@asubiotto
Copy link

Thanks for the changes. This looks good to me.

Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came away from this with the following (not fully formed) thoughts:

  • We should deprecate DType::Utf8 and make it an extension type
  • We should deprecate FSL and make it an extension type
  • Why are primitives types different? Why is u16 not a refinement over the integer type?

@gatesn
Copy link
Contributor

gatesn commented Mar 11, 2026

As a crazy proposal

enum DType {
 Null, => Null
 Bool, => BitBool
 Bytes, => (Fixed width bytes)
 Utf8, => GermanStrings // This is the weird one... why?
 List => ListView
 Struct (Tuple) => Struct
 Union => Union
 Variant => Variant
 Extension => ...
}

Differences:
* Primitive, Decimal => Bytes(n) // Collapse physical encodings for primitive, decimal, FSB, into one array
* Binary => List<u8> // no point storing arbitrary binary data as german strings
* FixedList => List // sizes == constant array. Still want cheap filter which we cannot do now with FSL.
* + Union, Variant

I would also propose adding a selection vector to the fixed width BytesArray. That means all of our canonical encodings now have "views" and can be filtered and shuffled with zero copy.

@asubiotto
Copy link

I came away from this with the following (not fully formed) thoughts:

  • We should deprecate DType::Utf8 and make it an extension type

Interesting. I think this is what I was asking with a previous question: how much gating is required to define something as an extension type vs a core dtype? I think you're asking the same question.

  • We should deprecate FSL and make it an extension type

I agree we should deprecate FSL, but I disagree that it should be a logical type. FSL is just a list with implicit offsets so for me it's a section of the List DType. This is another question: section functions should be fallible.

@connortsui20
Copy link
Contributor Author

connortsui20 commented Mar 12, 2026

@asubiotto there's a combination of things that we would need to figure out before deciding what delineates a core vs extension type, among those things would be semantic and also structural differences.

But there is also a pragmatic reason (as theoretically every type except a few is a refinement type over List<u8>, and a type system that only allows that is not practically useful). I'll be adding more detail on this to the RFC.

Also, just to be pedantic, a FSL is a refinement type of List, with the predicate of having a fixed size. The "section" function has nothing to do with the logical type system and everything to do with the mapping to physical types.

@asubiotto
Copy link

Also, just to be pedantic, a FSL is a refinement type of List, with the predicate of having a fixed size. The "section" function has nothing to do with the logical type system and everything to do with the mapping to physical types.

Hmm, I guess I can see this but when the predicate is a physical invariant then that seems pretty similar to the definition of a "section". On a first read, refinement types were more about operation gating. Or maybe not. You guys have thought more about this than me so happy to whatever conclusion you come to.

@connortsui20
Copy link
Contributor Author

I think that we are leaning towards extension types essentially mimicking refinement types. Specifically, there is a predicate that constrains the domain of values of that type (for Utf8 that would be .is_valid_utf8() and for FSL it would be l.len() == size), which in turn unlocks specific operations. It's similar to a trait bound in Rust!

But I agree that this stuff is getting complicated. There is a certain degree of decision making we need to have when it comes to what defines a core vs extension type, and what I've realized is that this is exactly the same issue of what built-in types programming languages ship! I will be adding that to the RFC soon.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20
Copy link
Contributor Author

@gatesn I have revised the decision framework which (in my mind) now makes a lot more sense. Basically, we want to stay away from creating a bijection between the logical types and Canonical and then from there it is easier to justify adding new logical types.

Then from there, the question is if it should be a core type or an extension type, and I still don't really have a good idea of how to decide that.

@connortsui20 connortsui20 requested a review from gatesn March 12, 2026 16:32
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Comment on lines +18 to +21
variants are justified. A new logical type requires either semantic distinctness (a genuinely
different equivalence class) or a refinement predicate that gates operations unavailable on the
parent type. If justified, a second step determines whether the type belongs in core `DType` or as
an extension type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we have now, but do want that or are we just stuck with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only sort of what we have now, and it has been hard to figure out if we want a FixedSizeBinary type, for example.

@connortsui20
Copy link
Contributor Author

Given that this thread is becoming long and it's only going into proposed/, I think we should just merge this now and we can keep iterating on it if we want to

Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this as proposed, not yet accepted

@gatesn gatesn merged commit b9907b8 into develop Mar 13, 2026
3 checks passed
@gatesn gatesn deleted the ct/types branch March 13, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants