8000
Skip to content

Add sequence conversion functions for built-in collection types#3525

Draft
jackfirth wants to merge 6 commits intoracket:masterfrom
jackfirth:sequence-conversion
Draft

Add sequence conversion functions for built-in collection types#3525
jackfirth wants to merge 6 commits intoracket:masterfrom
jackfirth:sequence-conversion

Conversation

@jackfirth
Copy link
Copy Markdown
Contributor

This pull request adds the following functions:

  • sequence->bytes added to racket/bytes
  • sequence->string added to racket/string
  • sequence->vector added to racket/vector
  • sequence->set added to racket/set

Each of these functions is equivalent to the corresponding list-specific version, such as list->vector, except they accept arbitrary single-element sequences instead of lists. The functions include two important optimizations: fast paths for specific kinds of sequences and a fast path for already-converted inputs.

The functions are meant to serve as drop-in replacements for their monomorphic equivalents, with the exception that they always produce immutable values instead of mutable ones. To simplify the APIs of these collection types, all other forms of converting between these collections are soft-deprecated in favor of the unified sequence conversion APIs.

This change is backwards compatible, but it does represent a shift from some of the current Racket API design choices. I think it's a good shift worth making in standard Racket, but I'm open to discussing alternatives.

This pull request is missing tests because I'm not sure where to add them. Pointers to the appropriate test files would be appreciated.

This commit adds the following functions:

- sequence->bytes
- sequence->string
- sequence->vector
- sequence->set

Each of these functions is equivalent to the corresponding list-specific version, such as list->vector, except they accept arbitrary single-element sequences instead of lists.  The functions include two important optimizations: fast paths for specific kinds of sequences and a fast path for already-converted inputs.

The functions are meant to serve as drop-in replacements for their monomorphic equivalents, with the exception that they always produce immutable values instead of mutable ones. To simplify the APIs of these collection types, all other forms of converting between these collections are soft-deprecated in favor of the unified sequence conversion APIs.
@jackfirth jackfirth added the api design Related to the design of core libraries and language features label Nov 29, 2020
@jackfirth jackfirth marked this pull request as draft November 29, 2020 03:07
@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Nov 29, 2020

I like the overall PR, but I disagree with producing immutable values like vector. IIUC, it's generally assumed that vector is mutable (e.g. for/vector), so this interface makes Racket libraries self-inconsistent.

@mflatt
Copy link
Copy Markdown
Member
mflatt commented Nov 29, 2020

A drawback of distributing these functions among racket/string, etc., is that it may substantially increase the footprint of each of those. Currently, the difference between

#lang racket/base
(require racket/bytes)

and

#lang racket/base
(require racket/bytes racket/sequence)

appeaars to be 55MB vs. 69MB with Racket CS on x86_64 according to (current-memory-use). It looks like adding them to racket/sequence would not have that drawback.

@jackfirth
Copy link
Copy Markdown
Contributor Author
jackfirth commented Nov 29, 2020

Adding to racket/sequence makes sense. It also fixes a cyclic dependency problem. Will do that next, except for sequence->set since that one depends on racket/sequence directly already.

@rmculpepper
Copy link
Copy Markdown
Collaborator

I think this would be a reasonable addition to racket/sequence. I also agree with Jack that for new APIs, producing immutable data is preferable as a default.

I object to "prefer" and "soft deprecated". I would be happy with "see also" for links to the new functions.

sequence->bytes and sequence->string raise errors referring to list->bytes and list->string when given bad sequences, like (sequence->string '(#\a #\b 3)).

For the custom errors for bytes and strings, consider phrasing those as normal contract errors with additional explanation. You can do that by inserting "contract violation;\n " at the beginning of the error message. (I think it would also be fine to omit the custom errors, though.)

@jackfirth
Copy link
Copy Markdown
Contributor Author

Update: I was able to remove the dependency on racket/sequence from all of the current modules, making the move to racket/sequence unnecessary.

I object to "prefer" and "soft deprecated". I would be happy with "see also" for links to the new functions.

This change introduces a lot more ways to convert between different collection types, so I think it's important we include some soft guidance on which way to use. How about "consider using" instead of "prefer using"?

sequence->bytes and sequence->string raise errors referring to list->bytes and list->string when given bad sequences, like (sequence->string '(#\a #\b 3)).

What's a good way to fix this without introducing terrible performance? sequence/c would be expensive right? Is there something better that I can do?

For the custom errors for bytes and strings, consider phrasing those as normal contract errors with additional explanation. You can do that by inserting "contract violation;\n " at the beginning of the error message. (I think it would also be fine to omit the custom errors, though.)

Done!

@rmculpepper
Copy link
Copy Markdown
Collaborator

This change introduces a lot more ways to convert between different collection types, so I think it's important we include some soft guidance on which way to use. How about "consider using" instead of "prefer using"?

I think "consider" and "prefer" should have a stated reason, and some reasons are fine for the reference docs and some reasons belong elsewhere. "Consider using sequence->list if you need to handle other inputs in addition to vectors" would probably be fine, because it doesn't push me towards something else if vector->list is already suitable for my program. "Consider using sequence->list because all Racket code should be written as generically as possible" (forgive the hyperbole) would not, IMO.

What's a good way to fix this without introducing terrible performance? sequence/c would be expensive right? Is there something better that I can do?

I think you just have to check the list yourself using (andmap char? _) or equivalent. For example:

(define (sequence->string seq)
  (define (bad)
    (raise-argument-error 'sequence->string "(sequence/c char?)" seq))
  (define (handle-list lst)
    (if (andmap char? lst)
        (string->immutable-string (list->string lst))
        (bad)))
  (cond
    ....
    [(list? seq) (handle-list seq)]
    [else (handle-list (sequence->list seq))]))

@gus-massa
Copy link
Copy Markdown
Contributor

In spite I really like the idea of immutable-by-default and do-not-make-unnecessary copies, I think an consistent standard library is more important. All the other functions in the standard libraries make the mutable version, unless immutability is explicit.

Related to racket/rhombus#22

@jackfirth
Copy link
Copy Markdown
Contributor Author
jackfirth commented Dec 15, 2020

I don't think it's fair to call the standard library consistent about this, because none of the contracts or documentation actually say that the returned strings are mutable, and occasionally they aren't. For example, (string-trim "hello" #:left? #false #:right? #false) returns an immutable string. If the documentation for the existing functions promised to return (and/c string? (not/c immutable?)) I might agree with you. But as it is I can't really rely on them returning mutable or immutable either way without checking first. I wouldn't be surprised if there's plenty of other corner cases where the "returned strings, bytes, and vectors are mutable unless stated otherwise" convention is silently broken, since we don't check it with contracts and we don't document it.

All that said, we should not continue this convention. It is a mistake with wide ranging consequences. Backwards compatibility is a good reason not to change existing functions, but I don't think consistency with an undocumented, unchecked convention is a good reason to perpetuate this mistake in new APIs. We should instead write new APIs that allow both mutable and immutable inputs but which produce immutable outputs by default, and we should document that behavior.

Instead of recommending the generic conversion functions, we just include "see also" links. The generic functions already document that there shouldn't be a performance reason to prefer the specialized versions. This avoids pushing people to learn about generic sequences if their code already works fine without generic sequences.
@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Dec 15, 2020

none of the contracts or documentation actually say that the returned strings are mutable

That might be fair for strings. What about vectors?

we should not continue this convention

Why do I feel like we are repeating the discussion in #3076... I definitely agree that immutability is a good idea. But they are inconsistent with the current convention, which could cause confusion to users. In Rhombus, we should definitely make immutability the default. But in Racket, I hope that they should follow the convention, and if you wish to add the immutable variants, perhaps suffix them with * or something like that (e.g. sequence->list*, sequence->vector*).

A0DD
@jackfirth
Copy link
Copy Markdown
Contributor Author

Hmm. I agree it's confusing when users can't tell which functions return immutable data and which don't. In line with the existing symbol->immutable-string and keyword->immutable-string functions, what do folks think about using these names instead:

  • sequence->immutable-bytes
  • sequence->immutable-string
  • sequence->immutable-vector
  • sequence->set (sets are kind of an oddball since the existing racket/set API produces immutable sets everywhere by default already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api design Related to the design of core libraries and language features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0