8000
Skip to content

add best+key+index, best, best-key and best-index#3076

Draft
Metaxal wants to merge 4 commits intoracket:masterfrom
Metaxal:best
Draft

add best+key+index, best, best-key and best-index#3076
Metaxal wants to merge 4 commits intoracket:masterfrom
Metaxal:best

Conversation

@Metaxal
Copy link
Copy Markdown
Contributor
@Metaxal Metaxal commented Mar 21, 2020

After a discussion on Slack, it appears that a few people (myself included) often have a need for a more flexible version of argmin.
This PR introduces a new function best+key+index that has a similar interface to sort and returns 3 values. Three additional functions are provided for convenience, which return only one of the three values. Examples from the docs:

> (best       '((3 "pears") (1 "banana") (2 "apples")) string<? #:key second)
'(2 "apples")

> (best-key   '((3 "pears") (1 "banana") (2 "apples")) string<? #:key second)
"apples"

> (best       '((3 "pears") (1 "banana") (2 "apples")) > #:key first)
'(3 "pears")

> (best-index '((3 "pears") (1 "banana") (2 "apples")) > #:key first)
0

> (best+key+index '(15 30 10 20 10) < #:key -)
30
-30
1

If this PR is accepted, I will make another one for vector, which was my original intention.

Tests are added, docs are updated and all tests pass AFAICT.

@rmculpepper
Copy link
Copy Markdown
Collaborator

It looks like the collections-lib package already has find-best that generalizes this PR's best. Would it make sense to add a find-best+index there instead?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 22, 2020

I would prefer to avoid this option because collections-lib is not a built-in package, which comes with several issue:

  • discoverability (I didn't know this package existed and provided find-best). And users (such as me) often wonder why such a useful function doesn't exist as a core function.
  • reliability questions (since it's not a core package, the user does not know how much it can trust this package to maintain backward compatibility, provide good space and time performance, have accurate documentation, etc.)
  • collections-lib has many dependencies (curly-fn-lib,  functional-lib,  match-plus,  static-rename,  unstable-list-lib, namespaced-transformer-lib, static-rename-doc, static-rename-lib) which is a hindrance just for one function.

Furthermore, a stress test suggests that find-best has substantial performance issues:

best
cpu time: 75 real time: 75 gc time: 0
10
data/collection find-best
cpu time: 6436 real time: 6436 gc time: 254
10

Code:

#lang racket
(require (prefix-in c: data/collection))


(define N 1000000)
(define l (range 1 (+ N 1)))

(displayln "best")
(collect-garbage)
(define n 0)
(time
 (for ([i (in-range 10)])
   (set! n (+ n (best l <)))))
(displayln n)

(displayln "data/collections find-best")
(collect-garbage)
(set! n 0)
(time
 (for ([i (in-range 10)])
   (set! n (+ n (c:find-best l <)))))
(displayln n)

@mfelleisen
Copy link
Copy Markdown
Collaborator

"discoverability"

As much as I agree -- having written these functions myself more than once -- I also think that repeating functionality across packages is less than ideal. I'd rather see Ryan's package incorporated into the core.

@rmculpepper
Copy link
Copy Markdown
Collaborator

collections-lib is Alexis's package, not mine.

I suspect the performance problem can be fixed by adding specialized forms for lists and vectors and by weakening the contract slightly (turning the get-key and compare contracts into flat arity checks).

I agree with @Metaxal about dependencies; that's more than I would want to transitively depend on. I think that's fixable too, but that would depend on Alexis.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 22, 2020

@rmculpepper adding specialized forms is precisely what this PR (and the next) is doing. I'm very happy for find-best in collections-lib to rely on best and the coming vector-best.

@mfelleisen I dislike code duplication too—and I like what data/collection does. But I also dislike slow functions when I know how to make them faster. So if we go with find-best and it remains slower by more than a factor 2, I'm just going to end up moving my proposed implementation to bazaar, and since other people will be justifiably reluctant to use my poor library they will just write their own and so on. That would be a lot of unnecessary code duplication.

@jackfirth
Copy link
Copy Markdown
Contributor

@rmculpepper adding specialized forms is precisely what this PR (and the next) is doing. I'm very happy for find-best in collections-lib to rely on best and the coming vector-best.

This PR adds a second, alternative API, it doesn't add a fast path to a single unified API. That's a big difference. Keeping specialized implementations of an API hidden from the user is a valuable thing to do.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 24, 2020

@jackfirth That remark would be true if the first API was part of the core library, but currently it's not. Also, racket built-ins are all using such type-specific functions (such as argmax and vector-argmax and many others), so I don't think I'm going against the flow here.

If really there is so much reluctance (which I find astonishing), would you all prefer I write a separate package with list-best, vector-best and then a unified best (+friends) where the latter chooses the correct implementation based on the type? That wouldn't solve the discoverability issue though. I'm also unsure if having a package for just a couple of functions is user-friendly. What do you think?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 28, 2020

Hi, so what do you propose?

It's okay if you reject the PR, but I'd prefer if it doesn't stay on hold indefinitely so I can at least put it somewhere.

It's also okay if you want to request more from this PR before being accepted. Maybe you prefer a single best + friends that works for sequences in general (using for under the hood), possibly specialized for cases like list or vector if that makes sense.

Or something else?

@jackfirth
Copy link
Copy Markdown
Contributor

(Haven't gotten around to replying until now, due to COVID-19 and some other stuff I had to put Racket on the back burner for a while.)

@Metaxal I think this could be useful, so no rejection from me. Here are my thoughts:

  • I strongly agree that a separate package wouldn't be discoverable, which would make the feature far less useful. So putting it in the main distribution makes sense to me.

  • I think a unified API that accepts any sequence? and internally dispatches to hidden fast paths for common sequence types is a good approach. Prefixing the name with sequence- and putting it in racket/sequence would make it fit in nicely with the other generic sequence functions.

  • Adding a type-specific copy of every sequence processing API is not something we want to do in the long term, so I recommend against adding list-best, vector-best, etc.

  • Returning multiple values is less user-friendly than returning a transparent struct. Multiple return values are a good idea when optimizing an already-fast function that's called many times in a hot loop, but in all other cases the pain of using them vastly outweighs their marginal performance benefit. It would be especially nice if the struct included the field names when printed, like this:

> (sequence-best (list 15 30 10 20 10) < #:key -)
(selection #:element 30 #:key -30 #:position 1)
  • I don't think all of the single-return versions are worth adding. They don't bring enough value on their own. I think instead I'd rather just have two functions, where one returns the element and one returns a selection with the element, the key, and the position. Since we don't have sequence-max or sequence-argmax, we could call them sequence-max and sequence-select-max. This would give us a good way to get around the awkward behavior of argmax where it lets you specify a key function, but not a custom comparison relation. So something like this:
(sequence-max [seq (sequence/c any/c)]
              [less-than? (-> any/c any/c boolean?) <] ; default to < for argmax behavior
              [#:key key-function (-> any/c any/c) values]) -> any/c

> (sequence-max (list 1 2 3 4 5))
5
> (sequence-max (list 1 2 3 4 5) >)
1
> (sequence-max (list 1 2 3 4 5) #:key -)
1

(sequence-select-max [seq (sequence/c any/c)]
                     [less-than? (-> any/c any/c boolean?) <] ; default to < for argmax behavior
                     [#:key key-function (-> any/c any/c) values]) -> selection?

> (sequence-select-max (list 1 2 3 4 5))
(selection #:element 5 #:key 5 #:position 4)
> (sequence-select-max (list 1 2 3 4 5) >)
(selection #:element 1 #:key 1 #:position 0)
> (sequence-select-max (list 1 2 3 4 5) #:key -)
(selection #:element 1 #:key -1 #:position 0)

What do you think?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 29, 2020

Thanks a lot for your comments @jackfirth, I have a clearer view of the situation now.

  • Okay for racket/sequence, possibly with type-specific implementations that are not exported.
  • I think it's important to mention in pairs.scrbl that there are other functions that can be used for lists. Otherwise the users may just assume what they're looking for doesn't exist.
  • Okay for returning a struct, and I'll try printing with useful info
  • Okay for just 2 functions, although I worry a little that many users will just implement the others themselves.
  • I would very much prefer sequence-best over sequence-max. I find usage of the first much clearer and in line with sort. sequence-max leads to some confusing scenarios such as your example when you need to call sequence-max with > (number-greater?) to find the minimum. Furthermore, max implies that elements have an associated numeric value (returned by extract-key) and are always compared with the numeric <. By contrast, best implies that there is a ranking of the elements according to some comparator—to find the minimum, use < (number-less?). This is unambiguous. Also, having a sequence-max would most likely lead the user to believe that there is a sequence-min (possibly to the point that the latter will be added in the future).

How does this sound?

Side question:
It is recommended that for forms are used with in-list, in-vector and such, thus turning the user's code from sequence-generic to type-specific. That seems to go contrary to the generic view considered in this thread. Comments on this?

@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Mar 29, 2020

@jackfirth

Returning multiple values is less user-friendly than returning a transparent struct.

Was there any function that returns a transparent struct in the way you described? If not, returning values might be better for the sake of consistency of API.

@samth
Copy link
Copy Markdown
Member
samth commented Mar 29, 2020

I mostly disagree with @jackfirth about the design here. I wouldn't want to switch to a generic API without a more wholesale move to that in Racket (which is one of the goals for Rhombus). Returning multiple values is, as @sorawee says, more idiomatic than a new struct. And the single-result versions seem quite useful to me.

@jackfirth
Copy link
Copy Markdown
Contributor
jackfirth commented Mar 29, 2020

Exposing list and vector based versions for the sake of consistency with current Racket style is reasonable, but please don't omit the sequence version. There's little point to having a generic sequence interface at all if there's no way to port type-specialized code to use it.

I don't think multiple values is an idiom we want to keep introducing in new Racket APIs. At the very least, doing so should require more justification than merely being consistent with existing APIs. They are awkward to use, hard to document, give confusing error messages, don't compose well with many other Racket constructs (e.g. how do you put them in a data structure?), and make code harder to read than it would be with named field accessors. Their downsides are significant. The only thing they've got going for them is avoiding allocation, and I don't think that matters enough in this case.

Rhombus could improve the situation on both these points, but I don't think that means we should avoid improving new Racket APIs. We should be less radical, but we shouldn't avoid trying anything new at all.

@jackfirth
Copy link
Copy Markdown
Contributor

Also, having a sequence-max would most likely lead the user to believe that there is a sequence-min.

@Metaxal Would sequence-best lead a user to believe there is a sequence-worst? (Serious question, not rhetorical.)

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Mar 30, 2020

@jackfirth Put it this way: If I was teaching again in front of 15 pairs of first-year students and they add to find the minimum of a list knowing that either sequence-max or sequence-best exist, I would expect the question about sequence-min for about a third to a half of them (and maybe a couple of others just looking up the docs and realizing that it doesn't exist but that they don't need it either), and I would expect the question about sequence-worst for at most one pair.

Actually, sequence-most would be even more accurate a name, but I think it would be harder to discover and less intuitive.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 4, 2020

Hey everyone, so what's the next step then?

@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Apr 4, 2020

Let me also offer you (yet) another name: extreme. That being said, I also think the debate on the name is dragged on for too long now.

Re: values vs struct, I agree with Sam here. As a person who initiates racket/rhombus#78, I understand the pain of values well, but a chimera library where a part uses a convention and another uses a different convention is simply worse in my opinion. Users need to remember which part follows this or that convention... And the ugliness... (granted, this is subjective). It's not worth it.

IMO, what you can do (and in fact, what you already did) is to make a package that is a wrapper library with a better interface. It won't help with discoverability of your library, unfortunately, but the consistency of the Racket library won't be compromised.

@rmculpepper
Copy link
Copy Markdown
Collaborator

My accumulated two cents:

I'm satisfied that it's reasonable to add this to racket/list rather than collections-lib. I think it's fine to focus on the list version in this PR and leave sequences for later.

To me, the best-key and best-index functions don't sound useful enough to justify adding them.

I personally don't find the word "best" intuitive to describe these functions. It conflicts with the following two intuitions I do have: (1) a partial (or total) order is usually defined by its <= relation; and (2) the "best" numeric score is usually the greatest, not the least. But (best nums <=) returns the least element in nums, IIUC.

It should be straightforward to extend argmin and argmax with optional comparator arguments. The idea of setting the partial/total order via a comparator (interpreted as that order's <= relation) and determining which end I want by using argmin vs argmax is completely intuitive to me.

So I think my preference would be to extend argmin and argmax with optional arguments for the order and add two new functions: argmin+index and argmax+index. A user who wants the key can apply the key function to the first result; a user who only wants the index can catch both values and throw away the first one.

I also don't think discoverability is a good argument for the name "best". I think it would be more useful to write a little example section and index words like "minimum", "minimize", "maximum", "maximize", "extreme", etc.

On the other hand, I'm not the main prospective user of these functions. Feel free to argue the points.

The implementation has a leftover 'sort symbol in one of the error cases.

Copy link
Copy Markdown
Contributor
@bennn bennn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge this

EDIT: oops didn't see Ryan's comment. I'd also prefer generalized argmin, but this PR still LGTM

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 4, 2020

@sorawee extreme is a good suggestion and conveys the right information, but it's probably not very intuitive for the average user; find-extreme could be a little better though. extremum would be actually fit better, but it's probably too alien for many (e.g., young) users. (I kind of like it though.)
[Edit: The issue with extremum is that it doesn't say in which direction the comparator should be used.]

@rmculpepper over the years, I've been redefining this function over and over, and the least confusing names I tried are best and find-best. I find it very intuitive because the comparator can directly be read as a predicate: (best lst better?). Then in (best lst <) with better? = <, one can read a < b means that a is better than b. (Looks like Alexis had the same idea in collections-lib.) But if it's confusing for others, then maybe we need a different name.

Regarding numeric scores, there are still many cases where best is least (cost functions as in A*, number of mistakes, number of cards remaining in your hand at UNO, how much debt you are in, how much weight you managed to lose before the summer, etc.), but that's not true for max.

better? is also better suited for more complex comparisons, such as:

(define (better? s1 s2)
  (or (< (height s1)
         (height s2))
      (and (= (height s1)
              (height s2))
           (> (width s1)
              (width s2)))))

Here the objects can't be compared with a single scalar, so max or min don't make sense.

Also, fixing two functions to make them do the same thing but in opposite directions when one function would do both jobs just fine sounds wrong to me. That said, if the consensus among core devs is for changing both argmin and argmax, I'll go with that.

(As I said before, I'm okay with discarding best-key and best-index—although I'd prefer to keep them.)

I'm perfectly fine with adding keywords in the docs.

@rfindler
Copy link
Copy Markdown
Member
rfindler commented Apr 4, 2020

Don't know if we're interested in more bells/whistles suggestions but one thing that I run into a bunch with sort is that I have a comparison in mind like the one that combines height and width above, but would like to just tell the library function about height< and width< and which one has priority without having to write that function over and over (I often just call sort twice because it is stable). If that's an easy thing to add, that might be nice. (Or just ignore me and get on with life!)

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 4, 2020

@rfindler This is indeed extremely handy, but it's probably better left as a whole different topic.

I have something for this, called 'comparators', and so does @jackfirth :)
My version of a comparator such as thing<=> merely returns '<, '= or '>, which I find simpler to use and combine (and is probably faster), and indeed I also have a simple macro to chain comparators. It does make my life much easier, and it extends to a lot more. (For example, you can easily define a fast all-different? when provided with a comparator like thing<=> by first sorting according to '< and then compare with '=, which requires O(n log n) instead of O(n^2)).

If there's a desire for that I'm happy to make a proposal, but I also worry this will lead to a very lengthy discussion :) I think it would make a very useful addition in any case.

@rfindler
Copy link
Copy Markdown
Member
rfindler commented Apr 4, 2020

@Metaxal lets definitely not go that route here! I was just thinking of something simple where I could put in a list of comparators and the library would do the right thing (it would assume they were "strict" comparison functions that treat (and (not (< a b)) (not (> a b))) as the definition of an equality relation to tell when it needed to move down to the next comparison in the list).

@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Apr 4, 2020

@rfindler that is indeed useful, but probably should be implemented as a separate combinator?

(define ((combine-comparator f g) x y)
  ;; probably extend to variadic function?
  (cond
    [(f x y) #t]
    [(f y x) #f]
    [(g x y) #t]
    [else #f]))

(define (height< a b)
  (< (car a) (car b)))

(define (width< a b)
  (< (cdr a) (cdr b)))

(define (width<= a b)
  (<= (cdr a) (cdr b)))

(define (height> a b)
  (> (car a) (car b)))

((combine-comparator height< width<) '(1 . 10) '(2 . 10))
; #t 
((combine-comparator height< width<) '(2 . 10) '(1 . 10))
; #f 
((combine-comparator height< width<) '(1 . 10) '(1 . 20))
; #t 
((combine-comparator height< width<) '(1 . 20) '(1 . 20))
; #f 
((combine-comparator height< width<) '(1 . 10) '(1 . 10))
; #f 
((combine-comparator height< width<=) '(1 . 10) '(1 . 10))
; #t 
((combine-comparator height> width<) '(2 . 10) '(1 . 10))
; #t 

It then can be used with both sort and this new function.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 4, 2020

@sorawee Let's continue this discussion on a different place maybe (slack?)

@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Apr 4, 2020

Agreed!

@pmatos
Copy link
Copy Markdown
Collaborator
pmatos commented Apr 22, 2020

ping! Can we look at this again so it is not forgotten? There's an approved review by @bennn but no merge.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 22, 2020

@mfelleisen @rmculpepper @samth @jackfirth @sorawee Here's a more 'generic' implementation based on for/fold with some instantiations for list, vector, dict and can easily be extended to many more.
Edit: we don't need to have 'many', but at least we can easily choose which one we want. We also don't need to export define-find-best, only the derived procedures.

It's only a first sketch though and an actual implementation would use syntax-parse and would have more checks.

Is it worth pursuing this road instead of the original PR?

#lang racket

(define no-result (gensym))

;; 'Generic' macro 
(define-syntax-rule (define-find-best name elts? elts idx elt idx-elt-binder ...)
  (define (name elts better? #:value [extract-val values])
    (unless (elts? elts)
      (raise-argument-error 'name (symbol->string 'elts?) elts))
    (for/fold ([best-elt #f]
               [best-idx no-result]
               [best-val #f])
              (idx-elt-binder
               ...)
      (define val (extract-val elt))
      (if (or (eq? best-idx no-result)
              (better? val best-val))
          (values elt idx val)
          (values best-elt best-idx best-val)))))

;;; Instantiations for some types

(define-find-best vector-best+index+value vector? elts idx elt 
                [idx (in-naturals)]
                [elt (in-vector e
992E
lts)])

(define-find-best best+index+value list? elts idx elt 
                [idx (in-naturals)]
                [elt (in-list elts)])

(define-find-best dict-best+key+value dict? elts idx elt 
                [(idx elt) (in-dict elts)])

; and some more?

;::::::::::::::;
;:: Examples ::;
;::::::::::::::;

(best+index+value '(1 2 3 2 8 4) >)

(define vec (vector "a" "aa" "bcb" "b" "bbbb" "aaaa"))

(vector-best+index+value vec < #:value string-length)
(vector-best+index+value vec >= #:value string-length)

(dict-best+key+value vec >= #:value string-length) ; same as vector-best

(define dic '((a . 5) (b . 2) (c . 3)))
(dict-best+key+value dic <)

(vector-best+index+value dic <) ; error with good message

Edit: val -> value; use key instead of index for dict

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 23, 2020

Forgot to say that I changed from key to value here because things got really confusing with the key in dictionaries.

@sorawee
Copy link
Copy Markdown
Collaborator
sorawee commented Apr 25, 2020

Huhh, I don't know how to describe this, but it feels kinda awkward to provide the macro make-find-best to users. It doesn't fit the rest of Racket library... Does that mean that we will have make-member, etc. in the future?

Also, it should be named define-find-best rather than make-find-best. make- is usually a procedure.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 25, 2020

@sorawee All good points. Indeed, I was thinking of exposing only the generated xxx-best that make sense, but I wouldn't be against exposing the definer too, but only if it becomes clean enough in the end.

My point was more that this new approach would avoid most code duplication.
Edit: I updated the post above to reflect @sorawee 's points.

@jackfirth
Copy link
Copy Markdown
Contributor

I think providing the list, vector, and sequence versions makes sense.

For dictionaries, I don't think we want to conflate "keys" and "indices" as being the same thing. Dictionary keys are usually unordered and sparse. If a user wanted to apply the -best functions to dictionaries, I think the onus is on them to convert the dictionary to a sequence with in-dict-keys, in-dict-values, or in-dict-pairs.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 27, 2020

@jackfirth What about this:
list: best+index+value
vector: vector-best+index+value
dict: dict-best+key+value
...

where value is the extracted value (previously key, as in sort).

@jackfirth
Copy link
Copy Markdown
Contributor

The dict version's expected semantics aren't clear to me. Does dict-best-key+value compare keys, values, or both?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 27, 2020

Keys in dicts play the same role as indices in vectors or lists. So better? compares values, not keys nor indices.

This is departing from sort which uses a #:key argument, which I found very confusing to deal with when I suddenly needed a dict-best variant.

(I just cleaned the code above a little.)

@jackfirth
Copy link
Copy Markdown
Contributor

I've encountered cases where I need the value corresponding to the max or min key of a dictionary-like collection. I don't think we should pick between comparing keys and comparing values, they both have many valid use cases.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 27, 2020

Then what do you think of having the value procedure for dict-best+key+value take 2 arguments (the key and the element) and by default returns the second one, i.e., the element?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented Apr 29, 2020

@jackfirth I'm also okay with not providing the dict form in this PR if we can't get a consensus of course.

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented May 2, 2020

Ok, at this point it may just be best for me to make another PR with the list, vector and maybe sequence or dict variants to start afresh.

I don't have much free time so it may take a while, sorry about that.

@nagaflokhu
Copy link
Copy Markdown

Do we ever expect to use define-find-best with more than two idx-elt-binders? Aren't we always expecting an idx-binder and an elt-binder?

@Metaxal
Copy link
Copy Markdown
Contributor Author
Metaxal commented May 10, 2020

it's either 1 (like for dict) or 2 I guess.

@samth samth self-assigned this Jul 7, 2020
@Metaxal Metaxal marked this pull request as draft August 3, 2020 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants

0