Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Correct.


Assume I have a table "Actors" with a column "age" and for some of the records the age is not set (an empty set). Does this mean that "SELECT SUM(age) FROM Actors;" gives "{}" or do you implement a special logic for empty-set summation when used in connection with aggregation (like SQL does)?


Aggregate functions in EdgeDB have an "initial value", which, for `sum()` is defined as zero. Other aggregates, like `avg()` are not defined for empty sets (you cannot divide by 0), so an error is thrown in this case.


I am still a bit unclear about this. Assume we have three actor records where two records have age=30 and for the remaining one the age is not set. From what I understand then "SELECT sum(age) FROM Actor;" returns "60" while "SELECT 30 + 30 + {}" returns "{}". This appears to be an inconsistent handling of empty sets (thought it would be the same as in SQL).


The difference is that `+` is defined as a strict function (returns empty on empty input): plus(a, b), whereas sum() is an aggregate that is specifically defined as 0 on empty input.


The sum of an empty set is, in fact, 0 (the identity for addition).

The generalized conjuction (we have a function called "all" for that) of an empty set is True (the identity for conjunction).

The generalized disjuction (we have a function called "any" for that) of an empty set is False (the identity for disjunction).

All of the above "sum", "all", and "any" are basically aggregate functions that operate on sets as a whole.

There is no special logic that you wouldn't get from considering these operations generalized for a set.


Thanks, that was helpful. However, I still think that having "sum(1,1,{})" returning "2" and "1+1+{}" returning "{}" can be viewed as somewhat inconsistent.


Let me make a tiny correction to the expression you wrote:

"sum({1, 1, {}})" - the function sum takes only one argument and it's a set. Because we flatten all "nested" sets, the expression "{1, 1, {}}" is equivalent to "{1} UNION {1} UNION {}".

The expression "1 + 1 + {}" albeit valid grammatically, can be equivalently re-written as "{1} + {1} +{}". At this point it should be far more obvious why "sum({1} UNION {1} UNION {})" is not the same as "{1} + {1} + {}".

Literals may be a little confusing because they look like elements, but they are still sets, singleton sets, specifically. There's practical value in simply thinking about "a bunch of things: A, B, C", where each of the A, B and C can themselves be empty, a single thing, or a bunch of things while ignoring nesting. In our case we allow duplication in these bunches (which is not part of the bunch theory: http://www.cs.toronto.edu/~hehner/bunch.pdf). However, because most people are familiar with sets we find it easier to keep using the terms "set" and "multi-set" (and stipulate that they are flattened) in explanations.

In general, the way the operator "+" works is this: A + B = {a + b : for all a in A, for all b in B}. Whereas the expression "{A, B}" is defined to be equivalent to "A UNION B".


Good explanation! (which indicates why the handling of empty sets is sometimes a bit confusing)

One more question: What was the motivation behind defining "sum({})" to be "0" rather then "{}" ?


Oh, that's simple: sum(A UNION B) should be the same as sum(A) + sum(B) for any two sets A and B (or else there would be very weird inconsistencies).

sum(A UNION {}) = sum(A) + sum({})

sum(A) = sum(A) + sum({})

0 = sum({})

Typically for any operation generalized for a set the result of op({}) should be equal to the identity for that operation (0 for sum, 1 for product, True for AND, False for OR, etc.). It's always such a value I that for any other value A, A op I = A.


Again, a very nice explanation. However, from a practical point, I see the following problem: Assume that "Select sum(amount) FROM Payments;" calculates the balance for a customer account, and assume further that for some reason (e.g. a programming error) the amount column for that customer has been filled with "{}", then the above query would still return the well defined result "0" which might indicate that everything is correct (while it is no).

This would not happen if "sum({a, b})" was defined as "{a} + {b}" (which is what a user would intuitively assume). However, this definition is also not very practical as any one occurence of {} in the sum would render the whole thing to {} (which is not what a user would expect).

I guess the handling of "{}" will always stay a bit tricky.


This type of error is better remedied by making the balance property be required (so that it cannot be set to {} and produce an exception at the time the error is introduced). This way you will know about the error early enough. The point is that if an empty value is NOT valid then forbidding it at schema level is the best solution. So required keyword is going to do that for you.

Alternatively, if making the property required is not possible due to some workflow constraints, you could do "SELECT Payment{customer} FILTER NOT EXISTS .balance" to find all payments (and the associated customer), which don't have any balance set. Then once you know what they are you might use "UPDATE" to fix the problem.

Empty sets have fairly well-defined and consistent behavior w.r.t. functions (and operators). You learn it once and it applies in all contexts - specifically that empty sets are just sets like any other.


Hhm, this time your answer doesn't convince me. Setting "required" is not always possible, and using UPDATE to set a value works only as an ex-post solution, i.e. you must already know that there is an error. But this was exactly the point of my hypothetical example: the balance-query returns 0 which is a perfectly legal account balance and there is no reason to suspect an error in the first place.

So, I still think that the definition "sum({}) = 0" has more potential to hide errors than the definition "sum({}) = {}" would have.


A few things come to mind:

1) It's a little unlikely that at the same time you have data where an {} is an error, but you are not making the property required AND with knowledge of that you still don't bother with other validation approaches. The point is that if you're aware that this property is potentially incorrect, you would want to check or restrict it. But yes, if this situation is completely unexpected, then "sum" won't notice any issues.

2) You have to remember that {} can arise from perfectly normal operations, such as filtering. So if you filter by a certain date range and there's no Payments there, then "(SELECT Payment FILTER .timestamp > <datetime>$date).amount" expression becomes {} even if the "amount" property itself is required. So doing a sum over it is simply a question of "What's the total amount in payments since $date?" and if there aren't any payments, the answer is 0, not {}. Plus what you certainly don't want is to get an {} from the "sum" here and do "{} + 100" to add some other charge (perhaps a sum from a different account) and still end up with {}, which now creates an error in a situation where there's nothing wrong with the data.

3) Rather than imbuing {} with special meaning to signal errors, a separate property would be more appropriate. Such as a boolean "valid" flag that gets set after the record checks out or can even be a dynamically computable expression that looks at the record (say, Payments from our example) and does something like "valid := EXISTS .amount". Then you'd filter things by the valid property before feeding them to "sum" like so:

  SELECT sum( (SELECT Payment FILTER .valid).amount );


Thanks for your thoughts which are all good and reasonable.

Over the past years I've seen many programmers struggle with the combination of "data aggregation" and NULL. In the beginning I thought that these programmers should just "RTFM", but as this problem occurs so often it might very well be that the practical implementation of aggregation and NULL is "a bit off".

I like your "set" approach a lot, however, we still have "sum({1, 1, {}})" unequal to "1+1+{}" and "sum({}) = 0" which, while consistent, I think are somewhat counterintuitive and I am pretty sure will lead to misunderstanding.

Having said that, I suspect that any decent and consistent approach to this problem is subject to a very reduced form of John Lydgate's famous quote, that is "You can only please some of the programmers some of the time".

Many thanks for your time and this interesting and insightful conversation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: