> Overusing DISTINCT to “Fix” Duplicates Any time I see DISTINCT in a query I im...

sigwinch28 · 2025-10-18T15:53:00 1760802780

Or it’s simply an indicator of a schema that has not been excessively normalised (why create an addresses_cities table just to ensure no duplicate cities are ever written to the addresses table?)

echelon · 2025-10-18T20:35:12 1760819712

DISTINCT, as well as the other aggregation functions, are fantastic for offline analytics queries. I find a lot of use for them in reporting, non-production code.

valiant55 · 2025-10-18T17:58:36 1760810316

It depends when you see it, but I agree that DISTINCT shouldn't be used in production. If I'm writing a one off query and DISTINCT gets me over the finish line sparing me a few minutes then that's fine.

viraptor · 2025-10-19T09:15:12 1760865312

Which categories did the user post in? Which projects did the user interact with in the last week? That's all normal DISTINCT usage.

ndsipa_pomu · 2025-10-19T09:45:18 1760867118

There's nothing wrong with using DISTINCT correctly and it does belong in production. The author is complaining about developers that just put in DISTINCT as a matter of course rather than using it appropriately.

ndsipa_pomu · 2025-10-19T09:43:51 1760867031

One reason to have excessively normalised tables would be to ensure consistency so that you don't have to worry about various records with "London", "LONDON", "lindon" etc.

sgarland · 2025-10-18T22:33:24 1760826804

Because a city/region/state can be uniquely identified with a postal code (hell, in Ireland, the entire address is encapsulated in the postal code), but the reverse is not true.

At scale, repeated low-cardinality columns matter a great deal.

virissimo · 2025-10-19T00:32:18 1760833938

There are ZIP codes that overlap a city and also an unincorporated area. Furthermore, there are zip codes that overlap different states. A data model that renders these unrepresentable may come back to bite you.

Breza · 2025-10-27T18:45:06 1761590706

This assumption got me in trouble as a junior analyst years ago. I was asked to analyze our customer base and wrote something like the below. Management congratulated me on finding thousands more customers than we'd ever had before.

SELECT zipcode.rural_urban_code, COUNT(*) AS n_customer FROM customer INNER JOIN zipcode USING(zipcode) GROUP BY 1;

pbnjay · 2025-10-19T00:33:16 1760833996

FYI this is not true in the US. Zip codes identify postal routes not locations

bdangubic · 2025-10-19T01:55:43 1760838943

saying zipcodes uniquely identify city/state/region is like saying John uniquely identifies a human :)

sgarland · 2025-10-19T12:09:19 1760875759

EDIT: TIL that there are cross-state ZIP codes.

lucyjojo · 2025-10-19T03:54:49 1760846089

these kinds of things are almost never true in the real world.

bts89 · 2025-10-18T15:33:39 1760801619

That’s almost always my experience too.

Though fairly recently I learned that even with all the correct joins in place, sometimes adding a DISTINCT within a CTE can dramatically increase performance. I assume there’s some optimizations the query planner can make when it’s been guaranteed record uniqueness.

Breza · 2025-10-27T18:47:30 1761590850

I agree with you. I also find that adding DISTINCT can sometimes make it easier for my colleagues to understand code, especially when I'm using multiple CTEs and it might be easy to miss a one-to-many join.

yxhuvud · 2025-10-19T08:06:58 1760861218

I've seen similar effects when changing a bunch of left outer joins to lateral joins with a limit 1 tacked on. The limit do nothing to the end result, but speed up the query by a factor of 1000..

dotancohen · 2025-10-18T16:37:40 1760805460

I've been told similar nasty things for adding LIMIT 1 to queries that I expect to return at most a single result, such as querying for an ID. But on large tables (at least in sqlite, mysql, and maybe postgress too) the database will continue to search the entire table after the given record was found.

Guillaume86 · 2025-10-18T17:57:40 1760810260

Only if your table is missing an unique index on that column, which it should have to enforce your assumption, so yeah LIMIT 1 is a code (or schema in the case) smell.

dotancohen · 2025-10-18T18:57:02 1760813822

IDs are typically unique primary key. But in my experience, adding LIMIT 1 would on average halve the time taken to retrieve the record.

I'll test again, really the last time I tested that was two decades ago.

EvanAnderson · 2025-10-18T19:40:41 1760816441

That seems like your RDBMS wasn't handling something right there or there wasn't a unique index on the column.

Do you recall what the database server was?

dotancohen · 2025-10-18T21:17:57 1760822277

Yes, I was using Mysql exclusively at the time. I don't recall which version.

I also tested this once years later when doing a Python app with sqlite. Similar result, but admittedly that was not a very big table to begin with.

I am meticulous with my database schemas, and periodically review my indexes and covering indexes. I'm no DBA, but I believe that the database is the only real value a codebase has, other than maybe a novel method here and there. So I put care into designing it properly and testing my assumptions.

Guillaume86 · 2025-10-19T15:20:20 1760887220

You should use the DB explain or equivalent command to spit out the query plan, limit 1 shouldn't change anything in your case, if it's not the case you should file an issue, it's pretty much 101 of query optimization.

viraptor · 2025-10-19T09:17:16 1760865436

That would be a reportable bug. Of a pretty high priority.

buckle8017 · 2025-10-18T19:22:13 1760815333

You are certainly doing something wrong if that's true.

I'm curious, can you demo this?

dotancohen · 2025-10-18T21:21:50 1760822510

I'm curious as well to see if this still holds up. I'll try this week.

giovannibonetti · 2025-10-18T20:55:35 1760820935

I've noticed that LIMIT 1 makes a huge difference when working with LATERAL JOINs in Postgres, even when the WHERE condition has a unique constraint.

sgarland · 2025-10-19T00:36:00 1760834160

If you include an ORDER BY, the DB _may_ continue searching. MySQL (and, I assume, MS SQL Server, since it also can cluster the PK) can stop early in some circumstances.

But if you just have a LIMIT, then no - any RDBMS should stop as soon as it’s reached your requested limit.

dotancohen · 2025-10-19T02:06:13 1760839573

Right, that's why I add it.

fipar · 2025-10-19T03:49:22 1760845762

In mysql, the db will continue reading even if the limit condition has been met, and then anything beyond the limit will be discarded before returning the result.

dotancohen · 2025-10-20T11:13:16 1760958796

Even without an ORDER BY clause?

fipar · 2025-10-21T02:04:23 1761012263

Nope, that does work as expected, unless a filesort is required, good point.

mcv · 2025-10-18T17:51:22 1760809882

It's the exact opposite in Cypher. I'm currently working with some complex data in neo4j, and wondered why my perfectly fine looking queries were so slow, until I remembered to use DISTINCT. It's very easy to get duplicate nodes in your results, especially when you use variable length relationships, and DISTINCT is the only fix I'm aware of that fixes that.

dleeftink · 2025-10-18T18:44:57 1760813097

Yeah, similarly combining distinct with recursive CTE's in SQL can be the difference between a n×n blowout or a performant graph walk that only visits nodes once.

bandrami · 2025-10-18T15:47:05 1760802425

IDK, "which ZIP codes do we have customers in?" seems like a reasonable thing to want to know

mbb70 · 2025-10-18T15:51:38 1760802698

The very next ask will be "order the zipcodes by number of customers" at which point you'll be back to aggregations, which is where you should have started

wvbdmp · 2025-10-18T16:26:21 1760804781

Anti-Patterns You Should Avoid: overengineering for potential future requirements. Are there real-life cases where you should design with the future in mind? Yes. Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time? Also yes.

RHSeeger · 2025-10-18T16:49:53 1760806193

> Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time

Indeed, along that line, I would say that DISTINCT can be used to convey intent... and doing that in code is important.

- I want to know the zipcodes we have customers in - DISTINCT

- I want to know how many customers we have in each zipcode - aggregates

Can you do the first with the second? Sure.. but the first makes it clear what your goal is.

dleeftink · 2025-10-18T18:51:36 1760813496

Partly in jest, but maybe we need a NON-DISTINCT signaller to convey the inverse and return duplicate values only.

SOMEWHAT-DISTINCT with a fuzzy threshold would also be useful.

RHSeeger · 2025-10-18T21:03:34 1760821414

I hear you. It's not all _that_ uncommon for me to query for "things with more than one instance". Although, to be fair, it's more common for me to that when grep/sort/uniqing logs on the command line.

majormajor · 2025-10-18T17:07:27 1760807247

Here we start to get close to analytics sql vs application sql, and I think that's a whole separate beast itself with different patterns and anti-patterns.

bandrami · 2025-10-18T21:04:12 1760821452

Ah, yeah, you beat me to it. I do reporting, not applications.

sql_nitpicker · 2025-10-18T15:59:45 1760803185

distinct seems like an aggregation to me

kristjansson · 2025-10-18T16:24:03 1760804643

Whole seconds will have been wasted!

bandrami · 2025-10-18T21:02:40 1760821360

I do reporting, not application development. If somebody wants to know different information I'd write a different query.

edoceo · 2025-10-18T16:16:01 1760804161

count(id) group by post_code order by 1

DavidWoof · 2025-10-18T19:50:14 1760817014

In OP's defense, "becoming suspicious" doesn't mean it's always wrong. I would definitely suggest an explaining comment if someone is using DISTINCT in a multi-column query.

ryandv · 2025-10-18T16:25:54 1760804754

Set theory...

There are self-identifying "senior software engineers" that cannot understand what even an XOR is, even after you draw out the entire truth table, all four rows.

BuyMyBitcoins · 2025-10-18T16:48:33 1760806113

I am surprised at common it is for software engineers to not treat booleans properly. I can’t tell you how many times if seen ‘if(IsFoo(X) != false)’

It never used to bug me as a junior dev, but once a peer pointed this out it became impossible for me to ignore.

furyofantares · 2025-10-18T19:31:31 1760815891

The most egregious one I saw, I was tracking down a bug and found code like this:

    bool x;

    ...

    if (x == true) {
        DoThing1();
    } else if (x == false) {
        DoThing2();
    }

And of course neither branch was hit, because this is C, and the uninitialized x was neither 0 nor 1, but some other random value.

Rexxar · 2025-10-19T09:34:16 1760866456

Maybe it was initially supposed to be a sort of "3-value boolean" (true/false/undefined) and not a standard bool. You can (rarely) meet this pattern in c++ if you use boost::tribool or in c# if you have a nullable bool. There is probably similar thing in other languages.

furyofantares · 2025-10-19T17:57:31 1760896651

It was definitely just bad code.

tomjakubowski · 2025-10-18T20:42:49 1760820169

Sometimes this kind of thing happens after a few revisions of code, where in earlier versions the structure of the code made more sense: maybe several conditions which were tested and then, due to changing requirements, they coalesced into something which now reads as nonsense.

When making a code change which touches a lot of places, it's not always obvious to "zoom out" and read the surrounding context to see if the structure of the code can be updated. The developer may be chewing through a grep list of a few dozen locations that need to be changed.

1718627440 · 2025-10-19T20:55:59 1760907359

I think of comparisons as a type conversion to a boolean. You wouldn't convert a boolean, but I like it to convert other types like an integer, even when the language rules would already specify the same I'm writing.

munchlax · 2025-10-18T18:45:31 1760813131

People do that? This hurts my brain. if(IsFoo(X)) is clear and readable.

catlifeonmars · 2025-10-18T18:19:29 1760811569

Clearly the correct spelling is

`if(X&IsFooMask != 0)`

:)

hyperman1 · 2025-10-18T17:33:10 1760808790

I've spent a lot of time not seeing how xor is just the 'not equals' operator for booleans.

layer8 · 2025-10-18T17:28:31 1760808511

Or, for a boolean type, that XOR is the same as the inequality operator.

avalys · 2025-10-18T18:38:56 1760812736

Maybe it’s confusing because it’s misnamed?

ryandv · 2025-10-18T18:56:51 1760813811

This is like saying the non-negative integers under addition, lists under append, and strings under concatenation are all just misnamings of the semigroup operator.

https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-...

layer8 · 2025-10-18T21:16:25 1760822185

Is it? Two things are equal exactly when they aren’t exclusive.

catlifeonmars · 2025-10-18T18:23:09 1760811789

XOR is for key splitting.

ryandv · 2025-10-18T16:33:08 1760805188

PostgreSQL's `DISTINCT ON` extension is useful for navigating bitemporal data in which I want, for example, the latest recorded version of an entry, for each day of the year.

There are few other legitimate use cases of the regular `DISTINCT` that I have seen, other than the typical one-off `SELECT DISTINCT(foo) FROM bar`.

dotancohen · 2025-10-18T16:48:22 1760806102

Without DISTINCT ON (which I've never used) you can use a window function via the OVER clause with PARTITION BY. I'm pretty sure that's standard SQL.

ryandv · 2025-10-18T16:49:20 1760806160

Yes, this is the implementation I have seen in other dialects.

jmull · 2025-10-18T16:00:09 1760803209

I'd be wary of overgeneralizing on that. I guess it depends on whose queries you're usually reading.

RHSeeger · 2025-10-18T16:47:14 1760806034

I think you're reading more into what was said than is really there

> I immediately become suspicious

All I read from that is, when DISTINCT is used, it's worth taking a look to make sure the person in question understands the data/query; and isn't just "fixing" a broken query with it. That doesn't mean it's wrong, but it's a "smell", a "flag" saying pay attention.

grumpylittleted · 2025-10-19T11:42:25 1760874145

So how do you "know" when you can safely omit DISTINCT for your shiny new query SELECT x FROM t ?

Oh you looked the schema for t and it said x has a PRIMARY or UNIQUE constraint?

Ah well two minutes after you looked at the schema Tom removed the UNIQUE constraint. Now your scratching your head when you get duplicates.

Sql is a bag language not a set language. The contract with relation t is that if the runtime can find there rel t and attribute x it will return it. You may end up with rows or not, and you may end up with duplicates or not, and the type of x may change between subsequent execution.

So if you want a set you need to say so using DISTINCT. At runtime the query planner will check the schema and if the attribute is UNIQUE or PRIMARY it will not have to do a deduplication.

fipar · 2025-10-19T03:48:00 1760845680

I'm not sure I understand the part about set theory. If anything, a valid use of DISTINCT is if you want the result to be (closer to) a set, as otherwise (to your point, depending on the data model) you may get a bag instead.

In fact, IIRC, using DISTINCT (usually bad for performance, btw) is an SQL advice by CJ Date in https://www.oreilly.com/library/view/sql-and-relational/9781...

dragonwriter · 2025-10-18T16:30:37 1760805037

In my experience, its nearly as often a problem with the design of the database as the query author.

ch2026 · 2025-10-18T18:53:31 1760813611

Or maybe they’re on OLAP not OLTP.

9rx · 2025-10-18T17:38:12 1760809092

Or believe more in Codd’s relational model than SQL’s tabulational model.

kpcyrd · 2025-10-18T22:40:03 1760827203

SQL is somehow "ask two people, get three different opinions" for something as basic as:

"given a BTreeMap<String, Vec<String>>, how do I do .keys() and .len()".

qcnguy · 2025-10-19T09:36:45 1760866605

SQL isn't very intuitive. Lots of people claim it is but then lots of people claim Haskell is, market outcomes suggest they are outliers.

The big justification for its design is to enable compiler optimizations (query planning) but compilers can optimize imperative code very well too, so I wonder if you could get the same benefits with a language that's less declarative.

leptons · 2025-10-18T14:58:53 1760799533

And that's okay. Not every developer knows every single thing there is to know about every single tech. Sometimes you just need a solution, and someone with more specific knowledge can optimize later. How many non-database related mistakes would you make if you had to build every part of a system yourself?

pessimizer · 2025-10-18T16:29:49 1760804989

But what if they don't know that they need your approval not to know things?

Sesse__ · 2025-10-18T14:54:28 1760799268

Or just doesn't know how to do semijoins in SQL, since they don't follow the same syntax as normal joins for whatever historical reason.

wvbdmp · 2025-10-18T15:41:25 1760802085

Eh, sometimes you need a quick fix and it’s just extremely concise and readable. I’ll take an INNER JOIN over EXISTS (nice but insanely verbose) or CROSS APPLY (nice but slow) almost every time. Obviously you have to know what you’re dealing with, and I’m mostly talking about reporting, not perf critical application code.

Distinct is also easily explained to users, who are probably familiar with Excel’s “remove duplicate rows”.

It can also be great for exploring unfamiliar databases. I ask applicants to find stuff in a database they would never see by scrolling, and you’d be surprised how many don’t find it.

Sesse__ · 2025-10-18T15:59:43 1760803183

The less verbose way of doing semijoins is by an IN subquery.

wvbdmp · 2025-10-18T16:23:35 1760804615

>subquery

>less verbose

Well…

In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.

Sesse__ · 2025-10-18T16:39:00 1760805540

What do you mean? Here are your real alternatives for doing a semijoin (assuming ANSI SQL, no vendor extensions):

  SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 WHERE t2.x = t1.x );
  SELECT * FROM t1 WHERE x IN ( SELECT x FROM t2 );
  SELECT * FROM t1 JOIN ( SELECT DISTINCT x FROM t2 ) s1 USING (x);

Now tell me which one of these is the less verbose semijoin?

You could argue that you could fake a semijoin using

  SELECT DISTINCT * FROM t1 JOIN t2 USING (x);

or

  SELECT * FROM t1 JOIN t2 USING (x) GROUP BY t1.*;

but it doesn't give the same result if t1 has duplicate rows, or if there is more than one t2 matching t1. (You can try to fudge it by replacing * with something else, in which case the problem just moves around, since “duplicate rows” will mean something else.)

wvbdmp · 2025-10-18T17:34:53 1760808893

No, sorry, you’re certainly correct, I just meant that any subqueries are generally crazy verbose. And then you usually want additional Where clauses or even Joins in there, and it starts to stop looking like a Where clause, so I’m often happy when I can push that logic into From.

Sesse__ · 2025-10-18T17:38:04 1760809084

Yes, I would certainly prefer if you could write

SELECT * FROM t1 SEMIJOIN t2 USING (x);

although it creates some extra problems for the join optimizer.

Little_Kitty · 2025-10-18T20:04:26 1760817866

It's great being able to use an any join (and the counterpart anti join) in Clickhouse to deal with these operations.