If the data is ordered by row as in your example (e.g. {1,a},{1,b}, etc), a rela...

If the data is ordered by row as in your example (e.g. {1,a},{1,b}, etc), a relatively straightforward option would be to read all columns for the row and use the values as a key such that hash(columns) -> list(rowNum).

This may not give you the lowest number of leaves depending on your dataset. If you're looking for the absolute smallest number of leaves, you could use each column as an individual key and start intersecting their lists to see if that produces a reduced leaf count.