Various tools use tool-specific graph attributes. For example, "rank" and "minlen" mean something to the hierarchical or layered graph layout tool (dot) but not to other layout tools. "size" and "label" are the same in all the layout tools. They all use the same underlying graph representation library with a parser generated by yacc or bison.
The documentation includes a big table of attributes that graphviz tools recognize.
With the availability of LLMs, there is better automated support now to find features that are needed. Just imagining here, but "make the layout fill the available space" or "make all the nodes look like points with associated text labels" (not sure if that even works but it should).
True, and in fact I am aware that other teams look for paid solutions where graphs power the core features of their products. For us, it is a small feature, so we were looking for the "least trouble" path.
I don't know enough about all those other libraries and their licenses, but I do know that as long as we don't ship those libraries, especially modified versions, it's likely ok (of course that's simplified). Some internal tooling depends on GNU tools but we are just users. For things like glibc, it's just a standard system library, so linking with it is not a problem. (I am sure legal has looked at this.)
But GPL/LGPL software is definitely the minority of software we use in any way. Basically they need to be avoided as much as possible.
One of the nice things about this work is that by assuming the environment is a web client, it supports some basic interactive exploration, and offloads a lot of bothersome rendering problems.
Also, by focusing on control flow graphs, the proposed method does a better job with domain-specific layout. Apparently CFG visualization and exploration is a current topic; e.g. CFGExplorer. Probably Graphviz some users would benefit if it incorporated CFG-friendly level assignment as an option.
There's already machinery in Graphviz to support polylines instead of splines, and to control edge ordering, but it is not well tested or documented. It seems tempting to incorporate an edge routing algorithm of Brandes and Kopf, based on long vertical runs with at most 2 bends per edge. This seems close to a master's degree worth of work to understand and implement.
Graphviz started almost 40 years ago, is only supported by a few (one or two?) 2nd-generation volunteers with no 3rd generation on the scene yet. Over the years we've had plenty of our own disdainful "What is all this junk" moments, about our own code and other people's (c.f. various xkcd comics), but sometimes a better perspective is asking "What is being optimized that led to some team choosing or ending up at this point in the design space". Generally, the market is addicted to features.
It is a little dismaying to see the relatively slow progress in the broad field of declarative 2d diagramming. Given the way the pendulum has swung so hard back toward language based methods and away from using interaction to do everything, you'd think there would be a bigger payoff now for doing the work. Unfortuantely tool-making has always been a tough market. The customers are generally smart, demanding, and work in cost centers so don't have generous budgets.
Not everything has to be directly informative or solve a problem. Sometimes data visualization can look pretty for pretty's sake.
Dimensionality reduction/clustering like this may be less useful for identifying trends in token embeddings, but for other types of embeddings it's extremely useful.
Agreed. The fact that it has any structure at all is fascinating (and super pretty). Could signal at interesting internal structures. I would love to see a version for Qwen-3 and Mistral too!
I wonder if being trained on significant amounts of synthetic data gave it any unique characteristics.
I lets you inspect what actually constitutes a given cluster, for example it seems like the outer clusters are variations of individual words and their direct translations, rather than synonyms (the ones I saw at least).
> What do people learn from visualizations like this?
Applying the embeddings model to some dataset of yours of interest, and then a similar visualization, is where it gets cool because you can visually look at clusters and draw conclusions about the closeness of items in your own dataset
Embedding visualizations have helped identify bias in word embeddings (Word2Vec), debug entity resolution systems, and optimize document retrieval by revealing semantic clusters that inform better indexing strategies.
Interesting, glad to know it's been useful for some specific contributions. (Not questioning that interesting-looking, appealing displays as overviews for general awareness are also worthwhile.)
And this is the point. You need to narrow the scope to make something like this useful. Writing a paper on "Good transportation design" is kind of meaningless. Do you mean cars, trucks, boats, planes, spacecraft, scooters, fighters, tanks ? Do you mean roadways that can accommodate some subset ?
If you mean "transactional websites", and assuming you mean something like product catalogs and being able to purchase, that narrows it down quite a lot.
Or does it ?
For the majority of use cases, Craigs list, ebay, Amazon are the best fit.
Next in number of use cases are Wix/Square/etc where you design your UI.
Then comes all in one systems with UI/ORM based on Python/Ruby/etc where you need to design your own DB schema and UI, but the "design" is already done for you.
The next step is custom designed systems like the one the article talks about, where complete off the shelf is not suitable
And then there are the highly scalable systems
The article is perfectly fine if we are discussing custom designed but not necessarily the highest in scalability.
Any ideas how these primitives could be used to implement an edge router for drawing natural-looking curves around obstacles in diagrams, as an improvement on the 25-year-old solver in graphviz https://dpd.cs.princeton.edu/Papers/DGKN97.pdf?
We learned the hard way, for some of us it's all too easy to make careless design errors that become baked-in and can't be fixed in a backward-compatible way (either at the DSL or API level). An example in Graphviz is its handling of backslash in string literals: to escape special characters (like quotes \"), to map special characters (like several flavors of newline with optional justification \n \l \r) and to indicate variables (like node names in labels \N) along with magic code that knows that if the -default- node name is the empty string that actually means \N but if a particular node name is the empty string, then it stays.
There was a published study, Wrangling Messy CSV Files by Detecting Row and Type Patterns by Gerrit J. J. van den Burg, Alfredo Nazábal, and Charles Sutton (Data Mining and Knowledge Discovery, 2019) that showed many pitfalls with parsing CSV files found on GitHub. They achieved 97%. It's easy to write code that slings out some text fields separated by commas, with the objective of using a human-readable portable format.
You can learn even more by allowing autofuzz to test your nice simple code to parse human readable files.
A lot of improvements are possible, based on 20 years of progress in interactive systems, and just overall computing performance.
reply