More

jsteemann · on March 5, 2022

Clang's scan-build supports C++ for a long time already.

Regarding gcc/g++, there seems to be support only for C. From https://gcc.gnu.org/wiki/DavidMalcolm/StaticAnalyzer:

> Initial implementation was added in GCC 10; major rewrite occurred in GCC 11. > Only C is currently supported (I hope to support C++ in GCC 13, but it is out-of-scope for GCC 12)

Looking forward to seeing C++ support added to it!

rwmj · on March 5, 2022

I've found loads of bugs in the C support and David has been very responsive to my bug reports, even substantially reworking how realloc(3) was being handled because of a case we found. (Edit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99260 - I must get around to testing his fix since GCC 12 is in Fedora now)

jsteemann · on March 5, 2022

I stumbled across fbinfer and gave it a try for a larger C++ project on Linux a few days ago.

It worked out of the box, because fbinfer can tap into the compile commands database that can easily be generated by CMake and other build tools.

The tool has nice output about its current progress, but even when running with many threads, its long translation and analysis durations made it somewhat impractical to use on a large project. I understand that later it can run with incremental changes, and can reuse some data from previous runs. In this case it probably may be better suited. But the first-time analysis of a large project is very time-consuming. I am not saying this is unique to fbinfer. Other static analyzers tend to have the same problem. To be fair, I haven't yet inspected the tool's various options, which can potentially speed up the analysis.

The initial report the tool created contained several findings. Mainly "potential" nullptr dereferences and a few "potential" data races. After manual inspection these all turned out to be false positives. However, the tool also found several dead stores, which turned out to be actual dead stores. So it is at least helpful w.r.t.

From my perspective, the tool has potential. Probably some of the false positives can be turned off via configuration, and using it for incremental analysis may also reduce its runtime so that it becomes tolerable for larger projects.

jsteemann · on Jan 4, 2021

From the status page (https://status.slack.com/2021-01/9ecc1bc75347b6d1), updated just now:

> We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted. > Jan 4, 5:20 PM GMT+1

jsteemann · on Oct 20, 2019

"Coffee is very bad" sounds like a very bold claim to me and should be backed by supporting facts.

Pretty much every substance will have bad effects on your health when overdosing. But thats not to say that lower doses will produce these negative effects as well.

Sola dosis facit venenum.

8bitsrule · on Oct 20, 2019

"Everything in moderation ... including moderation."

jsteemann · on March 14, 2019

The database market will all its competition is definitely challenging. I have no doubt AWS will increase their database market share over time. The good thing about this competition is that it is forcing all vendors to be innovative and to find (more) USPs.

AWS DocumentDB seems to be pretty much tied to the MongoDB API right now... So At the moment this will somewhat limit its functionality. However, they will not stand still and probably also extend into the multi-model space at some point. Apart from that, not everyone will be willing to pay for DocumentDB or have their data located in Amazon datacenters.

k__ · on March 14, 2019

"AWS DocumentDB seems to be pretty much tied to the MongoDB API"

I could imagine that they didn't build DocumentDB from the ground up.

DocumentDB is probably just a MongoDB compatible API for one of their base services (S3 or DynamoDB).

As far as I know, they build Serverless Aurora on top of S3, with the help of S3 select. So they will probably just create another custom-DB compatible API if they have the impression that this custom DB becomes the next big thing.

jsteemann · on March 14, 2019

Exactly, AWS DocumentDB is only MongoDB API-compatible, but it's not using any MongoDB components.

It's an implementation of its own, leveraging many the base building blocks and infrastructure Amazon has created.

DocumentDB is currently tied to the MongoDB 3.6 API, that means all the transactional extensions MongoDB has added recently is not present in DocumentDB (yet).

jsteemann · on March 14, 2019

ArangoDB is a multi-model database so it tries to target several use cases. It provides functionality working with key-values, documents, graphs and fulltext indexing/searching. It provides some flexibility in the sense that it does not force you into a specific way of working with the data. For example, it does not force you to treat each use case as a graph use case. This is in contrast to some other specialized databases, which excel at their specific area, but also force you to completely adopt the type of data-modeling they support.

dmix · on March 14, 2019

So what’s the downside of using this DB which does it all vs using a specialized DB? Scaling, performance, etc?

janemanos · on March 14, 2019

Think we have to be a bit more precise here. ArangoDB supports documents, key/value, and graph. It is not really optimized for large timeseries use cases which might need windowing or other features. Influx or Timescale might provide better characteristics here. However, for the supported data models we found a way to combine them quite efficiently.

Many search engines access data stored in JSON format. Hence integrating a search engine like ArangoSearch as an additional layer on top of the existing data models is no magic but makes a lot of sense. Allowing to combine models with search is then rather an obvious task for us.

jsteemann · on March 14, 2019

Specialized databases have the advantage of being, well, specialized...

For example, a specialized OLAP database which knows about the schema of the data can employ much more streamlined storage and query operators, so it should have a "natural" performance advantage.

However, a very specialized database may later lock you in to something, and in case you need something different, you will end up with running multiple different special-purpose databases.

Not saying this is necessarily bad (or good), but it is at least one aspect to consider how many different databases to you want to operate & manage in your stack.

dmix · on March 14, 2019

Interesting. Pretty much every startup I've worked for has run 2-3 databases. Usually Redis plus some search (typically Elastic now). I could see this making that easier.

jsteemann · on March 14, 2019

Regarding the question on the query language, AQL is fully declarative. In this respect it is like SQL. However, there are a few differences between AQL and SQL: * SQL is an all-purpose database management and querying language. It is very complex and heavy-weight as it has to solve a lot of different problems, e.g. data definition, data retrieval and manipulation, stored procedures etc. AQL is much more lightweight, as its purpose is querying and manipulating database data. Any data definition or database administration commands are not part of AQL, but can be achieved using other, dedicated commands/APIs. * for data retrieval and manipulation, the functionality of SQL and AQL do overlap a lot, but they use different keywords for the similar things. Still simple SQL queries can be converted to AQL easily and vice versa. There are some specialized parts of AQL, such as graph traversals and shortest path queries, for which may be no direct equivalent in SQL.

AQL is versioned along with the database core, as sometimes features are added to AQL which the database core must also support and vice versa. However, during further development of AQL and the database core, one of the major goals is to keep it always downwards-compatible, meaning that existing AQL queries are expected to work and behave identically in newer versions of the database (but ideally run faster or are better optimized there).

yingw787 · on March 14, 2019

Okay, I like how backwards compatibility is preserved. I worked with mongoDB at my previous company and we ended up not being able to migrate to mongoDB 3.x. I think it was because we forked 'eve-mongoengine' and couldn't merge upstream changes, which ended up forcing us to version the entire stack through the database at the same time, which passed the threshold of feasibility.

We were absolute idiots, but I still think a data warehouse should be idiot-proof, which is why I like SQL.

I read through the documentation for ArangoDB and I would be concerned about the lack of native strict type definitions and referencing in AQL, as well as the dearth of type availability in ArangoDB in general. Is this a design decision related to not supporting data/database administration, or something to be added later to the roadmap?

It sounds like if you support write-intensive paths through the database, it would be considered an OLTP database for some OLTP workloads; do you publish TPC-C benchmarks anywhere? What about resource utilization?

Is there a particular reason to support JavaScript first? Is it because Swagger has JavaScript-first support, or a different reason?

jsteemann · on March 14, 2019

ArangoDB is a schema-less database. There is currently no support for schemas or schema validation on the database core level, but it may be added later, because IMHO it is a very sensible feature. When that is in place, AQL may also be extended to get more strict about the types used in queries. However, IMHO that should only be enforced if there is a schema present.

To keep things simple and manageable, we originally started with AQL just being a language for querying the database. It was extended years ago to support data manipulation operations. I don't exclude the possibility that at some point it will support database administration or DDL commands, however, I am just one of the developers and not the product manager. And you are right about the main use case being OLTP workloads. For OLAP use cases, dedicated analytical databases (with fixed data schemas) are probably superior, because they can run much more specialized and streamlined operations on the data. To my best knowledge we never published any TPC benchmark results somewhere. I think it's possible to implement TPC-C even without SQL, however, implementing the full benchmark is a huge amount of work, so we never did...

jsteemann · on March 14, 2019

Forgot to answer the JavaScript question... JavaScript can be used in ArangoDB to run something like stored procedures. ArangoDB comes with a JavaScript-based framework (named Foxx) for building data-centric micro services. Its usage is completely optional however. When using the framework, it will allow you to easily write custom REST APIs for certain database operations. The API description is consumable via Swagger too, so API documentation and discoverability are no-brainers.

Apart from that, ArangoDB comes with a JavaScript-enabled shell (arangosh) that can be used for scripting and automating database operations.

PaulHoule · on March 14, 2019

Aql is similar to n1so from couchbase. Imagine if SQL was designed around Json instead of.rows and that is what aql is.

jsteemann · on March 14, 2019

Somewhat at least... N1QL tries to be more close to SQL in terms of keywords and such, whereas the AQL approach was to pick different keywords than SQL. Apart from the difference in keywords I tend to agree.

jsteemann · on Jan 17, 2019

Being one of the developers of ArangoDB, I would like to use the chance to reply to this as well.

I think there have been various issues with the cluster stability 1.5 years ago, and since then we have put great efforts into making the database much more robust and faster. Many man-years have been dedicated to this since 2017.

1.5 years ago we were shipping release 3.1, which is out of service already. Since then, we have released

* ArangoDB 3.2: this release provided the RocksDB storage engine, which improves parallelism and memory management compared to our traditional mostly-memory storage engine * ArangoDB 3.3: with a new deployment mode (active failover), plus ease-of-use and replication improvements (e.g. cross-datacenter replication) * ArangoDB 3.4: latest release, for which we put great emphasis on performance improvements, namely for the RocksDB storage engine (which now also is the default engine in ArangoDB)

In all of the above releases we also worked on improving AQL query execution plans, in order to make queries perform faster in both single server and cluster deployments. Working on the query optimizer and query execution plan improvements is obviously a never-ending task, and not only did we achieved a lot here since 2017, but we still have a lot of ideas for further improvements in this area. So there are more improvements to be expected for the following releases.

All that said, I think it is clear now that my intention is to show that things should have improved a lot compared to the situation 1.5 y ago, and that we will always be working hard to make ArangoDB a better product.

misiek08 · on Jan 25, 2019

What release will be stable if version 3.1 has serious clustering issues? The old, bad days version 1.0 was considered stable :)

reactor · on Jan 18, 2019

thanks, thats kind of reassuring.

jsteemann · on Dec 6, 2018

Most things in 3.4 are actually fully compatible to older releases. The new S2 geo indexes are an exception here. They are a completely new implementation and use a different storage format. However, index data will be automatically be converted into the new format when upgrading from older releases (e.g. 3.3) to 3.4.

To be on the safe side, it is a good idea to consult the list of incompatible changes/changed behavior before upgrading: https://github.com/arangodb/arangodb/blob/3.4/Documentation/...

That may seem like a huge list at first, but many items on it should actually be minorities.

jsteemann · on Sept 6, 2018

Thanks! We have a very brief description of the "distributed COLLECT" feature here: https://github.com/arangodb/arangodb/blob/3.4/Documentation/...

More beef to be added to this until the GA release.

The benefits of distributed COLLECT will come into play for queries that can push the aggregate operations onto the shards. Previous versions of ArangoDB shipped all documents from the database servers to the coordinator, so the coordinator would do the central aggregation of the results from all shards to produce the result.

With distributed COLLECT we now create an additional shard-local COLLECT operation that performs part of the aggregation on the shards already. This allows sending just the aggregated per-shard results to the coordinator, so the coordinator can finally perform an aggregation of the per-shard aggregates.

This will be beneficial in many cases when the per-shard aggregated result is much smaller than the non-aggregated per-shard result.

Following is a very simple example. Let's say you have a collection "test" with 5 shards and 500k simple documents that have just one numeric attribute (plus the three system attributes "_key", "_id" and "_rev"):

    db._create("test", { numberOfShards: 5 }); 
    for (i = 0; i < 500000; ++i) {
      db.test.insert({ value: i });
    }

Running a query that will calculate the minimum and maximum values in the "value" attribute can make use of the distributed COLLECT:

    FOR doc IN test 
      COLLECT AGGREGATE min = MIN(doc.value), max = MAX(doc.value) 
      RETURN { min, max }

The database servers can compute the per-shard minimum and maximum values, so they will each only send two numeric values back to the coordinator.

Without the optimization, the database servers will either send the entire documents or a projection of each document (containing just each document's "value" attribute back) to the coordinator. But then each shard would still have to send 100k values on average.

With a local cluster that has 2 database servers and runs them on the same host as the coordinator, this simple query is sped up by a factor of 2 to 3 when the optimization is applied. In a "real" setup the speedup will be even higher because then there will be additional network roundtrips between the cluster nodes. And in reality documents tend to contains more data and collections tend to have more documents. If this is the case, then the speedup will be even higher.