wordlift demo site

Graph, machine learning, hype, and beyond: ArangoDB open source multi-model database releases version 3.7

A sui generis, multi-model open source database, designed from the ground up to be distributed. ArangoDB keeps up with the times and uses graph, and machine learning, as the entry points for its offering.

If open source is the new normal in enterprise software, then that certainly holds for databases, too. In that line of thinking, Github is where it all happens. So to have been favorited 10.000 times on Github must say something about a project. Open source ArangoDB, which also offers an Enterprise version, has hit that milestone recently.

On Aug. 27, ArangoDB announces its new release 3.7, which comes with interesting new features around graph. We take the opportunity to discuss the database market, graph, and beyond, with CEO and co-founder Claudius Weinberger and Head of Engineering and Machine Learning Jörg Schad.

CLOUD AND MACHINE LEARNING READY

ArangoDB was founded in Cologne in 2014 by OnVista veterans Claudius Weinberger and Frank Celler. The team made the headlines in 2019 with their $10 million in Series A funding led by Bow Capital. As Weinberger noted, he and his co-founder have been working together for 20 years, and the decision to pursue their vision was not a spur of the moment idea:

“The main idea for ArangoDB, what is still valid today, is what we call the native multi-model approach. That means that we found a way that we can combine the JSON document data model, the graph model, and the key-value model in one database core with one query language.”

Today ArangoDB is a US company with a German subsidiary, it has a new chief revenue officer, Matt Ekstrom, and a new head of engineering, Schad. Schad joined ArangoDB last year but has been working with ArangoDB for the past four years. With a PhD in database systems, distributed data analytics, and large scale infrastructure container systems, Schad has been switching between databases.

Two key factors made him join the ArangoDB team: Distribution in a cloud setting and machine learning (ML). ArangoDB has been an early adopter of both Apache Mesos / DC/OS and Kubernetes. Eventually, Kubernetes prevailed, and ArangoDB 3.7 comes with the general availability of its Kubernetes operator, which has been developed over the last three years.

ArangoDB’s Kubernetes operator is also the foundation for its managed service Oasis, available in AWS, Azure, and GCP. The new release includes a number of improvements for faster replacement and movement of servers, improved monitoring and cluster health analysis, an advanced inspection of pod failure causes, and overall reduced resource usage. Cluster scalability improvements for on-premise deployment apply too.

arangoml-pipeline-complete-pipeline-1024x470.jpg
ArangoDB is touted as a solution to unify metadata across machine learning pipelines

ArangoDB has been promoting ArangoML: Using ArangoDB as the infrastructure for teams using ML. The idea is that beyond training data, which is a prerequisite for training ML models, metadata is also important, and using ArangoDB is a good match for that. We have long argued for the importance of metadata. But why ArangoDB, and not any other data management system?

Schad referred to his experience building machine learning pipelines for finance and healthcare use cases. One of the biggest challenges he saw there were audit trails for CCPA or GDPR, making it necessary to have a full view of the entire pipeline. They had to figure out what happens if patients withdraw consent to use their data, for example.

Just being able to identify the different ML models deployed in production was very challenging because they had to go through a number of different metadata stores — for the ML part, the data feature transformation part, and so on. So they wanted to have a common layer with all the metadata where this would end up being one query.

Relational systems are not a good match, Schad said. Machine learning features may be derived from other features, which means ending up with a lot of joins, and especially a lot of self joins. Apart from being ugly to write and maintain, those queries don’t perform well either. So this started to look like a case for a graph database — these are the types of queries graph databases excel at.

FROM GRAPH TO MULTI-MODEL AND BACK AGAIN

But still: why ArangoDB? ArangoDB is not a traditional graph database — it is a multi-model database which also supports graph. The advantage according to Schad is that this enables users to combine the flexibility of having no schema, leveraging the JSON document view of multi-model, with the structure of how things are connected as a graph:

“In the end, looking at which models have been impacted by which is being derived from just one data set, it’s just a graph traversal. So it turned out to be a really easy model, to be both flexible and very efficient in terms of formulating this query and many others as well.”

Schad went on to add that ArangoML has connectors for popular ML ecosystems like Tensorflow and PyTorch, and they are now working on Kubeflow integration. Custom integrations can be developed using a Python API. ArangoDB supports clients in Java, JavaScript, NodeJS, Go, Python, Elixir, R, and Rust.

Not having a schema, however, is not always a plus. ArangoDB 3.7 introduces JSON schema support, giving users the option to validate all new data written to the database, as well as analyze existing data validity. To us, this looks overdue. JSON schema may not be the most powerful schema mechanism around, but for a database emphasizing JSON, it’s a natural choice.

stresschaosistock-507216088a-poselenov-1.jpg
The key premise of multi-model databases is offering many views over the same data. For ArangoDB, graph is one view, document and key-value are the othersGetty Images/iStockphoto

Although ArangoDB has its own sui generis approach, we noticed that in the last year or so its messaging has shifted a bit from the multi-model aspect to emphasize graph. Its people confirmed that, mentioning they’re seeing a lot of demand for graph. Many users are coming with a graph use case and expand upon multi-model use cases later on.

The ArangoDB team believes, however, more data models are needed to support efficient and successful graph use cases. Graph and beyond, where graph is a central use case. Up until recently, the hype was all around graph, too. But those who have been into graph before it was cool knew that hypes come and go, and were expecting the hype to subside at some point.

The first sign came last week, with Gartner’s hype cycle for emerging technology in 2020 moving “graphs and ontologies” to the trough of disillusionment. Apart from the fact that conflating graphs and ontologies does not make much sense to us, we see this as a normal phase in the evolution of new, or in this case, not so new but still hyped, technology.

Schad noted that while graph use cases are on the rise, there’s still a lot of trial and error. Although use cases become more mature, some disillusionment in terms of scalability limits does exist. For Weinberger, it’s a good sign that the overall graph story is moving on, but expecting to do everything faster than other databases should not be the main reason people look at graphs.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

× How can I help you?