Categories
knowledge connexions

GoodData and Visa: A common data-driven future?

From user, to partner and investor. That’s not a very common scenario for software vendors, especially if the user-cum-partner-investor is someone like Visa. GoodData is evolving more than its relationship with select users.

GoodData, one of the key players in business intelligence and analytics, today announced a partnership with Visa. This is an interesting development for both sides, for a number of reasons.

ZDNet connected with Roman Stanek, GoodData Founder and CEO, to discuss the ins and outs of the deal, the data landscape, and the way data are used to shape directions for organizations big and small, and what’s next for GoodData.

The evolution of the data ecosystem

Last time we spoke with GoodData executives, it was on the occasion of another major partnership, that with Amazon. We picked up from where we left off, by going over the evolution of that partnership. Stanek said things are coming along great, and the collaboration around AWS Redshift was just the beginning.

Having a partner with such a big presence and important role as Amazon, which is the dominant player in the cloud market, is a great strength for GoodData. Stanek emphasized the broader vision of what he called investment in data, and how organizations can have a return on their data. But he also touched upon certain technical aspects GoodData is making a play on, and how they fit into the big picture.

Kubernetes is prominent among them, and the discussion on AWS naturally brought it to the fore. There has been ample coverage around Kubernetes lately, and Kubernetes was something we picked early on. Despite having growing pains around data management, the fact that we have a de-facto standard for application deployment on-premise and in the cloud in Kubernetes is a game-changer.

gooddatahub.jpg

GoodData wants to act as the enablement layer for its clients to offer analytics in their own offering to their own clients.

GoodData has historically been a managed solution for analytics, running in the cloud. The fact that it was among the first to realize the importance of cloud-based analytics solutions and execute on it has enabled it to get to where it is today. In a way, Kubernetes may enable GoodData to turn the approach on its head, to keep with the times, and maintain a lead.

Today, the name of the game is multi-cloud and hybrid cloud. Stanek called this “data balkanization”: Whether by design or not, few organizations put all their data eggs in one cloud basket. And many organizations also maintain on-premise infrastructure.

Although a cloud-based solution like GoodData can ingest data from anywhere, it’s much more efficient to be co-located with the data. Plus, there are always some users that want to have more control over their deployments.

Those users, the “geeky ones” as Stanek called them, can rejoice: GoodData is about to release its re-engineered Kubernetes-based offering in Beta soon, probably in July 2020. That way GoodData goes from data consumer to data prosumer. Joining the Kubernetes ecosystem will enable GoodData to join the Kubernetes observability ecosystem, and make geeky users happy.

Visa went from user to partner and investor, because financial services are all about data

Stanek noted that being an Amazon partner will work great for Kubernetes deployment. GoodData seems to have a way with partnerships, and partnering with Visa is a testament to this. Visa has been a GoodData customer for a while. Stanek said what made Visa go from user to partner and investor was a shared vision around data.

That may be true, but it does not really tell us much about Visa’s specific goals. We have a more speculative projection, which is that in a global environment where payments are increasingly becoming a data integration, processing, and analytics game, having someone like GoodData on board can come in handy.

Consider the PSD2 EU directive, for example. Although GDPR got much attention, PSD2 may be even more important. What PSD2 does is that it makes financial institutions give access to (parts of) their data and APIs to 3rd parties. Those 3rd parties can, if they have client consent, use that data and APIs to offer financial services to their clients.

img-0378.jpg

Going from user to partner-investor does not happen every day, especially if the user is Visa

Before the time PSD2 was set to go into effect, we speculated we might see new players such as Facebook move into payments. This prediction was confirmed, most notably by Facebook with Libra. Facebook Libra seems to be at an impasse, with Visa having withdrawn from its consortium along with Mastercard and others.

We posit, however, that Visa would very much like to be able to do what Libra set out to do, albeit not necessarily in the same way. To be able to do customer 360 with data coming from many different providers, for example, a tool like GoodData would be very good to have.

The details of the deal with GoodData were not discussed. However, we don’t routinely see payment technology companies like Visa investing in, and partnering with, their software vendors. Melissa McSherry, SVP and global head of Data, Security, and Identity products at Visa, noted that:

“As the world faces pandemic and economic challenges, there’s no better time to invest in areas that will improve the lives of consumers and businesses. With insights from data, we can help sellers, financial institutions, and Visa’s extended global business network better understand and meet consumer needs, especially when those needs are changing fast. Our partnership with GoodData will allow us to do that.”

A data-driven philosophy

Stanek, on his part, noted that GoodData is extremely pleased to have Visa as an investor and partner in GoodData’s mission to help companies of any size to become data companies. We could not help but notice the pandemic reference in Visa spokesperson’s statement, and this brought up an opportunity to discuss how GoodData’s philosophy is put to practice with Stanek.

Recently, GoodData helped develop and launch the COVID-19 Commerce Insight project to analyze a billion engagements and 400 million transactions showing the impact of COVID-19 on global and regional consumer spending.

The project is an Emarsys initiative in cooperation with GoodData, and Stanek said they wanted to do more than just another infection rate dashboard. The goal was to put up-to-date consumer data in the hands of business owners, economists, and policymakers, giving them the actionable insights needed to navigate this economic crisis.

The data included in the visualization is an anonymized subset of the online sales of brands in more than 100 countries, based on consumer engagement across more than one billion consumer profiles and 2,500 global brands. It powers analyses such as a look into how growth in e-commerce has accelerated to forecasted levels for at least 5 years ahead over the recent months.

cloudengineer.jpg

GoodData was among the first to get that data is moving to the cloud, and follow suit.

But how can organizations go beyond data points projected on yet another dashboard, to building data-driven applications? Stanek offered a vivid example, by asking the rhetorical question of how would users be able to navigate with their Uber if the data on their ride’s location and speed were laid out on a dashboard.

One of the initiatives GoodData is taking to help organizations go from dashboards to data-driven application is the Accelerator Toolkit. The Accelerator Toolkit is a UI library to enable customized and faster data analytics, along with educational resources. Stanek mentioned that GoodData plans to launch a GoodData University initiative soon, to offer more resources to empower organizations.

Another noteworthy development for GoodData is the evolution of its Semantic Layer data model. A new modeling tool by GoodData aims to improve collaboration between engineers and analysts to streamline the start process for enterprise data products.

Stanek initially referred to this as an attempt to establish a single version of the truth. This, however, has always been an elusive goal. While improving collaboration between engineers and analysts is commendable, more pragmatically, organizations can aim to establish shared data models among user groups, rather than global ones.

Stanek did not sound short of ambition, and our conversation touched upon a number of topics. If you want to listen to it in its entirety, make sure to subscribe to the Orchestrate all the Things podcast, where it will be released soon.

NOTE: The article has been edited to clarify that Visa is a payment technology company, not a financial institution.

Content retrieved from: https://www.zdnet.com/article/gooddata-and-visa-a-common-data-driven-future/.

Categories
knowledge connexions

AI chips in 2020: Nvidia and the challengers

Now that the dust from Nvidia’s unveiling of its new Ampere AI chip has settled, let’s take a look at the AI chip market behind the scenes and away from the spotlight

Few people, Nvidia’s competitors included, would dispute the fact that Nvidia is calling the shots in the AI chip game today. The announcement of the new Ampere AI chip in Nvidia’s main event, GTC, stole the spotlight last week.

There’s been ample coverage, including here on ZDNet. Tiernan Ray provided an in-depth analysis of the new and noteworthy with regards to the chip architecture itself. Andrew Brust focused on the software side of things, expanding on Nvidia’s support for Apache Spark, one of the most successful open-source frameworks for data engineering, analytics, and machine learning.

Let’s pick up from where they left off, putting the new architecture into perspective by comparing against the competition in terms of performance, economics, and software.

Nvidia’s double bottom line

The gist of Ray’s analysis is on capturing Nvidia’s intention with the new generation of chips: To provide one chip family that can serve for both “training” of neural networks, where the neural network’s operation is first developed on a set of examples, and also for inference, the phase where predictions are made based on new incoming data.

Ray notes this is a departure from today’s situation where different Nvidia chips turn up in different computer systems for either training or inference. He goes on to add that Nvidia is hoping to make an economic argument to AI shops that it’s best to buy an Nvidia-based system that can do both tasks.

“You get all of the overhead of additional memory, CPUs, and power supplies of 56 servers … collapsed into one,” said Nvidia CEO Jensen Huang. “The economic value proposition is really off the charts, and that’s the thing that is really exciting.”

Jonah Alben, Nvidia’s senior VP of GPU Engineering, told analysts that Nvidia had already pushed Volta, Nvidia’s previous-generation chip, as far as it could without catching fire. It went even further with Ampere, which features 54 billion transistors, and can execute 5 petaflops of performance, or about 20 times more than Volta.

2555435ebb25b4781b28-39429237-first-Nvidia-dgx-a100-at-argonne-national-laboratory-image-courtesy-argonne-national-laboratory.jpg

Nvidia is after a double bottom line: Better performance and better economics

So, Nvidia is after a double bottom line: Better performance and better economics. Let us recall that recently Nvidia also added support for Arm CPUs. Although Arm processor performance may not be on par with Intel at this point, its frugal power needs make them an attractive option for the data center, too, according to analysts.

On the software front, besides Apache Spark support, Nvidia also unveiled Jarvis, a new application framework for building conversational AI services. To offer interactive, personalized experiences, Nvidia notes, companies need to train their language-based applications on data that is specific to their own product offerings and customer requirements.

However, building a service from scratch requires deep AI expertise, large amounts of data, and compute resources to train the models, and software to regularly update models with new data. Jarvis aims to address these challenges by offering an end-to-end deep learning pipeline for conversational AI.

Jarvis includes state-of-the-art deep learning models, which can be further fine-tuned using Nvidia NeMo, optimized for inference using TensorRT, and deployed in the cloud and at the edge using Helm charts available on NGC, Nvidia’s catalog of GPU-optimized software.

Intel and GraphCore: high profile challengers

Working backward, this is something we have noted time and again for Nvidia: Its lead does not just lay in hardware. In fact, Nvidia’s software and partner ecosystem may be the hardest part for the competition to match. The competition is making moves too, however. Some competitors may challenge Nvidia on economics, others on performance. Let’s see what the challengers are up to.

Intel has been working on its Nervana technology for a while. At the end of 2019, Intel made waves when it acquired startup Habana Labs for $2 billion. As analyst Karl Freund notes, after the acquisition Intel has been working on switching its AI acceleration from Nervana technology to Habana Labs.

Freund also highlights the importance of the software stack. He notes that Intel’s AI software stack is second only to Nvidia’s, layered to provide support (through abstraction) of a wide variety of chips, including Xeon, Nervana, Movidius, and even Nvidia GPUs. Habana Labs features two separate AI chips, Gaudi for training, and Goya for inference.

Intel is betting that Gaudi and Goya can match Nvidia’s chips. The MLPerf inference benchmark results published last year were positive for Goya. However, we’ll have to wait and see how it fares against Nvidia’s Ampere and Nvidia’s ever-evolving software stack.

graphcore-popvision.png

AI chip challenger GraphCore is beefing up Poplar, its software stack

Another high profile challenger is GraphCore. The UK-based AI chip manufacturer has an architecture designed from the ground up for high performance and unicorn status. GraphCore has been keeping busy, too, expanding its market footprint and working on its software.

From Dell’s servers to Microsoft Azure’s cloud and Baidu’s PaddlePaddle hardware ecosystem, GraphCore has a number of significant deals in place. GraphCore has also been working on its own software stack, Poplar. In the last month, Poplar has seen a new version and a new analysis tool.

If Intel has a lot for catching up to do, that certainly also applies to GraphCore. Both vendors seem to be on a similar trajectory, however. Aiming to innovate on the hardware level, hoping to be able to challenge Nvidia with a new and radically different approach, custom-built for AI workloads. At the same time, working on their software stack, and building their market presence.

Fractionalizing AI hardware with a software solution by Run:AI

Last but not least, there a few challengers who are less high-profile and have a different approach. Startup Run:AI recently exited stealth mode, with the announcement of $13 million in funding for what sounds like an unorthodox solution: Rather than offering another AI chip, Run:AI offers a software layer to speed up machine learning workload execution, on-premise and in the cloud.

The company works closely with AWS and is a VMware technology partner. Its core value proposition is to act as a management platform to bridge the gap between the different AI workloads and the various hardware chips and run a really efficient and fast AI computing platform.

Run:AI recently unveiled its fractional GPU sharing for Kubernetes deep learning workloads. Aimed at lightweight AI tasks at scale such as inference, the fractional GPU system gives data science and AI engineering teams the ability to run multiple workloads simultaneously on a single GPU, thus lowering costs.
abstraction-layer-runai.pngabstraction-layer-runai.pngRun:AI works as an abstraction layer on top of hardware running AI workloads

Omri Geller, Run:AI co-founder and CEO told ZDNet that Nvidia’s announcement about “fractionalizing” GPU, or running separate jobs within a single GPU, is revolutionary for GPU hardware. Geller said it has seen many customers with this need, especially for inference workloads: Why utilize a full GPU for a job that does not require the full compute and memory of a GPU?

Geller said:

“We believe, however, that this is more easily managed in the software stack than at the hardware level, and the reason is flexibility. While hardware slicing creates ‘smaller GPUs’ with a static amount of memory and compute cores, software solutions allow for the division of GPUs into any number of smaller GPUs, each with a chosen memory footprint and compute power.

In addition, fractionalizing with a software solution is possible with any GPU or AI accelerator, not just Ampere servers – thus improving TCO for all of a company’s compute resources, not just the latest ones. This is, in fact, what Run:AI’s fractional GPU feature enables.”

An accessibility layer for FPGAs with InAccel

InAccel is a Greek startup, built around the premise of providing an FPGA manager that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. Founder and CEO Chris Kachris told ZDNet there are several arguments regarding the advantages of FPGAs vs GPUs, especially for AI workloads

Kachris noted FPGAs can provide better energy efficiency (performance/watt) in some cases, and they can also achieve lower latency compared to GPUs for deep neural networks (DNNs). For DNNs, Kachris went on to add, FPGAs can achieve high throughput using low-batch size, resulting in much lower latency. In applications that latency and energy efficiency are critical, FPGAs can prevail.

However, scalable deployment of FPGA clusters remains challenging, and this is the problem InAccel is out to solve. Its solutions aim to provide scalable deployment of FPGA clusters, proving the missing abstraction — OS-like layer for the FPGA world. InAccel’s orchestrator allows easy deployment, instant scaling, and automated resource management of FPGA clusters.

Kachris likened InAccel to VMware / Kubernetes, or Run.ai / Bitfusion for the FPGA world. He also claimed InAccel makes FPGA easier for software developers. He also noted that FPGA vendors like Intel and Xilinx have recognized the importance of a strong ecosystem and formed strong alliances that help expand their ecosystem:

“It seems that cloud vendors will have to provide a diverse and heterogeneous infrastructure as different platforms have pros and cons. Most of these vendors provide fully heterogeneous resources (CPUS, GPUS, FPGAs, and dedicated accelerators), letting users select the optimum resource.

Several cloud vendors, such as AWS and Alibaba, have started deploying FPGAs because they see the potential benefits. However, FPGA deployment is still challenging as users need to be familiar with the FPGA tool flow. We enable software developers to get all the benefits of FPGAs using familiar PaaS and SaaS model and high-level frameworks (Spark, Skcikit-learn, Keras), making FPGAs deployment in the cloud much easier.”

Hedge your bets

It takes more than fast chips to be the leader in this field. Economics is one aspect potential users need to consider, ecosystem and software are another. Taking everything into account, it seems like Nvidia still is ahead of the competition.

It’s also interesting to note, however, that this is starting to look less and less like a monoculture. Innovation is coming from different places, and in different shapes and forms. This is something Nvidia’s Alben acknowledged too. And it’s certainly something cloud vendors, server vendors, and application builders seem to be taking note of.

Hedging one’s bets in the AI chip market may be the wise thing to do.

Content retrieved from: https://www.zdnet.com/article/ai-chips-in-2020-nvidia-and-the-challengers/.

Categories
knowledge connexions

AI chips in 2020: Nvidia and the challengers

Now that the dust from Nvidia’s unveiling of its new Ampere AI chip has settled, let’s take a look at the AI chip market behind the scenes and away from the spotlight

Few people, Nvidia’s competitors included, would dispute the fact that Nvidia is calling the shots in the AI chip game today. The announcement of the new Ampere AI chip in Nvidia’s main event, GTC, stole the spotlight last week.

There’s been ample coverage, including here on ZDNet. Tiernan Ray provided an in-depth analysis of the new and noteworthy with regards to the chip architecture itself. Andrew Brust focused on the software side of things, expanding on Nvidia’s support for Apache Spark, one of the most successful open-source frameworks for data engineering, analytics, and machine learning.

Let’s pick up from where they left off, putting the new architecture into perspective by comparing against the competition in terms of performance, economics, and software.

Nvidia’s double bottom line

The gist of Ray’s analysis is on capturing Nvidia’s intention with the new generation of chips: To provide one chip family that can serve for both “training” of neural networks, where the neural network’s operation is first developed on a set of examples, and also for inference, the phase where predictions are made based on new incoming data.

Ray notes this is a departure from today’s situation where different Nvidia chips turn up in different computer systems for either training or inference. He goes on to add that Nvidia is hoping to make an economic argument to AI shops that it’s best to buy an Nvidia-based system that can do both tasks.

“You get all of the overhead of additional memory, CPUs, and power supplies of 56 servers … collapsed into one,” said Nvidia CEO Jensen Huang. “The economic value proposition is really off the charts, and that’s the thing that is really exciting.”

Jonah Alben, Nvidia’s senior VP of GPU Engineering, told analysts that Nvidia had already pushed Volta, Nvidia’s previous-generation chip, as far as it could without catching fire. It went even further with Ampere, which features 54 billion transistors, and can execute 5 petaflops of performance, or about 20 times more than Volta.

2555435ebb25b4781b28-39429237-first-Nvidia-dgx-a100-at-argonne-national-laboratory-image-courtesy-argonne-national-laboratory.jpg

Nvidia is after a double bottom line: Better performance and better economics

So, Nvidia is after a double bottom line: Better performance and better economics. Let us recall that recently Nvidia also added support for Arm CPUs. Although Arm processor performance may not be on par with Intel at this point, its frugal power needs make them an attractive option for the data center, too, according to analysts.

On the software front, besides Apache Spark support, Nvidia also unveiled Jarvis, a new application framework for building conversational AI services. To offer interactive, personalized experiences, Nvidia notes, companies need to train their language-based applications on data that is specific to their own product offerings and customer requirements.

However, building a service from scratch requires deep AI expertise, large amounts of data, and compute resources to train the models, and software to regularly update models with new data. Jarvis aims to address these challenges by offering an end-to-end deep learning pipeline for conversational AI.

Jarvis includes state-of-the-art deep learning models, which can be further fine-tuned using Nvidia NeMo, optimized for inference using TensorRT, and deployed in the cloud and at the edge using Helm charts available on NGC, Nvidia’s catalog of GPU-optimized software.

Intel and GraphCore: high profile challengers

Working backward, this is something we have noted time and again for Nvidia: Its lead does not just lay in hardware. In fact, Nvidia’s software and partner ecosystem may be the hardest part for the competition to match. The competition is making moves too, however. Some competitors may challenge Nvidia on economics, others on performance. Let’s see what the challengers are up to.

Intel has been working on its Nervana technology for a while. At the end of 2019, Intel made waves when it acquired startup Habana Labs for $2 billion. As analyst Karl Freund notes, after the acquisition Intel has been working on switching its AI acceleration from Nervana technology to Habana Labs.

Freund also highlights the importance of the software stack. He notes that Intel’s AI software stack is second only to Nvidia’s, layered to provide support (through abstraction) of a wide variety of chips, including Xeon, Nervana, Movidius, and even Nvidia GPUs. Habana Labs features two separate AI chips, Gaudi for training, and Goya for inference.

Intel is betting that Gaudi and Goya can match Nvidia’s chips. The MLPerf inference benchmark results published last year were positive for Goya. However, we’ll have to wait and see how it fares against Nvidia’s Ampere and Nvidia’s ever-evolving software stack.

graphcore-popvision.png

AI chip challenger GraphCore is beefing up Poplar, its software stack

Another high profile challenger is GraphCore. The UK-based AI chip manufacturer has an architecture designed from the ground up for high performance and unicorn status. GraphCore has been keeping busy, too, expanding its market footprint and working on its software.

From Dell’s servers to Microsoft Azure’s cloud and Baidu’s PaddlePaddle hardware ecosystem, GraphCore has a number of significant deals in place. GraphCore has also been working on its own software stack, Poplar. In the last month, Poplar has seen a new version and a new analysis tool.

If Intel has a lot for catching up to do, that certainly also applies to GraphCore. Both vendors seem to be on a similar trajectory, however. Aiming to innovate on the hardware level, hoping to be able to challenge Nvidia with a new and radically different approach, custom-built for AI workloads. At the same time, working on their software stack, and building their market presence.

Fractionalizing AI hardware with a software solution by Run:AI

Last but not least, there a few challengers who are less high-profile and have a different approach. Startup Run:AI recently exited stealth mode, with the announcement of $13 million in funding for what sounds like an unorthodox solution: Rather than offering another AI chip, Run:AI offers a software layer to speed up machine learning workload execution, on-premise and in the cloud.

The company works closely with AWS and is a VMware technology partner. Its core value proposition is to act as a management platform to bridge the gap between the different AI workloads and the various hardware chips and run a really efficient and fast AI computing platform.

Run:AI recently unveiled its fractional GPU sharing for Kubernetes deep learning workloads. Aimed at lightweight AI tasks at scale such as inference, the fractional GPU system gives data science and AI engineering teams the ability to run multiple workloads simultaneously on a single GPU, thus lowering costs.
abstraction-layer-runai.pngabstraction-layer-runai.pngRun:AI works as an abstraction layer on top of hardware running AI workloads

Omri Geller, Run:AI co-founder and CEO told ZDNet that Nvidia’s announcement about “fractionalizing” GPU, or running separate jobs within a single GPU, is revolutionary for GPU hardware. Geller said it has seen many customers with this need, especially for inference workloads: Why utilize a full GPU for a job that does not require the full compute and memory of a GPU?

Geller said:

“We believe, however, that this is more easily managed in the software stack than at the hardware level, and the reason is flexibility. While hardware slicing creates ‘smaller GPUs’ with a static amount of memory and compute cores, software solutions allow for the division of GPUs into any number of smaller GPUs, each with a chosen memory footprint and compute power.

In addition, fractionalizing with a software solution is possible with any GPU or AI accelerator, not just Ampere servers – thus improving TCO for all of a company’s compute resources, not just the latest ones. This is, in fact, what Run:AI’s fractional GPU feature enables.”

An accessibility layer for FPGAs with InAccel

InAccel is a Greek startup, built around the premise of providing an FPGA manager that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. Founder and CEO Chris Kachris told ZDNet there are several arguments regarding the advantages of FPGAs vs GPUs, especially for AI workloads

Kachris noted FPGAs can provide better energy efficiency (performance/watt) in some cases, and they can also achieve lower latency compared to GPUs for deep neural networks (DNNs). For DNNs, Kachris went on to add, FPGAs can achieve high throughput using low-batch size, resulting in much lower latency. In applications that latency and energy efficiency are critical, FPGAs can prevail.

However, scalable deployment of FPGA clusters remains challenging, and this is the problem InAccel is out to solve. Its solutions aim to provide scalable deployment of FPGA clusters, proving the missing abstraction — OS-like layer for the FPGA world. InAccel’s orchestrator allows easy deployment, instant scaling, and automated resource management of FPGA clusters.

Kachris likened InAccel to VMware / Kubernetes, or Run.ai / Bitfusion for the FPGA world. He also claimed InAccel makes FPGA easier for software developers. He also noted that FPGA vendors like Intel and Xilinx have recognized the importance of a strong ecosystem and formed strong alliances that help expand their ecosystem:

“It seems that cloud vendors will have to provide a diverse and heterogeneous infrastructure as different platforms have pros and cons. Most of these vendors provide fully heterogeneous resources (CPUS, GPUS, FPGAs, and dedicated accelerators), letting users select the optimum resource.

Several cloud vendors, such as AWS and Alibaba, have started deploying FPGAs because they see the potential benefits. However, FPGA deployment is still challenging as users need to be familiar with the FPGA tool flow. We enable software developers to get all the benefits of FPGAs using familiar PaaS and SaaS model and high-level frameworks (Spark, Skcikit-learn, Keras), making FPGAs deployment in the cloud much easier.”

Hedge your bets

It takes more than fast chips to be the leader in this field. Economics is one aspect potential users need to consider, ecosystem and software are another. Taking everything into account, it seems like Nvidia still is ahead of the competition.

It’s also interesting to note, however, that this is starting to look less and less like a monoculture. Innovation is coming from different places, and in different shapes and forms. This is something Nvidia’s Alben acknowledged too. And it’s certainly something cloud vendors, server vendors, and application builders seem to be taking note of.

Hedging one’s bets in the AI chip market may be the wise thing to do.

Content retrieved from: https://www.zdnet.com/article/ai-chips-in-2020-nvidia-and-the-challengers/.

Categories
knowledge connexions

Scientific fact-checking using AI language models: COVID-19 research and beyond

Fact or fiction? That’s not always an easy question to answer. Incomplete knowledge, context and bias typically come into play. In the nascent domain of scientific fact checking, things are complicated.

If you think fact-checking is hard, which it is, then what would you say about verifying scientific claims, on COVID-19 no less? Hint: it’s also hard — different in some ways, similar in some others.

Fact or Fiction: Verifying Scientific Claims is the title of a research paper published on pre-print server Arxiv by a team of researchers from the Allen Institute for Artificial Intelligence (AI2), with data and code available on GitHub. ZDNet connected with David Wadden, lead author of the paper and a visiting researcher at AI2, to discuss the rationale, details, and directions for this work.

What is scientific fact checking?

Although the authors of the paper refer to their work as scientific fact-checking, we believe it’s important to clarify semantics before going any further. Verifying scientific claims refers to the process of proving or disproving (with some degree of certainty) claims made in scientific research papers. It does not refer to a scientific method of doing “regular” fact-checking.

Fact-checking, as defined by the authors, is a task in which the veracity of an input claim is verified against a corpus of documents that support or refute the claim. A claim is defined as an atomic factual statement expressing a finding about one aspect of a scientific entity or process, which can be verified from a single source. This research area has seen increased attention, motivated by the proliferation of misinformation in political news, social media, and on the web.

In turn, interest in fact-checking has spurred the creation of many datasets across different domains to support research and development of automated fact-checking systems. Yet, it seems like up to this point no such dataset exists to facilitate research on another important domain for fact-checking – scientific literature.

fact-sign-thumb.jpg

Plain old fact checking is hard, and most people don’t do it. If you think scientific fact checking may be easier, think again

The ability to verify claims about scientific concepts, especially those related to biomedicine, is an important application area for fact-checking. Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, since successfully verifying most scientific claims requires expert background knowledge, complex language
understanding, and reasoning capability.

The AI2 researchers introduce the task of scientific fact-checking. To facilitate research on this task, they constructed SCIFACT, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts that support or refute each claim, and annotated with rationales justifying each support / refute decision.

To curate this dataset, a novel annotation protocol that takes advantage of a plentiful source of naturally-occurring claims in the scientific literature — citation sentences, or “citances” — was used.

Why, and how, does one do scientific fact checking?

Wadden, a graduate student in the University of Washington with a background in Physics, Computational Biology, and Natural Language Processing (NLP), shared an interesting story on what motivated him to start this work. In addition the well-known issue of navigating huge bodies of scientific knowledge, personal experience played its part too.

Wadden was briefly considering a career as an opera singer, when he had a vocal injury. He visited a number of doctors for consultations, and received a number of recommendations for potential treatments. Although they were all good doctors, Wadden observed none of them was able to provide data such as the percentage of patients for which the approach works.

Wadden’s situation was not dramatic, but he could not help but think about what would happen if it was. He felt the information he was given was incomplete to able to make informed decisions, and he believed it had to do with the fact that finding that information is not easy for doctors.

opera-snapshot-2020-05-24-115949-arxiv-org.png

Scientific fact checking is about checking claims made in scientific papers in a scientific way, and automating the process as much as possible. Image: Allen Institute for AI

The work uses a dataset specifically aimed at fact-checking COVID-19-related research. Wadden explained that the team set out to do this work in October 2019, before COVID-19 was a thing. However, they soon realized what was going on, and decided to make COVID-19 their focus.

Besides the SCIFACT dataset, the research also features the SCIFACT task, and the VERISCI baseline model. In a nutshell, they can be summarized as creating a dataset by manually annotating scientific papers and generating claims, evaluating claims, and creating a baseline AI language model for claim evaluation.

The annotation process, described in detail in the paper, is both a necessity, and a limiting factor. It is a necessity because it takes expert knowledge to be able to process citations, ask the right questions, and find the right answers. It is a limiting factor because relying on manual labor makes the process hard to scale, and it introduces bias.

Can there be bias in science?

Today NLP is largely powered by machine learning. SCIFACT developed VERISCI based on BERT, Google’s deep-learning language model. Machine learning algorithms need training data, and training data need processing and annotation by humans. This is a labor-intensive task. Relying on people to process large datasets means the process is slow and expensive, and results can be partial.

Large annotated datasets for NLP, and specifically for fact-checking do exist, but scientific fact checking is special. When dealing with common sense reasoning, Mechanical Turk workers are typically asked to annotate datasets. In scientific work, however, expert knowledge is needed to be able to understand, evaluate and process claims contained in research papers.

The SCIFACT team hired Biology undergrad and grad students for this job. Wadden is fully aware of the limitations this poses to scaling the approach up, and is considering crowdsourcing, hiring medical professionals via a recruitment platform, or assigning many Mechanical Turk workers to annotate the same work, and then averaging their answers, knowing each one will be imperfect.

magnifying-glass-technology.jpg

Science is not infallible. It, too, can introduce bias via imperfect data and methods. And even researchers with the best of intentions don’t always agree on everything – this is part of the process

Bias can be introduced in all moving parts of the process: what papers are picked, what claims are checked for each paper, what citations are checked for each claim, and how each citation is ranked. In other words: if research X supports claim A, while research Y contradicts it, what are we to believe? Not to mention, if research Y is not in the dataset, we’ll never know about its findings.

In COVID-19 times, as many people have turned armchair epidemiologists, this is something to keep in mind: Science, and data science, are not always straightforward processes that produce definitive, undisputed results. Wadden, for one, is very aware of the limitations of this research. Although the team has tried to mitigate those limitations, Wadden acknowledges this is just a first step in a long and winding road.

One way the SCIFACT team tried to address bias in selecting claims is that they extracted them from citations: They only considered claims where a paper was cited. Furthermore, they applied a series of techniques to get as high quality results as possible.

The paper selection process is driven by an initial body of seed papers: citations that reference those papers are examined. Only papers that have been cited at least 10 times can be part of the seed set, in an effort to select the most important ones. A technique called citation intent classification is used. The technique tries to identify the reason a paper is cited. Only citations referring to findings were processed.

Promising results

Another important thing to note is that claims are evaluated based on the abstract of the paper they cite. This is done for simplicity, as the underlying assumption seems to be that if a finding is key to a paper, it will be mentioned in the paper’s abstract. It would be hard for a language model to evaluate a claim based on the entire text of a scientific paper.

Claims found in papers may have multiple citations. For example, the claim “The R0 of the novel coronavirus is 2.5” may cite several papers with supporting evidence. In those cases, each citation is processed independently, and a result as to whether it supports or refutes the claim, or a conclusive decision cannot be made, is obtained for each.

Wadden’s team used the SCIFACT dataset and annotation process to develop and train the VERISCI model. VERISCI is a pipeline of three components: Abstract retrieval, which retrieves abstracts with highest similarity to the. Rationale selection, which identifies rationals for each candidate abstract. Label prediction, which makes the final label prediction.

Given a claim and a corpus of papers, VERISCI must predict a set of evidence abstracts. For each abstract in the corpus, it must predict a label, and a collection of rationale sentences. Although the annotations provided by the annotators may contain multiple separate rationales, the model must simply to predict a single collection of rationale sentences; these sentences may come from multiple annotated rationales.

opera-snapshot-2020-05-24-120058-arxiv-org.png

Where there are gray areas, they need to be pinpointed, and measures such as rewriting the original claims for clarity need to be taken. Image: Allen Institute for AI

The team experimented to establish a performance baseline on SCIFACT using VERISCI, analyzed the performance of the three components of VERISCI, and demonstrated the importance of in-domain training data. Qualitative results on verifying claims about COVID-19 using VERISCI were promising.

For roughly half of the claim-abstract pairs, VERISCI correctly identifies whether an abstract supports or refutes a claim, and provides reasonable evidence to justify the decision. Given the difficulty of the task and limited in-domain training data, the team considers this a promising result, while leaving plenty of room for improvement.

Some exploratory experiments to fact-check claims concerning COVID-19 were also conducted. A medical student was tasked to write 36 COVID19-related claims. VERISCI was used to predict evidence abstracts. The same medical student annotator assigned a label to each claim-abstract pair.

For the majority of these COVID-related claims (23 out of 36), the rationales produced by VERISCI was deemed plausible by the annotator. The sample is really small, however the team believes that VERISCI is able to successfully retrieve and classify evidence in many cases.

Complicated process, instructive work

There are a number of future directions for this work. Besides expanding the dataset and generating more annotations, adding support for partial evidence, modeling contextual information, and evidence synthesis are important areas for future research.

Expanding the system to include partial support is an interesting topic. Not all decisions can be clear-cut. A typical example is when we have a claim about drug X’s effectiveness. If a paper reports the effectiveness of the drug in mice, or in limited clinical testing on humans, this may offer inconclusive support for the claim.

Initial experiments showed a high degree of disagreement among expert annotators as to whether certain claims were fully, partially, or not at all supported by certain research findings. Sound familiar? In those gray area scenarios, the goal is to be able to better identify the situation. What the team wants to do is to edit the claim to reflect the inconclusiveness.

Modeling contextual information has to do with identifying implicit references. Initially, annotators were instructed to identify primary and supplemental rationale sentences for each rationale. Primary sentences are those that are needed to verify the claim, while supplemental sentences provide important context missing from primary sentences that are still necessary to determine whether a claim is supported or refuted.

For example, if a claim mentions “experimental animals” and a rationale sentence mentions “test group”, whether they refer to the same thing is not always straightforward. Again, a high degree of disagreement was noted among human experts in such scenarios. Thus, supplemental rationale sentences were removed from the dataset, and the team continues to work with annotators on improving agreement.

Last but not least: Evidence synthesis basically means that not all evidence is created equal, and that should probably be reflected in the decision-making process somehow. To use an extreme example: currently, a pre-print that has not undergone peer review and a paper with 1000 citations are treated equally. They probably should not.

An obvious thing to do here would be to use a sort of PageRank for research papers, i.e. an algorithm that does for research what Google does for the web – pick out the relevant stuff. Such algorithms already exist, for example for calculating impact factors. But then again, this is another gray area.

This work is not the only example of what we would call meta-research triggered by COVID19: research on how to facilitate research, in an effort to speed up the process of understanding and combating COVID19. We have seen, for example, how other researchers are using knowledge graphs for the same purpose.

Wadden posits that these approaches could complement one another. For example, where knowledge graphs have an edge between two nodes, asserting a type of relationship, SCIFACT could provide the text on the basis of which the assertion was made.

For the time being, the work will be submitted for peer review. It’s instructive, because it highlights the strengths and weaknesses of the scientific process. And despite its shortcomings, it reminds us of the basic premises in science: peer review, and intellectual honesty.

Content retrieved from: https://www.zdnet.com/article/scientific-fact-checking-using-ai-language-models-covid19-research-and-beyond/.

Categories
knowledge connexions

Scientific fact-checking using AI language models: COVID-19 research and beyond

Fact or fiction? That’s not always an easy question to answer. Incomplete knowledge, context and bias typically come into play. In the nascent domain of scientific fact checking, things are complicated.

If you think fact-checking is hard, which it is, then what would you say about verifying scientific claims, on COVID-19 no less? Hint: it’s also hard — different in some ways, similar in some others.

Fact or Fiction: Verifying Scientific Claims is the title of a research paper published on pre-print server Arxiv by a team of researchers from the Allen Institute for Artificial Intelligence (AI2), with data and code available on GitHub. ZDNet connected with David Wadden, lead author of the paper and a visiting researcher at AI2, to discuss the rationale, details, and directions for this work.

What is scientific fact checking?

Although the authors of the paper refer to their work as scientific fact-checking, we believe it’s important to clarify semantics before going any further. Verifying scientific claims refers to the process of proving or disproving (with some degree of certainty) claims made in scientific research papers. It does not refer to a scientific method of doing “regular” fact-checking.

Fact-checking, as defined by the authors, is a task in which the veracity of an input claim is verified against a corpus of documents that support or refute the claim. A claim is defined as an atomic factual statement expressing a finding about one aspect of a scientific entity or process, which can be verified from a single source. This research area has seen increased attention, motivated by the proliferation of misinformation in political news, social media, and on the web.

In turn, interest in fact-checking has spurred the creation of many datasets across different domains to support research and development of automated fact-checking systems. Yet, it seems like up to this point no such dataset exists to facilitate research on another important domain for fact-checking – scientific literature.

fact-sign-thumb.jpg

Plain old fact checking is hard, and most people don’t do it. If you think scientific fact checking may be easier, think again

The ability to verify claims about scientific concepts, especially those related to biomedicine, is an important application area for fact-checking. Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, since successfully verifying most scientific claims requires expert background knowledge, complex language
understanding, and reasoning capability.

The AI2 researchers introduce the task of scientific fact-checking. To facilitate research on this task, they constructed SCIFACT, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts that support or refute each claim, and annotated with rationales justifying each support / refute decision.

To curate this dataset, a novel annotation protocol that takes advantage of a plentiful source of naturally-occurring claims in the scientific literature — citation sentences, or “citances” — was used.

Why, and how, does one do scientific fact checking?

Wadden, a graduate student in the University of Washington with a background in Physics, Computational Biology, and Natural Language Processing (NLP), shared an interesting story on what motivated him to start this work. In addition the well-known issue of navigating huge bodies of scientific knowledge, personal experience played its part too.

Wadden was briefly considering a career as an opera singer, when he had a vocal injury. He visited a number of doctors for consultations, and received a number of recommendations for potential treatments. Although they were all good doctors, Wadden observed none of them was able to provide data such as the percentage of patients for which the approach works.

Wadden’s situation was not dramatic, but he could not help but think about what would happen if it was. He felt the information he was given was incomplete to able to make informed decisions, and he believed it had to do with the fact that finding that information is not easy for doctors.

opera-snapshot-2020-05-24-115949-arxiv-org.png

Scientific fact checking is about checking claims made in scientific papers in a scientific way, and automating the process as much as possible. Image: Allen Institute for AI

The work uses a dataset specifically aimed at fact-checking COVID-19-related research. Wadden explained that the team set out to do this work in October 2019, before COVID-19 was a thing. However, they soon realized what was going on, and decided to make COVID-19 their focus.

Besides the SCIFACT dataset, the research also features the SCIFACT task, and the VERISCI baseline model. In a nutshell, they can be summarized as creating a dataset by manually annotating scientific papers and generating claims, evaluating claims, and creating a baseline AI language model for claim evaluation.

The annotation process, described in detail in the paper, is both a necessity, and a limiting factor. It is a necessity because it takes expert knowledge to be able to process citations, ask the right questions, and find the right answers. It is a limiting factor because relying on manual labor makes the process hard to scale, and it introduces bias.

Can there be bias in science?

Today NLP is largely powered by machine learning. SCIFACT developed VERISCI based on BERT, Google’s deep-learning language model. Machine learning algorithms need training data, and training data need processing and annotation by humans. This is a labor-intensive task. Relying on people to process large datasets means the process is slow and expensive, and results can be partial.

Large annotated datasets for NLP, and specifically for fact-checking do exist, but scientific fact checking is special. When dealing with common sense reasoning, Mechanical Turk workers are typically asked to annotate datasets. In scientific work, however, expert knowledge is needed to be able to understand, evaluate and process claims contained in research papers.

The SCIFACT team hired Biology undergrad and grad students for this job. Wadden is fully aware of the limitations this poses to scaling the approach up, and is considering crowdsourcing, hiring medical professionals via a recruitment platform, or assigning many Mechanical Turk workers to annotate the same work, and then averaging their answers, knowing each one will be imperfect.

magnifying-glass-technology.jpg

Science is not infallible. It, too, can introduce bias via imperfect data and methods. And even researchers with the best of intentions don’t always agree on everything – this is part of the process

Bias can be introduced in all moving parts of the process: what papers are picked, what claims are checked for each paper, what citations are checked for each claim, and how each citation is ranked. In other words: if research X supports claim A, while research Y contradicts it, what are we to believe? Not to mention, if research Y is not in the dataset, we’ll never know about its findings.

In COVID-19 times, as many people have turned armchair epidemiologists, this is something to keep in mind: Science, and data science, are not always straightforward processes that produce definitive, undisputed results. Wadden, for one, is very aware of the limitations of this research. Although the team has tried to mitigate those limitations, Wadden acknowledges this is just a first step in a long and winding road.

One way the SCIFACT team tried to address bias in selecting claims is that they extracted them from citations: They only considered claims where a paper was cited. Furthermore, they applied a series of techniques to get as high quality results as possible.

The paper selection process is driven by an initial body of seed papers: citations that reference those papers are examined. Only papers that have been cited at least 10 times can be part of the seed set, in an effort to select the most important ones. A technique called citation intent classification is used. The technique tries to identify the reason a paper is cited. Only citations referring to findings were processed.

Promising results

Another important thing to note is that claims are evaluated based on the abstract of the paper they cite. This is done for simplicity, as the underlying assumption seems to be that if a finding is key to a paper, it will be mentioned in the paper’s abstract. It would be hard for a language model to evaluate a claim based on the entire text of a scientific paper.

Claims found in papers may have multiple citations. For example, the claim “The R0 of the novel coronavirus is 2.5” may cite several papers with supporting evidence. In those cases, each citation is processed independently, and a result as to whether it supports or refutes the claim, or a conclusive decision cannot be made, is obtained for each.

Wadden’s team used the SCIFACT dataset and annotation process to develop and train the VERISCI model. VERISCI is a pipeline of three components: Abstract retrieval, which retrieves abstracts with highest similarity to the. Rationale selection, which identifies rationals for each candidate abstract. Label prediction, which makes the final label prediction.

Given a claim and a corpus of papers, VERISCI must predict a set of evidence abstracts. For each abstract in the corpus, it must predict a label, and a collection of rationale sentences. Although the annotations provided by the annotators may contain multiple separate rationales, the model must simply to predict a single collection of rationale sentences; these sentences may come from multiple annotated rationales.

opera-snapshot-2020-05-24-120058-arxiv-org.png

Where there are gray areas, they need to be pinpointed, and measures such as rewriting the original claims for clarity need to be taken. Image: Allen Institute for AI

The team experimented to establish a performance baseline on SCIFACT using VERISCI, analyzed the performance of the three components of VERISCI, and demonstrated the importance of in-domain training data. Qualitative results on verifying claims about COVID-19 using VERISCI were promising.

For roughly half of the claim-abstract pairs, VERISCI correctly identifies whether an abstract supports or refutes a claim, and provides reasonable evidence to justify the decision. Given the difficulty of the task and limited in-domain training data, the team considers this a promising result, while leaving plenty of room for improvement.

Some exploratory experiments to fact-check claims concerning COVID-19 were also conducted. A medical student was tasked to write 36 COVID19-related claims. VERISCI was used to predict evidence abstracts. The same medical student annotator assigned a label to each claim-abstract pair.

For the majority of these COVID-related claims (23 out of 36), the rationales produced by VERISCI was deemed plausible by the annotator. The sample is really small, however the team believes that VERISCI is able to successfully retrieve and classify evidence in many cases.

Complicated process, instructive work

There are a number of future directions for this work. Besides expanding the dataset and generating more annotations, adding support for partial evidence, modeling contextual information, and evidence synthesis are important areas for future research.

Expanding the system to include partial support is an interesting topic. Not all decisions can be clear-cut. A typical example is when we have a claim about drug X’s effectiveness. If a paper reports the effectiveness of the drug in mice, or in limited clinical testing on humans, this may offer inconclusive support for the claim.

Initial experiments showed a high degree of disagreement among expert annotators as to whether certain claims were fully, partially, or not at all supported by certain research findings. Sound familiar? In those gray area scenarios, the goal is to be able to better identify the situation. What the team wants to do is to edit the claim to reflect the inconclusiveness.

Modeling contextual information has to do with identifying implicit references. Initially, annotators were instructed to identify primary and supplemental rationale sentences for each rationale. Primary sentences are those that are needed to verify the claim, while supplemental sentences provide important context missing from primary sentences that are still necessary to determine whether a claim is supported or refuted.

For example, if a claim mentions “experimental animals” and a rationale sentence mentions “test group”, whether they refer to the same thing is not always straightforward. Again, a high degree of disagreement was noted among human experts in such scenarios. Thus, supplemental rationale sentences were removed from the dataset, and the team continues to work with annotators on improving agreement.

Last but not least: Evidence synthesis basically means that not all evidence is created equal, and that should probably be reflected in the decision-making process somehow. To use an extreme example: currently, a pre-print that has not undergone peer review and a paper with 1000 citations are treated equally. They probably should not.

An obvious thing to do here would be to use a sort of PageRank for research papers, i.e. an algorithm that does for research what Google does for the web – pick out the relevant stuff. Such algorithms already exist, for example for calculating impact factors. But then again, this is another gray area.

This work is not the only example of what we would call meta-research triggered by COVID19: research on how to facilitate research, in an effort to speed up the process of understanding and combating COVID19. We have seen, for example, how other researchers are using knowledge graphs for the same purpose.

Wadden posits that these approaches could complement one another. For example, where knowledge graphs have an edge between two nodes, asserting a type of relationship, SCIFACT could provide the text on the basis of which the assertion was made.

For the time being, the work will be submitted for peer review. It’s instructive, because it highlights the strengths and weaknesses of the scientific process. And despite its shortcomings, it reminds us of the basic premises in science: peer review, and intellectual honesty.

Content retrieved from: https://www.zdnet.com/article/scientific-fact-checking-using-ai-language-models-covid19-research-and-beyond/.

Categories
knowledge connexions

Compute to data: using blockchain to decentralize data science and AI with the Ocean Protocol

The conflict between access to data and data sovereignty is key to understanding how AI works, and moving it forward. The Ocean Protocol Foundation wants to help resolve that conflict, by introducing a way of letting AI work with data without giving up control.

AI and its machine learning algorithms need data to work. By now, that’s a known fact. It’s not that algorithms don’t matter, it’s just that typically, getting more data, better data, helps come up with better results more than tweaking algorithms. The unreasonable effectiveness of data.

More data, and more compute capacity to train algorithms that use the data, is what has been fueling the rise of AI. Anyone who wants to train an algorithm for an AI application to address any problem in any domain must be able to get lots of relevant data in order to be successful.

That data can be public data, private data generated and owned by the organization developing the application, or private data acquired by 3rd parties. Public data is not an issue. Privately owned private data must be collected and processed in accordance with data protection laws such as GDPR and CCPA.

But what about private data owned by 3rd parties? Normally, application developers don’t have access to those, and for good reasons. Why would you trust anyone with your private data? Even if the party you hand it over to promises to take good care of the data, once the data is out of your hands, anyone can do as they please with it.

This is the problem the non-profit Ocean Protocol Foundation (OPF) wants to solve. ZDNet connected with Founder Trent McConaghy, to discuss OPF’s mission and the latest milestone achieved – Compute-to-Data.

Compute-to-Data: if the the data will not come to compute, then compute must go to the data

McConaghy has been working on the Ocean Protocol since 2017. McConaghy has a background in AI and blockchain, having worked in projects such as ascribe and BigchainDB. He described how he realized that blockchain could help solve the issue of data escapes and privacy for data used to train AI algorithms.

The OPF has been working on setting up the infrastructure to enable better accessibility to data via data marketplaces. As McConaghy pointed out, there have been many attempts of data marketplaces in the past, but they’ve always been custodial, which means the data marketplace is a middlemen users have to trust. Recent case in point – Surgisphere.

But what if you could have marketplaces act as the connector without them actually holding the data, without having to trust the marketplace? This is what OPF is out to achieve – decentralized data marketplaces.

ocean-computetodata.jpg

The Ocean Protocol Compute-to-Data lets AI algorithms use data, without moving the data from where they are

This is a tall order, and McConaghy is fast to admit that it will take years to get there. Last week, however, brought the OPF one step closer, by unveiling what it calls Compute-to-Data. Compute-to-Data provides a means to exchange data while preserving privacy by allowing the data to stay on-premise with the data provider, allowing data consumers to run compute jobs on the data to train AI models.

Rather than having the data sent to where the algorithm runs, the algorithm runs where the data is. The idea is very similar to federated learning. The difference, McConaghy says, is that federated learning only decentralizes the last mile of the process, while Compute-to-Data goes all the way.

TensorFlow Federated (TFF) and OpenMined are the most prominent federated learning projects. TFF does orchestration in a centralized fashion, OpenMined is decentralized. In TFF-style federated learning a centralized entity (e.g. Google) must perform the orchestration of compute jobs across silos. Personally identifiable information can leak to this entity.

OpenMined addresses this via decentralized orchestration. But its software infrastructure could use improvement to manage computation at each silo in a more secure fashion; this is where Compute-to-Data can help, says McConaghy. That’s all fine and well, but what about performance?

If algorithms run where the data is, then this means how fast they will run depends on the resources available at the host. So the time needed to train algorithms that way may be longer compared to the centralized scenario, factoring in the overhead of communications and crypto. In a typical scenario, compute needs move from client side to data host side, said McConaghy:

“Compute needs don’t get higher or lower, they simply get moved. Ocean Compute-to-Data supports Kubernetes, which allows massive scale-up of compute if needed. There’s no degradation of compute efficiency if it’s on the host data side. There’s a bonus: the bandwidth cost is lower, since only the final model has to be sent over the wire, rather than the whole dataset.

There’s another flow where Ocean Compute-to-Data is used to compute anonymized data. For example using Differential Privacy, or Decoupled Hashing. Then that anonymized data would be passed to the client side for model building there. In this case most of the compute is client-side, and bandwidth usage is higher because the (anonymized) dataset is sent over the wire. Ocean Compute-to-Data is flexible enough to accommodate all these scenarios”.

From Don’t be Evil to Can’t be Evil

The OPF has raised funding and built a well-versed team between 2017 and 2020. In order to realize the vision of decentralized data marketplaces, the OPF works in two ways. First, by eating its own dog food, working to develop its community-driven marketplace. Second, by facilitating others to build their marketplaces. McConaghy mentioned examples such as MOBI and dexFreight.

The Mobility Open Blockchain Initiative (MOBI) is a nonprofit organization working with companies, governments, and NGOs. The goal is to make mobility services more efficient, affordable, greener, safer, and less congested by promoting standards and accelerating adoption of blockchain and related technologies. The OPF helps make data and services available to solve challenges to coordinate vehicles, identify obstacles and route autonomous cars.

DexFreight, providers of a blockchain-based logistics platform, and OPF have partnered to launch the first Web3 data marketplace for the transportation and logistics industry. The marketplace will enable companies to aggregate and monetize operational data, focusing initially on serving U.S.-based truckload transportation providers.

McConaghy emphasized that the OPF typically does not work directly with users. Its role is to develop the core technology, and empower others to use it. Asked what he sees as the advantages of developing a decentralized marketplace, McConaghy said that it enables organizations to turn data from a potential liability to an asset, without compromising user privacy.

1-1tebcfe0d5qzw8symo1-hg.png

Ocean Protocol Integration for COVID-19 data

He went on to cite examples such as 23andMe, or Facebook, in which the parties entrusted with the data broke their promises and used the data for nefarious purposes: “Don’t be evil mottos can be compromised if companies are incentivized to mine or sell data. What we want to do is Can’t be evil”.

The technology is nascent, and the stack is still not as easy to use as the OPF would like. McConaghy mentioned that to be able to use the stack developers should be well-versed in Python or JavaScript / React, be familiar with Web3 and Ethereum concepts, as well as Ocean Protocol concepts, plus Kubernetes. Data scientists can also use Compute-to-Data via JavaScript or Python, as well as Jupyter Notebooks.

For end users, however, it may take a while to reap the benefits of the approach. In the path that McConaghy envisions, end users will initially be able to play with existing marketplaces. Step 2 would be to set up data unions, trusts, or co-ops that act on behalf of the users and give them royalties for their data.

McConaghy said that Ethereum-powered DAOs (Distributed Autonomous Organizations) could power such organizations, likening them to sub-Reddits with smart contract-based governance. Step 3, consumer level applications for domains such as social networking, will take a while to appear, McConaghy concedes.

Disclosure: The author has worked on a project with the OPF in 2018, and holds an amount of OCEAN tokens as part of that engagement.

Content retrieved from: https://www.zdnet.com/article/compute-to-data-using-blockchain-to-decentralize-data-science-and-ai-with-the-ocean-protocol/.

Categories
knowledge connexions

Single point of control, database security as a service: jSonar gets $50 million funding from Goldman Sachs

Why would Goldman Sachs invest a hefty amount into a previously little known company working on something unsexy like database security?

Let’s face it: database security is not the most appealing topic for most people. Ron Bennatan, CTO and co-founder of jSonar, is very aware of it. That, however, has not stopped him and the jSonar team from carving out a spot for themselves and making progress over the years. It has not stopped Goldman Sachs from investing in jSonar either.

jSonar announced it has closed a $50 million investment from Goldman Sachs for the company’s first institutional round of funding. As part of the transaction, David Campbell, a managing director in the merchant banking division of Goldman Sachs, will join jSonar’s board.

“In the last decade, enterprise database infrastructure has grown exponentially in scale and complexity. Simultaneously, data security has evolved from a compliance requirement to a critical enterprise security component.

jSonar enables its customers to meet today’s data security demands, positions them to seamlessly adopt new databases, data lakes, and cloud services, all while reducing costs and expanding their analytical capabilities. We are excited to invest in jSonar and work with the team to continue to build a world-class business,” said Campbell.

ZDNet connected with Bennatan to discuss what jSonar does, why, and how this led to today’s announcement.

A bunch of engineers meet Goldman Sachs

We started the conversation going over what’s in jSonar’s name. Our hunch was that this was a nod to the company’s engineering origins, perhaps some connection to the Java programming language. Close enough: Not exactly Java, but JSON – JavaScript Object Notation.

Bennatan opened the conversation by admitting that, until recently, the company was “a bunch of engineers who choose cheesy names.” If you want to get a bit more technical than that, let’s just say that the name was created by a combination of JSON, archive, and sonar, as in finding patterns.

Bennatan describes what jSonar does as: “We just make good database and data repository security. Really, really simple. That’s what we do. We make security products for where data lives, but we do it in a very good way”.

Computer system protection, database security, safe internet. Lock symbol on abstract computer data background programming binary code, data protection technology. Vector illustration

Securing databases may not be sexy, but it’s also not trivial, and quite important

Again, not very technical, but we’ll get to that. Bennatan and his co-founder, Ury Segal, met at the distributed systems lab in the university some 30 years ago. Bennatan described their course over the years, also including other startups, until they started jSonar in 2013. That experience may help explain the way they approached product development:

“We know that you don’t build things by yourself, you build things with a bunch of customers that tell you that, you know, this is a stupid thing to do. This is a smart thing to do. So we didn’t try to create a product that was perfect. We just tried to create something that we knew was good enough,” said Bennatan.

Bennatan also conceded that database security is pretty dull. It’s not a sexy area, because databases have been there for 40 years to 50 years, unlike the latest trends like, say, Kubernetes. Plus, when things happen, they’re very bad, traumatic; so bad that nobody knows about them, he went on to add. That’s an interesting point, coming from someone with his experience.

Perhaps paradoxically, this brings us to today’s funding announcement. What Goldman Sachs knows, posited Bennatan, is that if you don’t have good security at the data repository layer, then almost nothing matters. jSonar was growing without having raised capital, and the connection with Goldman Sachs was made about a year ago, but not with a specific agenda.

jSonar was happy to be steadily doubling every year, sometimes in customers, sometimes in other measures. At some point, however, this bunch of engineers decided to super-charge its growth, expand its footprint beyond the US, and offer database security as a service. So, let’s see how simple really simple is.

Database security: Whose job is it?

jSonar has a good presence in a few key sectors, such as financial services. Some sectors understand risk and are more conscious about mitigating it than others, but everyone really should do that, says jSonar. Which of course, begs the question: Isn’t everyone doing that already?

Don’t all databases come with security capabilities out of the box? Can’t database administrators (DBAs) just push the right buttons, and presto, problem solved? Well, it’s a bit more complicated than that, for a number of reasons, argues Bennatan.

Databases do indeed have very robust security layers. But the people charged with the security of the database are typically DBAs. A DBA is not someone who has security in their mindset, argues Bennatan, and we’d agree. The DBA’s job is to make sure the system runs smoothly, data is not lost, queries execute fast, and so on. That’s already a handful.

How well can DBAs know security? Maybe not that well. Perhaps then the security people should do it. Except then the question becomes: How well can security people know databases? The answer is, not that well, said Bennatan. Databases are complex, and they’re proliferating. It’s not like 20 years ago when even big companies only had a few databases:

“They’d have Oracle, SQL Server, they’d have DB2 on the mainframe, and maybe one more. Now every company has 40 databases. They’ve got SQL databases. They started dabbling in NoSQL, they have Hadoop, they have stuff running on the cloud. And each one of these has a separate security model. There’s no way for them to manage it at that level. They need to have a consistent policy.

What does the security officer care that you’re storing the data in SQL Server and I’m storing the data in MongoDB and someone else is storing the data in a combination of S3 and Athena? You don’t care if you’re a security officer, you need to have a single point, a single way of telling what’s going on. So being able to support all of these databases in a unified way — just think how much time that takes away from the equation.”

Computer system protection, database security, safe internet. Lock symbol on abstract computer data background programming binary code, data protection technology. Vector illustration

Database security is not all about administrators implementing low level policies, but it takes a holistic view, says jSonar.

That makes sense to us — being able to control security policy from a single point. But there is more. In the same way that there’s been investment in every other security area, for things such as user behavioral analysis and threat detection, the same thing should apply to databases, argues Bennatan:

“The database will just allow you to set a set of privileges that allows you to say User A can do this and User B can do that. But the question is not what User A is allowed or is not allowed to do. The question is, given that User A is allowed to do this thing, is User A misusing those privileges?

So it’s not a hard and fast definition of a policy. It’s an analysis that compares the policy with the actual actions and tries to say, you know, User A is an insider, he has privileges, he is the root user of this database. But should he really be doing that many SELECT statements on the employee table and constantly looking at salaries of everybody Why?”

Database security as a service

jSonar covers threat detection, compliance, and other work that has to do with policy. It also dabbles in areas like user rights management, or access monitoring for users. But unless we go into the database and understand exactly the privileges and the relationships between things, then we just have half the picture, Bennatan said.

This is the kind of thing jSonar wants to use the investment to be able to do. Well, that, and offering database security as a service. Currently jSonar’s install base tends to be largest companies, and it’s a very lucrative market. However, the company would also like to address clients beyond the Fortune 1000 and Forbes 2000.

Every company has databases, and every company needs security. But to do something that works for them is a little bit different than to do something that works for the biggest bank. The product may have the same functionality, but it has to be packaged differently, simplified, and delivered with the Software-as-a-Service model (SaaS).

If you’re talking to big companies, in Bennatan’s experience, they have requirements, and they have people who can implement things, and they tell you what they want. If you go to smaller companies, they don’t tell you they want anymore. They’re looking to you to tell them what they need to do in order to secure their data well.

jSonar had a SaaS offering, but they realized that having it and selling it are two different things. So is having an API, and using other people’s APIs. In that case, however, jSonar seems to be doing equally well.

jSonar works with the APIs of the databases it connects to mostly independently. Only rarely do they need to ask for special help from the database vendor, said Bennatan. But this goes both ways. jSonar interacts with the databases it connects with in a number of ways and also has a database of its own, used to store internal and aggregate information.

Outbound jSonar APIs make sure that data is accessible not just in jSonar, but can also be integrated in whatever observability solution clients are already using. In addition, having that data enables jSonar to apply machine learning to go beyond monitoring, to be able to do things like predictive analytics.

Our prediction: this bunch of engineers may be on to something, unsexy as it may be.

Content retrieved from: https://www.zdnet.com/article/single-point-of-control-database-security-as-a-service-jsonar-gets-50-million-funding-from-goldman-sachs/.

Categories
knowledge connexions

Another globally distributed cloud native SQL database on the rise: Yugabyte Raises $30 million in Series B Funding

Your good old on-premise SQL database is in terminal decline. Being a pure-play open-source cloud-native PostgreSQL, while offering Apache Cassandra and GraphQL interfaces, is what you need, says Yugabyte.

It’s not a coincidence — it’s a trend. Another globally distributed cloud native SQL database is getting a hefty amount of funding. This time, it’s Yugabyte. Yugabyte, a company founded by Facebook data infrastructure veterans, today announced that it has raised $30 million in an oversubscribed Series B round.

The round, led by 8VC, also includes participation from a strategic investor, Wipro Ventures, and existing investors, Lightspeed Venture Partners and Dell Technologies Capital. The round brings the company’s total funding to $55 million. Yugabyte also adds Scott Yara, co-founder and former SVP of Products of Pivotal Software, to its board of directors.

“Legacy source-of-truth databases form the beating heart of enterprises, and their movement to the cloud has just begun. This massive market deserves a product as beautifully architected and operable as the Yugabyte platform, and as formidable a team led by developer legends as Kannan Muthukkaruppan, Karthik Ranganathan and Mikhail Bautin,” said Bhaskar Ghosh, Partner and CTO at 8VC.

ZDNet connected with Yugabyte founders Kannan Muthukkaruppan and Karthik Ranganathan, and newly recruited CEO Bill Cook, previously of Sun Microsystems and Pivotal, for a deep dive in the company, the funding, and the market.

Applications are moving to the cloud, databases are following suit

Yugabyte’s commercial products include the Yugabyte Platform, a self-managed private database-as-a-service offering available on any public, private, or hybrid cloud or Kubernetes infrastructure and Yugabyte Cloud, a fully-managed database service currently available on AWS and Google Cloud.

Yugabyte was founded in 2016. Its founders met at Facebook, where they worked on Facebook’s high scale data infrastructure, including having worked on Apache Cassandra and Apache HBase before they were either open sourced or successful, said Muthukkaruppan.

The goal was to put mission-critical applications like Facebook Messenger on a data tier that was elastic, easy to manage and operate, and able to handle data center failures. The Yugabyte team realized that as applications are moving to the cloud, databases should follow suit, and set out on that mission — a globally distributed database in the cloud, for cloud-based applications.

In the past five years, AWS, Azure, and Google Cloud have seen their annual revenues grow exponentially, from $7 billion to $70 billion. IDC predicts next year will be the “year of multi-cloud,” driven by digital transformation acceleration due to the COVID-19 pandemic and Gartner predicts that in 2022 75% of all databases will be deployed or migrated to a cloud platform.

ybdb-stacked.jpg

Yugabyte is a globally distributed cloud native database with some interesting features

Yugabyte’s press release emphasizes that worldwide database management system (DBMS) revenue is growing to $46 billion in 2018 alone, and adoption for new applications of legacy databases is in terminal decline. Therefore, there is an immediate market need for a cloud-neutral database that adapts to any cloud and on-premises environment.

As we have seen here on Big on Data, Yugabyte is not the first vendor to come up with that idea. To begin with, there is a big NoSQL crowd that is already there, or close, in one way or another. And then we have the SQL crowd too. CockroachDB and FaunaDB come to mind, just to stick to some of the ones we have covered so far.

Then, of course, there are the Google Spanners and the Azure CosmosDBs of the world: SQL-based cloud-native databases, offered by cloud vendors. The obvious downside there is, good as they may be, you can’t do multi-cloud and hybrid cloud with those.

Frankly, this is quite a nuanced discussion. The reason we are mentioning it is to show that this market is big, it’s growing, and there’s plenty of competition and options. Yugabyte’s team is well aware of this, pointing out the fact that when we talk about databases, we’re looking at a $50 billion to $60 billion market. Getting a piece of that looked both possible and appealing for Yugabyte’s investors.

Crowded market, non-zero-sum game

We asked Ranganathan point-blank — what made Yugabyte’s investors vote with their dollars, in the face of such hard competition? Much of it has got to be the team, said Ranganathan.

Yugabyte founders go back to the days of working with Oracle and other relational databases, and have the experience of building, operating, and scaling mission-critical data infrastructure from the ground up. They also have experience in building companies, he went on to add, citing their stint at Nutanix. Another key element is the technical architecture, Ranganathan said:

“The market we’re addressing is the market of people building applications, transactional applications. What is the database most often picked in order to build these applications? It will have to PostgreSQL. It just always ends up there. A lot of people are using PostgreSQL to build their applications. However, their application is being built for the cloud or a cloud-native environment like Kubernetes.

It requires scale-out, like the ability to add more nodes in order to survive more requests and scale back down when needed. It also requires the ability to go and deploy data across zones, across regions, hybrid deployments, etc. So, if you combine those three with PostgresSQL, what you get is a null set. There’s no solution that exists that can do all of these today, cloud vendors included.”

Cloud computing

A new generation of cloud native databases is on the rise

Getty Images/iStockphoto

Ranganathan went on to compare Yugabyte against the closest competition, Cockroach DB and Fauna DB, in what is a nuanced discussion including parameters such as open-source licenses, community, growth, replication protocols, and all sorts of things CTOs should be aware of. That’s a bit too much to report on here, but if you’re interested, you will soon be able to catch the full discussion on the Orchestrate all the Things podcast.

The message from Yugabyte’s team is rather clear though: build your application on PostgreSQL, deploy it on Yugabyte anywhere. Interestingly, however, SQL is not all there is to Yugabyte. Yugabyte also offers an Apache Cassandra compatible API, in an obvious effort to onboard Cassandra users. Others like ScyllaDB, AWS and Azure CosmosDB offer this, too.

In addition, Yugabyte also offers a GraphQL layer, via a partnership with Hasura. In another nuanced analysis, Ranganathan went over what he sees as the 3 types of GraphQL layers for databases: generic ones, likes Apollo, PostgreSQL specific, like Hasura, and a combined GraphQL plus database play, like FaunaDB. Except for the third category, he went on to add, Yugabyte is open to working with all vendors.

As far as Yugabyte’s future plans are concerned, the goal is to double down on community and team growth. Support for some analytics workloads is in the roadmap, too. This is a crowded market, but big enough to be a non-zero-sum game. Yugabyte is worth keeping an eye on.

Content retrieved from: https://www.zdnet.com/article/another-globally-distributed-cloud-native-sql-database-on-the-rise-yugabyte-raises-30-million-in-series-b-funding/.

Categories
knowledge connexions

Another globally distributed cloud native SQL database on the rise: Yugabyte Raises $30 million in Series B Funding

Your good old on-premise SQL database is in terminal decline. Being a pure-play open-source cloud-native PostgreSQL, while offering Apache Cassandra and GraphQL interfaces, is what you need, says Yugabyte.

It’s not a coincidence — it’s a trend. Another globally distributed cloud native SQL database is getting a hefty amount of funding. This time, it’s Yugabyte. Yugabyte, a company founded by Facebook data infrastructure veterans, today announced that it has raised $30 million in an oversubscribed Series B round.

The round, led by 8VC, also includes participation from a strategic investor, Wipro Ventures, and existing investors, Lightspeed Venture Partners and Dell Technologies Capital. The round brings the company’s total funding to $55 million. Yugabyte also adds Scott Yara, co-founder and former SVP of Products of Pivotal Software, to its board of directors.

“Legacy source-of-truth databases form the beating heart of enterprises, and their movement to the cloud has just begun. This massive market deserves a product as beautifully architected and operable as the Yugabyte platform, and as formidable a team led by developer legends as Kannan Muthukkaruppan, Karthik Ranganathan and Mikhail Bautin,” said Bhaskar Ghosh, Partner and CTO at 8VC.

ZDNet connected with Yugabyte founders Kannan Muthukkaruppan and Karthik Ranganathan, and newly recruited CEO Bill Cook, previously of Sun Microsystems and Pivotal, for a deep dive in the company, the funding, and the market.

Applications are moving to the cloud, databases are following suit

Yugabyte’s commercial products include the Yugabyte Platform, a self-managed private database-as-a-service offering available on any public, private, or hybrid cloud or Kubernetes infrastructure and Yugabyte Cloud, a fully-managed database service currently available on AWS and Google Cloud.

Yugabyte was founded in 2016. Its founders met at Facebook, where they worked on Facebook’s high scale data infrastructure, including having worked on Apache Cassandra and Apache HBase before they were either open sourced or successful, said Muthukkaruppan.

The goal was to put mission-critical applications like Facebook Messenger on a data tier that was elastic, easy to manage and operate, and able to handle data center failures. The Yugabyte team realized that as applications are moving to the cloud, databases should follow suit, and set out on that mission — a globally distributed database in the cloud, for cloud-based applications.

In the past five years, AWS, Azure, and Google Cloud have seen their annual revenues grow exponentially, from $7 billion to $70 billion. IDC predicts next year will be the “year of multi-cloud,” driven by digital transformation acceleration due to the COVID-19 pandemic and Gartner predicts that in 2022 75% of all databases will be deployed or migrated to a cloud platform.

ybdb-stacked.jpg

Yugabyte is a globally distributed cloud native database with some interesting features

Yugabyte’s press release emphasizes that worldwide database management system (DBMS) revenue is growing to $46 billion in 2018 alone, and adoption for new applications of legacy databases is in terminal decline. Therefore, there is an immediate market need for a cloud-neutral database that adapts to any cloud and on-premises environment.

As we have seen here on Big on Data, Yugabyte is not the first vendor to come up with that idea. To begin with, there is a big NoSQL crowd that is already there, or close, in one way or another. And then we have the SQL crowd too. CockroachDB and FaunaDB come to mind, just to stick to some of the ones we have covered so far.

Then, of course, there are the Google Spanners and the Azure CosmosDBs of the world: SQL-based cloud-native databases, offered by cloud vendors. The obvious downside there is, good as they may be, you can’t do multi-cloud and hybrid cloud with those.

Frankly, this is quite a nuanced discussion. The reason we are mentioning it is to show that this market is big, it’s growing, and there’s plenty of competition and options. Yugabyte’s team is well aware of this, pointing out the fact that when we talk about databases, we’re looking at a $50 billion to $60 billion market. Getting a piece of that looked both possible and appealing for Yugabyte’s investors.

Crowded market, non-zero-sum game

We asked Ranganathan point-blank — what made Yugabyte’s investors vote with their dollars, in the face of such hard competition? Much of it has got to be the team, said Ranganathan.

Yugabyte founders go back to the days of working with Oracle and other relational databases, and have the experience of building, operating, and scaling mission-critical data infrastructure from the ground up. They also have experience in building companies, he went on to add, citing their stint at Nutanix. Another key element is the technical architecture, Ranganathan said:

“The market we’re addressing is the market of people building applications, transactional applications. What is the database most often picked in order to build these applications? It will have to PostgreSQL. It just always ends up there. A lot of people are using PostgreSQL to build their applications. However, their application is being built for the cloud or a cloud-native environment like Kubernetes.

It requires scale-out, like the ability to add more nodes in order to survive more requests and scale back down when needed. It also requires the ability to go and deploy data across zones, across regions, hybrid deployments, etc. So, if you combine those three with PostgresSQL, what you get is a null set. There’s no solution that exists that can do all of these today, cloud vendors included.”

Cloud computing

A new generation of cloud native databases is on the rise

Getty Images/iStockphoto

Ranganathan went on to compare Yugabyte against the closest competition, Cockroach DB and Fauna DB, in what is a nuanced discussion including parameters such as open-source licenses, community, growth, replication protocols, and all sorts of things CTOs should be aware of. That’s a bit too much to report on here, but if you’re interested, you will soon be able to catch the full discussion on the Orchestrate all the Things podcast.

The message from Yugabyte’s team is rather clear though: build your application on PostgreSQL, deploy it on Yugabyte anywhere. Interestingly, however, SQL is not all there is to Yugabyte. Yugabyte also offers an Apache Cassandra compatible API, in an obvious effort to onboard Cassandra users. Others like ScyllaDB, AWS and Azure CosmosDB offer this, too.

In addition, Yugabyte also offers a GraphQL layer, via a partnership with Hasura. In another nuanced analysis, Ranganathan went over what he sees as the 3 types of GraphQL layers for databases: generic ones, likes Apollo, PostgreSQL specific, like Hasura, and a combined GraphQL plus database play, like FaunaDB. Except for the third category, he went on to add, Yugabyte is open to working with all vendors.

As far as Yugabyte’s future plans are concerned, the goal is to double down on community and team growth. Support for some analytics workloads is in the roadmap, too. This is a crowded market, but big enough to be a non-zero-sum game. Yugabyte is worth keeping an eye on.

Content retrieved from: https://www.zdnet.com/article/another-globally-distributed-cloud-native-sql-database-on-the-rise-yugabyte-raises-30-million-in-series-b-funding/.

Categories
knowledge connexions

Streamlit wants to revolutionize building machine learning and data science applications, scores $21 million Series A funding

Streamlit wants to be for data science what business intelligence tools have been for databases: A quick way to get to results, without bothering much with the details

We were confused at first when we got the news. We interpreted “application framework for machine learning and data science” to mean some new framework for working with data, such as PyTorch, DeepLearning4j, and Neuton, to name just a few among many others out there.

So, our first reaction was: Another one, how is it different? Truth is, Streamlit is not a framework for working with data per se. Rather, it is a framework for building data-driven applications. That makes it different to boot with, and there’s more.

Streamlit is aimed at people who don’t necessarily know or care much about application development: Data scientists. It was created by a rock star team of data scientists who met in 2013 while working at Google X, it’s open source, and has been spreading like wildfire, counting some 200.000 applications built since late 2019.

Today Streamlit announced that it has secured $21 million in Series A funding. ZDNet connected with CEO Adrien Treuille to discuss what makes Streamlit special, and where it, and data-driven applications at large, are going next.

To listen to the conversation with Treuille in its entirety, you can head to the Orchestrate All the Things podcast.

From zero to hero: from datasets and models to applications

The investment was co-led by Gradient Ventures and GGV Capital, with additional participation from Bloomberg Beta, Elad Gil, Daniel Gross, and others. Glenn Solomon, a managing partner at GGV Capital, said that:

“Adapting quickly to new information and insights is one of the biggest challenges facing companies today. Streamlit is leading the way in helping data science teams accelerate time to market and amplify the work of machine learning throughout companies of all sizes across a wide variety of industries. At GGV we’re very excited to back this exceptional founding team and support their ambitious global growth plans.”

Let’s take it from the start then. In Treuille’s words, he and his co-founders came to be entrepreneurs via academia, doing machine learning and big data and AI before they were called by these names, and certainly before they were cool. Through his stints at Google X and Zoox AI teams, Treuille observed a pattern.

The promise of machine learning and artificial intelligence was often sequestered in those groups, and not influencing the organization as well or as easily as they could. That led Treuille to start working on a pet project to solve this. Eventually, it started getting used by a number of engineers and growing really quickly. Then investment came, and a big open-source launch.

istock-933321056.jpg

Streamlit is working on enabling data scientists to develop data driven applications in a fraction of the time it usually normally takes

metamorworks, Getty Images/iStockphoto

Streamlit grew from a one-man project to being used in a number of Fortune 500 companies, and beyond, under the radar, until today. And it worked that way for a number of reasons.

First, Treuille and his co-founders leveraged their network. Second, they open-sourced Streamlit, which made it easy for everyone to adopt and experiment with. Third, and perhaps more important, they captured what Treuille called the Zeitgeist: They offered a solution to a problem data scientists, and the organizations employing them, are facing:

How to go from fiddling with datasets and models, to deploying an application using them in production. In essence, to do this, a number of people have to work together. At the very least, data scientists and application developers. As usual in situations like these, skills and culture differ, and collaborating costs time and money.

Streamlit cites Delta Dental as an example. They were told that using AI to analyze their call center traffic would cost a hefty amount and take a year. A data scientist at Delta Dental used Streamlit instead, and he built an MVP in a week, a prototype in three weeks, and had it in production in three months, all at zero cost to the company, says Streamlit.

Taking the application developers out of application development

To understand how this is possible, we need to dive deeper into how Streamlit works. Streamlit tries to take the application development team out of the picture, by enabling data scientists to develop their own applications.

Treuille elaborates on the conundrum of getting data scientists to build applications, or getting application builders to work with data scientists. Data scientists do not necessarily have the core skills for application building, and their applications end up being un-maintainable. Application builders move on to other applications, resulting in freezing features.

What Streamlit does is it lets data scientists create applications as a byproduct of their workflow. It takes their Python scripts and turns them into applications, by letting them insert a few lines of code that abstract application constructs such as widgets.

That’s unorthodox. Software engineers would argue there’s a reason why web development frameworks exist, for example. And there are many years of experience and best practices distilled into them. To throw them all away in favor of annotated Python scripts would look like bad practice, not to mention, an existential threat.

Treuille begs to differ. To support his view, besides widespread adoption, he argues that this is a different way of developing applications. The applications are different, the scope is different, and Streamlit does not intend to reinvent the application development wheel, but rather, to integrate it:

“We view ourselves as a translation layer between the Python world and the web framework world. For example, everything in Streamlit is written in React. When you’ve discovered the joy of React, that’s like programming nirvana. We can take almost anything in the React ecosystem, and translate that into Streamlit almost effortlessly. So our core technology is really that translation layer.”

streamlitapp.gif

From a Python script to a web application, with a few extra lines of code. Image: Streamlit

Treuille went on to add that soon Streamlit will enable any developer to translate any bit of web tech into a single Python function, thus allowing the two ecosystems to flourish independently of one another. The same approach is taken also with regards to using other Python frameworks such as Dask or Ray, for example:

“Streamlit is very modest, in some ways, very small. And therefore we sit alongside whatever — the whole Python data ecosystem. And that is really exciting because of the bigger story here, which is way, way bigger than Streamlit. It’s the data world which was at one time big databases, and then it was Excel, and then it was Tableau, and more recently Looker.

This tsunami is coming, which is open source and machine learning, and Python, and Pandas, SciKit learn. This is basically 20 years of academic research into machine learning, crashing into the data world, and completely transforming it. And we view ourselves as just a little surfboard in that wave, just riding it, or trying to ride it as best we can.”

There’s an app for that. Should you build it with Streamlit?

That may explain the approach, but not the scope. There is more to applications than data and data-driven features. If you are Netflix, for example, the core business revolves around streaming, and the applications should reflect that. They should enable people to manage payments, stream films, and so on.

Recommendations add to that, powered by data and machine learning. But they are not the core business. Treuille acknowledged that Streamlit does not aspire to be the front end to your entire company: “If Netflix came to us and said, hey, we want to write the Netflix app website in Streamlit, we’d say we don’t think that’s a good idea.”

Streamlit is not a general-purpose application development framework. What it does, in a way, is the same thing that business intelligence application frameworks did for databases. It provides a framework that enables quick access to the underlying source of value. For BI frameworks, it was data stored in databases. For Streamlit, it’s machine learning models.

We would still question how many data scientists, or their managers for that matter, would be happy with adding the task of maintaining their Streamlit applications on top of everything else they already do. We would also question whether application developers can, or should, be taken out of the picture entirely, even for purpose-built, data-driven applications, as they grow over time. But Streamlit is early in its lifecycle to be able to answer those questions.

streamlitapp1.png

Data scientists are not necessarily the most suitable people to develop applications. Image: Streamlit

That, however, does not seem to have stopped users or investors. Speaking of which, there’s another interesting question here. What is Streamlit’s business model, and how did it get to convince people to invest money in it? In a nutshell: Software as a service in the cloud, with a tweak.

You can use Streamlit to develop any application without any restrictions. What you pay for, optionally, is deployment. Users can deploy Streamlit anywhere they please, on their own. But Streamlit offers its own cloud solution, called Streamlit for Teams, which comes with additional features around collaboration and deployment.

Treuille was adamant about Streamlit’s bottom-up sales strategy: Just getting the software out there, enabling people to start building applications, and then converting a part of them to paying users.

The bigger picture: Software 2.0

Streamlit is interesting if nothing else then because of the different paradigm it brings to application development. Which in turn, is part of what Treuille sees as a different way of building applications:

“The bigger picture is the way that the Python ecosystem and the community of open source developers and academic developers and corporations — TensorFlow is built by Google, PyTorch by Facebook — how all of these different forces have come together to create this incredibly powerful data ecosystem. That truly can revolutionize the show. That truly has different properties than just a simple spreadsheet and a list of your sales over the past year.”

Some people refer to this as Software 2.0. What we wondered, however, was whether the world is really ready for this. In many ways, most organizations probably have not gotten Software 1.0 right yet. Version control, release management, software development tools, and processes — these are not exactly trivial things.

Now add to that — dataset management, provenance, machine learning, and feature engineering, versioning, to name but a few of the concerns of data-driven development, and what you get is a combinatorial explosion. Treuille conceded that is really part of the Zeitgeist over the past couple years.

Treuille sees Streamlit as being part of a wave of new startups such as Tecton or Weights and Biases, which are essentially productionizing every layer of that stack. He believes talented people are working on this, and it’s coming into view. His take on how to get with the program:

“If you are a company, asking yourself how to get into this world, what is even the first step, I would say: Go to Insight Data Science. Hire one of their machine learning engineers or data scientists finishing the school for data scientists, and then give them Streamlit.”

Content retrieved from: https://www.zdnet.com/article/streamlit-wants-to-revolutionize-building-machine-learning-and-data-science-applications-scores-21-million-series-a-funding/.