Giant Zombie Robots Will Eat Your Brain - Or Skynet as a Service

Hello, I am Ori Pekelman, I am an entrepreneur and a consultant, you can find me on twitter/github/linked-in and such as OriPekelman. I have been building web applications and therefor APIs for the past twenty years or so. I am somewhat of a Hypermedia advocate and I am also very much interested in Semantic technologies and Machine Learning, I co-organize Paris Data Geeks, a huge gathering around these themes.

This is a very short talk and I will want to cover a very large terrain so I am publishing concurrently a more detailed version here: http://pekelman.com/presentations/zombie-robots and as a comment-able blog post here: http://blog.constellationmatrix.com/

My talk will not be about Zombies I was just dying to use this title for a small while, still I will be talking a bit about Machine Learning, so I should be safe around the robots.

All through the talk please remember we are not talking about transactional APIs.. we will be focussing on current landscape of data APIs, and try to give you some predictions on emergent technologies, and hopefully some practical pointers that will allow you to prepare for the next big thing.

So let's throw into a big cauldron the following elements:

The Semantic Stack - RDF, SPARQL and the likes
The Linked-Data Stack - The Poor man's Semantic Stack
The RESTful Hypermedia Stack - The Poor man's Linked-Data stack
The POJ Stack - The Poor man's Hypermedia stack
Automatic API discovery, search and orchestration (trying to get us up the stack and be rich again!)
Predictive APIs - AKA Giant Zombie Robots

We will end this talk with the idea that our Robot Zombie friends are hard working folks, and we should try to be nice and make our brains more readily accessible to them.

RDF is about transparent reasoning at scale

The Semantic Web (à la W3C) is about organizing all knowledge in a way we can automatically reason about it.

The design of RDF is intended to meet the following goals:

having a simple data model
having formal semantics and provable inference
using an extensible URI-based vocabulary
using an XML-based syntax
supporting use of XML schema datatypes
allowing anyone to make statements about any resource

http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-design-goals

RDF Concepts and Abstract Syntax defines an abstract syntax on which RDF is based, and which serves to link its concrete syntax to its formal semantics. It also includes discussion of design goals, key concepts, datatyping, character normalization and handling of URI references.

RDF Failed at scaling with humans

So... RDF did not scale well because it didn't scale well with humans, it is overly abstract and to get the minimal benefit from it you need to go all in and learn a very weird terminology and extremely high level concepts. Look at these design goals they do recognize two interesting things allow[ing] anyone to make statements about any resource and provable inference the rest are not goals right? they are means. Even the "formal semantics" bit is probably an outcome (can we get provable inference without formal semantics?)

Provable inference or transparent reasoning

So, let's dwell a bit on the "provable inference". This is what I referred to as "transparent reasoning" it means: we have a bunch of resources that are linked together, and then we can ask a computer system questions about this big graph, and it will be able to answer some hard questions, questions that require scale. But it will also, always be able to tell us why. It is able to do so because armies of humans have formally encoded human reasoning into a machine readable form.

The W3C Semantic Sect, pardon, Stack, is Humanist at its heart. It promotes an ideal where we the humans, stay the masters, where computers empower people to apply human reasoning at an incredible scale, but without supplanting human reasoning.

The semantic web did not fail, it just did not succeed, well semantics

Let's also put the "failure" of the Semantic Web at very relative terms. It never did garner widespread adoption, even with the "dumbed down" versions such as microformats schema.org style. But there are some incredible projects powered by RDF out there and if the one of the main problem was that ontologies were just too damn expensive to create and maintain.. the advent of DBPedia as a hub that connects many of those has made these technologies again somewhat relevant again.

Both hypermedia REST and RDFy technologies come from a post-hoc analysis of "Wohaa this web thing really works". But they represent two very different paths. One is minimal and allows for incremental advances and value creation the other is well, too complicated for the smartest people I know. Have a look at this improbable meeting of both http://notes.restdesc.org/2011/images/usecase-2.html

I don't say there are no smart people on the semantic web thing; on the contrary, this is a thing done by very smart people for very smart people. But the whole thing about the web is that it degrades gracefully, that you get 80% of the value even when it is 99% broken. The people working on the semantic stack not only became enamored with an Idea, but also with a specific way to implement it, with a technology, the humanists became less interested with the real people that are going to be the first users of this, which basically means programmers. We advocated in previous talks for a design process of APIs that is developer centric, that considers the real use cases. Where the first real use-case on any API is that a human being, a developer has to write code to interface with it.

The semantic Stack was built for computers, so it is no wonder people did not like it.

A trollish interlude

As a side note, and as its always good to throw in a trollish remark about SOAP - lets remember there was another strain of this supposedly computer-friendly, but truly human-hostile paradigm. This one was not humanist at all. This is was the detail oriented, anal-retentive, corporations are people, big complicated diagram version of the idea. It allowed two computers to talk to each other in ad-hoc broken dialect, which was native to neither. It was about no abstraction at all, but about exchanging idiosyncrasies.

Said that.

Computers are getting smarter, really, really smart.

Which basically means that we can write stuff that is easy to parse for humans, and there is still a very good chance computers will be able to figure out a lot; We will get into a bit more details around this and explain the whole Machine Learning vibe here, but much more to cover before.

So... the people have voted, they have chosen simple, easy to implement, easy to consume APIs that have very lightweight meta-artifacts: no or not a lot of schemas, not a lot of discovery and such; No standards just ad hoc implementations. The industry as a whole chose different middle-grounds between POX and POJ with a hint of REST and a soupçon of hypermedia.

And this worked great. But now we are having again some issues... with the scale of it all. With this jungle. The cost of maintaining a client for a specific API is compounded by the need to orchestrate its usage with a bunch of others. One can do streaming, the other can batch, yet another has rate limiting. Some prefer *_id* to id. And because no one sells software anymore and everyone wants to be meted usage based, we have to make all of those work nicely together. Basically the incredible success of our simple APIs brought upon us new scaling issues.

Hence the emergence of the orchestration layer;

On the API design side there are two orthogonal but concurrent trends, one has to do with more metadata (swagger style) or more common conventions and standards (like json-api or the api commons approaches). There are pitfalls to both. The technologies we are talking about are not mature and premature standardization might do more harm then good; For example having standard schemas looks like a good thing... if you have a way to handle schema versioning and migration; But if you can handle those... then probably you just won the prize of reinventing a more complicated version of SOAP. And we already said this was bad.

A part of the cost induced by the scale can be mitigated by clean implementations of Hypermedia, it does not allow for full automation of client generation nor full discovery capabilities, but it does reduce enormously the amount of "external" technical metadata that is needed to interface with a system. It also can handle through a bit of kung-fu some of the trickier aspects of managing versions.

And the emergence of the as-a-service orchestration layer

Around those two trends (more metadata, more standards) the last two years have also seen the emergence of the orchestration layer as a service, we see wrappers and façades as a service, API management as a service, but also meta-search APIs and so forth. All these try to absorb the complexity by abstracting away APIs with the same domain. And I do believe there will be an enormous market for just that, interfacing APIs.

But we can also see that without some sort of higher level automation this will fail. This scales to a company project. This scales for some use cases on a wider scale but this does not scale like the web does. This fails the grand vision of the architects of the Semantic Web that dared an "allowing anyone to make statements about any resource" as a design goal.

An image

So, basically the painting I am drawing is that of a bunch of lovely, smart, humanists in an ivory tower bereft by the fact that the vulgus refuses their baroque works of art. Wizards in white robes. Intricate machines, with strong magic, that they, only, can wield.

In the middle ground, on the battle-field, the humans build makeshift tools, crude machinery, but they are numerous and their fire power is incredible. Even the warrior tribes have learnt to collaborate, they copy each other's tools and make their engineers work in the open (source). In their quest to live and make a couple of bucks, they are aided by powerful and seemingly benevolent entities (the google, the amazon). Everything is a mess, there is no plan, there are no semantics, there is a lot of waste; But hell it works. So people propose European Unions and trade agreements to make this function together as a well oiled machine (and the EU is a good thing!).

But beyond a ridge other sounds, humming in the dark observe, meticulously both the wizards and the humans and their Unions.

Predictive APIs make reasoning scale through opacity.

At long last we get to the robots. When I say Predictive APIs I mean APIs whose main business value is derived from Machine Learning algorithms. Where the data we get is not data we put it but the result of a reasoning.

One of the interesting aspects of most Machine Learning algorithms is that . We know the model is robust, we can prove it, but we can no longer answer "why" questions. This runs contrary to what we have seen with the semantic stack. And the opacity here is double, not only can you not access the model, you probably couldn't handle the data-set on which it was trained.

Some of what we call "predictive" APIs are dedicated to exposing algorithms so they get predictable names like the open source Prediction.io to expose mahout, or the Prediction API from Google. But lets also not forget the beautifully designed APIs by the PSI project ( http://psi.cecs.anu.edu.au) and the really-not-so-ugly one from http://bigml.com.

There are of course also a bunch of services that handle other corners like sentiment analysis, machine vision, and others that are more NLP oriented but these again are usually services that expose a single form of algorithm. We do also have a whole adjacent area, well represented here in ApiDays of what they call "Semantic APIs" those are often a mix between traditional NLP and newer funk, often enough they have nothing to do with the Semantic Web per-se but all have the commonality of being data APIS that are "answer APIs".

Some of the APIs we can put in this bucket of Machine Learning are much higher level constructs. I'll plug some friends here; Snips http://snips.net/ have an API that can robustly predict the risk of an accident at a specific intersection. You can see why this is much more interesting then having a simple data API that could tell you if there was an accident there. That data would be useless (this data is very sparse in 99.99% and some of the cases it will simply say 0). If you want the actionable data you probably want the prediction data; But now think what happens when this data becomes ingested in other systems that themselves will give predictive responses...

The beautiful humanistic view of the Semantic web, does seem to fade away to give birth to a sinister Skynet. In the words of Little Britain "Computer Says No". And no reason can be given.

Which brings us to

The next big thing:

As a reminder this talk focuses on APIs that are Data APIs, not ones that do stuff the transactional kind. We are still very, very far away from writing automated code that automagically orchestrates the whole world in a R/W fashion.

But we can probably, right now, with a high level of confidence build systems that:

Predict what a column is and what it means

Predict what can be predicted

And I am not talking here about type inference, if you look at a company like Dataiku, they already do that very well. We are talking about semantic inference of data APIs. This is what allows for this higher level orchestration; Plugging friends again, one of the streams that http://www.quantstreams.com/ can handle is that of names, they can predict with a very high level of confidence the sex of a person from the name (oh and this, is not trivial!). The next thing is to be able to understand from a data source that it has names;

When we can reasonably infer the semantics of a Data service we can also start to be able to do robust predictions on the data it does not have (from the name we got a gender.. from gender with can predict...), many data sources are very sparse and many will become much denser .. allow me to predict.. this is a game changer; This makes our anthem of "Hybrid Data Makes All Your Data Big" even more convincing.

We basically won't need a lot of metadata if we can predict what stuff is. But what we do need is easy discovery and traversal.

And here, just listen to ol' Steve Klabnik, and do the Roy Fielding hypermedia thing. Could also be nice if you followed some of my own advice, serve a directory of endpoints from the root of your API.. put schema versions in the resources and so forth. There is a good chance machines will handle the rest. And even if they don't you will have probably made some developers happier.

So the the next big thing is machine learning powered orchestration

Because