The buzz term "AI" is shiny, and the internet often brilliantly exaggerates its results, but perfectionists are hardly convinced. It can learn to predict data points with high accuracy in a grandmaster style, but its applications, so far, are not nearly as good in demonstrating basic reasoning skills as a first-grader.
This post will brief you on how you can supercharge your organization's Data Science and Machine Learning (ML) data strategy by architecting a knowledge graph (KG). Showing a full AI example is outside of the scope of this post. This is to show a NoDW (Not-only Data Warehouse) example that can pave the way to hyper-contextualized AI models. Note: KG is part of AI.
- Most ML apps can demonstrate high accuracy, but no reasoning skills
- Google holds the gigantic internet yet fast, and your data is heavy and slow
- Knowledge is human understanding of information
- Searching for data is slow, but looking up for pre-computed knowledge is not
- To augment cognitive skills, you need a knowledge database
- Most organizations store data and are not in a knowledge business
- Knowledge database aka. graph operations are relatively linear
- Nobody has absolutely perfected data, but what can we do about it?
- (Optional) Developing taxonomy and ontology
- Building the graph
- Publishing graph as an API
- Who uses Knowledge Graphs and why?
Most ML applications are unnatural and lack reasoning.
ML has a fascinating track record in the marketing world. Some organizations have been suffering from FOMO, and the rest want ML to create a competitive edge. Regardless of the motive, ML is a lucrative option. Therefore, a lot of investments go into creating and fine-tuning ML models to retrofit the business needs.
Accuracy with basic cognitive skills is critical in automating industries where due diligence is vital, e.g., transactional. Some of us get our credit applications rejected because they were from a zipcode that historically had a bad credit reputation since the machine learning models were biased towards accuracy. By design, ML algorithms suffer from inductive bias.
Most ML applications focus too much on generalizing training data and maximizing the success function but lack the context of the training data.
Google is fast while carrying the internet, and your data is slow.
Google is the poster child of the Internet. It showed us the right information can be found and fast using the internet. Google often processes and analyzes free text, whereas most organizations process their data within their convenient self-defined schemas. Compared to Google, typical business organizations do not have as much volume and usage, but the performance and user experience are often nowhere near Google's. How do they do it?
It takes a massive village to deliver Google-like performance, so the scope of this post is only limited to one aspect of the data strategy.
So how does an internet-scale company produce results consistently under one second? A more straightforward assumption is that they pre-compute and cache results. This is possibly true in many cases, but we, too, run pipelines to process organizational data and prepare data warehouses, right? And it is still nowhere near Google's performance.
The reason is that we search for needles in a data haystack. We are not pre-computing knowledge. What is knowledge, anyway?
The knowledge edge
The difference between data, information, and knowledge is widely available on the internet. Generally, knowledge is a personal or organizational interpretation of information, which may not be holistically perfect but fair enough to act upon it and profit. Furthermore, knowledge gives wisdom explainability and auditability.
A personal or organizational understanding of knowledge is equivalent to truth. Businesses can rely on a collection of truths or a knowledge database. If Google was the internet's brain, the brain is predominantly filled with knowledge, not just data. Alexa and Google Home devices often show cognitive skills, which is a must for human communications, and they are powered mainly by knowledge databases.
A sample query to compare the operations may be required: Find the pages I do not like yet, but my 2nd-degree connections do.
Knowledge databases are often referred to as Knowledge graphs. Knowledge graphs, when designed right, typically enjoy the same benefits as the plethora of graph-based algorithms in relationship storage, discovery, processing, and prediction over usual relational and NoSQL databases.
Compare traversing a few tens or hundreds of hops on a graph vs. spending your life
JOINing tables by
ID across hundreds of thousands of rows. Therefore, graph operations are relatively linear.
And it gives context to the data.
Nobody has absolutely perfected data, but what can we do about it?
Spoiler alert: KG cannot solve all of your data problems. Data is big trouble in any organization. Some of the practical reasons for it are:
- absence of standardization across the org.
- multiple sources of truth
- inconsistent interpretation
- stale data points
- inadequate documentation
- constrained budget and so on
If data has not become a problem yet, likely, your organization has not grown enough, or you are not pushing data capabilities enough. Or, you are throwing too much money into its "managed" services only to realize later that it is unsustainable.
Business always has a defacto response to anything too complex to solve: "manage/de-risk" it. Hence, the solution is to create an illusion that you are managing data and its madness.
For that, let us summarize and list the problems we have at hand and see if knowledge graph-based techniques can offer solutions.
For brevity's sake, the process of building a knowledge architecture has been reduced to three steps:
- (Optional) Developing a taxonomy and ontology
- Building the graph database
- Publishing graph as an API
(Optional) Developing taxonomy and ontology
Typical businesses resist building a Taxonomy and Ontology. While I strongly recommend it, the reality is that they might not get built in your organization. They can become an excellent candidate for a centralized repository of business terms with their hierarchy, meaning, relationships, interpretation, and validation.
- Taxonomy: the easiest definition of taxonomy is locally grouping terms and creating a hierarchy with them
- Ontology: a poor man's definition is creating a class-relationship map based on the taxonomy along with attributes, constraints, cardinalities, enums, even instances of the classes. If you create an Ontology for Zoo, an instance of Zebra can be part of it.
With a reasoner, you can even validate an incoming object programmatically against the Ontology that you have built already. Here is probably the tiniest ontology and what a reasoner could infer out of a bare minimum input.
With a Taxonomy and Ontology at hand, designing a graph becomes a breeze, but you can still go ahead without them.
Building the graph
Constructing a graph is about storing interesting facts, their relationships, and the metadata of those relationships. Let us solve a real data challenge.
Build an app that detects trendy posts across social media platforms and curates them by calculating its potential for virality.
Thinking model: Once a post is created, every time an interesting event happens at a time,
t we keep a score of the event
score_t. For instance, for each share, comment, and reaction 1 point is added,
score_t += 1. Now, with
t_MAX and a possibly large tipping point for virality, when
t < t_MAX and
sum(score_t) > score_TP are true, we classify it as a potentially viral post.
A quick example, if a post gets >700 reactions+shares+comments (you can define more criteria, take NLP and emotion, etc., into account) within 60 minutes, it gets picked up by the app for promotion.
Relatable to the previous figure, here is one way to connect two nodes (e.g.,
Nouns) with a relation (e.g.,
Verb) into a graph using English grammar:
Noun 1 with
Adjective 1 ->
Verb 1 with
Adverb 1 to ->
Noun 2 with
Unlike RDBMS, graph databases have efficient implementations of graph algorithms in-built, so traversing them is not super hard. Another difference is that relations in graph databases can have useful attributes. For instance, in the example above, we are tracking the time for each event with
Publishing graph as an API
If you were the producer and consumer of the data, you would not have to rethink data. Oftentimes, in a complex data setup, others depend on the data you produce and vice versa. Therefore, asynchronicity in communications is critical to a graph's success.
- Taxonomy + Ontology: It was covered earlier.
- Graph database: The heart of a knowledge graph is a graph database. Graph databases are special NoSQL database solution that efficiently stores nodes, attributes, and the relationships. It is a strong candidate for working as the single source of truths. Most graph databases also come with in-built graph traversal algorithms along with a query system.
Data Science models and Semantic Web technologies can then be applied to such a database for highly contextualized analytics and training.
- Graph Intelligence API (GIA): This API abstracts and controls access to the graph database. If immutability is desired, you may provide authorized-only access to the update operations.
- Producers: These are part of your apps that supply facts. Whenever a producer comes across an interesting fact to report, it may make an asynchronous call to the GIA to store it with appropriate relationships among nodes. It will make the facts instantly available in the graph vs. waiting for long-running data warehouse pipelines to finish.
One of the benefits of the graph is that it is always a work in progress, so there is no significant upfront cost in migrating all data to the graph. Instead, systems interested in publishing their facts to the graph, whenever they are ready, can start pushing without depending on other systems or data points.
- Consumers: As the name suggests, it can request and access knowledge via GIA.
Who uses Knowledge Graphs and why?
In 2013, Facebook launched Graph API, and everything about their business has changed since then. Advertisers could niche down at an unimaginably granular level. Apps and AI became scary-intelligent. Users found far more contextual and useful recommendations—all under 1 second, like Google, subsecond responses, yet current.
We can use the same technique and be omnipresent in a customer's journey, not with yesterday's data (e.g., warehousing, conventionally), but current, not in 1-3 minutes, but now. An appropriate analogy would be using a hardcover journal vs. RoamResearch, or a 500-page geography book vs. a world map.
The following are some of the widely known applications:
- Relationship mining: ability to find links between entities and concepts that are not apparent
- Influencer mining: ability to calculate centrality and find key concepts/entities that influence a given outcome
- Similarity mining: find concepts/entities that are similar based on active/latent features, useful for compare X vs. Y
- Community mining: clustering/bucketing similar entities, useful for grouping concepts/entities
- Recommendation/prediction: based on similar previous links between concepts/entities, predict/recommend links yet to be made
- Anomaly detection: For example, fraud detection
The potential for graphs, especially in the context of the financial industry, is immense. In this post, we have briefly seen what it might look like to use graph technology, coming from contemporary data warehousing solutions, behind the API success of Facebook (2013) and Google (2012) and many data miners like Thomson Reuters.