Knowledge Architecture: An unfair advantage in Data

This article inspires a shift in thinking fixated on data warehousing to a knowledge architecture.

Knowledge Architecture: An unfair advantage in Data

The buzz term "AI" is shiny, and the internet often brilliantly exaggerates its results, but perfectionists are hardly convinced. It can learn to predict data points with high accuracy in a grandmaster style, but its applications, so far, are not nearly as good in demonstrating basic reasoning skills as a first-grader.

This post will brief you on how you can supercharge your organization's Data Science and Machine Learning (ML) data strategy by architecting a knowledge graph (KG). Showing a full AI example is outside of the scope of this post. This is to show a NoDW (Not-only Data Warehouse) example that can pave the way to hyper-contextualized AI models. Note: KG is part of AI.

Highlights

Most ML applications are unnatural and lack reasoning.

ML has a fascinating track record in the marketing world. Some organizations have been suffering from FOMO, and the rest want ML to create a competitive edge. Regardless of the motive, ML is a lucrative option. Therefore, a lot of investments go into creating and fine-tuning ML models to retrofit the business needs.

Predictive models, when not optimized for reasoning

Accuracy with basic cognitive skills is critical in automating industries where due diligence is vital, e.g., transactional. Some of us get our credit applications rejected because they were from a zipcode that historically had a bad credit reputation since the machine learning models were biased towards accuracy. By design, ML algorithms suffer from inductive bias.

Most ML applications focus too much on generalizing training data and maximizing the success function but lack the context of the training data.

Google is fast while carrying the internet, and your data is slow.

Google is the poster child of the Internet. It showed us the right information can be found and fast using the internet. Google often processes and analyzes free text, whereas most organizations process their data within their convenient self-defined schemas. Compared to Google, typical business organizations do not have as much volume and usage, but the performance and user experience are often nowhere near Google's. How do they do it?

It takes a massive village to deliver Google-like performance, so the scope of this post is only limited to one aspect of the data strategy.
Your data is slow

So how does an internet-scale company produce results consistently under one second? A more straightforward assumption is that they pre-compute and cache results. This is possibly true in many cases, but we, too, run pipelines to process organizational data and prepare data warehouses, right? And it is still nowhere near Google's performance.

The reason is that we search for needles in a data haystack. We are not pre-computing knowledge. What is knowledge, anyway?

The knowledge edge

The difference between data, information, and knowledge is widely available on the internet. Generally, knowledge is a personal or organizational interpretation of information, which may not be holistically perfect but fair enough to act upon it and profit. Furthermore, knowledge gives wisdom explainability and auditability.

A personal or organizational understanding of knowledge is equivalent to truth. Businesses can rely on a collection of truths or a knowledge database. If Google was the internet's brain, the brain is predominantly filled with knowledge, not just data. Alexa and Google Home devices often show cognitive skills, which is a must for human communications, and they are powered mainly by knowledge databases.

A sample query to compare the operations may be required: Find the pages I do not like yet, but my 2nd-degree connections do.

Graph traversal vs. death by SQL JOINs

Knowledge databases are often referred to as Knowledge graphs. Knowledge graphs, when designed right, typically enjoy the same benefits as the plethora of graph-based algorithms in relationship storage, discovery, processing, and prediction over usual relational and NoSQL databases.

Compare traversing a few tens or hundreds of hops on a graph vs. spending your life JOINing tables by ID across hundreds of thousands of rows. Therefore, graph operations are relatively linear.

And it gives context to the data.

Nobody has absolutely perfected data, but what can we do about it?

Spoiler alert: KG cannot solve all of your data problems. Data is big trouble in any organization. Some of the practical reasons for it are:

  • absence of standardization across the org.
  • multiple sources of truth
  • duplication
  • inconsistent interpretation
  • stale data points
  • inadequate documentation
  • constrained budget and so on

If data has not become a problem yet, likely, your organization has not grown enough, or you are not pushing data capabilities enough. Or, you are throwing too much money into its "managed" services only to realize later that it is unsustainable.

Business always has a defacto response to anything too complex to solve: "manage/de-risk" it. Hence, the solution is to create an illusion that you are managing data and its madness.

For that, let us summarize and list the problems we have at hand and see if knowledge graph-based techniques can offer solutions.

Data warehouses vs. Graph-oriented design

For brevity's sake, the process of building a knowledge architecture has been reduced to three steps:

  • (Optional) Developing a taxonomy and ontology
  • Building the graph database
  • Publishing graph as an API

(Optional) Developing taxonomy and ontology

Typical businesses resist building a Taxonomy and Ontology. While I strongly recommend it, the reality is that they might not get built in your organization. They can become an excellent candidate for a centralized repository of business terms with their hierarchy, meaning, relationships, interpretation, and validation.

  • Taxonomy: the easiest definition of taxonomy is locally grouping terms and creating a hierarchy with them
  • Ontology: a poor man's definition is creating a class-relationship map based on the taxonomy along with attributes, constraints, cardinalities, enums, even instances of the classes. If you create an Ontology for Zoo, an instance of Zebra can be part of it.
A view of Web Ontology Language (OWL) from an IDE

With a reasoner, you can even validate an incoming object programmatically against the Ontology that you have built already. Here is probably the tiniest ontology and what a reasoner could infer out of a bare minimum input.

Inference from an Ontology using a Reasoner

With a Taxonomy and Ontology at hand, designing a graph becomes a breeze, but you can still go ahead without them.

Building the graph

Constructing a graph is about storing interesting facts, their relationships, and the metadata of those relationships. Let us solve a real data challenge.

Sample challenge

Build an app that detects trendy posts across social media platforms and curates them by calculating its potential for virality.

Thinking model: Once a post is created, every time an interesting event happens at a time, t we keep a score of the event score_t. For instance, for each share, comment, and reaction 1 point is added, score_t += 1. Now, with t_MAX and a possibly large tipping point for virality, when t < t_MAX and sum(score_t) > score_TP are true, we classify it as a potentially viral post.

A quick example, if a post gets >700 reactions+shares+comments (you can define more criteria, take NLP and emotion, etc., into account) within 60 minutes, it gets picked up by the app for promotion.

Computing potential content virality using graph

Relatable to the previous figure, here is one way to connect two nodes (e.g., Nouns) with a relation (e.g., Verb) into a graph using English grammar:

Noun 1 with Adjective 1 -> Verb 1 with Adverb 1 to -> Noun 2 with Adjective 2.

Unlike RDBMS, graph databases have efficient implementations of graph algorithms in-built, so traversing them is not super hard. Another difference is that relations in graph databases can have useful attributes. For instance, in the example above, we are tracking the time for each event with timestamp: t.

Publishing graph as an API

If you were the producer and consumer of the data, you would not have to rethink data. Oftentimes, in a complex data setup, others depend on the data you produce and vice versa. Therefore, asynchronicity in communications is critical to a graph's success.

A Sample Knowledge Architecture
  • Taxonomy + Ontology: It was covered earlier.
  • Graph database: The heart of a knowledge graph is a graph database. Graph databases are special NoSQL database solution that efficiently stores nodes, attributes, and the relationships. It is a strong candidate for working as the single source of truths. Most graph databases also come with in-built graph traversal algorithms along with a query system.

    Data Science models and Semantic Web technologies can then be applied to such a database for highly contextualized analytics and training.
  • Graph Intelligence API (GIA): This API abstracts and controls access to the graph database. If immutability is desired, you may provide authorized-only access to the update operations.
  • Producers: These are part of your apps that supply facts. Whenever a producer comes across an interesting fact to report, it may make an asynchronous call to the GIA to store it with appropriate relationships among nodes. It will make the facts instantly available in the graph vs. waiting for long-running data warehouse pipelines to finish.

    One of the benefits of the graph is that it is always a work in progress, so there is no significant upfront cost in migrating all data to the graph. Instead, systems interested in publishing their facts to the graph, whenever they are ready, can start pushing without depending on other systems or data points.
  • Consumers: As the name suggests, it can request and access knowledge via GIA.

Who uses Knowledge Graphs and why?

In 2013, Facebook launched Graph API, and everything about their business has changed since then. Advertisers could niche down at an unimaginably granular level. Apps and AI became scary-intelligent. Users found far more contextual and useful recommendations—all under 1 second, like Google, subsecond responses, yet current.

We can use the same technique and be omnipresent in a customer's journey, not with yesterday's data (e.g., warehousing, conventionally), but current, not in 1-3 minutes, but now. An appropriate analogy would be using a hardcover journal vs. RoamResearch, or a 500-page geography book vs. a world map.

The following are some of the widely known applications:

  • Relationship mining: ability to find links between entities and concepts that are not apparent
  • Influencer mining: ability to calculate centrality and find key concepts/entities that influence a given outcome
  • Similarity mining: find concepts/entities that are similar based on active/latent features, useful for compare X vs. Y
  • Community mining: clustering/bucketing similar entities, useful for grouping concepts/entities
  • Recommendation/prediction: based on similar previous links between concepts/entities, predict/recommend links yet to be made
  • Anomaly detection: For example, fraud detection

Conclusion

The potential for graphs, especially in the context of the financial industry, is immense. In this post, we have briefly seen what it might look like to use graph technology, coming from contemporary data warehousing solutions, behind the API success of Facebook (2013) and Google (2012) and many data miners like Thomson Reuters.