Using Apache Pinot and Kafka to Analyze GitHub Events

Friday, April 10, 2020

1*eR64jBH1ZvC3uNfPP56p5g

Pinot is the latest Apache incubated project to follow in the footsteps of other tremendously popular open source projects that were first built by engineers at LinkedIn. Pinot joins the likes of Kafka, Helix, and Samza — the former of which is quickly becoming the industry’s message broker of choice for building highly-scalable cloud-native applications.

Outside of LinkedIn, Uber was one of the early adopters of Pinot, using it to power analytics for a variety of use cases such as UberEats Restaurant Manager.

Getting Started with Apache Pinot

In this blog post, we’ll show you how Pinot and Kafka can be used together to ingest, query, and visualize event streams sourced from the public GitHub API. For the step-by-step instructions, please visit our documentation, which will guide you through the specifics of running this example in your development environment.

Pinot Overview

First, let’s do a quick overview of the Pinot components that we’ll be using in this tutorial.

0*jRWx7MMaChnkNl2g

Pinot System Architecture Diagram

Physical components

Zookeeper
Pinot uses Apache Zookeeper to store cluster state and metadata. It is the very first component that needs to come up when creating a Pinot cluster.

Controllers
Controllers maintain the global metadata of the system, with the help of Zookeeper. They manage all other components of the cluster and are responsible for initializing the real-time consumption. Controllers have admin endpoints for managing configs and cluster operations.

Brokers
Brokers handle Pinot queries by forwarding them to the right servers, merging the received results, and sending them back to the client.

Servers
Servers host data segments and serve queries off the hosted data. When the data source is a real-time stream, servers directly ingest from the stream, periodically converting the in-memory ingested data into segments and writing them into the segment store.

Tutorial Overview

Now that you know the basics of Pinot and its architecture, let’s dive into what we’ll be building in this tutorial. Before we review how to ingest GitHub events from Kafka, let’s get familiar with the components we will use to query that data in Pinot. Similar to many NoSQL databases, Pinot has a browser-based query console and REST API. We call this component, the Pinot Controller, and it is the easiest way to run queries outside of a terminal or custom application.

0*u3OICp1rfMO3Ae9

Pinot Controller Start Page

As a part of the Pinot Controller, we provide a query console called the Pinot Data Explorer. If you’re new to Pinot, we have put together a custom docker image and instructions that will help you get up and running as fast as possible.

0*TRL u5mueQli9v l

Pinot Data Explorer

Now that you have a local Pinot cluster up and running, and are able to access the Pinot Data Explorer console. Let’s go over how to create a schema and table that maps a Kafka topic to a queryable data structure in Pinot.

Ingesting GitHub Events with Apache Kafka

Pinot has a variety of ways to ingest data collected from event streams. Today, we’ll be using Apache Kafka to collect event data from GitHub’s public REST API. We chose GitHub events because it is publicly available, there will be a constant stream of events, and it would give us interesting relatable insights about open source projects. We’ll then use Pinot to easily run analytical queries on the aggregate data model that resulted from event streams stored in Kafka topics.

We will be using the /events API from GitHub. In order to get all events related to commits being merged , we are going to collect events of type “PullRequestEvent” which have action == closed and merged == true. For every pull request event that we receive, we will make additional calls to fetch the commits, comments and review comments on the pull request. The URLs to make these calls are available in the payload of the pull request event.

0*wVgLajNFtgTPlxu0

Using the above four payloads, we finally generate a schema for Pinot, with dimensions, metrics and time column, as follows:

1*kTrHl0x2YcOD8luZifVfIQ

GitHub Event Schema for Apache Pinot

Querying GitHub Events with Apache Pinot

Now that we have our schema and table created, Pinot is able to ingest GitHub events from Kafka so that we can query it as a data structure using PQL. Earlier we reviewed the easiest way to query Pinot from a web browser, using the Pinot Data Explorer. Pinot has a SQL-based query language called PQL (Pinot Query Language), which is a declarative query language that gives you a familiar way to interface with data. It’s important to mention that PQL is based on SQL, and shares much of its semantics, but is not intended to be used for database transactions or writes.

After firing up the Pinot Data Explorer, you’ll be able to run a PQL query to fetch data from the GitHub events that are being ingested in realtime.

0*JrCX2akz6jDaQk5w

Query GitHub Event Schema in Pinot Data Explorer

From here, there are many different ways to interface and visualize the real-time event data from GitHub. One such way is to create a chart using Apache Superset, which is a popular open source web-based business intelligence tool. This is a common tool used to create reports that visualize Pinot data queried with PQL.

0*B8zBeHHAy CHTldf

Querying GitHub Event Data from Pinot in Superset

You can find more details and instructions on using Superset with Pinot from this community blog post.

Conclusion

In this tutorial we introduced you to using Kafka and Pinot to analyze, query, and visualize event streams ingested from GitHub. Please visit our documentation for the comprehensive step-by-step tutorial referenced in this blog post.

If you’re interested in learning more about Pinot, become a member of our open source community by joining our Slack channel and subscribing to our mailing list.

We’re excited to see how developers and companies are using Apache Pinot to build highly-scalable analytical queries on real-time event data. Feel free to ping us on Twitter or Slack with your stories and feedback.

Finally, here is a list of resources that you might find useful as you start your journey with Apache Pinot.

Special thanks

A very special thanks to Neha Pawar and the Apache Pinot engineering team for co-authoring this blog post. If you’re interested in co-authoring or contributing an article to our developer blog, please reach out to @kennybastani on Twitter.

Sentiment Analysis on Twitter Data Using Neo4j and Google Cloud

Thursday, September 19, 2019

In this blog post, we’re going to walk through designing a graph processing algorithm on top of Neo4j that discovers the influence and sentiment of tweets in your Twitter network.

The source code for this reference application is open source. You can find the GitHub project here.

Graph Data Modeling

The first thing we’ll need to do is to design a data model for analyzing the sentiments and influences of users on Twitter. This example iterates from an earlier graph processing example described in another blog post. I recommend taking a look at that post to better understand the concepts I talk about in this one.

The diagram below is the graph data model that we will use to import, analyze, and query data from Twitter.

Twitter graph data model

In the diagram above, the following relationships are described.

  • Users follow other users

  • Users create tweets

  • Tweets contain phrases

  • Phrases are categorized into topics

Twitter User Ranking

For this first blog post we’re going to focus on generating a rank of influential Twitter users in my social network that tells me which topics a user tweets about.

Twitter influencer ranking with topic

The screenshot above is from the results of a Neo4j cypher query. Here we find a list of Twitter users that were discovered using a crawling algorithm based on PageRank. This output is similar to the dashboard that was created in an earlier blog post, but adds in a top category, top phrase, and a sentiment score.

From Microservices to Service Blocks using Spring Cloud Function and AWS Lambda

Thursday, July 6, 2017

This blog post will introduce you to building service block architectures using Spring Cloud Function and AWS Lambda.

What is Spring Cloud Function?

Spring Cloud Function is a project from Pivotal that brings the same popular fundamentals behind Spring Boot to serverless functions.

Service Block Architecture

One of the most important considerations in software design is modularity. If we think about modularity in the mechanical sense, components of a system are designed as modules that can be replaced in the event of a mechanical failure. In the engine of a car, for example, you do not need to replace the entire engine if a single spark plug fails.

In software, modularity allows you to design for change.

Modularity also gives developers a shared map that can be used to reason about the functionality of an application. By being able to visualize and map out the complex processes that are orchestrated by an application’s source code, developers and architects alike can more easily visualize where to make a change with surgical precision.

Changing software

In many ways, we should consider ourselves lucky to be building software instead of cars. Some of today’s most valuable companies are created using bits and bytes instead of plastic and metal. But despite these advances, the very best car company releases less often than the world’s very worst software company.

An application’s source code is a system of connected bits and bytes that is always evolving—one change after another. But, as the source code of a system expands or contracts, small changes require us to build and deploy entire applications.

To make one small code change to a production environment, we are required to deploy everything else we didn’t change.

When teams share a deployment pipeline for an application, teams become forced to plan around a schedule they have little or no control over. For this reason, innovation is stifled—as developers must wait for the next bus before they can get any feedback about their changes.

The result of building microservices is an ever increasing number of pathways to production. With more and more microservices, the amount of unchanged code per deployment decreases when measured across all applications. It’s the decomposition in microservices that ends up breeding lower unchanged code deployed over time—an important metric. Serverless functions can help to get this number even lower—as the unit of change becomes the function. But, how do microservices and serverless functions fit together?

Service Blocks

Service blocks are cloud-native applications that share many characteristics with microservices. The key difference with microservices is that a service block is a self-contained system that has multiple independently deployable units—mixing together serverless functions with containers.

Service Block Spring Cloud Function

While microservices can be created entirely as serverless functions, a service block focuses on a contextual model that combines together traditional "always-on" applications with portable on-demand functions.

The Patterns

The basic pattern of a service block combines a core application running in a container with a collection of serverless functions.

Service Block Patterns Spring Cloud Function