Showing posts with label apache spark. Show all posts
Showing posts with label apache spark. Show all posts

Creating a PageRank Analytics Platform Using Spring Boot Microservices

Sunday, January 3, 2016

This article introduces you to a sample application that combines multiple microservices with a graph processing platform to rank communities of users on Twitter. We’re going to use a collection of popular tools as a part of this article’s sample application. The tools we’ll use, in the order of importance, will be:

Ranking Twitter Profiles

Let’s do an overview of the problem we will solve as a part of our sample application. The problem we’re going to solve is how to discover communities of influencers on Twitter using a set of seed profiles as inputs. To solve this problem without a background in machine learning or social network analytics might be a bit of a stretch, but we’re going to take a stab at it using a little bit of computer science history.

The PageRank algorithm, created by Google co-founder Larry Page, was first used by Google to rank website documents from analyzing the graph of backlinks between sites.

I dug up the original research paper on PageRank from Stanford for some inspiration. In the paper, the authors talk about the notion of approximating the "importance" of an academic publication by weighting the value of its citations.

The reason that PageRank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a webpage has a link to the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to "importance" can be obtained just from the link structure.
— Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999)
The PageRank Citation Ranking: Bringing Order to the Web

Now let’s take the same definition that is described in the paper and apply it to our problem of discovering important profiles on Twitter. Twitter users typically follow other users to track their updates as a part of their stream. We can use the same reasoning behind using PageRank on citations to approximate the "importance" of profiles on Twitter. This reasoning would tell us that it’s not the number of followers that make a profile important, it is measured by how important those followers are.

That’s exactly what we’re going to build in this article, and we’ll end up with something that looks like the following table.

Rank Photo Profile Followers PageRank
1. @ftrain 31948 7368.2417
2. @harper 32452 6754.455
3. @worrydream 37658 6747.585
4. @lstoll 41067 5976.3555
5. @katemats 25799 5916.3843
6. @rands 35079 5888.145
7. @al3x 41099 5547.4307
8. @defunkt 45310 4787.9644
9. @SaraJChipps 29617 4271.676
10. @leahculver 30723 3852.3728

The first thing we’re going to need to worry about when building this solution is how we’re going to calculate PageRank on potentially millions of users and links. To do this, we’re going to use something called a graph processing platform.

What is a graph processing platform?

A graph processing platform is an application architecture that provides a general-purpose job scheduling interface for analyzing graphs. The application we’ll build will make use of a graph processing platform to analyze and rank communities of users on Twitter. For this we’ll use Neo4j Mazerunner, an open source project that I started that connects Neo4j’s database server to Apache Spark.

The diagram below illustrates a graph processing platform similar to Neo4j Mazerunner.

Graph processing platform diagram

Submitting PageRank Jobs to GraphX

The graph processing platform I’ve described will provide us with a general purpose API for submitting PageRank jobs to Apache Spark’s GraphX module from Neo4j. The PageRank results from GraphX will be automatically applied back to Neo4j without any additional work to manually handle data loading. The workflow for this is extremely simple for our purposes. From a backend service we will only need to make a simple HTTP request to Neo4j to begin a PageRank job.

I’ve also taken care of making sure that the graph processing platform is easily deployable to a cloud provider using Docker containers. In a previous article, I describe how to use Docker Compose to run Mazerunner as a multi-container application. We’ll do the same for this sample application but extend the Docker Compose file to include additional Spring Boot applications that will become our backend microservices.

By default, Docker Compose will orchestrate containers on a single virtual machine. If we were to build a truly fault tolerant and resilient cloud-based application, we’d need to be sure to scale our system to multiple virtual machines using a cloud platform. This is the subject of a later article.

Now that we understand how we will use a graph processing platform, let’s talk about how to build a microservice architecture using Spring Boot and Spring Cloud to rank profiles on Twitter.

Building Microservices

I’ve talked a lot about microservices in past articles. When we talk about microservices we are talking about developing software in the context of continuous delivery. Microservices are not just smaller services that scale horizontally. When we talk about microservices, we are talking about being able to create applications that are the product of many teams delivering continuously in independent release cycles. Josh Long and I describe at length how to untangle the patterns of building and operating JVM-based microservices in O’Reilly’s Cloud Native Java.

In this sample, we’ll build 4 microservices, each as a Spring Boot application. If we were to build this architecture as microservices in an authentic scenario, each microservice would be owned and managed by a different team. This is an important differentiation in this new practice, as there is much confusion around what a microservice is and what it is not. A microservice is not just a distributed system of small services. The practice of building microservices should never be without the discipline of continuous delivery.

For the purposes of this article, we’ll focus on scenarios that help us gain experience and familiarity with building distributed systems that resemble a microservice architecture.

Overview

Now let’s do a quick overview of the concepts we’re going to cover as a part of this sample application. We will apply the same recipe from previous articles on similar topics for building microservices with Spring Boot and Spring Cloud. The key difference from my previous articles is that we are going to create a data service that does both batch processing tasks as well as exposing data as HTTP resources to API consumers.

System Architecture Diagram

The diagram below shows each component and microservice that we will create as a part of this sample application. Notice how we’re connecting the Spring Boot applications to the graph processing platform we looked at earlier. Also, notice the connections between the services, these connections define communication points between each service and what protocol is used.

Microservice architecture with Spring Boot

The three applications that are colored in blue are stateless services. Stateless services will not attach a persistent backing service or need to worry about managing state locally. The application that is colored in green is the Twitter Crawler service. Components that are colored in green will typically have an attached backing service. These backing services are responsible for managing state locally, and will either persist state to disk or in-memory.

Getting Started with Apache Spark and Neo4j Using Docker Compose

Tuesday, March 10, 2015

I've received a lot of interest in Neo4j Mazerunner since first announcing it a few months ago. People from around the world have reached out to me and are excited about the possibilities of using Apache Spark and Neo4j together. From authors who are writing new books about big data to PhD researchers who need it to solve the world's most challenging problems.

I'm glad to see such a wide range of needs for a simple integration like this. Spark and Neo4j are two great open source projects that are focusing on doing one thing very well. Integrating both products together makes for an awesome result.

Less is always more, simpler is always better.

Both Apache Spark and Neo4j are two tremendously useful tools. I've seen how both of these two tools give their users a way to transform problems that start out both large and complex into problems that become simpler and easier to solve. That's what the companies behind these platforms are getting at. They are two sides of the same coin.

One tool solves for scaling the size, complexity, and retrieval of data, while the other is solving for the complexity of processing the enormity of data by distributed computation at scale. Both of these products are achieving this without sacrificing ease of use.

Inspired by this, I've been working to make the integration in Neo4j Mazerunner easier to install and deploy. I believe I've taken a step forward in this and I'm excited to announce it in this blog post.

Categorical PageRank Using Neo4j and Apache Spark

Monday, January 19, 2015

PageRank is an important concept in computer science and modern technology. It is important because it is the underlying algorithm that mostly dictates what more than 3 billion users who use the internet experience as they browse the world wide web.

How does PageRank work?

The first PageRank algorithm was developed by Larry Page and Sergey Brinn at Stanford in 1996. Sergey Brinn had the idea that pages on the world wide web could be ordered and ranked by analyzing the number of links that point to each page. This idea was the foundation of the imminent rise of Google as the world's most popular search engine, with now over 3.5 billion searches made by its users every day.

PageRank gives us a measure of popularity in an ever connected world of information. With an enormous degree of complexity increasing every day in the virtual space of information sharing, PageRank gives us a way to understand what is important to us as users.

The unfortunate bit of this is that PageRank itself is mostly unapproachable to anything but seasoned engineers and esteemed academics. That's why I want to make it easier for every developer around the world to make this algorithm the foundation of their innovative desires.

Distributing PageRank Jobs

It should be no surprise to regular readers of this blog that I am all about the graph. Graphs are the best abstraction of data that we have today. The concept is brilliantly easy and intuitive. Nodes represent data points and are described by meta data. Relationships connect nodes together, also described by meta data, and they enrich the information of each node relative to one another.

Neo4j Mazerunner Project

As I have been building the open source project Neo4j Mazerunner to use Apache Spark GraphX and Neo4j for big scale graph analysis, I've come to understand the need for breaking down PageRank into categories. Something I call 'Categorical PageRank'.

What Graph Analysis of Wikipedia Tells Us About the Relevancy of Recent Knowledge

Sunday, December 7, 2014

The chart below was generated using data analyzed with a Neo4j Graph Database and Apache Spark GraphX. 10.9 million Wikipedia articles and 110 million hyperlinks were analyzed to produce a PageRank and Triangle Count for each node in the graph. The Triangle Count metric is a measure of clustering, while the PageRank metric is a measure of relevancy.

Knowledge moves forward in time

Every year through 1850—2012 on the X-axis represents a Wikipedia page that describes historical events and facts about that calendar year. Link analysis was performed on the inbound and outbound hyperlinks for each page and all other pages in the graph that contribute to that page's relevancy.

The chart describes a probability distribution over time. This distribution indicates that if a person were to randomly click hyperlinks starting from any page on Wikipedia, the person would move towards articles with a higher closeness centrality to Category:Year pages occurring later in the timeline.

When it comes to our collective human knowledge, as time moves forward, distant history becomes inversely relevant to more recent events in our timeline.

To see this pattern you can click and drag areas of the chart to zoom in. You'll notice the pattern is local as well as global.

Why is the year 2000 so relevant?

Wikipedia, the world's largest encyclopedia of human knowledge, was first launched on January 15th, 2001.

Links

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

Thursday, November 27, 2014

I've just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. This docker image is a great addition to Neo4j if you're looking to do easy PageRank or community detection on your graph data. Additionally, the results of the graph analysis are applied back to Neo4j.

This gives you the ability to optimize your recommendation-based Cypher queries by filtering and sorting on the results of the analysis.

Photo credit AMPLab Berkley

Using Apache Spark and Neo4j for Big Data Graph Analytics

Monday, November 3, 2014

As engineers, when we think about how to solve big data problems, evaluating technologies becomes a choice between scalable and not scalable. Ideally we choose the technologies that can scale to a variety of business problems without hitting a ceiling down the road.

Database technologies have evolved to be able to store big data, but are largely inflexible. The data models require tedious transformations and shuffling around of data. This is a complex process that is compounded in its complexity by combining a variety of inflexible solutions and platforms.

Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?