Getting Started with Apache Spark and Neo4j Using Docker Compose

Tuesday, March 10, 2015

I've received a lot of interest in Neo4j Mazerunner since first announcing it a few months ago. People from around the world have reached out to me and are excited about the possibilities of using Apache Spark and Neo4j together. From authors who are writing new books about big data to PhD researchers who need it to solve the world's most challenging problems.

I'm glad to see such a wide range of needs for a simple integration like this. Spark and Neo4j are two great open source projects that are focusing on doing one thing very well. Integrating both products together makes for an awesome result.

Less is always more, simpler is always better.

Both Apache Spark and Neo4j are two tremendously useful tools. I've seen how both of these two tools give their users a way to transform problems that start out both large and complex into problems that become simpler and easier to solve. That's what the companies behind these platforms are getting at. They are two sides of the same coin.

One tool solves for scaling the size, complexity, and retrieval of data, while the other is solving for the complexity of processing the enormity of data by distributed computation at scale. Both of these products are achieving this without sacrificing ease of use.

Inspired by this, I've been working to make the integration in Neo4j Mazerunner easier to install and deploy. I believe I've taken a step forward in this and I'm excited to announce it in this blog post.

Categorical PageRank Using Neo4j and Apache Spark

Monday, January 19, 2015

PageRank is an important concept in computer science and modern technology. It is important because it is the underlying algorithm that mostly dictates what more than 3 billion users who use the internet experience as they browse the world wide web.

How does PageRank work?

The first PageRank algorithm was developed by Larry Page and Sergey Brinn at Stanford in 1996. Sergey Brinn had the idea that pages on the world wide web could be ordered and ranked by analyzing the number of links that point to each page. This idea was the foundation of the imminent rise of Google as the world's most popular search engine, with now over 3.5 billion searches made by its users every day.

PageRank gives us a measure of popularity in an ever connected world of information. With an enormous degree of complexity increasing every day in the virtual space of information sharing, PageRank gives us a way to understand what is important to us as users.

The unfortunate bit of this is that PageRank itself is mostly unapproachable to anything but seasoned engineers and esteemed academics. That's why I want to make it easier for every developer around the world to make this algorithm the foundation of their innovative desires.

Distributing PageRank Jobs

It should be no surprise to regular readers of this blog that I am all about the graph. Graphs are the best abstraction of data that we have today. The concept is brilliantly easy and intuitive. Nodes represent data points and are described by meta data. Relationships connect nodes together, also described by meta data, and they enrich the information of each node relative to one another.

Neo4j Mazerunner Project

As I have been building the open source project Neo4j Mazerunner to use Apache Spark GraphX and Neo4j for big scale graph analysis, I've come to understand the need for breaking down PageRank into categories. Something I call 'Categorical PageRank'.

What Graph Analysis of Wikipedia Tells Us About the Relevancy of Recent Knowledge

Sunday, December 7, 2014

The chart below was generated using data analyzed with a Neo4j Graph Database and Apache Spark GraphX. 10.9 million Wikipedia articles and 110 million hyperlinks were analyzed to produce a PageRank and Triangle Count for each node in the graph. The Triangle Count metric is a measure of clustering, while the PageRank metric is a measure of relevancy.

Knowledge moves forward in time

Every year through 1850—2012 on the X-axis represents a Wikipedia page that describes historical events and facts about that calendar year. Link analysis was performed on the inbound and outbound hyperlinks for each page and all other pages in the graph that contribute to that page's relevancy.

The chart describes a probability distribution over time. This distribution indicates that if a person were to randomly click hyperlinks starting from any page on Wikipedia, the person would move towards articles with a higher closeness centrality to Category:Year pages occurring later in the timeline.

When it comes to our collective human knowledge, as time moves forward, distant history becomes inversely relevant to more recent events in our timeline.

To see this pattern you can click and drag areas of the chart to zoom in. You'll notice the pattern is local as well as global.

Why is the year 2000 so relevant?

Wikipedia, the world's largest encyclopedia of human knowledge, was first launched on January 15th, 2001.

Links