Building a Neo4j Reporting Service Part I

Thursday, April 24, 2014

Data science is pretty hot right now. The obvious reason is that data is rapidly expanding in complexity and size. There is an opportunity to be had in building systems that can capture this data, classify it in multiple dimensions, and to scale it up to the demands of analysts looking to convert data into valuable reports.


As a developer evangelist for Neo4j, I am frequently out in the community talking about things I build using our database. We use Meetup.com to schedule and promote our community events all over the world.

If you're unfamiliar with Meetup.com, here is a description from their Wikipedia entry:

"Meetup is an online social networking portal that facilitates offline group meetings in various localities around the world. Meetup allows members to find and join groups unified by a common interest, such as politics, books, games, movies, health, pets, careers or hobbies. Users enter their postal code or their city and the topic they want to meet about, and the website helps them arrange a place and time to meet. Topic listings are also available for users who only enter a location."

At Neo4j, we're obsessed with data, especially connected data. We believe in our product because we use it to solve our own problems every day. With something like Meetup.com, we found ourselves guessing about many of the aspects of our community and how we could do a better job creating a great community experience.

Some of those questions were:
  • How many people will show up to an event from the attendee list?
  • What kind of content are people interested in hearing about?
  • What's the best location to host our meetups to boost attendance?

I wanted to use Neo4j to do reporting. I decided to put together a platform to track some of this information and build some reports to visualize the data we collected. I started by breaking down the problem into a set of stories to be implemented as a report.

Problem

  • Track meetup group growth over time
  • Apply tags to meetup groups and report combined growth of all groups over time

Questions

  • Given a start date and an end date, what is the time series that plots the membership growth of a given meetup group?
  • Given a start date, an end date, and a combination of tags, what is the time series that plots the combined membership growth of all meetup groups with those tags?
  • How do you generate the JSON data of a time series for a basic JS line chart plugin?

I decided to start with a GraphGist, which is an open source project we built to enable our community to put together a quick proof of concept using our database.

Neo4j for Graph Analytics: Meetup.com Example

I designed an example graph data model, which I then translated into Neo4j's Cypher query language to create an example dataset.



Now it was time to scale it up to a full platform. I decided to use Node.js.


There would be three Node.js driven components. One console application for importing data on a schedule and two web applications; a dashboard for displaying reports and REST API to communicate with the Neo4j graph database.


With an architecture in place, I went forward with building out each of the modules.

In my next blog post I will go through the details of building the import scheduler, which polls the Meetup.com API each day and imports the graph data model into Neo4j.

Feel free to take a look at the finished documentation which details the creation of each of the Node.js modules:

Graph-based Reporting Documentation

Also, I put a slide deck together:



Time Scale Event Meta Model

Tuesday, October 15, 2013

Time Scale Graph
Recently at the GraphConnect 2013 conference in San Francisco, questions were asked about how to handle temporal or time-based traversals in a Neo4j graph database.

So I decided to write a GraphGist to help Neo4j developers do recommendations by logging events within a "Time Scale Graph".

The goal of this GraphGist is to provide you with a lens to help you see information as simple temporal facts that are captured across space and time.

You can find the full GraphGist here

Delete Duplicate Node By Index Using Neo4j Cypher Query

Wednesday, August 14, 2013

Follow the steps below to find and delete duplicate nodes on property and index in Neo4j's web admin console.

Step 1

Select duplicate records by executing the following Cypher query in the Neo4j admin console.

START n=node:invoices("PO_NUMBER:(\"112233\")")
// Cypher query for collecting the ids of indexed nodes containing duplicate properties
WITH n
ORDER BY id(n) DESC  // Order by descending to delete the most recent duplicated record
WITH n.Key? as DuplicateKey, COUNT(n) as ColCount, COLLECT(id(n)) as ColNode
WITH DuplicateKey, ColCount, ColNode, HEAD(ColNode) as DuplicateId
WHERE ColCount > 1 AND (DuplicateKey is not null) AND (DuplicateId is not null)
WITH DuplicateKey, ColCount, ColNode, DuplicateId 
ORDER BY DuplicateId 
RETURN DuplicateKey, ColCount, DuplicateId 
//RETURN COLLECT(DuplicateId) as CommaSeparatedListOfIds
//** Toggle comments for the return statements above to validate duplicate records 
//** Do not proceed to delete without validating

Step 2

Validate and copy duplicate record IDs from web admin console:

Execute the Cypher query from Step 1 to validate duplicate records exist.
After validating duplicate records, execute the Cypher query from Step 1 as a comma separated list of IDs.

Step 3

Copy and paste CommaSeparatedListOfIds into the delete query below.

START n=node(1120038,1120039,1120040,1120042,1120044,1120048,1120049,1120050,1120053,1120067,1120068)
// Replace above with the IDs from CommaSeparatedListOfIds in the previous step
MATCH n-[r]-()
DELETE r, n

** Execute the Cypher query above ONLY after replacing the example IDs in the START statement.

Step 4

Validate that the delete transaction committed.

Execute the Cypher query from Step 1 to make sure that the transaction was committed.

That's it! Comment below with questions or feedback.