Showing posts with label graph database. Show all posts
Showing posts with label graph database. Show all posts

Categorical PageRank Using Neo4j and Apache Spark

Monday, January 19, 2015

PageRank is an important concept in computer science and modern technology. It is important because it is the underlying algorithm that mostly dictates what more than 3 billion users who use the internet experience as they browse the world wide web.

How does PageRank work?

The first PageRank algorithm was developed by Larry Page and Sergey Brinn at Stanford in 1996. Sergey Brinn had the idea that pages on the world wide web could be ordered and ranked by analyzing the number of links that point to each page. This idea was the foundation of the imminent rise of Google as the world's most popular search engine, with now over 3.5 billion searches made by its users every day.

PageRank gives us a measure of popularity in an ever connected world of information. With an enormous degree of complexity increasing every day in the virtual space of information sharing, PageRank gives us a way to understand what is important to us as users.

The unfortunate bit of this is that PageRank itself is mostly unapproachable to anything but seasoned engineers and esteemed academics. That's why I want to make it easier for every developer around the world to make this algorithm the foundation of their innovative desires.

Distributing PageRank Jobs

It should be no surprise to regular readers of this blog that I am all about the graph. Graphs are the best abstraction of data that we have today. The concept is brilliantly easy and intuitive. Nodes represent data points and are described by meta data. Relationships connect nodes together, also described by meta data, and they enrich the information of each node relative to one another.

Neo4j Mazerunner Project

As I have been building the open source project Neo4j Mazerunner to use Apache Spark GraphX and Neo4j for big scale graph analysis, I've come to understand the need for breaking down PageRank into categories. Something I call 'Categorical PageRank'.

Using Apache Spark and Neo4j for Big Data Graph Analytics

Monday, November 3, 2014

As engineers, when we think about how to solve big data problems, evaluating technologies becomes a choice between scalable and not scalable. Ideally we choose the technologies that can scale to a variety of business problems without hitting a ceiling down the road.

Database technologies have evolved to be able to store big data, but are largely inflexible. The data models require tedious transformations and shuffling around of data. This is a complex process that is compounded in its complexity by combining a variety of inflexible solutions and platforms.

Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Deep Learning Sentiment Analysis for Movie Reviews using Neo4j

Monday, September 15, 2014

While the title of this article references Deep Learning, it's important to note that the process described below is more of a deep learning metaphor into a graph-based machine learning algorithm. No neural networks are used.

Sentiment analysis uses natural language processing to extract features of a text that relate to subjective information found in source materials.

Movie Review Sentiment Analysis

A movie review website allows users to submit reviews describing what they either liked or disliked about a particular movie. Being able to mine these reviews and generate valuable meta data that describes its content provides an opportunity to understand the general sentiment around that movie in a democratized way. That’s a pretty cool thing if you think about it. Using machine learning we can democratize subjectivity about anything in the world. We can make an objective analysis of subjective content, giving us the ability to better understand trends around products and services that we can use to make better decisions as consumers.

Using a Graph Database for Deep Learning Text Classification

Tuesday, August 26, 2014

Graphify is a Neo4j unmanaged extension that provides plug and play natural language text classification.

Graphify gives you a mechanism to train natural language parsing models that extract features of a text using deep learning. When training a model to recognize the meaning of a text, you can send an article of text with a provided set of labels that describe the nature of the text. Over time the natural language parsing model in Neo4j will grow to identify those features that optimally disambiguate a text to a set of classes.

Feature Hierarchy

Understanding How Neo4j Cypher Queries are Evaluated

Wednesday, July 9, 2014

There are many ways to store and manage data within a Neo4j graph database. When Neo4j 2.0 launched late last year we had an entirely new browser experience for interacting with graphs. The graph visualization from the return results of Cypher queries were at the core of the user experience enhancements to the platform.

Using 3D Visualization to Debug a Graph-based Algorithm

Monday, July 7, 2014

Recently I have been working on an idea for an algorithm that discovers patterns in raw streams of data. This pattern recognition algorithm uses deep learning to classify certain combinatorial features that uniquely identify an input stream.

I'm going to first talk a bit about the algorithm so it makes sense as to why visualization is such an important step in iterating and tweaking code that most efficiently implements the algorithm.

The Algorithm


In a previous post I introduce the idea for the algorithm and how a graph-based approach might work.

Hierarchical Pattern Recognition

Tuesday, June 17, 2014

About a year ago I read about Ray Kurzweil's "Pattern Recognition Theory of Mind", which he articulates in his book, "How to Create a Mind". I picked up the book after struggling with the idea of implementing a deep learning algorithm for parsing natural language text on Wikipedia. My goal was to discover links in volumes of text that were not already linked. I ended up developing all kinds of cool heuristics to do this, mostly by a lot of trial and error. King of these heuristics was a pretty simple algorithm at the core of the library that would find redundancies in batches of text content. 

How this worked is if a phrase was mentioned repeatedly in a collection of about 50 sentences, then I could extract that phrase as a node and link it back to the pieces of content it belonged to. Every now and then you'll get a reference to another article's name, which can then be verified against Wikipedia's site index, which would provide more sentences to find repeated phrases within.

I struggled with persistency because I knew how ugly my problem was for a relational database. I created some entity-relationship models, and implemented them using Entity Framework over Microsoft SQL Server. It worked, kind of. I waited patiently to happen upon a better solution. Thankfully I did, and using a graph database I was able to take my cool little algorithms and solve my persistency problem at scale.

Time Scale Event Meta Model

Tuesday, October 15, 2013

Time Scale Graph
Recently at the GraphConnect 2013 conference in San Francisco, questions were asked about how to handle temporal or time-based traversals in a Neo4j graph database.

So I decided to write a GraphGist to help Neo4j developers do recommendations by logging events within a "Time Scale Graph".

The goal of this GraphGist is to provide you with a lens to help you see information as simple temporal facts that are captured across space and time.

You can find the full GraphGist here

Delete Duplicate Node By Index Using Neo4j Cypher Query

Wednesday, August 14, 2013

Follow the steps below to find and delete duplicate nodes on property and index in Neo4j's web admin console.

Step 1

Select duplicate records by executing the following Cypher query in the Neo4j admin console.

START n=node:invoices("PO_NUMBER:(\"112233\")")
// Cypher query for collecting the ids of indexed nodes containing duplicate properties
WITH n
ORDER BY id(n) DESC  // Order by descending to delete the most recent duplicated record
WITH n.Key? as DuplicateKey, COUNT(n) as ColCount, COLLECT(id(n)) as ColNode
WITH DuplicateKey, ColCount, ColNode, HEAD(ColNode) as DuplicateId
WHERE ColCount > 1 AND (DuplicateKey is not null) AND (DuplicateId is not null)
WITH DuplicateKey, ColCount, ColNode, DuplicateId 
ORDER BY DuplicateId 
RETURN DuplicateKey, ColCount, DuplicateId 
//RETURN COLLECT(DuplicateId) as CommaSeparatedListOfIds
//** Toggle comments for the return statements above to validate duplicate records 
//** Do not proceed to delete without validating

Step 2

Validate and copy duplicate record IDs from web admin console:

Execute the Cypher query from Step 1 to validate duplicate records exist.
After validating duplicate records, execute the Cypher query from Step 1 as a comma separated list of IDs.

Step 3

Copy and paste CommaSeparatedListOfIds into the delete query below.

START n=node(1120038,1120039,1120040,1120042,1120044,1120048,1120049,1120050,1120053,1120067,1120068)
// Replace above with the IDs from CommaSeparatedListOfIds in the previous step
MATCH n-[r]-()
DELETE r, n

** Execute the Cypher query above ONLY after replacing the example IDs in the START statement.

Step 4

Validate that the delete transaction committed.

Execute the Cypher query from Step 1 to make sure that the transaction was committed.

That's it! Comment below with questions or feedback.