Specification - GSoC Adding support for Neo4j in Gephi
From Gephi:Wiki
Student: Martin Škurla
Mentors: Tobias Ivarsson, Neo4j project Mathieu Bastian, Gephi project
Contents |
Assignment
The project aims at enabling Gephi to visualize graphs stored using the Neo4j graph database.
- Difficulty: medium
- Required skills: Java
- Assigned Mentor: Tobias Ivarsson, Mathieu Bastian
A graph model coupling based on Neo4j could be implemented as a plugin for Gephi. This would enable users to work with a Neo4j graph from within Gephi. This is useful for debugging data of an application, generating graphical reports, and domain data design. From the other side of things it would also enable Gephi to work with arbitrarily large graphs, much larger than what can fit into RAM.
Gephi stores its graph in RAM with the help of GraphAPI and AttributesAPI. All modules that read or manipulate the graph or the attributes interact with these APIs. The Neo4j plugin would be responsible to control these APIs to map Neo4j data and perform operations.
Goals of this project
- Enable application developers using Neo4j to use Gephi for data debugging and introspection.
- Enable Gephi to work with larger graphs than can be fitted into memory.
Suggested roadmap
- Get started with Neo4j, write some code using it to get a feel for it.
- Write a new Gephi module on top of the Neo4j graphdb API and commanding Graph and Attributes APIs
- Map the Neo4j data model to Gephi GraphModel: nodes, relationships, relationship types and node/relationship properties.
- Map Neo4j data retrieval to Gephi GraphView: traversers - filter on relationship types and relationship directions, set depth etc.
- Work with unit tests to verify progress.
Comparison of Gephi & Neo4j
The main idea of this project as it was mentioned before is to add support for Neo4j in Gephi. If we want to extend Gephi functionalities, it would be a good idea to make some quick comparison and maybe point up main differences of graph model representation, name conventions, creating, adding, removing and iterating (traversing) through graph elements of both Gephi and Neo4j projects. After this comparison, proposals will be formed.
The comparison of graph model representations
Neo's graph model consist of:
- nodes,
- relationships,
- properties.
Relationships represent edges between nodes. Properties are key-value pairs. Keys are always strings. Valid property values are all the Java primitives, java.lang.String and arrays of primitives and strings. Neo4j does not accept arbitrary objects as property values. Null values are not allowed.
Nodes and relationships have their own id (of type long) which is assigned to them when they are created. Both nodes and relationships can have fully optional properties.
Gephi's graph model consist of:
- edges,
- nodes,
- attributes.
In Gephi, whole graph structure is represented in two models:
- attribute model,
- graph model.
Within Graph model, manipulations (adding and removing) with nodes and edges can be done. Within Attribute model, attributes can be created and suitable attribute values can be inserted into model.
Edges connect nodes. Both edges and nodes are represented in Attribute model in tables with rows and columns. Edges and nodes have separate attribute tables. Edges and nodes are stored in rows, one row for each instance. Attributes values are mapped to appropriate column within row.
Each column has two mandatory parameters: id (of type java.lang.String) and attribute type (of type org.gephi.data.attributes.api.AttributeType). Optional column parameters are title (of type java.lang.String), attribute origin (of type org.gephi.data.attributes.api.AttributeOrigin) and default value (of type java.lang.Object).
Supported attribute types are: float, double, int, long, boolean, java.lang.String, org.gephi.data.attributes.type.StringList and org.gephi.data.attributes.type.TimeInterval.
As a result, Neo4j model is much easier, but on the other hand Gephi models are much more detailed. This means that there is definitely a mismatch between these models and the main goal of this project is to solve this mismatch.
How models are related to each other
So let's have some graphical representation of fictive graph defined in Neo4j. Nodes are represented by blue circles, relationships are represented by directed arrows, properties on nodes are represented by orange rectangles and properties on edges are represented by green rectangles. Every node in Neo4j has its own id and every edge in Neo4j has its name. This graph is directly used by Neo4j. Properties are saved directly in nodes and relationships.
In Gephi, graph representation will be very similar except for properties. Property values will be saved in attribute node and attribute edge tables.

Now back to Gephi, if we want to define a graph with the same data, content of attribute node tables and attribute edge tables should be:
| | |
| | |
| | |
| | |
| | | | |
| | | | |
| | | | |
| | | | |
The exact procedure how tables are created, and how tables are populated with data will be described later.
Concept mapping
It should be clear from previous comparison, that Gephi and Neo4j graph model representations are quite different, but there is a straightforward mapping between both project graph elements:
| | |
| | |
| | |
| | |
Other Neo4j & Gephi abilities comparison
| | | |
| Supported Graph types | directed | directed, undirected, mixed, hierarchical |
| Supported property/attribute types | primitive types, array of primitive types, String, String[], null values not allowed | float, double, int, long, boolean, String, StringList, TimeInterval |
| Support for Cyclic edges | no | yes |
| Support for Parallel edges | ??? | no |
| Support for both directed/undirected edges | no (only directed) | yes |
So why is the mentioned table important for us? It shows us, that except for supported property/attribute types there should not be other problems in processing Neo4j graphs in Gephi. This result is based only on importing data. In the case of exporting data, there could be more problems described in table 4.
As we can see, Neo4j and Gephi support different types of data, which can be saved as a part of graph. We need to map all types from Neo4j to Gephi types. Numerical types, String and boolean are mapped straightforwardly. Array of strings can be mapped straightforwardly to StringList type. Problematic types are:
- char primitive type,
- any array of primitive type.
Char as a primitive type is problematic, because char type doesn't exist exactly in Gephi model. Solution can be done in 2 ways:
- convert char into int
- convert char into String
From my point of view, the first approach is not a very good idea, because char is used to store textual data (one character) and after converting it into int, suddenly we have a number type. So to handle the information properly, we have to add some other data or change model a little bit. It is definitely not acceptable to change the model, or break compatibility or API only for this kind of a small change.
The second approach is good and simple I think. The nature of data (textual data) will not be changed, it can be simply displayed and implementation is trivial.
Any array of primitive type is problematic, because suitable array representation doesn't exist in Gephi model. Solution can be done in 3 ways:
- for every item in array create a new attribute column with appropriate type
(char->String) - convert data into TimeInterval type
- convert data into StringList type
In my opinion, the first approach is a totally bad idea, because it will be necessary to create the new attribute column for every array item. In the case if there are many arrays and these arrays are of many types, there will be a lot of unnecessary attribute columns. So the design in this approach will be very poor.
The second approach is better, but there is still a type mismatch. The first problem is that internally TimeInterval uses double[][] array, so the second dimension will not be used. The second problem is any type except for numerical, so char type (solution could be conversion to int) and boolean type (solution could be mapping to 0 or 1 as values). So the design in this approach is better, but still not very usable.
The third approach I think fit the goal needs the best. We can simply convert any array of primitive types to array of type String, and data can be simply displayed. There is only one design issue. Is it only the need to display data or even process them in some way? It means that if we just want to display data, we can display them as strings, but if we want to process them, we need to recognize that even if the type of every item is String, Gephi should treat with them as with another type. This can be simply solved with adding new attribute column (with well defined name) of type String/StringList and value of this column should be something with following format: {columnName:realType}/columnName:realType. So from value from this column, we can manipulate with data as if they were stored in their nature type.
How the remote communication will work
Neo4j is an embedded database, which means that it runs in the same JVM as your application. If we are working with Neo4j, we should be familiar with apoc. Apoc is an abbreviation for “a package for components”. The main idea behind apoc is to extend basic Neo4j functionalities through additional components. One concrete example of such a component is “neo4j-remote-graphdb”.
The component “neo4j-remote-graphdb” is an implementation of the Neo4j API that delegates all operations to a remote Neo4j instance. Remote communication is based on RMI and from client perspective it is almost the same API as for local database communication. The only difference is the need of defining URL address of remote database and optional login/password pair. Any other model manipulation is the same as in local communication, so it is an example of good transparency.
Data model mapping proposal
“Map the Neo4j data model to Gephi GraphModel: nodes, relationships, relationship types and node/relationship properties”.
Firstly nodes are processed. For every node in Neo4j, there should exist a node in Gephi model. For every optional property in Neo4j graph, there should be an attribute column with name of this property and appropriate values saved if needed. The same procedure is for relationships.
So the full list of steps is:
- for every node from Neo4j, create a node in Gephi,
- for every relationship from Neo4j, create an edge in Gephi,
- create a whole graph representation in a similar way,
- for every optional property from Neo4j node, create an attribute column with this name and type either in node, or in edge attribute tables,
- after creating columns, the insertion of data will take place.
In Neo4j model, both nodes and relationships have their own id parameter (of type long).It is important to know whether it makes sense to store exactly these values to Gephi model, or not.
In the process of creating an attribute column, every column should have defined its id (of type String) and type (of type AttributeType). There could be situations, where we want to add id, that is already stored in Gephi model, so some collisions can arise.
I think there is no need to store data from ids from Neo4j into Gephi representations, even if it is possible and also done in the [2.2] and Table 1. and Table 2.. The name of the attribute column is Id and it's of type int.
In Neo4j model, every relationship must have a name describing the relationship. So for this information one extra attribute column called e.g. “EdgeType” of type String should be created.
In the process of creating attribute columns, we have to specify at least following parameters:
- id of type String,
- type of type AttributeType.
Other parameters are optional:
- title of type String,
- origin of type AttributeOrigin,
- defaultValue of type Object.
We want to provide as many information as we can, so we put all parameters to attribute column instance. DefaultValue will not be set. Origin will be set to DATA. A title will be set to name of the property. The type will be determined after property type recognition and conversion. The question is what value we will add to id parameter? I think the same value as for title will be good enough.
So the implementation part of Goal 1 consists of developing a plug-in/module which implements Attributes & Graph API and whole implementation is about calling appropriate Neo4j APIs.
The whole process of adding data and columns in appropriate tables is showed in the following pictures and tables. Firstly the node table:

| | | |
| | | |

| | | | |
| | | | |
| | | | |

| | | | |
| | | | |
| | | | |
| | | | |
Secondly the edge table:
| |
| |

| |
| |
| |

| | |
| | |
| | |
| | |
Data retrieval mapping proposal
“Map Neo4j data retrieval to Gephi GraphView: traversers - filter on relationship types and relationship directions, set depth etc”.
At first a user has to select a node and then goes to Traverse (search) option. A window similar to the following should be displayed. The user should be able to restrict some properties of the process of traversing (order, return, max. depth, edge type, direction). After filling in traversal information a request do Neo4j database will be created (through node.traverse(...) method). From result a Gephi graph representation should be created.

Mentioned traverse() method returns the object of type Traverser, which it actually extends Iterable<Node>. So for simple iteration, we can just use Java 5 for each loop. TraversalPosition type adds some additional functionality, we can simply determine which the current node is, which previous node was and which the last relationship was. So all given information is just enough to build a Gephi graph representation. The exact procedure will be very similar to procedure described in [3] “Data model mapping proposal”.
So the implementation part of Goal 2 consists of developing “transformer” from Neo4j Traversal object (or better said from the results which Traversal object returns) to Gephi graph representation.
Proposed Time-line
After planning and consideration about project duration, amount of work, I propose following time-line:
1 week NetBeans modules development
3 weeks Implementation of Goal 1
1 week JUnit tests creation and testing, Documentation, Project integration, Additional small improvements
3 weeks Implementation of Goal 2
1 week JUnit tests creation and testing, Documentation, Project integration,
Additional small improvements
1 weekSupport for Import API
1 weekSupport for Export API
1 weekDocumentation improvements, Bugs finding, Refactoring if needed
So the main effort is concentrated on the first 9 weeks. After that, some additional tasks can be done in the case that there will be some remaining time. Last week I plan to finish the project and make some final steps on improving the whole documentation, the code and the project as a whole.
Questions
1. Is it possible to add data during adding columns? Or the only chance is to add all columns first and then data?
2. Will also the Gephi Import and Export APIs be used?
3. Is there any consideration to make an opposite functionality? So maybe create an export function from Gephi format to Neo4j format? (sure, there will be many problems and the process will be more complicated)

