Specification - GSoC Indexing with Lucene
From Gephi:Wiki
Contents |
Description
The aim is to use Lucene to store attributes on the disk and index them in order to be searchable. The system should have a well-designed cache system to handle heavy read access on some elements. Though the need is not as critical as the graph structure, several features in Gephi need in-memory attributes to work properly. For instance, the edge weight is an attribute read very often due to the visualization and is called a 'Property', a sub-category of Attributes. On the contrary, custom data are not constantly used, only on modules like Filters or Data Laboratory. In addition, attributes data in other workspaces than the current one should not remain in memory.
Main Goals
- Integrate Lucene's indexing and search capabilities with Gephi's
- Attribute and Event system.
- Optimize for efficient use of memory resources and handling of large datasets.
Mapping Gephi's data model to Lucene
In Lucene a Document is the basic findable item. A Document can be anything depending on the application: a database row, a webpage or a node in a graph. Documents contain Fields which basically are key-value(s) pairs. Lucene extracts Terms from these Fields and generates an inverted index that maps Terms to Documents. The inverted index is the heart of Lucene's system and its performance.
Every Node/Edge in Gephi will be mapped as a Lucene Document since these are the basic searchable units in Gephi's data model. The corresponding Attributes would be mapped as Fields belonging to that specific Node/Edge.
For example:
Supposing a graph contains a node with the following attributes:
id: 314151 label: gephi date: 3/14/15
This node would be indexed as a Document with "id", "label" and "date" mapped as Fields. Lucene will then construct the inverted index using this Attributes data.
Since there are two basic elements to be indexed in Gephi: nodes and edges, there are two possible indexing scenarios. 1) Single index for both edges and nodes. 2) Two separate indexes. This is the actual proposal by Martin Skurla and would reduce the search space by separating nodes and edges, possibly improving search speed. Also, this approach follows the layout of the Data Laboratory GUI where the user can work with either edges or nodes but not both at the same time.
Design considerations
The Indexing Model
One important consideration is that the data present in each graph will be extremely varied and no two Gephi projects will be the same:
- Projects might not need to use indexing/search and therefore providing it will only be a waste of resources.
- On the other hand, other projects that deal with lots of textual data (as those related to social networks and the like) will almost surely benefit from an advanced search mechanism.
- Not all columns might need to be indexed or the indexing mechanism might be different between columns. For example a column containing Twitter posts will need to be parsed and analyzed in a different way than a column containing telephone numbers.
The Index API must provide the flexibility to satisfy all of these use cases and others that might appear. Furthermore the design of the Indexing API should follow architectural approaches already in use in Gephi to allow for a more natural integration. Following the steps of GraphModel and AttributeModel, a new IndexModel object will specify indexing requirements and control how indexing will be performed. It should have at least the following capabilities:
- Turn indexing on/off
- Expose the Lucene API for configuring Attributes indexing and whether:
- An Attribute will be included in the index
- An Attribute will be analyzed by a parser
- An Attribute's content will be stored in the index (such as the node/edge ID)
- Expose Lucene's configuration parameters so that performance can be tuned.
The IndexModel will be built during the import phase of the project and can be modified later if needed.
The Index API
Before defining an interface for the IndexAPI we should define the operations that will be performed by it, in the case of the Lucene index these will be:
- Update the index according to changes to the IndexModel:
- Global Create/Update/Delete operations on Attributes
- Local Create/Update/Delete operations on node/edge Attributes.
- Perform a query, either programmatically (from plugins for example) or a String query (from plugins or the user through the UI).
- Return the results of the query in a suitable format:
- As Collection<Node>
- As a GraphView. This eliminates boilerplate code needed to convert a Collection<Node> into an actual GraphView and simplifies manipulating the search results.
- Expose these capabilities through a Netbeans Platform module.

