Specification - GSoC Dynamic attributes and statistics

From Gephi:Wiki
Jump to: navigation, search

Student: Cezary Bartosiak

Mentor: Julian Bilcke

Abstract

The proposal focuses special attention on further development of dynamic network analysis (DNA) in Gephi. It has got a very practical purpose, in particular nowadays: analyzing evolution of networks [1][2] and dynamic network visualization [3]. The aim is to handle dynamic changes not only of graph topology but also attributes connected with nodes and edges. The end-product would be a framework which would make possible to build and query a dynamic graph with use of a proper API.

Introduction

Evolution of networks, network dynamics and dynamic network analysis [4] are hot topics nowadays. There is a growing interest in studying these issues. It causes that there is bigger and bigger need of DNA analysis tools. One of them is Gephi.

In my opinion Gephi is a tool for visualizing and analyzing graphs and networks which is an evidence for statement that interactive information visualization is important and how visual representation of information can be used to demystify data and reveal otherwise hidden patterns by leveraging human visual capabilities to make sense of abstract information.

I would like to participate in developing this necessary tool enriching it by adding new functionality and providing some future ideas.

Project details and an approximate schedule

My work on the project would begin anytime around May, 18th. From this day I will be partially working on the project. In June I will be a little busy because of my dear school, so my work will be limited. In August I am planning a 10-day trip to the Czarnohora Mountains so it is needless to say that my work during this period will be also limited. I think I would be able to work about 15-20 hours per week in May and June and about 30-40 hours per week in other months (July, August). I don't have any other plans for the summer right now so I will be fully focused on coding.

Let’s get down to brass tacks. Following your suggestions [5] these are the most important project requirements:

  • A data structure to host dynamic attributes efficiently.
  • A Dynamic API which has the following features: a Dynamic Graph Decorator, that wraps the graph and a time interval, return static graph copy for a given time interval, return attributes values array for a given node and time interval.
  • Adapting Metrics framework to use Dynamic API to propose dynamic versions of existing metrics.
  • Idea for dynamic visualization of attributes (color, size, label color, label size, edge weight).
  • If time remains, adapting the Ranking module to support dynamic attributes – dynamic visualization attributes transformation.

Taking them into consideration I have divided the work that needs to be done into several parts which I will try to describe as specific as it is possible. But it is worth pointing out that some things may change as I will be more familiar with the source code.

Moreover I must point out I will write tests for the new parts and document the code with comments all the time. The final project will be well prepared for later development. Source code will be commented and of good quality. It will be possible to generate project source code documentation using JavaDoc.

I will also provide weekly reports.

Preparation (April 26 – May 17)

Referring to Google I will get to know mentors, read documentation, get up to speed to begin working on the project. This period will allow me to learn the developer team's coding style, work style, etc.

Implementing Interval Tree to host dynamic attributes efficiently (May 18 – May 31)

As I mentioned on the forum [6] I would like to implement Interval Tree [7][8][9] to host dynamic attributes efficiently.

The main idea lies in the addition of special constants to AttributeType enum. These constants would represent types like DynamicFloat, DynamicInteger etc. They can be implemented as a one generic class DynamicType<T> which would be internally implemented using Interval Tree. Each column with such type would provide values of any time interval.

Interval<T> implements Comparable<Interval<T>>

This class represents an interval with some value. Its interface:

Interval(double low, double high, T value)

Interval(double low, double high)

int compareTo(Interval<T> interval)

double getLow()

double getHigh()

T getValue()

IntervalTree<T>

It is essentially a map from intervals to object which can be queried for Interval instances associated with a particular interval of time.

IntervalTree()

IntervalTree(IntervalTree intervalTree)

void insert(Interval<T> interval)

void delete(Interval<T> interval)

Interval<T> minimum()

Interval<T> maximum()

double getLow()

double getHigh()

boolean isEmpty()

List<Interval<T>> search(Interval<T> interval)

List<Interval<T>> search(double low, double high)

DynamicType<T>

As I wrote before it is a special type which provides methods of getting/setting values of any time interval. It is internally implemented using IntervalTree<T> and looks like this:

DynamicType()

DynamicType(Interval<T> in)

DynamicType(List<Interval<T>> in)

DynamicType(DynamicType<T> source)

DynamicType(DynamicType<T> source, Interval<T> in)

DynamicType(DynamicType<T> source, Interval<T> in, Interval<T> out)

DynamicType(DynamicType<T> source, List<Interval<T>> in)

DynamicType(DynamicType<T> source, List<Interval<T>> in, List<Interval<T>> out)

double getLow()

double getHigh()

boolean isInRange(double low, double high)

T getValue()

T getValue(double low, double high)

T getValue(Estimator Estimator)

T getValue(double low, double high, Estimator Estimator)

List<T> getValues()

List<T> getValues(double low, double high)

List<Interval<T>> getIntervals(double low, double high)

Class getUnderlyingType()

It is a base class for other classes like DynamicInteger, TimeInterval etc.

Dynamic API (June 1 – June 25)

This period is so long because of my school issues and of course there is a lot of coding in this module. Anyway I think we need something like this:

enum Estimator

This enum is used to determine what should be done with "ties". For example if in the given time interval some attribute has got 3 different values we should know how to estimate its value. It provides following constants:

AVERAGE, MEDIAN, MODE, SUM, MIN, MAX, FIRST, LAST

DynamicGraph

This is the class providing the main features. It wraps graph and time interval. Its interface looks like that:

DynamicGraph(Graph graph)

DynamicGraph(Graph graph, double low, double high)

Object[] getAttributesValues(Node node, double point)

Object[] getAttributesValues(Node node, double point, Estimator[] estimators)

Object[] getAttributesValues(Node node, double low, double high)

Object[] getAttributesValues(Node node, double low, double high, Estimator[] estimators)

Object[] getAttributesValues(Edge edge, double point)

Object[] getAttributesValues(Edge edge, double point, Estimator[] estimators)

Object[] getAttributesValues(Edge edge, double low, double high)

Object[] getAttributesValues(Edge edge, double low, double high, Estimator[] estimators)

double getLow()

double getHigh()

Graph getSnapshotGraph(double point)

Graph getSnapshotGraph(double point, Estimator estimator)

Graph getSnapshotGraph(double low, double high)

Graph getSnapshotGraph(double low, double high, Estimator estimator)

Graph getStrongSnapshotGraph(double point)

Graph getStrongSnapshotGraph(double low, double high)

Graph getUnderlyingGraph()

TimeInterval getInterval()

void setInterval(TimeInterval interval)

void setInterval(double low, double high)

With such methods we should be able to return static graph copy for given time interval as well as attributes values array for a given node/edge and time interval.

Adapting Metrics framework to use Dynamic API (June 26 – July 9)

I think it is sufficient to improve UI in order to let a potential user “store” all calculated metrics. I mean if we filter some network with dynamic range and click “Run”, a snapshot is taken into calculation. And then if we click “View” we can see calculations for different time intervals. Also we would have got dynamic versions of "static" metrics, that would fire static ones in loops and provide appropriate reports.

If time remains I would also provide some “automatic features”. For instance a potential user would be able to get calculations for some set of intervals. Also there could an option to calculate metrics “on the fly”. But this is rather non-efficient feature… for large networks obviously.

Dynamic visualization of attributes (July 10 – July 21)

It's a bit challenging, because it depends on abstract sense of attributes very much. Taking this into consideration I think a potential user should decide what kind of visualization should be used for each attribute. For example:

Let's say that some graph's nodes contain an attribute like priority. A potential user could choose between several possibilities like: nothing (it means this attribute shouldn't be visualized), color (for min and max value, it could be combined with "ties" problem, i.e. in case of average this color could be gradient or something like that), stroke thickness etc.

I think that the tabs on the bottom (Global, Nodes, Edges and Labels) need to be developed. An additional tab like “Attributes” might be necessary. User should be able to point out for each attribute how it should be visualized. And he should be able to set Estimators for each attribute. At the moment I have got an idea to let users set fill, stroke and size/scale properties.

Adapting the Ranking module (if time remains)

Maybe I am wrong but I think that it is sufficient to let users configure range property for dynamic attributes and provide reaction of the Ranking module on time interval changes. This could be done using methods for getting attributes values arrays from Dynamic API.

Extra time (July 22 – July 31)

This period could be used as a time safety margin. If everything is on schedule, this time can be used, for instance to try adapting the Ranking module.

Polishing the solution (August 1 – August 4)

This period will be used for cleanup and some refactoring. I will scrub the code, improve documentation etc. I must point out that my trip will probably begin on August 5th and finish on August 15th so when I come back I will only submit the final evaluation to Google. Therefore I would like to begin coding on May, 18th.

Future ideas/directions

  • Forecasting change in existing networks. Imagine that somebody wants to know with some probability how the given network will change in some period of time. It could be possible with networks evolving simulators.
  • Developing statistical techniques to see whether differences observed over time in networks are due to simply different samples from a distribution of links and nodes or changes over time in the underlying distribution of links and nodes.
  • Dynamic networks generators. Gephi definitely needs not only static networks generators, but also their dynamic versions.
  • Algorithms concerning graph similarity problems. Graphs can be considered as similar if they have got similar history. In this case although their “snapshots” can be totally different, they are similar.
  • Many, many more…

As we can see this GSoC project is only a germ of an idea of a very powerful dynamic network analysis tool.

References

[1] S.N. Dorogovtsev, J.F.F. Mendes Evolution of networks

http://arxiv.org/pdf/cond-mat/0106144v2

[2] M. Argollo de Menezes, A.-L. Barabási Fluctuations in Network Dynamics

http://arxiv.org/pdf/cond-mat/0306304

[3] Skye Bender-deMoll, Daniel A. McFarland

The Art and Science of Dynamic Network Visualization

http://www.cmu.edu/joss/content/articles/volume7/deMollMcFarland/

[4] Dynamic network analysis http://en.wikipedia.org/wiki/Dynamic_network_analysis

[5] Gephi GSoC Ideas http://wiki.gephi.org/index.php/Google_Summer_Of_Code_2010

[6] Dynamic Attributes and Statistics discussion

http://forum.gephi.org/viewtopic.php?f=9&t=90

[7] Interval Tree http://en.wikipedia.org/wiki/Interval_tree

[8] Antoine Vigneron Segment trees and interval trees

http://w3.jouy.inra.fr/unites/miaj/public/vigneron/cs4235/l5cs4235.pdf

[9] Moez Chaabouni, Soon M. Chung

The Point-Range Tree: A Data Structure for Indexing Intervals

http://moezchaabouni.com/Documents/PR%20Tree.pdf