Specification - GSoC Direct Social Networks Import

From Gephi:Wiki
Jump to: navigation, search
Qsicon inprogress.png This article is work in progress.



Student:Yi Du(杜一)

Mentor:Sebastien Heymann

Abstract

This is the Specification of GSoC Direct Social Networks Import.This document contains key technology, the design issue, implementation details and road map.

Introduction

Twitter, NYT, Flickr, Facebook, Del.icio.us, emails... are considered by sociologists (and others) as valuable data sources. This proposal aims to make Gephi's import process easier by connecting it to the proper API, and making available direct queries on these social networks.

Social Networks insights with Gephi must currently follow these steps:

  1. Get data like a "screenshot" or a photography.
  2. Pre-process data.
  3. Generate graph file (e.g. GEXF or GraphML).
  4. Import this file in Gephi.
  5. Start visualizing and analyzing data.

We could be more flexible by making Gephi able to connect to their API directly and grab data centered on individuals or complete queries. The process would be transformed to:

  1. Select a network to connect.
  2. Write the query.
  3. Gephi gets the data and generates the network on screen.
  4. The user expands and refines the results by interacting with the network (graph window). Data Lab interactions will be created in the future.

Description of existing networks

We can learn from NodeXL and divide the complex network into several parts. For example, NodeXL divides Twitter into user's network and search network. So the Graph descriptions of these social networks are as follows:

Email

Description: Email import can help users analyse the email contacts between people. Each email address can be seen as a node, and an edge is built when an email exists from one to another. The interesting idea is that we will add "multi-emails", that means user can combine several email addresses together and analyse them.

Users can import email data from either email server or email client(thunderbird, outlook Express, etc).

Architecture:We use JavaMail API to implement this email import module. We receive filtered emails one by one and get the useful data. So the speed of the data fetching depends on the JavaMail. We will support POP3 first and IMAP later.The problem is that POP3 cann't get the outbox data.


API: JavaMail http://java.sun.com/products/javamail/reference/api/index.html

Type of Data Source: Collection of email files. What we receive from the email server is object list "Message". We should define a class Email to format these data.

Limitations:Many email service provider doesn't open IMAP protocol.

Graphs:

Name Info Nodes Edges Node data Edge data Filter
People Network graph of emails between users Email addresses If an email exists from A to B, an edge from A to B is built. Cc and Bcc can be configured whether to be seen as an edge. email address(id),personal name(label) NONE Before loading to Gephi, we can filter emails by date, specific email address, existence of Cc or Bcc or Attachment, subject keywords, message keywords

Twitter

Description: Twitter is a social networking and microblogging service that enables its users to send and read messages known as tweets. http://en.wikipedia.org/wiki/Twitter

API: Twitter API wiki:http://apiwiki.twitter.com/

Twitter API for java: http://twitter4j.org/en/index.jsp

Type of Data Source: REST in XML, JSON, RSS, Atom

Limitations:

Graphs:

Similiar as NodeXL, but more detailed. For example, we can add filters on Twitter's profile

Name Info Nodes Edges Node data Edge data
User's Network People who follow each other Twitter users Follow relationship User profile such as name,location,screen name follower or following
Search Network People who search or mention similar keywords Twitter users Follow relationship, reply to, mention User profile such as name, location, screen name keywords or follow relationship

NYT

Description: The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. http://en.wikipedia.org/wiki/Nyt. It has a series of API for developers on news and social networks.

API: http://developer.nytimes.com/docs/

Type of Data Source: JSON (.json, default), XML (.xml)(Article Search API doesn't support XML. It supports JSON only)

Limitations: yes. If a user wants to use NYT API, he/she must have a API key for specific NYT API. The API key can be registered here: http://developer.nytimes.com/apps/register

Graphs:

There are several APIs of NYT, such as Article Search API, Best Seller API, etc. NYT has a tool for using manyeyes to visualize datasets.

Name Info Nodes Edges Node data Edge data
Article Network Articles with specific filters, such as date, keywords, facets, etc Articles User can choose which option constructs the edge. For example, user can choose date as the edge. If two articles have the same date, an edge between them is built. Of course, maybe no edge is built. author, date, facet, contents Parameter user chooses to construct the edge
TimesPeople Network TimesPeople is a social network for Times readers. The TimesPeople Network can analyse the relationship between them. users If user A follows user B, an edge from A to B is built. location,name,display name follower or following

Facebook

Description: Facebook is a social networking website. Users can add friends and send them messages, and update their personal profiles to notify friends about themselves. Additionally, users can join networks organized by workplace, school, or college. http://en.wikipedia.org/wiki/Facebook

API: http://developers.facebook.com/docs/api

Facebook Java API:http://code.google.com/p/facebook-java-api/

Type of Data Source: REST in JSON

Limitations: ?

Graphs:

Name Info Nodes Edges Node data Edge data
Friend Network A Graph on the relationship between the friends of a given user users If A is friend of B, an edge between A and B is built. id, name, location, work, education, etc nothing
Group Network A Graph on the relationship between the friends inside a given group users If A is friend of B, an edge between A and B is built. id, name, location, work, education, etc nothing
Search Network A Graph on the relationship between the friends inside a search result on pages, events or groups users If A is friend of B, an edge between A and B is built. id, name, location, work, education, etc nothing

Flickr

Description: Flicker is a photo sharing website.

API: http://www.flickr.com/services/api/

Type of Data Source: REST in XML, JSON, XML-RPC, SOAP

Limitations: ?

Graphs:

Name Info Nodes Edges Node data Edge data
Network 1 Similar people by tags users identical tags on photos photos TODO
Network 2 Similar tags by photos tags 2 tags are in the same photo nothing license of photos

LinkedIn

Description: LinkedIn is a professional network.

API: http://developer.linkedin.com/community/apis

Type of Data Source: REST in XML

Limitations: ?

Graphs:

Friendship network is not browsable. Also most of users hide their connections, but we could make some analyzes with the SearchAPI and ProfileAPI.

Name Info Nodes Edges Node data Edge data
Hiring Network Bipartite graph of companies and universities grabbed from Public Profiles company, education If an univ and a company are one a user's profile, a directed edge is built from the univ. to the company. company or univ. name start-date, end-date
Org. local Network Bipartite graph of companies and universities of people who are in the user's network, grabbed from Standard Profiles company, education If two people were/are in the same company or share a same education company or education start-date, end-date
Personal Network People who are in the user's network, grabbed from Standard Profiles people If two people were/are currently in the same company or univ. industry, country, company*, education* company*, education*

Delicious

Description: Delicious is a network for sharing bookmarks and tagging them.

API: http://delicious.com/help/api

Type of Data Source: REST in XML

Limitations: wait at least one second between queries http://delicious.com/help/api

Graphs:

Name Info Nodes Edges Node data Edge data
Suggested Tags Network Given a list of URLs, generates a bipartite graph of URLs and recommended tags (https://api.del.icio.us/v1/posts/suggest) tags, URLs If a tag is suggested for an URL, an undirected edge is built between them. Boolean set to true if the tag is popular nothing
Suggested People Network Given a list of URLs, generates a bipartite graph of URLs and people who bookmark these URLs (https://api.del.icio.us/v1/posts/suggest) usernames, URLs If a person is suggested for an URL, an undirected edge is built between them. nothing nothing
Tags Network Given a list of usernames, generates a bipartite graph of people and tags (https://api.del.icio.us/v1/tags/get) usernames, tags If a person uses a tag, an undirected edge is created built them. nothing nothing

YouTube

Description: A video publication network.

API: http://code.google.com/intl/zh/apis/youtube/2.0/developers_guide_protocol_audience.html

Type of Data Source: REST in AtomPub XML, JSON

Limitations: ?

Graphs:

Available fields on videos: http://code.google.com/intl/zh/apis/youtube/2.0/reference.html#youtube_data_api_tag_entry

Name Info Nodes Edges Node data Edge data
Video Network Given a list of keywords, restricted by a list of categories, a graph of videos sharing these keywords is created. (api) videos If video A and video B have a keyword in common, an undirected edge is built between A and B. title, link, author?, published?, viewCount, yt:location?, latitude?, longitude?, yt:recorded?, categories+, keywords+ shared keywords+
Top Rated Keywords Network Graph of region-specific and category-specific keywords of top rated videos (api) keywords If keyword A and keyword B have a video in common, an undirected edge is built between A and B. sum(viewCount), categories+ shared video titles+
Related Keywords Network Graph of keywords of the given video an related video keywords at a given depth (api) keywords If keyword A and keyword B have a video in common, an undirected edge is built between A and B. sum(viewCount), categories+ shared video titles+
Searched Network From data retrieved by a raw query, a graph of videos and/or keywords and/or users is created (api) videos/keywords/users If a single type of node is selected, the user must choose which type of element should be shared to create an edge. For example, if nodes are keywords, then two keywords may be linked by a common user or a common video. depends on the node type depends on the edge type

Last.FM

Description: Music discovery network.

API: http://www.last.fm/api

Type of Data Source: REST in XML, XML-RPC, XSPF (playlists)

Limitations: ?

Graphs:

Name Info Nodes Edges Node data Edge data
Past Venues Network Graph of locations linked by artists. (api and api) locations If location A and location B have an artist in common, an undirected edge is built between A and B. city, country, venue names*, geo:lat, geo:long startDate
Similar Artists Network Graph of similar artists of the given artist at a given depth. (api) artists If artist A is similar to artist B, an undirected edge is built between A and B. nothing nothing
Similar Artists Keywords Network Bipartite Graph of tags of the given artist an related artist tags at a given depth. (api and api) tags, artists If tag A and tag B have an artist in common, an undirected edge is built between A and B. nothing shared artists+

Open Social API

Description: The Google Open Graph API.

API: http://wiki.opensocial.org/index.php?title=JavaScript_API_Reference

Type of Data Source: REST in XML

Limitations: ?

Graphs: TODO

Design

Use Case

I will describe how a user use DSNI when it is implemented: A user wants to see the state of his twitter.

  1. He presses the icon/button "import from twitter", and then a user interface displays.
  2. The user interface guide the user to input account, password. It also let the user input which parameter represent the node, which represent the edge, and which represent attributes of node. Besides, the user interface should support query to let user get what they really want only. (There are things to be done to make sure whether the filter/query is common and should be structured for reuse)
  3. When these things have been done, this user click "submit" and the graph displays.

When the "submit" button is clicked, proper module should get proper data from twitter and parse it.

UI

Email:

Required input: email address, password, receive mail server(contains server type[imap|pop3] and server address).

Junior(no filter): receive all emails from server. Senior: 1.user can filter emails 2.user can combine several email addresses

Architecture

What all social networks have in common? What are the differences? What all the social networks have in common is the Graph data structure. And there are some common aspects, not all the social networks have:

  • API key. Only with this key, we can use the API and import data from these website.(NYT, Flicker).
  • HTTP Request&Response. Most of the websites support a simple HTTP Request&Response to fetch data.(Twitter, Facebook, Flicker, NYT). But there are also encapsulated tools in Java to do these things, such as twitter4j(Twitter), facebook-java-api(Facebook). These tools is much easier to use than parse XML or JSON.
  • Most of the social networks support XML or JSON except emails.

BTW, an abstract controller is an necessary class for developers to implement new social network import.

The most differences is the user interface. Each kind of social network should have a unique user interface.


I will write a new module named SNAImport for the DSNI, this module is the main API and SPI for DSNI.

org.gephi.io.importer.api;

interface SNAImportController{

SNAImporter getSNAImporter();
Container import(String, SNAImporter);
process(Container);
...

}

org.gephi.io.importer.api;

interface SocialNetwork{

//This is the datastructure and query string of the social network
get/set UserName();
get/set Password();
get/set APIKey();
get/set NodeQuery();
get/set EdgeQuery();
get/set NodeAttrQuery();
get/set EdgeAttrQurey();

}


I think SNAImporter should be an interface(SPI) extends from Import, which is in module Import. It has several methods such as: isMatchingImporter(), import(String,ContainerLoader,Report).

org.gephi.io.importer.spi;

interface SocialNetWorkImporter extends Importer{

importData(SocialNetwork snw, ContainLoader containLoader, Report report);
importData(HttpRequest hr, ContainLoader containLoader, Report report);//TODO

}

The architecture of the DSNI is as above. If we want to use the DSNI, we should do these things:(Take email import for example)

  • An ui of the email import
  • An import method (process the data and translate to the supported datastructure) using the SPI.
  • ?

Roadmap

  • May,24th~June,10th: Email Import----done
    • May,24th~May,30th: Complete the specification
    • Confirm the architecture of the DSNI.(SPI of SNAImporter)
    • Implement ImportEmailUI,SNAImportController
    • Implement ImportEmail using javamail api
  • June,11th~June,18th:Twitter Import----done, but overdue until June 25th.
    • Confirm the ui details(see "Design").Implement user's Network, both the ui and the processing.
    • Confirm the ui details(see "Design").Implement search network, both the ui and the processing.
  • June,19th~July,2nd:NYT Import
  • July,3rd~July,17th:Flicker Import(Mid-term evaluations between July,12nd and July,16th)
  • July,18th~July,28th:Youtube Import
  • July,29th~August,9th:Test and then pencil down.
  • Wrote on June 27th: I want to package a http request module with the help of HttpClient(an opensource project), and use it to do the following three import one by one.

User Manual

Email Import

We add email import function to current project. In our implementation, we support two kinds of email import style.

  • The First style is from local file. We only support .eml files to import. Users can multi select .eml files to import.
  • The second style is from email server. Users should know their email address and password. Besides, they should know the pop/imap server and port(ssl connection or not). Take gmail for example. The option user should know : email address, password, server type(pop3 or imap), whether the server use ssl conncetion(gmail always user ssl connection),port (995 for pop, 993 for imap), and receive server address(pop.gmail.com|imap.gmail.com)

Junior users can choose "select all email", and don't use the filter. Senior users can do filter of their import, they can filter email by several parameters: email address(from, to, Cc, Bcc), received date(before or after specific date), whether an attachement has, whether Cc or Bcc have, message include text, subject include text.

Besides the option, we also provide two option:

  • Use Cc line when calculating edge weights. If it is selected, an edge weight between A and B will be added if a message contains "A Cc B".
  • Use Bcc line when calculating edge weights.If it is selected, an edge weight between A and B will be added if a message contains "A Bcc B".

For developers:

If a developer wants to add more file format support to current project. It should only implements interface EmailFilesFilter.java in module "Spigoe Email". It has three methods:

  • String getDisplayName(): the display name in the ui. It is also an ID of the file filter.
  • String getSupportedFileExtension(): file extension developer want to add, such as ".eml".
  • MimeMessage[] parseFiles(File[] file):the parse detail. It's the most complex part, developer should know a little about MimeMessage. It's a datastruture implemented in java mail api. Its format is an international standard,http://www.ietf.org/rfc/rfc822.txt.

If developers want to add new filters. Though a litter more complex, but not impossible. They should develop their own ui part and implements EmailFilter.java.

Twitter Import

We provide "twitter search network" and "twitter user network" to provide twitter import. Be remember that both of them need twitter user name and password. Though twitter now support non-username&password authorization, we think a twitter user name and password is easy to get.And with user name and password, we can get much more information than without them. So feel free to provide them to Gephi.

Twitter search network:, we support twitter search network just as what we said in the design docs. There are three options on edge construction:

  • Replies-to relationship: If A reply to B in a searched tweet, an edge from A to B will be added.
  • Mentions relationship: If A mentions B in a searched tweet,an edge from A to B will be added.
  • Followers relationship:If A follows B in constructed graph, an edge from A to B will be added. This is a time consuming work, of course it can easily exceed the times limitation of twitter api request. We strongly recommend you don't choose this option if you are sure the number of node will bigger than 100. If you want to control the number of people less than 100, you can choose the "get the first _ people" to control. (This depends on the limitation of twitter.com, might change in the near future)

"get the first _ people" and "get the first _ tweet " are used to control the scope of the graph.

Twitter user network:We add and edge from A to B if A follows B in the whole graph by default . We provide three options for vertex:

  • Person followed by the user: If searched user A follows B, B will be added as a vertex.
  • Person following the user: If A follows searched user B, A will be added as a vertex.
  • Both:Both the above.

"limit to _ people " is used to control the scope of the graph.

"contains detail attribute of each node" is an option to decide whether to contain detail attribute of each node. We recommend you not choose it if you think the graph is a little big because if will cause connection error if the request to twitter.com exceed the limitation. This is also a time consuming work.

"get the depth of the graph" is an option to decide the depth of the graph to construct. It of course from 1 to 3. If you choose 1, only friends of searched user will be found and the graph is only a tree. If you choose 2, the real graph generated(The number of first layer is less than 150). If you choose 3, the graph is a little to big.

TODO:a cache might be added to the current dsni.

NYT Import

First of all, have an API key when you want to use NYT api. It can be registered here: http://developer.nytimes.com/apps/register

We support two kinds of NYT(New York Times) apis.One is the article network, the other is the TimesPeople network.

  • Article Network: We provide two levels of using this network. One is "Simple Query", the other is "Advanced Query".
    • Simple Query: User can only choose the keywords, date range. Then select the attributes he/she wants to analysis and the attribute

he/she wants to construct edges.

he/she wants to construct edges.

  • TimesPeople Network: Because of the limitation of the API, we only provide email to start the construction of the graph. It's similiar

to "twitter user network".

If user choose "people followed by the user", an edge will be built for searched user who is followed by the specific user.

If user choose "people following the user", an edge will be built for searched user who follows the specific user. It's a time consuming task, so don't choose it unless necessary.

You can set the limitation of node by the "Limit to" check box.

You can also decide the depth of the graph, the default value is 1.

Disucssion

Multi-search,Append:I think user can append graph data to their current graph, but I don't know whether it's hard to implement.

About twitter: There are two ways in my mind about the representations of the nodes and edges:

Followers and Friends:

Node: a twitter user. It contains all kinds of information about a twitter user, such as friends count, followers count. Or even style of page of this twitter.

Edge:relationships of a twitter user

There are two different kinds of graph. The first one is followers of a twitter user, (The source user is A, B follows A, C follows B,…). The second is friends of a twitter user.

Other kinds of options: number of layers of the follower/friend relationship, contains the source twitter user himself or not, directed graph or undirected graph

Simple Query:

Node: a tweet. I contains information about the date of the tweet, the source twitter user of the tweet, and so on.

Edge:?(I'm not quite clear about what represent edge)


we also want to know the representations of nodes and edges about facebook, flicker, and so on.


Sebastien said we should provide 2 modes in the UI: some proposed use cases(imagine what a user will use gephi import for), and a "free/advanced" mode with full possibilities. I don't quite understand the meaning/contents of the "free/advanced". So further discussion is welcome!


I think the two ui modes are as belows:

a junior user only need to input his user name and password of sns website. Of course he can choose which attributes represent nodes and which represent edges( or he can use the default option we provide to him).

a senior user can do two more things: one is choose which parameters of a user or friends should be get using api (inspired from NameGen [1]). The other is filter the data scope the user want to get(such as time scope, age of friends and so on).


I agree with you, by advanced I also though about directly entering the query string, or something like this more "engineer-friendly" :-) Seb


[2]These are usage of gephi on graph such as social network, facebook and so on. They are good guidence of DSNI. They also show the necessary of DSNI:)