Specification - GSoC Direct Social Networks Import
From Gephi:Wiki
Student:Yi Du(杜一)
Mentor:Sebastien Heymann
Contents |
Abstract
This is the Specification of GSoC Direct Social Networks Import.This document contains key technology, the design issue, implementation details and road map.
Introduction
Twitter, NYT, Flickr, Facebook, Del.icio.us, emails... are considered by sociologists (and others) as valuable data sources. This proposal aims to make Gephi's import process easier by connecting it to the proper API, and making available direct queries on these social networks.
Social Networks insights with Gephi must currently follow these steps:
- Get data like a "screenshot" or a photography.
- Pre-process data.
- Generate graph file (e.g. GEXF or GraphML).
- Import this file in Gephi.
- Start visualizing and analyzing data.
We could be more flexible by making Gephi able to connect to their API directly and grab data centered on individuals or complete queries. The process would be transformed to:
- Select a network to connect.
- Write the query.
- Gephi gets the data and generates the network on screen.
- The user expands and refines the results by interacting with the network (graph window). Data Lab interactions will be created in the future.
Description of existing networks
We can learn from NodeXL and divide the complex network into several parts. For example, NodeXL divides Twitter into user's network and search network. So the Graph descriptions of these social networks are as follows:
Description: Email import can help users analyse the email contacts between people. Each email address can be seen as a node, and an edge is built when an email exists from one to another. The interesting idea is that we will add "multi-emails", that means user can combine several email addresses together and analyse them.
Users can import email data from either email server or email client(thunderbird, outlook Express, etc).
Architecture:We use JavaMail API to implement this email import module. We receive filtered emails one by one and get the useful data. So the speed of the data fetching depends on the JavaMail. We will support POP3 first and IMAP later.The problem is that POP3 cann't get the outbox data.
API: JavaMail http://java.sun.com/products/javamail/reference/api/index.html
Type of Data Source: Collection of email files. What we receive from the email server is object list "Message". We should define a class Email to format these data.
Limitations:Many email service provider doesn't open IMAP protocol.
Graphs:
| Name | Info | Nodes | Edges | Node data | Edge data | Filter |
|---|---|---|---|---|---|---|
| People Network | graph of emails between users | Email addresses | If an email exists from A to B, an edge from A to B is built. Cc and Bcc can be configured whether to be seen as an edge. | email address(id),personal name(label) | NONE | Before loading to Gephi, we can filter emails by date, specific email address, existence of Cc or Bcc or Attachment, subject keywords, message keywords |
Description: Twitter is a social networking and microblogging service that enables its users to send and read messages known as tweets. http://en.wikipedia.org/wiki/Twitter
API: Twitter API wiki:http://apiwiki.twitter.com/
Twitter API for java: http://twitter4j.org/en/index.jsp
Type of Data Source: REST in XML, JSON, RSS, Atom
Limitations:
- http://support.twitter.com/articles/15364-about-twitter-limits-update-api-dm-and-following (API Requests: 150 per hour, Because of the api requests limitation, we cann't get the sufficient data of user profile)
Graphs:
Similiar as NodeXL, but more detailed. For example, we can add filters on Twitter's profile
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| User's Network | People who follow each other | Twitter users | Follow relationship | User profile such as name,location,screen name | follower or following |
| Search Network | People who search or mention similar keywords | Twitter users | Follow relationship, reply to, mention | User profile such as name, location, screen name | keywords or follow relationship |
NYT
Description: The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. http://en.wikipedia.org/wiki/Nyt. It has a series of API for developers on news and social networks.
API: http://developer.nytimes.com/docs/
Type of Data Source: JSON (.json, default), XML (.xml)(Article Search API doesn't support XML. It supports JSON only)
Limitations: yes. If a user wants to use NYT API, he/she must have a API key for specific NYT API. The API key can be registered here: http://developer.nytimes.com/apps/register
Graphs:
There are several APIs of NYT, such as Article Search API, Best Seller API, etc. NYT has a tool for using manyeyes to visualize datasets.
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Article Network | Articles with specific filters, such as date, keywords, facets, etc | Articles | User can choose which option constructs the edge. For example, user can choose date as the edge. If two articles have the same date, an edge between them is built. Of course, maybe no edge is built. | author, date, facet, contents | Parameter user chooses to construct the edge |
| TimesPeople Network | TimesPeople is a social network for Times readers. The TimesPeople Network can analyse the relationship between them. | users | If user A follows user B, an edge from A to B is built. | location,name,display name | follower or following |
Description: Facebook is a social networking website. Users can add friends and send them messages, and update their personal profiles to notify friends about themselves. Additionally, users can join networks organized by workplace, school, or college. http://en.wikipedia.org/wiki/Facebook
API: http://developers.facebook.com/docs/api
Facebook Java API:http://code.google.com/p/facebook-java-api/
Type of Data Source: REST in JSON
Limitations: ?
Graphs:
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Friend Network | A Graph on the relationship between the friends of a given user | users | If A is friend of B, an edge between A and B is built. | id, name, location, work, education, etc | nothing |
| Group Network | A Graph on the relationship between the friends inside a given group | users | If A is friend of B, an edge between A and B is built. | id, name, location, work, education, etc | nothing |
| Search Network | A Graph on the relationship between the friends inside a search result on pages, events or groups | users | If A is friend of B, an edge between A and B is built. | id, name, location, work, education, etc | nothing |
Flickr
Description: Flicker is a photo sharing website.
API: http://www.flickr.com/services/api/
Type of Data Source: REST in XML, JSON, XML-RPC, SOAP
Limitations: ?
Graphs:
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Network 1 | Similar people by tags | users | identical tags on photos | photos | TODO |
| Network 2 | Similar tags by photos | tags | 2 tags are in the same photo | nothing | license of photos |
Description: LinkedIn is a professional network.
API: http://developer.linkedin.com/community/apis
Type of Data Source: REST in XML
Limitations: ?
Graphs:
Friendship network is not browsable. Also most of users hide their connections, but we could make some analyzes with the SearchAPI and ProfileAPI.
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Hiring Network | Bipartite graph of companies and universities grabbed from Public Profiles | company, education | If an univ and a company are one a user's profile, a directed edge is built from the univ. to the company. | company or univ. name | start-date, end-date |
| Org. local Network | Bipartite graph of companies and universities of people who are in the user's network, grabbed from Standard Profiles | company, education | If two people were/are in the same company or share a same education | company or education | start-date, end-date |
| Personal Network | People who are in the user's network, grabbed from Standard Profiles | people | If two people were/are currently in the same company or univ. | industry, country, company*, education* | company*, education* |
Delicious
Description: Delicious is a network for sharing bookmarks and tagging them.
API: http://delicious.com/help/api
Type of Data Source: REST in XML
Limitations: wait at least one second between queries http://delicious.com/help/api
Graphs:
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Suggested Tags Network | Given a list of URLs, generates a bipartite graph of URLs and recommended tags (https://api.del.icio.us/v1/posts/suggest) | tags, URLs | If a tag is suggested for an URL, an undirected edge is built between them. | Boolean set to true if the tag is popular | nothing |
| Suggested People Network | Given a list of URLs, generates a bipartite graph of URLs and people who bookmark these URLs (https://api.del.icio.us/v1/posts/suggest) | usernames, URLs | If a person is suggested for an URL, an undirected edge is built between them. | nothing | nothing |
| Tags Network | Given a list of usernames, generates a bipartite graph of people and tags (https://api.del.icio.us/v1/tags/get) | usernames, tags | If a person uses a tag, an undirected edge is created built them. | nothing | nothing |
YouTube
Description: A video publication network.
API: http://code.google.com/intl/zh/apis/youtube/2.0/developers_guide_protocol_audience.html
Type of Data Source: REST in AtomPub XML, JSON
Limitations: ?
Graphs:
Available fields on videos: http://code.google.com/intl/zh/apis/youtube/2.0/reference.html#youtube_data_api_tag_entry
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Video Network | Given a list of keywords, restricted by a list of categories, a graph of videos sharing these keywords is created. (api) | videos | If video A and video B have a keyword in common, an undirected edge is built between A and B. | title, link, author?, published?, viewCount, yt:location?, latitude?, longitude?, yt:recorded?, categories+, keywords+ | shared keywords+ |
| Top Rated Keywords Network | Graph of region-specific and category-specific keywords of top rated videos (api) | keywords | If keyword A and keyword B have a video in common, an undirected edge is built between A and B. | sum(viewCount), categories+ | shared video titles+ |
| Related Keywords Network | Graph of keywords of the given video an related video keywords at a given depth (api) | keywords | If keyword A and keyword B have a video in common, an undirected edge is built between A and B. | sum(viewCount), categories+ | shared video titles+ |
| Searched Network | From data retrieved by a raw query, a graph of videos and/or keywords and/or users is created (api) | videos/keywords/users | If a single type of node is selected, the user must choose which type of element should be shared to create an edge. For example, if nodes are keywords, then two keywords may be linked by a common user or a common video. | depends on the node type | depends on the edge type |
Last.FM
Description: Music discovery network.
Type of Data Source: REST in XML, XML-RPC, XSPF (playlists)
Limitations: ?
Graphs:
| Name | Info | Nodes | Edges | Node data | Edge data |
|---|---|---|---|---|---|
| Past Venues Network | Graph of locations linked by artists. (api and api) | locations | If location A and location B have an artist in common, an undirected edge is built between A and B. | city, country, venue names*, geo:lat, geo:long | startDate |
| Similar Artists Network | Graph of similar artists of the given artist at a given depth. (api) | artists | If artist A is similar to artist B, an undirected edge is built between A and B. | nothing | nothing |
| Similar Artists Keywords Network | Bipartite Graph of tags of the given artist an related artist tags at a given depth. (api and api) | tags, artists | If tag A and tag B have an artist in common, an undirected edge is built between A and B. | nothing | shared artists+ |
Open Social API
Description: The Google Open Graph API.
API: http://wiki.opensocial.org/index.php?title=JavaScript_API_Reference
Type of Data Source: REST in XML
Limitations: ?
Graphs: TODO
Design
Use Case
I will describe how a user use DSNI when it is implemented: A user wants to see the state of his twitter.
- He presses the icon/button "import from twitter", and then a user interface displays.
- The user interface guide the user to input account, password. It also let the user input which parameter represent the node, which represent the edge, and which represent attributes of node. Besides, the user interface should support query to let user get what they really want only. (There are things to be done to make sure whether the filter/query is common and should be structured for reuse)
- When these things have been done, this user click "submit" and the graph displays.
When the "submit" button is clicked, proper module should get proper data from twitter and parse it.
UI
Email:
Required input: email address, password, receive mail server(contains server type[imap|pop3] and server address).
Junior(no filter): receive all emails from server. Senior: 1.user can filter emails 2.user can combine several email addresses
Architecture
What all social networks have in common? What are the differences? What all the social networks have in common is the Graph data structure. And there are some common aspects, not all the social networks have:
- API key. Only with this key, we can use the API and import data from these website.(NYT, Flicker).
- HTTP Request&Response. Most of the websites support a simple HTTP Request&Response to fetch data.(Twitter, Facebook, Flicker, NYT). But there are also encapsulated tools in Java to do these things, such as twitter4j(Twitter), facebook-java-api(Facebook). These tools is much easier to use than parse XML or JSON.
- Most of the social networks support XML or JSON except emails.
BTW, an abstract controller is an necessary class for developers to implement new social network import.
The most differences is the user interface. Each kind of social network should have a unique user interface.
I will write a new module named SNAImport for the DSNI, this module is the main API and SPI for DSNI.
org.gephi.io.importer.api;
interface SNAImportController{
- SNAImporter getSNAImporter();
- Container import(String, SNAImporter);
- process(Container);
- ...
}
org.gephi.io.importer.api;
interface SocialNetwork{
- //This is the datastructure and query string of the social network
- get/set UserName();
- get/set Password();
- get/set APIKey();
- get/set NodeQuery();
- get/set EdgeQuery();
- get/set NodeAttrQuery();
- get/set EdgeAttrQurey();
}
I think SNAImporter should be an interface(SPI) extends from Import, which is in module Import. It has several methods such as: isMatchingImporter(), import(String,ContainerLoader,Report).
org.gephi.io.importer.spi;
interface SocialNetWorkImporter extends Importer{
- importData(SocialNetwork snw, ContainLoader containLoader, Report report);
- importData(HttpRequest hr, ContainLoader containLoader, Report report);//TODO
}
The architecture of the DSNI is as above. If we want to use the DSNI, we should do these things:(Take email import for example)
- An ui of the email import
- An import method (process the data and translate to the supported datastructure) using the SPI.
- ?
Roadmap
- May,24th~June,10th: Email Import----done
- May,24th~May,30th: Complete the specification
- Confirm the architecture of the DSNI.(SPI of SNAImporter)
- Implement ImportEmailUI,SNAImportController
- Implement ImportEmail using javamail api
- June,11th~June,18th:Twitter Import----done, but overdue until June 25th.
- Confirm the ui details(see "Design").Implement user's Network, both the ui and the processing.
- Confirm the ui details(see "Design").Implement search network, both the ui and the processing.
- June,19th~July,2nd:NYT Import
- July,3rd~July,17th:Flicker Import(Mid-term evaluations between July,12nd and July,16th)
- July,18th~July,28th:Youtube Import
- July,29th~August,9th:Test and then pencil down.
- Wrote on June 27th: I want to package a http request module with the help of HttpClient(an opensource project), and use it to do the following three import one by one.
User Manual
Email Import
We add email import function to current project. In our implementation, we support two kinds of email import style.
- The First style is from local file. We only support .eml files to import. Users can multi select .eml files to import.
- The second style is from email server. Users should know their email address and password. Besides, they should know the pop/imap server and port(ssl connection or not). Take gmail for example. The option user should know : email address, password, server type(pop3 or imap), whether the server use ssl conncetion(gmail always user ssl connection),port (995 for pop, 993 for imap), and receive server address(pop.gmail.com|imap.gmail.com)
Junior users can choose "select all email", and don't use the filter. Senior users can do filter of their import, they can filter email by several parameters: email address(from, to, Cc, Bcc), received date(before or after specific date), whether an attachement has, whether Cc or Bcc have, message include text, subject include text.
Besides the option, we also provide two option:
- Use Cc line when calculating edge weights. If it is selected, an edge weight between A and B will be added if a message contains "A Cc B".
- Use Bcc line when calculating edge weights.If it is selected, an edge weight between A and B will be added if a message contains "A Bcc B".
For developers:
If a developer wants to add more file format support to current project. It should only implements interface EmailFilesFilter.java in module "Spigoe Email". It has three methods:
- String getDisplayName(): the display name in the ui. It is also an ID of the file filter.
- String getSupportedFileExtension(): file extension developer want to add, such as ".eml".
- MimeMessage[] parseFiles(File[] file):the parse detail. It's the most complex part, developer should know a little about MimeMessage. It's a datastruture implemented in java mail api. Its format is an international standard,http://www.ietf.org/rfc/rfc822.txt.
If developers want to add new filters. Though a litter more complex, but not impossible. They should develop their own ui part and implements EmailFilter.java.
Twitter Import
We provide "twitter search network" and "twitter user network" to provide twitter import. Be remember that both of them need twitter user name and password. Though twitter now support non-username&password authorization, we think a twitter user name and password is easy to get.And with user name and password, we can get much more information than without them. So feel free to provide them to Gephi.
Twitter search network:, we support twitter search network just as what we said in the design docs. There are three options on edge construction:
- Replies-to relationship: If A reply to B in a searched tweet, an edge from A to B will be added.
- Mentions relationship: If A mentions B in a searched tweet,an edge from A to B will be added.
- Followers relationship:If A follows B in constructed graph, an edge from A to B will be added. This is a time consuming work, of course it can easily exceed the times limitation of twitter api request. We strongly recommend you don't choose this option if you are sure the number of node will bigger than 100. If you want to control the number of people less than 100, you can choose the "get the first _ people" to control. (This depends on the limitation of twitter.com, might change in the near future)
"get the first _ people" and "get the first _ tweet " are used to control the scope of the graph.
Twitter user network:We add and edge from A to B if A follows B in the whole graph by default . We provide three options for vertex:
- Person followed by the user: If searched user A follows B, B will be added as a vertex.
- Person following the user: If A follows searched user B, A will be added as a vertex.
- Both:Both the above.
"limit to _ people " is used to control the scope of the graph.
"contains detail attribute of each node" is an option to decide whether to contain detail attribute of each node. We recommend you not choose it if you think the graph is a little big because if will cause connection error if the request to twitter.com exceed the limitation. This is also a time consuming work.
"get the depth of the graph" is an option to decide the depth of the graph to construct. It of course from 1 to 3. If you choose 1, only friends of searched user will be found and the graph is only a tree. If you choose 2, the real graph generated(The number of first layer is less than 150). If you choose 3, the graph is a little to big.
TODO:a cache might be added to the current dsni.
NYT Import
First of all, have an API key when you want to use NYT api. It can be registered here: http://developer.nytimes.com/apps/register
We support two kinds of NYT(New York Times) apis.One is the article network, the other is the TimesPeople network.
- Article Network: We provide two levels of using this network. One is "Simple Query", the other is "Advanced Query".
- Simple Query: User can only choose the keywords, date range. Then select the attributes he/she wants to analysis and the attribute
he/she wants to construct edges.
- Advanced Query: Advanced user can construct their own query words according to the NYT article search api( format here: http://developer.nytimes.com/docs/article_search_api#h3-uri). Then select the attributes he/she wants to analysis and the attribute
he/she wants to construct edges.
- TimesPeople Network: Because of the limitation of the API, we only provide email to start the construction of the graph. It's similiar
to "twitter user network".
If user choose "people followed by the user", an edge will be built for searched user who is followed by the specific user.
If user choose "people following the user", an edge will be built for searched user who follows the specific user. It's a time consuming task, so don't choose it unless necessary.
You can set the limitation of node by the "Limit to" check box.
You can also decide the depth of the graph, the default value is 1.
Disucssion
Multi-search,Append:I think user can append graph data to their current graph, but I don't know whether it's hard to implement.
About twitter: There are two ways in my mind about the representations of the nodes and edges:
Followers and Friends:
Node: a twitter user. It contains all kinds of information about a twitter user, such as friends count, followers count. Or even style of page of this twitter.
Edge:relationships of a twitter user
There are two different kinds of graph. The first one is followers of a twitter user, (The source user is A, B follows A, C follows B,…). The second is friends of a twitter user.
Other kinds of options: number of layers of the follower/friend relationship, contains the source twitter user himself or not, directed graph or undirected graph
Simple Query:
Node: a tweet. I contains information about the date of the tweet, the source twitter user of the tweet, and so on.
Edge:?(I'm not quite clear about what represent edge)
we also want to know the representations of nodes and edges about facebook, flicker, and so on.
Sebastien said we should provide 2 modes in the UI: some proposed use cases(imagine what a user will use gephi import for), and a "free/advanced" mode with full possibilities. I don't quite understand the meaning/contents of the "free/advanced". So further discussion is welcome!
I think the two ui modes are as belows:
a junior user only need to input his user name and password of sns website. Of course he can choose which attributes represent nodes and which represent edges( or he can use the default option we provide to him).
a senior user can do two more things: one is choose which parameters of a user or friends should be get using api (inspired from NameGen [1]). The other is filter the data scope the user want to get(such as time scope, age of friends and so on).
I agree with you, by advanced I also though about directly entering the query string, or something like this more "engineer-friendly" :-) Seb
[2]These are usage of gephi on graph such as social network, facebook and so on. They are good guidence of DSNI. They also show the necessary of DSNI:)

