Friday 10 July 2015

Lets try to build basic Social Network from Events log

Exploring Social data with Spark-GraphX

If you want to explore your data that is highly connected consider graph technology.

In this post we will exercise Spark-GraphX and Cassandra for building very simple People to People connectivity graph based on events log. The solution is not focused on optimization but more about playing around with the technology and get a feeling how it works. This solution is not optimal but more exploratory.




Steps in high level:

  • Present events log in graph format as input
  • Parse input file which is in dot format and upload to Cassandra table
  • Map events from table to Spark-GraphX-EdgeRDD and create Graph from those
  • Run Pregel and collect all edges from People to Content
  • Filter to Vertices and create People to People directed graph while grouping edges
  • Stream out the People to People graph to file in dot format for visualization

NOTE: The code is in Java as it was easier for me to start with. It is more complex to read compared to Scala. Future wise I will pursue moving to Scala


Events log in graph format 

For being able to to play around with different inputs for gaining better understanding of the GraphX and its capabilities I decided to use DOT language that can be visualized using different tools like Graphviz. The input will be list of People and Content nodes and actions people did on the content, like "read" and "own".

Here example for such input. This is sample of the data, look at  section "Full Data Set" for the full list 
  • P - stands for Person/Profile
  • C - stands for community 
  • F - stands for File
digraph G {
    node [ fontname="Arial"] ;
    subgraph Attributes {
        node [ shape="circle" color="#0B0B61"] ;
        C101 [ label="Community_101" ] ;
        C102 [ label="Community_102" ] ;
        ............
        F201 [ label="File_201" ] ;
        F202 [ label="File_202" ] ;
        ............
     }
    subgraph Attributes {      
        node [ shape="doublecircle" color="#DF0101"] ;
        P1 [ label="Person_1" ] ;
        P2 [ label="Person_2" ] ;

        ............
     }
    subgraph Attributes {
        edge [ style="dotted" ] ;
        P1 -> C101 [ label="(owns,2015)"] ;
        P1 -> C102 [ label="(owns,2015)"] ;
        P2 -> C101 [ label="(owns,2015)"] ;
    }
}


To get visual graph for your input download Graphviz and install.
Then run command lines to generate visual representation using circo and neato.

Example:
  • <graphviz dir>\Graphviz2.38\bin\circo.exe <output dir>\events.txt <output dir>\events.gv
  • <graphviz dir>\Graphviz2.38\bin\neato.exe -n <output dir>\events.gv -Tpng -O
  • Look for image at <output dir>\events.gv.png

 

Cassandra as distributed storage 

Except for graphviz of course we need software packages that will be used to store data and do the processing
Connector              Spark              Cassandra              Cassandra Java Driver          
1.31.32.1, 2.02.1

I download apache-cassandra-2.1.5, installed, and run from command line locally, and created from cqlsh tool KEYSPACE demo.
For connecting from the application I used:
com.datastax.driver.core.Cluster cluster = 
Cluster.builder().addContactPoint("127.0.0.1").build();
com.datastax.driver.core.Session session = cluster.connect("demo");

I downloaded JPGD - Java-based Parser for Graphviz Documents and used it for parsing the input file which is in dot format and translated it to Cassandra instructions.

I first created Table
com.datastax.driver.core.Session session.execute(
"CREATE TABLE events (id int PRIMARY KEY, profile varchar, content varchar, action varchar, time varchar);");


Than looped and inserted  all event to table
com.datastax.driver.core.Session session.execute(
"INSERT INTO events (id, profile, content, action, time) VALUES ( ......);");

And last printed all raws in Cassandara table (not sorted by id)

 [id: 23] [profile: P4] [content: C105] [action: owns] [time: 2015]
[id: 53] [profile: P2] [content: F201] [action: read] [time: 2015]
[id: 91] [profile: P10] [content: W303] [action: read] [time: 2015]
[id: 117] [profile: P2] [content: F205] [action: own] [time: 2015]
......................

......................
...................... 
[id: 87] [profile: P10] [content: W302] [action: read] [time: 2015]
[id: 77] [profile: P10] [content: F203] [action: read] [time: 2015]
[id: 3] [profile: P1] [content: C102] [action: owns] [time: 2015]
[id: 103] [profile: P10] [content: W304] [action: own] [time: 2015]