Date post: | 22-Jan-2018 |
Category: |
Data & Analytics |
Upload: | wassim-trifi |
View: | 138 times |
Download: | 3 times |
Wassim Trifi – Apr 2016
Objectives
Build a business relationship in an organization basedon transactions between employees.
Express the business model.
Understand the communication process betweenemployees.
Get employee’s influence by degree.
Get most connected employees.
Have more insights on communitees inside an organization
How to do ?
Data
• HR data from HRIS DB
• Daily email communications between employees from EmailServer
ETL
• Collect data from source with SparkSQL
• Clean, transform and aggregate
• Load into structred frames
Analyze
• Prepare imputs for the Graph
• Create the GraphX of communication
• Apply algorithms and comment results
Visualize
• Load the graph into Neo4J
• Visualize the components and their respective connections.
Architecture
HDFS
HRIS DB
EmailServer
File
Ta
ble
s
Create On
Vis
ua
lize
Data Source: Email Server
A Json file of 42000 lines with emails details ( «ID» : email identification,
«Receivers», «Sender») :
I developed a python program to generate this file
Data Source: Employee DB
A Postgresql Table of 1000 rows containing employees data :
ETL: Extract Employee Table Apache Spark is a powerful tool for ETL and analatycs.
Use the SparkSql library in Scala, to load data from Posgresql :
ETL: Employee DataFrame
The new data frame is created with 1000 rows of
employees details :
ETL: Extract Emails File
The JSON file is saved under HDFS.
We load and transform the content into a dataframe with SparkSQL in
Scala code :
ETL: Emails DataFrame
The new data frame is stored in-memory with 372011
rows of emails details :
ETL: Merge All in one Table
Map Emails and Employee details into a new Dataframe.
As result we have a new DataFrame of 372011 rows that maps the
receivers and Senders with the ID of email sent :
Build the Graph
Once we have the data prepared, we move on and we buid the Graph of
employees conneted by emails.
A Graph has as input Vertices and Edges. The nodes are the employees
and the Edges are the emails sent and received.
We create the Graph with the Spark Library GraphX.
Analyze : Degree of influence
by Employee (1)
The PageRank algorithm proovides a ranking value for each of the
vertices in a graph. It makes the assumption that the vertices that are
connected to the most edges are the most important ones.
Analyze : Degree of influence
by Employee (2)
The employee « Lewis Mark » with the highset rank. He receives most of
emails echanged. That person is certainly occupying a key role in the
organization ( leader, CEO ??)
Analyze : Most Connceted
Employees (1)
This algorithm provides a measure of connectivity
between employees. Who are strongly connected and
those not connected.
Analyze : Most Connceted
Employees (2)
This algorithm provides a measure of connectivity
between employees. Who are strongly connected and
those not connected.
Visualize with Neo4J
Neo4J is Graph Database and has its own web server
to browse and visualize data. This makes more efficient
to undersand the Graph components and the
connection inside.
It is possible to run several algorithms, like the famous
RankPage on the Neo4J graphs.
To create the Graph into Neo4J, we have to extract
vertices ans edges from Spark into csv files, then run
the neo4j-import script to load database.
Neo4J Graph
Conclusion
Spark plateform is so powerful. It provides amazing
tools for data analysis.
HR data are more relevant when analyzed with the
other datasource of an organization.
It is really possible to cut HR costs and acting with
performance when we understand how the
organization is made.
For your Feedbacks join me on linkedin :
fr.linkedin.com/in/WassTRIFI