1
Federal Big Data Working Group Meetup
Dr. Brand NiemannDirector and Senior Data Scientist
Semantic Communityhttp://semanticommunity.info/
http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup
January 7, 2014
2
Mission Statement• Federal: Supports the Federal Big Data Initiative, but not endorsed
by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which
is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal
Government and Non-Federal Government experts producing big data products (see Possible Team Presentations below); and
• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.
3
Co-organizers• Brand Niemann and Kate Goodier• Kate Goodier, Host: Excelerate Solutions offices in Tysons Corner:
– Capacity about 50 with Skype and wifi available. The Silver Line Spring Hill Metro Stop (planned to open in March) is across the street (Route 7 and Spring Hill Road).
• Directions to the building are easy and they have open underground parking:– See photo below from Excelerate Solutions Office looking south to the
Spring Hill Road Silver Line Metro Station (planned to open in March 2014).
• Logistics:– Refreshments, restrooms, etc.
4
Suggested Format• 6:30 p.m. Tutorials (I will start with - Proposed GMU Course, and hope that others
would offer to do tutorials as well) and Refreshments• 7:00 p.m. Introductions and Announcements (10 seconds per individual depending
on the size of the group)– Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the
Federal Big Data Senior Steering Work Group• 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where
did you store the data, and what were your results)– Start with our Semantic Big Data Science Application: Semantic Medline on the YarcData
Graph Appliance for the Federal Big Data Senior Steering Work Group that our Semantic Data Science Team made a good presentation of to Lee Watkins Jr., Director of Bioinformatics at the Institute of Genetic Medicine Center for Inherited Disease Research (CIDR) recently.
• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work)
• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)
5
Next Meetups• Second Meetup: Tuesday, February 4, 6:30 p.m.
– Continue Data Science Tutorial: Practical Data Science for Data Scientists– What Went Wrong with the Obamacare Web Site, and How Can It Be Fixed? and Why the First
Rollout of HealthCare.gov Crashed, an Architectural Assessment, Eric Kavanagh, Inside Analysis, and Geoffrey Malafsky, PSIKORS Institute; Healthcare.gov Data Science, Brand Niemann, Semantic Community; and Healthcare.gov Prototype Video, Kees van Mansom, Be Informed
• Third Meetup: Tuesday, February 18, 6:30 p.m.– Continue Data Science Tutorial: Modus Operandi Semantic Knowledge Base– Wave All-Source Semantic Fusion Engine: Eric Little, Modus Operandi: and Department of
Defense Metadata Engineers.• Fourth Meetup: March 4, 6:30 p.m.
– Continue Data Science Tutorial: Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases
– Bigdata SYSTAP, Michael Personick and Bryan Thompson, SYSTAP• April Workshop: Date and Location TBA
– 2nd Cloud: SOA, Semantics, Data Science, and Business Concept Computing (16th SOA for eGov Conference).
6
Practical Data Science for Data Scientists
http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists
7
Resources• Required Textbook
– Doing Data Science:• http://shop.oreilly.com/product/0636920028529.do• Free Sampler:
– http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF)
• Optional Supplemental Reading:– Data Science Starter Kit:
• http://shop.oreilly.com/category/get/data-science-kit.do– DC Data Community:
• http://datacommunitydc.org/blog/about/
• DC Data Community Calendar:– http://datacommunitydc.org/blog/calendar/
• Technology Requirements– Internet and Free Tools like Spotfire Cloud:
• https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest– NodeXL:
• http://nodexl.codeplex.com/
8
Class 1
• 1/21 What is Data Science and the Data Science Process?– Discuss Reading: Chapters 1 and 2
• My Resources:– http://semanticommunity.info/Data_Science– http://semanticommunity.info/Analytics/Predictive_Analytic_World_Government_2013#Story
• Hands-on Class Exercise: Individual and Team Profiles and Case Study: RealDirect
9
Tutorial
• Overview: Data Science and the Data Science Process• My Profile: Breaking Government/AOL Government
Data Stories and Products– Select some interesting content and make it structured– Select a related data set/table– Explore both and write a story about it:
• Where did you get the data?,• Where did you store the data?, and• What were your results?• What were the steps?
• Assignment: Do something like My Profile
10
Overview: Data Science
http://semanticommunity.info/Data_Science
Key Concepts ExtractedWhat is Data Science? The future belongs to the companies andpeople that turn data into productsSee Sidebar Topics
11
Overview: Data Science Process
http://semanticommunity.info/Analytics/Predictive_Analytic_World_Government_2013#Story
So my three overlapping circles are: "Find and Prepare Data Sets", "Store and Query Data Sets", and "Discover Data Stories in the Data Sets“See mapping between the three Venn Diagrams in the table below.
12
Select some interesting content
http://breakinggov.com/2012/03/30/defense-department-bets-big-on-big-data/
13
Make it structured
http://semanticommunity.info/@api/deki/files/27612/SpotfireCloud.xlsx
14
Select a related data set/table
http://semanticommunity.info/@api/deki/files/27612/SpotfireCloud.xlsx
My Note: This isCategorized (Faceted Search)Correlation (Two Numeric Variables)Relational (Columns and Rows)Linked (URLs)Semantic Web (Subject, Predicate, and Object)Graph/Network Analytics (Edge and Node Tables)Geospatial (Could add Latitude and Longitude)
15
AOL Gov to BreakingGov Migration
Web Player
Note: The lack of correlation between Excel size and Spotfire sizeis due to the presence of large boundary (Shape) files).
16
Spotfire Silver to Spotfire Cloud Migration
Web Player
17
Explore both and write a story about it
• Where did you get the data?,– The Web and spreadsheets
• Where did you store the data?, and– Spreadsheets
• What were your results?– All files were accounted for in the two migrations (data quality), versatile formats
were created, and visualizations help me and others build on this data science work
• Steps:– Search MindTouch for Spotfire File Name: Like GDELT-Spotfire– Find Where It Was Used at One Or More Locations– Change Web Player Links in Spotfire Dashboard, Story, and Slides– Test to See If Embedded File Works– Repeat the Process 283 Times!
18
Preview of What You Are Going To Hear
• The Best Way to Get BIG DATA is By Starting Small:– BIG DATA– Subcommittee on Networking and Information Technology Research
and Development (NITRD Subcommittee)• These three activities fostered Semantic Medline on the YarcData Graph
Appliance for the White House Big Data Initiative.– Data Science Team Example– Generic Problems– Semantic Medline – YarcData Graph Appliance Application for Federal
Big Data Senior Steering WG– Modus Operandi: Mantra, Performance, and Vision– Knowledge Base: Modus Operandi Web Intelligence in MindTouch– Big Data in Memory: Innovation Story