Date post: | 08-Sep-2014 |
Category: |
Technology |
Upload: | fujio-turner |
View: | 458 times |
Download: | 1 times |
HPCC Systems Load, Index & Query
Big Data the EZ way
By Fujio Turner
@myhousehippo
Comparison
JAVA C++Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Non-Indexed 4X-13X
Since 2000
Indexed: 2K-3K Jobs/sec
? ? ? ? ? ?
BusinessDevelopmentCustomers1 20
Non-Indexed Full Data Set
http://hpccsystems.com/why-hpcc/benchmarks
Map/Reduce
SQL w/ JOINS
GraphDB
Machine Learning
Simple to Complex Queries
PluginsBITransport
SecurityQuery
Encrypted on disk
“I’m sub-second fast.”
“I can query all or part of your
data.”
Thor RoxieHard Disk
Index(optional)Hard Disk
Index(optional) In-memory Index
SSD
Either/Both
Architecture
Data QueryFile
Example 2
Example 1
HPCC Systems Sample Data for Examples 1 & 2
Sample Data
http://hpccsystems.com/download/docs/learning-ecl
More Examples
CREATE TABLE layout_person ( PersonID INT(15) NOT NULL, FirstName VARCHAR(15) NOT NULL, LastName VARCHAR(25) NOT NULL, PRIMARY KEY (PersonID) );
1. Schema
2.
3.
Load
Query
INSERT INTO`layout_person` (`FirstName`,`LastName`)VALUE(‘Joe’,’Smith’;
SELECT * FROM `layout_person`;
Typical
1.
2.
Load
Queryw/ Applied Schema
on Read allPeople := DATASET(‘~file’,Layout_Person,THOR);
Layout_Person := RECORD UNSIGNED1 PersonID; STRING15 FirstName; STRING25 LastName; END;
allPeople;
Structured or
Semi-structured or
Unstructured
All data has: 1. Origin 2. DateTime 3. Info
Administrator Web GUI!on
Port 8010IP / Url of HPCC install
4.
5.
1. Upload file*!2. Distribute to cluster!3. Name of file in cluster!4. Size of each row!5. Push to cluster
*2GB file size limit through web No limit if uploaded via SOAP
Load Data
In Thor Cluster
Loaded
Query !Example 1
Data
allPeople := DATASET(‘~test::originalperson’,Layout_Person,THOR);
Layout_People := RECORD STRING15 FirstName; STRING25 LastName; STRING15 MiddleName; STRING5 Zip; STRING42 Street; STRING20 City; STRING2 State; END;
Smiths; //Output
Smiths := allPeople(LastName = ‘Smith’);Query
Schema
WHERE `LastName` = ‘Smith’
File TypeFile Location,!“FROM Table”
“USE DATABASE;”
“SELECT * ….”
1. Go to playground!2. Edit ECL!3. Pick “thor” Cluster!4. Submit
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_1/
Practice
Full !Table or Data !
Scan
Why Index ?
++and
from date to date
Indexing!Example 2
Make Index
File Position Number!pseudo recordID!
“Alter Table”(new column)Index Filename
allPeople := DATASET(‘~test::originalperson’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}}, THOR);
datax := INDEX(allPeople,{State,RecPtr},’~test::key_person’);
BUILDINDEX(datax);
Ex. Creating an index by “STATE”
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_2a_-_Create_Index
Query
filterdata; //Output
w/ IndexData
Queryfilterdata:= FETCH(allPeople,datax(State=‘NJ’),RIGHT. RecPtr);
datax:= INDEX(allPeople,{State,RecPtr},’~thor::test::key_person’);
WHERE `State` = ‘NJ’ from Index
allPeople := DATASET(‘~test::originalperson’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},THOR);
http://www.meetup.com/HPCC-SV/pages/ECL_EXAMPLE_2b_-_Query_with_Index
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013-06 Twitter
2013-06-06 ……….. -07 ……….. -08
Logical File
Real File
SuperFile!organizing your files
+ Append new real files
1. Create New !! or !! Update Existing!! Super File
2. Super File Name!!2b. Add new file to !! existing superfile!!
3. Create Superfile!!
Creating a SuperFile
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013-06 Twitter
2013 Twitter
SuperKeys!organizing your indexes
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013 Twitter
SuperKeys No Sub-Super Files or Keys
in Roxie
When and where NOT to Index
Filtered Data
80-100% Queries @ Roxie
Index HereDo Not Index Here
100% of Data Enters Here
100% of Data Enters Here
• Query 100% of all data • Lots of Regular Expressions • Few or No DateTime DataDo Not Index Here