Querying your Datagrid with Lucene, Hadoop and Spark

QUERYING YOUR DATAGRID WITH LUCENE,HADOOP AND SPARK

Gustavo Fernandes / [email protected]

ABOUT THE PRESENTERDeveloper @Red HatContributing on Infinispan & Hibernate Searchgithub.com/gustavonalle@gustavonalleBrazilian, Italian Citizen, London based

HASHMAP

HASHMAPAverage Worst Case

Space O(n) O(n)

Get by key O(1) O(n)

Insert O(1) O(n)

Delete O(1) O(n)

USE CASESSecuritySessionFinancialLow latency...

PLUSESPersistenceExpirationSpaceQuery on values

INFINISPANIn Memory K/V StoreASL v2ScalableJTA TransactionsPersistence (File/JDBC/LevelDB/...)Local/ClusteredEmbedded/Server...

GETTING STARTED<dependency> <groupId>org.infinispan</groupId> <artifactId>infinispanembedded</artifactId> <version>7.0.3.Final</version></dependency>

libraryDependencies += "org.infinispan" % "infinispanembedded" % "7.0.3.Final"

EmbeddedCacheManager cacheManager = new DefaultCacheManager();Cache<String,String> cache = cacheManager.getCache();

cache.put(1, "data goes here");

ADD PERSISTENCE (XML)<infinispan> <cachecontainer> <localcache name="testCache"> <persistence> <leveldbstore path="/tmp/folder"/> </persistence> </localcache> </cachecontainer></infinispan>

DefaultCacheManager cm = new DefaultCacheManager("infinispan.xml");Cache<Integer, String> cache = cacheManager.getCache("testCache");

ADD PERSISTENCE (PROGRAMMATIC)Configuration configuration = new ConfigurationBuilder() .persistence() .addStore(LevelDBStoreConfigurationBuilder.class) .build();

DefaultCacheManager cm = new DefaultCacheManager(configuration);Cache<Integer, String> cache = cm.getCache();

CLUSTERING - REPLICATEDGlobalConfiguration globalCfg = new GlobalConfigurationBuilder() .transport().defaultTransport() .build(); Configuration cfg = new ConfigurationBuilder() .clustering().cacheMode(CacheMode.REPL_SYNC) .build(); EmbeddedCacheManager cm = new DefaultCacheManager(globalCfg, cfg);Cache<Integer, String> cache = cm.getCache();

CLUSTERING - DISTRIBUTEDGlobalConfiguration globalCfg = new GlobalConfigurationBuilder() .transport().defaultTransport() .build(); Configuration configuration = new ConfigurationBuilder() .clustering().cacheMode(CacheMode.DIST_SYNC) .hash().numOwners(2).numSegments(100) .build();

EmbeddedCacheManager cm = new DefaultCacheManager(globalConfiguration, configuration);Cache<Integer, String> cache = cm.getCache();

QUERYINGApache Lucene IndexNative Map ReduceIndex-less

INDEXING - CONFIGURATIONConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.ALL) .build();

EmbeddedCacheManager cm = new DefaultCacheManager(configuration);Cache<Integer, DaySummary> cache = cm.getCache();

QUERY - SYNC/ASYNCConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.worker.execution", "async") .build();


QUERY - RAM STORAGEConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.worker.execution", "async") .addProperty("default.directory_provider", "ram") .build();


QUERY - INFINISPAN STORAGEConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.worker.execution", "async") .addProperty("default.directory_provider", "infinispan") .build();


QUERY - FILESYSTEM STORAGEConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.directory_provider", "filesystem") .addProperty("default.indexBase", "/path/to/index);.build();

QUERY - NEAR-REAL-TIME INDEXMANAGERConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.worker.execution", "async") .addProperty("default.directory_provider", "infinispan") .addProperty("default.indexmanager","nearrealtime").build();

QUERY - INFINISPAN INDEX MANAGERConfiguration configuration = new ConfigurationBuilder() .indexing().index(Index.LOCAL) .addProperty("default.worker.execution", "async") .addProperty("default.directory_provider", "infinispan") .addProperty("default.indexmanager", "org.infinispan.query.indexmanager.InfinispanIndexManager") .build();


STORING ENTITIESpublic class DaySummary private Station station; private Integer year; private Integer month; private Integer day; Float avgTemp;public class Station private Integer wban; private String name; private Country country; private Float latitude; private Float longitude;

public class Country private String name; private String code;

INDEXING CONFIGURATION@Indexedpublic class DaySummary

@IndexedEmbedded private Station station;

@Field(store = Store.YES, analyze = Analyze.NO) private Integer year;

@Field(store = Store.YES, analyze = Analyze.NO) private Integer month;

private Integer day;

@Field(store = Store.YES) Float avgTemp;

Document Field("station.name") Field("station.usaf") NumericField("year", Stored.YES, Indexed.NO) NumericField("month", Stored.YES, Indexed.NO)

INDEXING CONFIGURATION@Indexed@AnalyzerDef(name = "lowercaseKeyword", tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = @TokenFilterDef(factory = LowerCaseFilterFactory.class) )public class Country

@Analyzer(definition = "lowercaseKeyword") @Field(store = Store.YES) private String name;

@Field(store = Store.YES, analyze = Analyze.NO) private String code;

USING LUCENE QUERY PARSERQueryParser qp = new QueryParser("default", new StandardAnalyzer()); Query luceneQ = qp .parse("+station.name:airport +year:2014 +month:12 +(avgTemp < 0)");

CacheQuery cq = Search.getSearchManager(cache) .getQuery(luceneQ, DaySummary.class); List<Object> results = query.list();

COUNT ENTITIESimport org.apache.lucene.search.MatchAllDocsQuery;

MatchAllDocsQuery allDocsQuery = new MatchAllDocsQuery(); CacheQuery query = Search.getSearchManager(cache) .getQuery(allDocsQuery, DaySummary.class); int count = query.getResultSize();

USING LUCENE INDEXREADER DIRECTLYSearchIntegrator searchFactory = Search.getSearchManager(cache) .getSearchFactory(); IndexReader indexReader = searchFactory .getIndexReaderAccessor().open(DaySummary.class); IndexSearcher searcher = new IndexSearcher(indexReader);

DEMOIndexed

NOAA.gov data from 1901 to 2014~10M summariesYearly country max recorded temperature by month

Cache<Integer, DaySummary>

http://localhost:8080/

HADOOP INTEGRATION (WIP)

[Input|Output]FormatRun Hadoop v1 and v2 jobsServer ModeRun existing map/reduce jobs on Infinispan data

ISPN-5191

https://issues.jboss.org/browse/ISPN-5191

JOB CONFIGURATIONconfiguration.set("mapreduce.ispn.remote.cache.host", "10.0.0.2")configuration.set("mapreduce.ispn.input.cache.name", "incache");configuration.set("mapreduce.ispn.output.cache.name", "outcache");

JobConf jobConf = new JobConf(configuration, Main.class);jobConf.setJobName("wordcount");

jobConf.setOutputKeyClass(Text.class);jobConf.setOutputValueClass(IntWritable.class);jobConf.setMapperClass(MapClass.class);jobConf.setReducerClass(ReduceClass.class);

jobConf.setInputFormat(InfinispanInputFormat.class);jobConf.setOutputFormat(InfinispanOutputFormat.class); JobClient.runJob(jobConf);

DEMODocker: Infinispan Server cluster + Spark ClusterJava HotRod client inserting dataHadoop RDD with Infinispan Input/Output FormatWordcount mapreduce

FUTURE WORKPerformanceInput Split ConfigurationSpark connector

THANKSPeter Cook

http://animateddata.co.ukgithub.com/gustavonalle/query-infinispan-fosdem-slides

github.com/gustavonalle/weather-demogithub.com/gustavonalle/hadoop-demo

http://github.com/gustavonalle/hadoop-demo/

http://github.com/gustavonalle/query-infinispan-fosdem-slides

http://animateddata.co.uk/

http://github.com/gustavonalle/weather-demo/

infinispan.orgIRC #infinispan freenode

github.com/infinispan

THANKS/QA

Date post:	15-Jul-2015
Category:	Technology
Upload:	gustavo-fernandes
View:	575 times
Download:	3 times

Querying your Datagrid with Lucene, Hadoop and Spark

Technology