Hive on spark and MapReduce : a methodology for parameter tuning

Forster, Rodrigo Richard

http://hdl.handle.net/10362/52854

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TGI0168.pdf		2.19 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Forster, Rodrigo Richard

Orientador(es)

Santos, Vítor Manuel Pereira Duarte dos

Resumo(s)

As the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster.

Descrição

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management

Palavras-chave

Tuning Hive on Spark MapReduce Apache Spark Big Data HDFS Hadoop Data Warehouse

URI

http://hdl.handle.net/10362/52854

Coleções

NIMS - Dissertações de Mestrado em Gestão da Informação (Information Management)

Licença CC

cclicense-by

Ver registo completo