ApacheCon North America 2014 has ended
Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data - Data Ingest & SQL [clear filter]
Tuesday, April 8


Real Time Data Ingest into Hadoop using Flume
Apache Flume is a real time distributed data ingest system specifically designed for the Hadoop Ecosystem. Flume is highly scalable distributed system that guarantees delivery from a large number of data sources to an eventual destination like HDFS or HBase. Flume has been deployed in extremely large deployments in several companies around the world, transferring several hundreds of terabytes every weekend.

In this presentation, we will go through the fundamental components that make up Flume and how to configure and deploy Flume to your cluster to scale based on the number of sources and amount of data. As a committer and an engineer supporting Flume in production, I will present standard deployment topologies and how to design a deployment topology.


Hari Shreedharan

Software Engineer, Cloudera
Hari Shreedharan is a PMC member on Apache Flume and a committer on Apache Sqoop. He is a Software Engineer at Cloudera. He regularly presents at conferences and meetups related to Hadoop and Big Data.

Tuesday April 8, 2014 10:30am - 11:20am
Confluence C


Introducing Hive New Command Line Tool: Beeline
As Hive development has shifted from the original Hive server (HiveServer1) to the new server (HiveServer2), users and developers also need to switch the client tool accordingly in order to work with HiveServer2. Unfortunately, the migration isn't just switching executable name from “hive” to “beeline”. The purpose of this presentation, therefore, is to help make the migration as smooth as possible, with an emphasis on the usage differences and equivalences.

avatar for Xuefu Zhang

Xuefu Zhang

Software Engineer, Cloudera
Xuefu Zhang has over 10 year’s experience in software development. Working for Cloudera since May 2013, he spends a lot of his efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. Xuefu Zhang is... Read More →

Tuesday April 8, 2014 11:30am - 12:20pm
Confluence C


Data cubes in Apache Hive
This talk is about a system developed at InMobi to support data cubes on top of Hive metastore and Hive Query Language. The Hive metastore in its current state allows users to represent structured data in simple tables. However, it does not allow expressing relationships or richer DWH concepts like facts, dimensions and etc. With Hive data cubes, users can query data stored in HDFS, S3, Redshift and etc, with a single query language and schema. Underlying execution engines like Hive, Impala, Shark can be plugged in and utilized at run time. The execution engine used is transparent to the user. The system provides a unified logical schema to users consisting of cubes, facts and dimensions; and users can issue queries at a conceptual level without knowing about roll-up intervals, partitions, data types, underlying storage and table relationships; they will be figured out automatically.


Jaideep Dhok

Software Engineer, InMobi
Jaideep Dhok currently works as a Software Engineer in the Platform team in InMobi, working on systems to support analytics in InMobi, where he works on Apache Hive.  Before joining InMobi he worked as a contractor for Credit Suisse in Singapore where he worked on the APAC regulatory... Read More →
avatar for Amareshwari Sriramadasu

Amareshwari Sriramadasu

Architect, Inmobi
Amareshwari is currently working as Architect in data team at Inmobi, where she works on Hadoop and related projects for data collection and analytics. She is member of the ASF, Apache Incubator PMC, Apache Hadoop PMC, Apache Lens PMC and Apache Falcon PMC, and is Apache Hive committer... Read More →

Tuesday April 8, 2014 1:30pm - 2:20pm
Confluence C


Building Highly Flexible, High Performance Query Engines – Highlights from the Apache Drill Project
Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance.  This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
IT will cover how Apache Drill supports schema-less data querying and how application driven schemas like JSON structures can be tackled. I will cover details on how data sources in Drill present query rewrites to the query optimizer, thus allowing complex query push-down into the data source and how Drill generates very efficient low-level code that changes according to the kind and shape of data that is processed in the moment.

avatar for Neeraja Rentachintala

Neeraja Rentachintala

Director of Product Management, MapR technologies
As Sr Director of Product Management, Neeraja is responsible for the product strategy, roadmap and requirements of MapR SQL initiatives. Prior to MapR, Neeraja held numerous product management and engineering roles at Informatica, Microsoft SQL Server, Oracle and Expedia.com, most... Read More →

Tuesday April 8, 2014 2:30pm - 3:20pm
Confluence C


Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop
Data transfer is one of the most pressing problems facing companies in the telecoms industry today. As data requirements grow by the month, so too do the costs.

To a certain point, Hive-on-Hadoop has been sufficient for the company. But huge increases in requirements brought about by more sophisticated smartphones have started to create serious business bottlenecks, with the need for near real-time analytics reports on the massive warehouse load of petabytes of data and growing by the day. Clearly, a more fundamental shift was required.

In this presentation,  Keuntae Park will detail how this problem was dealt with; how support of SQLstandards was an essential part of the required solution; and how Tajo, an open-source low-latency query engine, might just point to the future of high-speed large enterprise data processing.


Keuntae Park

IT Manager, SK Telecom
Keuntae Park is an IT manager of SK Telecom(SKT), South Korea’s largest wireless communications provider. He is also an Apache committer of Tajo project and applying Tajo on the company's big data analytic cluster. Big data analysis of SKT mainly focuses on the customer retention... Read More →

Tuesday April 8, 2014 3:45pm - 4:35pm
Confluence C


Interoperability in the Apache Hive Ecosystem
Apache Hadoop has grown over time to spawn many other Apache projects, each of which enables crunching big data in one way or another. Due to the need for some of those projects to talk to each other, a smaller ecosystem has developed among some of these projects, notably Hive, HCatalog, Pig and HBase.

In this presentation, we will begin with a baseline overview of Apache Hadoop and MapReduce. We will outline the related other Apache projects (Hive, Pig, and HBase) and their niches, and highlight their use cases and best practices. Then, we will tie them all together via HCatalog apis and common metadata and look at some patterns for usage introspection and optimization.

We will approach the above from a historical perspective of evolution of these tools in this ecosystem, and also provide a sneak peek into recent developments and the future of Hadoop and these projects.

avatar for Mithun Radhakrishnan

Mithun Radhakrishnan

Programmer, Yahoo
Erstwhile firmware developer. Apache HCatalog committer. Author of DistCp for Hadoop-2. Has moderate to severe C++ withdrawal symptoms. Currently works on Hive and its ecosystem over at Yahoo!

Sushanth Sowmyan

Sushanth Sowmyan is an Apache HCatalog committer, and a long time Apache Hive contributor that spends most of his time oscillating between worrying about backward compatibility and being worked up about doing it ""the right way"". He currently works at Hortonworks in their data query... Read More →

Tuesday April 8, 2014 4:45pm - 5:35pm
Confluence C
Thursday, April 10


Meetup: Write powerful Big Data Applications easily with Spring XD
Spring XD aims to provide a one stop shop for writing and deploying
Big Data Applications.    It provides a scalable, fault tolerant,
distributed runtime for Data Ingestion, Analytics, and Workflow
Orchestration using a single programming, configuration and
extensibility model.  By not requiring developers to rationalize all
of this themselves across the many different solutions available
today, Spring XD greatly reduces the inherent complexity of Big Data
development.   It's all built on proven projects like Spring
Integration, and Spring Batch.  You'll see for yourself how this
heritage combines to provide a scalable runtime environment, that is
easily configured and assembled via a simple DSL.

avatar for Derek Beauregard

Derek Beauregard

Sr. Field Engineer, Pivotal
Derek Beauregard is a technologist who has worked in the Software/IT industry for the past 10+ years with roles - across the spectrum - in Field Sales (Sales Engineer), Consulting, and Engineering.  He has worked extensively with Java and Spring, across multiple industries, and has... Read More →

Thursday April 10, 2014 1:30pm - 2:30pm


Meetup: Painless build and deploy for YARN applications with Spring
Spring's goal, like any good framework, has always been to handle the
infrastructure so you can focus on your  application code.  Join this
session to see how Spring provides a simple programming model to
develop applications than can easily be tested and deployed as either
a YARN application or a traditional application.   No longer will you
need to struggle with
3rd party library build and packaging issues, XML, and how the YARN
Appmasters, Clients and Resource Managers all work together.   The
magic ofSpring Boot, Spring XD, and Spring for Apache Hadoop just make
it all work so you can get coding!

avatar for Derek Beauregard

Derek Beauregard

Sr. Field Engineer, Pivotal
Derek Beauregard is a technologist who has worked in the Software/IT industry for the past 10+ years with roles - across the spectrum - in Field Sales (Sales Engineer), Consulting, and Engineering.  He has worked extensively with Java and Spring, across multiple industries, and has... Read More →

Thursday April 10, 2014 2:30pm - 3:30pm