ApacheCon North America 2014 has ended
Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data - Data Ecosystem [clear filter]
Monday, April 7


Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.

avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in... Read More →
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels

Monday April 7, 2014 10:55am - 11:45am
Confluence C


Securing your Apache Hadoop cluster with Apache Sentry (incubating)
Apache Hadoop users can drive adoption within their organization by implementing Apache Sentry Role Based Access Control (RBAC). This talk will discuss the problems associated with using Apache Hadoop's low level authorization primitives in the context of Apache Hive and Apache SOLR, and how Apache Sentry addresses those problems. As well as how to implement Apache Sentry on your cluster, and the Apache Sentry roadmap, followed by a demo of Apache Sentry using Apache Hive and Apache SOLR.

- Concepts
- Practical concerns
- Installation
- Configuration
- Administration

avatar for Xuefu Zhang

Xuefu Zhang

Software Engineer, Cloudera
Xuefu Zhang has over 10 year’s experience in software development. Working for Cloudera since May 2013, he spends a lot of his efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. Xuefu Zhang is... Read More →

Monday April 7, 2014 11:55am - 12:45pm
Confluence C


Apache Falcon – Simplifying managing data jobs on Hadoop
Apache Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop. It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases. Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.

Apache Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with such as Data Sets, Process and Infrastructure entities that are expressed using declarative language.

The presentation covers detailed design and architecture along with case studies on the usage of Falcon in production. We also look at how this compares against solutions if we took a silo-ed approach.


Shwetha GS

Staff Engineer, InMobi
Shwetha GS is a Staff Engineer at InMobi and has building data processing applications over Hadoop. She is a committer and PMC with Apache Falcon (incubating) and a contributor on Apache Oozie. Prior to InMobi, she was with Amazon.

Monday April 7, 2014 2:00pm - 2:50pm
Confluence C


Sqoop 2 - New generation of Big Data Transfers
Apache Sqoop is a tool that was created to efficiently transfer big data between entire Hadoop ecosystem (components such as HDFS, Hive or HBase) and structured data stores (such as relational databases, data warehouses, or NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably.

In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases, as well as become easier to manage and operate, a new generation of Sqoop has been created. With focus on ease of use, ease of extension, and security Sqoop 2 was born. This session will dive into Sqoop 2 architecture, describing differences between Sqoop 1, and the benefits that the new architecture brings.

avatar for Jaroslav Cecho

Jaroslav Cecho

Software Engineer, Cloudera
Jarek Jarcec Cecho is a software engineer at Cloudera, where he develops software to help customers better access and integrate with the Hadoop ecosystem. He has led the Sqoop community in the architecture of the next generation of Sqoop, known as Sqoop 2. He is also a co-auhor of... Read More →

Abraham Elmahrek

Software Engineer, Cloudera
Abe is a Software Engineer at Cloudera working on ingest systems. Prior to working on ingest systems, he helped develop and bring to market Hue 3. He is a member of the Apache Sqoop PMC and a committer on the Apache HTrace (incubating) project.

Monday April 7, 2014 3:00pm - 3:50pm
Confluence C


Enterprise Kafka: Kafka as a Service
Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.


Clark Haskins

Site Reliability Engineer, LinkedIn
avatar for Todd Palino

Todd Palino

Staff Site Reliability Engineer, http://linkedin.com/
Todd Palino is a Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification... Read More →

Monday April 7, 2014 4:00pm - 4:50pm
Confluence C


Scaling MQTT Using Kafka
MQTT is a publish/subscribe protocol used in the Internet of Things to send telemetry data. Unlike Kafka, MQTT clients are simple and easy to implement. It is designed for devices with strict memory and power constraints. However, MQTT has an Achilles heel when it comes to scaling to high loads. Tim will review both protocols and dissect the performance characteristics of both protocols and discuss how his team at 2lemetry used Kafka to address MQTT's scaling problems.


Tim Kellogg

Lead Software Engineer, 2lemetry
Tim is a lead software engineer at 2lemetry focused on implementing protocols for the Internet of Things. He spends much of his time evaluating software and protocols and architecting IoT-scale solutions. He contributes articles to iotworld.com on a regular basis and is working on... Read More →

Monday April 7, 2014 5:00pm - 5:50pm
Confluence C
Tuesday, April 8


Apache Streams - Simplifying Real-Time data integration
Interest in analyzing the real-time web has reached a fever pitch among academics and corporate executives. Researchers and professionals tasked with capturing and analyzing high volumes of real-time social data have a plethora of open-source databases and machine learning libraries to choose from, but often spend a large fraction of their time writing code (and manually performing) ingestion, cleansing, normalization, and data management.

Apache Streams seeks to break these problems down into self-contained modules based on simple interfaces and foster a community-based approach to connecting and harmonizing data sources and services. Implementers can compose a data workflow from streams components and run their workflow in real-time or batch modes, using a variety of storage services (Kafka, HDFS, Cassandra, etc...) and execution engines (Tomcat, Storm, Amazon Kinesis, etc...)

avatar for Steve Blackmon

Steve Blackmon

VP Technology, People Pattern, Inc.
VP Technology at People Pattern, previously Director of Data Science at W2O Group, co-founder of Ravel, stints at Boeing, Lockheed Martin, and Accenture. Committer and PMC for Apache Streams (incubating). Experienced user of Spark, Storm, Hadoop, Pig, Hive, Nutch, Cassandra, Tinkerpop... Read More →

Tuesday April 8, 2014 10:30am - 11:20am
Lawrence A