ApacheCon North America 2014: Full Schedule

Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

10:55am PDT

Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform

A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.

Speakers

Konstantin Boudnik

CEO, Memcore

Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in... Read More →

Roman Shaposhnik

Director of Open Source, Linux Foundation

Apache Software Foundation and Data, oh but also unikernels

Monday April 7, 2014 10:55am - 11:45am PDT
Confluence C

Big Data - Data Ecosystem

11:55am PDT

Securing your Apache Hadoop cluster with Apache Sentry (incubating)

Apache Hadoop users can drive adoption within their organization by implementing Apache Sentry Role Based Access Control (RBAC). This talk will discuss the problems associated with using Apache Hadoop's low level authorization primitives in the context of Apache Hive and Apache SOLR, and how Apache Sentry addresses those problems. As well as how to implement Apache Sentry on your cluster, and the Apache Sentry roadmap, followed by a demo of Apache Sentry using Apache Hive and Apache SOLR.

- Concepts
- Practical concerns
- Installation
- Configuration
- Administration

Speakers

Xuefu Zhang

Software Engineer, Cloudera

Xuefu Zhang has over 10 year’s experience in software development. Working for Cloudera since May 2013, he spends a lot of his efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. Xuefu Zhang is... Read More →

Monday April 7, 2014 11:55am - 12:45pm PDT
Confluence C

Big Data - Data Ecosystem

2:00pm PDT

Apache Falcon – Simplifying managing data jobs on Hadoop

Apache Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop. It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases. Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.

Apache Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with such as Data Sets, Process and Infrastructure entities that are expressed using declarative language.

The presentation covers detailed design and architecture along with case studies on the usage of Falcon in production. We also look at how this compares against solutions if we took a silo-ed approach.

Speakers

Shwetha GS

Staff Engineer, InMobi

Shwetha GS is a Staff Engineer at InMobi and has building data processing applications over Hadoop. She is a committer and PMC with Apache Falcon (incubating) and a contributor on Apache Oozie. Prior to InMobi, she was with Amazon.

Monday April 7, 2014 2:00pm - 2:50pm PDT
Confluence C

Big Data - Data Ecosystem

3:00pm PDT

Sqoop 2 - New generation of Big Data Transfers

Apache Sqoop is a tool that was created to efficiently transfer big data between entire Hadoop ecosystem (components such as HDFS, Hive or HBase) and structured data stores (such as relational databases, data warehouses, or NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably.

In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases, as well as become easier to manage and operate, a new generation of Sqoop has been created. With focus on ease of use, ease of extension, and security Sqoop 2 was born. This session will dive into Sqoop 2 architecture, describing differences between Sqoop 1, and the benefits that the new architecture brings.

Speakers

Jaroslav Cecho

Software Engineer, Cloudera

Jarek Jarcec Cecho is a software engineer at Cloudera, where he develops software to help customers better access and integrate with the Hadoop ecosystem. He has led the Sqoop community in the architecture of the next generation of Sqoop, known as Sqoop 2. He is also a co-auhor of... Read More →

Abraham Elmahrek

Software Engineer, Cloudera

Abe is a Software Engineer at Cloudera working on ingest systems. Prior to working on ingest systems, he helped develop and bring to market Hue 3. He is a member of the Apache Sqoop PMC and a committer on the Apache HTrace (incubating) project.

Monday April 7, 2014 3:00pm - 3:50pm PDT
Confluence C

Big Data - Data Ecosystem

4:00pm PDT

Enterprise Kafka: Kafka as a Service

Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.

Speakers

Clark Haskins

Site Reliability Engineer, LinkedIn

Todd Palino

Staff Site Reliability Engineer, http://linkedin.com/

Todd Palino is a Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification... Read More →

Monday April 7, 2014 4:00pm - 4:50pm PDT
Confluence C

Big Data - Data Ecosystem

5:00pm PDT

Scaling MQTT Using Kafka

MQTT is a publish/subscribe protocol used in the Internet of Things to send telemetry data. Unlike Kafka, MQTT clients are simple and easy to implement. It is designed for devices with strict memory and power constraints. However, MQTT has an Achilles heel when it comes to scaling to high loads. Tim will review both protocols and dissect the performance characteristics of both protocols and discuss how his team at 2lemetry used Kafka to address MQTT's scaling problems.

Speakers

Tim Kellogg

Lead Software Engineer, 2lemetry

Tim is a lead software engineer at 2lemetry focused on implementing protocols for the Internet of Things. He spends much of his time evaluating software and protocols and architecting IoT-scale solutions. He contributes articles to iotworld.com on a regular basis and is working on... Read More →

Monday April 7, 2014 5:00pm - 5:50pm PDT
Confluence C

Big Data - Data Ecosystem

10:30am PDT

Apache Streams - Simplifying Real-Time data integration

Interest in analyzing the real-time web has reached a fever pitch among academics and corporate executives. Researchers and professionals tasked with capturing and analyzing high volumes of real-time social data have a plethora of open-source databases and machine learning libraries to choose from, but often spend a large fraction of their time writing code (and manually performing) ingestion, cleansing, normalization, and data management.

Apache Streams seeks to break these problems down into self-contained modules based on simple interfaces and foster a community-based approach to connecting and harmonizing data sources and services. Implementers can compose a data workflow from streams components and run their workflow in real-time or batch modes, using a variety of storage services (Kafka, HDFS, Cassandra, etc...) and execution engines (Tomcat, Storm, Amazon Kinesis, etc...)

Speakers

Steve Blackmon

VP Technology, People Pattern, Inc.

VP Technology at People Pattern, previously Director of Data Science at W2O Group, co-founder of Ravel, stints at Boeing, Lockheed Martin, and Accenture. Committer and PMC for Apache Streams (incubating). Experienced user of Spark, Storm, Hadoop, Pig, Hive, Nutch, Cassandra, Tinkerpop... Read More →

Tuesday April 8, 2014 10:30am - 11:20am PDT
Lawrence A

Big Data - Data Ecosystem