ApacheCon North America 2014 has ended
Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data - Frameworks [clear filter]
Wednesday, April 9

9:00am PDT

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka http://kafka.apache.org/ is an introduction for developers about why and how to use Apache Kafka.  Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log.  Kafka is designed to allow a single cluster to serve as the central data backbone.  A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.  It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.  Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages.


Joe Stein

Big Data Open Source Security LLC, Big Data Open Source Security LLC
Joe Stein is an Apache Kafka committer and PMC member. A frequent speaker also on both Hadoop and Cassandra, Joe is the Founder and Principal Architect of Big Data Open Source Security LLC http://stealth.ly a professional services and product solutions company. Joe has been a distributed... Read More →

Wednesday April 9, 2014 9:00am - 9:50am PDT
Confluence C

10:00am PDT

Simplifying Big Data with Apache Crunch
The MapReduce framework is a proven method for processing large volumes of data but even simple problems require expertise.  Tackling the learning curve for Big Data and efficient processing is a daunting task for developers just getting started.  The Apache Crunch project helps to break down complex processing problems into simple concepts which can be utilized on industry standard frameworks such as Hadoop and Spark. Apache Crunch is being used as an integral part of building processing pipelines for healthcare data allowing for quick development of new solutions and architectures.  The talk will also cover how the core concepts of Apache Crunch enable first class integration, rapid scaling of development across teams, and development of extensible processing infrastructure.

avatar for Micah Whitacre

Micah Whitacre

Software Architect, Cerner Corporation
Micah is a committer on the Apache Crunch project as well as a Software Architect for Cerner Corporation, a leading provider of healthcare technology. For almost a decade he has worked on building infrastructure and reusable assets. In the last few years his focus has shifted towards... Read More →

Wednesday April 9, 2014 10:00am - 10:50am PDT
Confluence C

11:15am PDT

Developing the Tez Execution Engine for Pig
Apache Pig is a programming language and execution runtime for doing petabyte scale processing with MapReduce. One of the major recent developments in the Hadoop ecosystem is the introduction of Apache Tez, a successor to MapReduce which provides major performance enhancements and a more natural foundation for Pig. The Pig-on-Tez project aims to dramatically increase the throughput of data pipelines written in Pig by using Apache Tez as the execution engine instead of MapReduce. By running atop the Tez framework, benchmarks of representative queries have sped up 2-3x when compared to MapReduce.

In the second half of this presentation we’ll explain how LinkedIn, Netflix, Hortonworks, and Yahoo have successfully collaborated over a 6 month period to deliver a major rewrite of critical infrastructure, providing significant benefits for both themselves as well as the community at large.

avatar for Cheolsoo Park

Cheolsoo Park

Senior Software Engineer, Netflix
Cheolsoo Park is an Apache Pig PMC member and committer. He is also a senior software engineer at Netflix and works on cloud-based big data analytics infrastructure that leverages open source technologies including Hadoop, Hive and Pig. Cheolsoo holds a Bachelor’s degree in Computer... Read More →
avatar for Mark Wagner

Mark Wagner

Mark Wagner is a committer on the Apache Pig project and a contributor to many other projects in the Hadoop ecosystem. He is passionate about distributed systems, programming languages, and machine learning. Mark holds Bachelors’ degrees in Mathematics and Computer Science from... Read More →

Wednesday April 9, 2014 11:15am - 12:05pm PDT
Confluence C

1:15pm PDT

Harnessing the power of YARN with Apache Twill
With its resource manager YARN, Apache Hadoop 2.0 allows arbitrary distributed workloads in a cluster. While powerful and generic, the YARN interface is complex and poses a steep learning curve to developers. Apache Twill removes this threshold, exposing YARN’s power through a simple thread-like programming model. This talk gives an overview of YARN and Twill, and ends with a Twill tutorial.

avatar for Terence Yim

Terence Yim

Committer, Apache Twill, The Apache Software Foundation
Terence Yim is a Software Engineer at Continuuity, responsible for designing and building realtime processing systems on Hadoop/HBase. Prior to Continuuity, Terence spent over a year at LinkedIn Inc. and seven years at Yahoo!, building high performance large scale distributed sys... Read More →

Wednesday April 9, 2014 1:15pm - 2:05pm PDT
Confluence C

2:15pm PDT

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!
The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL­-on­Hadoop has followed shortly after that, paving a way to the whole schema-­on­-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built­-in algorithms and finally how to implement your own graph processing  applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.

avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels

Wednesday April 9, 2014 2:15pm - 3:05pm PDT
Confluence C

3:15pm PDT

Apache Pig as a platform for Datascience
Apache Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. In addition, it
provides for extensibility by way of User Defined Functions. There are
some third-party libraries for Pig geared for use by Data Scientists.

In this talk, I will explore how to integrate popular libraries with Apache Pig to provide a robust environment to do data science. I will explore gaps and potential improvements that can be had based on our experience using Pig as a tool for data science.  In particular, we will focus the role of Pig as a data aggregation tool as well as a platform to evaluate machine learning models at scale.


Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist... Read More →

Wednesday April 9, 2014 3:15pm - 4:05pm PDT
Confluence C