ApacheCon North America 2014 has ended
Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data - HBase & Indexing [clear filter]
Monday, April 7

10:55am PDT

How Apache Phoenix enables interactive, low latency applications over your HBase data
Apache Phoenix (http://phoenix.incubator.apache.org) opens the door for an entirely new class of applications for Apache HBase - interactive big data applications that demand low latency as opposed to the typical map-reduce, batch-oriented applications. Phoenix is the technology used to support big data at Salesforce.com and the first SQL query engine to be built specifically for HBase, leveraging all of its power to push the computation to where the data lives. Not just a read-only query engine, but one that supports full DDL, DML, secondary indexes, views, and multi-tenant data. Delivered as a standard JDBC driver, it allows anyone who knows SQL to leverage the power of HBase for interactive, big data applications.

Come learn about the difference between this and other SQL-over-Hadoop products, and hear about our roadmap for the future.

avatar for James Taylor

James Taylor

Architect, Salesforce.com
James Taylor is an architect at Salesforce.com in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to working at Salesforce.com, James worked at BEA Systems on projects such as a federated query processing system and an event... Read More →

Monday April 7, 2014 10:55am - 11:45am PDT
Confluence A

11:55am PDT

Bringing Distributed Transactions to HBase
HBase is a distributed columnar store that provides elastic scalability and strong consistency. As its successful usage broadens, organizations build more applications with features that "the HBase core implementation" trades off in favor of flexibility and scalability. While HBase allows atomic multi-row operations within single region, there are many application use-cases that would benefit or could be simplified by the availability of global (cross-region, cross-table) transactions.

In this talk we will describe in detail a method for bringing global transaction support to HBase. We will explore how adding transactions can make use of HBase features like coprocessors and multi-version concurrency control, including a deep dive into the implementations. We will also examine some specific use-cases, which make use of transactional support to simplify solutions on HBase.

avatar for Alex Baranau

Alex Baranau

Software Engineer, Continuuity
Alex Baranau is a software engineer at Continuuity where he is responsible for building and designing software fueling the next generation of Big Data applications. For the last few years, Alex has been working on complex data-analytics systems utilizing Hadoop and HBase. Alex is... Read More →

Monday April 7, 2014 11:55am - 12:45pm PDT
Confluence A

2:00pm PDT

Hindex: Secondary indexes for faster HBase queries
 HBase is a very popular data store due to its tight integration with Hadoop. However, query latencies can sometimes be high, specially when scanning tables on column values. This can also have other undesirable side effects - like timeouts at client, or lease expires. Hindex adds secondary indexes for HBase tables. The indexes are used for equals and range condition scans, and can turn full table scans to point/range scans. Hindex is 100% server-side solution based on co-processors and supports one or more indexes on a table, multi-column index, and also index based on part of a column value. Hindex stores region level index in a separate table, and colocates the user and index table regions with a custom load balancer.
In this session, we will learn in detail the new capability this adds to HBase, and will also dive into technical details of the implementation.


Rajeshbabu Chintaguntla

Software Engineer, Huawel
Rajesh is committer for Apache HBase. He is a Software Engineer with Huawei in their Bangalore R&D Center, and works on enhancment and stabilization of HBase to meet needs of Telco customers. His recent focus has been the development of secondary indexes, which Huawei recently open... Read More →

Monday April 7, 2014 2:00pm - 2:50pm PDT
Confluence A

3:00pm PDT

Introduction to Apache DataFu
Apache DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.

During the the course of development at LinkedIn and other companies, a need was recognized for a stable well-tested library of routines in high-level languages suitable for execution on Hadoop.  Over time, many routines had been collected but were ill-documented, ill-organized, and easily broken.   Initially, DataFu was an initiative to clean-up these routines by adding documentation and rigorous unit tests.

Since then DataFu has evolved through many versions of Hadoop and Pig.  During this time DataFu has been used extensively at LinkedIn and other companies for many data driven products such as" People You May Known," "Skills and Endorsements" and other products.

This presentation presents an introduction to DataFu as well as example use cases in Pig.


William Vaughan

Software Engineer, LinkedIn
William Vaughan is currently a Staff Software Engineer at LinkedIn who has been involved with the creation of the Skills and Expertise as well as the Endorsements Big Data products.

Monday April 7, 2014 3:00pm - 3:50pm PDT
Confluence A

4:00pm PDT

HydraBase: Strong Consistency beyond a single datacenter
Apache HBase powers several popular applications at Facebook, most notably Facebook Messages. To ensure data availability in the face of large scale disruptive events, such as datacenter-wide power outages and network problems, Facebook uses asynchronous replication and failover. This approach however, puts a burden onto the application developers, each of whom must build their own solution to deal with eventual consistency.

As an answer to this problem, Facebook has built HydraBase, an iteration of HBase which synchronously replicates data to geographically dispersed hosts while maintaining strong consistency. This presentation will cover the design of this system; the changes made to HBase to support this type of replication while upholding the guarantees provided by HBase to application developers; and some early results and lessons learned from running this system in production.

avatar for Arjen Roodselaar

Arjen Roodselaar

Data Infastructure Engineer, Facebook
Arjen has been fascinated with the design and operation of distributed systems ever since gaining access to the internet and writing his first client and server code. For the past 2 years he has been part of the Data Infrastructure engineering team at Facebook, working on increased... Read More →

Monday April 7, 2014 4:00pm - 4:50pm PDT
Confluence A

5:00pm PDT

Feeding the Elephant: Optimizing the Read Path of the Hadoop Distributed Filesystem
The Hadoop Distributed Filesystem (HDFS) is a key component of the Hadoop distributed computation framework.  I'd like to talk about some important optimizations we made to the read path of HDFS, such as direct reads, short-circuit local reads, zero-copy reads, and HDFS caching.  Along the way, I'll talk about lessons that I learned while working on HDFS, and emerging trends in data center hardware.  Finally, I'll talk about some interesting ongoing and planned approaches to optimizing Hadoop and HDFS.


Colin McCabe

Software Engineer, Cloudera
Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. He is a committer on HDFS. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and... Read More →

Monday April 7, 2014 5:00pm - 5:50pm PDT
Confluence A