Apache DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.
During the the course of development at LinkedIn and other companies, a need was recognized for a stable well-tested library of routines in high-level languages suitable for execution on Hadoop. Over time, many routines had been collected but were ill-documented, ill-organized, and easily broken. Initially, DataFu was an initiative to clean-up these routines by adding documentation and rigorous unit tests.
Since then DataFu has evolved through many versions of Hadoop and Pig. During this time DataFu has been used extensively at LinkedIn and other companies for many data driven products such as" People You May Known," "Skills and Endorsements" and other products.
This presentation presents an introduction to DataFu as well as example use cases in Pig.
William Vaughan is currently a Staff Software Engineer at LinkedIn who has been involved with the creation of the Skills and Expertise as well as the Endorsements Big Data products.
Monday April 7, 2014 3:00pm - 3:50pm PDT
Confluence A