Apache DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.
During the the course of development at LinkedIn and other companies, a need was recognized for a stable well-tested library of routines in high-level languages suitable for execution on Hadoop. Over time, many routines had been collected but were ill-documented, ill-organized, and easily broken. Initially, DataFu was an initiative to clean-up these routines by adding documentation and rigorous unit tests.
Since then DataFu has evolved through many versions of Hadoop and Pig. During this time DataFu has been used extensively at LinkedIn and other companies for many data driven products such as" People You May Known," "Skills and Endorsements" and other products.
This presentation presents an introduction to DataFu as well as example use cases in Pig.