Loading…
ApacheCon North America 2014 has ended
Register Now for ApacheCon North America 2014 - April 7-9 in Denver, CO. Registration fees increase on March 15th, so don’t delay!

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Lucene & Friends [clear filter]
Monday, April 7
 

10:55am PDT

What's New In Apache Solr?
In it's 8 year history at the Apache Software Foundation, Apache Solr has had 17 feature releases" - over a third of which were in 2013.  In this session we'll look at some of the major new Solr features released in the past year and discuss when and how to leverage them.

Speakers
avatar for Chris Hostetter

Chris Hostetter

LucidWorks
Chris 'Hoss' Hostetter is a Member of the Apache Software Foundation, and serves on the Lucene Project Management Committee. Prior to joining LucidWorks in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about... Read More →


Monday April 7, 2014 10:55am - 11:45am PDT
Confluence B

11:55am PDT

Introduction to SolrCloud
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.

Speakers
avatar for Timothy Potter

Timothy Potter

Senior Software Engineer, Lucidworks
Timothy Potter is a senior member of the engineering team at Lucidworks and PMC member of the Apache Lucene/Solr project. At Lucidworks, Tim leads a team that builds tools to empower business analysts and data scientists to search, analyze, and visualize large-scale enterprise data... Read More →


Monday April 7, 2014 11:55am - 12:45pm PDT
Confluence B

2:00pm PDT

Apache Lucene 4
Apache Lucene is an open-source search engine library written in Java. This talk will give an overview of its current capabilities.

Speakers
avatar for Robert Muir

Robert Muir

Elasticsearch
Robert Muir is an Apache Lucene/Solr committer and PMC member. He has worked on a variety of Lucene's features, including flexible indexing support, column-stride fields, scoring models, and improved internationalization. He works for Elasticsearch


Monday April 7, 2014 2:00pm - 2:50pm PDT
Confluence B

3:00pm PDT

Solr's SolrCloud, The State of the Union
With the release of Solr 4.0, a new set of distributed capabilities were introduced under the name SolrCloud in order to prepare Solr for the rise of big data and the trend towards larger and larger indexes that require dozens and even hundreds of servers. In this presentation, Solr committer Mark Miller talks about the state of SolrCloud since the 4.0 release. Mark will discuss the new features and improvements that have been added, talk about the hardening that has occurred, and speculate on some of the features and improvements coming soon and beyond.

Speakers
avatar for Mark Miller

Mark Miller

Software Engineer, Cloudera
Mark Miller is a Lucene / Solr committer and Apache member. After starting with Lucene in 2006, Mark has spent most his time getting paid to work on the open source software projects that he loves. Mark has given many talks on Lucene/Solr at various conferences and meet-ups around... Read More →


Monday April 7, 2014 3:00pm - 3:50pm PDT
Confluence B

4:00pm PDT

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.

Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Monday April 7, 2014 4:00pm - 4:50pm PDT
Confluence B

5:00pm PDT

Hacking Lucene for Custom Search Results
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there's enough projects weird enough to require custom search scoring and ranking. In this talk, I'll walk through how to use Lucene to implement your custom scoring and search ranking. We'll see how you can achieve both amazing power (and responsibility) over your search results. We'll see the flexibility of Lucene's data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.

Speakers
avatar for Doug Turnbull

Doug Turnbull

Chief Technical Officer, OpenSource Connections
Search relevance consultant. Author of Relevant Search. Doug crafts search/recommendation solutions that “get” users. To do this, Doug uses Solr, sprinkling a little natural language processing and machine learning on top for good measure. Through writing and speaking Doug wants... Read More →


Monday April 7, 2014 5:00pm - 5:50pm PDT
Confluence B
 
Tuesday, April 8
 

10:30am PDT

'Shrinking the Haystack' using Apache Solr and OpenNLP
The customers in the Intelligence Community and Department of Defense that ISS services have a big data challenge.  The sheer volume of data being produced, and ultimately consumed by large enterprise systems has grown exponentially in a short amount of time.  Providing analysts the ability to interpret meaning, and act on time-critical information is a top priority for ISS.  In this talk, we will explore our journey into building a search and discovery system for our customers that combines a variety of Apache eco-system components - Apache Solr, OpenNLP, UIMA, Tika, Jackrabbit, and Hadoop to enable analysts to "Shrink the Haystack" into actionable information.

Speakers
avatar for Wes Caldwell

Wes Caldwell

Chief Architect, Intelligent Software Solutions
Wes is Chief Architect at Intelligent Software Solutions, a leading software company operating primarily in the public sector, headquartered in Colorado Springs, CO. Wes is responsible for technical oversight of a variety of programs at ISS, supporting the Intelligence Community... Read More →


Tuesday April 8, 2014 10:30am - 11:20am PDT
Confluence B

11:30am PDT

Deploying and managing SolrCloud in the cloud
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.

Speakers
avatar for Timothy Potter

Timothy Potter

Senior Software Engineer, Lucidworks
Timothy Potter is a senior member of the engineering team at Lucidworks and PMC member of the Apache Lucene/Solr project. At Lucidworks, Tim leads a team that builds tools to empower business analysts and data scientists to search, analyze, and visualize large-scale enterprise data... Read More →


Tuesday April 8, 2014 11:30am - 12:20pm PDT
Confluence B

1:30pm PDT

Hidden Gems: Getting More Out Of Apache Solr
Every day billions of documents are searched, sorted, faceted and highlighted by millions of users who have no idea that behind the scenes, Apache Solr is hard at work, making life simple for developers like you.  But what else can Solr do for you? 
In this session, we'll dive into some of the less well known, less understood, features of Apache Solr that even seasoned Solr developers may not be aware of -- features that can be useful in ways you might not have considered even if you do know about them, so you can take your Solr powered applications to the next level.

Speakers
avatar for Chris Hostetter

Chris Hostetter

LucidWorks
Chris 'Hoss' Hostetter is a Member of the Apache Software Foundation, and serves on the Lucene Project Management Committee. Prior to joining LucidWorks in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about... Read More →


Tuesday April 8, 2014 1:30pm - 2:20pm PDT
Confluence B

2:30pm PDT

Test Driven Relevancy -- How to Work with Content Experts to Optimize and Maintain Search Relevancy
Getting good search results is hard; maintaining good relevancy is even harder. Fixing one problem can easily create many others. Without good tools to measure the impact of relevancy changes, there's no way to know if the "fix" that you've developed will cause relevancy problems with other queries. Ideally, much like we have unit tests for code to detect when bugs are introduced, we would like to create ways to measure changes in relevancy. This is exactly what we've done at OpenSource Connections. We've developed a tool, Quepid, that allows us to work with content experts to define metrics for search quality. Once defined, we can instantly measure the impact of modifying our relevancy strategy, allowing us to iterate quickly on very difficult relevancy problems. Get an in depth look at the tools we use to not only search a relevancy problem -- but to make sure it stays solved!

Speakers
avatar for Doug Turnbull

Doug Turnbull

Chief Technical Officer, OpenSource Connections
Search relevance consultant. Author of Relevant Search. Doug crafts search/recommendation solutions that “get” users. To do this, Doug uses Solr, sprinkling a little natural language processing and machine learning on top for good measure. Through writing and speaking Doug wants... Read More →


Tuesday April 8, 2014 2:30pm - 3:20pm PDT
Confluence B

3:45pm PDT

Building next generation, personalized search applications
Building next generation, personalized search applications (Amit Nithianandan, Wibidata) - Building personalized search experiences is critical in today’s competitive landscape as search applications become more ubiquitous. Showing relevant results based on external query data can mean the difference between a monetizable action and a disengaged user. In this talk, we will show how to build a dynamic personalized search application using Apache’s Solr, Hadoop and HBase.

Speakers
AN

Amit Nithianandan

Technical Staff, Wibidata
Amit Nithianandan is currently a Member of Technical Staff at Wibidata as a contributor to the Kiji project, Wibidata’s open source framework for building Big Data applications on Hadoop and HBase. Prior to joining Wibidata, Amit was the lead engineer for search and data at Zvents... Read More →


Tuesday April 8, 2014 3:45pm - 4:35pm PDT
Confluence B

4:45pm PDT

Native Code and Off-Heap Data-structures for Solr
Off-heap data structures and native code performance improvements for Apache Solr are being developed as part of the Heliosearch project.  This presentation will cover the reasons behind these features, implementation details, and performance impacts.  Other recent Solr/Heliosearch features such as deep paging and new analytics capabilities will also be covered.

Speakers
avatar for Yonik Seeley

Yonik Seeley

Search Engineer, Cloudera
Yonik Seeley is the creator of Solr. He works at Cloudera integrating and leveraging "Big Search" technologies into their advanced platform for machine learning and analytics. Yonik was a co-founder of LucidWorks, and he holds a master's degree in computer science from Stanford U... Read More →


Tuesday April 8, 2014 4:45pm - 5:35pm PDT
Confluence B
 
Wednesday, April 9
 

9:00am PDT

Secrets of Apache Tika
The Apache Tika toolkit comes with many advanced features that are often overlooked. Structured text, language detection, MIME type inference, XMP metadata, JVM forking, and other secrets are there just waiting to be used. This presentation covers many of these often undocumented features of Tika, and shows how they can be used to solve real-world problems.

Speakers
avatar for Jukka Zitting

Jukka Zitting

Senior Developer, Adobe Systems
Jukka Zitting is an experienced open source developer who works on various Java technologies related to content management. He's a key member of Apache Jackrabbit and Tika, and a frequent contributor to many other projects. In addition to his role as a developer, Jukka frequently... Read More →


Wednesday April 9, 2014 9:00am - 9:50am PDT
Confluence B

11:15am PDT

Building your big data search stack with Apache Nutch 2.x
Lewis John McGibbney - In this tutorial Lewis encourages you to join him in building your own customized search stack capable of handling enormous data volumes. Although the tutorial is focused on Apache Nutch 2.x, we will also be using source code from Apache Gora; an open source framework which provides an in-memory data model and persistence for big data, which acts as an object (WebPage or Host) to-datastore mapping framework for crawl data. Apache Nutch 2.x differs from the Nutch 1.x branch in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Speakers
avatar for Lewis J. McGibbney

Lewis J. McGibbney

Chair, ESIP Semantic Technologies Committee, NASA, JPL
My name is Lewis John McGibbney, I am currently a Data Scientist at the NASA Jet Propulsion Laboratory in Pasadena, California where I work in Computer Science and Data Intensive Applications. I enjoy floating up and down the tide of technologies @ The Apache Software Foundation having... Read More →


Wednesday April 9, 2014 11:15am - 12:05pm PDT
Confluence B

1:15pm PDT

What's with the 1s and 0s? Making sense of binary data at scale with Tika and friends
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!

In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.

Speakers
avatar for Nick Burch

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! Most of the projects Nick has worked in belong in the "Content" space, such as Apache POI (ex-PMC Chair), Apache Tika and Apache Chemistry. As well as coding projects, Nick is also involved in a number... Read More →


Wednesday April 9, 2014 1:15pm - 2:05pm PDT
Confluence B

2:15pm PDT

Allura - A Gentle Introduction
Allura, An Open Source Software Forge (Wayne Witzel III, SourceForge.net) - The code that powers the developer tools at SourceForge is called Allura and it is completely open source. Wayne Witzel will show you how you can get up and running with your own instance of Allura at your organization, a quick overview of the various tools such as Git, Hg, Svn, Tickets, Wiki, and Discussion and how we use them at SourceForge. Wayne will also give an Allura community update and provide you with information on where you can find documentation and basic installation instructions as well as how you can get in touch with the core development team with feedback and suggestions.

Speakers
avatar for Wayne Witzel III

Wayne Witzel III

Software Engineer, Canonical, Ltd.
Wayne Witzel III resides in Florida, USA and is currently working for Canonical, Ltd. as a Software Engineer. He is a core developer for the Apache Allura (incubating) project and a member of the Apache Allura Podling PMC. He can be reached at @wwitzel3.


Wednesday April 9, 2014 2:15pm - 3:05pm PDT
Confluence B

3:15pm PDT

Diving Deeper into Allura
Apache Allura is a fully open source development platform, providing ticket tracking, wiki, git, svn, hg, blog, etc.  Dave will show you how you can run and use Allura, and explain setting up its neighborhoods and projects to best suit your own projects.  Allura and its toolset is very flexible - Dave will dive into configuration and features of each tool and the permission system, as well as introduce the extension points for custom themes, authentication, and entire tools so that you know how to put Allura to work for yourself or your organization.

Allura could potentially be used by many projects at the Apache Software Foundation.  Dave will explore these possibilities by explaining Allura’s import and export functionality and how Allura could be a good fit for interested Apache projects.  How would you use Allura at Apache?

Speakers
avatar for Dave Brondsema

Dave Brondsema

Principal Software Engineer, SourceForge
Dave Brondsema is a Principal Python engineer at SourceForge.net. His team uses and contributes to Apache Allura, the Open Source forge platform, to provide the developer tools for hundreds of thousands of SourceForge projects. He works remotely from his home in Grand Rapids, MI and... Read More →


Wednesday April 9, 2014 3:15pm - 4:05pm PDT
Confluence B