Wow! What a fantastic line up! This is going to be a great day.
Friday Dec 5th
Registration and Breakfast
Welcome and Introduction
by Doug Cutting
by Rishi Easwaran
by Roger Ding
by Azarias Reda
by Jeff Kunkle
by Jonathan Chase
by Alex Moundalexis
by Bryan Bende
by Marshall Presser and Bjorn Boe
by Jeremy Freeman
by Dr. Kirk Borne
Gather at the Watering Hole to discuss how awesome Elephant Talk was!
Opening Keynote by Doug Cutting
Doug Cutting is the founder of numerous successful open source projects, including Apache Lucene, Nutch, Apache Avro, and Apache Hadoop. Doug joined Cloudera in 2009 from Yahoo!, where he was a key member of the team that built and deployed a production Hadoop storage and analysis cluster for mission-critical business analytics. Doug holds a Bachelor’s degree from Stanford University and sits on the Board of the Apache Software Foundation.
Closing Keynote by Dr. Kirk Borne
Data Science: From NASA Astrophysics to Digital Marketing to "Lost in Space": The analytics profession may disappear as quickly as it has appeared. The reason: the art and science of learning from data, deriving insights from data, driving decisions with data, and building revenue streams from data are all becoming the new normal in today's digital workplace. Digital science, digital business, digital government, digital health, digital warfare, digital education, digital life, etc. are the norm, not the exception. Analytics professionals will become blended into the workforce seamlessly, in the same way that every worker now works with I.T., without being labeled an "I.T. professional". I will describe numerous data science pathways that I have followed and experienced in the past 15 years that have been as varied as they have been remarkable (from NASA Astrophysics to Data Science Education to Marketing Analytics, and more), with a view toward where these types of analytical paths might be leading all of us.
Dr. Kirk D. Borne is a Data Scientist and Professor of Astrophysics & Computational Science at George Mason University (since 2003). He received his PhD in Astronomy from Caltech. He conducts research, teaching, and doctoral student advising in the theory and practice of data science. He has also been an active consultant to numerous organizations, institutions, and agencies in data science and big data analytics. He previously spent nearly 20 years supporting large scientific data systems for NASA astrophysics missions, including the Hubble Space Telescope. He was identified in 2013 as the Worldwide #1 Big Data influencer on Twitter at @KirkDBorne.
Mapping the Brain at Scale with Spark by Jeremy Freeman
To understand the brain, we need to record from as much of it as possible. New technologies are enabling large-scale recordings in awake, behaving animals. But these techniques generate massive, and complex data sets. Among big data platforms, Spark is ideally suited to the unique demands of scientific analytics. I will describe Thunder, an open-source library we have developed on top of Spark for analyzing large-scale spatial and temporal neural data. It's fast to run, easy to extend, and designed for interactivity and visualization.
Jeremy Freeman is a neuroscientist at HHMI Janelia Research Center (@HHMIJanelia) using computation to understand the brain. He’s passionate about brains, behavior, analytics, big data, spark, visualization, and open science.
Translate from MapReduce to Spark by Roger Ding
Spark has become one of the hottest technologies in Big Data and is the leading candidate to replace Hadoop MapReduce. Why? Because porting a MapReduce program to Spark will immediately results in a performance boost.
The presentation gives a brief introduction to MapReduce and Spark programming. Then a MapReduce project (processing massive amount of XML files) will be used as an example to demonstrate how to translate a MapReduce program to a Spark program.
Roger Ding is a Solutions Consultant at Cloudera, where he loves working with distributed computing technologies. He worked as a software engineer for 15 years before joining Cloudera.
SOLR Lessons Learnt by Rishi Easwaran
Our goal is to design a highly scalable, fault tolerant, easily manageable and robust, system, to provided extremely effective and efficient search capabilities and capable of handling very high transaction volume, using an open source solution. Why we opted to use SOLR, a brief overview of our old multicore architecture, its performance, scalability, availability and pain points.
Dive into our new Multi-Tier Hybrid SOLR Cloud architecture, its performance, availability, scalability and cost benefits and lessons learnt in use of SOLR to provide mailbox search capability for AOL mail users.
Rishi Easwaran is a Principle software engineer working with the AOL mail team since 2008 and the tech lead for our existing mail search infrastructure and Re-architecture efforts to implement a Multi-tiered high availability, large scale Hybrid-SOLR Cloud solution. Over the years the entire team focused on performance, availability and scalability of SOLR and in our efforts to accomplish this we dove into the source code and have a close working knowledge and experience with SOLR.
Big Data in the 2014 Midterm Elections by Azarias Reda
Azarias Reda is the first ever Chief Data Officer of the Republican National Committee, and built the data science and engineering team that played an important role in helping the Republican Party win a historic majority in the US Senate. He earned his PhD in computer science from the University of Michigan, and previously founded Meritful – an enterprise software startup based in Austin. Prior to Meritful, Azarias worked in the Search and Network Analytics group at LinkedIn where he built Metaphor, the search recommendation engine used by LinkedIn search. Azarias was born in Ethiopia and currently lives in Washington DC.
Open Source Graph Analysis by Jeff Kunkle
Lumify is a relatively new open source platform for big data analysis and visualization, designed to help organizations derive actionable insights from the large volumes of diverse data flowing through their enterprise. Utilizing popular big data tools like Hadoop, Accumulo, and Storm, it ingests and integrates many kinds of data, from unstructured text documents and structured datasets, to images and video. Several open source analytic tools (including Tika, OpenNLP, CLAVIN, OpenCV, and ElasticSearch) are used to enrich the data, increase its discoverability, and automatically uncover hidden connections. All information is stored in a secure graph database implemented on top of Accumulo to support cell-level security of all data and metadata elements. A modern, browser-based user interface enables analysts to explore and manipulate their data, discovering subtle relationships and drawing critical new insights. In addition to full-text search, geospatial mapping, and multimedia processing, Lumify features a powerful graph visualization supporting sophisticated link analysis and complex knowledge representation.
This talk will blend a high-level use case demo with a more technical presentation of Lumify's underpinnings, focusing on its use of Accumulo to implement fine-grained access control of the graph data.
Jeff Kunkle is the Director of Research and Development at Altamira Technologies, responsible for overseeing the company’s internal research and development investments. He has an MS in Systems Engineering from Virginia Tech and a BS in Electrical Engineering from Penn State University. As a technologist, he's focused on helping clients build Agile-focused development teams and create big data, web, network, and mobile applications for nearly fifteen years, most of them spent supporting the IC/DoD. His most recent role is as the project lead for Lumify, an open source big data analysis and visualization platform.
Simple Linux Tuning to Improve Hadoop Performance by Alex Moundalexis
Running Hadoop isn't easy. Many clusters suffer from configuration problems that can negatively impact performance. With vast and sometimes confusing configuration options, it can can be scary to make changes to Hadoop when performance isn't as expected. Learn how to improve Hadoop performance and eliminate common problem areas using a handful of simple and safe Linux configuration changes.
Alex Moundalexis is a Solutions Architect for Cloudera Government Solutions who spends his time installing and configuring Hadoop clusters across the United States for a variety of commercial and government customers. Before entering the land of Big Data, Alex spent the better part of ten years wrangling Linux server farms and writing Perl code as a contractor to the Department of Defense and Department of Justice. Alex attended Polytechnic Institute of New York University where he attained an M.S. in Cybersecurity and Rochester Institute of Technology where he attained a B.S. in Information Technology. He likes shiny objects.
Predicting Driver Behavior with Big Data Analytics by Marshall Presser and Bjorn Boe
Managing risk is at the heart of the insurance industry. In this talk, we'll solve the problem of assessing driver behavior with a case study using Telematic data processed with Python, Java, Map/Reduce, SQL on Hadoop (HAWQ), Redis, and Cloud Foundry to assist an insurance company in mapping trips by their policy holders and using that data to predict driver behavior. The audience should have a basic understanding of Hadoop
Marshall Presser is Federal Field CTO for Pivotal, a company building Platform as a Service for doing Big Data Analytics, deriving insights from the data, and providing a platform for productizing those insights for end users. Prior to coming to Pivotal, he worked in parallel computing for scientific and business applications as well as a stint in compiler and OS development.
Bjorn Boe works as a Field Engineer at Pivotal, working with customers to solve challenges related to cloud, big data and Internet of Things. He has a background in software development and architecture, distributed systems and databases.
Vagrant vs Docker by Jonathan Chase
But it works on my dev box! How many times have we heard this answer when the app works on one machine, but fails on another? This is the problem that led my team to use Vagrant to gain consistency between environments. However, could Docker be even better? This talk gives an introduction to Vagrant and Docker and explores how they compare.
Jonathan Chase is a software engineer at BTI360, where he loves teaching and learning from engineers that are continually striving to develop their skills. He has an MS in Computer Science from Johns Hopkins University and a BS in Computer Science from Georgia State University. Jonathan is passionate about creating software solutions that have a meaningful impact on people's lives. His current work leverages Hadoop, Solr, AngularJS, and D3JS to enable data discovery and analysis.
Real-Time Inverted Search by Bryan Bende
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Bryan Bende is a Senior Lead Engineer in the Strategic Innovation Group at Booz Allen Hamilton. He focuses on building solutions for clients involving big-data, distributed systems, and information retrieval. Bryan received a B.S. in Computer Science from the University of Maryland at College Park, and a M.S. in Computer Science from Johns Hopkins University.