pre-release: Continuum meeting announcement

Please take a moment to review your details and reply with OK or edits.
Subject and below is what will go out and also will be used to title the videos.

Subject: 
ANN: Continuum at A1 Mon March 18, 9p


Continuum
=========================
When: 9 AM Monday March 18, 2013
Where: A1

None

Topics
------
1. HDF5 is for Lovers
Anthony Scopatz

HDF5 is a hierarchical, binary database format that has become a de facto
standard for scientific computing. While the specification may be used in a
relatively simple way (persistence of static arrays) it also supports several
high-level features that prove invaluable. These include chunking, ragged
data, extensible data, parallel I/O, compression, complex selection, and in-
core calculations. Moreover, HDF5 bindings exist for almost every language -
including two Python libraries (PyTables and h5py).

  
This tutorial will discuss tools, strategies, and hacks for really squeezing
every ounce of performance out of HDF5 in new or existing projects. It will
also go over fundamental limitations in the specification and provide creative
and subtle strategies for getting around them. Overall, this tutorial will
show how HDF5 plays nicely with all parts of an application making the code
and data both faster and smaller. With such powerful features at the
developer's disposal, what is not to love?!

  
This tutorial is targeted at a more advanced audience which has a prior
knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is
recommended but not required.

  
This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables
2.3+. ViTables and MatPlotLib are also recommended. These may all be found in
Linux package managers. They are also available through EPD or easy_install.
ViTables may need to be installed independently.
 recording release: yes license:   

2. Intro to NumPy
Bryan Van de Ven

Tutorial
 recording release: yes license:   

3. Pandas
Wes McKinney

Tutorial
 recording release: yes license:   

4. IPython-parallel
Min Ragan-Kelley

IPython is a great tool for doing interactive exploration of code and data.
IPython.parallel is part of IPython that enables interactive exploration of
parallel code, and aims to make distributing your work on local clusters or
AWS simple and straightforward. The tutorial will cover the basics of getting
IPython.parallel up and running in various environments, and how to do
interactive and asynchronous parallel computing with IPython. Some of
IPython's cooler interactive features will be demonstrated, such as
automatically parallelizing code with magics in the IPython Notebook and
interactive debugging of remote execution, all with the help of real-world
examples.
 recording release: yes license:   

5. Beautiful Plots With Matplotlib
Mike Muller

When it comes to plotting with Python many people think about matplotlib. It
is widely used and provides a simple interface for creating a wide variety of
plots from very simple diagrams to sophisticated animations. This tutorial is
a hands-on introduction that teaches the basics of matplotlib. Students will
learn how to create publication-ready plots with just a few lines of Python.
Students should have a working knowledge of Python. NumPy knowledge is helpful
but not required.
 recording release: yes license:   

6. Bayesian Machine Learning & Python – Naïve Bayes
Krishna Sankar

Bayesian Algorithms are employed in machine learning algorithms including
classification, collaborative filtering & recommendation engines. This
tutorial introduces the Naïve Bayes and it’s application to classification
problems. We will work thru a few problems using pencil & paper as well as
Python programming. We will use nltk & publicly available data. It will be a
hands-on tutorial – github url for code & slide set http://goo.gl/2DOQX.
Please install nltk & download the data.
 recording release: yes license:   

7. Social Network Analysis
Kartherine Chuang

Tutorial
 recording release: yes license:   

8. scikit - image
Davin Potts

(Needs description.) 
 recording release: yes license:   

9. Creating Interactive Applications in Matplotlib
Jake Vanderplas

Matplotlib is the leading scientific visualization tool for Python. Though its
ability to generate publication-quality plots is well-known, some of its more
advanced features are less-often utilized. In this tutorial, we will explore
the ability to create custom mouse- and key-bindings within matplotlib plot
windows, giving participants the background and tools needed to create simple
cross-platform GUI applications within matplotlib. After going through the
basics, we will walk through some more intricate scripts, including a simple
MineSweeper game and a 3D interactive Rubik's cube, both implemented entirely
in Matplotlib.
 recording release: yes license:   

10. Learning Python
Peter Norvig

There has been a recent flurry of websites and online courses designed to
teach more people to be competent at programming computers. In this talk we
look at some of the data that is available on what works and what doesn't
work, and in particular at the features of Python that come into play when it
is used as a vehicle for learning programming and computer science.
 recording release: yes license:   

11. Python in an Evolving Enterprise System
Angelica Pando, Dave Himrod, Steve Kannan

Our data pipeline is growing like crazy, processing more than 30 terabytes of
data every day and more than tripling in the last year alone. In 2011, we
moved our data pipeline to a Hadoop stack in order to enable horizontal
scalability for future growth. Our optimization tools used for data
exploration, aggregations, and general data hackery are critical for updating
budgets and optimization data. However, these tools are built in Python, and
integrating them with our Hadoop data pipeline has been an enormous challenge.
Our continued explosive growth demands increased efficiency, whether that's in
simplifying our infrastructure or building more shared services. Over the past
few months, we evaluated multiple solutions for integrating Python with Hadoop
including using Hadoop Streaming, PIG with Jython UDFs, writing MapReduce in
Jython, and of course, why not just do it in Java? In our talk, we'll explore
the different Python-Hadoop integration options, share our evaluation process
and best practices, and invite an interactive dialogue of lessons learned.
 recording release: yes license:   

12. Intro to Network Science
Christopher Roach

In 1967 sociologist Stanley Milgram began a series of experiments into the
"small world problem" that would firmly cement the phrase "six degrees of
separation" within the popular culture. Because of these experiments, nearly
all of us today have heard that we are simply a few hand shakes away from
anyone in the world. Indeed it's a popular past time amongst academics to
figure our their Erdos number and, amongst the rest of us, to calculate a
favorite actor's Bacon number. Fast forward to today and the world seems even
smaller. With the internet connecting all of us to one another at the speed of
light, and social networks such as Twitter and Facebook creating communities
that quite literally span the globe, this new era in connectedness has given
us a wealth of data about how we interact with one another. There's hardly
anyone in the tech community today who hasn't heard of social network
analysis, but this combination of sociology, computer science, and mathematics
has significance beyond just the analysis of social networks.

  
Between nearly any set of entities a relationship can be found, and thus a
network can be made, from which the inner workings of those relationships can
be studied. The still nascent field of network science is quickly becoming THE
science of the 21st century and this talk will introduce this budding field
and demonstrate how tools such as NetworkX and Matplotlib make it possible for
Pythonistas to make meaningful contributions or simply just analyze their own
popularity on Twitter.

  
The goal of this talk is to give the attendees a basic understanding of what
network science is and what it can be used for, as well as demonstrate its use
in a specific scenario. During the course of this talk we'll walk through a
proper definition of a network and introduce some of the jargon necessary to
converse with others working in the field. We'll also take a look at some of
the statistical properties of networks and how to use them to analyze our own
networks. Finally, we'll look at a specific example of the application of
network science principles on a real life social network. By the end of the
talk, an attendee should feel comfortable enough with field of network science
to be able to start analyzing their own networks of data.
 recording release: yes license:   

13. UV-CDAT Resharable Analyses and Diagnostics
Charles Doutriaux

Some of today’s greatest challenges to the scientific community are “big
data”, “reproducibility/transparency” and “code sharing”. The state-of-the-art
Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT) environment
addresses the first two issues with new visualizations and techniques to
address big data and provenance. This talk addresses code re-sharing and re-
distribution by introducing the UV-CDAT Re-sharable Analyses and Diagnoses
(U-ReAD). U-ReAD will offer scientists a complete set of tools (framework)
based on the Python programming language along with a code repository.
U-ReAD’s goal is to use structured documentation to help build the interface
between UV-CDAT and a diagnostic, with few or no changes to the original code.
This framework will allow scientists to quickly and seamlessly re-implement
their diagnostics so that they will fit perfectly into the UV-CDAT
environment. As a result U-ReAD-enhanced diagnostics will be automatically
provenance-enabled, making it easy to reproduce any set of results exactly and
transparently, a crucial functionality considering today’s increased scrutiny
toward scientific results.

  
This talk aims to demonstrate how easy it can be to plug any diagnostic into
UV-CDAT using U-ReAD. We will show how few changes are necessary to create
these plugins and how “augmented” the diagnostics are in return.

  
U-ReAD’s developers also hope to create a central repository of U-ReAD-
enhanced tools so that scientists can easily share their tools. This talk will
show what is in store along these lines. http://u-read.llnl.gov
 recording release: yes license:   

14. Blaze
Travis Oliphant

(Needs description.) 
 recording release: yes license:   

15. Disco: Not Just MapReduce Any More
Prashanth Mundkur

The goal of Disco has been to be a simple and usable implementation of
MapReduce. To keep things simple, this MapReduce aspect has been hard-coded
into Disco, both in the Erlang job scheduler, as well as in the Python
library. To fix various issues in the implementation, we decided to take a
cold hard look at the dataflow in Disco's version of MapReduce. We came up
with a generalization that should be more flexible and hence also more useful
than plain old MapReduce. We call this the Pipeline model, and we hope to use
this in the next major release of Disco. This will implement the old MapReduce
model in terms of a more general programmable pipeline, and also expose the
pipeline to users wishing to take advantage of the optimization opportunities
it offers.

  
If time permits, we will also discuss other aspects of the Disco roadmap, and
the future of the Disco project.
 recording release: yes license:   

16. PyCascading for Intuitive Flow Processing With Hadoop
Gabor Szabo

(Needs description.) 
 recording release: yes license:   

17. Wise.io a Machine-Learning Platform
Henrik Brink

At wise.io we are building a machine-learning platform that makes efficient
and accurate learning algorithms available in an easy-to-use service. In this
presentation, I will describe how the platform works and how we're using
Python to make it scalable and accessible.

  
Machine-learning is an active field of data science, where sophisticated
models are "trained" on data and used to enable human-like cognition in data
analysis pipelines and data-heavy applications. Data scientists need the most
efficient and most accurate machine-learning implementations, while developers
need on-ramps that make it easy to incorporate machine-learning into their
applications.

  
Highlights of our platform include one-step data ingestion and model building,
validation, hosting, integration and sharing. A domain intelligence
"marketplace" enables domain-specific knowledge to be incorporated in a model
with a click (or a "git push") and is scaled automatically to handle large
datasets. We use Python and a range of cloud and data frameworks to make this
possible, including Anaconda, PiCloud, Pandas and PyTables.
 recording release: yes license:   

18. Practical Time Series Modeling and Analysis
Chang She

Exploratory analysis and predictive modeling of time series is an enormously
important part of practical data analysis. From basic processing and cleaning
to statistical modeling and analysis, Python has many powerful and high
productivity tools for manipulating and exploring time series data using
numpy, pandas, and statsmodels.

  
We will use practical code examples to illustrate important topics such as:

  
-resampling  
-handling of missing data  
-intraday data filtering  
-moving window computations  
-analysis of autocorrelation  
-predictive time series models  
-time series visualizations
 recording release: yes license:   

19. Scaling Machine Learning in Python
Olivier Grisel

In this talk we will introduce the typical predictive modeling tasks on "not-
so-big-data-but-not- quite-small-either" that benefit from distributed the
work on several cores or nodes in a small cluster (e.g. 20 * 8 cores).

  
We will talk about cross validation, grid search, ensemble learning, model
averaging, numpy memory mapping, Hadoop or Disco MapReduce, MPI AllReduce and
disk & memory locality.

  
We will also feature some quick demos using scikit-learn and IPython.parallel
from the notebook on an spot-instance EC2 cluster managed by StarCluster.
 recording release: yes license:   

20. Introduction to Marinexplore
André Karpištšenko

Marinexplore is creating a spatio-temporal data warehouse for Planet's ocean
data. Our focus on a vertical allows us to optimize the entire technology
stack from collecting a datapoint with a sensor to acting based on analytics.

  
We collaborate globally with leading organizations like World Ocean Council,
NOAA and Cornell University to integrate point measurements, gridded, swath,
sweep, and acoustic datasets. As a result Marinexplore is building the largest
and highest quality footprint on the web for open ocean data. In closed
enterprise program we involve companies from oil & gas, shipping, risk
management and hardware development.

  
The talk will share how the open data product is built from technological and
organizational perspective. The explosion of environmental data is creating
challenges in data collection, distribution, analysis and collaboration. Our
cloud based solution relies heavily on Python throughout the technology stack
with parts of the system being prepared for open sourcing.
 recording release: yes license:   

21. Thin Client Data Science
Josh Levy

The Data Science team at Vast builds data products informed by the behavior of
consumers making big purchases. Our big data is billions of user interactions
with millions of pieces of inventory. Recently we have adopted a data
processing, analysis, and visualization environment based on remote access to
IPython Notebook hosted by a powerful compute server.

  
Our Data Science environment is inspired by a Development environment proposed
by blogger Mark O'Connor. O'Connor advocates using an iPad as a thin client to
connect to a more powerful server in the cloud. The combination of tablet plus
server is better than a laptop for several reasons including:

  
The tablet is more portable and offers longer battery life than a laptop;

The server offers better performance (more and faster cores, more RAM, more
CACHE) than a laptop;

Laptops run loud and hot. The noise and heat of the server need not be close
the tablet or the ears and lap of the user;

The server is always running and the tablet can wake up and reconnect
instantly;

IPython Notebook is the keystone of our environment. It enables us to use the
tablet browser as a thin client to work with our favorite Python libraries
including matplotlb for visualization, scikit-learn for predictive modeling,
and pandas for processing and aggregation.

  
In this talk, I'll discuss configuring the Notebook server and the tablet
client. I'll also show examples and results of actual analyses performed in
this environment.
 recording release: yes license:   

22. Data Visualization With Nodebox
Lynn Cherny

The family of Nodebox (http://nodebox.net/) tools supports creation of
artistic visualization using either Python or a visual programming interface.
The Python API is quite similar to Processing's but is (in my opinion) even
easier to learn, because of Python's friendliness. In this talk, I'll
illustrate creation of basic and more advanced visuals in Nodebox OpenGL,
using data from an exploratory text analysis project.
 recording release: yes license:   

23. IPython: a modern vision of interactive computing
Fernando Perez

IPython has evolved from an enhanced interactive shell into a large and fairly
complex set of components that include a graphical Qt-based console, a
parallel computing framework and a web-based notebook interface. All of these
seemingly disparate tools actually serve a unified vision of interactive
computing that covers everything from one-off exploratory codes to the
production of entire books made from live computational documents. In this
talk I will attempt to show how these ideas form a coherent whole and how they
are represented in IPython's codebase. I will also discuss the evolution of
the project, attempting to draw some lessons from the last decade as we plan
for the future of scientific computing and data analysis.
 recording release: yes license:   

24. Data Wrangling Kung Fu With pandas
Wes McKinney

In this talk I'll show how a number of tools from the pandas library can be
used to quickly wrangle raw data into shape for analysis. Techniques for
structured and semi-structured data manipulation, cleaning and preparation,
reshaping, and other common tasks will be the main focus.
 recording release: yes license:   

25. Luigi - Batch Data Processing in Python
Elias Freider

Luigi is Spotify's recently open sourced Python framework for batch data
processing including dependency resolution and monitoring. We will demonstrate
how Luigi can help you get started with data processing in Hadoop MapReduce as
well as on your local workstation.

  
Spotify has terabytes of data being logged by backend services every day for
everything from debugging to reporting reasons. The logs are basically huge
semi-structured text files that can be parsed using a few lines of Python.
From this data aggregated reports need to be created, data needs to be pushed
into SQL databases for internal dashboards, related artists need to be
calculated using complex algorithms and a lot of other tasks need to be
performed, many of which have to be run on an daily or even hourly basis.

  
A lot of the initial processing steps are very similar for the many data
products that are produced, and instead of re-doing a lot of work,
intermediate results are stored and form dependencies for later tasks. The
dependency graph forms a data pipeline.

  
Luigi was created for managing task dependencies, monitoring the progress of
the data pipeline and providing frameworks for common batch processing tasks.
 recording release: yes license:   

26. Big Data in Fashion
Katherine Chuang

PythonFashionForecaster is an ongoing open source code project that I'd like
to present to the PyData Community in order to initiate discussion about
applications of Python in a traditionally non data-centric industry. It will
hopefully extend the use of Python and open source to the world of fashion. A
quick search of python repositories on github show a lack of true fashion
apps, those mostly involving weather forecast or shopping tools rather than
specifically fashion styles. On the other spectrum of fashion apps, those
highly relevant to fashion styles are commercial. PythonFashionForecaster is
different in that the objective is to display fashion style trends as an
information resource in an automatic and computational manner.

  
This talk would be of interest to anyone that would like to see a case study
on the application of parsing JSON data with Python, a survey of data analysis
libraries that can be use to analyze social data, as well as anyone interested
in fashion related topics. I believe that indirectly this project will bring
exposure to the Python Open Source community in non-traditional domains.
 recording release: yes license:   

27. Dataflow Programming Using Generators and Coroutines
James Powell

This talk discusses generators as a mechanism for modelling data-centric
problems. The techniques suggested focus on simplifying the semantics of
processing code, adding flexibility by inverting control structures, and
allowing performance optimisations through caching, laziness, and targeted
specialisations.

  
* This would be a continuation of the material I presented at PyData NYC 2012. I would incorporate feedback from that presentation to cover areas of particular interest. It would also use material developed since then, including some illustrative examples of how generators could be used to model certain problems in finance (the benchmark pricing problem, the refdata problem, &c.;)
 recording release: yes license:   

28. Zipline in the Cloud: Optimizing Financial Trading Algorithms
Thomas Wiecki

Simulation has become an indispensable research tool across different
scientific disciplines ranging from neuroscience to econometrics and
quantitative finance. These computational simulations often involve parameters
which have to be optimized on data. This parameter optimization gets
increasingly challenging the more complex and longer simulations take to run.
Cloud services like Amazon Web Services (AWS) provide a compelling tool in
scaling this optimization problem by offering computing resources that allow
everyone to spawn their own personal cluster within minutes.

  
With a focus on algorithmic trading models, in this talk I will show how
large-scale simulations can be optimized in parallel in the cloud.
Specifically, I will (i) provide a tutorial on how trading strategies of
varying sophistication can be developed using Zipline -- our open-source
financial backtesting system written in Python; (ii) how StarCluster provides
an easy interface to launch an Amazon EC2 cluster; (iii) how IPython Parallel
can then be used to test large parameter ranges in parallel; and (iv) a brief
demo of how Quantopian.com can greatly simplify parts of this process by
offering a completely web-based solution free-of-charge. While a case study in
quantitative finance, the general approach has direct application to other
research domains.
 recording release: yes license:   

29. Bitdeli - A Platform for Creating Custom Analytics in Your Browser
Ville Tuulos

Bitdeli is a platform for creating custom analytics in Python, conveniently in
your web browser.

  
You can use Bitdeli to create real-time dashboards and reports, or as a quick
and robust way to experiment with up to terabytes of real-time data. Bitdeli
is based on vanilla Python to maximize developer-friendliness. There is no
need to learn a new paradigm or stop using existing Python packages.

  
A typical customer of Bitdeli today is a mobile or web startup that wants to
understand and leverage the behavior of their users in ways that are not
supported by mainstream analytics services. To further support the long tail
of custom analytics, we encourage developers to open-source and share their
metrics in GitHub, which is tightly integrated to Bitdeli.
 recording release: yes license:   

30. Building Analytic Database Engines With Python
Robert Brewer

Analytic queries require different systems and approaches than operational
transactions if they're going to be efficient. This talk will cover what tools
Python gives us out of the box for building fast analytic databases: memory
manipulation, compression, dynamic typing, optimized representations,
multiprocessing, map-reduce.
 recording release: yes license:   

31. How Web APIs and Data-centric Tools Power the Materials Project
Dan Gunter, Shreyas Cholia

Python has been an important tool for analysis and manipulation of scientific
data. This has traditionally taken the form of large datasets on disk or in
local databases, which are then processed by sophisticated numerical and
scientific libraries (SciPy and friends). Increasingly, science is becoming a
collaborative enterprise where "big data" is generated in multiple locations
and analyzed by multiple research groups.

  
In this talk we discuss how Python data analysis can help scientists work more
collaboratively by integrating Web APIs to access remote data. We will discuss
the details of this approach as applied to the Materials Project (see
materialsproject.org), a Department of Energy project that aims to remove the
guesswork from materials design using an open database of computed properties
for all known materials. Using the Python Materials Genomics (pymatgen)
analysis package (see packages.python.org/pymatgen), Materials Project data
can be seamlessly analyzed alongside local computed and experimental data. We
will describe how we make this data available as a web API (through Django)
and how we provide access to both data and analysis under a single library.
The talk will go over the technology stack and demonstrate the potential power
of these tools within an IPython notebook. We will finish by describing plans
to extend this work to address key challenges for distributed scientific data.
 recording release: yes license:   

32. MARS Modeling on the Python Data Stack
Jason Rudy

Multivariate Adaptive Regression Splines (MARS, also known as earth) is a non-
parametric regression method originally published by Jerome Friedman in 1991.
It is particularly useful for modeling medium to high dimensional systems
where it is not known ahead of time which variables are predictive or in what
form, but it is suspected that an additive model with perhaps a few low order
interactions is appropriate. An implementations is available for R, and MARS
is included in the Orange data mining system for Python. However, there is
currently no easy way to integrate MARS with a Pandas / Numpy based data
analysis pipeline. In my talk, I'll be explaining the MARS algorithm and
demonstrating a new Pandas / Numpy compatible Python implementation with a few
example problems.
 recording release: yes license:   

33. Measuring the New Wikipedia Community
Ryan Faulkner

I will be discussing the approaches taken by the Editor Engagement
Experimentation team at the Wikimedia Foundation to discover the new site
features that lead to stronger collaborative contributions from editors and
readers. The focus will be on how we define, gather and analyze our metrics
[2,3,4] and how these have been exposed via a RESTful API built with Flask.

  
I'll also discuss the experimental results of new features (article feedback,
post-edit feedback) and improved ones (account creation) in the context of the
analytics implementation with the "e3_analysis" [3,4] python package. Finally,
I will give an overview of the work we are carrying out on ranking the quality
of reader feedback comments using the pybrain [5] and mdp [6] machine learning
and data processing packages.

  
[1] http://meta.wikimedia.org/wiki/Editor_engagement_experiments

[2] https://meta.wikimedia.org/wiki/Research:Metrics

[3]
http://pypi.python.org/pypi?:action=display&name;=e3_analysis&version;=0.1.4

[4] https://github.com/rfaulkner/E3_analysis

[5] http://pybrain.org/

[6] http://mdp-toolkit.sourceforge.net/
 recording release: yes license:   

34. Lighting Fast Cluster Computing with PySpark
Patrick Wendell

This talk will introduce PySpark a framework for cluster scale data-intensive
computing. PySpark is a deeply integrated set of Python API bindings for the
popular Spark computation engine. Spark provides rich data-flow abstractions
and sophisticated use of distributed caching to speed up complex analytic
processing by several orders of magnitude. It interfaces directly with popular
storage layers (e.g. Hadoop HDFS) and is optimized for advanced analytic
functions such as machine learning, OLAP processing, and ETL. The talk will
introduce the PySpark API and architecture, present use cases, and walk
through a demo.
 recording release: yes license:   

35. Escape from the Curse of the Cluster and the Headache of Hadoop
David Schachter

Disney Interactive
 recording release: yes license:   



Location
--------
A1


About the group
---------------