SciPy 2007 Conference Day 1 Notes

by David Shupe
SciPy 2007 Conference - 16 Aug 2007

[Note: my editorial comments are in square brackets.]

Ivan Krstic - OLPC, Science, Awesome

Main point is education.    Before school, learning is curiosity-driven,
all-day, peer-based, happens everywhere.  Then, learning is 
authority-driven, select hours, unidrectional, in a particular place.

Works with a great teacher.  Not at all with no teacher.  1.2 billion
kids, 75% with inadequate access to education.

Top down fixing of schools? takes 50-100 yr and others are working on it.  
Try a peer-based approach.  Can do laptops now.
AMD Geode LX 0.8Wh, 433 MHz, 256 MB RAM, 1GB flash.  Like a desktop
machine around 2000.  Readable in sunlight!  
Mesh networking: self-contained ARM processor on separate power
rail, operational during suspend.  802.11s pre-implementation (also
exising b and g).  

Touchpad: capacitive (normal) and resistive (thermal?).  
Open source, new gui called Sugar, optimized for kids, emphasizes
collaboration and simplicity.  New security platform called Bitfrost,
very high security w/o user involvement.  Tried and true stack mostly,
perturbed in strange ways.
LinuxBIOS, OpenFirmware, Linux, HAL, X.org, D-BUS, GTK+, HippoCanvas,
NetworkManager, Telepathy, Jabber/XMPP, Mozilla XULRunner, Python,
AbiWord, SQLite, OHM, Matchbox, CSound...

Beta-4 / C-test machines.  Mostly a delta from PyCon.  New processor
has more L1/L2 cache, runs Python much, much better.

E Jones: How did you choose this stack? 
A: Has to be small. OS has to fit in 100-150 MB. Don't want to have
to recharge every 3-4 hr when just reading.  Aggressive approach to
power management.  Want to suspend in 150msec and resume in 150msec,
every few seconds -- very challenging for software.  

Q: How much space left over after infrastructure? A: TBD.
OS takes 200-250 MB, want 100-150 MB.
Q: Who writes children's s/w?  A: Anyone can contribute, can play with Sugar.
Q: Is everything Python? A: Biggest chunks (e.g. Sugar).

Fernando: Why Python?
A: Allen Kaye, Seymour Papert are constructionists.  Wanted something
kids can take apart and see how it works, without CS degree.  Python
is one of the friendliest languages that you can write production code.
Ivan is one of the drivers for using Python.  Reduces complexity.  

Perry: How do you know this is what is really needed?  Trying out
prototypes?
A: Not listening well!  Have had them with 100's of kids, can see
reports on our wiki, e.g. Nigeria, Brazil?  Ivan suprised there hasn't
been more resistance or backlash, working surprisingly well.

Q: Did kids play with Python?
A: Africa: CS professor visited, kids started asking him questions, one
kid asked permission to install programming environment on school PC.

Personal opinion territory now!
Science: an obvious question that no one asks me at my talks:
Assuming kids use these to learn learning, what should they use
learning to learn?  If you now can teach yourself things, what to learn?
Why is learning important?
Literature, history and the arts are crucial....but most pressing need
is for scientific knowledge.  Health, power (electrical), home and communal
e.g. 14-yr-old William Kamkwamba from Malawi, built a windmill from rough
plans from a library book.  Imagine if he had all of Wikipedia/Project
Gutenberg/Google.

Something you (in audience) could build: 
MetaSci.  In hacker community, MetaSploit reduces overhead in writing
Proof of Concept exploits.  Boilerplate is included.
In scientific community, you learn about discoveries either from
mainstream publications or from the real papers if you can afford
the journals.  [ What about arXiv.org?  Free physics and astronomy preprints.]
It'd be awesome to read about a discovery and get a partial dataset
and visualizations that highlight what the researchers found.  Ivan's
half-baked idea.  But it's an enormous hassle for scientists to build
a cross-platform demo.  Choose a vis. toolkit and API, and make
simple HyperCardish notecards, tables and a way to break into the
Python shell...couple of hours of work for a scientist.  Ivan is
confident enough people could be convinced to roll demo sets.
[ my thoughts: takes a Ph.D. to understand "jargon".  Mainstream
journalism explanations of science take a lot of work. ]

Demo
Measure activity.  Response to lighter flame.  
Australia: water leak from PVC pipe.  Set up XO with a motion detector
and camera.  After a few days had photos of 10 different species of birds.
Buttons: Measure A, Set bias o, Show Deta, Stop, Show FFT.
Can attach cheap $1 sensors, waterproof.

Next talk:
Travis Oliphant: will be joining Enthought!  Will be moving to
Austin on Monday.  Will double the number of kids at
the Enthought Christmas party! :)

Average attendee - 1st three eigenfaces from yesterday!

Numpy in Python:
- a long-term goal.  We haven't wanted to commit to the release
schedule.  No one has stepped up to argue our case with other Python
developers.  Now NumPy is even "bigger" than it was in the past.

Tactical change:  Get the structure of NumPy into Python 3.0
via the buffer interface.  Start with changes to Python 3.0 and
then backport additions to Python 2.6.  Eventually, the demand for
some of the rest of NumPy will increase.  (PEP for Python 3.0 gets
more attention from Python developers.  May take 10 yr??)

Array interface:
Numeric, numarray, NumPy all use to share data.  
Hatched idea after talking to Guido last SciPy (2006), of buffer
protocol.  With help of Carl Banks and Greg Ewing and others on
py3k-dev, PEP 3118 grew out of TO's early efforts. (harder than
writing a paper!  requires a lot of attention, peer review)

PEP 3118 Overview
- redefines tp_as_buffer fcn ptr table for every PyTypeObject
- Adds PyMemoryViewObject (memoryview in Python) -- will be the 1st
object in Python to support multi-D slicing.
- expands struct module with new character-based syntax.
- creates new C-API fcns to make common things simple for extension
writer.  Python user won't see much change.

Timeline: happening now.  Goggle Sprint is next week.  MemoryViewObject
needs work.  Struct module needs work.  Bug fixes on what's already
implemented.  Python 3.0 alpha release at end of August.

tp_as_buffer:
old: 4 items,get buffer, write buffer, get seg, get charbug.
couldn't share the data.
new: 2 items: get_buffer, release_buffer.  Introduces a way to
lock buffer.  In Python 2.x, always got a new pointer to get around
the pointer.  New, requires users to release buffer when not needed
anymore, to allow e.g. allocating more memory.

Getbuffer
ojb, view, flags, return

Pybuffer structure [I can't really follow these slides]

New C_API:
PyObject_CheckBuffer (make sure is present)
PyObject_GetBuffer
PyObject_ReleaseBuffer
PyBuffer_FromContiguous, ToContiguous
PyObject_CopyData
PyBuffer_IsContiguous
PyBuffer_FillContiguousStrides
PyBuffer_FillInfo
PyMemoryView_Check
PyMemoryView_GetContiguous,FromObject, FromMemory

If you have ideas, now is the time!  Can get into Python 3.0.

MemoryView Object

Struct-string syntax.

Implications:
- standard way to share data among media libraries
- standard way to share arrays among GUIs
- increase adoption on NumPy-like features by wider Python community
- Powerful struct/ctypes connection
- maybe automatic compiled function call-backs using function-pointer
data.

Interested?  Google code Sprints (Aug 22-25), contact Travis before
Aug 21 AM for guidance.


Chris Mueller - "CorePy: Using Python on IBM's Cell/B.E."

Cell/B.E. : high performance processor, in Playstation, Toshiba products
Cell is whole platform.  B.E. is broadband engine.
Heterogeneous multicore processors (have different fcns)
SPE - Synergeistic Processing Element (SPE) cores (8)
two general PPC cores
-In-order execution on all cores.
- Programmer-managed local store on SPUs
-3 instr sets: PowerPC (PPE), VMX/AltiVec (PPE/SIM), SPU(SPE/SIMD)

Linux:
 - IBM or Mercury Blades (2 Cell/BE, 16 SPUs, 1 GM RAM)
 - Sony PS3 (1 Cell/BE, 6 SPUs enabled)

PPU details
- Full PowerPC and VMX instruc. sets
- 2 h/w threads, 2 levels of cache,
- very minimal hardware implementation
   use PPU's as little as possible, disappointing perfomrance

SPEs are more interesting
 SIMD instr set with.  256k local store.  lots of registers (128).
 Single Instruction, Multiple Data - one add works on 4 pieces of data.

Cell programmming models
 - Manager/Worker (PPE's dispatch work to SPEs)
 - Pipelined execution (each SPE is a stage in a pipeline)
 - SPE Threads
    treat each SPE as a thread in the program, minimal use of PPE,
      processor in memory: move code in and out more often than data. 

Cell/Python Programming Model
 - Use Python for low-performance tasks
 - Use native kernels for high-perf. tasks
 - Pass pointers from Python-allocated data to SPE kernels
 - Process data in blocks when possible

CorePy
 library for creating and executing PowerPC, VMX, and SPU programs
      from Python.
 Execute arbitrary SPE programs from Python
 Talk to SPE programs using libspe wrappers
 Create new SPE or PPE programs directly from Python

CorePy Components
Example:
  Population counts (popc) counts number of 1 bits in a bit vector
  Assembly-level.
  spu_popc: Pop. count in C.

Exposed pthread-like interface so can use all cores.

Python examples.

iterator examples.
auto-parallelization for embarrassingly parallel problems.

CorePy pop counts, Take 3 - works on any length of vector.

PS3 example: Lyapunov Fractals.
 3.65 FPS, ~21 GOps, ~10 GFlops. Video rendered to a framebuffer in
   main memory.

Source distribution, evaluation license.
Tested on IBM Cell Blades, PS3, Apple G4/G5 Macs

http://www.corepy.org

Michele [ pronounced "meekala"]  Vallisneri (JPL), 
"Python and the Mock LISA Data Challenges"

ligo.caltech.edu, www.ligo.org, lisa.nasa.gov

Gravitational-wave basics
Measured in amplitude (1/R); do not form images; detectors are
quasi-omnidirectional.  f < 10 kHz.  Difficult to absorb and scatter.
Emitted coherently by the bulk motion of matter.  Emitted by massive
and compact objects: strong gravity.  4.8e-20 ((m/Msun)/(r/Mpc))*(v/c)**2
[lots of time spent on this slide, intro to gravitational-wave astronomy.
As an astronomer I really dig this, but what of the rest of the audience?]

LISA: a constellation, solaar orbit 20 deg from Earth.  (Movie from MPifGP)
Laser Interferometry can detect small changes (5e6 km baseline).

Supermassive black holes in active galactic nuclei.
70% of local galaxies show evidence of mergers.
Inspiral, merger, and ringdown (of BHs).

Measurements are hard.  Statistical theory of detection:
Strategies: Orthogonality and Coincidence.

Mock Lisa Data Challenges:
 why: encourage development of tools and techniques
 how: compete in analysis of synthetic data sets w/ instrument noise
   and GW source of undisclosed parameters.
 Challenge 1: some Galactic binaries, isolated SMBH binaries
    posted June 2006, evaluated Dec 2006
 Challenge 2: [didn't catch]

tasks
compute random GW source parms   -  Python script
-> lisaXML file

compute grav. waveforms - standalone C/C++ code w/ Python wrappers
-> lisaXML file

compute LISA response & noises - Python module, legacy C w/ Python wrappers
-> lisaXML file

Put everything together - Python script
-> lisaXML file

Python made it possible. in 3 months
- intuitive IO library for XML format.
- steering scripts with easy access to OS.
- efficiently & simply wrap C/C++ GW codes using SWIG
- Wrap legacy app with "set-file" IO
- Write master installer script that calls various setup.py and
   configure/make within SVN.
    need a one-step install for end users and verification on new platforms... 
    Installs all the needed packages (numpy, SWIG, etc.) in non-system 
       directory.

Data format:
   Python & XML make a very strong pair.
   Text-based format reassuring (parsed by humans, less dependent on
          I/O libraries or formats like HDF & FITS)
   XML promising

  But....binary file performance would have been nice:
     datasets contain big arrays (130 Mbytes)

  XSIL (Extensible Scientific Interchange Language)
   developed by Roy Williams et al at Caltech's CACR
   based on eight simple XML elements

Building a natural Python interface for lisaXML
  Everything in XML is mapped into Python.

SWIG interface module.  Inherits from lisaXML

Thanks to Python, numpy, pyRXP....


MLDC: astrogravs.nasa.gov/docs/mldc
Code:
sourceforge.net/projects/lisatools

- Lunch -

Peter Wang - Interactive Plotting for Fun and Profit
(Enthought)

House assessment data.  Interactive demo of Chaco.
YearBuilt, Mkt_Value, SqFtCost.
Scatter plot + color/symbol code.  Select data, linear regression
to selected data.  Got a pretty good deal on his house assessment.

Prosper.com.  Peer-to-peer lending.  3-yr loans, not collateralized,
up to $25k.  
Plot lender rate history for different borrower credit ratings.
Can get a dump of Prosper credit data.
1. Get data.
2. Get Robert Kern to write code.
3. Get a pile of data.
Turning items into numpy record arrays.
DebtToIncomeRatio vs. LoanStatus
CurrentRating vs. LoanStatus
Conclusion: Prosper is a little more dangerous than other places
[for lenders, or borrowers?  Lenders?]

Multitouch scatter plot.  IR illumination blasting screen from behind.
Camera reads finger spots.  Image processing of spots, which moved
between frames and which did not.  60 fps camera, can keep up with
processing.  

Chris Lee - "PyGr: The Python Graph Database Framework for Bioinformatics"
(UCLA, Ctr for Computational Biology)  

What is PyGr?
- sequence analysis & comparative genomics tools.
- Python, plus Pyrex / C extensions where crucial.

Competing models of languages:

- Scripting for piping results from one program to another, parsing
output formats, etc.
- A model of core properties of the data, its inter-relationships,
and how we formulate questions about it (eg. sequences, mappings).
[ I love this slide!!]

What should our goal be?
- Bioperl dominates bioinformatics, so we often have to answer
"Why bother with Python".
- Should we just replicate same functionality w/ better syntax?
- Or take beyond scripting? make modeling data easy and natural

Thesis: Python's core models are already a good model of Bioinfo. data
- Sequence: 
- Mapping/ Graphs
- Attributes

e.g scripts typically store a sequence as a string.
- this ignores our need for a representation.
e.g. Python sequences: can be sliced.  Any slice is a slice.
String: str(s).  Could come from file, SQL...
Add Allen interval logic: union, intersection, before, after, etc.
Add orientation to handle DNA sequences. (strand orientation)

Multiple Storages, Same Interface
- Sequence: Python object in memory
- SQLSequence: slice query to relational dB
- BlastSequence: slice query to NCBI fastacmd
- FileDBSequence: slice query via fseek to disk
All follow same interface, interchangeable.  Only needs
two customizations: __len__(), __str()__

Hypergraphs: A General Model for Bioinformatics
Sequence Alignment: Nodes: sequence letters or intervals.
        Edges: links between sequences.

Python Mapping
  M[node1] -> node2
Need Graph:
  M[node1][node2] -> edgeinfo 
node 1 is a source node, node 2 is a target node, edge connects
the two.

Alternative Splicing Example:
in SQL, requires a 6-way JOIN, for a simple example.
In Python,
  q = {1: {2:None, 3:None},2:{3:None}} # no edge info
 for d in GraphQuery(g,q): print 'exons',d[1],d[2],d[3]
Simpler because the data really is a graph; SQL schema is
not a good prepresentation of that.

Benefits of Graph Query
- Data is actually a graph.  Query is also a graph.
- The SQL is a mess, unusable except by experts.

"Everything in Python is a dictionary" (mapping)
All Python Data (Already) IS a Graph Database!

Intervals.  Intervals are sorted by xstart, ystart, xend, yend.
Can stop search at first non-overlapping interval.
Query time is millisec: proportional to number of items that
will be returned.
10-500 times faster than R-Tree.  Published.

Pygr.Data: A Namespace for Scientific Data & Schema
- Would like obtaining complex database + dependencies as easy as:
Python import foo.bar.you.  All you need is "name".
- The ultimate graph database: all bioinformatics data and their relations
could be available.

- Object marked with its pygr.Data ID
- Automatically saves any dependencies (again by ID).
- Uses Python pickle: objects must be picklable.

Demo
import pygr.Data defaults to XMLRPC server at UCLA.

http://www.bioinformatics.ucla.edu/pygr

[This was a fabulous talk!  It made me think a lot about modeling the
 data for my own area.]

BREAK

Lightning Talks

**Fernando Perez - TConfig - Traits-based declarative configuration for programs.
Traits: typed variables with validation and automatic GUI generation.

from enthought.traits.api import HasTraits,Int,Float

class C(HasTraits):
  n = Int(10)
  x = Float(2.5)

TConfig = Traits + ConfigObj

Written 2 weeks ago, in ipython/saw/sandbox.

**Len Reder - Mars Science Laboratory at JPL

Mars rover.  140 threads that are constantly running.

Formal interface specs (XML).
Python tools generate tested and well-understood patterns.
Template "Cheetah" in Python

No Python interpreter on rover.  Everything is C code that looks
like it is from 20 yr ago.

**Climate Data Analysis Tools (CDAT) - Charles Doutriaux

Goal: Provide scientific community with tools to allow them to focus
on science NOT technical aspects.

Current Version: 4.3 (Numeric), 5.0beta (NumPy)
Contributed packages-- mix of Python and Fortran, most built on top of
  SciPy/f2py.
One environment, several hundred users.

http://cdat.sf.net

**Platform for Intelligent Computing (NuPIC) - Charlie Curry
Numenta - startup in Menlo Park in NoCal
Hierarchicaal Temporal Memory (HTM)

Can break up local computations.  

**Bill Spotz: Numpy.i
Officially released.  Prabhu says, documentation is awesome!
It's in the doc directory of every numpy source distribution.

Trelino: will be available end of August.  Lots of solvers.
PyTrelinos package.

**Rick Wagner: Taught intro to scientific computation for high schoolers.
Kids know computers (Windows) but not programming.
If you never leave the interpreter, you're not doing scientific computing.
Borrowed heavily from Software carpentry course by Greg Wilson.
Had to make it "shinier" for high-schoolers.
e.g. Quakenator.  Downloads earthquake data and plots.

Fernando: please put on scipy.org so we'll have a repository of
courses that people have taught.

**Brian Granger: PyStream: Stream & GPU computing in Python
- New emphasis on performance per watt.
- multicpu, multicore,etc
Ex. NVidia GPU ($600) 128 cores.
Ex. Folding@Home - more GFlops on 30k PS3s, more than supercomputers

NVIDIA CUDA SDK - makes programming GPUs easy for you.
CUDA makes easier, but in C and lots of boilerplate.
PyStream - Python can be used for everything but the actual GPU kernel.
 - fully interactive (or not), and lightning fast.
 - integrated with NumPy.  e.g. send FFT to GPUs.
http://pystream.googlecode.com, BSD license.

**Prabhu Ramachandran: TVTK and MayaVi2
Goal: x-platform 2D/3D visualization for scientists and engineers.
Almost all 2D plotting: matplotlib and Chaco

TVTK: VTK + Traits + Numpy supportt = Pythonic VTK

Now MayaVi is standalone outside Envisage.  Now resuable.

** Gael Varoquaux - mlab: a pylab-like interface to Mayavi2

Mayavi Pros
  Interactive, uses VTK= high-quality, feature-rich.
Limitations;
  Creating VTK data is too involved.  I don't want to learn VTK.

Run script inside Mayavi,  or run with ipython -wthread.

API is still changing -> we need feedback.

enthought.mayavi.tools.new_mlab in Enthought Tools Suite 2.5
  https://svn.enthought.com/enthought/wiki/install

**Michel Sanner: What's new in MGL Tools

molecular interactions
InstallJammer on Linux, Windows,  PackageMaker on Mac OS X.
Send everything incl. Python.  Can rollback to initial installation.
Can download nightly build with fix.  Next install of tested version
invalidates rollback of in-between nightly builds.

Matplotlib in Vision.

**Robert Kern: Spectral color maps.
Spectral color maps  are confusing, RK has color deficiency.
Colormap viewer in 3D.
Allows interesting analysis of colormaps.
Any that Robert likes? Grey, Heat, some diverging that have white
at center and diverge smoothly to two different colors at the extremes.

End of lighning talks.

BASIN - Dept of Physics, Drexel University, Enrico Vesperini

Stellar Dynamics, Cosmology

Another can come in, and share data with me!

IPython Engines.

BASIN kernel: classes and fcns for data distribution and parallel data
operations (C++/MPI)

Packages for: Cosmology, Stellar dynamics, Statistics,
FFT (FFTW), Coordinate Transformations.

BASIN Python interface created with Boost Python.

A remote Python client can invoke BASIN commands to be executed
by the Data Analysis Engine.

Multiple distributed clients can connect to same BASIN engine and
share data (based on IPython1, terrific tool!)

Visualization: VisIt (www.llnl.gov/visit)
Visualization of large distributed datasets
Also plotting based on GnuPlot API.  Or can use whatever you
want on the client machine after data are transferred.

Goals: Ease access to parallel data analysis.
Avoid redundant developement.
Interactive and multi-user parallel data analysis.

What we have:
Kernel for parallel data mgmt and operations
Scientific packages
A few visualization packages

Next up:
Increase science scope beyond astrophysics
Extend visualization options
2-way communication with visualization packages
Improve ease of use and installation