Amund Tveit's Blog: 2008

Saturday, December 27, 2008

Ajax with Python - Combining PyJS and Appengine with json-rpc

I recently (re)discovered pyjs - also called Pyjamas - which is a tool to support development of (client-side) Ajax applications with Python, it does that by compiling Python code to Javascript (Pyjamas is inspired by GWT - which supports writing Ajax applications in Java).

pyjs' kitchensink comes with a JSON-RPC example, this posting shows how to use Appengine to serve as a JSON-RPC server for the pyjs json-rpc example.

(Rough) Steps:
1) download appengine SDK and create an appengine application (in the dashboard)
2) download pyjs
3) Replace pyjs example/jsonrpc/JSONRPCExample.py with this code

from ui import RootPanel, TextArea, Label, Button, HTML, VerticalPanel, HorizontalPanel, ListBox

from JSONService import JSONProxy

class JSONRPCExample:
   def onModuleLoad(self):
   self.TEXT_WAITING = "Waiting for response..."
   self.TEXT_ERROR = "Server Error"
   self.remote_py = UpperServicePython()
   self.status=Label()
   self.text_area = TextArea()
   self.text_area.setText(r"Please uppercase this string")
   self.text_area.setCharacterWidth(80)
   self.text_area.setVisibleLines(8)
   self.button_py = Button("Send to Python Service", self)
   buttons = HorizontalPanel()
   buttons.add(self.button_py)
   buttons.setSpacing(8)
   info = r'This example demonstrates the calling of appengine upper(case) method with JSON-RPC from javascript (i.e. Python code compiled with pyjs to javascript).'
   panel = VerticalPanel()
   panel.add(HTML(info))
   panel.add(self.text_area)
   panel.add(buttons)
   panel.add(self.status)
   RootPanel().add(panel)

   def onClick(self, sender):
   self.status.setText(self.TEXT_WAITING)
   text = self.text_area.getText()
   if self.remote_py.upper(self.text_area.getText(), self) < 0:
   self.status.setText(self.TEXT_ERROR)

   def onRemoteResponse(self, response, request_info):
   self.status.setText(response)

   def onRemoteError(self, code, message, request_info):
   self.status.setText("Server Error or Invalid Response: ERROR " + code + " - " + message)

class UpperServicePython(JSONProxy):
   def __init__(self):
   JSONProxy.__init__(self, "/json", ["upper"])

4) Use the following code for the appengine app

from google.appengine.ext import webapp

from google.appengine.ext.webapp.util import run_wsgi_app

import logging

from django.utils import simplejson

class JSONHandler(webapp.RequestHandler):

def json_upper(self,args):

return [args[0].upper()]

def post(self):

args = simplejson.loads(self.request.body)

json_func = getattr(self, 'json_%s' % args[u"method"])

json_params = args[u"params"]

json_method_id = args[u"id"]

result = json_func(json_params)

# reuse args to send result back

args.pop(u"method")

args["result"] = result[0]

args["error"] = None # IMPORTANT!!

self.response.headers['Content-Type'] = 'application/json'

self.response.set_status(200)

self.response.out.write(simplejson.dumps(args))

application = webapp.WSGIApplication(

[('/json', JSONHandler)],

debug=True)

def main():

run_wsgi_app(application)

if __name__ == "__main__":

main()

5) compile pyjs code in 3) and create static dir in appengine app to store compiled code (i.e in app.yaml)

6) use dev_appserver.py to test locally or appcfg.py to deploy on appengine

7) a testable version is on https://testulf3.appspot.com/static/JSONRPCExample.html .

Facebook application
By using this recipe it was easy to create a facebook app of the example in this posting - check it out at http://apps.facebook.com/testulf/

Alternative backend - using webpy

This shows how to use webpy as backend, just put the javascript/html resulting from pyjs compile into the static/ directory. Very similar as the previous approach, with the code in blue being the main diff (i.e. webpy specific)

import web

import json

urls = (

'/json', 'jsonhandler'

)

app = web.application(urls, globals())

class jsonhandler:

def json_upper(self,args):

return [args[0].upper()]

def json_markdown(self,args):

return [args[0].lower()]

def POST(self):

args = json.loads(web.data())

json_func = getattr(self, 'json_%s' % args[u"method"])

json_params = args[u"params"]

json_method_id = args[u"id"]

result = json_func(json_params)

# reuse args to send result back

args.pop(u"method")

args["result"] = result[0]

args["error"] = None # IMPORTANT!!

web.header("Content-Type","text/html; charset=utf-8")

return json.dumps(args)

if __name__ == "__main__":

app.run()

Wednesday, December 17, 2008

Simulating Mamma Mia under the xmas tree (with Python)

Norway has a population of ~4.7 million, and "Mamma Mia!" has during a few weeks been sold in ~600 thousand DVD/Blueray copies (> 12% of the population, i.e. breaking every sales record, perhaps with the exception of Sissel Kyrkjebø's Christmas Carol album which has sold a total of ~900 thousand copies in the same population, but that was over a period of 21 years). In the UK it has sold more than 5 million (in a population of 59 million, i.e. >8% of the population).

Well, to the point. Guesstimating that there will be ~2 million xmas trees in Norway, one can assume that many of the trees will have (much) more than one "Mamma Mia!" dvd/blueray underneath it*, the question is how many?

Simulating Mamma Mia under the xmas tree
.. Before you start recapping probability theory, combinatorics, birthday paradox, multiplication of improbabilities and whatnot, how about finding an alternative solution using simulation?

A simple simulation model could be to assume that for every Mamma Mia dvd/blueray copy there is a lottery where all trees participate, and then finally (or incrementally) count how many copies each tree won.

A friend wrote a nice Python script to simulate this:

import random

NUM_XMAS_TREES = 2000000
NUM_MAMMA_MIAS = 600000

tree_supplies = {}
for mamma_mia in range(NUM_MAMMA_MIAS):
winner_tree = random.randint(0, NUM_XMAS_TREES)
tree_supplies[winner_tree] = tree_supplies.get(winner_tree,0) + 1

tree_stats = {}
for tree in tree_supplies:
tree_stats[tree_supplies[tree]] = tree_stats.get(tree_supplies[tree], 0) + 1

print tree_stats

Results:

$ for k in `seq 1 10`; do echo -n "$k " ; python mammamia.py; done
1 {1: 443564, 2: 67181, 3: 6618, 4: 510, 5: 36}
2 {1: 444497, 2: 66811, 3: 6543, 4: 520, 5: 32, 6: 2}
3 {1: 444796, 2: 66376, 3: 6738, 4: 510, 5: 36, 6: 3}
4 {1: 444499, 2: 66750, 3: 6652, 4: 469, 5: 30, 6: 2, 7: 1}
5 {1: 444347, 2: 66717, 3: 6697, 4: 494, 5: 28, 6: 2}
6 {1: 444511, 2: 66389, 3: 6763, 4: 551, 5: 40, 6: 3}
7 {1: 443914, 2: 66755, 3: 6785, 4: 511, 5: 33, 6: 2}
8 {1: 444747, 2: 66558, 3: 6667, 4: 484, 5: 40}
9 {1: 444553, 2: 66703, 3: 6631, 4: 497, 5: 32}
10 {1: 443903, 2: 66853, 3: 6774, 4: 487, 5: 23, 6: 1}

Conclusion

So we see that in run 4 there was one xmas tree that according to the simulation model got 7(!) Mamma Mia DVD/Bluerays underneath it, but the overall simulation shows that 5 or 6 (at most) is probably more likely (assuming the model is right).

Regarding the simulation model, it is probably way too simplistic, i.e. not taking into account people buying mamma mia for themselves (or not as xmas gifts), a likely skewness in terms of number of gifts per christmas tree, interaction between buyers, etc. But it can with relatively simple manners be extended with code to make it more realistic. Check out Markov Chain Monte Carlo simulation for more info on how to create more realistic simulation models.

Wednesday, December 3, 2008

cinpy - or C in Python

cinpy is a tool where you can write C code in your Python code (with the help of ctypes - included in modern Python versions). When you execute your python program the C code is compiled on the fly using Tiny C Compiler (TCC).

In this posting I will describe:

1) installation and testing cinpy

2) a simple benchmark (c-in-py vs python)

3) compare performance with gcc (c-in-py vs gcc)

4) measure cinpy (on-the-fly) compilation time

5) how to dynamically change cinpy methods

1. How to install and try cinpy (note: also found in cinpy/tcc README files)
1) download, uncompress and compile TCC
./configure
make

make install
gcc -shared -Wl,-soname,libtcc.so -o libtcc.so libtcc.o

2) download, uncompress and try cinpy
cp ../tcc*/*.so .
python cinpy_test.py # you may have to comment out or install psyco

2. Sample performance results (on a x86 linux box):

python cinpy_test.py

Calculating fib(30)...

fibc : 1346269 time: 0.03495 s

fibpy: 1346269 time: 2.27871 s

Calculating for(1000000)...

forc : 1000000 time: 0.00342 s

forpy: 1000000 time: 0.32119 s

Using cinpy for fibc (Fibonacci) method was ~65 times faster than fibpy, and and cinpy for forc (loop) was ~93 times faster than forpy, not bad.

3. How does cinpy (compiled with tcc) compare to gcc performance?

Copying the C fib() method and calling it with main program

$ time fibgcc

fib(30) = 1346269

real 0m0.016s

user 0m0.020s

sys 0m0.000s

GCC gives roughly twice as fast code as cinpy/tcc (0.034/0.016).

4. How long time does it take for tcc to on-the-fly compile cinpy methods?

#!/usr/bin/env python
# cinpy_compile_performance.py

import ctypes

import cinpy

import time

t=time.time

t0 = t()

fibc=cinpy.defc(
"fib",
ctypes.CFUNCTYPE(ctypes.c_int,ctypes.c_int),

"""

int fib(int x) {

if (x<=1) return 1;

return fib(x-1)+fib(x-2);

}

""")

t1 = t()

print "Calculating fib(30)..."

sc,rv_fibc,ec=t(),fibc(30),t()

print "fibc :",rv_fibc,"time: %6.5f s" % (ec-sc)

print "compilation time = %6.5f s" % (t1-t0)

python cinpy_compile_performance.py

Calculating fib(30)...
fibc: 1346269 time: 0.03346 s
compilation time: 0.00333 s

So compilation (and linking) time is about 3.3ms, which is reasonably good, not a lot of overhead! (note: TCC is benchmarked to compile, assemble and link 67MB of code in 2.27s)

5. "Hot-swap" replacement of cinpy code?
Let us assume you have a system with a "pareto" situation, i.e. 80-99% of the code doesn't need be fast (written in Python), but 1-20% need to be really high performance (and written in C using cinpy), and that you need to frequently change the high performing code, can that be done? Examples of such a system could be mobile (software) agents.

Sure, all you need to do is wrap your cinpy definition as a string and run exec on it, like this:

origmethod="""
fibc=cinpy.defc("fib",ctypes.CFUNCTYPE(ctypes.c_int,ctypes.c_int),
   '''
   int fib(int x) {
   if (x<=1) return 1;
   return fib(x-1)+fib(x-2);
   }
   ''')
"""
# add an offset to the Fibonacci method
alternatemethod = origmethod.replace("+", "+998+")
print alternatemethod
# alternatemethod has replaces origmethod with exec(alternatemethod)
print fibc(2) # = 1000

Conclusion:
cinpy ain't bad.

Remark: depending on the problem you are solving (e.g. if it is primarily network IO bound and not CPU bound) , becoming 1-2 orders of magnitude faster (cinpy vs pure python) is probably fast enough (CPU wise), the doubling from GCC may not matter (since network IO wise Python performs quite alright, e.g. with Twisted ).

Saturday, November 29, 2008

Tools for Accelerating Python

If you need to speed up your Python program there are several possible approaches, with varying degree of effort needed, here is (probably not complete) overview:

Rewrite your Python code by parallelizing or optimizing/replacing/tuning algorithm(s), e.g. using:

Hadoop or Disco

Open source implementations of MapReduce

Parallel Python
Message Passing Interface (MPI)

Frequently used for numerical programming

Bulk Synchronous Parallel (BSP)
RPyC

RPC for Distributed/Parallel Programming

Twisted

Network libraries for Distributed/Parallel Programming

Profiling Tools
Threading or Microthreads (Stackless)

Use a tool that can speed up your code more or less unchanged

Psyco

Just in time compiler, note: this is probably the easiest approach to try

Pyrex

Write and compile Python with a flavor of C data structures

Cython
PyJs

Compile (large subset of) Python to Javascript, note: probably more interesting for client development (ajax) than server side.

Rpython

Compile (large subset of) Python to native code, note: part of PyPy project

PyStream
GPULib
Shedskin

Compile (large subset of) Python to C++, some benchmarks

Replace (parts of) your Python code with another language

Simplified Wrapper and Interface Generator (SWIG)

Use C/C++ from Python

Fortran to Python Interface Generator (F2PY)

Use Fortran from Python

llvm-py

Code and Compile to Assembler for running on Low-Level Virtual Machine

CorePy

Write Assembly Code in Python

Weave
PyInline
Boost.Python
Cinpy

C code blended with Python (using ctypes) and runtime compilation with tcc

Saturday, October 18, 2008

Example of Microsimulation with Stackless Python

Stackless Python is roughly Python with support for microthreads (called tasklets). Tasklets communicate with each other by sending messages through channels , i.e. similar to Erlang, but of course with Python syntax (guess which I prefer :-). Stackless has a sibling called Greenlet Python , which is a library that can be used with traditional Python to get decent mictrothread support.

Scale
Tasklets have low overhead (compared to traditional threads) so you can have hundreds of thousands of them simultaneously, and when it gets overly crowded and busy on one machine, you can easily scale up by pickling some tasklets and send them to another machine (better known as mobile agents in computer science). Since tasklets also live Python's GIL regime, it might be idea to use the processing API to spawn several processes to better utilize multicore machines (and have many tasklets within each process), or use some of these libraries .

API Alignment Challenge
In Python 2.6 the process and thread APIs are aligned, but a missing piece would be to align those apis with stackless/tasklet apis. From my perspective a natural order would be process > thread > tasklet, and perhaps adding your favorite piece of the cloud would make sense for the API as well?

Example - simulation with 100k robots
The robots are playing a variant of musical chairs on a rectangular arena, it differs from regular musical chairs since if a chair has been sat on once, it can't be sat on again. Finding out who won is left as an exercise (hint: it can be found more than 1 place)

from random import randint
import stackless
import sys
import time

class Arena:
    def __init__(self, xdim, ydim):
        self.arena = [[None]*ydim for x in range(xdim)]
        (self.xdim, self.ydim) = (xdim, ydim)

    def find_unused_place_for_robot(self, robotname):
        (x, y) = randint(0,self.xdim-1), randint(0,self.ydim-1)
        if not self.arena[x][y]:
            self.arena[x][y] = robotname
        return self.arena[x][y] == robotname

class Robot:
    def __init__(self, name=0, arena=None, maxrounds=0):
        self.name = name
        self.arena = arena
        self.points = 0
        self.maxrounds = maxrounds

        # bind Robot's live method to it's tasklet
        self.tasklet = stackless.tasklet(self.live)()

    def live(self):
        rounds = 0
        while rounds < self.maxrounds:
            self.play()
            rounds += 1
        self.tasklet.kill()

    def play(self):
        # entire play method is atomically executed
        atomic = self.tasklet.set_atomic(True)
        if self.arena.find_unused_place_for_robot(self.name):
            self.points += 1
        self.tasklet.set_atomic(atomic)

class RobotArenaSimulator:
    def __init__(self, num_robots, xdim, ydim, maxrounds):
        self.maxrounds = maxrounds
        self.arena = Arena(xdim, ydim)
        self.robots = [Robot(id, self.arena, maxrounds)
                       for id in range(1,num_robots+1)]

    def run_preemptive(self, nopreempt_steps=1):
        tstart = time.clock()
        while stackless.getruncount() != 1:
            # run() takes out the tasklet after nopreempt_steps
            t = stackless.run(nopreempt_steps)
            # so need to add the tasklet to scheduler again
            if t: t.insert()
        return (time.clock()-tstart)

if __name__ == "__main__":
    tstart = time.clock()
    simulator = RobotArenaSimulator(num_robots = 100000,
                                    xdim = 10000, ydim = 10000,
                                    maxrounds = 10)
    simulationtime = simulator.run_preemptive()
    print "simulation time was %.2f seconds" % (simulationtime)
    print "total running time was %.2f seconds" % (time.clock()-tstart)

Friday, July 18, 2008

The Rebirth of Confounding and Theory in Science?

Wired recently published an interesting article titled: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, where they roughly conclude that with enough data and processing correlation is all you need. While interesting I think it misses a few important points.

Confounding variables
At the end of the article they claimed: "Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all".

Correlation between two variables can give some information, and with a lot of data and computational horsepower you can find correlation relatively easily, but the danger is if correlation is interpreted as causation (and that can happen).

Example: Correlation between GDP and Children´s Weight
An example is the positive correlation between a country´s Gross Domestic Product (GDP) and the weight of its children, if that was interpreted as a causation the cheapest way to increase GDP for a country would be put kids on e.g. a Chankonabe diet (note: I am not saying that is a likely action). The GDP and weight correlation is an example of where a third confounding variable time was not accounted for.

Example: Correlation between substance X and your health
In the press you frequently see eating/drinking substance X is good for your health and makes you live longer, in several of those cases I believe correlation is reported and not the confounding variables, i.e. many of these substances are luxury goods and may not be affordable to most of the worlds population, so the confounding variable might be personal economy which is probably correlated with access to vaccines and medical care.

The value of theory
My personal hypothesis is that the likelihood of not finding confounding variables increases exponentially with the complexity or level of abstractness of the domain investigated for correlation (e.g. in medicine, farmacology, biology or physics) The useful information is more likely to be found when discovering confounding variables, and the correlation is primarily a signal that says that something exciting is underneath (there can be cases where correlation is sufficiently interesting in itself though). In order to find or propose confounding variables I believe the need for models and theories is still very much needed, probably even more than earlier (but the simplicity or complexity of models is another discussion, I tend to lean towards KISS).

Disclaimer: This posting (and all my others) represent only my personal views.

Monday, July 14, 2008

Annual Echo Chamber

Having blogged for a bit more than a year (April 2007), here are some rough stats and opinions of blog entries so far. Most read entries:

In terms of self-eval (read: navel gazing..) of entries, the list would look like this:

As always this blog only expresses my personal opinions.

Wednesday, June 11, 2008

Pragmatic Classification of Classifiers

recap: In my previous machine learning related postings I have written about the basics of classification and given an overview of Python tools for classification (and also a machine learning dream team and how to increase automation of test-driven development).

In this posting I will "go meta" and say something about classes and characteristics of classifiers.

Informative vs Discriminative Classifiers
Informative classifiers model the densities of classes and select the class that most likely produce the features, in the naive bayes case this modeling involves counting (see here for an example with these data).

Discriminative classifiers have a different approach - they try to model class boundary and membership directly, e.g. in a simple 2-feature dimension case this could mean trying to finding the line that best separates the classes (in >3 feature dimensions case it would be looking for the hyperplane that best separate classes). Examples of discriminative classifiers are support vector machines (SVM) and ridge regression.

Classifier training methods
Many classifiers are batch-based, that means that they need to have access to all training data at the same time (including historic data in a re-training case). Online classifiers don't need all data for every training round, they supporting updating the classifier data incrementally. A related training method is decremental training, which is about dealing with classifier problems where there is concept drift (i.e. forgetting out-of-date examples). Other training methods include stochastic training which is about training using random samples of data.

Linear vs Non-Linear Classifiers
If you have a situation where one class is inside a circle and the other class is outside the circle (and surrounding the circle), it will be impossible to linearly separate the two classes (with a discriminative classifier), fortunately there are non-linear classifiers that can solve this (typically by transforming the problem into a more computationally heavier problem using a kernel trick, but at least the new problem is possible to solve).

Sequential vs Parallel Classifiers
Sequential classifier algorithms can typically utilize one core, cpu or machine, and parallel classifier algorithms are able to utilize more cores, cpus or machines (e.g. in order to handle more data or get faster results).

Non-orthogonal Data
Non-orthogonality is handled by some classifiers, this can happen when there are repeated occurrences of training data.

Dependencies between features
Dealing with dependencies between features (e.g. correlations) is handled by some classifiers (this is sometimes a symptom of potential for improvement in feature representation).

Tuesday, May 27, 2008

Blogging is good for you..

Source: Blogging -- It's Good for You (Scientific American)

Monday, May 26, 2008

Scrum makes you smarter

Disclaimer: cartoon presents a free interpretation of this

Sunday, May 25, 2008

Pragmatic Classification with Python

In my previous posting I wrote about classification basics, this posting will follow up and talk about Python tools for classification and give an example with one of the tools.

Open Source Python Tools for Classification

Orange - machine learning tool which supports classification (including combining classifiers in ensembles), feature extraction, basic statistical analysis, regression and association rules. It also has an extension module which supports clustering and additional classifier algorithms. Note: Orange is probably the Python-based machine learning tool that is most similar to the more famous tool Weka (which is for Java, or Jython for that matter).

Monte - less comprehensive than Orange, written purely in Python (i.e. no SWIGed C++). Looks interesting (has several classifiers algorithms), but the APIs seems to be in an early phase (relatively new tool in version 0.1.0)

libsvm - Python API for most popular open source implementation of SVM. Note: libsvm is also included with Orange and PyML. (I used this tools during my PhD a few years ago)

RPy - not exactly a classification tool, but it is quite useful with a statistics tool when you are doing classification (it has a nice plotting capability, not unlike matlabs), check out the demo.

PyML - also less comprehensive than Orange (specialized towards classification and regression, it supports SVM/SMO, ANN and Ridge Regression), but it has a nice API. Example of use:


from PyML import multi, svm, datafunc
# read training data, last column has the class
mydataset = datafunc.SparseDataSet('iris.data', labelsColumn = -1)
myclassifier = multi.OneAgainstRest(svm.SVM())
print "cross-validation results", myclassifier.cv(mydataset)

My recommendation is to either go with Orange or with PyML.

Tuesday, April 22, 2008

Pragmatic Classification: The very basics

Classification is an everyday task, it is about selecting one out of several outcomes based on their features. An example could be recycling of garbage where you select the bin based on the characteristics of the garbage, e.g. paper, metal, plastic or organic.

Classification with computers
For classification with computers the focus is frequently on the classifier - the function/algorithm that selects the class based on features (note: classifiers usually has to be trained to get fit for fight). Classifiers can be found in many flavors, and quite a few of them have impressive names (phrases with rough, kernel, vector, machine and reasoning aren't uncommon when naming them).

note: Garbage in leads to Garbage out - as (almost always) - same goes for classification.

The numerical baseline
Let us assume you have a data set with 1000 documents that shows to have 4 equally different categories (e.g. math, physics, chemistry and medicine). A simple classifier for a believe-to-be-similar-dataset could be the rule: "the class is math", which is likely to give a classification accuracy of about 25%. (Another classifier could be to pick a random category for every document). This can be used as a numerical baseline for comparison with when bringing in heavier classification machinery, e.g if you get 19% accuracy with the heavier machinery it probably isn't very good (or your feature representation isn't very good) for that particular problem. (Note: heavy classification machinery frequently has plenty of degrees of freedom, so fine tuning them can be a challenge, same goes for feature extraction and representation).

Combining classifiers
On the other hand, if the heavy machinery classifier gave 0% accuracy you could combine it with a random classifier to only randomly select from the 3 classes the heavy machinery classifier didn't suggest.

Question 1: What is the accuracy with these combined classifiers?

Baseline for unbalanced data sets
Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say - "the article is about soccer". This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.

Silver bullet classifier and feature extraction method?

Q: My friend says that classifier algorithm X and feature extraction method Y are the best for all problems, is that the case?
A: No, tell him/her to read about the ugly duckling and no free lunch theorems which clearly says that there is no universally best classifier or feature extraction approach.

note: Just some of the basics this time, something more concrete next time (I think).

Sunday, April 20, 2008

My years on the net from 1998-2000

1998 - Insurance and Finance, and my first domain name
Once upon a time (September 1998) I bought my first domain name - agentus.com. The domain name was inspired by a course in "distributed artificial intelligence and intelligent agents" I (partially) followed while being an IT trainee in an insurance and finance company. Learned quite a bit from insurance (enjoyed working with actuaries, i.e. basic risk analysis for insurance pricing), but felt I had some unfinished business in academia (and entrepreneur wise), so in the summer of 1999 I left the finance company to university to start on a PhD scholarship in the fall.

1999 - Back to University
I spent the most of the summer of 1999 doing what could be called entrepreneurship-in-a-sandbox, i.e. not actually doing much concrete (coding wise), but reading up on e-commerce literature[2] to figure out something entrepreneurial to do. I ended up with a few vague ideas (one was creating a service to automatically create brand and domain names), but nothing that I really believed could fly (maybe the insight into risk analysis wasn't such a great thing?). The PhD scholarship was part of project called "Electronic Commercial Agents" - ElComAg. One of the first things when I created web pages for the project was an updated list of academic events and call for papers to e-commerce and CS related conferences, workshops and journals, this was gradually improved over the years and grew into eventseer.net.

2000 - Entrepreneurial year
At this time CoShopper.com and LetsBuyIt.com provided coshopping services, the idea behind these services was roughly to be a "crowd-shopping middleman" i.e. if a lots of consumers got together and purchased things (e.g. hundreds of dvd players) they should get it cheaper than on their own. Inspired by this and my recent insurance experience I thought something like: "insurance is by nature a crowd-risk-sharing product, so what is more natural than co-shopping of insurance?". Another nice property of insurance (at least selling it..) is that it is highly virtual so close to no distribution cost[3]. (Coshopping of insurance actually happened already then but more implicit, e.g. if you're a member of an organization you might get cheaper insurance).
I strongly believed this would change the insurance industry, and in particular for insurance brokers (and perhaps even for reinsurance industry, but I don't remember why :), so together with 2 CS project students I tried to make this idea fly, but the highlight became a full day workshop with executives from a very large insurance broker. (From Feb 2000 investment mood starting changing around the world and when summer of 2000 arrived it all died out).

As mentioned I had the summer before thought of a service for automatic brand and domain name creating, so I got a CS project student working on it (in parallel with the insurance project) and we got approved to participate in a summit for entrepreneurs sponsored by the University[4], the first part of the workshop revealed significant business level holes in the idea (revenue potential etc.), so we had to be agile and think of something new so we changed the idea to an ontology-based browser plugin to support search. At the summit's presentation real investors were in the panel, and when we spoke with one of them he said informally he might be willing to invest a very large amount, not being used to be close to large numbers it somewhat freaked me out[5] . Coincidentally(1) enough during the first part of the summit (which was in February ) I saw the front page of a financial newspaper (in the reception area) that the stockmarket was seriously down, i.e. the dot.com bubble burst.

Coincidentally(2) again during the second part of the entrepreneurial summit I got a phone call on behalf of a large investor, I had recently sent him an email with misc. ideas. I stayed in touch and later in the year (fall) I got funding from that investor to found a company together with my brother and the student who worked on the automatic brand name project. I worked on this company part time in addition to my PhD studies for about 2-3 years until going back to PhD fulltime.

I was a co-organizer for something called "technology café", mainly together with other PhD students from many parts of the university (e.g. industrial ecology, political science, linguistics, social anthropology and computer science). Together with a few of them we tried to develop a company (during the fall) doing consulting and services related to indicators for industrial eco-efficiency. We had meetings with potential local investors and customers, but our business model was quite vague so it was a hard sell, so the initiative unfortunately fizzled out.

One of the guys from the industrial eco-efficiency initiative had another project related to services related to carbon quota market (this was 3 years after the Kyoto Protocol), they were lacking an IT guy so I was for a brief period one of the 5 people working on it, but due to my newly funded company (2 paragraphs up) I had to pass this project on. In retrospect I honestly wish I didn't, this company - PointCarbon - became very successful now with customers in more than 200 countries :-)

Hm, 2000 was truly a busy year, other than that I remember getting my submitted first paper rejected ;-) But things started getting somewhat better on the academic front in 2001.

[1] - intellige.. twice in a course name had to be good :)
[2] - online/PDF e-commerce literature such as Make Your Site Sell and Wilsonweb, and magazines: Fast Company, Business 2.0, Red Herring and Wired (some of them approximating phone book size per monthly issue. Not as being in silicon valley, but an ok substitute while being 9 timezones east of it, i.e. Scandinavia).
[3] - you pay someone to take on your risk, and the combined pay of the entire crowd is going to at least cover the costs of reimbursements to those who have incidents the insurance covers.
[4] - it was mainly traditional engineering companies participating at the innovation summit, e.g. with more tangible projects like a mobile asfalt producing vehicle. Our idea was accepted because it was exotic with a dot.com'ish like company (at least I believe so).
[5] - I saw bigger numbers as in IT trainee in insurance and finance, but they felt more distant.

Saturday, April 12, 2008

Biting the Hand that Feeds the Double Standard?

Maybe it is just me, but I am personally somewhat puzzled by two recent world events:

protests against a big event in a country who at the same time is the major supplier of goods to those who are protesting against the event (double standards?)
suggested actions that severely irritates another country who is the major supplier of energy to several of the countries who are irritating it (biting the hand that feeds you?)

Friday, April 4, 2008

A Machine Learning Theory Dream Team

Russia is the birthplace of many great mathematicians, and the work from quite a few of them have significant impact on state-of-the-art computer science and machine learning theory, e.g.:

Ludwig O. Hesse (1811-1874) - Hessian Matrix

Used in calculations of Logistic Regression (which can be used for binary classification), and feed-forward Neural Networks

David Hilbert (1862-1943) - Hilbert Space

E.g. a feature space for Radial Basis Function kernels in Support Vector Machine classifiers can be described with a Hilbert Space

Andrey N. Kolmogorov (1903-1987) - Kolmogorov Complexity

Used in algorithmic information theory, and also in theory behind evolutionary and genetic programming

Andrei A. Markov (1856-1922) - Markov Models and Markov Chains

Can be used e.g. for simulation (in games).
Noteworthy later "spinn-offs": Hidden Markov Models (HMM) and Markov Chain Monte Carlo (MCMC).

Andrei N. Tikhonov (1906-1993) - Tikhonov Regularization

Tikhonov Regularization is roughly a templating language for classification and regression. The template variable is a loss function, e.g. if using a square loss function you get Ridge Regression (also known as Regularized Least Squares Regression or Shrinkage Regression), an epsilon-insensitive loss function gives Support Vector Machine Regression, and a hinge loss function gives Support Vector Machine Classification.

Kudos.

Thursday, March 20, 2008

Test-Driven Parsing with Python, Dparser and Doctest

Dparser for Python is the easiest-to-use parsing framework I've ever seen and used. It requires very little boilerplate code and supports having grammars in docstrings. Under the hood Dparser is a Scannerless Generalized Left-to-right Rightmost (GLR) derivation parser based on the Tomita Algorithm (no less :-) (See also A look at DParser for Python - note: a bit old article)

DParser and Doctest
Since DParser uses docstrings for grammars it didn't work together with Python doctest, but a small change in dparser.py replacing occurrences of f.__doc__ with f.__doc__.split(""">>>""")[0] seemed to fix that. This means that you can relatively easily do test-driven development of parsers, i.e. just write tests and the least amount of grammar + code to make tests pass. (note: I used Dparser 1.18 combined with (Stackless) Python 2.5.1 on Linux). Check out the small example below to see how it works (first part of docstrings has the grammar and the rest of the docstring are doctests testing the corresponding grammar):

from dparser import Parser
import doctest

def d_start(t):
   """start : noun verb
   >>> Parser().parse('cat flies')
   >>> Parser().parse('dog flies')
   """

def d_noun(t):
   """noun : 'cat' | 'dog'
   >>> Parser().parse('cat', start_symbol='noun')
   'cat'
   >>> Parser().parse('dog', start_symbol='noun')
   'dog'
   """
   return t[0]

def d_verb(t):
   """verb : 'flies'
   >>> Parser().parse('flies', start_symbol='verb')
   'flies'
   """
   return t[0]

def _test():
   doctest.testmod()

if __name__ == '__main__':
   _test()


# note: output when running time python example.py -v
Trying:
    Parser().parse('cat', start_symbol='noun')
Expecting:
    'cat'
ok
Trying:
    Parser().parse('dog', start_symbol='noun')
Expecting:
    'dog'
ok
Trying:
    Parser().parse('cat flies')
Expecting nothing
ok
Trying:
    Parser().parse('dog flies')
Expecting nothing
ok
Trying:
    Parser().parse('flies', start_symbol='verb')
Expecting:
    'flies'
ok
2 items had no tests:
    __main__
    __main__._test
3 items passed all tests:
   2 tests in __main__.d_noun
   2 tests in __main__.d_start
   1 tests in __main__.d_verb
5 tests in 5 items.
5 passed and 0 failed.
Test passed.

real    0m0.080s
user    0m0.068s
sys     0m0.008s

Amund Tveit's Blog