Some things

Better living through biologically-inspired optimization

blog

Is active equity portfolio management worth it?

I still don't know very much about investment, but consider the following thought experiment.

Suppose you believe you can find undervalued companies to go long on and overvalued companies to short -- i.e., you want to take a bottom-up approach to portfolio management by trying to find superior individual stocks. For the sake of simplicity, suppose you limit your investment universe to large-cap U.S. equities - say, the entire S&P 500. Suppose you assume an analyst working full-time can cover 25 stocks; then you need 20 analysts to cover that universe. Since you're so good, assume you can somehow cover it all with just 15 analysts -- you're a superstar who recruits other superstars, after all. Since you're superstars, you're all being paid $300,000 a year. So you immediately have $4.5 million a year of expenses to jump through before you create any added value for your clients. Suppose you have $300 million in assets under management and the S&P 500 jumps up 10% that year, so that $300 million in a portfolio that closely follows that index would give $10 million in profit. In order for your clients to simply match the benchmark index, you have to consistently beat it by 4.5%.

That seems like quite an obstacle to overcome in order to add value. Most likely, in order to actually find undervalued companies (presumably all companies in a closely studied, well-known index like the S&P 500 are studied more than most companies), you'd have to widen your investment universe to be much larger, which means you'd have to hire more high-flying analysts, making the goal of delivering any additional profits to your clients even more difficult. And plus, shorting a Fortune 500 corporation is probably inconsistent with the risk tolerances and investment goals of most typical mutual funds.

It seems challenging to consistently beat an index fund run by one guy and his optimization software/spreadsheets after taking management fees and transaction costs into account.

A few quotes about investment and forecasting

``I'd be a bum on the street with a tin cup if the markets were always efficient.'' ---Warren Buffett on the Efficient Market Hypothesis

``Predictions are risky, especially when they're about the future.'' ---Yogi Berra

darcs rules!

too tired to use capitalization. quite shamefully, with most of my minor programming projects i don't use source control. this is stupid. the simplest version control program i've found is called darcs.

to start using darcs once you have it installed, you just do:

cd my_project
darcs initialize

it'll prompt you for a name for the initial checkin.

then you just add your existing files, like, e.g.:

darcs add *.c *.py Makefile

you can type brief descriptions of the patches (or long ones), which you can then get formatted by date in GNU file format like:

me@Stampy:~/forex$ darcs changes
Sat May  6 03:14:13 PDT 2006  my.email@address.com
  * Added lots more technical indicators. Everything not written to disk at the end is currently untested.

Fri May  5 23:49:30 PDT 2006  my.email@address.com
  * added some stuff.

Fri May  5 23:46:27 PDT 2006  my.email@address.com
  * Initial revision.
me@Stampy:~/forex$

when you modify a file that you want to check in you just do darcs record. does version control have to be any harder than dead-simple commands like this? i think not.

Lisp vs. OCaml: fight, fight, fight!

I don't yet know OCaml well and I don't care to take the time to consider the relative quality of these two chunks of code, but here's some Lisp benchmark code:

(defun main ()
  (flet ((read-int (x)
       (let ((int 0) (sgn 1) (eof t))
         (declare (type fixnum int sgn))
         (loop for byte of-type fixnum = (char-code (read-char x nil #\null))
           for num of-type fixnum = (- byte (char-code #\0))
           when (= byte (char-code #\-)) do (setq sgn -1)
           when (and (/= byte (char-code #\-)) (or (< num 0) (>= num 10))) do (return (if eof nil (* int sgn)))
           when (/= byte (char-code #\-)) do
           (if eof (setq eof nil))
           (setq int (+ (the fixnum (* int 10)) num))))))

    (princ
     (loop for int = (read-int *standard-input*)
       when (not (null int)) sum (the fixnum int) into res
       when (null int) do (return res)))
    (terpri)))

And here's the equivalent OCaml for the same benchmark:


let sum = ref 0
let rec loop () = sum := !sum + int_of_string (input_line stdin); loop ();;
try loop () with End_of_file -> Printf.printf "%d\n" !sum

...

Ouch!

News from my exciting life

Well, I've been learning more about investment and genetic programming.

It turns out that there are multiple unanswered questions in the intersection of machine learning and finance/investment:

  • Does technical analysis (the use of historical information to predict and analyze trends) have any real merit? Practitioners and academics (as well as academics and academics) differ on this issue.
  • If there exist trading rules which are likely to give returns greater than that of a simple "buy and hold" strategy, can they be found? If so, how?

The literature on investment and genetic programming shows mixed results and returns.

Bah!

Motherfucker! It seems that between the time Sycara and Thomas wrote their papers on using text data for financial prediction, apparently the Yahoo finance boards went to shit. The MSFT discussion board is one big never-ending flamewar about Microsoft vs. Linux -- very little discussion of Microsoft's financial status actually occurs. Remarkably, a cursory analysis of the discussion surrounding heavily traded stocks like GOOG, MSFT and so on reveals that actually there's very little information in the posts. It's not intelligent, thoughtful discussion of the kind you'd find at Fool.com or the Wall Street Journal, it's mostly typo-ridden indecipherable garbage that the most sophisticated natural language processing methods are powerless to deal with. Yahoo finance board discussion data is, in other words, so ridiculously noisy as to be useless.

Genetic Programming Ramblings

What I've been learning about lately is genetic programming. You can read about the description of it at Wikipedia if you're unfamiliar with it.

I found a couple of interesting papers, one of which is available online in PDF form: Integrating Genetic Algorithms and Text Learning for Financial Prediction by Thomas and Sycara of CMU. They give very interesting (profitable) results from creating a simple committee (weighted average) system from both evolved trading rules and the output of text classifiers, both working on downloaded stock bulletin board data. Unfortunately, I think manually training a text classifier just doesn't scale up to large data sets, mainly because I'm lazy. Also, they don't take into account transaction costs.

That's why I like clustering, self-organizing maps, and other unsupervised methods so much: I hate tagging and labelling stuff by hand.

Words I wish could be removed from the Intarweb's vocabulary

Like Abe Simpson writing a rambling letter to the local newspaper, I find myself writing this list of words and phrases I would never like to see used again on the Internets.

  • Web 2.0
  • Mashup
  • Beta (when used to mean "we're going to imply that our product or service is fit for a particular purpose and aggressively encourage you to use it but not take responsibility for when it fucks up")
  • Long tail
  • Don't Repeat Yourself
  • Convention over Configuration
  • Model-View-Controller
  • Object-relational impedance mismatch
  • "Hey guys, check out my new half-baked Python web framework!"
  • Blogosphere
  • Folksonomy
  • Services-oriented architecture [1]
  • "Rails for --"
  • Pinko marketing
  • Enterprise anything
  • XML configuration file

[1] I still don't know what the fuck this is.

Once more unto the breach

OK, I'm giving Lisp another try. Last time was On Lisp and futile attempts to cobble together a web programming environment in Common Lisp. The dismay at the community's lack of interest in providing Unicode support was heartwrenching: how can you claim to have an XML parser if it will garble non-Latin 1 characters to shit?

This time, I'm reading the famous SICP. I secretly suspect that SICP is one of those books that everyone can clearly see is a landmark work, a magnificent book, but few who don't use it for a university course (i.e., they major in electrical engineering/computer science at Berkeley or MIT) actually read and appreciate it thoroughly. I shamefully but freely admit that I one of those people. People are a lot more upfront about admitting they can't muster the discipline to work through Knuth's books.

This was my new year's resolution: find a way to do nice, continuation-based web programming in a Lisp dialect. So far I'm not doing a very good job of it.

Chandler Angst

I want to use Chandler, but skimming the Wiki shows that they aren't planning to get serious about performance for months.

They more or less need to improve the speed by a factor of five for most operations in order to make it feel like a usable program, which is a lot better than when I tried it some time in late 2004: they needed a couple orders of magnitude then, if I remember correctly.

Without a doubt the most beautiful Python desktop program I've ever seen.

More whining about Java

Why do Java projects always have such deep, complex directory structures? If you have a 200-SLOC web project that I just want to skim the code of to get a feel for how your framework/library works in practice, please don't make me do deep sea diving in my file management program just to look at your autogenerated-getter-and-setter-method heavy .java files.

And using Java after using Lisps, Python, and Ruby would never stop being painful: you don't get the efficiency of C++ but you do get the typing headaches.

Claim: correct, maintainable, brief, explicit code is more appropriate for an "enterprise" than verbose, mediocre Java code.

Sun's idea of an embedded relational database

Look how god damn complicated it is to get a lightweight, deployable database into a Java program.

Everything in Java must be overly complicated and always require "configuration properties", likely via multiple XML files. Why? Why? Why always the needless complexity and verbosity? It really doesn't help. Mickey mouse desktop CRUD applications are really boring. You aren't getting anywhere. Apparently there are many common verbose pieces of code that must be mindlessly spat out in the course of designing a typical non-trivial Java application; these are given the euphemism "Java design patterns." You see that little bit of linguistic chicanery they pulled there? We are asked to consider the phenomenon of a language-specific (library specific, perhaps) design pattern: one that actually has jack shit to do with design principles. One of dozens that are invisible in dynamic languages. Sigh.

SQLite, in comparison, is so much simpler. It also happens to be smaller (a few hundred kilobytes vs. 2 megabytes) and is probably faster (being implemented in ANSI C rather than Java), in additional to being free in all senses of the word.

Stupid Python Tricks II: A 50-line micro-web framework.

It doesn't work with anything but Python's builtin HTTP server, it has no templating, it basically has nothing but URL dispatching, and only for control classes defined within the file. Nevertheless, I present a microscopic model-view-controller framework. Um, minus the model, and the view. Requires Python Routes, a reimplementation of Ruby on Rails' famous RESTful default URL dispatching for Python.


# eigenframe.py - a proper frame(work)

import BaseHTTPServer, SimpleHTTPServer, sys
from routes import Mapper

dispatcher = Mapper()

def bad_url():
    return "Bad URL"

class Handler(SimpleHTTPServer.SimpleHTTPRequestHandler):
    def run_app(self):
        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.end_headers()

        # nice rails-style mapping.
        dispatcher.connect(":(ctrl)/:(act)/:(id)")
        dispatcher.connect(":(ctrl)", act="index", id=None)
        dispatcher.connect(":(ctrl)/:(act)", id=None)
        dispatcher.connect("/",ctrl="default",act="index",id=None)
        dispatcher.create_regs([]) # what does this do? whatever.
        route=dispatcher.match(self.path)

        if route is not None:
            data = self.do_call(route['ctrl'].title(),route['act'],route['id'])
            self.wfile.write(data)
        else:
            self.wfile.write(bad_url())

    def do_GET(self):
        if self.path.startswith("/static/") or self.path == "/favicon.ico":
            SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
        else:
            self.run_app()

    def do_call(self, ctrl, act, id):
        try:
            mod = __import__(ctrl)
        except:
            return bad_url()
        cls = vars(mod)[ctrl]()
        return getattr(cls, act)(id)

def run(port=defaultport):
    httpd = BaseHTTPServer.HTTPServer(("", port), Handler)
    httpd.serve_forever()

if __name__ == "__main__":
    run(8000)

When it sees a controller like blaahg it looks to Blaahg.py for a controller clled Blaahg. So load up a new file, Default.py with the contents as follows:


import eigenframe

class Default:
    def index(self):
        return "Jumping is not a crime"

if __name__ == "__main__": eigenframe.run()

Now access http://localhost:8000.


warren@Stampy:~/agg$ lynx -dump http://localhost:8000/

   Jumping is not a crime

warren@Stampy:~/agg$

So it has a 7-line "hello, world", about 20 times better than in Java. Buzzword compliance: REST, Convention over Configuration. Obviously it can't do much but the point is clear: dynamic languages really are a huge win. And with that I'm going to bed.

Stupid Python tricks


>>> class A:
...     def meh(self):
...             print "ook!"
... 
>>> getattr(locals()['A'](),"meh")()
ook!
>>>

Thank you, thank you, I'll be here all week. Don't forget to tip your waiter or waitress, God bless.

What will Perl 6 look like?

Judging from the example code available, it will still be dirty, ugly shit: full of & * @ # $ {} -> and the like.

Blegh. Making a language look like input to half a dozen unrelated Unix scripting/sys admin tools from the '80s might have made sense in the 80s, but hopefully we have moved on to better things.

The subtext of the above claims is that how a language looks and feels matters. In fact, to me it matters a lot.

libsvm 2.82 released

Version 2.82 of the nice++ libsvm was released on April Fools' day. They now support precomputed kernels.

web.py using SQLObject and SQLite

Aaron Swartz' web.py looks pretty good, but his framework doesn't thus far seem to support SQLite (?). Furthermore, I like using the object-relational manager SQLObject (which does support SQLite). So as I was working through the web.py tutorial, I changed the code snippets to work with SQLObject and SQLite. I store my model data in a new file, model.py, the contents of which is as follows:


from sqlobject import *
import datetime

# change the location at the end for your own purposes
sqlhub.processConnection = connectionForURI("sqlite:///home/warren/todo.db")

class Todo(SQLObject):
    title=StringCol()
    created=DateTimeCol(default=datetime.datetime.now)
    done=BoolCol(default=False)

Then the modified code.py is as follows.


import web
from model import Todo

urls = (
    '/', 'view',
    '/add', 'add'
)

class view:
    def GET(self):
        todos=Todo.select()
        web.render('view.html')

class add:
    def POST(self):
        i = web.input()
        new=Todo(title=i.title)
        web.seeother('./#t'+str(new.id))

web.internalerror = web.debugerror

if __name__ == "__main__":
    web.run(urls, web.reloader)

Notice how I import from model.py, and the use of SQLObject functions. For the GET function in view, I get the data by using Todo.select() rather than using web.py's querying function. Similarly a new object is created using SQLObject rather than web.py. Also absent are database parameters (web.db_parameters) mentioned in the tutorial.

The file view.html is unchanged from the tutorial but it is included here for completeness.


<ul>
   #for todo in $todos
       <li id="t$todo.id">$todo.title</li>
   #end for
</ul>

<form method="post" action="add">
<p><input type="text" name="title" /> <input type="submit" value="Add" /></p>
</form>

This shows that web.py certainly isn't incompatible with object-relational managers. Inspection of the source to the tutorial file (as of 30 March 2006) shows that originally Aaron had something almost identical to this, in fact also using SQLObject.

My first reaction after completing this exercise is that web.py is wonderful. It's the first web framework I've used which doesn't frustrate and annoy me.

A Boatload of New GA Tech Reports

Over at the Illinois Genetic Algorithms Lab blog, many new technical reports are available for download, including several coauthored by Mr. Genetic Algorithms, David Goldberg.

Python Date Arithmetic Rules!

I started getting upset that I might have to implement annoying calendrical arithmetic stuff for my program, but it's already been taken care of for me! Praise Bob! Observe:

>>> import datetime
>>> d=datetime.timedelta(days=1)
>>> today=datetime.date.today()
>>> today
datetime.date(2006, 3, 23)
>>> today += d # increment a date by one day
>>> today
datetime.date(2006, 3, 24)

It also knows about when months end and begin.

>>> m1=datetime.date(2006,3,1)
>>> m1-=d
>>> m1
datetime.date(2006, 2, 28)

That is so cool!

The Python docs have more info if you want.

Why do default C/C++ rand() functions suck so much?

Probably the most surprising thing I learned this quarter (Spring '06) at university is that:

  • The default random number generator(s) included in most popular C/C++ compilers is/are completely inappropriate for simulation purposes;
  • Many people doing research and writing books which involve writing code that uses random number generation are apparently unaware of this.

These are people with PhDs in computer science who don't know that their random number generation is crippling their code.

To see why they're potentially so bad, consider the following situation. Some of them only generate 65,535 or so unique numbers before repeating. Even a modest program using randomness will easy use that up every second. For instance, I wrote an implementation of a multilayer perceptron which randomly trains on the available training data. It's not uncommon to use 200,000 elements for a training set; if we want to permute that list, we'll need to generate more than 65,535 elements every single run of the data! This is a completely unacceptable situation. A minimum period (numbers it can generate before repeating) is, say, 2^30 or so. Even that might not be good enough.

So what should you use instead?

Well, there are various different ones, but for my purposes the Mersenne Twister, created in 1997, is good enough. It has a period of 2^19937-1, which is about 10^6000. The creators of it have an implementation on the site I linked to available under the GPL. It has good performance. The GNU Scientific Library also has an implementation which you can use with their random number generation facility.

What I find outrageous is that gcc, Visual C++, and the like give the user poor random number generation by default. If they don't know any better, they shouldn't get poor-quality generation by default.

Python uses the MT by default, though. Props.

Notes on Twisted

What is Twisted?

Twisted is a free Python networking framework. If you want to write a networked application using any dynamic language, you should strongly consider using Python and Twisted. Actually, even if you're going to use C/C++ (you'd better be writing a realtime 3d game or something), strongly consider embedding Python and using Twisted, as well. Check out its homepage for more information, documentation, and download links.

The problem is that Twisted has a steep learning curve and it's slightly unintuitive. The basic idea of its execution is building up "event loops"/chains; Twisted is fundamentally event-driven. The mental model I have in my head is like a finite automaton (state machine). You create the states and the transition events. It does this to avoid the debugging nightmares which can come with multithreaded code.

I also find the documentation confusing. Their howtos/tutorials focus on stuff I have no interest in (like creating UDP-based servers). I've never written anything which used anything besides SFTP/FTP/HTTP, and only a couple used anything but HTTP.

This little page will serve to document things I want to remember.

Downloading stuff at different intervals

Suppose you have a task you want to run every 10 (or 60, or 3600, or whatever) seconds. The best example I can think of is a news aggregator which conditionally downloads new feeds every 60 minutes or so.

The way to achieve this is by using a LoopingCall object. Observe:

from twisted.internet import task, reactor
import time

def runthis():
    print time.strftime("the time is now %H:%M:%S")

l = task.LoopingCall(runthis)
l.start(10.0) # call every 10 seconds

# l.stop() will stop the looping calls
reactor.run()

This produces output like:

the time is now 19:54:16
the time is now 19:54:26
the time is now 19:54:36

And so on. All we have to do is give Twisted a callback and the appropriate interval. Nice.

Well, what if we want to have multiple tasks which will possibly run at different intervals (e.g., we want to check different feeds at various intervals)? Then we just do the same trick multiple times, like this:

from twisted.internet import task, reactor
import time

def task1():
    print time.strftime("time is %H:%M:%S, task 1")

def task2():
    print time.strftime("time is %H:%M:%S, task 2")

loop1 = task.LoopingCall(task1)
loop1.start(10.0)

loop2 = task.LoopingCall(task2)
loop2.start(7.0)

reactor.run()

And we get output like:

time is 19:59:40, task 1
time is 19:59:40, task 2
time is 19:59:47, task 2
time is 19:59:50, task 1
time is 19:59:54, task 2
time is 20:00:00, task 1
time is 20:00:01, task 2

Etc. The different tasks run asynchronously and on their own intervals. Nice!

Stuff I want to figure out

  • How the hell do I conditonally download stuff without making a class factory and other silly stuff? I want the code to be relatively readable to someone who doesn't know how Twisted works.