Why I use median, and when you should too

One of the issues I struggled with on MTG Card Prices was how to price a card, based on historical sales.

My initial solution was to take the average price of the prices that had occurred in the last 4 weeks, rolling. Prices for individual cards always fluctuated a little, so smoothing out bumps worked well. Also, it was easy (MySQL has an avg() function).

While it worked for most of the cards, it would occasionally lead to strange results. An example of this is the Beta Sol Ring, which had sales like this:

$25.12, $1725.98, $80.98, $69.99, $62, $73.32, $109.54, $110.25

The $1,725.98 price is a huge outlier, and while the auction itself included other beta cards, the auction title only mentioned the Sol Ring, so of course the matcher matched it to a Beta Sol Ring.

So, with the above auctions, using average, the price is $253.59, which no one will pay for.

Knowing that any matching system will incorrectly match an auction like this, the question becomes, "How can I get a reasonable price given the occasional outlier?".

Fortunately, there is a solution to this problem, and I came across it recently while reading "Data Analysis with Open Source Tools". Philip mentions that when you have a set that in not evenly distributed, you should use median, not average.

Using our set of prices above, the median comes to $73.77, which is close to what someone would actually buy a Beta Sol Ring for.

This is a much better solution, and it's a shame that MySQL doesn't have a built-in median function. I ended up using a substring_index solution that seemed like the least painful, and didn't require any self-joins.

So, the rule of thumb is: "If you know your data is evenly distributed, use average, otherwise you probably want median".

The price effect of banning Jace and Stoneforge Mystic in Standard

A few days ago, Jace, the Mind Sculptor and Stoneforge Mystic were banned by Wizards. What is unexpected is that the price for a Foil Jace is still high, close to $200. This may be because the are not many for sale, so their value remains high, or it may be a delayed effect of the banning, we will have to watch and see.

Jace is used in other formats, so his price may stay stable, but you can see both the Stoneforge Mystic and Foil Stoneforge Mystic are down since June 19th (46% and 30%, respectively)

Matching items to a known corpus

As part of MTG Card Prices, we have to be able to identify what cards
are part of an auction. If you could only list a single card on eBay,
this would be relatively simple, but it is complicated by the fact
that you can list

  • Multiple different cards together 
  • Different editions of the card (10th ed, M11, M10, etc) 
  • Foil versions of some card 
  • Foreign editions 
  • Altered cards 


And so on. For some of the more difficult cases (multiple cards
together, altered cards, foreign cards), I am just excluding them. If
a person is buy 2 Jaces and a Tezzeret, it's hard to get an idea how
that person is valuing those cards individually.

For the other cases, I have a database of cards I've created using the
Gatherer search engine and some Internet digging. That database is
used as my "corpus" and auctions have to be matched to it.

Every day, there are about 10k auctions that are listed that we need
to try and match to our corpus. Also, as the matching logic evolves
or new cards become available, we need to be able to rematch days or
even weeks of cards historically.

I started by breaking up the card and cardset name into their
keywords, so a card like "Tezzeret, Agent of Bolas" from "Mirrodin
Besieged" would become "tezzeret agent bolas mirrodin besieged".
These words were then put into a hash table, specifying the card #
that they were a part of. Something like this:

{
    mirrodin => [ 1, 2, 3 ],
    besieged => [ 1, 2, 3 ],
    tezzeret => [ 1 ],
    bolas => [ 1 ],
}

Then, you would break the auction into keywords and add a point for each card that has that keyword. (ex. you have the auction "Tezzeret from Mirrodin, like NEW!!!!", this would break into "tezzeret from
mirrodin like new", and when you add up the keywords, card 1 would
have 2 points, while card 2 and 3 would only have 1 point).

Then, you would order the cards by score, and if you met some
threshold, you'd match the auction to the card with the most points.

This was a fine solution to the problem and generated good "guesses"
if a card couldn't be matched.

Unfortunately, it was too slow for my needs (5 auctions/sec),
requiring n hash lookups for each auction (where n is the # of
keywords in an auction), and it took up a fair amount of memory.

Doing a little research, I came across a data structure called a trie
that I thought would work well. (yes, it's trie). And since I'm using
Ruby, there's a gem for it.

After doing a little testing, I was achieving similar auto-matching
%'s, but had more then doubled my performance (12 auctions/sec).

Unfortunately, the trie structure was even more memory intensive then
the previous solution. But, it had pointed me in the right direction.
Instead of storing the trie a character at a time, I built a trie to
store an entire word at a time. That and I used hashes to store the
data, instead of creating a trie-node each time.

With a little monkey-patching to the Hash class, the result was much
improved memory and incredibly fast matching (170+ auctions /sec)!

 

Command line scripts with Sinatra

There are times when I want to do batch processing or perform other
command line activities that don't fit well into the web-framework.
The trouble is, my db connection info is based on what "environement"
I'm working in. I could re-create the hooks to set that environment,
but I'd rather leverage the existing capabilities in Sinatra to do so.
That way if they change how it works in the future, I don't have to
work through all that again.

In Sinatra, there's a Delegator module that handles passing requests
for methods like 'production?' or 'set' off to the App. With that in
mind, I can easily just include the Delegator and let it handle
everything from there.

Here a sample of my code:

require 'rubygems'
require 'optparse'
require 'sinatra/base'
include Sinatra::Delegator

options = {}

OptionParser.new do |op|
op.on('-e env') { |val| set :environment, val.to_sym }
end.parse!

You can see this on my github page as well:

https://github.com/benprew/mtg/blob/master/bin/match_xtns.rb

Testing sessions with Sinatra

I've got a sinatra-based app that relies on sessions and I need to test them.  After doing a little digging, here was the solution I was able to come up with:

1. Make sure that you're setting your RACK_ENV to 'test' ( I do this in my Rakefile )
2. Disable sessions in your app when in test? 
3. when you make a request that requires a session, pass 'rack.session' => {} in the environment hash

See an example of this in my app: http://github.com/benprew/picklespears
 
Here's the breakdown of the code,

In the Rakefile:

  ENV['RACK_ENV'] = 'test'

In the app (picklespears.rb)

if test?
  set :sessions, false
else
  set :sessions, true
end
And finally, in the test itself (test/test_player.rb)
    post '/player/update', { :name => 'new_name' }, 'rack.session' => { :player_id => player.id }