Search This Blog

14 February 2008

Aggregate Functions with Perl stored procedures.

I found myself thinking this evening about how someone could set about writing aggregate functions using Perl and on an idle Google search, came across this webpage entitled User-defined Aggregate Functions in DB2 Universal Database.

I wondered if a similar technique could be applied for our implementation of External Language Stored Procedures for MySQL. It turns out that the answer is: Yes.

Suppose you had the following Perl declarations in a Perl module:

our %summary=();
sub aggregate_add($$)
{
my ($value,$group)= @_;
if (defined $value)
{
$summary{$group}= {value=>0.0, value2=>0.0, count=>0}
if !defined $summary{$group};
my $scalar= scalar $value;
$summary{$group}{value}+= $scalar;
$summary{$group}{value2}+= $scalar * $scalar;
$summary{$group}{count}++;
}
return $value;
}
sub aggregate_result($$)
{
my ($value,$group)= @_;
return undef if !defined $summary{$group};
my $count= $summary{$group}{count};
my $average= $summary{$group}{value} / $count;
my $sqavg= $summary{$group}{value2} / $count;
my $variance= $sqavg - ($average * $average);
my $stddev= sqrt $variance;
return sprintf("count=%d avg=%0.3f var=%0.3f std=%0.3f",
$count, $average, $variance, $stddev);
}


Now we need to test this little piece of Perl magic by starting up a client session...

mysql> CREATE TABLE t1 (grp INT, a DOUBLE);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t1 VALUES (1,1), (2,2), (2,3), (3,4), (3,5), (3,6);
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

mysql> CREATE FUNCTION test.agg_result(value DOUBLE, grp INT) RETURNS CHAR(128)
-> LANGUAGE Perl NO SQL EXTERNAL NAME 'Foo::aggregate_result';
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE FUNCTION test.agg_add(value DOUBLE, grp INT) RETURNS DOUBLE
-> LANGUAGE Perl NO SQL EXTERNAL NAME 'Foo::aggregate_add';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT COUNT(a),
-> CAST(AVG(a) AS DECIMAL(7,3)) 'AVG',
-> CAST(VARIANCE(a) AS DECIMAL(7,3)) 'VAR',
-> CAST(STD(a) AS DECIMAL(7,3)) 'STD',
-> test.agg_result(MAX(test.agg_add(a,grp)),grp) 'TEST'
-> FROM t1 GROUP BY grp;
+----------+-------+-------+-------+---------------------------------------+
| COUNT(a) | AVG | VAR | STD | TEST |
+----------+-------+-------+-------+---------------------------------------+
| 1 | 1.000 | 0.000 | 0.000 | count=1 avg=1.000 var=0.000 std=0.000 |
| 2 | 2.500 | 0.250 | 0.500 | count=2 avg=2.500 var=0.250 std=0.500 |
| 3 | 5.000 | 0.667 | 0.816 | count=3 avg=5.000 var=0.667 std=0.816 |
+----------+-------+-------+-------+---------------------------------------+
3 rows in set (0.04 sec)


Whoot! It actually works! Note that this only works because the Perl instances are not shared between multiple threads and so all the calls to those subroutines and their variables exist within the calling thread only. This avoids any nasty synchronization headaches. I would imagine that it would not be much more difficult to do something similar with the Java plugin except that the global statistics variable would have to be a ThreadLocal instance to avoid multiple threads conflicting with each other.

I guess I shall just have to include it as a test case; at least until a concrete method to declare aggregate stored functions arrives. I would imagine that such a declaration would name existing stored procedures in order to construct the aggregate function.

Something to think about for the future.

(An unrelated side note, March is Endometriosis Awareness Month. Please spend a thought for women who suffer from this disabilitating disease... Thanks)