PG2 release 0.1.18
PG2 version 0.1.18 is available (it’s a client for Postgres). This release brings two major features:
- built-in pgvector extension support;
- better type mapping between Postgres and Clojure.
PGVector Support
Pgvector is a well known extension for PostgreSQL. It provides a fast and robust vector type which is quite useful for heavy computations. Pgvector also provides a sparse version of a vector to save space.
This section covers how to use types provided by the extension with PG2.
Vector
First, install pgvector
as the official readme file prescribes. Now that you
have it installed, try a simple table with the vector
column:
(def conn
(jdbc/get-connection {...}))
(pg/query conn "create temp table test (id int, items vector)")
(pg/execute conn "insert into test values (1, '[1,2,3]')")
(pg/execute conn "insert into test values (2, '[1,2,3,4,5]')")
(pg/execute conn "select * from test order by id")
;; [{:id 1, :items "[1,2,3]"} {:id 2, :items "[1,2,3,4,5]"}]
It works, but we got the result unparsed: the :items
field in each row is a
string. This is because, to take a custom type into account when encoding and
decoding data, you need to specify something. Namely, pass the :with-pgvector?
flag to the config map as follows:
(def config
{:host "127.0.0.1"
:port 5432
:user "test"
:password "test"
:database "test"
:with-pgvector? true})
(def conn
(jdbc/get-connection config))
Now the strings are parsed into a Clojure vector of double
values:
(pg/execute conn "select * from test order by id")
[{:id 1, :items [1.0 2.0 3.0]}
{:id 2, :items [1.0 2.0 3.0 4.0 5.0]}]
To insert a vector, pass it as a Clojure vector as well:
(pg/execute conn "insert into test values ($1, $2)"
{:params [3 [1 2 3 4 5]]})
It can be also a lazy collection of numbers produced by a map
call:
(pg/execute conn "insert into test values ($1, $2)"
{:params [4 (map inc [1 2 3 4 5])]})
The vector
column above doesn’t have an explicit size. Thus, vectors of any
size can be stored in that column. You can limit the size by providing it in
parentheses:
(pg/query conn "create temp table test2 (id int, items vector(5))")
Now if you pass a vector of a different size, you’ll get an error response from the database:
(pg/execute conn "insert into test2 values (1, '[1,2,3]')")
;; Server error response: {severity=ERROR, code=22000, file=vector.c, line=77,
;; function=CheckExpectedDim, message=expected 5 dimensions, not 3,
;; verbosity=ERROR}
The vector
type supports both text and binary modes of PostgreSQL wire
protocol.
Sparse Vector
The pgvector
extension provides a special sparsevec
type to store vectors
where only certain elements are filled. All the rest elements are considered as
zero. For example, you have a vector of 1000 items where the 3rd item is 42.001,
and 10th item is 99.123. Storing it as a native vector of 1000 double numbers is
inefficient. It can be written as follows which takes much less:
{3:42.001,10:99.123}/1000
The sparsevec
Postgres type acts exactly like this: internally, it’s a sort of
a map that stores the size (1000) and the {index -> value}
mapping. An
important note is that indexes are counted from one, not zero (see the
README.md file of the extension for details).
PG2 provides a special wrapper for a sparse vector. A brief demo:
(pg/execute conn "create temp table test3 (id int, v sparsevec)")
(pg/execute conn "insert into test3 values (1, '{2:42.00001,7:99.00009}/9')")
(pg/execute conn "select * from test3")
;; [{:v <SparseVector {2:42.00001,7:99.00009}/9>, :id 1}]
The v
field above is an instance of the org.pg.type.SparseVector
class. Let’s look at it closer:
;; put it into a separate variable
(def -sv
(-> (pg/execute conn "select * from test3")
first
:v))
(type -sv)
org.pg.type.SparseVector
The -sv
value has a number of interesting traits. To turn in into a native
Clojure map, just deref
it:
@-sv
{:nnz 2, :index {1 42.00001, 6 99.00009}, :dim 9}
It mimics the nth
access as the standard Clojure vector does:
(nth -sv 0) ;; 0.0
(nth -sv 1) ;; 42.00001
(nth -sv 2) ;; 0.0
To turn in into a native vector, just pass it into the vec
function:
(vec -sv)
[0.0 42.00001 0.0 0.0 0.0 0.0 99.00009 0.0 0.0]
There are several ways you can insert a sparse vector into the database. First, pass an ordinary vector:
(pg/execute conn "insert into test3 values ($1, $2)"
{:params [2 [5 2 6 0 2 5 0 0]]})
Internally, zero values get eliminated, and the vector is transformed into a
SparseVector
instance. Now read it back:
(pg/execute conn "select * from test3 where id = 2")
[{:v <SparseVector {1:5.0,2:2.0,3:6.0,5:2.0,6:5.0}/8>, :id 2}]
The second way is to pass a SparseVector
instance produced by the
pg.type/->sparse-vector
function. It accepts the size of the vector and a
mapping of {index => value}
:
(require '[pg.type :as t])
(pg/execute conn "insert into test3 values ($1, $2)"
{:params [3 (t/->sparse-vector 9 {0 523.23423
7 623.52346})]})
Finally, you can pass a string representation of a sparse vector:
(pg/execute conn "insert into test3 values ($1, $2)"
{:params [3 "{1:5.0,2:2.0,3:6.0,5:2.0,6:5.0}/8"]})
Like the vector
type, sparsevec
can be also limited to a certain size:
create table ... (id int, items sparsevec(5))
The sparsevec
type supports both binary and text Postgres wire protocol.
Custom Schemas
The text above assumes you have the pgvector
extension installed globally
meaning it is hosted in the public
schema. Sometimes though, extensions are
setup per schema. For example only a schema named sales
has access to the
pgvector
extension but nobody else.
If it’s your case and you installed pgvector
into a certain schema, the
standard :with-pgvector?
flag won’t work. By default, PG2 scans the pg_types
table for the public.vector
and public.sparsevec
types. Since the schema
name is not public
but sales
, you need to specify it by passing a special
option called :type-map
. It’s a map where keys are fully qualified type names
(either a keyword or a string), and values are predefined instances of the
IProcessor
interface:
(def config
{:host "127.0.0.1"
:port 5432
:user "test"
:password "test"
:database "test"
:type-map {"sales.vector" t/vector
"sales.sparsevec" t/sparsevec}})
You can rely on keywords as well:
(def config
{:host "127.0.0.1"
:port 5432
:user "test"
:password "test"
:database "test"
:type-map {:sales/vector t/vector
:sales/sparsevec t/sparsevec}})
The t
alias references the pg.type
namespace.
Now if you install the extension into the statistics
schema as well, add it
into the map:
(def config
{:host "127.0.0.1"
:port 5432
:user "test"
:password "test"
:database "test"
:type-map {:sales/vector t/vector
:sales/sparsevec t/sparsevec
:statistics/vector t/vector
:statistics/sparsevec t/sparsevec}})
Should you make a mistake in a fully qualified type name, it will be ignored, and you’ll get value from the database unparsed. The actual value depends on the binary encoding and decoding options of a connection. By default, it uses text protocol so you’ll get a string like “[1, 2, 3]”. For binary encoding and decoding, you’ll get a byte array that holds raw Postgres payload.
Custom Type Processors
PG2 version 0.1.18 has the entire type system refactored. It introduces a conception of type processors which allows to connect Postgres types with Java/Clojure ones with ease.
When reading data from Postgres, the client knows only the OID of a type of a column. This OID is just an integer number points to a certain type. The default builtin types are hard-coded in Postgres, and thus their OIDs are known in advance.
Say, it’s for sure that the int4
type has OID 23, and text
has
OID 25. That’s true for any Postgres installation. Any Postgres client has a
kind of a hash map or a Enum class with these OIDs.
Things get worse when you define custom types. These might be either enums or
complex types defined by extensions: pgvector
, postgis
and so on. You cannot
guess OIDs of types any longer because they are generated in runtime. Their
actual values depend on a specific machine. On prod, the public.vector
type
has OID 10541, on pre-prod it’s 9621, and in Docker you’ll get 1523.
Moreover, a type name is unique only across a schema that’s holding it. You can
easily have two different enum types called status
defined in various
schemas. Thus, relying on a type name is not a good option unless it’s fully
qualified.
To deal with all said above, a new conception of type mapping was introduced.
First, if a certain OID is builtin (meaning it exists the list of predefined OIDs), it gets processed as before.
When you connect to a database, you can pass a mapping like {schema.typename =>
Processor}
. When pg2 has established a connection, it executes an internal
query to discover type mapping. Namely, it reads the pg_type
table to get OIDs
that have provided schemas and type name. The query looks like this:
select
pg_type.oid, pg_namespace.nspname || '.' || pg_type.typname as type
from
pg_type, pg_namespace
where
pg_type.typnamespace = pg_namespace.oid
and pg_namespace.nspname || '.' || pg_type.typname in (
'schema1.type1',
'schema2.type2',
...
);
It returns pairs of OID and the full type name:
121512 | schema1.type1
21234 | schema2.type2
Now PG2 knows that the OID 121512 specifies schema1.type1
but nothing else.
Finally, from the map {schema.typename => Processor}
you submitted before, PG2
builds a map {OID => Processor}
. If the OID is not a default one, it checks
this map trying to find a processor object.
A processor object is an instance of the org.pg.processor.IProcessor
interface, or, if more precisely, an abstract AProcessor
which is partially
implemented. It has four methods:
ByteBuffer encodeBin(Object value, CodecParams codecParams);
String encodeTxt(Object value, CodecParams codecParams);
Object decodeBin(ByteBuffer bb, CodecParams codecParams);
Object decodeTxt(String text, CodecParams codecParams);
Depending on whether you’re decoding (reading) the data or encoding them (e.g. passing parameters), and the current format (text or binary), a corresponding method is called. By extending all four methods, you can handle any type you want.
At the moment, there are about 25 processors implementing standard types:
int2
, int4
, text
, float4
, and so on. Find them in the
pg-core/src/java/org/pg/processor
directory. There is also a couple of
processors for the pgvector
extension in the pgvector
subdirectory.
The next step is to implement processors for the postgis
extension.
Нашли ошибку? Выделите мышкой и нажмите Ctrl/⌘+Enter