import spannerlib
Introduction
This tutorial will teach you the basics of the spannerlog language and the spannerlib framework.
Spannerlog is: * Similar to Datalog, but has type safety features * Has support for aggregation functions * Enables using stateless user defined functions called IE functions to derive new relations from existing relations * has some DRY features to help you write spannerlog code effectively * comes with support for Document Spanners using the Span
class.
Spannerlib, via its Session
object, enables: * registering IE functions, and aggregations functions to be used as callbacks in spannerlog. * executing spannerlog code programmatically
Installation
prerequisites:
- Have Python version 3.8 or above installed
To download and install spannerlog run the following commands in your terminal:
git clone https://github.com/DeanLight/spannerlib
cd spannerlib
pip install .
Make sure you are calling the pip version of your current python environment. To install with another python interpreter, run
<path_to_python_interpreter> -m pip install .
You can also install spannerlib in the current Jupyter kernel:
!git clone https://github.com/DeanLight/spannerlib
!pip install spannerlib
In order to use spannerlib in jupyter notebooks, you must first load it:
Importing the spannerlog library automatically loads the %%spannerlog
cell magic which accepts spannerlog queries as shown below.
str, str)
new uncle("bob", "greg")
uncle( ?uncle(X,Y)
'?uncle(X,Y)'
X | Y |
---|---|
bob | greg |
The Spannerlog Language
Type safe Datalog
Spannerlog syntax is very similar to datalog, but relations and their types must be declared using the new
keyword.
# defining relations
str,str)
new parent(# defining initial facts
'xerces', 'brooke')
parent('brooke', 'damocles') parent(
Rules can be defined that describe how to derive new facts from existing facts. * We call the part to the left of the <-
the rule’s head (or head clause). * We call the part to the right of the <-
the rule’s body (made up of body clauses).
# you can define relations recursively
# and use line escapes for long rules to make them more readable
<- parent(X, Y).
ancestor(X, Y) <- parent(X, Z),
ancestor(X, Y) ancestor(Z, Y).
derived and existing facts can be queried using the ?
operator, with either Logical Variables such as X
or constants.
?parent(X,Y)
'xerces',Y)
?ancestor(
'xerces','brooke') ?ancestor(
'?parent(X,Y)'
X | Y |
---|---|
brooke | damocles |
xerces | brooke |
"?ancestor('xerces',Y)"
Y |
---|
brooke |
damocles |
"?ancestor('xerces','brooke')"
True
Spannerlog has built in support for declaring relations for primitive types: * int
* str
* float
* bool
But programatically, you can define relations and add facts of any pythonic data type.
Aggregation
You can use aggregation functions in a rule’s head to express group-by-and-aggregate logic. Grouping happens on the non-aggregated variables; the other variables are aggregated by their respective functions.
<- ancestor(X,Y).
numDescendants(X,count(Y))
?numDescendants(X,N)
'?numDescendants(X,N)'
X | N |
---|---|
brooke | 1 |
xerces | 2 |
Built-in aggregations include: * min * max * sum * avg * count
But you will see in later sections that external aggregation functions can be defined.
IE functions
Given a pure (stateless) function f(X,Y)->(Z)
we can think of f
as deriving information from (x,y)
values to generate (z)
values. In the relational settings, IE functions are pure functions that take tuples over some input schema and derive a number of new tuples from them over some output schema. We can use IE functions as body clauses to derive new facts.
IE functions are invoked using the func_name(InputVars...)->(OutputVars...)
syntax.
str)
new Texts("Hello darkness my old friend")
Texts("I've come to talk with you again")
Texts(
<- Texts(X), rgx("(\w+)",X)->(Word).
Words(Word)
?Words(W)
'?Words(W)'
W |
---|
[@9a1d0f,0,5) "Hello" |
[@9a1d0f,6,14) "darkness" |
[@9a1d0f,15,17) "my" |
[@9a1d0f,18,21) "old" |
[@9a1d0f,22,28) "friend" |
[@c7e66d,0,1) "I" |
[@c7e66d,2,4) "ve" |
[@c7e66d,5,9) "come" |
[@c7e66d,10,12) "to" |
[@c7e66d,13,17) "talk" |
[@c7e66d,18,22) "with" |
[@c7e66d,23,26) "you" |
[@c7e66d,27,32) "again" |
rgx
is part of the built-in IE functions. It returns Spans over the original text. We will learn more Spans later.
DRY with variables.
Variable that appear in rule’s and queries are called logical variables. To help make our code more DRY and easier to read we can also include normal variables and dereference them in rules and in facts using the $
operator. As follows:
= "hello darkness my old friend"
darkness_sentence = "Im walking on sunshine"
sunshine_sentence
str,bool)
new SentenceSentiment(False)
SentenceSentiment($darkness_sentence, True)
SentenceSentiment($sunshine_sentence,
?SentenceSentiment($darkness_sentence,Y)
= "(\w+)"
word_pattern
<- SentenceSentiment(Sentence, False), rgx($word_pattern, Sentence)->(Word).
BadWords(Word)
?BadWords(W)
'?SentenceSentiment($darkness_sentence,Y)'
Y |
---|
False |
'?BadWords(W)'
W |
---|
[@42bf20,0,5) "hello" |
[@42bf20,6,14) "darkness" |
[@42bf20,15,17) "my" |
[@42bf20,18,21) "old" |
[@42bf20,22,28) "friend" |
Spans
Spans, or document spanners, are objects that describe an interval of a string. They are available as part of the library and can be used as such:
from pathlib import Path
from spannerlib import Span
= "hello darkness my old friend"
some_text = Span(some_text)
full_span str(full_span) full_span,
([@42bf20,0,28) "hello dark...", 'hello darkness my old friend')
Note that Spans are represented by a triple of [doc_id,i,j)
. They encode the string doc_id[i:j]
.
You can control the doc_id
with the name
parameter, it is also initialized automatically as a file’s name when a Path object is fed to a span.
="greeting") Span(some_text,name
[@greeting,0,28) "hello dark..."
= Path("copilot_data/example_code.py")
file_path = Span(file_path)
file_span file_span
[@example_code.py,0,178) "def f(x,y)..."
Spans can be initialized with specific indices:
= Span(some_text, start=0, end=14)
first_2_words str(first_2_words) first_2_words,
([@42bf20,0,14) "hello dark...", 'hello darkness')
And can be sliced like a string to produce new spans with matching indices.
= first_2_words[6:14]
darkness darkness
[@42bf20,6,14) "darkness"
= darkness[:4]
dark dark
[@42bf20,6,10) "dark"
Due to an open issue in pandas
, Spans and other classes are not printed properly in dataframes.
import pandas as pd
= pd.DataFrame([
df str(full_span)],
[full_span, str(file_span)],
[file_span, str(first_2_words)],
[first_2_words, str(darkness)],
[darkness, str(dark)]
[dark,
]) df
0 | 1 | |
---|---|---|
0 | (h, e, l, l, o, , d, a, r, k, n, e, s, s, , ... | hello darkness my old friend |
1 | (d, e, f, , f, (, x, ,, y, ), :, \n, , , ,... | def f(x,y): x+y def g(x,y): return f... |
2 | (h, e, l, l, o, , d, a, r, k, n, e, s, s) | hello darkness |
3 | (d, a, r, k, n, e, s, s) | darkness |
4 | (d, a, r, k) | dark |
Which is why we need to use the map(repr)
workaround to see them clearly. This is also done behind the scenes by spannerlog.
map(repr) df.
0 | 1 | |
---|---|---|
0 | [@42bf20,0,28) "hello dark..." | 'hello darkness my old friend' |
1 | [@example_code.py,0,178) "def f(x,y)..." | 'def f(x,y):\n x+y \n\ndef g(x,y):\n ret... |
2 | [@42bf20,0,14) "hello dark..." | 'hello darkness' |
3 | [@42bf20,6,14) "darkness" | 'darkness' |
4 | [@42bf20,6,10) "dark" | 'dark' |
Many default IE functions use Spans as their return types. To convert a Span (or anything else) to a string, we can use the as_str
IE function.
<- BadWords(WordSpan),as_str(WordSpan)->(Word).
BadWordStrings(Word) ?BadWordStrings(W)
'?BadWordStrings(W)'
W |
---|
darkness |
friend |
hello |
my |
old |
Python Spannerlog Interactions
Exporting query results and changing sessions.
All interactions between the spannerlog and python, is mediated through a Session
object. For example, calling the export
method with a string with spannerlog code will execute it.
from spannerlib import Session
=Session()
sess# exports returns the value of the last statement in our code, which is the query in this case.
= sess.export("""
uncle_df new uncle(str, str)
uncle("bob", "greg")
?uncle(X,Y)
""")
uncle_df
X | Y | |
---|---|---|
0 | bob | greg |
In fact, the magic system that allows us to use %%spannerlog
cells, simply sends initializes a session object and runs the code we put in %%spannerlog
cells through it. The magic system also prints the results of queries to the string for ease of debugging.
To get or change the session that the magic system uses, do the following:
from spannerlib import get_magic_session,set_magic_session
= get_magic_session()
magic_sess "?BadWordStrings(W)") magic_sess.export(
W | |
---|---|
0 | darkness |
1 | friend |
2 | hello |
3 | my |
4 | old |
As you can see, rules run in the magic cells where executed inside the magic_sess
object we now have access to. Now lets set the magic system to use the Session
bound to the sess
variable.
set_magic_session(sess)
Now we can query uncle
from the magic system.
?uncle(X,Y)
'?uncle(X,Y)'
X | Y |
---|---|
bob | greg |
Using sessions programatically, we can not only get results into python, allowing us to post process them, we can also run spannerlog code outside of a jupyter notebook.
Importing data to spannerlog
Usually, our data doesnt come as spannerlog facts, but rather from other relational sources. We can import our data into a session as follows:
# either a path to a csv file, or a dataframe
= pd.DataFrame([
aunts_data "susan", "greg"],
["susan", "jerry"]
[=["Aunt", "Of"])
], columns
="Aunts", data=aunts_data) sess.import_rel(name
?Aunts(X,Y)
'?Aunts(X,Y)'
X | Y |
---|---|
susan | greg |
susan | jerry |
We can also import variables from python into spannerlog.
='fox_sent', value='what does the fox say?') sess.import_var(name
<- rgx("(\w+)",$fox_sent)->(Word).
FoxyWords(Word)
?FoxyWords(W)
'?FoxyWords(W)'
W |
---|
[@0ffcc6,0,4) "what" |
[@0ffcc6,5,9) "does" |
[@0ffcc6,10,13) "the" |
[@0ffcc6,14,17) "fox" |
[@0ffcc6,18,21) "say" |
There are also auxilary methods that help deleting rules if you made an error when defining a rule and do not want to restart the session, such as: * remove_all_rules
* remove_rule
* remode_head
For a full list of methods and options, see the Session class in the reference guide.
Defining and registering IE and Aggregation functions
Part of what makes spannerlib powerful, is that you can define your own callbacks as python functions and register them for use right away. * An IE function is a stateless function that takes a tuple over an input schema and returns as Iterable
over an output schema. * An Aggregation function (Agg function for short) is a stateless function that takes a set/list of values and returns a single value. To register our functions we need to tell the session what input/output schema to expect.
"hello $ world $#".count("$")
2
def char_positions(text,char):
# here we return a list of tuples
return [(i,) for i,letter in enumerate(text) if char==letter]
def char_positions_iter(text,char):
# we can also return a lazy iterable using python generators
for i,letter in enumerate(text):
if letter==char:
# spannerlib knows to wrap single values in a tuple
yield i
# We register our function's input/output schema as a list of pythonic types
'char_pos',char_positions_iter,[str,str],[int])
sess.register(
def count_twice(positions):
return 2*len(positions)
'count_twice',count_twice,[int],[int]) sess.register_agg(
str)
new Texts("hello darkness my old friend$")
Texts("I need a $ $ $ is what i need")
Texts(
<- Texts(Text),char_pos(Text,"$")->(Pos).
DollarPos(Text,Pos)
?DollarPos(T,P)
<- DollarPos(Text,Pos).
TwiceTotalDollars(Text,count_twice(Pos)) ?TwiceTotalDollars(T,N)
'?DollarPos(T,P)'
T | P |
---|---|
I need a $ $ $ is what i need | 9 |
I need a $ $ $ is what i need | 11 |
I need a $ $ $ is what i need | 13 |
hello darkness my old friend$ | 28 |
'?TwiceTotalDollars(T,N)'
T | N |
---|---|
I need a $ $ $ is what i need | 6 |
hello darkness my old friend$ | 2 |
If we want a callback function to work with multiple types, we can either register it with a common super type, or put a tuple of types. For example:
# object is a super type of all python objects, so count will work on anything
'count_twice',count_twice,[object],[int])
sess.register_agg(
# now we support Paths as well
def char_positions_iter(text,char):
if isinstance(text,Path):
= text.read_text()
text for i,letter in enumerate(text):
if letter==char:
yield i
'char_pos',char_positions_iter,[(Path,str),str],[int]) sess.register(
To inspect which callback functions are available in a sess, we can do:
sess.get_all_functions()
{'ie': {'print': IEFunction(name='print', func=<function print_ie>, in_schema=<function object_arity>, out_schema=[<class 'object'>]),
'rgx': IEFunction(name='rgx', func=<function rgx>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=<function span_arity>),
'rgx_split': IEFunction(name='rgx_split', func=<function rgx_split>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=[<class 'spannerlib.span.Span'>, <class 'spannerlib.span.Span'>]),
'rgx_is_match': IEFunction(name='rgx_is_match', func=<function rgx_is_match>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=[<class 'bool'>]),
'expr_eval': IEFunction(name='expr_eval', func=<function expr_eval>, in_schema=<function object_arity>, out_schema=[<class 'object'>]),
'not': IEFunction(name='not', func=<function not_ie>, in_schema=[<class 'bool'>], out_schema=[<class 'bool'>]),
'as_str': IEFunction(name='as_str', func=<function as_str>, in_schema=[<class 'object'>], out_schema=[<class 'str'>]),
'span_contained': IEFunction(name='span_contained', func=<function span_contained>, in_schema=[<class 'spannerlib.span.Span'>, <class 'spannerlib.span.Span'>], out_schema=[<class 'bool'>]),
'deconstruct_span': IEFunction(name='deconstruct_span', func=<function deconstruct_span>, in_schema=[<class 'spannerlib.span.Span'>], out_schema=[<class 'str'>, <class 'int'>, <class 'int'>]),
'read': IEFunction(name='read', func=<function read>, in_schema=[<class 'str'>], out_schema=[<class 'str'>]),
'read_span': IEFunction(name='read_span', func=<function read_span>, in_schema=[<class 'str'>], out_schema=[<class 'spannerlib.span.Span'>]),
'json_path': IEFunction(name='json_path', func=<function json_path>, in_schema=[<class 'str'>, <class 'str'>], out_schema=[<class 'str'>, <class 'str'>]),
'char_pos': IEFunction(name='char_pos', func=<function char_positions_iter>, in_schema=[(<class 'pathlib.Path'>, <class 'str'>), <class 'str'>], out_schema=[<class 'int'>])},
'agg': {'count': AGGFunction(name='count', func='count', in_schema=[<class 'object'>], out_schema=[<class 'int'>]),
'sum': AGGFunction(name='sum', func='sum', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
'avg': AGGFunction(name='avg', func='avg', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
'max': AGGFunction(name='max', func='max', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
'min': AGGFunction(name='min', func='min', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
'count_twice': AGGFunction(name='count_twice', func=<function count_twice>, in_schema=[<class 'object'>], out_schema=[<class 'int'>])}}
To get a nested dict with all callbacks. To see the list of default IE function and their documentation, go to the standard ie function section in the documentation.