Introduction

Note

this project is built with nbdev, which is a full literate programming environment built on Jupyter Notebooks. That means that every piece of documentation, including the page you’re reading now, can be accessed as interactive Jupyter notebook.
Open In Colab

This tutorial will teach you the basics of the spannerlog language and the spannerlib framework.

Spannerlog is: * Similar to Datalog, but has type safety features * Has support for aggregation functions * Enables using stateless user defined functions called IE functions to derive new relations from existing relations * has some DRY features to help you write spannerlog code effectively * comes with support for Document Spanners using the Span class.

Spannerlib, via its Session object, enables: * registering IE functions, and aggregations functions to be used as callbacks in spannerlog. * executing spannerlog code programmatically

Installation

prerequisites:

  • Have Python version 3.8 or above installed

To download and install spannerlog run the following commands in your terminal:

git clone https://github.com/DeanLight/spannerlib
cd spannerlib
pip install . 

Make sure you are calling the pip version of your current python environment. To install with another python interpreter, run

<path_to_python_interpreter> -m pip install .

You can also install spannerlib in the current Jupyter kernel:

!git clone https://github.com/DeanLight/spannerlib
!pip install spannerlib

In order to use spannerlib in jupyter notebooks, you must first load it:

import spannerlib

Importing the spannerlog library automatically loads the %%spannerlog cell magic which accepts spannerlog queries as shown below.

new uncle(str, str)
uncle("bob", "greg")
?uncle(X,Y)
'?uncle(X,Y)'
X Y
bob greg

The Spannerlog Language

Type safe Datalog

Spannerlog syntax is very similar to datalog, but relations and their types must be declared using the new keyword.

# defining relations
new parent(str,str)
# defining initial facts
parent('xerces', 'brooke')
parent('brooke', 'damocles')

Rules can be defined that describe how to derive new facts from existing facts. * We call the part to the left of the <- the rule’s head (or head clause). * We call the part to the right of the <- the rule’s body (made up of body clauses).

# you can define relations recursively
# and use line escapes for long rules to make them more readable
ancestor(X, Y) <- parent(X, Y).
ancestor(X, Y) <- parent(X, Z),
     ancestor(Z, Y).

derived and existing facts can be queried using the ? operator, with either Logical Variables such as X or constants.

?parent(X,Y)

?ancestor('xerces',Y)

?ancestor('xerces','brooke')
'?parent(X,Y)'
X Y
brooke damocles
xerces brooke
"?ancestor('xerces',Y)"
Y
brooke
damocles
"?ancestor('xerces','brooke')"
True

Spannerlog has built in support for declaring relations for primitive types: * int * str * float * bool

But programatically, you can define relations and add facts of any pythonic data type.

Aggregation

You can use aggregation functions in a rule’s head to express group-by-and-aggregate logic. Grouping happens on the non-aggregated variables; the other variables are aggregated by their respective functions.

numDescendants(X,count(Y)) <- ancestor(X,Y).

?numDescendants(X,N)
'?numDescendants(X,N)'
X N
brooke 1
xerces 2

Built-in aggregations include: * min * max * sum * avg * count

But you will see in later sections that external aggregation functions can be defined.

IE functions

Given a pure (stateless) function f(X,Y)->(Z) we can think of f as deriving information from (x,y) values to generate (z) values. In the relational settings, IE functions are pure functions that take tuples over some input schema and derive a number of new tuples from them over some output schema. We can use IE functions as body clauses to derive new facts.

IE functions are invoked using the func_name(InputVars...)->(OutputVars...) syntax.

new Texts(str)
Texts("Hello darkness my old friend")
Texts("I've come to talk with you again")

Words(Word) <- Texts(X), rgx("(\w+)",X)->(Word).

?Words(W)
'?Words(W)'
W
[@9a1d0f,0,5) "Hello"
[@9a1d0f,6,14) "darkness"
[@9a1d0f,15,17) "my"
[@9a1d0f,18,21) "old"
[@9a1d0f,22,28) "friend"
[@c7e66d,0,1) "I"
[@c7e66d,2,4) "ve"
[@c7e66d,5,9) "come"
[@c7e66d,10,12) "to"
[@c7e66d,13,17) "talk"
[@c7e66d,18,22) "with"
[@c7e66d,23,26) "you"
[@c7e66d,27,32) "again"

rgx is part of the built-in IE functions. It returns Spans over the original text. We will learn more Spans later.

DRY with variables.

Variable that appear in rule’s and queries are called logical variables. To help make our code more DRY and easier to read we can also include normal variables and dereference them in rules and in facts using the $ operator. As follows:

darkness_sentence = "hello darkness my old friend"
sunshine_sentence = "Im walking on sunshine"

new SentenceSentiment(str,bool)
SentenceSentiment($darkness_sentence, False)
SentenceSentiment($sunshine_sentence, True)

?SentenceSentiment($darkness_sentence,Y)

word_pattern = "(\w+)"

BadWords(Word) <- SentenceSentiment(Sentence, False), rgx($word_pattern, Sentence)->(Word).

?BadWords(W)
'?SentenceSentiment($darkness_sentence,Y)'
Y
False
'?BadWords(W)'
W
[@42bf20,0,5) "hello"
[@42bf20,6,14) "darkness"
[@42bf20,15,17) "my"
[@42bf20,18,21) "old"
[@42bf20,22,28) "friend"

Spans

Spans, or document spanners, are objects that describe an interval of a string. They are available as part of the library and can be used as such:

from pathlib import Path
from spannerlib import Span
some_text = "hello darkness my old friend"
full_span = Span(some_text)
full_span,str(full_span)
([@42bf20,0,28) "hello dark...", 'hello darkness my old friend')

Note that Spans are represented by a triple of [doc_id,i,j). They encode the string doc_id[i:j].

You can control the doc_id with the name parameter, it is also initialized automatically as a file’s name when a Path object is fed to a span.

Span(some_text,name="greeting")
[@greeting,0,28) "hello dark..."
file_path = Path("copilot_data/example_code.py")
file_span = Span(file_path)
file_span
[@example_code.py,0,178) "def f(x,y)..."

Spans can be initialized with specific indices:

first_2_words = Span(some_text, start=0, end=14)
first_2_words,str(first_2_words)
([@42bf20,0,14) "hello dark...", 'hello darkness')

And can be sliced like a string to produce new spans with matching indices.

darkness = first_2_words[6:14]
darkness
[@42bf20,6,14) "darkness"
dark = darkness[:4]
dark
[@42bf20,6,10) "dark"

Due to an open issue in pandas, Spans and other classes are not printed properly in dataframes.

import pandas as pd

df = pd.DataFrame([
    [full_span, str(full_span)],
    [file_span, str(file_span)],
    [first_2_words, str(first_2_words)],
    [darkness, str(darkness)],
    [dark, str(dark)]
])
df
0 1
0 (h, e, l, l, o, , d, a, r, k, n, e, s, s, , ... hello darkness my old friend
1 (d, e, f, , f, (, x, ,, y, ), :, \n, , , ,... def f(x,y): x+y def g(x,y): return f...
2 (h, e, l, l, o, , d, a, r, k, n, e, s, s) hello darkness
3 (d, a, r, k, n, e, s, s) darkness
4 (d, a, r, k) dark

Which is why we need to use the map(repr) workaround to see them clearly. This is also done behind the scenes by spannerlog.

df.map(repr)
0 1
0 [@42bf20,0,28) "hello dark..." 'hello darkness my old friend'
1 [@example_code.py,0,178) "def f(x,y)..." 'def f(x,y):\n x+y \n\ndef g(x,y):\n ret...
2 [@42bf20,0,14) "hello dark..." 'hello darkness'
3 [@42bf20,6,14) "darkness" 'darkness'
4 [@42bf20,6,10) "dark" 'dark'

Many default IE functions use Spans as their return types. To convert a Span (or anything else) to a string, we can use the as_str IE function.

BadWordStrings(Word) <- BadWords(WordSpan),as_str(WordSpan)->(Word).
?BadWordStrings(W)
'?BadWordStrings(W)'
W
darkness
friend
hello
my
old

Python Spannerlog Interactions

Exporting query results and changing sessions.

All interactions between the spannerlog and python, is mediated through a Session object. For example, calling the export method with a string with spannerlog code will execute it.

from spannerlib import Session

sess=Session()
# exports returns the value of the last statement in our code, which is the query in this case.
uncle_df = sess.export("""
new uncle(str, str)
uncle("bob", "greg")
?uncle(X,Y)
""")
uncle_df
X Y
0 bob greg

In fact, the magic system that allows us to use %%spannerlog cells, simply sends initializes a session object and runs the code we put in %%spannerlog cells through it. The magic system also prints the results of queries to the string for ease of debugging.

To get or change the session that the magic system uses, do the following:

from spannerlib import get_magic_session,set_magic_session

magic_sess = get_magic_session()
magic_sess.export("?BadWordStrings(W)")
W
0 darkness
1 friend
2 hello
3 my
4 old

As you can see, rules run in the magic cells where executed inside the magic_sess object we now have access to. Now lets set the magic system to use the Session bound to the sess variable.

set_magic_session(sess)

Now we can query uncle from the magic system.

?uncle(X,Y)
'?uncle(X,Y)'
X Y
bob greg

Using sessions programatically, we can not only get results into python, allowing us to post process them, we can also run spannerlog code outside of a jupyter notebook.

Importing data to spannerlog

Usually, our data doesnt come as spannerlog facts, but rather from other relational sources. We can import our data into a session as follows:

# either a path to a csv file, or a dataframe
aunts_data = pd.DataFrame([
    ["susan", "greg"],
    ["susan", "jerry"]
], columns=["Aunt", "Of"])

sess.import_rel(name="Aunts", data=aunts_data)
?Aunts(X,Y)
'?Aunts(X,Y)'
X Y
susan greg
susan jerry

We can also import variables from python into spannerlog.

sess.import_var(name='fox_sent', value='what does the fox say?')
FoxyWords(Word) <- rgx("(\w+)",$fox_sent)->(Word).

?FoxyWords(W)
'?FoxyWords(W)'
W
[@0ffcc6,0,4) "what"
[@0ffcc6,5,9) "does"
[@0ffcc6,10,13) "the"
[@0ffcc6,14,17) "fox"
[@0ffcc6,18,21) "say"

There are also auxilary methods that help deleting rules if you made an error when defining a rule and do not want to restart the session, such as: * remove_all_rules * remove_rule * remode_head

For a full list of methods and options, see the Session class in the reference guide.

Defining and registering IE and Aggregation functions

Part of what makes spannerlib powerful, is that you can define your own callbacks as python functions and register them for use right away. * An IE function is a stateless function that takes a tuple over an input schema and returns as Iterable over an output schema. * An Aggregation function (Agg function for short) is a stateless function that takes a set/list of values and returns a single value. To register our functions we need to tell the session what input/output schema to expect.

"hello $ world $#".count("$")
2
def char_positions(text,char):
    # here we return a list of tuples
    return [(i,) for i,letter in enumerate(text) if char==letter]

def char_positions_iter(text,char):
    # we can also return a lazy iterable using python generators
    for i,letter in enumerate(text):
        if letter==char:
            # spannerlib knows to wrap single values in a tuple
            yield i 

# We register our function's input/output schema as a list of pythonic types
sess.register('char_pos',char_positions_iter,[str,str],[int])


def count_twice(positions):
    return 2*len(positions)

sess.register_agg('count_twice',count_twice,[int],[int])
new Texts(str)
Texts("hello darkness my old friend$")
Texts("I need a $ $ $ is what i need")

DollarPos(Text,Pos) <- Texts(Text),char_pos(Text,"$")->(Pos).

?DollarPos(T,P)

TwiceTotalDollars(Text,count_twice(Pos)) <- DollarPos(Text,Pos).
?TwiceTotalDollars(T,N)
'?DollarPos(T,P)'
T P
I need a $ $ $ is what i need 9
I need a $ $ $ is what i need 11
I need a $ $ $ is what i need 13
hello darkness my old friend$ 28
'?TwiceTotalDollars(T,N)'
T N
I need a $ $ $ is what i need 6
hello darkness my old friend$ 2

If we want a callback function to work with multiple types, we can either register it with a common super type, or put a tuple of types. For example:

# object is a super type of all python objects, so count will work on anything
sess.register_agg('count_twice',count_twice,[object],[int])

# now we support Paths as well
def char_positions_iter(text,char):
    if isinstance(text,Path):
        text = text.read_text()
    for i,letter in enumerate(text):
        if letter==char:
            yield i

sess.register('char_pos',char_positions_iter,[(Path,str),str],[int])

To inspect which callback functions are available in a sess, we can do:

sess.get_all_functions()
{'ie': {'print': IEFunction(name='print', func=<function print_ie>, in_schema=<function object_arity>, out_schema=[<class 'object'>]),
  'rgx': IEFunction(name='rgx', func=<function rgx>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=<function span_arity>),
  'rgx_split': IEFunction(name='rgx_split', func=<function rgx_split>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=[<class 'spannerlib.span.Span'>, <class 'spannerlib.span.Span'>]),
  'rgx_is_match': IEFunction(name='rgx_is_match', func=<function rgx_is_match>, in_schema=[<class 'str'>, (<class 'str'>, <class 'spannerlib.span.Span'>)], out_schema=[<class 'bool'>]),
  'expr_eval': IEFunction(name='expr_eval', func=<function expr_eval>, in_schema=<function object_arity>, out_schema=[<class 'object'>]),
  'not': IEFunction(name='not', func=<function not_ie>, in_schema=[<class 'bool'>], out_schema=[<class 'bool'>]),
  'as_str': IEFunction(name='as_str', func=<function as_str>, in_schema=[<class 'object'>], out_schema=[<class 'str'>]),
  'span_contained': IEFunction(name='span_contained', func=<function span_contained>, in_schema=[<class 'spannerlib.span.Span'>, <class 'spannerlib.span.Span'>], out_schema=[<class 'bool'>]),
  'deconstruct_span': IEFunction(name='deconstruct_span', func=<function deconstruct_span>, in_schema=[<class 'spannerlib.span.Span'>], out_schema=[<class 'str'>, <class 'int'>, <class 'int'>]),
  'read': IEFunction(name='read', func=<function read>, in_schema=[<class 'str'>], out_schema=[<class 'str'>]),
  'read_span': IEFunction(name='read_span', func=<function read_span>, in_schema=[<class 'str'>], out_schema=[<class 'spannerlib.span.Span'>]),
  'json_path': IEFunction(name='json_path', func=<function json_path>, in_schema=[<class 'str'>, <class 'str'>], out_schema=[<class 'str'>, <class 'str'>]),
  'char_pos': IEFunction(name='char_pos', func=<function char_positions_iter>, in_schema=[(<class 'pathlib.Path'>, <class 'str'>), <class 'str'>], out_schema=[<class 'int'>])},
 'agg': {'count': AGGFunction(name='count', func='count', in_schema=[<class 'object'>], out_schema=[<class 'int'>]),
  'sum': AGGFunction(name='sum', func='sum', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
  'avg': AGGFunction(name='avg', func='avg', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
  'max': AGGFunction(name='max', func='max', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
  'min': AGGFunction(name='min', func='min', in_schema=[<class 'numbers.Real'>], out_schema=[<class 'numbers.Real'>]),
  'count_twice': AGGFunction(name='count_twice', func=<function count_twice>, in_schema=[<class 'object'>], out_schema=[<class 'int'>])}}

To get a nested dict with all callbacks. To see the list of default IE function and their documentation, go to the standard ie function section in the documentation.