import spannerlib
import pandas as pd
# get dynamic access to the session running through the jupyter magic system
from spannerlib import get_magic_session
= get_magic_session() session
Welcome to Spannerlib
Welcome to the spannerlib project.
The spannerlib is a framework for building programming languages that are a combination of imperative and declarative languages. This combination is based off of derivations of the document spanner model.
Currently, we implement a language called spannerlog over python. spannerlog is an extension of statically types datalog which allows users to define their own ie functions which can be used to derive new structured information from relations.
The spannerlog repl, shown bellow is served using the jupyter magic commands
Bellow, we will show you how to install and use spannerlog through Spannerlib.
For more comprehensive walkthroughs, see our tutorials section.
Installation
Unix
To download and install RGXLog run the following commands in your terminal:
git clone https://github.com/DeanLight/spannerlib
cd spannerlib
pip install -e .
download corenlp to spannerlib/rgxlog/
from this link
# verify everything worked
# first time might take a couple of minutes since run time assets are being configured
python nbdev_test.py
docker
git clone https://github.com/DeanLight/spannerlib
cd spannerlib
download corenlp to spannerlib/rgxlog/
from this link
docker build . -t spannerlib_image
# on windows, change `pwd to current working directory`
# to get a bash terminal to the container
docker run --name swc --rm -it \
-v `pwd`:/spannerlib:Z \
spannerlib_image bash
# to run an interactive notebook on host port 8891
docker run --name swc --rm -it \
-v `pwd`:/spannerlib:Z \
-p8891:8888 \
--no-browser --allow-root
spannerlib_image jupyter notebook
#Verify tests inside the container
python /spannerlib/nbdev_test.py
Getting started - TLDR
Here is a TLDR intro, for a more comprehensive tutorial, please see the introduction section of the tutorials.
Get a dataframe
= pd.DataFrame(
lecturer_df "walter","chemistry"],
[["linus", "operating_systems"],
['rick', 'physics']
[=["name","course"])
],columns lecturer_df
name | course | |
---|---|---|
0 | walter | chemistry |
1 | linus | operating_systems |
2 | rick | physics |
Or a CSV
'sample_data/example_students.csv',names=["name","course"]) pd.read_csv(
name | course | |
---|---|---|
0 | abigail | chemistry |
1 | abigail | operation systems |
2 | jordan | chemistry |
3 | gale | operation systems |
4 | howard | chemistry |
5 | howard | physics |
Import them to the session
"lecturer",lecturer_df)
session.import_rel("enrolled","sample_data/enrolled.csv",delim=",") session.import_rel(
They can even be documents
= pd.DataFrame([
documents "abigail is happy, but walter did not approve"],
["howard is happy, gale is happy, but jordan is sad"]
[
])"documents",documents) session.import_rel(
?documents(X)
'?documents(X)'
X |
---|
abigail is happy, but walter did not approve |
howard is happy, gale is happy, but jordan is sad |
Define your own IE functions to extract information from relations
# the function itself, writing it as a python generator makes your data processing lazy
def get_happy(text):
"""
get the names of people who are happy in `text`
"""
import re
= re.compile("(\w+) is happy")
compiled_rgx = compiled_rgx.groups
num_groups for match in re.finditer(compiled_rgx, text):
if num_groups == 0:
= [match.group()]
matched_strings else:
= [group for group in match.groups()]
matched_strings yield matched_strings
# register the ie function with the session
session.register("get_happy", # name of the function
# the function itself
get_happy, str], # input types
[str] # output types
[ )
rgxlog supports relations over the following primitive types * strings * spans * integers
Write a rgxlog program (like datalog but you can use your own ie functions)
session.remove_all_rules()
# you can also define data inline via a statically typed variant of datalog syntax
str)
new sad_lecturers("walter")
sad_lecturers("linus")
sad_lecturers(
# and include primitive variable
= "abigail 100 jordan 80 gale 79 howard 60"
gpa_doc
# define datalog rules
<- enrolled(X, "chemistry").
enrolled_in_chemistry(X) <- enrolled_in_chemistry(X), enrolled(X, "physics").
enrolled_in_physics_and_chemistry(X)
# and query them inline (to print to screen)
# ?enrolled_in_chemistry("jordan") # returns empty tuple ()
# ?enrolled_in_chemistry("gale") # returns nothing
# ?enrolled_in_chemistry(X) # returns "abigail", "jordan" and "howard"
# ?enrolled_in_physics_and_chemistry(X) # returns "howard"
<- lecturer(X,Y), enrolled(Z,Y).
lecturer_of(X,Z)
# use ie functions in body clauses to extract structured data from unstructured data
# standard ie functions like regex are already registered
<-
student_gpas(Student, Grade) "(\w+).*?(\d+)",$gpa_doc)->(StudentSpan, GradeSpan),
rgx(->(Student), as_str(GradeSpan)->(Grade).
as_str(StudentSpan)
# and you can use your defined functions as well
<-
happy_students_with_sad_lecturers_and_their_gpas(Student, Grade, Lecturer)
documents(Doc),->(Student),
get_happy(Doc)
sad_lecturers(Lecturer),
lecturer_of(Lecturer,Student), student_gpas(Student, Grade).
And query it
?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)
'?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)'
Stu | Gpa | Lec |
---|---|---|
abigail | 100 | linus |
gale | 79 | linus |
howard | 60 | walter |
You can also get query results as Dataframes for downstream processing
= session.export(
df "?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)")
df
Stu | Gpa | Lec | |
---|---|---|---|
0 | abigail | 100 | linus |
1 | gale | 79 | linus |
2 | howard | 60 | walter |