Exported source
# importing dependencies
import re
import pandas as pd
from pandas import DataFrame
from pathlib import Path
from spannerlib.utils import load_env
from spannerlib import get_magic_session,Session,Span
In the previous tutorial, we showed how Spannerlib can make composing control logic for LLM agents, easy and lead to elegant code. In this tutorial we will show the power of combining the compositional abilities of a language like Spannerlog with Document Spanners. Document Spanners allow us to combine deep learning based ML techniques with structured text mining seamlessly.
Our use case will be building a copilot-like code documentation agent, with smarted input context as the prompt to the LLM. Normally, when running a model like copilot in an IDE like pycharm or vs-code, the prompt sent to the LLM consists of:
In our agent, instead, we would like to leverage our understanding of the structure of code to:
Given:
Return:
We will reuse our llm
and format
ie function from the basic tutorial. and introduce some new ie functions, namely:
ast_xpath(file,xpath_query)->(ast_node)
that allows us to select ast nodes using xpath queries.ast_to_span(file,ast_node)->(ast_span)
that allows us to get the span of code described by an ast node of a given file. We will also use an ie functions from the standard libraries.expr_eval(expression_template,val_1,...,val_n)->(expression_result)
allows us to given a template string for a python expression in a printf
like format.
span_contained(span1,span2)->(bool)
which return true if span1
is a subspan of span2
.We will also add a text-specific aggregation function:
lex_concat(strings)->(string)
takes a set of strings and concatenate them when they are sorted by lexicographic order.
%%spannerlog
# find all function definitions in the code base
FuncDefSpan(span,name)<-
Files(text),
ast_xpath(tet, "//FunctionDef")->(node),
ast_to_span(text,node)->(span),
expr_eval("{0}.name",node)->(name).
# find all function calls in the code base
FuncCallSpan(span,name)<-
Files(text),
ast_xpath(text, "//Call/func/Name")->(node),
ast_to_span(text,node)->(span),
as_str(span)->(name).
# find the function whose span contains the cursor
CursorWrappingFunc(cursor,name)<-
Cursors(cursor),
FuncDefSpan(span,name),
span_contained(cursor,span)->(True).
# get all functions who had a call to called_name inside them
Mentions(lex_concat(caller_span),called_name)<-
FuncCallSpan(called_span,called_name),
FuncDefSpan(caller_span,caller_name),
span_contained(called_span,caller_span)->(True).
model = 'gpt-3.5-turbo'
# compose current function and all mentions into a prompt and
# ask the model to generate the documentation for the function
DocumentFunction(cursor,answer)<-
CursorWrappingFunc(cursor,name),
Mentions(mentions,name),
FuncDefSpan(def_span,name),
as_str(def_span)->(def_string),
format($func_document_template,mentions,def_string)->(prompt),
llm($model,prompt)->(answer).
If the implementation details are not of interest, feel free to move to the next section.
In order to analyze the structure of the code, we will be using python’s ast
module. We will write a very generic ie function that gets a piece of code, and an xpath query string returns the spans of all matches of the query over the ast of the given code.
To do so we will use the pyastgrep
library that allows us to look for xpath matches in python ast’s. We will write a modified version of it’s main function that returns Spans of the ast nodes.
ast_xpath (py_str, xpath_query)
@cache
def _py_to_xml(py:str)->str:
ast_tree = ast.parse(py)
node_mappings = {}
xml_tree = ast_to_xml(ast_tree, node_mappings)
return xml_tree,ast_tree,node_mappings
def _xml_to_string(xml_tree):
return etree.tostring(xml_tree, pretty_print=True).decode('utf-8')
def _print_file_xml(file_path):
text = Path(file_path).read_text()
xml_tree,_,_ = _py_to_xml(text)
print(_xml_to_string(xml_tree))
def _ast_to_string(ast_tree):
if isinstance(ast_tree,ast.AST):
return ast.unparse(ast_tree)
else:
return ast_tree
def ast_xpath(py_str,xpath_query):
if isinstance(py_str,Path):
py_str = py_str.read_text()
if isinstance(py_str,Span):
py_str = str(py_str)
xml_tree,ast_tree,node_mappings = _py_to_xml(py_str)
xml_matches = xml_tree.xpath(xpath_query)
ast_matches = [node_mappings[match] if match in node_mappings else match for match in xml_matches]
return ast_matches
def f(x,y):
x+y
def g(x,y):
return f(x,y)**2
class A:
def __init__(self, x):
self.x = x
def method(self, y):
return f(self.x, y)
print(f(2,3))
def f(x, y):
x + y
def g(x, y):
return f(x, y) ** 2
class A:
def __init__(self, x):
self.x = x
def method(self, y):
return f(self.x, y)
print(f(2, 3))
for match in ast_xpath(code_file,'//FunctionDef'):
print(_ast_to_string(match))
print("="*80)
for match in ast_xpath(code_text,'//FunctionDef/@name'):
print(_ast_to_string(match))
def f(x, y):
x + y
def g(x, y):
return f(x, y) ** 2
def __init__(self, x):
self.x = x
def method(self, y):
return f(self.x, y)
================================================================================
f
g
__init__
method
ast_to_span (string, node)
given a node
@cache
def _get_lines(path):
if isinstance(path,Path):
tuple(path.read_text().split('\n'))
else:
return tuple(path.split('\n'))
def _get_character_position(path, line_number, column_offset):
"""gets a character position from a line number and column offset"""
lines = _get_lines(path)
if line_number < 1 or line_number > len(lines):
raise ValueError("Invalid line number")
line = lines[line_number - 1]
if column_offset < 0 or column_offset > len(line):
raise ValueError("Invalid column offset")
return sum(len(lines[i]) + 1 for i in range(line_number - 1)) + column_offset
def ast_to_span(string,node):
"""given a node <node> of an ast from file <path>,
returns the location of the node in the file as a Span object"""
if isinstance(string,Path):
text = string.read_text()
name = string.name
else:
text = string
name = None
start = _get_character_position(str(text),node.lineno,node.col_offset)
if hasattr(node,'end_lineno') and hasattr(node,'end_col_offset'):
end = _get_character_position(str(text),node.end_lineno,node.end_col_offset)
else:
end = start + len(ast.unparse(node))
return [Span(text,start,end,name=name)]
([@example_code.py,0,19) "def f(x,y)...", 'def f(x,y):\n x+y')
lex_concat (strings)
# note that we use the ast node class directly in spannerlib since we want to access node attributes as well
sess.register('ast_xpath',ast_xpath,[(str,Path,Span),str],[ast.AST])
sess.register('ast_to_span',ast_to_span,[(str,Span,Path),ast.AST],[Span])
sess.register_agg('lex_concat',lex_concat,[(str,Span)],[str])
ExampleAST(span,string)<-
ast_xpath($code_file,'//FunctionDef')->(node),
ast_to_span($code_file,node)->(span),
as_str(span)->(string).
?ExampleAST(span,string)
'?ExampleAST(span,string)'
span | string |
---|---|
[@example_code.py,0,19) "def f(x,y)..." | def f(x,y): x+y |
[@example_code.py,22,54) "def g(x,y)..." | def g(x,y): return f(x,y)**2 |
[@example_code.py,69,110) "def __init..." | def __init__(self, x): self.x = x |
[@example_code.py,115,163) "def method..." | def method(self, y): return f(self.x, y) |
To keep the outputs readable, we will only load a single python file
0 | |
---|---|
0 | [@example_code.py,0,178) "def f(x,y)..." |
def f(x,y):
x+y
def g(x,y):
return f(x,y)**2
class A:
def __init__(self, x):
self.x = x
def method(self, y):
return f(self.x, y)
print(f(2,3))
We will simulate a cursor position inside the function f
0 | |
---|---|
0 | [@example_code.py,16,17) "x" |
We import this data to our session.
And now we can incrementally build our rules from the bottom up
# get all spans of function definitions and their name
# note we use expr_eval to get the name attribute of the ast Node we assigned to the free variable 'node'.
FuncDefSpan(span,name)<-
Files(text),
ast_xpath(text, "//FunctionDef")->(node),
ast_to_span(text,node)->(span),
expr_eval("{0}.name",node)->(name).
?FuncDefSpan(span,name)
'?FuncDefSpan(span,name)'
span | name |
---|---|
[@example_code.py,0,19) "def f(x,y)..." | f |
[@example_code.py,22,54) "def g(x,y)..." | g |
[@example_code.py,69,110) "def __init..." | __init__ |
[@example_code.py,115,163) "def method..." | method |
# get all spans of function calls and their names
FuncCallSpan(span,name)<-
Files(text),
ast_xpath(text, "//Call/func/Name")->(node),
ast_to_span(text,node)->(span),
as_str(span)->(name).
?FuncCallSpan(span,name)
'?FuncCallSpan(span,name)'
span | name |
---|---|
[@example_code.py,45,46) "f" | f |
[@example_code.py,151,152) "f" | f |
[@example_code.py,165,170) "print" | |
[@example_code.py,171,172) "f" | f |
# we compute the func wrapping a cursor by checking with function def span contains our cursor
CursorWrappingFunc(cursor,name)<-
Cursors(cursor),
FuncDefSpan(span,name),
span_contained(cursor,span)->(True).
?CursorWrappingFunc(cursor,name)
'?CursorWrappingFunc(cursor,name)'
cursor | name |
---|---|
[@example_code.py,16,17) "x" | f |
# we get all mentions of a function by looking for function calls of it that are sub spans of func definitions
# we aggregate our mentions using lex_concat to get a single mention context per function.
Mentions(lex_concat(caller_span),called_name)<-
FuncCallSpan(called_span,called_name),
FuncDefSpan(caller_span,caller_name),
span_contained(called_span,caller_span)->(True).
?Mentions(mentions,func)
'?Mentions(mentions,func)'
mentions | func |
---|---|
def g(x,y): return f(x,y)**2 def method(self, y): return f(self.x, y) | f |
Now to piece this together to an LLM prompt we can call, lets define our prompt template:
And just like in our basic agent tutorial, we get our strings from our lowlevel rules and compose them using the format
ie function.
model = 'gpt-3.5-turbo'
DocumentFunctionPrompt(cursor,prompt)<-
CursorWrappingFunc(cursor,name),
Mentions(mentions,name),
FuncDefSpan(def_span,name),
as_str(def_span)->(def_string),
format($func_document_template,mentions,def_string)->(prompt).
?DocumentFunctionPrompt(cursor,prompt)
DocumentFunction(cursor,answer)<-
DocumentFunctionPrompt(cursor,prompt),
llm($model,prompt)->(answer).
?DocumentFunction(cursor,answer)
'?DocumentFunctionPrompt(cursor,prompt)'
cursor | prompt |
---|---|
[@example_code.py,16,17) "x" | system: based on the following context: def g(x,y): return f(x,y)**2 def method(self, y): return f(self.x, y) Explain the following function: def f(x,y): x+y In the format of a doc string. |
'?DocumentFunction(cursor,answer)'
cursor | answer |
---|---|
[@example_code.py,16,17) "x" | """ This function calculates the sum of two inputs x and y. """ |
Putting all of our spannerlog together, we get:
FuncDefSpan(span,name)<-
Files(text),
ast_xpath(text, "//FunctionDef")->(node),
ast_to_span(text,node)->(span),
expr_eval("{0}.name",node)->(name).
FuncCallSpan(span,name)<-
Files(text),
ast_xpath(text, "//Call/func/Name")->(node),
ast_to_span(text,node)->(span),
as_str(span)->(name).
CursorWrappingFunc(cursor,name)<-
Cursors(cursor),
FuncDefSpan(span,name),
span_contained(cursor,span)->(True).
Mentions(lex_concat(caller_span),called_name)<-
FuncCallSpan(called_span,called_name),
FuncDefSpan(caller_span,caller_name),
span_contained(called_span,caller_span)->(True).
model = 'gpt-3.5-turbo'
DocumentFunctionPrompt(cursor,prompt)<-
CursorWrappingFunc(cursor,name),
Mentions(mentions,name),
FuncDefSpan(def_span,name),
as_str(def_span)->(def_string),
format($func_document_template,mentions,def_string)->(prompt).
DocumentFunction(cursor,answer)<-
DocumentFunctionPrompt(cursor,prompt),
llm($model,prompt)->(answer).
Note how short and elegant, a complex pipeline can be when we decompose our code into: * powerful and generic callbacks * declerative compositional logic
And the ease with which we can combine formal IE extractions and LLMs to get the best of both: * structured analysis * NLP via LLMs