RDFLib Graph with RDBS

13th May 2018

It's tricky to find out how to set up RDFLib Graph to use a RDBS backend, but it can be done using SQLAlchemy and RDFLib-SQLAlchemy. This means it'll support all the engines SQLAlchemy does, including MySQL and Postgres. I'll highlight two gotchas to look out for when using RDFLib-SQLAlchemy and walk you through getting setup using a wrapper class to RDFLIB Graph.


RDFLIB SUPPORTS multiple persistent storage methods, you can write to file, use BerkeleyDB, or wrap around an existing SPARQL endpoint. Although not part of RDFLib core, there is a library that’s part of the RDFLib ecosystem called RDFLib-SQLAlchemy that also allows you to work with a RDBS, such as MySQL.

The technological benefits are not well documented, it can be assumed that when your graph has very many triples that using a RDBS will improve query time compared to other methods, it can also be assumed to use less memory than reading in the whole file to query. For smaller numbers of triples, it is unlikely to be faster than keeping the data in memory. I cannot comment on performance with certainty as it is untested.

However, the non-technical benefits are simple. Some hosting providers charge for disk-space but not database space, or charge more for disk space than database space. So by moving the data storage to RDBS you save can money. That’s enough of an incentive for most small projects.

pip install and more

THE MOST basic requirements for this are rdflib and rdflib-sqlalchemy, assuming you have a MySQL or other database running and some kind of webapp to test in/under development such as Flask or Klein. I’m leaving it to you to find out how to get a suitable database for your platform.

pip install rdflib rdflib-sqlalchemy sqlalchemy

Wraping rdflib.Graph

ALTHOUGH POSSIBLE to inherit from rdflib.Graph, I’m going to demonstrate with a wrapper class because I think it is easier to parse how this works. In the code I’m going to use constants for BASE_URL and DB_URI, which we’ll put into our config/constants file later on. The BASE_URL is your website URL, DB_URI is the identifier for SQLAlchemy to find your database. Start with the imports.

from rdflib import Graph, Namespace
from rdflib_sqlalchemy.store import SQLAlchemy
from sqlalchemy import create_engine

from config import BASE_URL, DB_URI
Next I setup the namespaces I’ll want to use with a dictionary. This proves useful later on and saves some headaches.

DBC = Namespace("http://dbpedia.org/page/Category:")
NAMESPACES = dict(dbc=DBC,
                                # add more here
                                 )

Now for the wrapper class, it needs an engine, URI, and identifier. From that we can create a graph using a RDBS as persistent storage. An additional setup method let’s you control when the connection is opened and binds the namespaces we created.

class DataStore:
    engine = create_engine(DB_URI)
    uri = Literal(DB_URI)
    identifier = URIRef(BASE_URL)

    def __init__(self):
        self.store = SQLAlchemy(
            identifier=self.identifier,
            engine=self.engine,
            )
        self.graph = Graph(
            self.store,
            identifier=self.identifier,
            )

    def setup(self, create=False):
        self.graph.open(self.uri, create=create)
        for ns, uri in NAMESPACES.items():
            self.graph.bind(ns, uri)

    def closelink(self):
        try:
            self.graph.close()
        except Exception as e:
            print("Exception in Closing", e)
This appears to be all we’d need, except there’s the issue of MySQL timing out after a period of inactivity, closing the connection and you getting a bug when you try to run a query. You should be able to set some parameters though SQLAlchemy to prevent this, but it seems as though these don’t make it through the RDFLib-SQLAlchemy library.

I’m afraid this leads to a rather ugly hack to work around it using a try: ... except: ....

    def query(self, *args, **kwargs):
        try:
            return self.graph.query(*args, **kwargs)
        except Exception as e:
            print("Query Error", e)
            self.closelink()
            self.setup()
            print("Re-connected", e)
            return self.graph.query(*args, **kwargs)

About Those Constants

IT’S ADVISABLE to keep such data in your config file or constant file for easy changing. BASE_URL is easy, it’s just your website address. This helps RDFLib understand when you’re referring to values that include your own URL.

# Change as appropriate
BASE_URL = "http://www.paulbrownmagic.com/"

The DB_URI will take a little bit of figuring out if you’re not familiar with such things. The format is:

{dialect}+{driver}://{username}:{password}@{host}:{port}/{database}
If you include this as an f-string, you can then set the values with variables. This could also be done with .format for a tidier version.

dialect = "mysql"
driver = "mysqldb"
host = "db.my_provider.com"
port = "3306"
username = "LeRoy"
password = "Talma"
database = "Bosco"
DB_URI = f"{dialect}+{driver}://{username}:{password}@{host}:{port}/{database}"

Another gotcha

THERE’S ONE more thing to look out for, it seems that no matter how you try to configure your setup, from the database end and the SQLAlchemy engine, you’ll be stuck with ASCII characters. Again, I assume this is something in RDFLib-SQLAlchemy that’s not listening to the configuration. This means you can’t use things like proper apostrophes (‘) in your triples.

Conclusion

WITH THAT you should be up and running and hopefully have a few less headaches than I did. If you do manage to workout how to prevent MySQL timing out or how to permit Unicode characters, do let me know. The workings don’t seem to be perfect, and the technical advantages are debatable, but it could save you a little money on storage space or make it easier to work in an environment where you have to use a RDBS.

Post Tags


All Tags



LIFE IS better when we share.


Comments

JOIN THE conversation: awesome comments.