The Internet Is My Database

06th May 2018

Use the internet as a database. Across the internet, some generous providers make their data available not only as RDF/OWL, but via a SPARQL endpoint that you can send queries to. It's like an API, but instead of getting a response with all the data the developer thought you might like, you get the results of the query you want. It's even possible to query across multiple endpoints in a single query. SPARQL is similar enough to SQL to be easy to learn. In this post, we start out looking at SELECT queries against DBpedia.


AS A developer, you may be familiar with using SQL to query your SQLite/MySql/PostgreSQL database. Thanks to the Semantic Web and its people, you can treat the Internet as a database and query it with SPARQL. We’ll take a look at what we can query, write ourselves some utilities to make querying easier and test out some queries.

Please note, this is only an introduction to using SPARQL with Python to query open SPARQL endpoints on the web. If you wish to use data obtained using SPARQL within an application, just like with any API, you should store the results locally and serve this local data to your users, refreshing periodically. This is faster for your users and being nice to the data provider.

This post is also available as a Jupyter Notebook

What Can I Query?

THERE ARE only a limited number of public SPARQL endpoints you can query. Submitting a SPARQL query is much like submitting an SQL query straight to the database, except that database is public. Due to the database being public, most SPARQL endpoints don’t accept “UPDATE”/”DELETE” queries! Unfortunately, “UPDATE”/”DELETE” isn’t the only malicious thing an end-user can do with a query, it is relatively easy to DDOS a SPARQL endpoint by sending queries that take a long time to execute. The more common REST/RESTFUL API’s restrict what you can ask the database and sometimes implement rate-limiting, hence some Semantic Web applications, such as Thomson Reuters PermID only expose an API. Unfortunately the technology I’m currently using on my website also doesn’t permit me to expose a SPARQL endpoint. Like many other organisations I only permit the RDF to be downloaded and queried locally.

So who does expose a SPARQL endpoint that we can access? The most famous, general purpose one is provided by DBpedia, which gathers all the data in those little Wikipedia boxes. We’ll use their SPARQL endpoint.

import What?

PYTHON IS famously “batteries included”, unfortunately we’re lacking in the SPARQL query space; there’s no SPARQLAlchemy, yet! So far in the blog we’ve used RDFLib when making Semantic Web applications, but that SPARQL interface is for local data, so we’re switching to SPARQLWrapper. It’s pretty bare-bones, so we’ll make some utility functions to make it easy to use.

# pip install sparqlwrapper
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint = "http://dbpedia.org/sparql/"
sparql = SPARQLWrapper(endpoint)
# Return JSON so we can access results like a dict
sparql.setReturnFormat(JSON)

Make Some Utilities

YOU CAN copy these and adapt them. We’re going to start with easy prefixes.

Prefixes let us use short-hand for URIs, such as rdf:type instead of http://www.w3.org/1999/02/22-rdf-syntax-ns#type. Obviously we love them. But we don’t want to write them out everytime we make a query. So we’ll store our namespaces in a dict for easy code updates, use a function to get that data into the PREFIX syntax, and join them all together. Finally we create the function make_query to add the prefixes onto a given query.

namespaces = dict(db="http://dbpedia.org/",
                  dbo="http://dbpedia.org/ontology/",
                  dbr="http://dbpedia.org/resource/",
                  dbc="http://dbpedia.org/page/Category:",
                  dbp="http://dbpedia.org/property/",
                  rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#",
                  rdfs="http://www.w3.org/2000/01/rdf-schema#",
                  owl="http://www.w3.org/2002/07/owl#")

# function to put data into PREFIX syntax
def prefix(ns, uri):
    return f"PREFIX {ns}: <{uri}>"

# Join all the prefixes together
prefixes = "\n".join(prefix(ns=ns, uri=uri) 
                     for ns, uri in namespaces.items())

# Utility function to add prefixes to a query
def make_query(query:str) -> str:
    return "{}\n\n{}".format(prefixes, query)

# Test it
print(make_query("SELECT ?test WHERE {?test a magic:success.}"))

Now we need an easy way to make the query and parse the results. SPARQLWrapper requires that we set the query, then call the query() method. It returns JSON that contains more details than we need for this tutorial so we’ll slim that down into a nice generator of dictionaries that associate the variable with its value. We’re using partial, if you’ve not met it check out Python Partial: Code Your Intention.

from functools import partial


def result_variables(result:dict) -> [str]:
    """Get a list of the variable names used in the query."""
    return result['head']['vars']

def get_data(variables:[str], data:dict) -> dict:
    """Get a dictionary with the variables as keys
    and their values as values."""
    return {v:data[v]['value'] for v in variables}

def parse_results(results:dict) -> map:
    """Takes the JSON, partially applies get data to the
    variable names, and maps this over the results to
    return a generator of dictionaries."""
    read_data = partial(get_data, result_variables(results))
    return map(read_data, results['results']['bindings'])

def query(query:str) -> map:
    """Take a query and return generator of results."""
    sparql.setQuery(make_query(query))
    return parse_results(sparql.query().convert())

Let’s SPARQL!

LET’S START with some basic “SELECT” queries. It’s possible to do amazing things with SPARQL, including query multiple endpoints in a single query, but we’ve got to start at the beginning.

Everything in Semantic Web is a triple: subject, predicate, object. Subject is what we’re talking about, predicate is the relationship, object is what the subject is related to by that predicate. DBPedia is too big to ask for all the triples, so let’s pick a subject. Go to Wikipedia and choose an article that’s not a stub and doesn’t have parenthesis () in the title. We’ll discuss how to deal with DBpedia’s parenthesis in the next query.

I’m going to use my favourite magic trick: https://en.wikipedia.org/wiki/Cups_and_balls. The wikipedia URL maps to a DBpedia resource, for my subject it will be dbr:Cups_and_balls. Your’s will be dbr: followed by the last part of your Wikipedia URL. Or to explain using Python: "https://en.wikipedia.org/wiki/Cups_and_balls".replace("https://en.wikipedia.org/wiki/", "dbr:")

Variable names are prefixed with a ‘?’ in SPARQL, so here we are asking for all the ?predicate and ?object that are in triples where the subject is dbr:Cups_and_balls. Not too disimilar from SQL. Just watch out for the full-stops on the end of each line in the “WHERE” clause, they cause many a debugging headache!

q = """SELECT ?predicate ?object
WHERE {
   dbr:Cups_and_balls ?predicate ?object.
}"""

for result in query(q):
    print(result)

If you read the results you’ll see what looks like a big mess to the untrained eye. That’s because the predicates and some of the objects are full URIs. We could use our namespace dictionary to change these to prefix format, but that will be left as an exercise for you. In Semantic Web we use URIs so we can share terms with common definitions; if I do a blog post and include the triple to say it has dcterms:subject dbr:Cups_and_balls then there can be no ambiguity as to whether it is about https://en.wikipedia.org/wiki/Cups_and_balls or https://en.wikipedia.org/wiki/Cup-and-ball.

I promised a solution to DBpedia’s use of parenthesis, sadly we can’t use parenthesis with prefixes, so we need to type the whole URI. To tell SPARQL it’s a URI we put it in angle brackets <>. In this query I’m going to use <http://dbpedia.org/resource/Magic_(illusion)>.

Let’s do a more useful SELECT query and get some objects that we want. In this query, we’ll make use of DISTINCT and FILTER; many of the familiar SQL keywords are also in SPARQL, including SUM, COUNT, GROUP BY, ORDER BY and more. There’s also an additional OPTIONAL keyword that will return a result if there is one, and won’t if there’s not. However, our simple utility functions would need improving to not throw a KeyError when no result is returned.

Popular Wikipedia articles are in multiple languages, so here we make sure to FILTER to only get the English label and abstract.

q = """SELECT DISTINCT ?label ?abstract
WHERE {
   <http://dbpedia.org/resource/Magic_(illusion)> rdfs:label ?label.
   OPTIONAL { <http://dbpedia.org/resource/Magic_(illusion)> dbo:abstract ?abstract. }
   FILTER (lang(?label) = "en")
   FILTER (lang(?abstract) = "en")
}"""

from textwrap import wrap

for result in query(q): 
    print(result['label'], 
          "\n----------------\n", 
          "\n".join(wrap(result['abstract'])))

Of course, you’re not limited to searching for predicates on objects. In this final query I get the name and abstract for all the people on DBpedia who share the same birthday as one of my magic heros, Patrick Page.

q = """SELECT DISTINCT ?person ?name ?about
WHERE {
   <http://dbpedia.org/resource/Pat_Page_(magician)> dbo:birthDate ?birthday.
   ?person dbo:birthDate ?birthday.
   ?person foaf:name ?name.
   ?person dbo:abstract ?about.
   FILTER (lang(?about) = "en")
} ORDER BY ?name"""

for result in query(q):
    print("{} ( <{}> )".format(result['name'], result['person']))
    print(result['about'], end="\n\n")

Conclusion

WE’VE ONLY looked at one SPARQL endpoint and only covered the most basic of queries, but hopefully this has been enough to spark your imagination. Imagine if you didn’t have to write all the nitty-gritty content of your website yourself, but could just pull in data from Wikipedia. The BBC did it during the London Olympics so every sport and athlete could have their own page, they’ve also done it for animals. I’ll be doing it to describe my tags, just as soon as I get a round tuit.

Get The Code

THE CODE for this walkthrough is available on GitHub

Post Tags


All Tags



LIFE IS better when we share.


Comments

JOIN THE conversation: awesome comments.