30.04.2019HSR Fachvortrag: Wie das Open Data Portal der Stadt Zürich seine Daten erntet

As a HSR allumnus I was proud to be invited to talk at the "HSR Fachvortrag" series about my new employer and how we harvest data for the Open Data Portal of Zurich.

Here are my slides:

13.02.2019Export datasets from CKAN on the command line

Today I got an interesting request, about how datasets could be exported from CKAN based on a search query.

Our Canadian friends created ckanapi, a CLI application (that can be used in python scripts as well!) to talk to the CKAN API. This is basically all we need to fulfill the request.

In this example I'm using the CKAN portal of the City of Zurich, and we try to download all "Fahrzeiten" datasets from the VBZ. These datasets are super interesting, as they contain a target/actual comparison of the public transport timetable of Zurich (i.e. delays of buses and trams).

With ckanapi, it's quite easy to get the metadata of all these datasets:

ckanapi action package_search 'q=title:Fahrzeiten' -r https://data.stadt-zuerich.ch

This returns a JSON (since that is what the API returns) with all the datasets and their metadata.

But actually we want to download the files as well. Luckily, ckanapi has a feature to dump a dataset as a Data Package. A Data Package is essentially a directory containing the metadata in the root and the actual files in a data folder. So this is exactly what we want.

Now let's use some CLI magic to turn our result from above into Data Packages.

First of all, the ckanapi dataset dump command takes a list of dataset names as a parameter, so let's generate such a list:

jq ".results|.[]|.name"

jq is a tool to manipulate JSON on the command line (like sed, awk and grep for text). In this example we simply extract the name attribute all all elements in the results key.

Now all we have to do is use this list in ckanapi dataset dump. Let's put everything together:

ckanapi action package_search 'q=title:Fahrzeiten' -r https://data.stadt-zuerich.ch | jq ".results|.[]|.name" |  xargs -I % ckanapi dump datasets % -r https://data.stadt-zuerich.ch --datapackages=./dump_directory/

Remember, using | let's you reuse the output of one command as the input for the next one.

Voilà, we're done and all the datasets containing "Fahrzeiten" in their title will be downloaded to the dump_directory.

24.04.2018Make open data discoverable for search engines

Note: this blogpost was first published on the Liip Blog.

Open data portals are a great way to discover datasets and present them to the public. But they lack interoperability and it’s thus even harder to search across them. Imagine if you’re looking for a dataset it’s just a simple "google search" away. Historically there are lots and lots of metadata standards. CKAN as the de-facto standard uses a model that is close to Dublin Core. It consists of 15 basic fields to describe a dataset and its related resources.

In the area of Open Government Data (OGD) the metadata standard that is widely used is DCAT. Especially the application profiles ("DCAT-AP"), which are a specialization of the DCAT standard for certain topic areas or countries. For CKAN the ckanext-dcat extension provides plugins to expose and consume DCAT-compatible data using an RDF graph. We use this extension on opendata.swiss and data.stadt-zuerich.ch, as it provides handy interfaces to extend it to our custom data model. I'm a maintainer and code contributor to this extension.

When Dan Brickley working for Google, opened an issue on the DCAT extension about implementing schema.org/Dataset for CKAN, I was very excited. I only learned about it in December 2017 and thought it would be a fun feature to implement over the holidays. But what exactly was Dan suggesting?

With ckanext-dcat we already have the bridge from our relational ("database") model to a graph ("linked data"). This is a huge step enables new uses of our data. Remember the 5 star model of Sir Tim Berners-Lee?

5 star model describing the quality of the data Source: http://5stardata.info/en/, CC-Zero

So with our RDF, we already reached 4 stars! Now imagine a search engine takes all those RDFs, and is able to search in them and eventually is even able to connect them together. This is where schema.org/Dataset comes in. Based on the request from Dan I built a feature in ckanext-dcat to map the DCAT dataset to a schema.org/Dataset. By default it is returning the data as JSON-LD.

Even if you've never heard of JSON-LD, chances are, that you’ve used it. Google is promoting it with the keyword Structured Data. At its core, JSON-LD is a JSON representation of an RDF graph. But Google is pushing this standard forward to enable all kinds of “semantic web” applications. The goal is to let a computer understand the content of a website or any other content that has JSON-LD embedded. And in the future, Google wants to have a better understanding of the concept of a "dataset", or to put it in the words of Dan Brickley:

It's unusual for Google to talk much about search feature plans in advance, but in this case I can say with confidence "we are still figuring out the details!", and that the shape of actual real-world data will be a critical part of that. That is why we put up the documentation as early as possible. If all goes according to plan, we will indeed make it substantially easier for people to find datasets via Google; whether that is via the main UI or a dedicated interface (or both) is yet to be determined. Dataset search has various special challenges which is why we need to be non-comital on the details at the stage, and why we hope publishers will engage with the effort even if it's in its early stages...

This feature is deployed on the CKAN demo instance, so let’s look at an example. I can use the API to get a dataset as JSON-LD. So for the dataset Energy in Málaga, I could build the URL like that:

  • Append “.jsonld”
  • Specify “schemaorg” as the profile (i.e. the format of the mapping)

Et voilà: https://demo.ckan.org/dataset/energy-in-malaga.jsonld?profiles=schemaorg

This is the result as JSON-LD:


{
  "@context": {
    "adms": "http://www.w3.org/ns/adms#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gsp": "http://www.opengis.net/ont/geosparql#",
    "locn": "http://www.w3.org/ns/locn#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "http://schema.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "time": "http://www.w3.org/2006/time",
    "vcard": "http://www.w3.org/2006/vcard/ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@graph": [
    {
      "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9",
      "@type": "dcat:Dataset",
      "dcat:contactPoint": {
        "@id": "_:N71006d3e0205458db0cc7ced676f91e0"
      },
      "dcat:distribution": [
        {
          "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/c3c5b857-24e7-4df7-ae1e-8fbe29db93f3"
        },
        {
          "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/5ecbfa6c-9ea0-4f5f-9fbe-eb39964c0f7f"
        },
        {
          "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/b74584c7-9a9a-4528-9c73-dc23b29c084d"
        }
      ],
      "dcat:keyword": [
        "energy",
        "málaga"
      ],
      "dct:description": "Some energy related sources from the city of Málaga",
      "dct:identifier": "c8689e49-4fb2-43dd-85dd-ee243104a2a9",
      "dct:issued": {
        "@type": "xsd:dateTime",
        "@value": "2017-06-25T17:02:11.406471"
      },
      "dct:modified": {
        "@type": "xsd:dateTime",
        "@value": "2017-06-25T17:05:24.777086"
      },
      "dct:publisher": {
        "@id": "https://demo.ckan.org/organization/f0656b3a-9802-46cf-bb19-024573be43ec"
      },
      "dct:title": "Energy in Málaga"
    },
    {
      "@id": "https://demo.ckan.org/organization/f0656b3a-9802-46cf-bb19-024573be43ec",
      "@type": "foaf:Organization",
      "foaf:name": "BigMasterUMA1617"
    },
    {
      "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/b74584c7-9a9a-4528-9c73-dc23b29c084d",
      "@type": "dcat:Distribution",
      "dcat:accessURL": {
        "@id": "http://datosabiertos.malaga.eu/recursos/energia/ecopuntos/ecoPuntos-23030.csv"
      },
      "dct:description": "Ecopuntos de la ciudad de málaga",
      "dct:format": "CSV",
      "dct:title": "Ecopuntos"
    },
    {
      "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/c3c5b857-24e7-4df7-ae1e-8fbe29db93f3",
      "@type": "dcat:Distribution",
      "dcat:accessURL": {
        "@id": "http://datosabiertos.malaga.eu/recursos/ambiente/telec/201706.csv"
      },
      "dct:description": "Los datos se corresponden a la información que se ha decidido historizar de los sensores instalados en cuadros eléctricos de distintas zonas de Málaga.",
      "dct:format": "CSV",
      "dct:title": "Lecturas cuadros eléctricos Junio 2017"
    },
    {
      "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/5ecbfa6c-9ea0-4f5f-9fbe-eb39964c0f7f",
      "@type": "dcat:Distribution",
      "dcat:accessURL": {
        "@id": "http://datosabiertos.malaga.eu/recursos/ambiente/telec/nodos.csv"
      },
      "dct:description": "Destalle de los cuadros eléctricos con sensores instalados para su gestión remota.",
      "dct:format": "CSV",
      "dct:title": "Cuadros eléctricos"
    },
    {
      "@id": "_:N71006d3e0205458db0cc7ced676f91e0",
      "@type": "vcard:Organization",
      "vcard:fn": "Gabriel Requena",
      "vcard:hasEmail": "gabi@email.com"
    }
  ]
}

Google even provides a Structured Data Testing Tool where you can submit a URL and it will tell you if the data is valid.

Of course knowing the CKAN API is good if you're a developer, but not really the way to go if you want a search engine to find you datasets. So the JSON-LD that you can see above, is already embedded on the dataset page (check out the testing tool with just the dataset URL). So if you have enabled this feature, every time a search engine visits your portal, it'll get structured information about the dataset it crawls instead of simply the HTML of the page.

Check the documentation for more information, but most importantly: if you're running CKAN, give it a try!

06.03.2018Digital Real Estate Summit 2018: Open Data in der Immobilienbranche

Ich war eingeladen am Digital Real Estate Summit 2018 über "Open Data in der Immobilienbranche zu sprechen". Der Talk war ein kurzer Abriss zum Thema Open Data und v.a. ein Workshop-Teil für die Teilnehmer um sich mit dem Thema auseinander zu setzen.

09.05.2017Open Education Day 2017: HTML/CSS Workshop

The /ch/open association hosted the Open Education Day 2017 at the FHNW in Brugg/Windisch.

I wanted to provide a workshop that could possible be re-used in a classroom, as teachers were the main audience of this event. As a coach of OpenTechSchool, it was clear that I would use one of those courses and talk a little bit about OTS as well.

After some thought, I chose the HTML/CSS course. The course is designed to be "do it yourself, in your own tempo".

It was a lot of fun and I think the participants could take away the important message, that websites are not magic. Thanks to the open learning material, they can finish the course on their own and also re-use it for their students.

24.04.20172nd Dark Night in Zurich: Open Data Workshop

Together with Zentrum Karl der Grosse, the Chaos Computer Club Zurich (CCCZH) and the Digitale Gessellschaft Schweiz organized the 2nd Dark Night in Zurich.

I was asked to give a workshop about Open Data. The audience was very broad, from young to old, from novice to tech-savvy, but they all had very little knowledge about Open Data.

It was a very good expierence to try to explain the field of my work and one of my biggest private interests from the ground up.

My approach was to break down this very complicated topic in 5 areas and I tried to encourage the participants to actually "get their hands dirty" by diving into an open data portal. In the end, 60min is a very short time, but I hope I made a lasting impression and the topic of open data is no longer an "unknown" in their heads.

12.11.2016Jugend hackt Schweiz: Github & Git

I was co-organizing "Jugend hackt" in Switzerland this year. If you want more information about the event, read the excellent blog post about it, watch the video of the final presentations or read the article by SRF (including an audio interview with me :) ).

During the event there was also a series of lightning talks about a variety of topics. I held a talk about "Github & Git", the slides are on my github account or available online and they are actually a fork of a similar talk at a previous "Jugend hackt" event.

04.10.2016CKANCon 2016: How We Combined CKAN with WordPress

This year the CKANCon was in Madrid, prior to the International Open Data Conference. It was a one day, rather tech-focused, conference. As CKAN is the framework I'm using almost on a daily basis, and I know a lot of people from the mailing list, it's good to meet them from time to time in person.

My talk was about the CKAN/WordPress integration we did for opendata.swiss. You can find the slides below.

Update: In the meantime, all CKAN and WordPress plugins were open sourced and can be found on GitHub!

29.09.2016Hacks/Hackers: Open Data Portals + Civic Hacking

I was invited by the Hacks/Hackers meetup to give a talk about my passion "open data", so I decided to mashup my previous talks and talk about my work (building open data portals) and my private interest (being an open data activist and a civic hacker). It was a very interesting experience and there were a lot of questions in the Q&A after my talk and even after then when we had some drinks at the venue.

Thank you to the organizers to having me and for everybody, who showed up!

As always, here are my slides:


enabled: