Getting insight from data is not always a straightforward process. Data is often hard to find, archived in difficult to use formats, poorly structured and/or incomplete. These issues create friction and make it difficult-to-use, publish and share data. The Frictionless Data initiative at Open Knowledge Foundation aims to reduce friction in working with data, with a goal to make it effortless to transport data among different tools and platforms for further analysis. The corresponding toolkit consists of a set of standards for data and metadata interoperability, accompanied by a collection of open source software that implement these standards, and a range of best practices for data management. In this article we are going to focus on data standards, why they are important, and, using the Frictionless Data standards as a base for the reflection, we will try to understand what makes - in our opinion- a standard successful.
Getting insight from data is not always a straightforward process. Data is often hard to find, archived in difficult-to-use formats, poorly structured and/or incomplete. These issues create friction and make it difficult to use, publish and share data. Friction is also breaking the creative cycles that would make for a dynamic and attractive data ecosystem, where people can use reliable existing data and reproducible scientific results to create innovative solutions and accelerate scientific discovery.
A very (in)famous example of data friction comes from computational biology, where Microsoft Excel is largely used for data collection. There are a number of gene and protein names that Excel mistakenly perceives as dates, and autocorrects them to dates and floating-point numbers. According to a recent paper, more than 30% of the published articles in the field with a supplementary Excel gene list contain gene name errors because of Excel autocorrections. As a result, the data collected is made useless and papers have to be retracted because the data is corrupted. Friction not only hinders the verification of the data though, but also additional experimentation because no further research can be performed on corrupted data. If researchers had published their data with a description of how the dataset is constructed, maybe it would have been easier to validate the datasets, checking that all the fields correspond to the one described (this is definitely not the case if there is a ‘date’ where we should have a ‘name’!), and to spot the mistakes before publication.
To prevent this kind of mistakes is the aim of the Frictionless Data initiative at Open Knowledge Foundation. Frictionless Data wants to make it effortless to transport data among different tools and platforms for further analysis. The Frictionless Data project consists of a set of standards for data and metadata interoperability, accompanied by a collection of open source software that implement these specifications, and a range of best practices for data management.
As science becomes more data-driven, collaborative, and interdisciplinary, demand increases for data interoperability, but to be able to reuse someone else's findings and data, researchers and data scientists need to understand:
How was the raw data collected and the analysis done?
What do the column names mean?
What are the data types of the columns?
Who created the data?
What is the licence of the data?
How to assess the data quality?
Metadata, which is the data about data, serves key roles in answering those questions. What also helps making data interoperable and reliable is (open) data and metadata standards. Standards are sets of agreed specifications or requirements for how data and metadata should be made publicly available, with the objective of making it easy to understand what is described and how it is described.
It takes years of engagement with communities of practice and key stakeholders to agree upon a standard. The Frictionless Data standards have grown out of more than 10 years of iterative work with open data communities and a long engagement with issues around data interoperability, publication workflows, and analysis. The design philosophy behind it is simplicity and utility. What does it mean exactly?
The Frictionless Data requirements are driven by simplicity, disrupting as little as possible existing approaches and processes
The specifications are extensible and customisable by design
It has standards for human-editable and machine-usable metadata
It reuses existing standard formats for data (we don’t want to reinvent the wheel!)
It is as much as possible language-, technology- and infrastructure-agnostic
In aiming for these goals, when the version 1 of the specifications was released in 2017, a great effort was made to remove ambiguity, to cut under-defined features, and remove and reduce various types of optionality in the way things could be specified, even making some implicit patterns explicit.
For most of the history of the project, the standards were curated by Open Knowledge Foundation with a tremendous amount of contributions from the community that has grown around it. (Being open source, the Frictionless Data community is essential for the project development, you can have a look at the code contributors page below to have an idea.)
As a result, the standards have steadily gained traction across various projects and software, validating the approach we have taken: creating a minimum viable set of lightweight and comprehensive standards to significantly improve the transport of data.
The French Public Open Data Platform, for example, uses the Table Schema specification (a Frictionless Data Standard to describe an individual file, Ed.) to model how data producers should publish some datasets (see an example below). In France the Frictionless Data standards are even mentioned in a decree.
In Brazil, datasets published in their Open Data Portal are fully documented using both the Data Package and the Table Schema specification.
In the research data world, the Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a great example of successful Frictionless Data use. In an effort to improve the transportation of data, to support reproducibility of research by making all curation activities of the office completely transparent and traceable and, to improve the efficiency and consistency across data management staff, BCO-DMO, with the help of Open Knowledge Foundation, developed an application for creating Frictionless Data Package Pipelines that help data managers process data efficiently, while recording the provenance of their activities to support reproducibility of results.
As the BCO-DMO team put it,
“Data Package Pipelines implement the philosophy of declarative workflows, which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility is appealing because it means the intent of the data manager can be translated into many varying programming (and data) languages over time without having to refactor older workflows.”
Read more about the collaboration with BCO-DMO here.
For data to be effortlessly transportable and easily reusable, it needs to follow some standards and be shared alongside its metadata. The idea behind the Frictionless Data core specification is that it can all be containerised in a package: the Data Package.
A Data Package is a simple container format used to describe and package a collection of data. It contains your data and a file with general metadata and a machine-readable list of the data files included in the package.
Looking closely you’ll see that the Data Package is actually a suite of specifications. This suite is made of small specifications, that can either be usable on their own or combined together. For example, you can use the Table Schema specification with the Data Package, as the Brazilians are doing in the example here above.
When thinking about the Data Package, and about the specifications in general, our aim has always been not to enforce one standard way of structuring the data or writing a schema. The specifications we have written are designed specifically to be easy to build on, and adapt for use in existing tooling and platforms. What's great about them is that they are lightweight and can be a starting point for building domain-specific content standards.
A Data Package for example can contain any kind of data. At the same time, it has an extensible format, meaning that it allows you to add your own attributes to it. Data Packages can be specialised and enriched for specific types of data so there are, for example, Fiscal Data Packages with extra semantics on budgets, categories of public expenditure (and other specific to this use case), or Tabular Data Packages for tabular data, where the resources in the Data Package must be tabular data (e.g. CSV). While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation makes it easy to adopt within an existing infrastructure, to extend to the specific of your data, and to use within your existing workflow.
To understand what this means, we can take a look at the work of mySociety, a not-for-profit group pioneering the use of online technologies to empower citizens to take their first steps towards greater civic participation. In their data workflow, mySociety uses the Frictionless Data standards to describe their data, and specifically collect their data resources in Data Packages. That way, all the datasets that they built for their own purposes become more easily accessible and reusable by others.
While mySociety leans on the Frictionless Data standard for basic validation, they chose to use other frameworks (pytest) on top of it to run additional tests on the data itself (check their recent blog to know more in depths about mySociety’s data workflow).
The work on the Frictionless Data specifications is managed via an open issue tracker on GitHub that any user or community member can input. This way of working ensures a great degree of transparency, neutrality, and accountability.
The initial idea when the Frictionless Data project was started, more than ten years ago, was for it to be mainly about standards with a few small code library implementations. In time, it became clear that the standards alone are not enough, and that technical implementation of the standards is also crucial. The focus has now shifted to integration, and adoption. A rich set of tools has been developed, including a powerful Python framework to describe, extract, validate, and transform tabular data.
Currently, the standards are only one of the six Frictionless Data tools. These tools are built on top of the specifications, and aim at simplifying your data workflow, ensuring data quality. One of these tools is, for example, the Frictionless Repository that you can use to automatically validate your data, if it is hosted on Github. Another one is Livemark, which can be used to generate a website for you data, with interactive charts, tables, and other features. You can use one or more of these tools in your data workflow according to your needs.
An excellent example of this is the Libraries Hacked project, leveraging data for innovation and service improvement for public libraries in England. Libraries don’t share a lot of data openly, there is often little guidance on how data should be shared and few standards. To formalise how data should be structured, it was necessary to create technical 'data schemas’. As David Rowe from Libraries Hacked said:
“It can be easy to decide on the data you want, but fail to describe it properly. For example, we could provide people with a spreadsheet that included a column title such as ‘Closed date’. I’d expect people to enter a date in that column, but we’d end up with all kinds of formats.
The Table Schema specification for defining data, from Frictionless Data, provided a good option for tackling this problem. Not only would it allow us to create a detailed description for the data fields, but we could use other frictionless tools to allow library services to validate their data before publishing. Things like mismatching date formats would be picked up by the validator, and it would give instructions for how to fix the issue. We would additionally also provide ‘human-readable’ guidance on the datasets.”
(Read the full Libraries Hacked use-case here).
As the project expands, we are actively thinking about the next iteration for our specifications to be more inclusive of non-technical audiences. We are currently working on a Frictionless Data Application, which should be released as a beta version by the end of the year. The Application is going to be a kind of data publisher integrated development environment (with a user interface similar to Excel in some ways, so easy to use also for non-coders), an everyday tool that data publishers have in their workflow.
In conclusion, we wanted to share here our main takeaways from this long Frictionless Data journey:
• Don’t try to reinvent the wheel.
Before developing new standards or frameworks, have a look around to see if there is already something fitting your needs, or that could do so with just a few tweaks. There are plenty of quality open source tools out there that are just flexible enough to fit in your own workflows.
• Include broad and diverse audiences in the development of your project.
This is key to get feedback, but also to make sure you have many different use-cases in mind. Build a healthy community around your project. They will provide the context you need to iterate on your project, besides ensuring a certain degree of transparency and accountability as well.
• Be simple.
Complexity hinders adoption, so data standards should be as simple to use as possible. You need to balance between simplicity and ease of usage to standardise all different concepts people might want to use.
In an ideal world, everything would follow the same standard. We would describe the semantics of the data, what a person's name is, what a location is, and what relationships there are between the data (for example, a ‘city’ belongs to a ‘state’ which belongs to a ‘country’). This would facilitate different datasets being interoperable. But at the same time, this would increase the complexity of using the standard, which in turn reduces their usage.
To end with an analogy, the CSV format was a great model for us. Even though it has issues, and there are other data exchange formats technically much better than it, CSV is still one of the most used formats. We believe that's because of its simplicity. That's what we strive to achieve with Frictionless Data. With Frictionless Data we have decided to describe the minimum needed to understand a dataset, like who the authors are, its description, its licence, a data dictionary, etc. When there is a need to standardise other concepts, you can extend the standards by adding the specifics of your data.