Introduction

What

This is a repository for a DRAFT of a computable schema for GFF3

It is designed to separate out the datamodel from the serialization (TSV)

The schema is specified using a YAML file https://github.com/berkeleybop/gff-schema/blob/main/src/schema/gff.yaml

This is compiled into multiple artefacts including:

  • jsonschema
  • ShEx
  • graphql
  • python object model (dataclasses)
  • SQL DDL (soon)

And also online documentation:

https://berkeleybop.github.io/gff-schema/

Why

Currently the GFF3 spec and proposed extensions or supplementary specs are specified as text. Text is non-computable which means

  • additional work in implementing validators
  • potential ambiguities in interpretation
  • additional work to create anciliary artefacts - e.g. a SQL schema or a JSON schema

Additionally, GFF3 is a TSV-based format which limits ability to use standard off the shelf validation approaches - e.g. OWL or ShEx validation for RDF, or JSON-Schema for JSON

Approach

This project is neutral on the question of TSV serialization vs JSON serialization vs RDF vs XML ...

It takes the view that datamodel and serialization are separate concerns

We first define a UML-like abstract data model. We can then have as many serializations as we like

  • JSON
  • TSV

The specification of how to map the data model to the TSV becomes a separate (simpler) problem, distinct from specifying the data model itself.

It could be argued this is overkill. GFF3 is designed to be simple, it's tight coupling to a TSV representation is arguably a feature, forcing a simplicity. Yet as it is crucial to multiple projects this arguably is a limitation.

Minimally this project provides clear data dictionary of all attributes used in SO, plus computable constraints on each

Similar Work

  • GFFO
  • FALDO
  • AgBio specs

Status: PROOF OF CONCEPT

This is currently the work of a 15 minute linkml demo session on a BBOP hangout. Much more needs to be done!!!

Repository

https://github.com/berkeleybop/gff-schema