Harmonized neuroimaging data management in Synapsy

Synapsy DMP Training @ Campus Biotech, Geneva

by Sebastien Tourbier

March 08, 2021

BIDS

Brain Imaging Data Structure

What is BIDS?

Specifications to organize and describe neuroimaging data


https://doi.org/10.1038/sdata.2016.44

Comprehensible organization and naming with well-accepted formats

from https://bids-specification.readthedocs.io

Designed with the Findable, Accessible, Inter-operable and Researchable (FAIR) principles in mind.

https://www.go-fair.org/fair-principles/

Image with acquisition metadata

Dataset documentation and metadata

Standard adopted by a growing number of researchers


Results from www.webofknowledge.com (Date: March 07, 2021)

BIDS official website

https://bids.neuroimaging.io

How is BIDS useful?

Make data fully understandable by itself thanks to its metadata and documentation files

Facilitate data sharing between lab members and collaborators part of Synapsy

Make code interoperable between projects, lab members, and collaborators part of Synapsy

Very little effort to publish dataset to databases

Databases such as OpenNeuro, LORIS, COINs, XNAT, SciTran and others accept and export datasets organized following BIDS

Benefit of dedicated and well documented tools

For BIDS dataset creation

https://github.com/nipy/heudiconv

For validation and data curation support

https://bids-standard.github.io/bids-validator

For dataset query

For analysis

A number of processing pipelines handling BIDS datasets (BIDS Apps) are available, ranging from quality control to preprocessing, connectome mapping, and statistical analysis - and maybe one of yours in the future!

https://bids-apps.neuroimaging.io/apps/

How to get started with BIDS?

Online BIDS specifications

https://bids-specification.readthedocs.io

Official BIDS Tutorials

https://github.com/bids-standard/bids-starter-kit/wiki/Tutorials

DataLad

What is DataLad?

Distributed data management system

  • Built on top of Git and git-annex
    https://git-scm.com/book/en/v2
    It allows you to keep track of dataset with large file content as you do with text file in Git.

  • But it much more than that!

What is DataLad capable of?

Keep track of your dataset history

Keep track of your dataset history

  • Create an empty DataLad dataset:
    
    datalad create (-c yoda, -c text2git)
    									
    A dataset has a history to track files and their modificafications that is explored with Git:
    
    git log
    									

  • Record with a descriptive message the dataset or file state to the history
    
    	datalad save -m "messsage"
    									
    Concise commit messages should summarize the change for future you and others.

  • Report the current dataset state:
    
    	datalad status
    									
    A clean status is good practice.

Dataset consumption and sharing

Dataset consumption and sharing

Dataset consumption and sharing

  • Publish your dataset to a remote dataset repository:
    • Create a dataset sibling on a UNIX-like Shell (local or SSH)-accessible machine:
      
      	datalad create-sibling
      											
      It creates a remote dataset repository and configures it as a dataset sibling to be used as a publication target.

    • Publish the dataset:
      
      	datalad push
      											
      It updates all your local changes saved and annexed data to the remote dataset repository.

Computationally reproducible dataset analysis

computationally reproducible dataset analysis

(Computationally) reproducible dataset analysis

  • Execute and track input, output and source code:
    
    	datalad run
    									
    It links datasets (as subdatasets) and source code, records data origin and command execution, and collect and store provenance of all contents of a dataset created.

  • The analysis step can be re-executed with:
    
    	datalad rerun
    									

Computationally reproducible dataset analysis

  • Execute and track input, output, source code, and computing environment (in the form of software containers) with the datalad-containers extension:
    
    	datalad run-containers
    									
  • It stores the software container in the dataset, links datasets (as subdatasets) and the software container, records data origin and command execution, and collect and store provenance of all contents of a dataset created.

    Fully computationally reproducible analysis

Computationally reproducible dataset analysis

How to get started with DataLad?

Follow the great DataLad handbook

http://handbook.datalad.org

What are the steps?

Step 1: Understanding how datasets are organized in each lab

Step 2: Creation of BIDS datasets for raw data

Step 3: Adoption of Datalad and creation of dataset catalog

Thank you for your attention!