How best to collect and share germline gene repertoires?

ematsen · May 23, 2016, 9:43pm

I’m reanimating this thread because of an interesting discussion that we recently had on a conference call with a number of contributors to the forum, with a number of ideas from @dooley @rarnaout @caschramm @bussec, among others.

There was general consensus that nobody is going to try to start a curated resource like IMGT. Thus, what can we do that is somewhat decentralized?

The most basic thing is simply to put the data somewhere that people can see it. GitHub has advantages for this because it is naturally versioned and people can see diffs. @bussec has done this beautifully over at https://github.com/b-cell-immunology/sciReptor_library. @dooley pointed out that one can upload data to iPlant and mint a DOI for it.

The second tier would be that plus a directory of what resources are where, as well as some standards for sharing. For example, one could imagine that we all agree on a file format, and then there is some machine-readable file that describes where resources are in such a way that a computer program could use that directory to automatically go out and grab things. This file could be updated by pull requests on GitHub.

The third tier is similar but all of the data is in one GitHub repository. One could imagine that each update to the database is a pull request, but this pull request would trigger a job that would do some basic consistency checks, for example that people aren’t adding sequences that are already in the database, and that everything is properly formatted.

None of this would have human curation, and thus one would invariably have incorrect sequences pulled into this web of information. But, at least there could be a consistent means of fetching those sequences and having names for them.

Thoughts?