I was at Goodenough College in London last Wednesday, 5th October 2011, for a workshop organised under the JISC Discovery programme (discovery.ac.uk), to discuss approaches to publishing, managing, and using Open Bibliographic data (OBD) on the web. Here are some of the notes that I made on the day. I’ve left them rather rough because I don’t have time to bully them into proper paragraphs.
The workshop started with a general overview and discussion of the current picture of OBD.
- We’re dealing with a growing number of technologies for open library discovery: Linked Data, BibJSON, OPDS (based on Atom), Lincoln’s NoSQL/API-centric approach, even SuperMARC(!?).
- Few if any people have a good handle on all of these approaches, but we ought to be at least conversant with them.
- We’re a room full of experimenters! But how can we communicate Discovery/OBD to others? How can JISC funding be used to support the work? We need to surface not only tools and data but also skills.
- Possibility of looking to e.g. DevCSI/Netskills to help with addressing the skills gap. Are CompSci graduates being encouraged to exercise their skills in open/community development?
We then split into two groups to brainstorm “what’s interesting in bibliographic data at the moment?”: the two groups managed to fill around 8 flipchart sheets
A few quotes and themes I picked up on:
- What will be the value of OA repositories in hindsight? Will it be open data (some are skeptical) or rather will it be their effect on the publishing industry?
- A really useful application would be a fits-all API to identify possible identifiers within a record/page – ”I think this is an identifier, please tell me what sort it is” – which then leads into a web service to aggregate information about the thing itself (rights information, etc.) – jokingly called “Rate my Regex”! – some interest in this as a project.
- Paul Walk: “Please an we have a day off from Linked Data!?“
- Idea of the role of “data doctor/data wrangler” gaining some currency in institutions.
- There are plenty of code libs for dealing with bibliographic data: pymarc, MARC4J, MARC::Record (perl). solrmarc.
- Owen Stephens: “MARCXML is the worst of MARC combined with the worst of XML. It’s rubbish.“
- A colleague of Peter Murray-Rust (sorry, I didn’t catch your name!). Citable data is not copyrightable. Java library containing ~20,000,000 open article records???
- Mark MacGillivray[?]: “To most people, this [taps laptop] is just a plastic box full of magic.“
After lunch we split again, this time into three groups, each to consider a different aspect of managing Open Bibliographic Data; each to consider opportunities, costs, pitfalls, etc. relating to the technologies themselves as well as to the skills needed in exploiting those technologies:
- Transforming data
- Munging data (both groups 1. and 2. agreed that the two steps are really the same thing – just “more transformation” – also that ‘munging’ is an awful word…)
- Exploitation of data
I was part of the ‘Munging data’ group.
- Problems in the move from a unitary system to distributed data services – loss of control (quality of 3rd-party data can be a problem for the librarian mindset!), worries over sustainability of mashup-style approaches (c.f. dbpedia, BBC RDF, the now-defunct Talis Silkworm project). However, openness itself provides some guarantee against things becoming defunct (i.e. Open Source Software)_.
- Need to think about the capacity (and the uneven geographic distribution) of local skills
- “Any data is better than no data”. Use of third-party open data is not really a challenge for management any more (only cataloguers care!)? But still important are notions of provenance, attribution, putting power back in the hands of the end user.
- We need to think at the citation level – is there a big difference between personal and institutional data?
- Character encoding!
- Skills. Not enough developers. Unevenly distributed geographically. (Can we construct a course/curriculum for open community development skills?).
- #ukdiscovery is somewhat distant from the mundane concerns of libraries. Ed Chamberlain is speaking to a group of cataloguers in Oxford about OBD – that’s the sort of thing we want!
- Thinking about the role of CILIP and ‘professionalism’ – keeping [technical] skills up to date. Portfolios/competency framework approaches. Can we get a push from the top of the library profession?
- Technology gaps, on the other hand, have mostly gone away. There are enough interesting and easy things to keep us busy without having to worry too much about the things that still don’t work. JISC can help to convince (smaller?) institutions that open development should be trusted.
- Still attempting to overcome legacy licensing issues. Instead of concentrating on dealing with old data, why don’t we just take a “line in the sand” approach and make sure we’re being 100% open from now on. Do the OBD principles need to be extended?
- Make use of feedback loops. Learn something about your data by feeding how it’s been used back into the system. Use this usage to inform your transformations.