Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Tuesday, November 19, 2024

Coming in CERF 6: Improved support for using custom apps to perform mission-critical tasks and analyses on files stored in CERF.

Imagine you are a research organization that works with data files in some specialized format. A genetics lab working with GenBank .GBK or snapgene .DNA sequence files would be a good example. Now imagine your software engineers have written a custom app designed to perform some calculation or processing task on your data files, with the result or summary output to a new file.

Let's further imagine, that as a data manager, you need to have a good record of when any analyses where performed, who performed the analysis and precisely where the results of the analysis are stored. As it happens, this is a workflow that CERF ELN is very well pre-adapted to perform.

In most cases, users typically use CERF in conjunction with the default, industry standard applications on their local computer. An MS Word file, for example, may automatically open in MS Word, whilst your .DNA files may open in, say, snapgene. This workflow illustrates one of the unique advantages of a combined ELN and document management system that uses a desktop application to process your files. CERF carefully logs the interaction between the user and the files stored on the CERF server and displays all activity in the secure audit trail, so that managers are aware of current and past activity and access. In some cases, it may be advantageous to work with highly specialized applications that you've written yourself, designed specifically for performing specialized tasks on data that you stored in CERF. With CERF ELN, users can specify local applications on their computer that they would like to use to check out and edit specific file types. This allows users to optionally checkout files from CERF and open them in a non-default local application.

Lab-Ally has been working with bioinformatics students at the University of Maryland to create a toolbox of small accessory applications that can be used for processing various data files stored in CERF. Each academic term, as part of a capstone bioinformatics class, small groups of students (supervised by Lab-Ally) design, build and test an application of their choice. The application is designed to solve some common bioinformatics problem. An example is described below.

One team of 4 students recently built a GenBank extractor to make parsing genomic data easier and by utilizing this program you get a simplified output from the GenBank files that is readily compatible with CERF and the CERF search feature. The application can be used as a standalone tool or can be used integrated with CERF ELN to allow for superior record keeping, better efficiency and improved organization-wide collaboration.  This parser is designed to extract essential information from GenBank files and output a readable .rtf file.

What Does the GenBank Parser Do?

The parser extracts important data from GenBank files, such as:

  • Accession
  • Organism (Genus species)
  • Taxon data
  • Gene(s)
  • Genetic Sequence

It then organizes this data into an .rtf file, which is easy to read and compatible with most

platforms. Below is an example of what you will find in the output: 


RTF file showing various metadata retrieved from within a sequence file



How to Use the Parser as a Standalone Application


Install the Application:

  • Run the installation file on your computer
Launching the Application:
  • Navigate to the executable file of the program: go to C:\Program Files\GenBankParser and double-click on GenBankParser.exe.
  • A window should pop up with an "Open File" button.
Troubleshooting Display Issues:
  • If the window doesn’t show up properly, try resizing the window. Some users have experienced this issue, and resizing the window can often solve it.
Processing a GenBank File:
  • After clicking the Open File button, choose a .gb or .gbff GenBank file from your system.
  • The application will process the file and save an .rtf file to your desktop.

How to Use the Parser with CERF

  • If you’ve installed the parser, the next step is to configure CERF so it can summon files from the CERF server on demand and utilize the parser tool. Without this step, CERF would simply open GenBank files in whatever the default sequence editing application is on the user's local machine.
  • In CERF, navigate to Tools > Options > Applications.
  • Add the GenBankParser by pointing to the .exe in C:\Program Files\GenBankParser.
  • Set the MIME type to chemical/x-genbank. This helps CERF to understand what types of files you would like to open with the specified applicaiton

This is how Tools > Options > Applications should look once it's set up: 


CERF external application selector


Viewing GenBank Files with CERF:

  • Locate any .gb or .gbff file in CERF’s collections.
  • Right-click the file, select View-in, and choose GenBankParser from the list.
  • The parser will open, allowing you to process the file

CERF modified "View In..."  right-click options



File Output:

  • The application will process the file and save an .rtf file containing the results of the parser analysis to a specified local location.




Pasting as a Relation:

  • The file can then be dragged from the desktop into CERF, and specifically onto the associated file to have it pasted as a relation. This has the advantage that once added to CERF, the .rtf file is immediately indexed for searching so that users with the correct access permissions can search for target text that is located in the .rtf, and once they find THAT file, they can also locate the parent file containing the original raw sequence data.


New in CERF 6, the system will offer the option to automatically associate new files produced by custom applications (containing the results of some analysis) with the parent file containing the raw data. Since CERF offers outstanding version control, it will also be possible to perform these sorts of analysis with different versions of the original data file, associating the results with the correct version of the data in each case, and recording the entire process accurately in the CERF audit trail. We also hope to eventually offer this student-built parser and many other "add-on" tools for use with CERF on our website some time after the release of CERF 6 in 2025.

If you would like to see this tool in action or take a look at the code for the tool that these students built, or if you are a student or developer who would like to work with us to create additional tools like this genetic parser, we would love to hear from you. You can find contact information on the Lab-Ally website.



Wednesday, February 18, 2015

Research Data Management (RDM) in the UK, factors affecting development of a coherent national strategy.

Research Data Management (RDM) should be a topic of discussion for all academic and government labs. How can data be coherently protected, archived, searched and made available in ways that further national and institutional research goals? How can (and should?) organizations co-operate to achieve these goals in a consistent way? Frankly, the US lags in this area because to some extent, most of the big academic research institutes see themselves as partial competitors and also because so many of the faculty tend to see themselves as independent agents rather than members of a national team striving towards some common goal for the greater good. In the UK, there tends to be more government sponsored co-ordination of national research goals, so the discussion of a national co-operative research policy appears to be more advanced, although not without its challenges. The article below appears with kind permission from my colleagues at Research Space, makers of the RSpace ELN, a solution that is focussed on the needs of institutional ELN deployments at large academic and government research organizations.
The original appears at:
http://www.dcc.ac.uk/blog/reflections-idcc15-why-road-broader-take-rdm-opening
The Digital Curation Centre (DCC) website at http://www.dcc.ac.uk is a great place to visit regularly for anyone with an interest in scientific data management strategies at the institutional and national levels.
The three fundamental factors influencing RDM take up
The ‘Why is it taking so long panel’ discussion touched on two themes that are crucial in understanding the kind of environment that is conducive to take up of RDM. Geoff Bilder repeatedly, and correctly in my view, hammered home the point that until the right infrastructure is in place you can’t expect researchers to be enthusiastic about engaging with RDM, in fact you can’t expect them to do it at all. 
Geoff pointed to a second, and in his view underlying, issue, namely funding -- without an appropriate funding model infrastructure will develop too slowly to support, and stimulate, take up of RDM. Geoff sees the problem as originating in the current funding model, which tries to squeeze infrastructure development out from grant funding.
Up to a point I also agree with this second strand of Geoff’s argument. But I would suggest that it’s possible to dig down and identify a third, even more fundamental, factor which lies beneath the funding conundrum. This is what could be termed ‘culture’, specifically researchers’ attitudes to RDM infrastructure and tools, and their views on RDM’s priority or lack of priority in the context of their broader need for support.
If researchers don’t view RDM as a priority they are not going to pressure funders or their host institutions to provide the necessary infrastructure and tools to make it possible. No amount of cajoling or encouraging is going to change that, and until recently the RDM community has mostly been in the position of fighting that uphill battle.
So by culture I mean an understanding on the part of researchers’ of the usefulness of a particular bit of infrastructure or tool, and a desire on their part to adopt or use it because they think it will benefit their research. I would argue that when the culture and the infrastructure or tool are there, funding will follow. Depending on the circumstances – the cost of the bit of infrastructure or tool, the institutional set up, budget, funding cycles, etc. -- this may take longer or happen more quickly, but it will happen. My amended picture of the three key factors driving RDM uptake is displayed in the following diagram.


Figure 1 Factors influencing RDM take up

The three constituencies in action:  ELNs at the Universities of Manchester and Edinburgh
A second point to understanding the critical minimum circumstances for RDM to be taken up, and to take off, is that RDM happens in particular institutions. We saw this point illustrated in the many presentations at #IDCC15 where people from a wide variety of institutions talked about RDM at their institution – how it is developing, challenges, progress, issues, etc. In each case the status and prospect of RDM is inseparable from the institutional environment.
Let me here make a second assertion  -- a culture conducive to RDM take up requires buy-in or actually enthusiastic support from three key constituencies in research institutions: researchers, IT managers and administrators – data librarians and research data administrators. Absent support from all three constituencies the culture will not develop, and without that the requisite pressure to find funding for RDM will not happen.



Figure 2 Three key constituencies in research institutions influencing RDM take up
To bring this point home I’ll end by recounting a recent personal experience. In January my colleague Richard Adams and I gave a talk at the University of Manchester about our RSpace ELN and in particular how it had been integrated into the RDM infrastructure at the University of Edinburgh. Our host Mary Mcderby had advertised the session in advance and, to our amazement, more than 60 people showed up: an equal mix of researchers, IT managers and research administrators. Even more amazingly, we were greeted like rock stars (ok, not quite like rock stars), and peppered by a volley of interested questions and comments from all three sections of the audience.
What became clear as the discussion progressed was not only that all three constituencies had an active interest in adopting an ELN, but that they were aware of each other’s interests and by and large seemed supportive of each other and happy to work together. That is what I mean by a culture conducive to RDM take up. I’m confident that Manchester will find a way, sooner rather than later, to adopt an institutional ELN, because (a) the will is there across all three constituencies, and, crucially (b) this is a tool that all three constituencies can see will bring benefits.
The road to a broadly dispersed RDM culture and sustainable funding models is opening up
Pioneering institutions like Manchester and Edinburgh may have to be a bit creative, and come up with innovative and ad hoc solutions, to fund take up of RDM infrastructure. But, as they are now beginning to show the way, pressure will grow on funders to put forward sustainable and well considered funding solutions that are replicable more broadly as the culture at other institutions develops to the point where the majority of research institutions find themselves in a position to follow in the footsteps of the pioneers.
Rory Macneil
Research Space

Tuesday, June 10, 2014

Study says research data is lost at alarming rates

Study says research data is lost at alarming rates

"Eighty per cent of scientific data are lost within two decades, according to a new study that tracks the accessibility of data over time.
The culprits? Old e-mail addresses and obsolete storage devices.
“Publicly funded science generates an extraordinary amount of data each year,” says Tim Vines, a visiting scholar at the University of British Columbia. “Much of these data are unique to a time and place, and is thus irreplaceable, and many other datasets are expensive to regenerate."

Wednesday, April 16, 2014

Integrating an ELN with a university's long term Datashare archive.

This article discusses a recent initiative at the University of Edinburgh to solve this problem by integrating the RSpace ELN with their DataShare long-term archiving system. Longevity of digital data is always an issue, and as more and more scientists make the jump to ELNs, the importance of archiving data for future generations is often overlooked. Imagine what a loss it would be if there were no equivalent to "Down House" (where Darwin's notes and data are kept).

http://datablog.is.ed.ac.uk/2014/04/15/using-an-electronic-lab-notebook-to-deposit-data/