Showing posts with label Building your own system. Show all posts
Showing posts with label Building your own system. Show all posts

Friday, February 1, 2013

Interview with organization that built its own sample tracking system.


One option for a science company with unmet data management needs is to but a solution that works for you, and there are certainly companies that can help you with your decision. Another option is to simply build your  software solution from scratch. You might think that this would be strictly the domain of large well-resourced organizations, but even for a small business or startup this can be a possibility that can yield unexpected benefits. There are still places in the US where a lower cost of living means that software developers are not as expensive as they are in silicon valley, and with patience, persistence and a bit of luck, business owners can indeed find highly talented local individuals who can create a great solution even in the least "techie" of locales. In the following interview, I talk with key players at a small midwestern biotechnology company who needed a better way to manage their extensive collection of biospecimens. The company prefers to stay somewhat anonymous, so I have used pseudonyms;  "The Biorepository" is the name of the business, "PM" is the primary project manager and biorepository specialist who helped with research and design, "Dev" is the software developer who made it all possible and "Tissue Database" or just "TDB" is the name of the solution itself.  "Bob", is of course, yours truly, the author of this blog.

TDB is a client-server biospecimen management system built by a single developer over a period of about a year with design input from at least three biorepository specialists including "PM". It is currently being phased into use by The Biorepository and is expected to replace their existing systems completely within a few months. Its primary functions include the following:
1) Manage an inventory of more than 2,000,000 human tissue samples. 
2) Compliantly gather, organize and store existing data associated with these samples and also modify and expand this data in response to pathology QC efforts / completed studies / new information provided by ongoing research. This new information should be available to make repeat sales proceed more quickly and to add new value to the existing inventory. 
3) Create, process and track studies (i.e. customer requests for specific biospecimens) from initiation to completion and delivery. 
4) Enforce stringent QA steps and industry best practices that ensure that customers receive exactly what they need in as short a time as possible. 
5) Perform advanced data-mining functions that allow The Biorepository to quickly locate specimens that meet specific inclusion and exclusion criteria. 
6) Create and track batches and sets of specimens that can be processed more efficiently when they are grouped together. 
7) Define and set access roles so that appropriate read / write privileges are maintained throughout and only authorized personal can see and interact with data as necessary

Bob: So, Dev, What's your background? how did you learn to code?
Dev: My interest in computers grew out of a childhood fascination with anything and everything technical; I'm self taught from start to finish. I seem to naturally see things in hierarchic, relational ways, and computers really just make sense to me. 

Bob: Wasn't anyone else at all involved in your technical education? Most educational theories say that all learning is social.
Dev: I learned most of my craft from books, particularly those dealing with general programing "best practices" that can be applied to any project, but I also use those thick technical manuals that deal with the details of syntax and so forth, so you might say I learn from the authors of those materials. Free online manuals and forums are another important resource. When I first started out, I was inspired by a friend's father who sold me my first IBM 286, which came with a full set of humungous manuals, as most machines did back then. Plowing through those is what got me started on this path. Like many developers, I have no formal degree in CS, mainly because by the time I was old enough for college I had already surpassed the level of the required foundation classes. I also don't really think that classroom training would have been particularly useful for me. I'm more of a hands-on learner.

Bob: Have you ever been tempted to move to the West coast, or Silicon Valley, where programming skills are in high demand? What keeps you here in the Midwest?
Dev: My family is here and I have never felt any need or desire to leave. It seems like programmers can work from anywhere these days.

Bob: How did you become involved with this particular specimen management project?
Dev: Connections. The owner of the company was looking around informally for a developer and happened to know someone who knew me. We met, we talked, I showed him some of my past work, he told me about the data management issues his company was wrestling with, and the project grew from there.

Bob: How about you, PM… what is your background?
PM: I have a MS in cancer biology and an MBA from a large Midwestern university. I've been working in the biospecimen field for about 5 years. Initially I worked in an academic setting coordinating specimen collection (particularly blood draws) and clinical research projects. I also assisted informatics specialists with the collection of de-identified data and organizing it so that it could be used in research. 

Bob: Do you have any formal training with compliance issues related to the use of patient data in research?
PM: Yes. I was required to complete CITI and HIPPA training, so I am pretty familiar with 45CFR46 and other relevant compliance issues.

Bob: How did The Biorepository deal with specimen management before TDB was created?
PM: Initially, the data for our entire collection resided entirely within Excel. Each category of specimen was listed on a different sheet, and a shared drive was used to allow everyone to have access to all of the data all of the time. This worked OK at first since most of the clinical data we received from our sources was already in excel format, so it was pretty easy to just add this data to our existing master files. We also used a web tool we created ourselves to assign study numbers, and monitor the progress of each study and it's associated data, but this tool was separate from the data itself. Over time, our collection grew. More sheets and more data fields were added to our excel files and they gradually became difficult to work with. 
Dev: As the number of employees who needed to simultaneously work with these files began to grow, we began to see versioning issues that ate into our efficiency. We also had problems physically locating the right files. We had excel files scattered all over the place and remedies such as dropbox introduced new technical and compliance issues of their own that we couldn't easily resolve.
PM: Given the nature of our business, and our commitment to meeting or exceeding various compliance rules, accreditation standards and industry best practices for data integrity, it eventually became clear that we would need a more sophisticated solution to allow us to better track, search and filter very large amounts of data. It also became clear that we needed a system that provided role-based, real time access for multiple employees that was integrated with our business workflows and compliance needs.

Bob: Did you look at any existing "off the shelf" solutions? 
PM: Initially we looked as caTissue , which is part of the National Cancer Institute's caBIG initiative. We found it cumbersome and difficult to install and use, and ultimately it didn't match up with our specific workflows and data tracking needs. Since caTissue is open source, we considered modifying it to meet our needs, but our best estimate was that this would probably take longer and be more difficult than building our own system from scratch. 
Dev: caTissue struck me as being a large suite of tools, most of which it seemed we would never use. Before I was tasked with building TDB from scratch, I spent a lot of time and effort trying to install caTissue, but despite my best efforts I was never able to get it to work with our network infrastructure. I still don't really know why, and I'm pretty technical. I can only imagine how hard it would be for a small lab with no IT support to attempt to deploy this solution.
PM:  We did also look at a few commercial LIMS and inventory management solutions, and we probably would have bought one if we could have found one that somewhat closely matched what we needed, but ultimately none of them were a good fit. 

Bob: In the end, what most influenced your decision to go it alone? Was it price, feature set, or something else?
PM: Pricing for the commercial systems varied wildly. When we started the project we estimated that it might take a developer 6 months to build a usable system from scratch. If we could have found an "off the shelf" product for less than the price of 6 months of developer time, we probably would have bought it. At the time, we thought it was unfortunate that nothing with the right feature set existed. There was nothing out there that allowed us to store, search for, and retrieve specimen data in the same way that we already know our customers typically ask for it. There was nothing that would definitely make repeat sales faster, or help us enrich the data we have available for each specimen, or more generally, make us more profitable. I suppose this, rather than price was the main reason we decided to go it alone. 

Bob: So did you make the right choice?
PM: With hindsight, it would have been a mistake to buy a solution "off the shelf". It turns out that the process of developing the TDB solution from scratch has been tremendously beneficial to the company in that, by embarking on a systematic, step by step analysis of what we do, we have discovered all kinds of new ways to boost productivity and maximize the value of our inventory. Predictably, the scope of the product has grown too as we have addressed these discoveries. We must have at least doubled the size of the project from our original vision and we now have not 6 months but more like a year of developer time invested. Nonetheless, we now consider the end product to be a bargain because of the powerful ways the solution will help us to be more efficient. Part of this success is luck in that we happened to connect with a compatible developer who didn't hand us a solution that was so buggy or slow that we had to waste time with rounds of troubleshooting and rebuilding. Software QC of the weekly builds has not really been much of an issue. Each component has worked well and there have been no major technical setbacks, rather, all of our effort has been concentrated on multiple rounds of design and redesign that have produced something that formalizes and works with our existing workflows and needs. We have also been able to think about new workflow steps, particularly ways to monitor the quality of our specimens, the accuracy of our data and the criteria we use to search for and categorize samples of all kinds. Since our business has always tried to distinguish itself from our competitors by offering superior quality, these new steps align perfectly with our business model and take a lot of the guess work out of things like, say, the accuracy of pathology reports written by third parties. This will mean better quality samples provided with shorter turnaround times for the client. Additionally, because we now have such a good understanding of both our business workflows and the specifications of the solution we built, we have the option to continue development and maybe add even more functionality and quality assurance checks. This should help us stay on top of any changes to compliance rules or industry regulations without having to abandon or rebuild the whole system.

Bob: How was the design of the system influenced by compliance concerns?
PM: From the very start of the design process, a lot of thought went into how to meet our responsibilities as a "trusted third party". In particular, the rules of 45CFR46 require the TDB solution to maintain an impenetrable wall between any potentially identifying data that may be associated with each specimen and the use of that specimen in a research context. We were also concerned about future-proofing the system so that it could be adapted to deal with any regulatory changes that might occur in the future. We are confident that our knowledge of the system specifications and code will allow either our current developer, or some future successor to be able to modify the system to deal with these changes if necessary. From the very start we have been particularly careful to avoid compliance issues by simply not keeping any sensitive data in the system. All possible patient identifiers have been removed before the data gets to the TDB, so in theory it is impossible to endanger our position as a "trusted third party".

Bob: Tell me something about the design process you used to build TDB. Where did the design parameters come from?
Dev: The original parameters were fleshed out in a series of meetings with PM and other key team members of the company. We started with just a list of things the solution needed to do, and clarification of the types of data we work with. It quickly became clear that data associated with respectively donors, events and specimens, should reside in three distinct, interrelated tiers around which everything else would revolve; the donor, the event and the specimen. Later we added a fourth diagnosis tier. The basic code structure of the four tiers was created within a couple of months. From there, the details of the solution evolved in very organic, bottom-up way with subsequent weekly meetings gradually fleshing out the fine details and adding new features. All the field names and necessary input types were pretty much given to me in those weekly meetings. The only other clear guide I had was a general sense of the workflows, and of course, the compliance rules already established for our industry. Since compliance rules have already been worked out and documentation is widely available, and since we don't do anything that is not already covered by these rules, that particular aspect of the design didn't need a lot of new thought. All we had to do was follow existing guidelines.

Bob: Was there any conscious effort to follow formalized software development methodologies strategies such as "Agile" ?
Dev: No, not really, although we certainly drew on a number of elements classically considered to be programming best practices. In particular, the design is highly modular and hopefully extensible. The iterative, empiric, "QC as you go" approach of "Agile", with short "sprints" followed by frequent reviews certainly seems to be close to what we did, but there was no formal effort to enforce this, and no official "scrum master" per se.

Bob: What components and technologies did you use to build the TDB solution? Why did you choose these particular building blocks?

Dev: Nothing fancy really. The backend is a MySQL database. I didn't really look at any other options for the DB since MySQL is what I am most familiar with. The front end is web-based specifically to keep it cross-platform compatible, and was written in ASP.NET. The server logic is written with visual basic. I mainly chose these just because they are what I am most familiar with and because all these components have a good reputation as being simple, "get-er-done" technologies.

Bob: What about the design for the front-end GUI?
Dev: I avoided even discussing the GUI for as long as possible until we had a pretty good idea of what the system was meant to do. Once I had a pretty good idea of how the system would work, I story-boarded the interface for the team in a fairly complete form. The design was based on previous projects I had worked on, so I already knew that the design would be at least somewhat user friendly. We story-boarded out every page, every tier, every single data field, before I wrote even one character of code. Once things were fairly concretely defined (both the GUI design and all of the DB tables) it only took me about a week to program up the first prototype since I knew exactly what was required. We are currently beta testing the solution, and so far the the GUI has been well received. When we are satisfied that the system is ready for full deployment with no further functional changes, I will probably give the interface one last review just to try to make the interface look as attractive, polished and professional as possible. As it happens, graphic art is another interest of mine, so I am a bit of a perfectionist when it comes to the look and feel of my software.

Bob: Can you walk us through some of the specifics of the GUI?
Dev: Sure. After logging in, the user starts out on a home pages that offers several options including a couple of different search tools:





Dev: If the user then chooses "catalog search" for example, they can search the catalog using a number of filtering criteria, then click on one of the results to see it.





Dev: Pictured below are the Tier Panels. These appear at the top of all Tissue Database pages used for data entry, reviewing and searching. The Tier Panels have been designed to show the maximum amount of pertinent data. At a simple level, the Tier Panels show basic data about the Donor, as well as the Events, Diagnosis, and Specimen associated with that Donor. Through coloring and highlighting, the Tier Panels display further information about the relations between each Tier. In the example below, the Donor “D5736”  tier is highlighted so the detail panel underneath shows information about that donor. The user can select any of the enclosed events, diagnoses or specimens to see the details appear in the details panel at the bottom of the window. The Tier Panels below are set to their default size. Each Tier item can be expanded to show the full available data without leaving the current page. The Donor Tier (full yellow bar) can be pulled down as one large block, while the Event, Diagnosis, and Specimen items can all be expanded individually. The “Expand” checkboxes at each Tier level can be used to make the system display all the data for each object in that Tier. The “Expand All” checkbox in the Donor Tier can be used to make all Tier items display at their maximum size to show all data all at once. The “Add Event”, “Add Diagnosis”, and “Add Specimen” buttons are only available to users in a high enough Role Group, They are disabled for all other users. 




Dev: Users can create lists of specimens for specific purposes that can then stay grouped together for further processing, or this list can be output as various pre-formatted .XLS files to create shipping manifests and so forth:





Dev: There are lots of other features and functions of course; ways to edit entries, embed files as attachments, output and distribute information, track and monitor the processing of specimens or batches of specimens and so forth, but the images above should give a sense of the basics.

Bob: Nice! I notice in particular that the system is extremely snappy. I have looked at oh, so many client-server systems where the delay between clicking on something and then getting the result is excruciatingly slow. I love that in this system, the responses, so far as I can see, seem instantaneous for pretty much everything.

Bob: What were the biggest challenges for you once you moved from the prototyping phase to the actual entering of real data and beta testing of the workflows in real world scenarios?
Dev: Probably it was trying to manage the balance between the requests for new features and the dangers of damaging what had already been built. In other words, regressive testing of the whole system anytime something was changed. One way to address this is to set up new features so that they behave as "stand-alone" whenever possible, and don't require re-writing of previous work…..  but then you have to be careful to avoid adding too many new "stand-alone" features for anything and everything you can think of, because that can make the interface overly complex.

Bob: Is there a danger that with a "bottom-up" design philosophy like this, the solution begins to sprawl and it becomes increasingly hard to say when it is finished?
Dev: That's a good question. Yes, that is a concern, but we did anticipate this when we started the process to some extent so that the solution would grow in a logical way rather than "sprawl". For example, all of the various functions draw values from the same places in the same basic data tables. This made it easy to add new functions without causing the overall data model to geometrically increase in complexity. We also restricted certain functions to the specific user access roles who need them so that the GUI is as simple as possible for most users. Another strategy for avoiding endless revision of the solution was to build in a certain amount of flexibility. When a user says that something "almost never happens" what I hear is "this definitely will happen eventually", so I try to build in a certain amount of tolerance for anomalous events. Then, when it turns out that this event DOES occur sometimes, the system doesn't need to be re-written to deal with it. The TDB solution will be deemed to be finished when our entire team is confident that it facilitates core business workflows and captures key data accurately and efficiently. We will also not go fully live until our beta testing confirms that the solution is reliable, and can can deal with workflow exceptions and unexpected user behaviors gracefully. There is an argument that because the business will hopefully continue to grow and evolve, our solution will never be truly finished, it will just continue to evolve along with the business. We don't see that as undesirable. There is a great quote from one of my personal heroes, Leonardo da Vinci: "Art is never finished, only abandoned". 

Bob: What standards have been set for the solution in terms of up-time, security, backups, disaster recovery and so forth; the sorts of things that larger companies seem to endlessly agonize over when it comes to the deployment of a data management system?
Dev: For backups, right now we are still doing it manually with SQL data dumps. Before we officially go fully live, we will automate this process. Ultimately all client-server traffic will be SSL encrypted, but truth be told, we are still beta testing the system and have not installed the required CERT yet, but then, we currently only use this system within the confines of our own secure network. (Ultimately, authorized users will be able to access the system securely from off site). The user member and authentication system, and also system logging and so forth are all taken care of by the ASP.NET subsystem. As far as up-time, obviously the system was designed to be robust and be up all the time, but we really won't know for sure until we see it in full use. It's easy for software companies to make wild claims about up-time, but the only really meaningful metric is "how long has the system been up?" and even that doesn't mean much until a decent amount of deployment time has elapsed. So far, our testing shows the system to be fast and robust. We haven't found any need yet for regular routine table optimization, reinitialization, or anything like that. Currently the only time the system is down is when I take it down to install updates. Because the system is simple, we are not too worried about disaster recovery. In the unlikely event of some catastrophic problem the system can be restored from a backup pretty easily, and we are making sure to document that process so that anyone with a little IT experience could do it. We are still a fairly small company, so manual re-entry of any small amount of data lost since last backup would not be a big deal. 

Bob: Currently the server is a physical machine located on-site. Did you ever think of deploying it as a virtual slice on, say, amazon's server farm?
Dev: We do plan to make the system available off-site so that employees have the option to work from home and so that we can offer our customers real time access to certain data, so locating the server on the cloud was something we considered, but honestly I am not a fan of cloud hosted servers for a number of practical and financial reasons. I think ultimately we wanted to retain full control of our own data and system, and that was more important to us at this stage then nebulous benefits like elastic load balancing and other cloud features we just don't need yet.

Bob: Do you think that TDB might be useful for other organizations working with biospecimens or tissue samples, or have you built something that is so tightly integrated with the specific workflows of your company that it basically has a user base limited to this single deployment?
PM: Oh absolutely, I think that, yes, it would be useful to other organizations working with biosamples, or managing a biorepository in a regulated environment. Having looked at caTissue first, TDB certainly seems elegant by comparison. The business case for productizing the system is complicated though, since the most obvious customers would likely be our direct competitors, and we have to think carefully about whether or not we want to offer them something that might diminish our advantage and potentially hurt our core business in the long run. If anyone reading this is genuinely interested in purchasing this system they should definitely talk to us though. I don't think we would rule out anything.

Bob: Could the solution be extended to be used in other functions? For example could you imagine adding, say, ELN capability?
Dev: The short answer is yes. I could probably add any number of additional functions. I know that the .NET framework already has many additional libraries for science and medicine that I could draw on for things like interfacing with particular pieces of scientific equipment. Whether or not that would happen is another question. I don't think The Biorepository's CEO doesn't see a strong case for us going into the kind of GLP research that would require real-time capture of all laboratory activity, (i.e. in an ELN) but you never know. Perhaps if we productize this solution this may be something that we would revisit. Many of the tables that define the various fields can be relatively easily edited by anyone with a little DB know-how, which makes the system quite adaptable when it comes to the specifics of data capture and the characteristics of the particular inventory it is managing.  It's also conceivable that we may expose this ability in a more user-friendly way by adding a manager's interface that would allow someone with no IT or informatics background to edit tables and define the various relationships between different variables. This would probably be a requirement if we wanted to sell the solution to other organizations with their own unique needs.

Bob: What would your advice be to a scientific organization that is thinking of deploying some sort of enterprise data management system? Do you think they should look for an off-the-shelf system or attempt to build their own system? Do you think some industry industry standard strategy or solution will eventually emerge?
Dev: Obviously it depends on their needs and capabilities. I do think both approaches are equally risky though, and I don't think the answer necessarily has to be "either / or". For example, I would love it if the TDB solution continued to evolve into a system where multiple modules were available and manager's configuration tools, plus published APIs allowed organizations to pick and choose how they use the solution and how they integrate it with systems they already use. Even better would be if different organizations could be given the ability to write their own modules and plugins, so that even the most specific of workflows and data could be included. Of course, the danger with this strategy is that eventually the solution might turn into something like caTissue. The overabundance of complex options and installable modules that may or may not work with the core system of caTissue is precisely what drove us to build our own solution in the first place.
Bob: It seems to me that the approach you are describing doesn't have to be overly complex though, and has certainly been used successfully in some sectors. Salesforce springs to mind as a good example of a modular system that can be configured for any organization's needs and which already has a thriving secondary developer community that produces some interesting add-ons. At the same time it has remained pretty easy to use. The formula seems to have been commercially successful, and yet nothing quite like salesforce seems to have emerged as an industry standard in the sciences. My guess would be that this is in part because of the additional complexities related to compliance, human subjects data protection, patent disclosure concerns and so on.
Dev: We also need to consider that the scientific community has significantly more varied and complex data management needs than the average sales team, but unfortunately, scientists have less money than big business. Its difficult to sell a solution that is MORE sophisticated than salesforce that needs to cost LESS because the vast majority of intended customer base have no money. Those that DO have money (i.e large pharma etc,) have generally already built their own solutions. Ultimately it is business considerations and the cost of building a one-size-fits-all solution that have prevented the emergence of an industry standard LIMS, ELN or other scientific data management system.
Bob: I totally agree, and that's why I have always thought that government assistance would probably be required to deploy any kind of nationwide scientific data management system, especially for cash-strapped academic labs. I suppose that is why initiatives such as the caBig project made sense at the time they were conceived, even thought that particular example appears to be somewhat floundering now.

Bob: About how long will it be before the TDB system goes fully online?
PM: Hard to say, but certainly not long. All the data capture machinery is now in place. We mainly just need to do some work on refining the search engine so that it will allow us to build slightly smarter, structured queries that can retrieve batches of specimens that meet certain boolean criteria. I know Dev also wants to take one more pass at improving the appearance of the GUI.

Bob: Do you plan any kind of form user acceptance test or evaluation to certify the system for use in situ?
PM: We have not planned anything formal yet, although we have started to create the user documentation that will probably become the de facto user test. Naturally we are concerned about compliance, and ultimately there are some industry accreditations we want to pursue, so we will probably have an independent third party review the completed solution with those in mind.

Bob: Assuming all goes well with deployment, how might interested organizations learn more about this system and what would be the best way for them to throw money at you if they wanted to use it?
PM: LOL, I think it will be a while before we seriously consider changing our business model to include an enterprise software division, but anything is possible. If there are groups out there that would seriously like to know more they can get in touch with us via your blogger profile page.