
| Biotechnology, Information, and the Internet |
|---|
| Homework Assignment |
It is accurate
You have context information for the data
You can access the data in a meaningful way (make queries)
You can correlate data in your database with other data/information
In the Human Genome Project, the value of complete data is quite easily seen. For instance, the more of the human genome that has been mapped and sequenced, the higher the likelihood that a sequence that you identify will match a sequence in the databank, allowing you to identify the locus, genes that may be in the same control region, and if other groups have identified the same gene based on similar or disparate research. The context for the sequence is its location in the subclone, how it was sequenced, what subclone it came from, where the subclone maps in the genome. The more genomic data and context data you have, the more accurate and meaningful your queries of the database will be. For instance, if you want to identify all the genes that contain a canonical GTP binding region, your results will depend on how much of the genome is known. This type of query allows you to do one type of correlation of the data - how many stretches of primary sequence homologies are in the database. Other types of correlations may be more abstract - for instance, what does a promoter "look like" to polymerase? Biochemically, we understand many types of promoters, but we don't understand the exceptions. Pattern searching and data mining techniques may help us recognize differences between active and inactive promoters, differences between cognates and incomplete duplications.
The accuracy of the database will affect your queries - that is, did either the search sequence or the target sequence contain errors? For sequence information these errors can be from technical errors, mutations in the particular clone isolated, presence of transposons or retrovirus, or entry mistakes. Errors may result in frame shifts, deletions, point mutations, insertions, or deletions; or from human errors, such as transpositions, copying errors, misinterpretation of gels or other lab data. Verifying sequence data is a very necessary and time consuming step in assembling a meaningful sequence database. Keeping these concepts in mind when you are using GenBank and also protein databases will help you understand the power and also the limitations of these resources.
Complexity of Data
Reproducible Access (standard queries)
Ability to "mine" data associations
Storage and Archival
One of the benefits of life after World War II in the United States is that we have experienced 50 years of uninterrupted posterity and growth. As a part of that growth, we have much better access to information, and as the cost of publication plummets, our information output skyrockets. For instance, the amount of information available in printed and electronic form is shown in the graph below.

What about Biotechnology and biological information? The process of information identification I have outlined is very general - let us fill it with some details relevant to biology. For instance, in the past 40 years we have gone from 4 journal dedicated to biological research and findings to more than 400. The volume of data that the FDA receives today for a standard NDA is truly staggering, and both the volume of the typical submission and the number of submissions have dramatically increased in the past 20 years. In addition to publications, modern biologists have developed methods for sequencing DNA and protein at ever higher rates, and the length of DNA sequences and protein sequences known have skyrocketed in the past 30 years from 10's of hundreds to 10's of billions.
Fortunately, along with these changes that are leading to a dramatic increase in the amount of raw data that we generate have come advances in computing technologies to help us come to grips with the increase in data. Concomitant with the exponential growth of information, we are experiencing an exponential growth of digital storage and computational power. Using computers, we can manage our data, extract useful correlations, as well as identify data and analyze it. The cost of storage and computer memory is plotted below:

New storage technologies, such as DVD, should continue to push the cost of storage down, with storage costs less than a penny per megabyte by the turn of the century. Similarly, the cost of computation, or computational power, has dramatically declined. The computing ability of a commodity desktop computer today approaches or exceeds that of a midrange ($50K-100K) workstation of six or seven years ago. Likewise, the graphical display capabilities of today's Macintosh and Windows computers exceeds that of a high-end Evans and Sutherland graphics workstation a decade ago. All of this computational and imaging horsepower on your desk for less than $3,000!

CPU advances, both in manufacturing practices and architecture, are driving much of the increase in performance seen in this graph. According to "Moore's Law" the complexity and performance of integrated circuits will progress at an exponential rate, roughly doubling every 18 months. This observation was made in 1966 by Gordon Moore while at Fairchild Semiconductor, based on 10 years of watching the semiconductor industry. In 1968, Moore, Bob Noyce and Andy Grove left Fairchild Semiconductor (where Noyce had invented the integrated circuit in 1959) and formed Intel. Grove is currently the CEO of Intel, and 30 years later Moore's predictions have been borne out by 25 years of growth in the microprocessor industry. As you can see from the following graph, processing power continues to increase, and the cost of computing continues to decline.

Technology driven
Performance driven
Computer Time
Internet Time
The computer industry is likewise very competitive, and in addition, the advances in manufacturing and cpu architecture give succeeding generations significant performance enhancements for a similar price.

How much life do you estimate is left in the Pentium? How long do you expect the Pentium Pro to stay on the market?
Data Ownership
Data Modeling
Structure Data
Function Data
Phenotypic (clinical) Data
Diagnostic Data
Outcome Data
Structure/Function Relationships
Disease Prognosis
Disease Progression
Imaging Technologies
Pattern Recognition
Neural Computation
The primary reason, IMHO, people amass and analyze huge volumes of data is the assumption that this data can be identified and analyzed in a meaningful way. It is our ability to access stored data in a coherent fashion that gives the data value. In the most general sense, this value is based on the kinds of questions you can ask of the data set, the cost of acquiring and assimilating the data, and the accuracy and completeness of the data. In biology, it is much easier to understand (and estimate, for at least some kinds of data) the value of data than in some other fields.
For instance, take the sequencing of an organism's genome. If you have the complete genome map and sequence for a particular organism, you should be able to identify the location and verify the sequence (and lack of mutations) for any particular strain (or for humans, individual). your ability to do this is dependent on the veracity or the accuracy of the data. When you want to identify a particular sequence, you need to be able to query or access the sequence for you to use it for other uses. Finally, you need to be able to compare or correlate new sequences or sets of sequences with the existing sequences.
Databases & Electronic Information Sources
Other Government Publications
U.S. Patent Office
U.S. Department of Commerce
National Oceanic and Atmospheric Administration
Technology Administration
Biotechnology Software Journal Editor's Home Page
Genetics Computer Group
Disclaimer:
This is not an official Northwestern University site. No guarantee, warranty or claim is made or should be implied by the content or absence of content.
Any services, facilities, or other sites listed in these pages are strictly listed as a convenience to the reader. No endorsement of the content of listed sites or identity of any published organization is implied or intended.
last modified 02/24/98
Please contact Warren Kibbe if you have
comments, additions or resources you would like to see on this page.
The END