Research, Computing, and the Internet


In general, data in a database is valuable because:

  dot It is complete

  dot It is accurate

  dot You have context information for the data

  dot You can access the data in a meaningful way (make queries)

  dot You can correlate data in your database with other data/information

 

In genomic research, these concepts are readily understood when we looking at the sequencing of the human genome, or some other sequencing project.

In the Human Genome Project, the value of complete data is quite easily seen. For instance, the more of the human genome that has been mapped and sequenced, the higher the likelihood that a sequence that you identify will match a sequence in the databank, allowing you to identify the locus, genes that may be in the same control region, and if other groups have identified the same gene based on similar or disparate research. The context for the sequence is its location in the subclone, how it was sequenced, what subclone it came from, where the subclone maps in the genome. The more genomic data and context data you have, the more accurate and meaningful your queries of the database will be. For instance, if you want to identify all the genes that contain a canonical GTP binding region, your results will depend on how much of the genome is known. This type of query allows you to do one type of correlation of the data - how many stretches of primary sequence homologies are in the database. Other types of correlations may be more abstract - for instance, what does a promoter "look like" to polymerase? Biochemically, we understand many types of promoters, but we don't understand the exceptions. Pattern searching and data mining techniques may help us recognize differences between active and inactive promoters, differences between cognates and incomplete duplications.

The accuracy of the database will affect your queries - that is, did either the search sequence or the target sequence contain errors? For sequence information these errors can be from technical errors, mutations in the particular clone isolated, presence of transposons or retrovirus, or entry mistakes. Errors may result in frame shifts, deletions, point mutations, insertions, or deletions; or from human errors, such as transpositions, copying errors, misinterpretation of gels or other lab data. Verifying sequence data is a very necessary and time consuming step in assembling a meaningful sequence database. Keeping these concepts in mind when you are using GenBank and also protein databases will help you understand the power and also the limitations of these resources.

 

Using Computers to Manage Information - Why Bother?

  dot Volume of Data

  dot Complexity of Data

  dot Reproducible Access (standard queries)

  dot Ability to "mine" data associations

  dot Storage and Archival

 

 

The sheer volume of information available

How much information is available? According to a recent report by Ernst and Young, the volume of published material doubled from 1880 to 1930, again from 1930 to 1960, doubled from 1960 to 1970, and is currently doubling every 18 months. Keeping track of the mass of publication, let along accessing and analyzing information in this sea of data is difficult at best. To help manage this overwhelming mass of data, computing solutions using search engines and "intelligent agents" have emerged to let us skim and peruse huge databases of data, looking for the few kernels of relevant information from the warehouses full of data fodder.

 

One of the benefits of life after World War II in the United States is that we have experienced 50 years of uninterrupted posterity and growth. As a part of that growth, we have much better access to information, and as the cost of publication plummets, our information output skyrockets. For instance, the amount of information available in printed and electronic form is shown in the graph below.

What about Biotechnology and biological information? The process of information identification I have outlined is very general - let us fill it with some details relevant to biology. For instance, in the past 40 years we have gone from 4 journal dedicated to biological research and findings to more than 400. The volume of data that the FDA receives today for a standard NDA is truly staggering, and both the volume of the typical submission and the number of submissions have dramatically increased in the past 20 years. In addition to publications, modern biologists have developed methods for sequencing DNA and protein at ever higher rates, and the length of DNA sequences and protein sequences known have skyrocketed in the past 30 years from 10's of hundreds to 10's of billions.

 

Fortunately, along with these changes that are leading to a dramatic increase in the amount of raw data that we generate have come advances in computing technologies to help us come to grips with the increase in data. Concomitant with the exponential growth of information, we are experiencing an exponential growth of digital storage and computational power. Using computers, we can manage our data, extract useful correlations, as well as identify data and analyze it. The cost of storage and computer memory is plotted below:

New storage technologies, such as DVD, should continue to push the cost of storage down, with storage costs less than a penny per megabyte by the turn of the century. Similarly, the cost of computation, or computational power, has dramatically declined. The computing ability of a commodity desktop computer today approaches or exceeds that of a midrange ($50K-100K) workstation of six or seven years ago. Likewise, the graphical display capabilities of today's Macintosh and Windows computers exceeds that of a high-end Evans and Sutherland graphics workstation a decade ago. All of this computational and imaging horsepower on your desk for less than $3,000!

CPU advances, both in manufacturing practices and architecture, are driving much of the increase in performance seen in this graph. According to "Moore's Law" the complexity and performance of integrated circuits will progress at an exponential rate, roughly doubling every 18 months. This observation was made in 1966 by Gordon Moore while at Fairchild Semiconductor, based on 10 years of watching the semiconductor industry. In 1968, Moore, Bob Noyce and Andy Grove left Fairchild Semiconductor (where Noyce had invented the integrated circuit in 1959) and formed Intel. Grove is currently the CEO of Intel, and 30 years later Moore's predictions have been borne out by 25 years of growth in the microprocessor industry. As you can see from the following graph, processing power continues to increase, and the cost of computing continues to decline.

 

Links to a few CPU manufacturers

 

Business of computing

  dot Lifecycle

  dot Technology driven

  dot Performance driven

  dot Computer Time

  dot Internet Time

 

This workshop is meant to focus on biotechnology, information, and computing, but I will briefly deviate from the background material on computation and discuss the obsolescence of a CPU in the marketplace. As in every market, there is research and development, manufacturing, product release, and then a decrease in the cost of the product as the research costs are amortized. During this time, the next generation of product is under development, and once the previous generation product has paid for itself or reached an end of its competitiveness, the next generation of product is released. Take blood substitutes, for instance. What generation are the current blood substitute products on the market?

The computer industry is likewise very competitive, and in addition, the advances in manufacturing and cpu architecture give succeeding generations significant performance enhancements for a similar price.

How much life do you estimate is left in the Pentium? How long do you expect the Pentium Pro to stay on the market?

The IT view of Information

  dot Data Definitions

  dot Data Ownership

  dot Data Modeling

 

 

 

Information and Biotechnology: Bioinformatics?

  dot Genomic Data

  dot Structure Data

  dot Function Data

  dot Patient Data (historical)

 dot Patient Data (clinical)

  dot Diagnostic Data

  dot Outcome Data

 

We can use these data to analyze and model:

 

  dot Gene Expression and Regulation

  dot Structure/Function Relationships

  dot Disease Prognosis

  dot Disease Progression

 

 

Biological Research and Computation

  dot Structure/Function

  dot Imaging Technologies

  dot Pattern Recognition

  dot Neural Computation

 

Timely and accurate access and analysis of information has become a high priority for many segments of industry. Marketing decisions, work flow design, managing of data flow, quality assurance, and performance analysis functions have come to rely on access to huge volumes of current and historical data. Information management plays a critical role in medical diagnostics, monitoring of clinical outcomes, identifying relevant publications, obtaining and securing intellectual property rights, and analyzing and evaluating research data. For biotechnologists, the ability to access, compare and identify novel DNA and protein sequences is an integral part of the research process. Likewise, the ability to analyze the stereochemistry, structural domains and structural homologies in biological materials plays a critical role in basic research and in drug design and evaluation.

 

The primary reason, IMHO, people amass and analyze huge volumes of data is the assumption that this data can be identified and analyzed in a meaningful way. It is our ability to access stored data in a coherent fashion that gives the data value. In the most general sense, this value is based on the kinds of questions you can ask of the data set, the cost of acquiring and assimilating the data, and the accuracy and completeness of the data. In biology, it is much easier to understand (and estimate, for at least some kinds of data) the value of data than in some other fields.

 

For instance, take the sequencing of an organism's genome. If you have the complete genome map and sequence for a particular organism, you should be able to identify the location and verify the sequence (and lack of mutations) for any particular strain (or for humans, individual). your ability to do this is dependent on the veracity or the accuracy of the data. When you want to identify a particular sequence, you need to be able to query or access the sequence for you to use it for other uses. Finally, you need to be able to compare or correlate new sequences or sets of sequences with the existing sequences.


 

Data mining, meta analysis, meta data, and meta content

MCF meta language reprint

 

dot Communication
dot Organization

 

Research, Computing, and your Career

dot Analyze data
dot Harness the flow of information around you
dot Maintain contacts with your peers
dot Continue to learn new skills
dot Your ability to manage information is critical to the success of your career!

  dot Bio Online: Biotechnology: Career Center
  dot
CareerMosaic
  dot
FSG Biotech Online Career Server
  dot
LUXNet Career Services
  dot CareerPath
  dot E.span
  dot Hoover's Online: Find A Job
  dot Hot Jobs™
  dot THE MONSTER BOARD
  dot Online Career Center
  dot SelectJOBS

Return to main document