Basic Sciences

Biotechnology, Information, and the Internet
Homework Assignment

Presentation Outline
    dot Summary
    dot What is Information
    dot What makes Information Valuable
    dot Using Computers to Manage Information - Why Bother?
    dot The volume of information available
    dot The IT view of Information
    dot Information and Biotechnology
    dot Biological Research and Computation
    dot Business of computing
    dot Data mining, meta analysis, meta data, and meta content
    dot Communication
    dot Biotechnology, Computing, and your Career
    dot Analyze data
    dot Internet Resources for Biotechnology
    dot Funding
    dot Research Tools


Summary

This document is the outline of a short workshop given at Northwestern University on 2/24/98. The workshop discussed using the internet to access information research, business, and job-related sites currently available, and discussed emerging technologies and services. All links used in the workshop can be found in this document. Some of the discussions are included as well.

What is Information?

  dot NU's Oxford English Dictionary On-line
  dot OED definition for information

What makes Data/Information Valuable?

In general, data in a database is valuable because:

  dot It is complete

  dot It is accurate

  dot You have context information for the data

  dot You can access the data in a meaningful way (make queries)

  dot You can correlate data in your database with other data/information

In biotechnology, these concepts are readily understood when we looking at the sequencing of the human genome, or some other sequencing project.

In the Human Genome Project, the value of complete data is quite easily seen. For instance, the more of the human genome that has been mapped and sequenced, the higher the likelihood that a sequence that you identify will match a sequence in the databank, allowing you to identify the locus, genes that may be in the same control region, and if other groups have identified the same gene based on similar or disparate research. The context for the sequence is its location in the subclone, how it was sequenced, what subclone it came from, where the subclone maps in the genome. The more genomic data and context data you have, the more accurate and meaningful your queries of the database will be. For instance, if you want to identify all the genes that contain a canonical GTP binding region, your results will depend on how much of the genome is known. This type of query allows you to do one type of correlation of the data - how many stretches of primary sequence homologies are in the database. Other types of correlations may be more abstract - for instance, what does a promoter "look like" to polymerase? Biochemically, we understand many types of promoters, but we don't understand the exceptions. Pattern searching and data mining techniques may help us recognize differences between active and inactive promoters, differences between cognates and incomplete duplications.

The accuracy of the database will affect your queries - that is, did either the search sequence or the target sequence contain errors? For sequence information these errors can be from technical errors, mutations in the particular clone isolated, presence of transposons or retrovirus, or entry mistakes. Errors may result in frame shifts, deletions, point mutations, insertions, or deletions; or from human errors, such as transpositions, copying errors, misinterpretation of gels or other lab data. Verifying sequence data is a very necessary and time consuming step in assembling a meaningful sequence database. Keeping these concepts in mind when you are using GenBank and also protein databases will help you understand the power and also the limitations of these resources.

Using Computers to Manage Information - Why Bother?

  dot Volume of Data

  dot Complexity of Data

  dot Reproducible Access (standard queries)

  dot Ability to "mine" data associations

  dot Storage and Archival

The sheer volume of information available

How much information is available? According to a recent report by Ernst and Young, the volume of published material doubled from 1880 to 1930, again from 1930 to 1960, doubled from 1960 to 1970, and is currently doubling every 18 months. Keeping track of the mass of publication, let along accessing and analyzing information in this sea of data is difficult at best. To help manage this overwhelming mass of data, computing solutions using search engines and "intelligent agents" have emerged to let us skim and peruse huge databases of data, looking for the few kernels of relevant information from the warehouses full of data fodder.

One of the benefits of life after World War II in the United States is that we have experienced 50 years of uninterrupted posterity and growth. As a part of that growth, we have much better access to information, and as the cost of publication plummets, our information output skyrockets. For instance, the amount of information available in printed and electronic form is shown in the graph below.

What about Biotechnology and biological information? The process of information identification I have outlined is very general - let us fill it with some details relevant to biology. For instance, in the past 40 years we have gone from 4 journal dedicated to biological research and findings to more than 400. The volume of data that the FDA receives today for a standard NDA is truly staggering, and both the volume of the typical submission and the number of submissions have dramatically increased in the past 20 years. In addition to publications, modern biologists have developed methods for sequencing DNA and protein at ever higher rates, and the length of DNA sequences and protein sequences known have skyrocketed in the past 30 years from 10's of hundreds to 10's of billions.

Fortunately, along with these changes that are leading to a dramatic increase in the amount of raw data that we generate have come advances in computing technologies to help us come to grips with the increase in data. Concomitant with the exponential growth of information, we are experiencing an exponential growth of digital storage and computational power. Using computers, we can manage our data, extract useful correlations, as well as identify data and analyze it. The cost of storage and computer memory is plotted below:

New storage technologies, such as DVD, should continue to push the cost of storage down, with storage costs less than a penny per megabyte by the turn of the century. Similarly, the cost of computation, or computational power, has dramatically declined. The computing ability of a commodity desktop computer today approaches or exceeds that of a midrange ($50K-100K) workstation of six or seven years ago. Likewise, the graphical display capabilities of today's Macintosh and Windows computers exceeds that of a high-end Evans and Sutherland graphics workstation a decade ago. All of this computational and imaging horsepower on your desk for less than $3,000!

CPU advances, both in manufacturing practices and architecture, are driving much of the increase in performance seen in this graph. According to "Moore's Law" the complexity and performance of integrated circuits will progress at an exponential rate, roughly doubling every 18 months. This observation was made in 1966 by Gordon Moore while at Fairchild Semiconductor, based on 10 years of watching the semiconductor industry. In 1968, Moore, Bob Noyce and Andy Grove left Fairchild Semiconductor (where Noyce had invented the integrated circuit in 1959) and formed Intel. Grove is currently the CEO of Intel, and 30 years later Moore's predictions have been borne out by 25 years of growth in the microprocessor industry. As you can see from the following graph, processing power continues to increase, and the cost of computing continues to decline.

Links to a few CPU manufacturers

Business of computing

  dot Lifecycle

  dot Technology driven

  dot Performance driven

  dot Computer Time

  dot Internet Time

This workshop is meant to focus on biotechnology, information, and computing, but I will briefly deviate from the background material on computation and discuss the obsolescence of a CPU in the marketplace. As in every market, there is research and development, manufacturing, product release, and then a decrease in the cost of the product as the research costs are amortized. During this time, the next generation of product is under development, and once the previous generation product has paid for itself or reached an end of its competitiveness, the next generation of product is released. Take blood substitutes, for instance. What generation are the current blood substitute products on the market?

The computer industry is likewise very competitive, and in addition, the advances in manufacturing and cpu architecture give succeeding generations significant performance enhancements for a similar price.

How much life do you estimate is left in the Pentium? How long do you expect the Pentium Pro to stay on the market?

The IT view of Information

  dot Data Definitions

  dot Data Ownership

  dot Data Modeling

 

 

Information and Biotechnology: Bioinformatics?

  dot Genomic Data

  dot Structure Data

  dot Function Data

  dot Phenotypic (clinical) Data

  dot Diagnostic Data

  dot Outcome Data

We can use these data to analyze and model:

  dot Gene Expression and Regulation

  dot Structure/Function Relationships

  dot Disease Prognosis

  dot Disease Progression

Biological Research and Computation

  dot Structure/Function

  dot Imaging Technologies

  dot Pattern Recognition

  dot Neural Computation

Timely and accurate access and analysis of information has become a high priority for many segments of industry. Marketing decisions, work flow design, managing of data flow, quality assurance, and performance analysis functions have come to rely on access to huge volumes of current and historical data. Information management plays a critical role in medical diagnostics, monitoring of clinical outcomes, identifying relevant publications, obtaining and securing intellectual property rights, and analyzing and evaluating research data. For biotechnologists, the ability to access, compare and identify novel DNA and protein sequences is an integral part of the research process. Likewise, the ability to analyze the stereochemistry, structural domains and structural homologies in biological materials plays a critical role in basic research and in drug design and evaluation.

The primary reason, IMHO, people amass and analyze huge volumes of data is the assumption that this data can be identified and analyzed in a meaningful way. It is our ability to access stored data in a coherent fashion that gives the data value. In the most general sense, this value is based on the kinds of questions you can ask of the data set, the cost of acquiring and assimilating the data, and the accuracy and completeness of the data. In biology, it is much easier to understand (and estimate, for at least some kinds of data) the value of data than in some other fields.

For instance, take the sequencing of an organism's genome. If you have the complete genome map and sequence for a particular organism, you should be able to identify the location and verify the sequence (and lack of mutations) for any particular strain (or for humans, individual). your ability to do this is dependent on the veracity or the accuracy of the data. When you want to identify a particular sequence, you need to be able to query or access the sequence for you to use it for other uses. Finally, you need to be able to compare or correlate new sequences or sets of sequences with the existing sequences.


Data mining, meta analysis, meta data, and meta content

MCF meta language reprint

Biotechnology, Computing, and your Career

  dot Bio Online: Biotechnology: Career Center
  dot CareerMosaic
  dot FSG Biotech Online Career Server
  dot LUXNet Career Services

Resources available on the Internet

Electronic resources at NU not available on the Internet

Computers and Research

Communicating via computer

Faculty Research Database

The Office of Research has purchased a one year subscription to the Community of Science (COS) faculty research database, grant information, and other on-line services. You can directly search for Northwestern Faculty by expertise.


Collection of Links

NUMS Basic Science Links

Internet Resources for Biotechnology

General NU Resources

Medline

OED, Encyclopedia Brittanica

Publications

dot Journal of Biological Chemistry
dot SCIENCE Magazine On-Line
dot RNA
dot Cell
dot Journal of Molecular Biology
For other journals
dot WWW Virtual Library: Biosciences: Journals and Conferences
Harvard's compilation of links to journals and conferences.

Sequence databases

Sequences, Searches, and Submissions

dot The National Center for Biotechnology Information (NCBI)
Home of GenBank and Web-based DNA sequence databases
dot Searching GenBank
This is the NCBI/GenBank site for searching DNA sequences and sequence annotations.
dot The Human Genome Database
The John Hopkins Human Genome Database Project Home Page.
A collection of DNA and protein databases and structure resources at Johns Hopkins and throughout the world.
dot MIT Whitehead Institute Center for Genome Research
The MIT site has useful information, discussion groups, and interactive web software including a primer selection program.

Structure information and databases

dot The Brookhaven Protein Databank
dot What is the PDB data format??
Finding Structure Information
dot Using the PDB Gopher Site
Viewing structures
dot RasMOL Home Page
dot RasMOL Generated Images

Organism databases

dot WWW Virtual Library: Research on Yeast
A list of sites relevant to yeast researchers. Includes sections on Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Candida.
dot WWW Virtual Library: Subject Catalogue
A catalog of science data. Take a look at the section on Biological Science and Model Organisms!

Techniques, Products and Services

dot Biological Data Transport Home Page
dot Newsgroups: sci.bio.technology
Biotechnology information of general interest

Software Sites and Archives

dot Biotechnology Software Journal Editor's Home Page
dot Genetics Computer Group

GCG: makers of the UNIX and VAX 'GCG Wisconsin Package' for DNA and protein analysis.
dot MIT Genome Center FTP Archive-Software
Some select tools for molecular biology and biological data management.
dot Search for Biological Software
This site has pointers to many searchable archives.
dot Archives for Graphics Software and Databases Around the World
dot Archives for Biological Software and Databases Around the World

Medicine

dot Virtual Hospital Home Page
A fascinating site run by the University of Iowa

General Resources

dot WWW Virtual Library: Biosciences
Harvard's compilation of links to an assortment of biologically relevant sites.
dot NIH Scientific Resources
Information on scientific projects, capabilities and opportunities at the NIH.
dot BIO (Biotechnology Industry Organization)
A web service run by the industry consortium of biotech and pharmaceutical companies

Funding

Searching & Info servers

NU: Searching the Internet
Alta Vista: Main Page
Infoseek Home Page
Internet Search
Lycos Search Form
WebCrawler Searching
Yahoo

Government and Business Resources

Biotechnology Law Web Server
Collection of biotech law information.
Intellectual Property Resources
Some resources for finding patents and other intellectual property information.
United States Patent and Trademark Office
Yahoo - Business and Economy:Companies:Law:Intellectual Property Search Form
THINCK Associates' front office
THE SEAMLESS WEBsite - law and legal resources - law legal lawyer law firm expert
Biomega Corporation
Phase 3 pharmaceutical drug profiles and intellectual property services.
Government Resources (Bio Online)
U.S. Department Of Commerce Information Services via World Wide Web

Investment Sites

MIT historical Stock service
The SEC EDGAR corporate information
Bank of America
NCS Investor Page: Annual Reports, 10Qs 10Ks, Stocks, Bonds, Futures, Mutal Funds
E*TRADE
Register with Lombard
InterQuote Service Packages
Security APL WhatsNew
Security APL Quote Server
Investors Edge Stock Server
Stock quotes and Market Information
Quote.Com -- Other Links for Investors
DBC Quote Server
ClubNet: News Directory
Charles Schwab



Disclaimer:
This is not an official Northwestern University site. No guarantee, warranty or claim is made or should be implied by the content or absence of content.

Any services, facilities, or other sites listed in these pages are strictly listed as a convenience to the reader. No endorsement of the content of listed sites or identity of any published organization is implied or intended.

last modified 02/24/98

Please contact Warren Kibbe if you have comments, additions or resources you would like to see on this page.
The END