Evaluating Clustering algorithms using xgenes

References

overview

  • A gene is a stretch of DNA that encodes information.
  • Cell decides to produce "gene product" (i.e. protein)
  • The production of the RNA copy of the DNA is called transcription.
  • For some RNA (non-coding RNA) the mature RNA is the final gene product.[8]
  • In the case of messenger RNA (mRNA) the RNA has coding for the synthesis of one or more proteins.
  • Not all proteins remain within the cell and many are exported, for example, digestive enzymes, hormones and extracellular matrix proteins.
  • Regulation of gene expression refers to the control of the amount and timing of appearance of the functional product of a gene.
  • An example where gene expression is important: Control of insulin expression so it gives a signal for blood glucose regulation.
  • In genetics, gene expression is the most fundamental level at which the genotype gives rise to the phenotype, i.e. observable trait. The genetic code stored in DNA is "interpreted" by gene expression, and the properties of the expression give rise to the organism's phenotype.
  • When some one says 'One has gene for red hair' - It means, he/she has DNA which contains a gene which produces relevant proteins.

DNA Micro-Array

  • DNA microarray (aka DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface.
  • Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously.
  • Affymetrix chip is a DNA micro array.
  • This technology can be used for gene profiling.
  • The expression levels of thousands of genes are simultaneously monitored to to identify genes whose expression is changed in response to pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues
  • See https://altanalyze.readthedocs.io/en/latest/Tutorial_GeneExpressionAnalysis/ AltAnalyze can directly process Affymetrix CEL files.

High Potential Resources

  • optBiomarker: Estimation of optimal number of biomarkers for two-group microarray based classifications at a given error tolerance level for various classification rules. See https://cran.r-project.org/web/packages/optBiomarker/index.html
  • library(optBiomarker) ; help(optBiomarker)
  • library(entropy); entropy(); mutualinfo(); supported. You will have to normalize mutualinfo().
  • library(infotheo)

GO - Gene Ontology and QuickGO Browser

  • The Gene Ontology (GO) has proven to be a valuable resource for functional annotation of gene products.
  • At well over 27 000 terms, the descriptiveness of GO has increased rapidly in line with the biological data it represents.
  • QuickGO is a web-based tool for browsing the GO info provided by the GOA group.
  • Database URL: http://www.ebi.ac.uk/QuickGO

Genetic Algorithm

* Chromosome == Set of Genes

START
Generate the initial population
Compute fitness           
REPEAT
    Selection   // Select fittest individuals
    Crossover   // Threshold for mixing of genes. e.g. 3 means, it stops after 3 genes mix between parents.
                //             The remaining unmixed genes are all could be from single Male or Female.
    Mutation    // Happens with low probability
    Compute fitness
UNTIL population has converged   // New population not much different from old.
STOP

Intro to bioinformatics

* Before emergence of bioinformatics, there were only two ways to conduct biological experiments :
  • Within a living organism (in vivo, meaning in living in Latin)
  • In an artificial environment (in vitro, meaning in glass in Latin)
  • The field of bioinformatics is considered as in silico (meaning in silicon in Latin)

  • First of all you will have to learn a bit about biology; genetics and genomics to be specific.

  • This will include studying about genes, DNA, RNA, protein structures, etc.

  • Then you will have to study about biological sequences (for example, sequences found in DNA, RNA and proteins) and techniques to discover and analyze various patterns in them.

  • For creating drugs, we can understand the disease using computational tools, identify the disease cause and treat with suitable drugs accordingly, rather than merely treating the symptoms. ???

  • Algorithms, genomics, proteomics

  • Current research in bioinformatics can be classified into:

      1. Genomics :- Genomics is the study of an organism's genome.
      1. Proteomics :- Proteome refers to the entire set of expressed proteins in a cell.
    • (iii) Computer Aided Drug Designing : computational methods to simulate drug-receptor interactions. CADD methods are heavily dependent on bioinformatics tools, applications and databases. (See: http://www.mpi-inf.mpg.de/departments/d3/areas/docking.html )

    • (iv) Biological database: collected from experiments, literature, and computational analyses. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.

      (See: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/biodb.htm )

    • (v) Biological Data Mining: Biological Data mining is the discovery of useful knowledge from biological databases. Some of the most popular tasks are classification, clustering, association and sequence analysis, and regression. (See: http://cs.salemstate.edu/hatfield/teaching/courses/DataMining/M.htm )

    • (vi) Microarray informatics: Microarray Technology is a powerful tool to monitor gene expression or gene expression changes of hundreds or thousands of genes in a single experiment.

      1. Molecular Phylogenetics: Study of relationship between different organisms e.g. plants, fungi, etc.

Microarray informatics

Overview

  • Sample refers to tissues.
  • FDR = FD/(FD+TD) # FD - False Discoveries; TD - True Discoveries; Expected. Higher FDR means higher tolerance for error.

Services

Microarray-centered informatics is currently applied to primarily two high-capacity profiling areas:

  • genome wide microarray gene expression profiling
  • array-based Q-PCR microRNA expression profiling

We provide assistance with these following modules of typical experimental workflows:

  • Study and Experiment Design (Disease specific ??) – in-depth discussion of the biological context and key questions asked; experiment type selection (pair-wise comparison, time series, multiparametric studies); best cost-effective strategies for replicate studies; optimization of experimental conduct to identify and minimize sources of experimental noise; sample preparation strategy and array platform selection.
  • Data Preprocessing and Management – upload of annotated raw data into institutional repository and into client analysis environments; data normalization and transformation (probe level summarization principles (Affymetrix), condition-centered normalization strategies, experiment interpretation options); data filtering (intensities, QC metrics such as Ct values and Q-PCR flags (microRNA)); data reduction strategies.
  • Differential Expression – statistical significance (t-test statistics, one-way and two-factor ANOVA, multiple testing corrections, Significance Analysis of Microrarrays, non-parametric tests, Bayesian estimation of temporal regulation)
  • Pattern Matching – ad hoc and post hoc template matching procedures to establish non-overlapping patterns of gene expression.
  • Clustering of reduced-size data – unsupervised (hierarchical), semi-supervised (K-means clustering, Self-Organizing Maps), resampling for support.
  • Classification – strategies to classify biological samples and predicting outcomes based on gene expression profiles (Support Vector Machines, Discriminant Analysis Classifiers, K- nearest neighbor).
  • MicroRNA Target mRNA Prediction – cross-validation of target prediction algorithms; correlation of miRNA profiles with mRNA and protein expression patterns derived from gene expression and proteomics data; context-score ranking of target mRNAs; determination of patterns of post-transcriptional control (microRNA:mRNA regulatory networks).
  • Biological Significance – in-depth data and literature mining; functional annotation of gene groups and determination of context-specific biological meaning of expression profiling results using integrated biological knowledgebases (e.g. NIH DAVID -database for annotation, visualization and integrated discovery, GSE – gene set enrichment analysis etc).
  • Integration of Multiple Species Data – batch translation of standard ID’s into orthologous lists; matching of expression patterns from patient and model organism samples.