notes-biotech

book-body

Notes On Bio Technology

Contents

local

Terminology

DNA

DeoxyRiboNucleic Acid. Contains Sequence of chromosomes.

Phenotype

"Phenotype" simply refers to an observable trait. "Pheno" simply means "observe" and comes from the same root as the word "phenomenon".

Genotype

A person's genotype is their unique sequence of DNA (or genes). More specifically, this term is used to refer to the two alleles a person has inherited for a particular gene.

Phenotype is the detectable expression of this genotype. e.g. Black hair or Purple flower.

Homologous Chromosomes

In biology, homologous chromosomes are paired chromosomes. They essentially have the same: gene sequence, loci (gene position), centromere location, and chromosomal length.

Mitosis

Normal cell division where diploid somatic (body, hair, etc) cells duplicates itself into two copies of diploid cells. Both parent and daugther cells contains 2*23 chromosomes.

Meiosis

Meiosis is a special type of cell division of germ cells (not normal somatic cells!) in sexually-reproducing organisms that produces the gametes, such as sperm or egg cells. The 46 chromosomes in source produces 23 chromosomes in child cells.

This happens in 2 stages: * In first stage, a cell with 2*23 chromosomes split into two cells of 23 chromosomes each. * The second phase is like Mitosis where it duplicates the cells. * You started with single diploid germ cell and ended up with producing 4 haploids!

Aromatic

An aromatic molecule or compound is one that has special stability and properties due to a closed loop of electrons. Not all molecules with ring (loop) structures are aromatic.

Structure 3D notes

  • The orbitals are numbered as: 1s 2s 2p(x) 2p(y) 2p(z) 3s 3p*, etc:

    Orbital  Max Electrons
    s        2
    p        6
    d        10
    f        14
    g        18
    h        22
    i        26
    

See Auf Bau Diagram on how to fill the orbitals:

Shell                   Capacity of Shell

1s                      2
2s 2p                   8
3s 3p 3d                18
4s 4p 4d 4f             32
5s 5p 5d 5f 5g          50
6s 6p 6d 6f 6g 6h       72
7s 7p 7d 7f 7g 7h 7i    98
  • The 4 in 4s refers to shell number. Higher the shell number it is bigger.
  • Bigger shell not necessarily implies higher energy level. e.g. The electron filling sequence is 1s 2s 2p 3s 3p 4s 3d 4p 5s 4d 5p 6s ... i.e. 3d has higher energy level than 4s.
  • s is circle shaped. p is lobe shaped (in x, y, z planes).
  • Circle shape of s orbital does not imply circular motion. The eliptical motion orientation can keep changing, so the location probability of the orbital is expressed in spherical region.
  • Linear bond at 180 degree angle. e.g. CO2. It is usually sp hybridized. i.e. p orbital of C with s orbital of O.
  • BF3 with 120 Degree angle. All in same plane. i.e. Maximum push from each bonds. SO2 acts like BF3 which has approx 116 (< 120) Degrees angle, because it has lone electron pair which has stronger pushing away force than normal bonds. It is sp2 hybridization. (one s orbital + 2 p orbital yielding 3 sp2 orbitals)
  • CH4. With angle 109.5 Degrees angle in 3D. Max push apart in 3D of 4 surrounding elements. NH3 acts like CH4 because it has lone pair of electrons. But angels between H is (107 < 109.5) degrees. This structure is called Tedra-hedral.
  • H2O acts lke CH4 since it has 4 free electrons forming 2 lone pairs. So O has 4 surrounding entities. Since lone pairs have even more repulsive forces, the angle between H is 105 < 107 degrees.
  • P.Cl5 has 5 atoms surrounding center. They are placed with 3 in one plane with 120 Degrees apart. Another pependicular plane with 2 atoms with 180 Degrees apart. It is called Trigonal bipyramid shape.
  • S.F4 like P.Cl5 but it has lone pair of electrons, so that is pushing harder at top.

FAQ

Gene, Genome, Chromosome, DNA ?

  • Cell contains nucleus which contains 23 pairs of chromosomes (total 46), each chromosome contains thousands of Genes. The 46 chromosomes together is also called DNA. 1 cell conatins one DNA which when expanded measures around 6 Feet. DNA is not necessarily single continous structure (double helix). It is a set of discrete units (chromosomes). When it is about to be reproduced, it forms a one single continous structure and compacts around itself.
  • DNA is a string of complex molecules called nucleotides. The DNA that makes up all genomes is composed of four related chemicals called nucleo bases: - adenine (A), guanine (G), cytosine (C), and thymine (T) (and Uracil (U) only in RNA instead of T) - Each A, G, C, T, U is called as a "nucleotide base" or nucleo-base. - Adenine is a purine nucleobase. C5 H5 N5; 135.13 g/mol - A base pair (A with T and C with G) is used to hold DNA strands together. - Nucleotide contains any nucleo-base and supporting strand (handrail like) structure using sugar and phosphate molecules. -
  • In 2000, the Human Genome Project provided the first full sequence of a human genome.
  • Every Gene is a sequence of AT-GC locked pairs. (may be hundreds of them). A gene has a specific position e.g. At chromosome 7 at offset ???
  • A gene usually looks like a rope loosely wrapped around proteins. When it is about to reproduce, it coils tightly around proteins.
  • Humans have between 20,000 and 25,000 genes in the DNA. (Approx 500 genes per chromosome ?) Some chromosomes may have 100s of genes some may have thousands. As part of Genome project, we sequence all the genes in all chromosomes.
  • genome is all of the genetic material in an organism. It is made of DNA (or RNA in some viruses) and includes genes and other elements that control the activity of those genes.
  • Humans are 99.9% identical on a genetic level. The 0.1% difference is caused by insertions, deletions and substitutions in the DNA sequence. These substitutions are known as Single Nucleotide Polymorphisms (SNPs). They occur about every 1000 base pairs.
  • HIPPA: Health Insurance Portability and Accountabiliy Act

Types of Bio-molecular Entities

  • There are around 25K genes in DNA!
  • More than 100K types of proteins exist in Human!
  • Around 1 million cell types!

Cell Types

  • There are various types of cells categories: Epithelial, nerve, muscle, connective tissue cells
  • These include RedBlood cells, WBC, platelets, etc. Note that Eosinophil is one kind of white blood cell.
  • Body cells are called Somatic Cells. Sperm and Egg cells are called Germ Cells.
  • Bone, blood, lymph cells are called as "Connective Tissue" cells.

Tissue

Tissue is a group of cells that have similar structure and that function together as a unit. A nonliving material, fills the spaces between the cells. The amount of space varies. Human tissue is described as an organ, or part of a human body or any substance extracted from a human body. Tissue is a group or layer of cells that work together to perform a specific function:

Tissues and secretion

Tissue                  Secretion
-------------------------------------------------
Thyroid                 thyroid hormones
Breast                  milk
Salivary gland          saliva
Tear ducts              tears
Exocrine pancreas       digestive enzymes
Islets of Langerhans    insulin, glucagon
Stomach epithelium      acid, intrinsic factor

Tissues respond to a variety of stimuli. For example, the thyroid responds to thyroid-stimulating hormone (TSH) released by the pituitary gland, which promotes the division of thyroid epithelial cells and the release of thyroid hormones.

Disease Pathways

Discovering disease pathways, which can be defined as sets of proteins associated with a given disease, is an important problem that has the potential to provide clinically actionable insights for disease diagnosis, prognosis, and treatment.

Kynase

  • (KY-nays) A type of enzyme (a protein that speeds up chemical reactions in the body) that adds chemicals called phosphates to other molecules, such as sugars or proteins. This may cause other molecules in the cell to become either active or inactive. Kinases are a part of many cell processes.
  • In biochemistry, a kinase is an enzyme that catalyzes the transfer of phosphate groups from high-energy, phosphate-donating molecules to specific substrates. This process is known as phosphorylation, where the high-energy ATP molecule donates a phosphate group to the substrate molecule.
  • A large number of kinases exist—the human genome contains at least 500 kinase-encoding genes.
  • Inhibitors of kinases can be important treatments for human diseases in which hyperactive processes need to be dampened. For example, one form of human leukemia, CML (chronic myelogenous leukemia), is caused by excess activity of the Abelson tyrosine kinase. Imatinib (Gleevec) is a chemical that binds to the active site of this kinase, thereby blocking the enzyme’s ability to phosphorylate targets. Imatinib has been useful in the initial treatment of CML; however, in many cases the kinase enzyme mutates, rendering the drug ineffective.
  • Included among these enzymes’ targets for phosphate group addition (phosphorylation) are proteins, lipids, and nucleic acids.

ligands

  • Ligand refers to any compound (usually small molecule) that binds to a receptor (usually protein).
  • In DNA-ligand binding studies, the ligand can be a small molecule, ion, or protein which binds to the DNA double helix.
  • Binding occurs by intermolecular forces, such as ionic bonds, hydrogen bonds and Van der Waals forces.
  • The association or docking is actually reversible through dissociation.
  • Ligand binding to a receptor protein alters the conformation by affecting the three-dimensional shape orientation.
  • Ligands include substrates, inhibitors, activators, signaling lipids, and neurotransmitters.
  • The rate of binding is called affinity.
  • Endogenous Ligands means naturally occuring small molecule which binds. Endogenous means "occuring with in body". Typically produced by host tissues.

Cytokines

Cytokines are a broad and loose category of small proteins (~5–25 kDa[1]) important in cell signaling. Cytokines are peptides and cannot cross the lipid bilayer of cells to enter the cytoplasm.

PPI - Protein Protein Interaction Network

NLP Uses

  • Information Extraction aka Name Entity recognition

  • Automatic QA system.

  • Text Summarization

  • Text generation: understand a database and write human readable text analysis.

  • Machine Translation

  • Sentiment Analysis

  • Morphologya(stem words), syntax, semantics, pragmatics (context) and discourse (paragraphs).

  • Typical sentences read:

    (Gene|Protein) (MolecularFunction) (Gene| (MolecularFunction) (Gene|)Protein), For Example: Pax-3 Gene activated Myodl Gene.

The TOP10 popularly studied genes includes TP53, TNF, EGFR, VEGFA, APOE, IL6, TGFB1, MTHFR, ESR1, AKT1.

Genetical Disorders

Each chromosome pair has the same genes. Sometimes there are slight variations of these genes. These variations occur in less than 1% of the DNA sequence. The genes that have these variations are called alleles. If one out of two genes (alleles) is abnormal, if it is recessive, no disease develops. If both are abnormal or if the abnormal gene is dominant, then disease develops.

Almost all diseases have a genetic component! This can be classified as:

  • Single-gene defects
  • Chromosomal disorders. (Excess or Lack of genes contained in chromosome)
  • Multifactorial. (Several Genes and various factors. Most common reason. For example, Asthma, Cancer, Diabetes, etc fall in this category.

Genes which are not well understood are called dark genes.

Pathogenesis

Pathogenesis is the process by which an infection leads to disease. Pathogenic mechanisms of viral disease include (1) implantation of virus at the portal of entry, (2) local replication, (3) spread to target organs (disease sites), and (4) spread to sites of shedding of virus into the environment.

Types of pathogenesis include microbial infection, inflammation, malignancy and tissue breakdown. For example, bacterial pathogenesis is the process by which bacteria cause infectious illness.

Diagnosis vs Prognosis

A diagnosis is an identification of a disease via examination. What follows is a prognosis, which is a prediction of the course of the disease as well as the treatment and results.

Drug Target

A drug target is a body protein (functional biomolecules), that is intrinsically associated with a particular disease process and that could be addressed by a drug (biologically active compounds) to produce a desired therapeutic effect.

Various kinds of drug targets are:

1. G-Proteins: Also known as guanine nucleotide-binding proteins, act as molecular switches inside
cells, involved in transmitting signals from a variety of stimuli outside a cell to its interior. G-protein coupled receptors (GPCRs), otherwise known as G-proteins, are a diverse family of receptors found in a huge range of tissues throughout the body. Accounts 5% of Genome. 45% of all our drugs target the G-Proteins. Potential G-Protiens are around 5000 but only around 400 distinct proteins are targeted now.

  1. Enzymes : Around 28% of drugs target Enzymes.
  2. Hormones and factors: Around 11%
  3. Ion Channels: 5%
  4. Nuclear Receptor: 2% (> 150) (For FDA approved drugs, around 11% of drugs target nuclear receptors??)

Note: Ligands that bind to and activate nuclear receptors include lipophilic substances such as endogenous hormones, vitamins A and D, and xenobiotic hormones. Because the expression of a large number of genes is regulated by nuclear receptors, ligands that activate these receptors can have profound effects on the organism.

Methods to identify Drug Targets

  • Celluar and Genetic Target methods.
  • Genomics. (number of genes implicated to diseases is around only 1000) There are around 5 to 10 linked proteins per gene.
  • Proteomics: Study of all proteins produced by a species. Analysis of Protein interactions with another Protein, nuclic acid, protein ligand is atmost important. Compare protein expression levels in normal vs diseased tissues.
  • BioInformatics:

Appendix: Extra Terminologies

  • HL7: Health Level Seven or HL7 refers to a set of international standards for transfer of clinical and administrative data between software applications.
  • Clinical Document Architecture (CDA) – an exchange model for clinical documents, based on HL7 Version 3.
  • Structured Product Labeling (SPL) – the published information of medicine, based on HL7 Version 3.
  • SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) is one of the standards for use in US for electronic exchange of clinical health information.
  • ICD-9-CM: International Classification of Diseases,Ninth Revision, Clinical Modification. ICD-9-CM is the official system of assigning codes to diseases, diagnoses and procedures.
  • NIH - National Institute of Health, Maryland, USA. Founded in 1887. Annual budget is around $45 billion! Parent agency is department of health. Children agencies include NLM (National Library Of Medicine) and various national institutes for Cancer, Allergy, Diabetes, Heart, etc. (27 such institutes). Hardward University also publishes good quality number of papers, probably even more than NIH !???

Database of GenoTypes and PhenoTypes: dbGap

  • NIH sponsors the Database of Genotypes and Phenotypes (dbGap) - information about interaction of genotype and phenotype. The information includes phenotypes, molecular assay data, analyses and documents.

Cell Lines

A cell line is a set of cells grown in a laboratory from a single plant or animal cell. Cell lines can either be based on primary cells – for example muscle or fat cells – or on stem cells. The work involves growing stem cell lines and examining their surface molecules.

OMICS

The branches of science known informally as omics are various disciplines in biology whose names end in the suffix -omics, such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics.

Nucleotide vs Nucleid Acid

  • A sequence of DNA is a string of these nucleic acids (also called “bases” or “base pairs”) that are chemically attached to each other, such as AGATTCAG, which is “read out” linearly. The DNA is approx 3 Billion leters (i.e. base pairs) long but only 1-2% of them are genes. The other parts of Genome is called non-coding DNA (i.e. not encoding protein assembly instruction). Note that Onion contains 10 Billion pairs in it's Genome! so the size does not represent complexity!?
  • Nucleotide: A molecule consisting of a nitrogen-containing base (adenine, guanine, thymine, or cytosine in DNA; adenine, guanine, uracil, or cytosine in RNA), a phosphate group, and a sugar (deoxyribose in DNA; ribose in RNA). A nucleotide is composed of three distinctive chemical sub-units: a five-carbon sugar molecule, a nucleobase (the two of which together are called a nucleoside), and one phosphate group. With all three joined, a nucleotide is also termed a "nucleoside monophosphate" Sugar-Phosphate forms the backbone where the Nitrogenous base elements stick to.
  • Nucleic acids consist of a chain of linked units called nucleotides.

Pharos

Pharos is the user interface to the Knowledge Management Center (KMC) for the Illuminating the Druggable Genome (IDG) program funded by the National Institutes of Health (NIH) Common Fund. (Grant No. 1U24CA224370-01). The goal of KMC is to develop a comprehensive, integrated knowledge-base for the Druggable Genome (DG) to illuminate the uncharacterized and/or poorly annotated portion of the DG, focusing on three of the most commonly drug-targeted protein families:

  • G-protein-coupled receptors (GPCRs)
  • Ion channels (ICs)
  • Kinases

The Pharos interface provides facile access to most data types collected by the KMC. Given the complexity of the data surrounding any target, efficient and intuitive visualization has been a high priority, to enable users to quickly navigate & summarize search results and rapidly identify patterns. A critical feature of the interface is the ability to perform flexible search and subsequent drill down of search results. Underlying the interface is a GraphQL API that provides programmatic access to all KMC data, allowing for easy consumption in user applications.

UMLS

The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences.

Disease Ontology

Mondo Disease Ontology

Numerous sources for disease definitions and data models currently exist, which include HPO, OMIM, SNOMED CT, ICD, PhenoDB, MedDRA, MedGen, ORDO, DO, GARD, etc But they are not accurate. UMLS helps but not precise.

Mondo’s development is coordinated with the Human Phenotype Ontology (HPO), which describes the individual phenotypic features that constitute a disease. Like the HPO, Mondo provides a hierarchical structure which can be used for classification or “rolling up” diseases to higher level groupings. It provides mappings to other disease resources, but in contrast to other mappings between ontologies, we precisely annotate each mapping using strict semantics, so that we know when two disease names or identifiers are equivalent or one-to-one, in contrast to simply being closely related

OLS - Ontology Lookup Service

The Ontology Lookup Service (OLS) is a repository for biomedical ontologies.

OLS is developed and maintained by the Samples, Phenotypes and Ontologies Team (SPOT) at EMBL-EBI.

Example Diseases to understand ontology:

  • Diabetes
  • Cancer

Rare Diseases

See https://rarediseases.org/ This is NORD: National Organization for Rare Disorders

See Also: Orphanet Rare disease Ontology http://www.orphadata.org/cgi-bin/index.php#ontologies

Autoimmune encephalitis

OMIM

Online Mendelian Inheritance in Man

  • OMIM is a comprehensive collection of human genes and genetic phenotypes - freely available and updated daily.
  • The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 16,000 genes.
  • OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

Tables

gtex homologene expression lincs

protein
SELECT DISTINCT sym AS symbol, id AS protein_id FROM protein WHERE sym in ... (Fetch protein Id for symbol)

phenotype

mpo (mouse phenotype)

pathway
"SELECT DISTINCT id_in_source AS reactome_id, name FROM pathway WHERE pwtype = 'Reactome' SELECT DISTINCT SUBSTR(id_in_source, 6) AS kegg_pathway_id, name AS kegg_pathway_name FROM pathway WHERE " "pwtype = 'KEGG'

SELECT DISTINCT
clinvar.protein_id

FROM
clinvar

JOIN
clinvar_phenotype ON clinvar.clinvar_phenotype_id = clinvar_phenotype.id

JOIN
clinvar_phenotype_xref ON clinvar_phenotype.id = clinvar_phenotype_xref.clinvar_phenotype_id

WHERE
clinvar_phenotype_xref.source = 'OMIM' AND clinvar.clinical_significance != 'Uncertain significance'

disease

Appendix: Gene expression to detect ischaemic stroke

Terminology

  • Acute ischaemic stroke - AIS
  • Asymptomatic - Showing no symptoms
  • Genetic algorithm k-nearest neighbours (GA/kNN)
  • Cohort - Collection of people with similar characteristics
  • Sensitivity, specificity, Accuracy

Observations

GA/kNN identified 10 genes (ANTXR2, STK3, PDK4, CD163, MAL, GRAP, ID3, CTSZ, KIF1B and PLXDC2) whose coordinate pattern of expression was able to identify 98.4% of discovery cohort subjects correctly (97.4% sensitive, 100% specific)

Sensitivity, specificity, Accuracy

Consider 100 people appearing for interview for 10 posts. Let us say, there are top 10 really good candidates. Interview process identifies 7 good candidates and 3 bad candidates. What is sensitivity, specificity and Accuracy ?

Sensitivity = True Positive / True Positive + False Negative
            = Correctly Selected / Correctly Selected + Mistakenly Rejected
            = Correctly Selected/ Total Excellent candidates
            = 7 / 10 = 70% sensitivity.

Specificity = True Negative/ True Negative + False Positive
            = Correctly Rejected / Correctly Rejected + Mistakenly Selected
            = Correctly Rejected / Total poor candidates
            = 87 / 90  =  96.66 % Specificity

Note: With low sensitivity, and high specificity, we "justify" that
there is lot of negatives and we are still able to reject most of them.

Accuracy = (True Positive + True Negative)/ 
           (True Positive+ False Positive + True Negative + False Negative)

         = (Correctly Selected + Correctly Rejected)/
           (Correctly Selected + Mistakenly Selected + Correctly Rejected 
                   + Mistakenly Rejected)

         = (Correctly Selected + Correctly Rejected)/ Total Candidates.
         = (7 + 87) / 100 = 94.0 %

Notes:

  • For above examples, consider various cases :

    Case                   Sensitivity  Specificiy    Accuracy
    Reject All             0%           100%          90%
    Select All Bad         0% 
    

cord-19 Dataset

metadata.csv

  • Contains info about one cord-19 paper per row.
  • cord_uid : Document id. like pubmed id. non-pubmed papers also present.
  • title, abstract, authors, url (list),
  • pdf_json_files: ; separated list of json files. e.g. document_parses/pdf_json/<uid1>.json; document_parses/pdf_json/<uid2>.json Each json file contains full text dictionary. It is like collection of sections with each section is a paragraph and section_name. The section 'introduction' is special.
  • pmc_json_files: Similar to pdf_json_files. The source is XML files from PMC, but destination is json files of same format as that for pdf_json_files.
  • source_x: Source info. e.g. ArXiv; Elsevier; PMC; WHO' Could be multiple sources.
  • doi: Paper DOI
  • pmcid: Paper ID on PubMed Central. e.g. PMC123456
  • pubmed_id: Int valued Paper Id on PubMed
  • journal: Paper Journal name. e.g. BMJ or British Medical Journal, etc. (Optional). Not necessarily standaridized.
  • who_covidence_id: WHO assigned id. e.g. #72306
  • arxiv_id: arXiv ID of this paper.
  • s2_id: Semantic Scholar ID for this paper. Can be used with the Semantic Scholar API (e.g. s2_id=9445722 corresponds to http://api.semanticscholar.org/corpusid:9445722)