==================================================================
             Notes On Databases
==================================================================

References:

    http://www.databasejournal.com

    Whitemarsh Infosys corp:
    http://www.wiscorp.com/SQLStandards.html

    Library of free models:
    http://www.databaseanswers.org/data_models/

    SQL Web Tutorial http://sqlzoo.net

    Comparison of SQL Compliance by leading DBMS products: 
    http://troels.arvin.dk/db/rdbms/


===================================================================

|Contents|:

Relational Theory ~

Database History ~

Normalization ~

Summary of famous DBMSes ~

Notes from Ulman's Book on DBMS ~
* B-tree and indexing ~

How MVCC works on Postgress ~

Pete's Article on MVCC ~

Secondary Index ~

Any recommended approach for EAV pattern ? ~

Natural Join ~

Emulating Outer Join ~

Window Functions ~

Database Modeling ~

Data Modeling Notation ~

Isolation Levels ~

Heap Sort ~

B-Trees ~

Query Evaluation ~

Transaction Locking ~
Recovering From a Crash ~

Optimization ~

OLAP ~

DSL ALPHA, QUEL, SQL ~


 Key Terms ~

 Embedded SQL Example ~
 Dynamic SQL ~
 INSTEAD OF Trigger ~
 Identifying Relationship ~
 Foreign Keys ~
 NULL Handling ~
 Deferred Integrity Checking ~
 Surrogate Key ~
 ISAM ~
 Benchmarks ~

Queries by Example ~
Challenging SQL Queries ~
Views ~

Appendix ~

 Recommended Books ~

 Useful Research Pointers ~

========================================================================
Relational Theory ~
========================================================================

- Relational Database is based on relational algebra and relational
  calculas.

- Relational algebra deals with a set of finitary relations 
that is closed under certain operators. These operators operate on one or more
relations to yield a relation.

- What is first order logic ? A logic which deals with 'finite' number of
  elements with operands as terminal variables (vs functions).

- First order logic does not have expressive power of 'higher' order
  (e.g. second order) logic which can prove certain things about natural
  numbers (or floats).

- Relational algebra deals with operators such as  intersect, union, etc.
  Specific Operators: JOIN, Project(i.e. select columns), Restrict (on cond).
  It is more procedural. (The JOIN and Restrict operator makes it much
  more powerful than simple set algebra)

- Relational Algebra = Simple Algebra Of Sets extended with 2 operators. ~
                       Restricted to operate on finite set of elements.

- First order logic deals with predicates, there-exists,
  for-all y, implies, etc. 
     E.g. For-all y such that y>0, there-exists x < y

- Natural deductive systems, sequent calculus, etc can be applied to
  First order logic.

- Relational calculas is declarative ~

    Get StoreName and StorePhone for supplies 
     such that there exists a title BK with the same BookstoreID value and 
             with a BookTitle value of Some Sample Book.

- Codd's Theorem establishes that *Relational Algebra and Relational Calculas*
  have *same* expressive power.

- Query languages that are equivalent in expressive power to relational
  algebra were called *relationally complete* by Codd. 
  By Codd's Theorem, this includes relational calculus (and SQL).

- An SQL yields Relational Algebra expression for query -- Depending
  on the order of the operands, the optimization may suffer.

- People developed many database implementations where SQL is still not used.

- The relational algebra does lack some power for data manipulation.
  Some common examples are:

    * Group by and aggregate (aggregation of common groups), not available
      in Relational Algebra, provided by SQL extension.
    * Not even count(*) is available in relational algebra for that matter.
    * Compute a *transitive closure* of a graph given by its binary edge.

  For example, A -> B (A is friend of B); B -> C ; C -> D; etc. 
  We say Friend's Friend is a Friend. A invites B & his friends for party!
  Find out all friends of A!

  select * from Community as C1, Community as C2, Friends F
     where  ... ???

========================================================================


From Introduction to First-Order Logic and Prolog
www.comp.nus.edu.sg/~kanmy/courses/3243_2006/.../w8-all.pdf
Author: Min-Yen Kan, National Univ of Singapore, First Order Logic

In terms of expressive power:

  Propositional logic < First Order Logic < Procedural Language

  "Assembly Language is more powerful"

Propositional Logic does not support for-all, there-exists operators.
It is just bunch of predicates which must be repeated as many times
as the number of elements in a relation.

Propositional logic is declarative
 - Propositional logic allows partial/disjunctive/negated information
   (unlike most data structures and databases)
 - Propositional logic is compositional:
   meaning of B(1,1) and P(1,2) is derived from meaning of B(1,1) 
   and of (P1,2)
 -  Meaning in propositional logic is context-independent
    (unlike natural language, where meaning depends on context)

Bad:
 - Propositional logic has very limited expressive power
   (unlike natural language)
    E.g., cannot say "pits cause breezes in adjacent squares"

========================================================================

First order Logic:
  Whereas propositional logic assumes
  - the world contains facts,
  - first-order logic (like natural language) assumes the world contains
        * Objects: people, houses, numbers,
        * colors, baseball games, wars, ...
        * Relations: red, round, prime, brother of, bigger than, part of,
                            comes between, ...
        * Functions: father of, best friend, one more than, plus, ...
   - Variables (x, y, z, etc) and Constants (John, Ram, etc)
   - Atomic Sentence: Predicate(terminal1, terminal2, ..., terminalm) 
                        terminal := function(termX)

Interacting with FOL KBs :
   - Predicate(x, y) will return all x, y !

========================================================================
What is Logic Programming?

A type of programming consisting of facts and relationships
from which the programming language can draw a conclusion.

In imperative programming languages:
   *  we tell the computer what to do by programming the procedure 
      by which program states and variables are modified.

In contrast, in logical programming:
     we don't tell the computer exactly what it should do 
     (i.e., how to derive a conclusion). User provided facts and 
     relationships allow it to derive answers via logical inference.

========================================================================

From http://www.cs.pitt.edu/~chang/156/10calculus.html

RELATIONAL CALCULUS

Relational calculus is nonprocedural 
It has the same expressive power as relational algebra, i.e. it is
relationally complete 

It is a formal language based upon a branch of mathematical logic called
"predicate calculus" 
There are two approaches: tuple relational calculus and domain relational
calculus 
(We will discuss only tuple relational calculus)

WHY IS RELATIONAL CALCULUS IMPORTANT?

It lays the formal foundation for many query languages, such as QUEL, QBE,
SQL, etc.

TUPLE RELATIONAL CALCULUS

{ t | COND(t) } 

{ t | EMPLOYEE(t) and t.SALARY > 50000 } 

The RANGE RELATION is EMPLOYEE 
The TUPLE VARIABLE is t 
The ATTRIBUTE of a TUPLE VARIABLE is t.SALARY 
(This is similar to SQL's T.SALARY 
   In relational algebra, we will write T[SALARY] ) 

{t.FNAME,t.LNAME | EMPLOYEE(t) and t.SALARY > 50000 } 
is equivalent to 
SELECT T.FNAME, T.LNAME
FROM EMPLOYEE T
WHERE T.SALARY > 50000

Note: how to express JOINs here ?

SAFE EXPRESSIONS

A SAFE EXPRESSION is one that is guaranteed to yield a finite number of tuples
as its results. Otherwise, it is called UNSAFE. 

{ t | not(EMPLOYEE) } 
is UNSAFE! 

Technique to guarantee SAFENESS can be applied to transform a query. 

Q6: Find the names of employees without dependents. 

Q6: {e.FNAME, e.LNAME | EMPLOYEE(e) and (not(E d) (DEPENDENT(d) and 
                       e.SSN = d.ESSN) } 

Q6A: {e.FNAME, e.LNAME | EMPLOYEE(e) ((A d)(not(DEPENDENT(d)) or 
                      ((E d)(DEPENDENT(d) and not(e.SSN=d.ESSN))) ) ) }

========================================================================

From: http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch4_Domain_Calculus.pdf

 
- Relational Calculas is a simple subset of first order logic
  (first order logic can compute transitive closure, but relational calc???)

- Expressions in calculas are called formulas. 

========================================================================


Great material: http://users.soe.ucsc.edu/~kolaitis/talks/gii09-final.pdf

Author: Phokion G. Kolaitis University of California, Santa Cruz &
                        IBM Research-Almaden kolaitis@cs.ucsc.edu

        2009 GII Doctoral School on Advances in Databases

Codd introduced two different query languages for the relational data
model:

  * Relational Algebra, which is a procedural language.
      It is an algebraic formalism in which queries are expressed by
      applying a sequence of operations to relations.

  * Relational Calculus, which is a declarative language.
        It is a logical formalism in which queries are expressed as
        formulas of first-order logic.

Codd's Theorem: Relational Algebra and Relational Calculus are
                essentially equivalent in terms of expressive power.
========================================================================

Goals:

The language should be sufficiently high-level to secure physical
data independence, i.e., the separation between the physical level
and the conceptual level of databases.

The language should have high enough expressive power to be
able to pose useful and interesting queries against the database.
The language should be efficiently implementable to allow for the
fast retrieval of information from the database.

Warning:
There is a tension between the last two desiderata.
Increase in expressive power comes at the expense of efficiency.

Codd's key contribution was to identify a small set of basic
operations on relations and to demonstrate that useful and
interesting queries can be expressed by combining these operations.
Thus, relational algebra is a rich enough language, even though,
as we will see later on, it suffers from certain limitations in terms
of expressive power.

The first RDBMS prototype implementations (System R and Ingres)
demonstrated that the relational algebra operations can be
implemented efficiently.

The Five Basic Operations of Relational Algebra

Group I: Three standard set-theoretic binary operations:
        *Union*
        *Difference*
        *Cartesian Product*
Group II. Two special unary operations on relations:
        *Projection*
        *Selection*

Relational Algebra consists of all expressions obtained by combining
these five basic operations in syntactically correct ways.


Context-free grammar for relational algebra expressions:
E := R, S, ... | (E1 U E2) | (E1 - E2) | (E1 X E2) | Project(E) | Cond(E),
where R, S, ... are relation schemas

There are other expressions which are *derived* relational algebra operation:
   R Intersect S = Derived Algebra Equation. = R - (R-S) = S - (S-R)

Quotient (or Division) 
Example :

  Given ENROLLS(stud-name,course) and TEACHES(fac-name,course),
find the names of students who take every course taught by V.
Vianu.

     RelationX / RelationY  => Get me subset of X so that  .... ???

Find the names of Netflix customers who have rented every film
in which Tom Cruise acted.

Great one: Division explained :
http://www.cs.arizona.edu/~mccann/research/divpresentation.pdf

Author: Lester I. McCann mccann@uwp.edu . Univ of Wisconsin — Parkside Kenosha

The complete division expression:

T1 :  c1, cc (common column)
T2 :  c2, cc

Pi == Project     /* Duplicates auto removed after projection ???  */

T1 / T2 =  Pi_c1(T1) - Pi_c1( (Pi_c1(T1) X pi_cc(T2)) - T1 )
                   step-3        step-1              step2

As per set theory, the duplicates are auto removed!

Ignoring the projections, there are just three steps:
1. Compute all possible attribute pairings
2. Remove the existing pairings
3. Remove the non-answers from the possible answers

Example 

Query: Find the sno values of the suppliers that supply all
       parts of weight equal to 17.
       (or imagine all parts of color = RED since you like RED)

p part: pno pname color weight city

spj supplier_part: sno pno jno qty

(we don't care about supplier table which contains supplier name etc)

select distinct sno from spj
except
select sno
  from ( select sno, pno
          from (select sno from spj) as t1,
               (select pno from p where weight=17) as t2
          except
          select sno, pno from spj
       ) as t3;

Note: mysql does not support set difference "except" operator,
      but this can be done by:

SELECT		users.username
FROM		users
LEFT JOIN	computers 
ON		computers.username = USER.username
WHERE		computers.username IS NULL

   This is still stupid:
   e.g.  select c1, c2  except select n1 as c1, n2 as c2
should be translated to:
         select c1, c2 from T1 LEFT JOIN T2
                       ON  c1 = T2.c1 AND c2 = T2.c2
         WHERE T2.c1 is NULL

Note: LEFT JOIN ... ON produces relation with m + n columns.
      LEFT JOIN ... USING(col) produces relation with m + n - 1 columns.

Note: Oracle supports MINUS operator
========================================================================

Another approach of computing MINUS using SET CONTAINMENT logic :

The resulting SQL query scans through the sno
values, computes A based on the current sno, and
includes it in the quotient if the difference is empty:

select distinct sno from spj as globl
where not exists (
( select pno from p where weight = 17 )
except
( select p.pno
from p, spj
where p.pno = spj.pno
and spj.sno = globl.sno )
);


#4: A Counting We Will Go (cont.)
The resulting SQL query:

select distinct sno
from spj, p
where spj.pno = p.pno and weight = 17
group by sno
        having count(distinct p.pno) = (select count (distinct pno)
        from p
        where weight = 17);

Note: Both count and group-by are not part of relational algebra.

========================================================================

Definition (Codd 1972): A database query language L is
relationally complete if it is at least as expressive as relational
algebra, i.e., every relational algebra expression E has an equivalent
expression F in L.

Relational completeness provides a benchmark for the expressive
power of a database query language.

Every commercial database query language should be at least as
expressive as relational algebra.

SQL vs. Relational Algebra

========================================================================
SQL                 Relational Algebra
========================================================================
SELECT              Projection 
FROM C              Cartesian Product
WHERE S             Selection 

========================================================================

In addition to relational algebra, Codd introduced relational calculus.

Relational calculus is a declarative database query language based
on first-order logic.

Relational calculus comes into two different flavors:
        Tuple relational calculus
        Domain relational calculus.

========================================================================

What is partial order ?

  Subset inclusion is a partial order because we may say, A includes B 
  as long as the "common" columns of B all are in A.
  As opposed to total order of natural numbers, where every element
  can be compared to every other element.
  In partial order the comparison is based on some partial components.

  Binary tree elements are partially ordered wrt "successor" relationship.
      The relation is reflexive, transitive, anti-symmetric
          (i.e. a >= b and b>=a implies a = b)
  The "siblings" are not related by any of these relations.
  So there is no total order. 
  (compare this with natural number, for any a, b: a <= b or b >= a)

========================================================================

First Order Logic:

First-Order Logic is a formalism for expressing properties of
mathematical structures (graphs, trees, partial orders, ...).

Example: Consider a graph G=(V,E) (nodes are in V, edges are in E)

There is a self-loop.
Every two nodes are connected via a path of length 2.
Every node has exactly three distinct neighbors.
There is a path of length 3 from node x to node y.
Node x has at least four distinct neighbors

These and many other similar properties are expressible as
formulas of first-order logic on graphs.

One of Codds key insights was that first-order logic can also be
used to express relational database queries.


Example: The relational calculus expression in Domain Calculus 
{(x,y): there-exists z such that (E(x,z) AND E(z,y))}
returns the set P of all pairs of nodes (a,b) that are connected via a
path of length 2.
  The query starts with "unbound" variables.

Note: All these places, we assume SET CONTAINS UNIQUE ELEMENTS !

Give a relational calculus expression for R natural-join S
                {(x1,x2,x3,x4): R(x1,x2,x3) AND S(x2,x3,x4)}

Assume that R has arity 5 and S has arity 3.
Express R / S in relational calculus.
{(x1,x2): (for-every x3)(for-every x4)(for-every  x5) 
            (S(x3,x4,x5) implies R(x1,x2,x3,x4,x5))}

Much simpler than the relational algebra expression for R / S (R DIV S)

========================================================================

Computational Complexity Theory:
Complexity Classes and Complete Problems
In particular, the classes P and NP, and NP-complete problems.

Church's Thesis (aka Church-Turing Thesis): The following
statements are equivalent for a (partial) function f
        * There is a Turing machine that computes f
        * There is an algorithm that computes f.

Main Use of Church's Thesis: To show that there is no algorithm for
computing a function f, it suffices to show that there is no Turing
machine that computes f.

========================================================================
Limitations of Relational Algebra & Relational Calculus
========================================================================

Outline:
Relational Algebra and Relational Calculus have substantial
expressive power. In particular, they can express
Natural Join
Quotient
Unions of conjunctive queries
...
However, they cannot express recursive queries.
*Datalog* is a declarative database query language that augments the
language of conjunctive queries with a recursion mechanism.

Parents, Grandparents, and Greatgrandparents

Let PARENT be a binary relational schema such that if
(a,b) present in PARENT in some database instance, 
then a is a parent of b.
Using PARENT, we can define GRANDPARENT and
GREATGRANPARENT by the conjunctive queries:

GRANDPARENT(x,y) :- PARENT(x,z), PARENT(z,y)
and
GREATGRANDPARENT(x,y) :- PARENT(x,z), PARENT(z,w),
PARENT(w,y)
Similarly, we can define GREATGREATGRANPARENT by a
conjunctive query, and so on up to any fixed level of ancenstry.

Parents and Ancestors ~

Question: Is there a relational algebra (relational calculus)
expression that defines ANCESTOR from PARENT?
Note: This type of question occurs in other related concepts:
Given a binary relation MANAGES(manager, employee), is there a
relational algebra (relational calculus) expression that defines
HIGHER-MANAGER

Given a binary relation DIRECT(from,to) about flights, is there a
relational algebra (relational calculus) expression that defines
CAN-FLY(from,to)?

More abstractly, given a binary relation E, is there a relational
algebra (relational calculus) expression that defines the
*Transitive Closure* TC of E? No!
They cannot express recursive queries.

Question: What is to be done to overcome the limitations of the
expressive power of relational calculus?
Answer 1: Embedded Relational Calculus (Embedded SQL):
Allow SQL commands inside a conventional programming
language, such as C, Java, etc.
This is an inferior solution, as it destroys the high-level character
of SQL.

Answer 2:
Augment relational calculus with a high-level declarative
mechanism for recursion.
Conceptually, this a superior solution as it maintains the high-
level declarative character of relational calculus.

Datalog = "Conjunctive Queries + Recursion"

Datalog was introduced by Chandra and Harel in 1982 and has been
studied by the research community in depth since that time

SQL:1999 and subsequent versions of the SQL standard provide
support for a sublanguage of Datalog, called linear Datalog.

Syntax:
WITH RECURSIVE T AS
<Datalog program for T>
<query involving T>

Datalog and SQL

Example: Give an SQL query for (oldest) ANCESTOR (using PARENT as EDB)

WITH RECURSIVE ANCESTOR(anc,desc)
(SELECT parent, child
 FROM PARENT

       UNION

 SELECT PARENT.parent, ANCESTOR.desc
 FROM PARENT, ANCESTOR
 WHERE PARENT.child = ANCESTOR.anc
) 
SELECT * FROM ANCESTOR

Example: Give an SQL query that computes all descendants of Noah

WITH RECURSIVE ANCESTOR(anc,desc)
(SELECT parent, child
FROM PARENT
UNION
SELECT ANCESTOR.anc, PARENT.child
FROM PARENT, ANCESTOR
WHERE PARENT.child = ANCESTOR.anc)
SELECT desc FROM ANCESTOR
WHERE anc = "Noah"

(In stored function ???)

Example: Datalog program for Transitive Closure
T(x,y) :- E(x,y)
T(x,y) :- E(x,z), T(z,y)

Note: Each rule is OR condition. The "," acts like AND condition.

Fact(0) := 1
Fact(n) := n * Fact(n-1)

Note: Recursive equations arising from Datalog programs
      need not have a unique solution.


========================================================================


==================================================================
Database History ~
==================================================================

EF Codd publishes Relational Model paper on 1970 
CJ Date Has collaborated with Codd and extensively written.

History of standards:
1986 : SQL-86 First published by ANSI. Ratified by ISO in 1997
       Also known as SQL-87
       Tables, Views, Referential and integrity constraints etc.

1989 : SQL-89 Minor revision
1992 : SQL-92 Known as SQL-2; Date, Time, Check Constraints,
       Alter, Scrolling Cursors, etc.

1999 : SQL:1999 [aka - SQL-3] Triggers, Stored procedures,
        Some obj oriented features, (Table inheritance) etc.

2003: SQL:2003 : XML Related features.

2006: SQL:2006  : Publishing SQL data in XML Format; XQuery functions.

2008: SQL:2008 : Various minor improvements like Truncate table;
                 INSTEAD OF trigger, etc. This standard replaces 
                 2006 std ?

2011: In progress

The INGRES database (Stonebraker and other Berkeley folks) was
the seed for SQL Server, Sybase, and many other database systems.

IBM System R, DB2 and others were assisted by Jim Gray and many
other folks. Lots of rivalry between INGRES and System R folks.
INGRES guys thought IBM guys are copying INGRES and vice versa!

Oracle beat others by reading whitepapers and coming to market
before others !!!

Early 80s:
	 Ellisons Oracle beats IBM to market by reading white papers.

	 IBM releases multiple RDBMSs, settles down to DB2. Gray (System R),
         Jerry Held (Ingres) and others join Tandem (Non-Stop SQL), Kapali
         Eswaran starts EsVal, which begets HP Allbase and Cullinet

	 Relational Technology Inc (Ingres Corp), Britton-Lee/Sybase, Wang PACE
         grow out of Ingres group

	 CA releases CA-Universe, a commercialization of Ingres

	 Informix started by Cal alum Roger Sippl (no pedigree to research).

	 Teradata started by some Cal Tech alums, based on proprietary network
         ing technology (no pedigree to software research)

1990s:
   Postgres project at UC Berkeley turned into successful open source 
            project by a large community, mostly driven by a group in russia

   Illustra (from Postgres) -> Informix -> IBM
   MySQL

2000s:
   Postgres  Netezza, Vertica, Greenplum, EnterpriseDB...
   MySQL  Infobright
   Ingres  DATAllegro

System-R and Ingres are 2 most influential DBMSs foundations. 


Normalization ~

  Suppose:

     Suppliers: (sno, sname, city)
     Product:   (pno, pname, color, price)
     cansupply(sno, pno, qty)

     Student:    (sid, sname, deptno)
     Department: (deptno, deptname)
     Courses:    (courseno, coursename)

     Enrollment(sid, courseno);

Suppose: Enrollment(sid, courseno, coursename)
     Here coursename depends on courseno which in turn depends on
     the composite key (sid, courseno).
     Element is dependent on proper subset of a key. => Normalization violation
  
Another Example:

   Employee(eid, name, lic_no, car_make?)
   car_info(lic_no, make) 
   drivers(drv_id, name, address)

   Suppose Look at: Employee(eid, name, lic_no, car_make)

   car_make depends on lic_no which depends on eid.
   Transitive dependency.

   If car_info() not maintained separately, it will lose the
   info if the car is temporarily not assigned any employees.

1 NF (Codd) - All tables flat.
2 NF (Codd) - No transitive dependencies
3 NF (Codd) - Fully functional on key

BCNF
4 NF

To start with initial set of tables ? Use ER Diagram.
It yields initial set of tables.
Over the years, ER Diagram has following extensions added:
 - Weak entities 
 - Inheritance hierarchies (specialization/generalization)
 - more than binary relationship
 - relation cardinality (count -- min/max of participants).

Superkey - It is a key. It may be more than a key.
Candidate key - Minimal superkey. Minimal set of attributes to
                identify any tuple in relation.

Codd introduced the model & two query languages: 
Relational algebra and Relational calculus.
A relational model for data for large shared data banks, CACM, 1970.
Relational completeness of data base sublanguages, in: Database Systems,  1972


===================================================================

Summary of famous DBMSes ~
===================================================================

@@ *BSD* database

Berkeley DB (BDB) is a computer software library that provides a
high-performance embedded database for key/value data. Berkeley DB is a
programmatic software library written in C with API bindings for C++, PHP,
Java, Perl, Python, Ruby, Tcl, Smalltalk, and most other programming
languages. BDB stores arbitrary key/data pairs as byte arrays, and supports
multiple data items for a single key. Berkeley DB is not a relational
database.[1] BDB can support thousands of simultaneous threads of control or
concurrent processes manipulating databases as large as 256 terabytes, on a
wide variety of operating systems including most Unix-like and Windows
systems, and real-time operating systems. Berkeley DB is also used as the
common name for three distinct products; Oracle Berkeley DB, Berkeley DB Java
Edition, and Berkeley DB XML. These three products all share a common ancestry
and are currently under active development at Oracle Corporation.

===================================================================

@@ *SQLite*

SQLite is an ACID-compliant embedded relational database management system
contained in a small (~275 kB)[4] C programming library.

===================================================================


@@ *Teradata*

Teradata Corporation (NYSE: TDC) is a computer company that sells database
software for data warehouses and analytic applications. Its products are meant
to consolidate data from different sources and make the data available for
analysis. Teradata was incorporated in 1979 and was formerly a division of NCR
Corporation, with the spinoff from NCR on October 1, 2007.[2

===================================================================
@@ *CouchDB*

Apache CouchDB, commonly referred to as CouchDB, is an open source
document-oriented database written mostly in the Erlang programming language.
It is part of the NoSQL group of data stores and is designed for local
replication and to scale horizontally across a wide range of devices. CouchDB
is supported by commercial enterprises Couchbase and Cloudant.

CouchDB exposes a RESTful HTTP API and a large number of pre-written clients
are available. Additionally, a plugin architecture allows for using different
computer languages as the view server such as JavaScript (default), PHP, Ruby,
Python and Erlang

It is in use in many software projects and web sites,[7] including Ubuntu,
where it is used to synchronize address and bookmark data.[8] 
===================================================================

@@ *MongoDB*

MongoDB (from "humongous") is an open source, high-performance, schema-free,
document-oriented NoSQL database system written in the C++ programming
language.[1] It manages collections of BSON documents that can be nested in
complex hierarchies and still be easy to query and index, which allows many
applications to store data in a natural way that matches their native data
types and structures.

 *MongoDB* scales horizontally using a system called sharding[45] which is very
similar to the *BigTable* and *PNUTS* scaling model.

Server supported only on little endian systems.
Driver for java, C, php, etc is supported on any little/big endian system.

*Aggregation* 
   It supports a couple of tools for aggregation, including 
   MapReduce[4] and a group function similar to SQL's GROUP BY.

*Map/reduce* in MongoDB is useful for batch processing of data and aggregation
operations. It is similar in spirit to using something like *Hadoop* with all
input coming from a collection and output going to a collection. Often, in a
situation where you would have used GROUP BY in SQL, map/reduce is the right
tool in MongoDB.

Development of MongoDB began in October 2007 by 10gen. The first public
release was in February 2009.[2]


===================================================================
@@ *Citrus-leaf*
   partners@citrusleaf.com 
   Citrusleaf 2.0 is a NoSQL database that uniquely delivers reliability,
linear scalability and exceptional performance for high volume, data
intensive, web-scale and mobile businesses. 

===================================================================
@@ *Clustra HADB*

 Internode communication through ATM Switch (or highspeed communication
interface). Telecom db using off-the-shelf hardware and OS!

Node-Group = Site --> Correlated probability of failure.

Site A           Site B
Node-A0          Node-B0
Node-A1          Node-B1


========================================================================


Dictionary of buzzwords ~

@@ *Amazon EC2*
   Amazon Elastic Compute Cloud - 
     * based on Xeon virtualization, provides virtual computing resource.
     * central part of AWS -- machine image can be deployed dynamically.
     * Persistant storage is either EBS or S3 (simple storage service)

@@ *Amazon S3 - Simple Storage Service *
    S3 is a storage service based on Web-Services standard.
    (SOAP, WSDL, XML and/or Web API, REST - Representational State Transfer) 

@@ *MapReduce*

*MapReduce* is a software framework introduced by Google in 2004 to support
distributed computing on large data sets on clusters of computers.[1] Parts of
the framework are patented in some countries.[2]

The framework is inspired by the map and reduce functions commonly used in
functional programming,[3] although their purpose in the MapReduce framework
is not the same as their original forms.[4]

MapReduce libraries have been written in almost all programming languages.

"Map" step: The master node takes the input, partitions it up into smaller
sub-problems, and distributes them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the
sub-problems and combines them in some way to form the output the answer
to the problem it was originally trying to solve.

At Google, MapReduce was used to completely regenerate Google's index of the
World Wide Web. It replaced the old ad hoc programs that updated the index and
ran the various analyses.[12]

Lack of type information in interfaces makes it less useful as general
purpose mechanism.

MapReduce allows for distributed processing of the map and reduction operations.

The most famous opensource MapReduce implementation is Apache's Hadoop.

Example Uses:
   Distributed Grep
   Word Count 
   Reverse Weblink (like distributed grep)

===================================================================

@@ *Hadoop*

Hadoop, Apache's free and open source implementation of MapReduce.

written in the Java programming language. Yahoo! has been the largest
contributor[3] to the project, and uses Hadoop extensively across its
businesses.[4]

The HDFS is a distributed, scalable, and portable filesystem written in Java
for the Hadoop framework.

===================================================================

@@ *BigTable*

   BigTable is a compressed, high performance, and proprietary database system
built on Google File System (also known as the "GFS"), Chubby Lock Service,
SSTable and a few other Google technologies; it is currently not distributed
nor is it used outside of Google, although Google offers access to it as part
of their Google App Engine.

  Each table has multiple dimensions (one of which is a field for time,
allowing for versioning and garbage collection). 

The tablets are compressed using the algorithm BMDiff[10] (referenced in [11])
and the Zippy compression algorithm[12] publicly known and open-sourced as
Snappy,[13] which is a less space-optimal variation of LZ77 but more efficient
in terms of computing time. 

===================================================================

@@ *HBASE*

   Written in Java. Provides BigTable-like support on the Hadoop Core.

HBase is an open source, non-relational, distributed database.

it is now serving several data-driven websites,[2][3] including Facebook's
Messaging Platform.[4]


===================================================================

@@ *Apache.Cassandra*

    Written in Java.

    Apache Cassandra is an open source distributed database management system.
It is an Apache Software Foundation top-level project[1] designed to handle
very large amounts of data spread out across many commodity servers while
providing a highly available service with no single point of failure

   BigTable data model running on an Amazon Dynamo-like infrastructure.

Cassandra provides a structured key-value store with tunable consistency.

New for 0.8 is CQL (Cassandra Query Language), an SQL-alike alternative to the
traditional RPC interface. Language drivers are available for Java (JDBC) and
Python (DBAPI2).

  For more:  http://en.wikipedia.org/wiki/Apache_Cassandra

===================================================================

@@ *memcached*

Memcached's APIs provide a giant hash table distributed across multiple machines.

function get_foo(int userid) {
    /* first try the cache */
    data = memcached_fetch("userrow:" + userid);
    if (!data) {
       /* not found : request database */
       data = db_select("SELECT * FROM users WHERE userid = ?", userid);
       /* then store in cache until next get */
       memcached_add("userrow:" + userid, data);
    }
    return data;
 }

*MemcacheDB* is a presistence supported version of memcached.

   http://en.wikipedia.org/wiki/Memcached

  Very widely used by all.

===================================================================

@@ *Hive*

Written in *Java*

Apache *Hive* is a data warehouse infrastructure built on top of *Hadoop* for
providing data summarization, query, and analysis.[1] While initially
developed by *Facebook* , Apache Hive is now used and developed by other
companies such as Netflix.[2][3] Hive is also included in *Amazon* Elastic
*MapReduce* on Amazon Web Services.

Apache Hive supports analysis of large datasets stored in *Hadoop* compatible
file systems such as Amazon S3 filesystem. It provides an SQL-like language
called *HiveQL* while maintaining full support for *map/reduce.* To accelerate
queries, it provides indexes, including bitmap indexes.[5]

===================================================================

@@ *Pig* [1] is a high-level platform for creating *MapReduce* programs used with
*Hadoop.* The language for this platform is called *Pig-Latin* ; Pig Latin
abstracts the programming from the *Java* MapReduce idiom into a notation which
makes MapReduce programming high level, similar to that of *SQL* for *RDBMS*
systems. Pig Latin can be extended using *UDF* (User Defined Functions) which
the user can write in Java and then call directly from the language.

Pig was originally [2] developed at *Yahoo* Research around 2006 for researchers
to have an ad-hoc way of creating and executing map-reduce jobs on very large
data sets. In 2007[3], it was moved into the *Apache* Software Foundation.[4]

Below is an example of a "Word Count" program in Pig Latin

A = load '/tmp/my-copy-of-all-pages-on-internet';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/number-of-words-on-internet';

===================================================================


@@ *Hypertable* 

    Hypertable is designed to manage the storage and processing of information
    on a large cluster of commodity servers.

===================================================================

@@ *Amazon-SimpleDB*

Amazon SimpleDB is a distributed database written in Erlang[1] by Amazon.com.
It is used as a web service in concert with Amazon Elastic Compute Cloud (EC2)
and Amazon S3 and is part of Amazon Web Services. 
It was announced on December 13, 2007.[2]

As with EC2 and S3, Amazon charges fees for SimpleDB storage, transfer, and
throughput over the Internet. On December 1, 2008, Amazon introduced a new
pricing with free tier[3] for 1 GB of data & 25 machine hours. Transfer to
other Amazon Web Services is free of charge.[4]

SimpleDB provides eventual consistency, which is a weaker form of consistency

===================================================================


@@ MaxDB : SAP's internal dbms which can handle large amount of data.

===================================================================

@@ *HyperTable*

 Hypertable is an open source database inspired by publications on the design
of Google's BigTable. The project is based on experience of engineers who were
solving large-scale data-intensive tasks for many years.

Hypertable runs on top of a distributed file system such as the Apache Hadoop
DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely
in C++.

Hypertable has been developed as an in-house software at Zvents Inc. In
January 2009, Baidu, the leading Chinese language search engine, became a
project sponsor.

===================================================================

@@ Non-Stop SQL ~

The product was originally developed by Tandem Computers. Tandem was acquired
by Compaq in 1997. Compaq was later acquired by Hewlett Packard in 2001.

The product primarily is used for online transaction processing and is
tailored for organizations that need high availability and scalability for
their database system. Typical users of the product are stock exchanges and
bank ATM networks[1]

===================================================================

@@ mSQL ~
   miniSQL by Hughes Technologies once popular before MySQL.
===================================================================


@@ *Dynamo* (storage system)

Dynamo is a highly available, proprietary *key-value* structured storage system
[1] or a distributed data store.[2] It has properties of both databases and
distributed hash tables (DHTs). It is not directly exposed as a Web service,
but is used to power parts of other *Amazon* Web Services[3].

===================================================================


@@ *BSON*
    It is binary JSON format, used in MongoDB.
===================================================================

@@ *SpiderMonkey*
    SpiderMonkey is the code name for the first-ever JavaScript engine,
    written by Brendan Eich at Netscape Communications, later released as 
    open source and now maintained by the Mozilla Foundation.
===================================================================
@@ MicroSoft SQL Server
   The term Transact-SQL refers to their SQL variance.
   Unique Features:
     Can create clustered indexes.
     Primary key auto-creates clustered index. Any unique constraint
     creates a non-clustered index by default. You can force create
     clustered index if one does not exist already.
        CREATE CLUSTERED INDEX IX_TestTable_TestCol1 
        ON dbo.TestTable (TestCol1); 
     You can also explicitly specify Primary Key constraint to create
     a non-clustered index for that-- and then create clustered index
     on another unique key.

==================================================================


Notes from Ulman's Book on DBMS ~

Introduction:

* RDBMS concepts originated after 1970 paper by Ted Codd.
  Relations are tables. SQL promotes queries at high level and avoids
  details of "naviagation" through database. Increases the efficiency
  of database programmers.

* Investigate the other models to substantiate the above statement.

On Architecture:

* Any relationship connecting more than 2 entity sets can be converted
  to a collection of binary, many-one relationships without losing 
  info.

* The many to one relation in the relational model should be expressed
  in two tables where the second tables lists the "many" values and the
  foreign keys associated. This leads to the order of N algorithm to
  search all info (where N = total number of elements in the table which
  contains all "many" values in 1-many relation.) Also note that we dont
  have pointers to efficiently collect the info. However, the indexed access
  improves performance. And the independency of the data without bothering
  about "who all refer to me", makes implementation much more simple.

* B-tree and indexing ~

    B-tree is balanced tree of which there could be any number of children.
  This is highly used for indexing. Since the B-tree nodes are really disk
blocks of 4K size, only upto 3 disk accesseses are required to locate the
data.

    B-tree can be used for non-unique index as well where there would be
overflow pages used for duplicates.

  In B-trees, internal (non-leaf) nodes can have a variable number of child
nodes within some pre-defined range. When data is inserted or removed from a
node, its number of child nodes changes. In order to maintain the pre-defined
range, internal nodes may be joined or split. Because a range of child nodes
is permitted, B-trees do not need re-balancing as frequently as other
self-balancing search trees, but may waste some space, since nodes are not
entirely full. The lower and upper bounds on the number of child nodes are
typically fixed for a particular implementation. For example, in a 2-3 B-tree
(often simply referred to as a 2-3 tree), each internal node may have only 2
or 3 child nodes.

Each internal node of a B-tree will contain a number of keys. Usually, the
number of keys is chosen to vary between d and 2d. 
Example: 8 to 16 elements or 4k/8 to 8k/8 i.e. 512 to 1024 elements etc.

In practice, the keys take up the most space in a node. The factor of 2 will
guarantee that nodes can be split or combined. If an internal node has  keys,
then adding a key to that node can be accomplished by splitting the  key node
into two key nodes and adding the key to the parent node.
  
 In the narrow sense, a B-tree stores keys in its internal nodes but need not
store those keys in the records at the leaves. The general class includes
variations such as the B+-tree and the B*-tree.

In the B+-tree, copies of the keys are stored in the internal nodes; the keys
and records are stored in leaves; in addition, a leaf node may include a
pointer to the next leaf node to speed sequential access.(Comer 1979, p. 129)
(Note: Storing the key on the leaves allows good scanning performance)

The B*-tree balances more neighboring internal nodes to keep the internal
nodes more densely packed.(Comer 1979, p. 129) This variant requires non-root
nodes to be at least 2/3 full instead of 1/2. (Knuth 1998, p. 488) 
With in 4 levels, you can access around >10TB of data.

Bulk loading of B+Tree should first sort the data and then load.
The 'fill-factor' may be (in implementation specific way) controlled.
Then the amount of splits with locking will be uniform and much 
faster to load.

==================================================================

* Architecture of DBMS:
  Query processor, Storage Manager, Transaction Manager. Storate Manager
  accesses both Data and Meta Data.

* Acid properties of a Transaction:
Atomicity: All of a transaction or none.
Consistency: Writes won't violate any constraints. Only Valid States.
Isolation: Two transactions has to be independent and should not see each other.
Durability:  Completed Transaction should live in the system in case system
              failures.

* constraints and triggers are 2 kinds of active elements found in dbms.

Chapter 2: Database Modeling:

Discusses following models:
A database is nothing but a collection of entities related like pay roll.
In RDBMS is just a collection of Tables-- collection of associations..
a set of associations.. an association is nothing but a record.. We can use
the term "Set" instead of Table. In RDBMS only way of relating the tables is
by using joining and specifying some elements are unique.  Describing the
structure of the database in RDBMS is ultimately defining the set of Tables.
Does RDBMS support specifying the relationships as 1-many or many-1 among
its objects, easily ?

What is OODBMS ?
  We can specify the structure of the database in an object oriented way.
Identify the independent components in the database as objects and 
specify the relationships between them as 1-many or many-1. 
Could an OODBMS be RDMBS ?

* E/R model - traditional -- uses components and the relations in a diagram
* ODL - Object Defn Language - emerging standard for Object oriented DB.
* Network and heirarchical Models - Old models.--used in commercial DBs in 1970
  They are kind of restricted versions of ODL ??

ODL:
  - Supports defining the structure of a Database (like payroll) in object
    oriented terms. Supports pre defined datatypes and OO concepts like Object
    identity, inheritance, multiple inheritance.
  - Supports translation into declarations in OODBMS.
  - Extension of IDL (interface Desc Lang used in CORBA).

E/R Diagrams:
- Involves Entity Sets, Attributes, Relationships. Same as Objects, attributes,
  relations in ODL. In ODL, however, the relations are directional (could
  specify two names in both directions) and importantly, E/R relations could
  involve many entity sets, and in ODL could involve atmost two.
- Weak entity-set in E/R Diagram:
  If the key of entity set is composed of attributes some or all of which
  belong to another entity set. Such an entity set is weak entity set.

Differences between Entity-Relationship Model and UML ~

Entity-Relationship concepts are different to object-oriented concepts because:

  * with ER concepts behaviour cant be modelled.  (which makes ER Simpler)
    There are no operations and messages in ER concepts. 

  * Today ER models, that are extended with aggregation and generalization, 
    are also called semantic data models.  Associations don't clearly 
    describe which side is a owner and part in 'aggregation' relationship
    for example. (UML solid diamond specifies aggregation).

  * Also, relationship cardinalities in ER is just 1 or M, 
    but in UML it could be 0..1, *, 1..M etc, These can be directly 
    adapted to ER Model, if needed ???

  * In the object-oriented world every object has an unique object identifier 
    which is independent of attribute values. In contrast, the 
    ER model uses keys to identify entities.

Note: Generalization in UML == is-a relation in ER
      Containment in UML == part-of relation in ER

Note: UML Class diagram is helpful to specify the sw design.
      UML Object diagram specifies specific instances of that class how to
      behave. This is useful either as an example or specification.
      For example, you can say, 'The red ball' should be packeted only last.
      Though in class diagram, the red ball is just like anyother ball.

What is Network Model ?
  Consists of logical record types (like E/R sets) and Links ( many-one binary
relations).  Restricted E/R model to binary relations(for implementation
simplicity??).
what is the limitations compared to others ?

What is Hierarchical Model ?
network model restricted to all relations have to be owner-member types.
More closer to implementation considerations. Unnatural since the many-many
relations could not fit-in directly, and virtual types have to be created
when translating from network to hierarchical model for many-one relations.
Also the logical record types and links form a forest, collection of trees,
where the tree is the recordtypes+connecting link.


Chapter 3: Relational Model Design
Schema: Relation schema: Movie(title, year)
        Database Schema: set of relational schemas.

- The attributes have to be atomic. Record structures are not permitted.
- Representation of relationship:
  In ODL both relationships are explicitly specified. In relational model
  the relations are implicit since keys are involved in the association.
  When such relations are complex, or for Set valued attributes in ODL,
  the relations itself should be a separate relation in RDBMS.
- Representing "isa" relation in Relational model:
  The info is distributed over several relations. The additional relations
  with the keys of original relations could define the extra attributes:
  if A->B, then B isa A. Define additional relation which defines extra
  attributes of B.

- Functional Dependency:
  If A1 A2 ... AN -> B, then A's functionally determines B.
  The two tuples which agree on A, should agree on B.
- Closure of attributes is nothing but attributes and its functionally dep atts.
- What are anomalies ?
  Usually occurs when we cramp too much info in relations.
  - Redundancy: Info repeated unnecessarily
  - Update Anomoly: Update only some tuples...
  - Deletion Anomoly: Delete long tuples <a,b,c,d,e,f> when d,e,f become invalid
    but, in the process loose the info that <a,b,c> are related.
- BCNF form guarantees there wont be any anomolies.
- What is BCNF form of a relation?
  If A1 A2 A3 ... AN -> B for R, then <A1, A2,..AN> must be a superkey in R.
  i.e. A proper subset of any key could not functionaly determine any other
  attribute in R.

- What is 1st normal form? Each attribute of a relation should be atomic.
- What is 2nd normal form? It is less restrictive than 3rd. It allows 
  transitive dependencies in a relation, but forbids a nontrivial dep with
  the left side that is a proper subset of a key.
- What is 3rd normal form ? (important)
  It is less restrictive than BCNF:
  If A1 A2 .. AN -> B, then <A1,A2..AN> is a superkey or B is a member of some
  key.

  This is useful over BCNF where some transitive dependencies exist with in
  the relation which we dont want to decompose further. When it does, some
  redundancies do exist. Revisit this with examples later.

  Consider (a,b,c,d,e) where (a,b) is a key. (a,c) is also a key.
  So, (a,b)->c and (a,c) -> b. This is okay. Also say, b->c. But b is not a
  key. This is BCNF violation. However, any break-down of this relation to
  smaller relations will loose information. We can take, (b,c) as a separate
  relation, so (a,b)->c in the data model could be achieved by traversing in
  the relation b->c. However, for (a,c)->b could only be achieved in a
  relation which involves all 3 attributes. So, in 3rd normal form, allow this
  since "c" is a member of some key (a,c).

Redundancy in BCNF:
  - Because of the presence of multivalued attribute in a relation, the
    other attributes will unnecessarily repeated for each value of the multi
    valued attribute. i.e. For (a,b,C,d,e), if C is multivalued attribute
    (as say represents a set), all other attributes will be repeated.
    Note that here, (a,b,c) could be a key. This can easily occur during
    a direct translation from ODL where a class has a set attribute and this
    attribute itself is independent of some of other attributes.

Defn of Multivalued Dependency:
  - A1 A2 .. An ->> B1 B2 ... BM
    If we fix the values of A's the set of values(B's) is independent of 
    other attributes in R.
    So each  key -> all other attributes is a multivalued dependency too.
    (trivial multivalued dependency).
    Also, as in earlier example, each (a,b,ci)->> d, e is a mul.val. dep.
    since (a,b,ci) is a key. 
    More interestingly, (a,b)->>c is a mul.val.dep. since the set of values
    of C depends only on (a,b) and no others (like d or e).
    A nontrivial mul val dep is when B's are not among A's and R has some
    attributes other than A's and B's.

What is 4th Normal form?
    4th normal form eliminates multivalued dependency, there by duplicates
    of same info repeated unnecessarily:
    if A1 A2..An -->> B1 B2 .. Bm is mul val dep, then (A1 A2 ..An) is a
    super key.
    This is a generalization of BCNF extended to eliminate mul val dep.
    4th normal form is more restrictive than BCNF.

Chapter 4 : Operations in the Relational Model

   Operations could be expressed either
   *) in Relational Algebra
   *) Datalog logic

Relational Algebra Operations
 *) Set operations: union, intersection, difference
 *) Selection : eliminates rows
 *) projection: eliminates columns
 *) cartesian product
 *) Natural joins: join on equality condition on some common attributes
 *) Theta joins: join on some arbitrary theta condition. Equivalent to
    cartesian product and selection on this condition.
 *) Extension of relational model supports views a conditional relation which
    is evaluated on demand.

Datalog logic:
 *) More powerful than Relational Algebra since supports recursion, otherwise
    same power.
 *) consists of predicates and atoms:
    P(x,y,z): P is predicate and (x,y,z) is an atom.
 *) Query could be expressed as a set of predicates like in prolog.
 *) Extensional predicates are the ones for which relations are stored in a
    database. Intensional predicates are computed by applying one or more 
    datalog rules. Views are intensional predicates.

What Relational Algebra and Datalog dont support but commercial DBMS do ?
*) insertion, deletion, updation, null values, aggregation operators like
    count, sum, min, max, avg.

How to express constraints on relations ?
 Just make some assertions. That's it. 
 The referential constraint could be like:
 project (R1) should be contained in project (R2). Says that if some column
 appear in R1 should also appear on R2. R1 < R2 says R2 contains all of R1.
 However it is not clear how commercial DBMS supports such constraints
 which involves many relations.

Bags: Same as sets but allows duplicate entries in set. 

Chapter 5: The Database Language SQL:
- ANSI SQL and SQL2 is the updated std in 1992 also called SQL-92.
- SQL3 is evolving. 
- comparison of strings: equality, <, > supported.
  s LIKE p (where p is a pattern) is supported. % is a reg exp *.
  Date and Time is a special data type.
- Order by is supported.
- To form a query which involves tuples of same relation, use
  select ... From Moviestar as S1, MovieStar as S2 where S1... = S2... ;
- An uninutituve consequence of SQL semantics:
  select R.A  From R, S, T where R.A = S.A OR R.A = T.A 
  may give empty result if T is empty. The behaviour is not inuitive.
- The set difference operator in SQL is "EXCEPT"
- Nested SQL statements could improve efficiency like:
  select name From Movie where cert_number = (select prodno from ... );
- supported operators involving a scalar (or scalar tuple) and unary (or not)
  relations : EXISTS, IN, ALL, ANY:
  EXISTS R, s IN R, s > ALL R, s > ANY R.
- scoping rules in SQL queries have atmost precedence from inner most level
  to outer most level.
- Duplicates Elimination:
  Select distinct field From ....
  Select ..... UNION select .... (union by default eliminates duplicates)
  Select .... ALL UNION select .. (BAG semantics instead of set.
                                   Prefix with ALL No elimination)
- Aggregation
  select name, SUM(length) From Movie (where ..) group by name;
  Aggregations are per group. Only the attributes listed in group by could
  appear unaggregated in select clause.

- Having clause:
  If we want to select the group based on the aggregate property of the group
  itself, use "having" clause after "group by":
  select name, SUM(length) From Movie (where...) group by name
  having MIN(year) > 1930; Only those groups are selected.

- Database Modifications:
  Insertion: insert into R(a1,..an) values(v1,..vn);
  insert into Studio(name) select Moviename .... ; /* only name attribute
     inserted. All other attributes are inserted as NULL */
  Deletion: delete From T1 where .... ;

- insert inserts without checking for duplicates. Deletion deletes all tuples
  of the conditions. In SQL you can not specify to delete only one tuple !!!!

- update:
  update movieexec SET name = 'pres' || name  /* || is string cat */
  where cert_no IN (select pres# FROM Studio);

- Defining Relation schema:
  principal data types: char(n), varchar(n), BIT(n), BIT VARYING(N), int,
  (or integer), shortint, float or real, double precision, decimal(n,d)
  (1234.56 is decimal(6,2)). DATE and TIME are also data types.

- Modifying relation schemas:
 drop R;
 Alter table Moviestar Add phone char(16); /*The def value is NULL */
 Alter table moviestart Add phone char(16) Default 'unlisted' ;

 created table t1 ( name char (30) default 'unknown' );

- Domain : aliasing the type:
  create domain moviedomain AS varchar(50) Default 'unknown';
  drop domain moviedomain;

- Indexes :
  create index yearindex on Movie(year); /* Makes queries with year and
    constant comparison */
  creation of indexes are not part of any sql stds upto sql2.
  create index keyindex on Movie (title, year) is a multiindex.
  drop index keyindex;

Views:
  create view movieprod(viewtitle, viewmoviename) AS 
  select title, name From ....;
  If (viewtitle, viewmoviename) is omitted "title, name" would apply.
  We can modify views as we modify the tables.
  drop view movieprod; will drop the view.

Null values and outerjoins:
  outerjoin is supported only from SQL2 onwards. useful when tuple of one
  relation fails to join with other relation. This will have atmost N more
  in the result as compared to the non-outer of same join where N is the no
  of tuples in first relation.

  The truth value "UNKNOWN" exists similar to TRUE and FALSE. And only
  the TRUE relations are displayed in the answer. However FALSE AND UNKNOWN
  is false. and TRUE OR UNKNOWN is true, etc.

  select ....
  FROM Movie Cross Join Starsin (where .... ) ;  /* cartesian product */

  select ...
  FROM Movie JOIN Starsin ON title = .... AND year = movieyear ;
         /* A Join on some arbitrary condition */
  select ...
  FROM Moviestar NATURAL join Movieexec; /* Natural join, to join of the
   same named attributes on equality condtion */

  select ...
  FROM Moviestar (natural) full outer join Movieexec; /* outer join bothway */
  select ...
  FROM Moviestar left outer join Movieexec; /* outer join from left padding */
  select ...
  FROM Moviestar right outer join Movieexec; /*outer join from right padding */

Recursion in SQL3:
  With [ R AS definition of R ]* <query involving R>
e.g. With  
        Pairs AS select frm,to FROM Flights,  /* Defn 1 */
        Recursive Reaches (frm, to) AS 
        PAIRS UNION
         (select ... where ...)    /* 2 nd Defn */
     select * FROM Reaches; /* query involving the defns */

Chapter 6: Constraints and triggers in SQL:

- triggers are not part of SQL2. will be mostly part of SQL3. many commercial
  systems do offer some form of trigger.

- Keys in SQL:

  create table t ( name char(30) primary key; ... } OR
  create table t ( name char(30); .... ; primary key (name) }
  to declare a composite key use: primary key (name, place);

  Refer other keys as:  unique(name) OR name char(30) unique; etc.
  To enforce key constraint it is important to have indexes on the keys...
  otherwise the insertions could be very time consuming.

  To maintain referential integrity, following policies could be used:
  reject: the inserts, deletions, updates which violate this will be 
          rejected by the system. This is default policy.
  cascade policy: The other tuples violating the integrity will automatically
         be deleted or updated.
  set-null policy: The tuples will be set to null as appropriate.

  To specify a specific policy:
  create table t ( name char(30) references anothertable(name)
                     ON delete  set null
                     ON update cascade );

- not null constraint:
    pres_number INTEGER references MovieExec(serno) NOT NULL;
- attribute-based check constraint:
    pres_number INTEGER references MovieExec(serno) CHECK (pres_number >10)
    gender char(1) check (gender IN ('F' , 'M' ),
- Domain constraints:
   create domain genderdomain char(1) check (value in ('F', 'M'));
- SQL2 gives the ability to declare a constraint to be DEFERRED, then the
  checking will only be done at the end of the transaction. How to declare?
- Global constraints:
   * Tuple based check constraint applies to tuples of single relation:
      create table t(....., check(f1 = X OR f2 = Y ));
   * Assertions which involve more than one relation:
     create assertion richpres check 
            (not exists (select * from studio, movieexec where ...) );
- if you give name to a constraint, you can drop it later:
   gender char(1) constraint gencheck check (gender in ('f', 'm')), ....
 Later... Alter table T drop constraint gencheck; will drop the constraint.
- you can add constraints using alter table command.
- you can add constraints using alter domain command.
- drop assertion assertion_name; will drop the assertion.

- *Trigger* : Available only in SQL3.

create trigger networthtrigger ~
after update of networth on MovieExec
referencing 
   old as oldtuple,
   new as newtuple,
when (oldtuple.networth > newtuple.networth)
    update movieexec  SET networth = oldtype.networth 
    where cert# = newtuple.cert#
FOR EACH ROW

  Trigger could be executed before/after/instead of on 
  update/insertion/delete
  Could be specified whether should be executed per row level or 
  statement lvl.
 
- Assertions in SQL3:
  SQL3 extends SQL2 assertions to specify the conditions underwhich the
  assertion to be checked.

==================================================================
Secondary Index ~
==================================================================

 A secondary index, put simply, is a way to efficiently access records in a
database (the primary) by means of some piece of information other than the
usual (primary) key. In Berkeley DB, this index is simply another database
whose keys are these pieces of information (the secondary keys), and whose data
are the primary keys.
 Note: Ofcourse, the index can be unique or non-unique.

==================================================================
  
Informix Learning:

- The power of stored procedure:
   - Stored procedure is associated with the Database and stored as binary.
   - Creating/Defining stored procedure is part of SQL standard.
     create procedure f(i INT default 0) returning int, int, char(15);
   define j int;
   ....
   SYSTEM 'rm /tmp/*.txt'; # No way of getting the system output to local var.
   TRACE ON; every expression will be traced.
   return a,b, c  with resume; /* this procedure is like select expr */
   end procedure;

   - what kind of expressions are allowed inside the procedure ?
   conditional, looping, TRACE ON, LET, CALL (call another proc).,
   foreach select statment.... end foreach;

   - the stored procedure can replace a select expression in another stored
     procedure where the "rethrn ... with resume" is used.
   - string manipulation routines are not available inside stored procedure.
   - recursive procedures are allowed.
   - using on exception ...statement, you can capture (and raise) SQL
     exceptions like -206: table doesn't exist.


- The power of ESQL/C:
   . It has all the power of general SQL: EXEC SQL IMMEDIATE
     (That means it can also define and use stored procedures)
   . It has more finer control by providing cursor and row by row manipulation:
       EXEC SQL DECLARE CURSOR .... FOR UPDATE;
       .... set .... where current of cursor_name;
     cursor doesn't have any meaning outside ESQL/C. Stored procedure and
     triggers are objects in the database. cursor is a mechanism for ESQL/C
     to interact with the database for ESQL/C.
   . whenever [SQLERROR|SQLWARNING|NOT FOUND] [CONTINUE|STOP|GOTO|CALL]
     syntax can gracefully handle exceptions in ESQL/C code for SQL errors.

- Update syntax:

   update tabname set  a1 = ..., b1 = ..., .... where c = '...';
  This does not provide looping control. The 'if else' control could be
  achieved by:
   update tabname set a1 = case ..when-else... end, b1 = ... etc.
  But calling the stored procedure could be the simplest:
   update tabname set (a1, b1, c1...) = storedprocedure(d1,e1,...) where ....;

- SQL string built-in functions supported:
  trim, substr, substring, replace, lpad, rpad, upper, lower, initcap


Database Pointers:
NIST validation software
download page
http://www.itl.nist.gov/div897/ctg/
software.htm; validation suites
include SQL, CGM, Cobol85,
Fortran78, PHIGS, Posix, RDA,
and VRML
Validated products list
Full list: ftp://speckle.ncsl.nist.gov/
vpl/ intro.htm
SQL section: ftp://speckle.ncsl.nist.
gov/vpl/sqlintro.htm
SQL project home page
ftp://speckle.ncsl.nist.gov/sql-testing/
contents_sql.htm
SQL tutorial
ftp://speckle.ncsl.nist.gov/sql-testing/
examples_sql.htm

SQL:2003 draft can be downloaded from :
   http://www.wiscorp.com/

Since the dissolution of NIST?s data management standards program in 1996, 
no certification or other mechanism exists to verify whether database manufacturers 
are in compliance with the ANSI SQL standard.

Typical database dictionary includes:
    * detailed descriptions of tables and fields
    * indexing information
    * referential integrity constraints
    * database schema definitions
    * stored procedures and database triggers
    * access control information, such as usernames, roles, and privileges
    * storage allocation parameters
    * database usage statistics

INFORMATION_SCHEMA is part of SQL2003 std. The data dictionary is 
stored in this schema which is accessible "like a database".
The dictionary tables are accessed by INFORMATION_SCHEMA.<tab_name>


See http://labs.google.com

See difference between natural join and theta join.

database contains multiple SQL-created schemas.
database also has implementation specific catalog which maintains these schemas.
Some people call catalog as another (predefined) schema.
It is also OK to say catalog includes multiple schemas -- 
since db:(system_catalogs) is 1:1 relation.
For example, postgres contains catalogs pg_catalog,pg_toast one (set) per db.
However one db can contain multiple SQL-created schemas.
Syntactically, the catalog is treated like schema though.

What is schema ?
  It just provides a new namespace with in database. The tables, etc 
  are "schema objs" since
they reside in schema. However Note that roles, users are not schema objects.

From SQL:2003 standards doc:

4.2.8 Catalogs and schemas

4.2.8.1 Catalogs

A catalog is a named collection of SQL-schemas, foreign server descriptors,
and foreign data wrapper
descriptors in an SQL-environment. The mechanisms for creating and destroying
catalogs are implementationdefined.
The default catalog name for <preparable statement>s that are dynamically
prepared in the current SQL-session
through the execution of <prepare statement>s and <execute immediate
statement>s is initially implementationdefined
but may be changed by the use of <set catalog statement>s.

Note: Is there set catalog statement ???

4.2.8.2 SQL-schemas
An SQL-schema, often referred to simply as a schema, is a persistent, named
collection of descriptors. Any
object whose descriptor is in some SQL-schema is known as an SQL-schema object.
A schema, the schema objects in it, and the SQL-data described by them are
said to be owned by the authorization
identifier associated with the schema.
SQL-schemas are created and destroyed by execution of SQL-schema statements
(or by implementation-defined
mechanisms).

4.2.8.3 The Information Schema
Every catalog contains an SQL-schema with the name INFORMATION_SCHEMA that
includes the descriptors
of a number of schema objects, mostly view definitions, that together allow
every descriptor in that catalog to
be accessed, but not changed, as though it was SQL-data.
The data available through the views in an Information Schema includes the
descriptors of the Information
Schema itself. 
Each Information Schema view is so specified that a given user can access only
those rows of the view that
represent descriptors on which that user has privileges.

OLAP - online analytical processing

Database Schema Design:
 *http://www.youtube.com/watch?v=KqvIGYjcLQ4 (Barry's tutorial)
 (also at microsoft)

===================================================================
Standards re Relational Databases

Standards or precise definitions or specific guidelines, for the following
have existed for over 30 years. In progressive order, each containing the
previous:

   1.  Normalisation (for any Database)
      (a Standard is not required, it is an engineering principle; common
across many disciplines; a simple process for those qualified; and absence of
it is easily identified.)

   2.  Relational Databases: The Relational Model: E F Codd & C J Date

   3.  The famous Twelve Rules of Relational Compliance:  E F Codd

   4.  Relational Data Modelling
      IDEF1X; 1980's; NIST Standard FIPS 184 of 1993

===================================================================
Tools:
    * When I model data, I use IDEF1X
    * When I model functions, I use IDEF0 or SSADM (depending on the maturity
      of the users)
    * When I model objects, I use UML
===================================================================

Database Design Approach:
=========================

   I prefer to use a three stage approach: conceptual data modeling, logical
data modeling, and physical data modeling. The use of fancy tools depends on
the scope of the project.

The first stage is analysis, also known as requirements definition. The result
of the first stage is a conceptual data model, and related diagrams. The first
stage is data model agnostic.

I use ER modeling and ER diagrams. Attributes are discovered, and their
domains are determined, if possible. Attributes are connected to subject
matter entities and relationships among entities. Relationships are
identified, but not implemented via foreign keys. That's logical design.

Natural keys are identified, and evaluated as to whether they can be trusted.

The notation involves attributes, domains, entities and relationships.

The second stage is logical design. The result of the second stage is a
logical data model, expressed in SQL terminology. I use SQL terminology like
columns, rows, tables, and domains, although these are stand ins for
attributes, tuples, relations, and domains.
The logical model is specific to the relational data model, but is DBMS agnostic.

Unlike some practitioners, I use natural keys as PKs when they can be trusted.
Otherwise, I invent surrogates.

The main difference is that foreign keys are now in the picture. Every entity
gets a table. Some relationships are modeled by adding foreign keys to entity
tables while other relationships get tables of their own. Many-to-many
relationships are an example of the latter.

Issues like table composition and normalization are dealt with at this stage.
Unlike some practitioners, I don't treat normalization as some kind of
religion. I make design decisions in view of the consequences. However,
departures from normalization have to have a specific justification.

if I were to design a non relational database, the second stage would look very different.

The third stage is physical design. This results in a physical data model. The
physical data model starts with the logical data model, and adds features like
indexes, tablespaces, stored procedures, and what have you. The physical
design is DBMS specific, and takes into account volume, traffic, performance
goals, and resources available.

The physical data model is a blueprint for database construction.
===================================================================

Design tips:

The missing Cardinality is critical. Putting it in will actually assist in
resolving the model. I am not fussed re IDEF1X (circles) vs IEEE (crows feet),
but I always put them in as I go, and keep changing them as the model
progresses.

Looking to figure out the proper way to model the below requirements.

   1. There are 3 types of parties to be concerned with, a Fan, a
      Band, and a BandMember.
   2. That BandMember will always be associated with a Band and can 
      also be a Fan of any band.
   3. There are common attributes between a Fan, a Band, and a BandMember,
      but each of these 3 will also have their own unique attributes.
   4. A Fan can be a fan of of any Band or none at all


  Here you understand the data really well, but (no one has taught you this
and) you have confused Roles and Subtypes.

   - Band Member is a role.
   - Fan is a role.
   - People is entity
   - Band is entity.

You can have something like below:

    CREATE TABLE XREFBandMembers(
         [MemberID] [int] NOT NULL,
         [BandId] [int] NOT NULL,
         (MemberID, BandId) Primary Key );

    CREATE TABLE XREFBandFans](
             [FanId] [int] NOT NULL,
             [BandId] [int] NOT NULL,
             (FanID, BandId) Primary Key );

    CREATE TABLE People(
       [Id] [int] IDENTITY(1,1) NOT NULL Primary Key,
       [Name] [nvarchar](100) NOT NULL);

    CREATE TABLE [dbo].[Bands](
        [Id] [int] IDENTITY(1,1) NOT NULL Primary Key,
        [Name] [nvarchar](50) NOT NULL);

===================================================================
Example Data schema design for Reddit :

The co-founder of Reddit gave a presentation on issues they had while scaling
to millions of users. A summary is available here.

What surprised me is point 3:

    Instead, they keep a Thing Table and a Data Table. Everything in Reddit is
a Thing: users, links, comments, subreddits, awards, etc. Things keep common
attribute like up/down votes, a type, and creation date. The Data table has
three columns: thing id, key, value. Theres a row for every attribute.
Theres a row for title, url, author, spam votes, etc. When they add new
features they didnt have to worry about the database anymore. They
didnt have to add new tables for new things or worry about upgrades.


This is a data model known as EAV for entity-attribute-value. It has its uses.
A prime example is patient test data which is naturally sparse since there are
hundreds of thousands of tests which might be run, but typically only a
handful are present for a patient. A table with hundreds of thousands of
columns is silly, but a table with EAV makes good sense.

Row modeling, where facts about something (in this case, a sales transaction)
are recorded as multiple rows rather than multiple columns, is a standard data
modeling technique. The differences between row modeling and EAV (which may be
considered a generalization of row-modeling) are:

    * A row-modeled table is homogeneous in the facts that it describes: a
    * Line Items table describes only products sold. By contrast, an EAV table
    * contains almost any type of fact.

===================================================================


Database Design notation:
   
   Notations, for ex., UML, IDEF1X. Barker, Information Engineering (IE)

===================================================================
  An optional car loan (car_order::car_loan) may be implemented 
  using nullable foreign key ?? This is 1::0..1 relationship.

===================================================================

Database design for product :

Taken from stackoverflow.com : 
http://stackoverflow.com/questions/695752/product-table-many-kinds-of-product-each-product-has-many-parameters/695860

I want to support many kind of products (TV, Phone, PC, ...). Each kind of product
has different set of parameters like:
          o Phone will have Color, Size, Weight, OS...
          o PC will have CPU, HDD, RAM...

      Set of parameters must be dynamic. You can add or edit any parameter.


You have at least these five options for modeling the type hierarchy you describe:

    * Single Table Inheritance: one table for all Product types, with enough
      columns to store all attributes of all types. This means a lot of
      columns, most of which are NULL on any given row.
     

      Class Table Inheritance: one table for Products, storing attributes
      common to all product types. Then one table per product type, storing
      attributes specific to that product type.

    * Concrete Table Inheritance: no table for common Products attributes.
      Instead, one table per product type, storing both common product
      attributes, and product-specific attributes.

      Serialized LOB: One table for Products, storing attributes common to all
      product types. One extra column stores a BLOB of semi-structured data,
      in XML, YAML, JSON, or some other format. This BLOB allows you to store
      the attributes specific to each product type. You can use fancy Design
      Patterns to describe this, such as Facade and Memento. But regardless
      you have a blob of attributes that can't be easily queried within SQL;
      you have to fetch the whole blob back to the application and sort it out
      there.

    * Entity-Attribute-Value: One table for Products, and one table that
      pivots attributes to rows, instead of columns. EAV is not a valid design
      with respect to the relational paradigm, but many people use it anyway.
      This is the "Properties Pattern" mentioned by another answer. See other
      questions with the eav tag on StackOverflow for some of the pitfalls.
===================================================================

Any recommended approach for EAV pattern ? ~

If you can live with the drawbacks of NoSQL databases, the best way to
approach the EAV pattern is with a NoSQL alternative like CouchDB or MongoDB.
These databases offer a "schemaless" design which allows each row to have its
own schema. Doing an EAV with a traditional RDBMS is asking for trouble as the
querying becomes very difficult and performance suffers the larger the
dataset.

An alternative that I have used successfully in the past is to combine an
RDBMS with a NOSQL variant (MySql and MongoDB). I use MySQL to store the EAV
data (gaining the transactional integrity), and I use MongoDB as a reporting
store to get around the querying issues with the EAV model.

===================================================================

@@ Difference between Clustered Vs Non-clustered Indexes

A clustered index is a special type of index that reorders the way records in
the table are physically stored. Therefore table can have only one clustered
index. The leaf nodes of a clustered index contain the data pages. A
nonclustered index is a special type of index in which the logical order of
the index does not match the physical stored order of the rows on disk. The
leaf node of a nonclustered index does not consist of the data pages. Instead,
the leaf nodes contain index rows.

Only some DBMSs support this notion.

See Also: Secondary Index (whose data leaf contains primary key)

===================================================================

@@ Database basic questions One must know (From Stack Overflow!!!)


28
down vote
A reprint of my answer here, as general guidelines for topics.

Basics

SELECTing columns from a table
Aggregates Part 1: COUNT, SUM, MAX/MIN
Aggregates Part 2: DISTINCT, GROUP BY, HAVING
Intermediate

JOINs, ANSI-89 and ANSI-92 syntax
UNION vs UNION ALL
NULL handling: COALESCE & Native NULL handling
Subqueries: IN, EXISTS, and inline views
Subqueries: Correlated
WITH syntax: Subquery Factoring/CTE
Views
Advanced Topics

Functions, Stored Procedures, Packages
Pivoting data: CASE & PIVOT syntax
Hierarchical Queries
Cursors: Implicit and Explicit
Triggers
Dynamic SQL
Materialized Views
Query Optimization: Indexes
Query Optimization: Explain Plans
Query Optimization: Profiling
Data Modelling: Normal Forms, 1 through 3
Data Modelling: Primary & Foreign Keys
Data Modelling: Table Constraints
Data Modelling: Link/Corrollary Tables
Full Text Searching
XML
Isolation Levels
Entity Relationship Diagrams (ERDs), Logical and Physical
Transactions: COMMIT, ROLLBACK, Error Handling

Denormalizing, replication, query index hints, partitioning, etc.

===================================================================


@@ todo
    Download the schema examples and load with workbench.
    Download data modeler from Oracle:
       http://www.oracle.com/technetwork/developer-tools/datamodeler/sqldevdm31ea-download-515132.html
    Download Oracle installed virtual box


==================================================================
How MVCC works in Postgress ~
==================================================================

 From: http://onlamp.com/onlamp/2001/05/25/postgresql_mvcc.html

Let's look deeper into how MVCC works in PostgreSQL to allow "no-locking." Each
row in PostgreSQL has two transaction IDs: a creation transaction ID for the
transaction that created the row, and an expiration transaction ID for the
transaction that expired the row. When an UPDATE is performed, PostgreSQL
creates a new row and expires the old row. It's the same row -- just different
versions. PostgreSQL creates a new version of the row and also retains the old
or expired version. Database systems that use row-level locking do not retain
old versions of the data, hence the need for locks to maintain data
consistency.

So now that you know how PostgreSQL creates versions of the data, you might be
wondering how it knows which version to display.

It's quite simple. At the start of a query, PostgreSQL records two things: 1)
the current transaction ID, and 2) all in-process transaction IDs. When a query
is issued, PostgreSQL displays all the row versions that match the following
criteria:

The row's creation transaction ID is a committed transaction and is less than
the current transaction counter.
The row lacks an expiration transaction ID or its expiration transaction ID was
in process at query start.
The power of MVCC is in keeping track of transaction IDs to determine the
version of the data, and thereby avoid having to issue any locks. It's very
logical and efficient. New users to PostgreSQL will be pleased with the
performance improvements of MVCC over row-level locking, especially those
running in a large multi-user environment.

Additionally, MVCC offers another advantage: hot backups. 

Note: Disadvantages of MVCC would be that more transactions will
      be aborted on commit. ???
      Huge parallel table updates may endup consuming huge disk space.

Note: InnoDB supports only row level locking not MVCC. It is just better
      than table level locking but still not clean implementation.

==================================================================
Pete's Article on MVCC ~
==================================================================

http://www.mysqlperformanceblog.com/2007/12/19/mvcc-transaction-ids-log-sequence-numbers-and-snapshots/


==================================================================
Natural Join ~
==================================================================

It is Join on equally named columns only. If there is no common
names, it acts like cross join.

A Natural Left Join is an Left Outer Join with same 'natural'
condition.

Note: For natural joins, the resulting joined table contains only 
      one column for each pair of equally-named columns.

Note: It is dangerous to use due to adding columns and naming.

Query for Natural Left Join

SELECT *FROM employee NATURAL LEFT JOIN Emp_sal;
   It is left outer join; The commonly named *field(s)* in both tables should match.
All left table entries will be listed even if there is no match.

Note: Any non-implicit join (implicit join operator is ",") requires  
      "ON  f1 = f2"  type of condition; 
      For natural join, the condition is implicit.

Different Syntactic Flavors; When you find one of these, 
          change it to simple "JOIN" with explicit condition
       Or you have strong documented procedure about altering tables.

==================================================================
Emulating Outer Join ~
==================================================================

Some database systems do not support the full outer join functionality
directly, but they can emulate it through the use of an inner join and UNION
ALL selects of the "single table rows" from left and right tables
respectively. The same example can appear as follows:

SELECT employee.LastName, employee.DepartmentID, department.DepartmentName,
department.DepartmentID
FROM employee
INNER JOIN department ON employee.DepartmentID = department.DepartmentID
 
UNION ALL
 
SELECT employee.LastName, employee.DepartmentID, CAST(NULL AS VARCHAR(20)),
CAST(NULL AS INTEGER)
FROM employee
WHERE NOT EXISTS (SELECT * FROM department WHERE employee.DepartmentID =
department.DepartmentID)
 
UNION ALL
 
SELECT CAST(NULL AS VARCHAR(20)), CAST(NULL AS INTEGER), 
department.DepartmentName, department.DepartmentID
FROM department
WHERE NOT EXISTS (SELECT * FROM employee WHERE employee.DepartmentID =
department.DepartmentID)

==================================================================
Joins Explained ~
==================================================================

http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
  The visual explanation assumes that the join conditions involve
key columns so that the join yields atmost 1 match per row (0 or 1).
   
  The condition is assumed to be A.name == B.name (and name is unique)

  A Inner Join B       ==  A intersects B
  A Full Outer Join B  ==  A Union B
  A LEFT Outer Join B  ==  A  (no matter how much B overlaps)
  A LEFT Outer Join B 
   WHERE B.id is NULL  ==  A - B (matching rows of B excluded in result)

  A Full Outer Join B 
   WHERE A.id is NULL OR
         B.id is NULL  ==  (A-B) U (B-A) (common elements removed)

  Cross Join can't be explained
  so simply. May be bunch of circles around Table A. 


==================================================================
Window Functions ~
==================================================================
From http://www.postgresql.org/docs/8.4/static/tutorial-window.html

A window function performs a calculation across a set of table rows that are
somehow related to the current row. This is comparable to the type of
calculation that can be done with an aggregate function. But unlike regular
aggregate functions, use of a window function does not cause rows to become
grouped into a single output row  the rows retain their separate
identities. Behind the scenes, the window function is able to access more than
just the current row of the query result.

Here is an example that shows how to compare each employee's salary with the
average salary in his or her department:

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM
empsalary;

  depname  | empno | salary |          avg          
-----------+-------+--------+-----------------------
 develop   |    11 |   5200 | 5020.0000000000000000
 develop   |     7 |   4200 | 5020.0000000000000000
 develop   |     9 |   4500 | 5020.0000000000000000
 develop   |     8 |   6000 | 5020.0000000000000000
 develop   |    10 |   5200 | 5020.0000000000000000
 personnel |     5 |   3500 | 3700.0000000000000000
 personnel |     2 |   3900 | 3700.0000000000000000
 sales     |     3 |   4800 | 4866.6666666666666667
 sales     |     1 |   5000 | 4866.6666666666666667
 sales     |     4 |   4800 | 4866.6666666666666667
(10 rows)

The first three output columns come directly from the table empsalary, and
there is one output row for each row in the table. The fourth column
represents an average taken across all the table rows that have the same
depname value as the current row. (This actually is the same function as the
regular avg aggregate function, but the OVER clause causes it to be treated as
a window function and computed across an appropriate set of rows.)

A window function call always contains an OVER clause following the window
function's name and argument(s). This is what syntactically distinguishes it
from a regular function or aggregate function. The OVER clause determines
exactly how the rows of the query are split up for processing by the window
function. The PARTITION BY list within OVER specifies dividing the rows into
groups, or partitions, that share the same values of the PARTITION BY
expression(s). For each row, the window function is computed across the rows
that fall into the same partition as the current row.

Although avg will produce the same result no matter what order it processes
the partition's rows in, this is not true of all window functions. When
needed, you can control that order using ORDER BY within OVER. Here is an
example:

SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY
salary DESC) FROM empsalary;

  depname  | empno | salary | rank 
-----------+-------+--------+------
 develop   |     8 |   6000 |    1
 develop   |    10 |   5200 |    2
 develop   |    11 |   5200 |    2
 develop   |     9 |   4500 |    4
 develop   |     7 |   4200 |    5
 personnel |     2 |   3900 |    1
 personnel |     5 |   3500 |    2
 sales     |     1 |   5000 |    1
 sales     |     4 |   4800 |    2
 sales     |     3 |   4800 |    2
(10 rows)

As shown here, the rank function produces a numerical rank within the current
row's partition for each distinct ORDER BY value, in the order defined by the
ORDER BY clause. rank needs no explicit parameter, because its behavior is
entirely determined by the OVER clause.


==================================================================
Database Modeling ~
==================================================================

ER
===

* Entity-relationship modeling is a database modeling method.
* Produces ER Diagrams.
* EER is Enhanced ER modeling which supports inheritance, etc.
* ER diagram can use different types of notations:
   * Chen notation - 1976 paper - with diamonds for relations;
     rectangles for entities, ovals for attributes.

* ER models roughly correspond to just 1 of the 14 different modeling 
  techniques offered by UML. ??

Object Role Modeling
====================
  Object Role Modeling (ORM) allows the semantics of a universe of 
discourse to be modeled using natural languages and diagrams. 


Don't confuse with Object Relational Mapping:
Object-relational mapping (ORM, O/RM, and O/R mapping) in computer software is
a programming technique for converting data between incompatible type systems
in object-oriented programming languages. This creates, in effect, a "virtual
object database" that can be used from within the programming language.

==================================================================
Data Modeling Notation ~
==================================================================

Which notation is best ?

  For data modeling we use ER-Diagramming tool.
In ER Diagram relation  itself sometimes pictured as diamond.
Sometimes just written on the relation line.
In database, relation can become another table!

Crow feet notation(aka IE -Information Engineering) seems to be good.

   Person   >0-------||- Location    (0..n  to 1 relation)

      Means 0,1,or more persons born in a location.
      Every person has only 1 location.

    >|-------||-  (1..n to 1 relation)

Same thing in UML Model would be:

--------------     <Relations>>   --------------- 
| <<Entity>> |0..N   born-in>    1| <<Entity>>> |
|  Person    |--------------------| Location    |
--------------  <BirthPlaceOf     ---------------


The entity box looks like this:
    -----------------------
    | Primary key col     |  /* If primary key is composite, 
    -----------------------     All components here */
    | Other columns       |
    |                     |
    |-------------------- |

==================================================================
  IDEF1X
==================================================================

IDEF1X notation is industry standard. But not that common.

IDEF1X (Integration Definition for Information Modeling) is a data modeling
language for the developing of semantic data models.
IDEF1X Notation:

   Notation     Meaning
    *-----      0..n  (just a solid dot at end)
  P *-----      1..n  (solid dot with P written near it)
  n *-----      exactly n
m-n *-----      from m to n
  Z *-----      Zero or 1

In IDEF1X, a dependent entity (child table in an identifying relation)
is rectangle with round corners. Identifier-independent entity (e.g. parent)
is perfect rectangle (with sharp corners).

When annotating relation, note following order:
      rel-A-to-B                     A
  A *-------------* B                | rel-A-to-B/
      rel-B-to-A                     | rel-B-to-A
                                     B

Supports classification (or inheritance) of entities using
categorization relationship notation syntax: The symbol for
  Exclusive Classification     Non-Exclusive
         O                         O
        ===                      ----
Don't confuse the above with --0|- syntax as that is drawn small
very close to the object and the classification is drawn as big.

Contains Aggregation (in UML) may not necessarily indicate that 
the part is dependent on the whole since sometimes the part may
exist separately and sold separately. Still, the relation between
car and seats is that car "owns" seats.

==================================================================
 Visio offer two types, The crow feet notations and simple arrows.
James martin book uses crows arrow.

Arrow requires 1..* kind of inline annotation. It is microsoft's
own invention -- not to be followed.

Enterprise Architect uses UML class diagram for database schema!

Note: Usually class diagram (UML) has much more expressivity than 
      EER (Enhanced ER) diagrams. EER is mainly for physical layout
      of data. UML abstracts the complete application behaviour as well.

General Tools supporting this:
 http://en.wikipedia.org/wiki/Data_modeling_tools

workbench ofcourse.

IBM's Rational Rose,
DB Designer 4 at http://fabforce.net (predecessor of workbench)
Oracle's SQL Designer, 
Computer Associates's ERwin  (only Windows, 600MB)
theKompany's DataArchitect 
Power Architect: http://www.sqlpower.ca/page/architect (Java Based)

See:
http://en.wikipedia.org/wiki/Entity-relationship_model#ER_diagramming_tools


Implement a many-to-many relationship ~

For this example I will again use the classic library example. Books can have
many authors. Authors can write many books. So we need a look-up table to
handle this many-to-many relationship.

CREATE TABLE authors_books (
author_id INT NOT NULL, book_id INT NOT NULL,
PRIMARY KEY (author_id, book_id),
FOREIGN KEY (author_id) REFERENCES authors (author_id),
FOREIGN KEY (book_id) REFERENCES books (book_id)
);


==================================================================
Isolation Levels ~
==================================================================

1. READ UNCOMMITTED: A transaction may read any ver-
sion of data, committed or not. This is achieved in a locking
implementation by read requests proceeding without acquir-
ing any locks.1

2. READ COMMITTED: A transaction may read any commit-
ted version of data. Repeated reads of an object may result
in different (committed) versions. This is achieved by read
requests acquiring a read lock before accessing an object,
and unlocking it immediately after access.

3. REPEATABLE READ: A transaction will read only one
version of committed data; once the transaction reads an
object, it will always read the same version of that object.
This is achieved by read requests acquiring a read lock
before accessing an object, and holding the lock until end-
of-transaction.

4. SERIALIZABLE: Full serializable access is guaranteed.

In all isolation levels, write requests are preceded by write locks 
that are held until end of transaction.

In addition to the standard ANSI SQL isolation levels, various ven-
dors provide additional levels that have proven popular in particular
cases.

*CURSOR-STABILITY* Isolation:
                   This is READ-COMMITTED + single row lock on current cursor.
                   Less intrusive than repeatable-read isolation, 
                   but tx can do read-think-write on row by row basis. 
                   of READ COMMITTED.

*SNAPSHOT-ISOLATION* 

A transaction running in SNAP- SHOT ISOLATION 
mode operates on a version of the
database as it existed at the time the transaction began;
subsequent updates by other transactions are invisible to the
transaction. This is one of the major uses of MVCC in pro-
duction database systems. When the transaction starts, it
gets a unique start-timestamp from a monotonically increas-
ing counter; when it commits it gets a unique end-timestamp
from the counter. The transaction commits only if no other
transaction with an overlapping start/end-transaction pair
wrote data that this transaction also wrote. This isolation
mode depends upon a multi-version concurrency implemen-
tation, rather than locking. These schemes typically coexist,
however, in systems that support SNAPSHOT ISOLATION.

*READ-CONSISTENCY* This is an MVCC scheme defined by
Oracle; it is subtly different from SNAPSHOT ISOLATION.
The first writer wins and the second writer must wait for
the transaction completion of the first writer before its write
proceeds. By contrast, in SNAPSHOT ISOLATION the first
committer wins rather than the first writer.

Weak isolation schemes can provide higher concurrency than full
serializability. As a result, some systems even use weak consistency
228 Transactions: Concurrency Control and Recovery
as the default. For example, Microsoft SQL Server defaults to READ
COMMITTED. The downside is that Isolation (in the ACID sense)
is not guaranteed. Bad choice.

==================================================================
Recovery Theme ~
==================================================================

The standard theme of database recovery is to use a Write-Ahead
Logging (WAL) protocol. The WAL protocol consists of three very
simple rules:
1. Each modification to a database page should generate a log
record, and the log record must be flushed to the log device
before the database page is flushed.
6.4 Log Manager
229
2. Database log records must be flushed in order; log record r
cannot be flushed until all log records preceding r are flushed.
3. Upon a transaction commit request, a commit log record
must be flushed to the log device before the commit request
returns successfully.

The first rule ensures that the actions of incomplete transactions
can be undone in the event of a transaction abort, to ensure atomicity.
The combination of rules (2) and (3) ensure durability: the actions of
a committed transaction can be redone after a system crash if they are
not yet reflected in the database.


==================================================================
Heap Sort ~
==================================================================
  Unstable sort. People prefer quicksort because it is stable.
  good worst time complexity. 
  Merge sort is better because distributing work is easier.
  Introsort combines the best of both heapsort and quicksort.

==================================================================
External Sort ~
==================================================================
  Sort subfiles and merge. Can sort 1GB data using 1MB RAM!
  The files are used as temporary storage. (using merge sort)
 
Datamation: Sort 1M records of size 100 bytes
Typical DBMS: 15 minutes
World record: 3.5 seconds
  12-CPU SGI machine, 96 disks, 2GB of RA


========================================================================
Optimization ~
========================================================================

Typical disk latency    : 13 millisecs
        Disk throughput : 60 MB/sec i.e. 60 KB/ms; 600KB/10ms.

Memory throughput is approx 5GB/sec; Latency is around 5 ns.

One latency in 1s Read affects 1% !
100 latency in 1s Read affects 100% i.e slows down disk by half.
100 latency in 1s is expected if data is chunks of 600KB.
If data is chunks of 100KB then it slows down the time by 6 times!

Asynchronous I/O is good!

From
http://dev.mysql.com/doc/innodb/1.1/en/innodb-performance-aio-linux.html

* Starting in InnoDB 1.1 with MySQL 5.5, INNODB async I/O available on Linux
* AIO for InnoDB on Windows available earlier!
* Other Unix uses synch calls.

* Running with a large number of InnoDB I/O threads, can exceed capacity 
  limits on Linux systems:

   EAGAIN: The specified maxevents exceeds the user's limit of available
           events. 
  FIX: Write a higher limit to /proc/sys/fs/aio-max-nr.
  Disable AIO: Set option innodb_use_native_aio = 0

  For more information, see Optimizing InnoDB Disk I/O.


==================================================================
Query Evaluation and Optimization ~
==================================================================

Join : Index Nested Loops:

                foreach tuple r in R do
                   foreach tuple s in S where ri == sj do
                          add <r, s> to result

Join: Sort-Merge (R i=j11 S)
        Sort R and S on the join column, then scan them to do a merge
        (on join col.), and output result tuples.

Avoid nested queries, temporary relations, complex
conditions, and operations like DISTINCT and GROUP BY .

Summary on Unnesting Queries
-----------------------------

DISTINCT at top level: Can ignore duplicates.
Can sometimes infer DISTINCT at top level! (e.g.
subquery clause matches at most one tuple)

DISTINCT in subquery w/o DISTINCT at top:
Hard to convert.

Subqueries inside OR: Hard to convert.

ALL subqueries: Hard to convert.

EXISTS and ANY are just like IN.

Aggregates in subqueries: Tricky.

Good news: Some systems now rewrite under
the covers (e.g. DB2).
Database Management Systems 3ed, R. Ramakrishnan and J

==================================================================
Transaction Locking ~
==================================================================

 Strict Two-phase Locking (Strict 2PL) Protocol:

Each Xact must obtain a S (shared) lock on object before
reading, and an X (exclusive) lock on object before writing.

All locks held by a transaction are released when the
transaction completes

If an Xact holds an X lock on an object, no other Xact can
get a lock (S or X) on that object.

Strict 2PL allows only serializable schedules.


==================================================================
Recovering From a Crash ~
==================================================================

There are 3 phases in the Aries(Ramakrishnan) recovery algorithm:

Analysis: Scan the log forward (from the most recent
checkpoint) to identify all Xacts that were active, and all dirty
pages in the buffer pool at the time of the crash.

Redo: Redoes all updates to dirty pages in the buffer pool,
as needed, to ensure that all logged updates are in fact
carried out and written to disk.

Undo: The writes of all Xacts that were active at the crash
are undone (by restoring the before value of the update,
which is in the log record for the update), working
backwards in the log. (Some care must be taken to handle
the case of a crash occurring during the recovery process!)

Write-ahead logging (WAL) is used to undo the
actions of aborted transactions and to restore the
system to a consistent state after a crash.
Consistent state: Only the effects of commited Xacts seen.


==================================================================
 Embedded SQL Example ~
==================================================================
An example of an embedded SQL statement that creates a new Person instance
would be:

    EXEC SQL BEGIN DECLARE SECTION;
        char SQLSTATE[6];
        char ssan[9];
        char name[30];
    EXEC SQL END DECLARE SECTION

    EXEC SQL
    INSERT INTO person VALUES (:ssan, :name);

==================================================================
 Dynamic SQL ~
==================================================================
 Opposite of prepared statements:
     Exec(sql_string);
     objConn.execute(sql);  (in Java)

 Note: At compile time there is nothing known.

==================================================================
 INSTEAD OF Trigger ~
==================================================================
 Do something else on insert!

--Create an INSTEAD OF INSERT trigger on the view.
CREATE TRIGGER InsteadTrigger on InsteadView
INSTEAD OF INSERT
AS
BEGIN
  --Build an INSERT statement ignoring inserted.PrimaryKey and 
  --inserted.ComputedCol.
  INSERT INTO BaseTable
       SELECT Color, Material
       FROM inserted
END
GO

==================================================================
 Identifying Relationship ~
==================================================================
  An identifying relationship means that the child table cannot be uniquely
identified without the parent.

Example... 
Account (AccountID, AccountNum, AccountTypeID) 
PersonAccount (AccountID, PersonID, Balance)     [ Child Table ]
                      (AccountID, PersonID) is composite key.
Person(PersonID, Name)

Account to PersonAccount is identifying relationship:
  - Child table can not exist without parent
  - PersonAccount->Account is n:1 relation. Account is 1 and only one.
       Account --||------------< PerSonAccount
    Joint account may be owned by many persons.
  - AccountID is foreign key. Whenever there is foreign key, that end
    can not be optional (it has to be 1 and only one).
  - A referencing table is a child. Referenced Table is parent,
    which is one and only-1.
  - A foreign key can be NULL, but if it is NOT NULL,
    (e.g. also being part of primary key, or explicit constraint),
    then it identifies the 1 (and only one) other end.
  - An identifying relationship is denoted by solid line ???

Strong Entity (Or Independent Entity) = Parent Entity
Weak Entity (Or Dependent Entity) = Child Entity

When you delete Strong Entity, we may want to do cascade-delete
the weak Entity:

create table employee_insurance_policy(
ssn INT Primary Key,        /* In this case PK is also FK !!! */
FOREIGN KEY (ssn) REFERENCES Employees,
ON DELETE CASCADE   ===> Delete this guy when the reference is gone!
);

========================================================================

If table1:table2 is m:1 relationship, I insert a foreign key
on table1; A separate relationship table is not needed.
This is the case for worker:manager.
Not true for "Works In" relationship if employee can work in
multiple departments. employee:dept is m:n -- To see that
ask 2 separate questions: 
   * Given employee, how many depts ?
   * Given dept, how many employees ?

==================================================================
 Foreign Keys ~
==================================================================
 - Foreign key can be NULL. Means, it is not yet assigned.
   When it is not NULL, 'participation constraint' is enforced upon.
 - Foreign key can be a composite key.
 - If any of the component of composite foreign key is NULL,
   the entire checking is skipped.
 - Foreign key can refer to any unique constraint. (not only PK)
 - A foreign key can be UNIQUE or not unique. When it is not,
   it implies many-1 relationship.
 - A foreign key interpretation:
   ==================================================================
   Constraint         Meaning of relationship between child and parent table
   Null OK?  Unique? 
   ==================================================================
   NOT-NULL   *       parent is 1 and only 1; whether many children map or not
                      does not matter. Parent side cardinality is 1 (and only1);
   NULL-OK    *       Parent side cardinality is {0, 1}
                      Not identifying relationship. 

   NOT-NULL  UNIQUE   The child side cardinality is {1} (and only 1)
   NOT-NULL  NO       The child side cardinality is {1..n}

   NULL-OK   NO       The child side cardinality is {0..n}
   NULL-OK   UNIQUE   Not-possible. Unique is always NOT NULL.

   when it may be NULL, not unique the child side is {0..n}

  - A foreign key can also be primary key. In that case, it has 1-1
    relationship with the other able.

  - A foreign key on child table creates 'internal non-unique index'.
    why it is needed ? To verify referential constraint, on deleting
    row from parent table, we can't afford to scan the child table.

  - Can foreign key refer to non-unique column ?
    In STD SQL, No!!!
    But in MySQL InnoDB, yes. It is InnoDB extension.

==================================================================
 NULL Handling ~
==================================================================

  Use COALESCE(col1, 0) : It returns first Non-null value in list.
  IFNULL(col1, 0) also does the same thing but non-standard.

  CAST(NULL as signed integer) returns NULL (of type signed int)
  which is important if you are inserting into another table
  for which type info is needed.

==================================================================
 Deferred Integrity Checking ~
==================================================================

Deferred check is SQL std required; however MySQL InnoDB does not
implement it. The workaround is:

mysql> SET foreign_key_checks = 0;
mysql> SOURCE dump_file_name;
mysql> SET foreign_key_checks = 1;

==================================================================
 Surrogate Key ~
==================================================================

==================================================================
 ISAM ~
==================================================================
ISAM stands for Indexed Sequential Access Method, a method for indexing data
for fast retrieval. ISAM was originally developed by IBM

A database system where an application developer directly uses an Application
Programming Interface to search indexes in order to locate records in data
files. In contrast, a relational database uses a query optimizer which
automatically selects indexes.

An indexing algorithm that allows both sequential and keyed access to data.
Most databases now use some variation of the B-Tree for this purpose, although
the original IBM ISAM and VSAM implementations did not do so.

ISAM was replaced at IBM with a methodology called VSAM (Virtual Storage
Access Method). VSAM is the physical access method used in DB2.

(something like: has a header of fixed array of pointer-offsets;
 these pointers are never updated-- only the data changes; 
 The datapages can have any amount of overflow pages; with time
 the performance can go down).

========================================================================
 Benchmarks ~
========================================================================
TPC-A - Simple benchmark to measure $/tps measures
TPC-B - Measure DBMS itself - update 3 tables and append 4th table.
TPC-C - More complex suite of tasks - Defacto benchmark now for OLTP.

TPC-D - OLAP benchmark


==================================================================
Queries by Example ~
==================================================================

  Scalar subquery: (A form of self join with co-related values)
  http://web.archive.org/web/20071009162319/http://www.dbmsmag.com/9610d06.html


  Get me TOP 3 ranked sales men with total sales:

With Self joining:

SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales
HAVING COUNT(Boundary.tot_sales) <= 3; 


With Subquery:
 ???
SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements
WHERE (SELECT COUNT(*)
FROM SalesReport AS Boundary
WHERE Elements.tot_sales > Boundary.tot_sales)>= 3;

 you want to swap the subquery and the constant for readability, that action
is legal in SQL-92 but not in SQL-89. What if you want to allow ties? Just
change count( ) to a COUNT(DISTINCT) function with a HAVING clause, thus:

SELECT Elements.name, Elements.tot_sales 
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales 
HAVING COUNT(DISTINCT Boundary.tot_sales) <= 3; 

Scalar subquery example:
A third version - which will run only in SQL-92 - uses scalar subqueries:

SELECT
(SELECT MAX(tot_sales)
FROM SalesReport) AS s1,
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1)) AS s2,
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1, s2)) AS s3,
. . .
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1, ..., s[n-1])) AS sn
FROM Dummy;


Note: Great example. Note that the correlation with in SELECT list happens
      only if the latter elements contain subqueries, not otherwise:

  SELECT tv1 AS myv1, myv1 FROM t;    ==> Error  "myv1" no such field.
  SELECT tv1 AS myv1, (myv1) FROM t;  ==> Error  "myv1" no such field.
  SELECT tv1 AS myv1, (select myv1) FROM t;  ==> OK!

So, in essense, you can write a procedural logic in a single SQL,
using (scalar) subqueries:

Find top 3 sales men:
      Step 1:    max1 = select max(sales_total) ...;
      Step 2:    max2 = select max(sales_total) ... where (sales_total!=max1)
      Step 3:    max3 = select max(sales_total) ... where sales<max1 and max2

Put this in single query:
    SELECT  max(sales_total) as max1, (select max() where sales<max1) as max2,
            (select ... ) as max3 FROM sales_table;
          
Note: Subquery can occur in multiple places:
   Case 1)  SELECT expr1 as e1, expr2 as e2 .... FROM ... WHERE ...;

            expr1 can be subquery, but it *must* be scalar subquery:
                e.g. select max(); select v from t ... must return 1 value.
                If it returns multiple values, it is only *runtime* error.
 
            expr1 can be like:  field1 in (select v from t)
                                              |-> this select is OK.

            From expr2, you can access e1; (From SQL-92 onwards).
   
   Case 2)  SELECT ... FROM (expr1) AS t1, (expr2) AS t2 WHERE ...

                Table in FROM clause is called derived table.
                Conceptually derived table is evaluated first.
                *FROM table list is the main producer of name space*

                In expr2, can you access t1 ???? *verify*
                *There should not be mutual dependency*

   Case 3) SELECT e1, e2 FROM t1, t2 WHERE  (expr1);

                WHERE clause is the *consumer* of  FROM-table-list.
                It does not have access to e1, e2 since they depend on this
                and we can't have circular dependency.

                WHERE clause can't import table-field-aliases from e1,e2
                even if these are available at compile time.

                If expr1 subquery accesses t1 or t2 then it is co-related
                subquery. Otherwise it is just a subquery.


More examples:

Find country with min population from 'South America' continent.
Use correlated subquery:

SELECT * FROM Country c1
WHERE Continent = 'South America'
AND Population = (SELECT MIN(Population) FROM Country c2
WHERE c2.Continent = c1.Continent)

Note1: Above is not efficient. Better options:
      - You may do a 2 step process.
      - SELECT * FROM Country Order By Population LIMIT 1; (only 1 output)
      - SELECT * FROM Country WHERE Continent = 'South America' AND
                    Population = (SELECT MIN(Population) FROM Country);
        (Since no correlation, it need not be executed multiple times)

========================================================================

Find countries which have the max surface area within their own continent.
(one output per continent)

SELECT Continent, Name, SurfaceArea
 FROM Country AS C
 WHERE SurfaceArea = ( SELECT MAX(SurfaceArea) FROM Country 
                       WHERE Continent = C.Continent);

========================================================================

mysql> SELECT DISTINCT Language
-> FROM CountryLanguage
-> WHERE CountryCode IN (
-> SELECT Code
-> FROM Country
-> Where GovernmentForm = 'Monarchy')
-> ORDER BY Language;

+----------+
| Language |
+----------+
| Arabic   |
| Asami    |
| Dzongkha |
| English  |
| Nepali   |
| Swazi    |
| Tongan   |
| Urdu     |
| Zulu     |
+----------+

========================================================================

Find countries in which German is spoken:

 SELECT Code, Name
 FROM Country C
 WHERE EXISTS (SELECT * FROM CountryLanguage WHERE CountryCode = C.Code
                                AND Language = 'German')
                                ORDER BY Name;

========================================================================
Row constructor example:

SELECT Population
 FROM City
 WHERE (Name, District, CountryCode) = ('Houston', 'Texas', 'USA');

+------------+
| Population |
+------------+
| 1953631    |
+------------+

========================================================================

Find total German/Hindi/Tamil speaking people:

SELECT SUM(Speakers)
   FROM (SELECT (Population * Percentage) / 100 AS Speakers
         FROM CountryLanguage cl, Country c
         WHERE cl.CountryCode = c.Code AND c.Region = 'Western Europe'
         AND cl.Language = 'German') AS tmp;

SELECT SUM(Speakers)
   FROM (SELECT (Population * Percentage) / 100 AS Speakers
         FROM CountryLanguage cl, Country c
         WHERE cl.CountryCode = c.Code 
         AND cl.Language = 'Hindi') AS tmp;

========================================================================

Subquery converted to a LEFT JOIN:

Projects which have clients:

SELECT * FROM project WHERE project.id IN
(SELECT client.id FROM client)

How would you rewrite the statement using the LEFT JOIN operator? 
Your output should have column headings like this:

+------------+--------------+-----------+---------------+
| Project ID | Project Name | Client No | Client Name   |
+------------+--------------+-----------+---------------+

SELECT
p.pid AS `Project ID`,
p.name AS `Project Name`,
c.id AS `Client No`,
c.name AS `Client Name`
 FROM project AS p
 LEFT JOIN client AS c
 USING (id) /* or: ON p.id = c.id */
 WHERE c.name IS NOT NULL
 ;

The IS NOT NULL filters out projects which don't have clients.

========================================================================

Same above query in INNER JOIN :

SELECT
p.pid AS `Project ID`,
p.name AS `Project Name`,
c.id AS `Client No`,
c.name AS `Client Name`
 FROM project AS p
 INNER JOIN client AS c
 USING (id) /* or: ON p.id = c.id */

Note: For each project there is atmost one client, so output won't have
      duplicates.

========================================================================

Subquery converted to a LEFT JOIN:
mysql> SELECT
->
p.pid AS `Project ID`,
->
p.name AS `Project Name`,
->
c.id AS `Client No`,
->
c.name AS `Client Name`
-> FROM project AS p
-> LEFT JOIN client AS c
-> USING (id) /* or: ON p.id = c.id */
-> WHERE c.name IS NULL
-> ;
+------------+--------------+-----------+-------------+
| Project ID | Project Name | Client No | Client Name |
+------------+--------------+-----------+-------------+
| 10080      | Intranet     | NULL | NULL |
| 10090      | PDC Server   | NULL | NULL |
+------------+--------------+-----------+-------------

========================================================================

Great Example :

We want following output for only the country for which max languages
declared official :

SELECT CONCAT(
'The country ',
... ,
' has ',
... ,
' official languages: ',
...
);


SELECT
   CONCAT(
     'The country ',
        (SELECT Name FROM (
           SELECT Name, COUNT(*) AS nlanguages,
                GROUP_CONCAT(Language) as languages
                FROM Country c, CountryLanguage cl
                WHERE c.Code = cl.CountryCode
                AND cl.IsOfficial = 'T'
           GROUP BY Name
           ORDER BY nlanguages DESC, Name LIMIT 1
          ) AS tmp
       ),

     ' has ',

     (SELECT nlanguages FROM (
        SELECT Name, COUNT(*) AS nlanguages,
             GROUP_CONCAT(Language) as languages
        FROM Country c, CountryLanguage cl
        WHERE c.Code = cl.CountryCode AND cl.IsOfficial = 'T'
        GROUP BY Name ORDER BY nlanguages DESC, Name LIMIT 1
        ) AS tmp1
     ),
     ' official languages: ',
     (SELECT languages FROM (
        SELECT Name, COUNT(*) AS nlanguages,
        GROUP_CONCAT(Language) as languages
        FROM Country c, CountryLanguage cl
        WHERE c.Code = cl.CountryCode AND cl.IsOfficial = 'T'
        GROUP BY Name ORDER BY nlanguages DESC, Name LIMIT 1
     ) AS tmp2
    )
);

The result looks like this:
The country South Africa has 4 official languages:
Afrikaans,English,Xhosa,Zulu

=========================================================================
Emulate Row_Number()

ISO SQL defines a ROW_NUMBER() OVER function with an optional PARTITION clause
for generating a derived row number column in a resultset. Several
RDBMSs including DB2, Oracle and SQL Server implement it. Here is the
simplest possible example. Given a table with two columns i and j, generate a
resultset that has a derived sequential row_number column taking the values
1,2,3,... for a defined ordering of j which resets to 1 when the value of i
changes: 

DROP TABLE IF EXISTS test; 
CREATE TABLE test(i int,j int); 
INSERT INTO test  
VALUES (3,31),(1,11),(4,14),(1,13),(2,21),(1,12),(2,22),(3,32),(2,23),(3,33); 

The result must look like this:

+------+------+------------+ 
| i    | j    | row_number | 
+------+------+------------+ 
|    1 |   11 |          1 | 
|    1 |   12 |          2 | 
|    1 |   13 |          3 | 
|    2 |   21 |          1 | 
|    2 |   22 |          2 | 
|    2 |   23 |          3 | 
|    3 |   31 |          1 | 
|    3 |   32 |          2 | 
|    3 |   33 |          3 | 
|    4 |   14 |          1 | 
+------+------+------------+ 


SELECT a.i, a.j, count(*) as row_number 
FROM test a 
JOIN test b ON a.i=b.i AND a.j >= b.j 
GROUP BY a.i, a.j ;  


========================================================================


------------------------------------------------------------------
Bad Correlated Subquery
------------------------------------------------------------------

  Select p.name, 
    (SELECT MAX(price) FROM orderitems WHERE prod_id=p.prod_id)
    AS max_sold_price
  FROM Products p;

 What does this mean ?
   List each product_name, max_sold_price using product, orderitmes tables.

 OrderOfAlgorithm:=  For each product in P:
                        For each item in OrderItem, compare and find Max.

Better Query:

  SELECT p.name, MAX(oi.price) AS max_sold_price
  FROM Products P INNER JOIN OrderItems oi ON (p.prod_id = oi.prod_id)
  GROUP BY p.name;
========================================================================

Comparison with Subquery:
operand comparison_operator ANY (subquery) :
        
operand IN (subquery)  ==> synonym for  =ANY operator in subqueries context.
              However for expression list e.g. (1, 2, 3) IN must be used.

operand comparison_operator SOME (subquery) ; SOME is an ALIAS for ANY


SELECT * FROM t1 WHERE 1 > ALL (SELECT s1 FROM t2) means:
SELECT * FROM t1 WHERE 1 >     (SELECT max(s1) FROM t2)

SELECT * FROM t1 WHERE 1 > ANY (SELECT s1 FROM t2) means:
SELECT * FROM t1 WHERE 1 >     (SELECT min(s1) FROM t2)

   <>ALL is alias for  NOT IN

Row Subqueries:
SELECT * FROM t1
  WHERE (col1,col2) = (SELECT col3, col4 FROM t2 WHERE id = 10);

========================================================================
http://www.artfulsoftware.com/infotree/queries.php :

If columns other than the GROUP BY column must be retrieved, and if the
grouping expression does not have a strictly 1:1 relationship with those
columns, then to avoid returning arbitrary values for those non-grouping
columns, you must put the GROUP BY query in a subquery and join that result to
the other columns, for example:

SELECT s.partID, s, thiscol, s.thatcol, anothercol, x.Suppliers 
FROM supparts s 
JOIN ( 
  SELECT partID,GROUP_CONCAT(supID ORDER BY supID) AS Suppliers  
  FROM supparts  
  GROUP BY partID 
) x USING(partID) 

========================================================================


Given the table authorbook(authid INT, bookid INT), what query finds the books
who have authors with more than one book in the table?

Even one level of recursion can induce a mild trance. Escape the trance by
taking the problem one step at a time. First write the query that finds the
authors with multiple books. Then join an outer query to that on authorid, and
have the outer query select bookid:

SELECT a1.bookid 
FROM authorbook a1 
INNER JOIN ( 
  SELECT authid,count(bookid) 
  FROM authorbook a2 
  GROUP BY authid 
  HAVING COUNT(bookid)>1 
) AS a3 ON a1.authid=a3.authid; 

This is a nested IF(): IF(X<Y,X,IF(Y<=Z,Y,Z)) 
========================================================================

Comparing one set with another to yield nC2 combinations (not m*n), do:
select X.*, Y.* From X, Y where X.id < Y.id;  <====== Note!

========================================================================
Pairwise matchmaking

Given tables tracking users and their hobbies, how do we write a query that
ranks pairs of users on hobby similarity? 

DROP TABLE IF EXISTS users,hobbies,users_hobbies; 
CREATE TABLE users( id int, name char(16) ) ; 
INSERT INTO users VALUES (1,'John'),(2,'Lewis'),(3,'Muhammad'); 
CREATE TABLE hobbies( id int, title char(16) ) ; 
INSERT INTO hobbies  
VALUES
(1,'Sports'),(2,'Computing'),(3,'Drinking'),(4,'Racing'),(5,'Swimming'),(6,'Photography');
CREATE TABLE users_hobbies( user_id int, hobby_id int ) ; 
INSERT INTO users_hobbies  
VALUES (1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1), 
(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6); 

It's a SQL version of a famous computing problem known as nearest neighbour
search or similarity search: given a metric space of K vectors, return the N
most similar.

Start by defining a similarity measure for one pair of users: if user A has x
hobbies and user B has y hobbies, one measure of their similarity is the
number of hobbies they share, divided by the number of hobbies that either
has. Is that plausible? If A and B both have hobbies 1,14,27, they are
100*3/3=100% similar. If A has 9,13 while B has 6,9,15, together they have 4
hobbies, one of which they share, so their similarity is 25%.

That's reasonable, but incomplete. If the comparison space has 100 hobbies,
and one pair shares 1 of 4 while another pair shares 4 of 16, are those two
pairs equally similar? Arguably not, since the second pair shares a greater
proportion of the total possible. Then the similarity measure should take this
into account, so if N=the total number of hobbies, S=the number of hobbies
shared by a pair, and T=the total number of distinct hobbies they have
together, their similarity is (S/T) * (T/N), or simply S/N.

So from first logical principles, the solution is a three-step:
Write the above logic in SQL
Write a query that applies that SQL to all pairs
Rank the scores that result
Fortunately, aggregation collapses these three logical steps to 2:

Step 1: count hobbies, then collect hobbies pairwise by user:

SET @N = (Select Count(DISTINCT id) FROM hobbies); 
SELECT @N; 
+------+ 
| @N   | 
+------+ 
|    6 | 
+------+ 
SELECT a.user_id, b.user_id, Group_Concat(DISTINCT a.hobby_id) AS 'Pairwise
                               shared hobbies' 
FROM users_hobbies a 
JOIN users_hobbies b ON a.user_id<b.user_id AND a.hobby_id=b.hobby_id 
GROUP BY a.user_id,b.user_id; 
+---------+---------+-------------------------+ 
| user_id | user_id | Pairwise shared hobbies | 
+---------+---------+-------------------------+ 
|       1 |       2 | 6                       | 
|       1 |       3 | 2,6                     | 
|       2 |       3 | 5,6                     | 
+---------+---------+-------------------------+ 

Step 2: Count, calculate and order by the percentages:

SELECT  
  Concat(a.user_id, ' & ', b.user_id) AS Pair, 
  Round( Count( DISTINCT a.hobby_id ) / @N, 2 ) AS Pct 
FROM users_hobbies a 
JOIN users_hobbies b ON a.user_id<b.user_id AND a.hobby_id=b.hobby_id 
GROUP BY a.user_id,b.user_id  
ORDER BY Pct DESC ; 
+-------+------+ 
| Pair  | Pct  | 
+-------+------+ 
| 1 & 3 | 0.33 | 
| 2 & 3 | 0.33 | 
| 1 & 2 | 0.17 | 
+-------+------+ 

But there is no hard and fast rule that determines a uniquely correct measure
of pairwise similarity. Some circumstances may require the above formula.
Others may require simple or complicated weights computed from pair and
population sizes. Implement them by applying a corrective calculation to the
denominator @N in Count( DISTINCT a.hobby_id ) / @N, 2 ).
Last updated 17 Sep 2010
========================================================================

Consider:
select  ... FROM  (...) as t1, (...) as t2;

Since the order of joining is not determined, the t2 query
should not refer to t1. However this sort of thing is allowed in
the select query list.

Have an utility table like this: 
   create table myint10(i int); 
   insert into myint10 values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);  
   insert into myint2 values(0),(1);
   insert into myint4 values(0),(1), (2), (4);
========================================================================
You have a table in which each row references one text and one keyword in the text ... 

DROP TABLE IF EXISTS keywords; 
CREATE TABLE keywords (txtID int, keyword char(8)); 
INSERT INTO keywords VALUES(1 , 'foo'),(2 , 'bar'),(1 , 'foo'),(2 , 'foo'); 

... and you want a list of texts which include every keyword.

You might think you have to join and match. You don't. All you need to do is
count the distinct keywords which occur for each text, then for each text
compare that number with the entire list of distinct keywords:

SELECT txtID, COUNT(DISTINCT keyword) AS N 
FROM keywords 
GROUP BY txtID 
HAVING N = (SELECT COUNT(DISTINCT keyword) FROM keywords); 
+-------+---+ 
| txtID | N | 
+-------+---+ 
|     2 | 2 | 
+-------+---+ 
========================================================================

Self-exclusion join solution: One way to model within-group minima or maxima
is via an left self exclusion join...

SELECT p1.item, p1.supplier, p1.price 
FROM products AS p1 
LEFT JOIN products AS p2 ON p1.item  = p2.item AND p1.price > p2.price 
WHERE p2.item IS NULL; 

Note: If there are 2 suppliers with min price, it will just arbitrarily
      choose the first guy.

...because in the resultset built by joining on left item = right item and
left price > right price, the left-sided rows for which there is no greater
right-sided price are precisely the per-item rows with the smallest prices.
The query runs more than an order of magnitude faster with an index on (item,
supplier).

General Rule:
  - Prefer flat queries
  - Prefer subquery in FROM clause
  - Prefer subquery in WHERE clause

========================================================================
Find missing numbers in a sequence

You have a table tbl(id int) with values (1,2,4,18,19,20,21), and you wish to find the first missing number in its sequence of id values:

SELECT t1.id+1 AS Missing 
FROM tbl AS t1 
LEFT JOIN tbl AS t2 ON t1.id+1 = t2.id 
WHERE t2.id IS NULL 
ORDER BY id LIMIT 1; 
+---------+ 
| Missing | 
+---------+ 
|       3 | 
+---------+ 

For all the gaps, including gaps of more than 1 value, you need something a little more baroque...

SELECT 
  a.id+1 AS 'Missing From', 
  MIN(b.id) - 1 AS 'To' 
FROM tbl AS a, tbl AS b 
WHERE a.id < b.id 
GROUP BY a.id 
HAVING `Missing From` < MIN(b.id); 
+--------------+------+ 
| Missing From | To   | 
+--------------+------+ 
|            3 |    3 | 
|            5 |   17 | 
+--------------+------+ 

------------------------------------------------------------------
Changing Correlated Subquery to Derived Table
------------------------------------------------------------------

  Derived Table == SuqueryTable in the From Class

TODO.

==================================================================
Views ~

 A view may be updatable but not insertable, why ?
    - View may be based on only few subset of columns in table,
      insert may require default value for other columns in real table.
    - View may contain expressions for some columns, update may do
      only update some columns, but insert has to update all columns.

View can be updated only if only one base table is affected :

  UPDATE v1 SET CityName = 'Werbelina' WHERE CityName = 'Berlin'; ==> OK
  UPDATE v1 SET CountryName = 'Deutschland' WHERE CountryName = 'Germany'; =>OK

  UPDATE v1 SET CityName = 'Berlin', CountryName = 'Germany'
            WHERE CityName = 'Werbelina' AND CountryName = 'Deutschland';

*ERROR* 1393 (HY000): Can not modify more than one base table
                      through a join view 'world.v1'

INSERT INTO v1 SET CityName = 'Neustadt', 
              CountryName = 'Deutschland';  ==> Same above error.

========================================================================


========================================================================


Tip: MySQL Does not support Role Based Access Control. It is privilege 
     based access control only.

==================================================================
OLAP ~
==================================================================

OLAP - On-Line Analytic Processing
OLTP - On-line Transaction Processing 

Three Complementary Trends

Data Warehousing: 
        * Consolidate data from many sources in one large repository.
        * Loading, periodic synchronization of replicas.
        * Semantic integration.
        * ETL - Extract, Transform, Load

OLAP:
        * Complex SQL queries and views.
        * Queries based on spreadsheet-style operations and
        * multidimensional view of data.
        * Interactive and online queries.

Data Mining: 
        * Exploratory search for interesting trends and anomalies.

*MOLAP*
Multidimensional data can be stored physically
in a (disk-resident, persistent) array; called
MOLAP systems. 

*ROLAP* Alternatively, can store as a relation; called ROLAP systems.

The main relation, which relates dimensions to
a measure, is called the *fact* table. Each
*dimension* can have additional attributes and
an associated dimension table.

E.g., Products(pid, pname, category, price)  [ It is one Dimension ]
Fact tables are much larger than dimensional tables.
Dimension Hierarchies

For each dimension, the set of values can be organized in a hierarchy:
Examples:   Product::  category => Product-name
            Time::     Year => Month => Week
            Location:: Country => State => City

OLAP Queries are Influenced by SQL and by spreadsheets.
A common operation is to aggregate a measure over one or more dimensions.

        * Find total sales.
        * Find total sales for each city, or for each state.
        * Find top five products ranked by total sales.
        *
Roll-up: Aggregating at different levels of a dimension hierarchy.
        * E.g., Given total sales by city, we can roll-up to get sales by state.

Drill-down: The inverse of roll-up.
        E.g., Given total sales by state, can drill-down to get
        total sales by city.
        E.g., Can also drill-down on different dimension to
        get total sales by product for each state.

Pivoting: Aggregation on selected dimensions.
        * E.g., Pivoting on Location and Time to summarize the measure.

Slicing and Dicing: 
        * Equality and range selections on one or more dimensions.

The *CUBE* Operator
-----------------

Generalizing the previous example, if there are k dimensions, we have 2^k
possible SQL GROUP BY queries that can be generated through pivoting on a
subset of dimensions.

        CUBE pid, locid, timeid BY SUM Sales

Equivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up
corresponds to an SQL query of the form:

        SELECT SUM(S.sales)       # Sales is the Fact table here.
                FROM Sales S
                GROUP BY grouping-list

SALES:  pid timeid locid sales   (Fact Table)

TIMES :  (timeid date week month quarter year holiday_flag)

PRODUCTS: pid pname category price

LOCATIONS: locid city state country

Fact table in BCNF; dimension tables un-normalized.

Dimension tables are small; updates/inserts/deletes are rare. 
                   So, anomalies less important than query performance.
This kind of schema is very common in OLAP applications, and is called 
a *star* schema; 
computing the join of all these relations is called a *star* join.

New indexing techniques: Bitmap indexes, Join indexes, array representations, 
                        compression, precomputation of aggregations, etc

Querying Sequences in SQL:1999
Trend analysis is difficult to do in SQL-92:
                * Find the % change in monthly sales
                * Find the top 5 product by total sales
                * Find the trailing n-day moving average of sales

The first two queries can be expressed with difficulty, but the third 
cannot even be expressed in SQL-92 if n is a parameter of the query.

The WINDOW clause in SQL:1999 allows us to write such queries over a table
viewed as a sequence (implicitly, based on user-specified sort keys)

The *WINDOW* Clause
------------------

SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid

WINDOW W AS (PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL `1 MONTH PRECEDING AND INTERVAL `1 MONTH FOLLOWING)

Let the result of the FROM and WHERE clauses be Temp.
(Conceptually) Temp is partitioned according to the PARTITION BY clause.
Similar to GROUP BY, but the answer has one row for each row in a partition, not
one row per partition!

Each partition is sorted according to the ORDER BY clause.
For each row in a partition, the WINDOW clause creates a window of
nearby (preceding or succeeding) tuples.
Can be value-based, as in example, using RANGE
Can be based on number of rows to include in the window, using ROWS clause
The aggregate function is evaluated for each row in the partition using the
corresponding window.

New aggregate functions that are useful with windowing include RANK (position
of a row within its partition) and its variants DENSE_RANK, PERCENT_RANK,
CUME_DIST.

==================================================================
Note: In the early 1990s products such as
Essbase provided customized tools for storing data in priority cube for-
mat while providing cube-based user interfaces to navigate the data.
Over time, data cube support has been added to other major DBMS.
It is called HOLAP - Hybrid OLAP.
OLAP Not good for adhoc queries.

In a star schema, if the dimension tables are hierarchical, then it is
'multi-level star schema' aka 'Snowflake Schema'.

Note: Vendors offering column stores include Sybase, Vertica,
Sand, Vhayu, and KX.

==================================================================
DSL ALPHA, QUEL, SQL ~
==================================================================

Suppose we wish to "Get the names and phone numbers of customers living in
London". With DSL Alpha (Codd's original proposal), we would specify 
this query as:

Range Customer X;
Get (X.Cname, X.Cphone): X.Ccity=London;

whereas in SQL its equivalent would be:
Select Cname, Phone
From Customer
Where Ccity = 'London
------------------------------------------------------------------

QUEL statements are always defined by tuple variables, which can be used to
limit queries or return result sets. Consider this example, taken from the
original Ingres paper:

range of e is employee
retrieve (comp = e.salary / (e.age - 18))
where e.name = "Jones"

e is a tuple, defining a set of data, in this case all the rows in the
employee table that have the first name "Jones". An equivalent SQL statement
is:

select (e.salary / (e.age - 18)) as comp
from employee as e
where e.name = "Jones"

QUEL is generally more "normalized" than SQL.[citation needed] Whereas every
major SQL command has a format that is at least somewhat different from the
others, in QUEL a single syntax is used for all commands.[citation needed]

For instance, here is a sample of a simple session that creates a table,
inserts a row into it, and then retrieves and modifies the data inside it and
finally deletes the row that was added (assuming that name is a unique field).

create student(name = c10, age = i4, sex = c1, state = c2)
range of s is student
append to s (name = "philip", age = 17, sex = "m", state = "FL")
retrieve (s.all) where s.state = "FL"
replace s (age=s.age+1)
retrieve (s.all)
delete s where s.name="philip"
Here is a similar set of SQL statements:
create table student(name char(10), age int, sex char(1), state char(2))
insert into student (name, age, sex, state) values ("philip", 17, "m", "FL")
select * from student where state = "FL"
update student set age=age+1
select * from student
delete from student where name="philip"

==================================================================

Parallel Architecture
 *Shared-Memory* : typically inside one machine, for large installation high
costs. All process models are applicable. Great for OLTP, many imple­
mentation form almost every vendor.

 *Shared-Nothing* : typically as a cluster of nodes. Require good partitioning,
which is easier for OLAP workloads (Teradata, Greenplum, DB2 Parallel
Edition,...).

 *Shared-Disk* : cluster of nodes with a SAN. Simple model, because ev­
ery node can access all the data, but requires cache-coherence protocols.
Oracle RAC, DB2 SYSPLEX.

 *NUMA* : not that common, we will not discuss. (Often DBMS treat it as
either shared nothing or shared memory, depending how non-uniform it
is). (non-uniform memory architecture)

==================================================================
Cost of execution Plan
==================================================================

Whats the cost of a particular plan?

*Selinger*
Famous paper. Pat Selinger was one of the early System R researchers; still
active today.  Lays the foundation for modern query optimization. Some things
are weak but have since been improved upon.

Idea behind query optimization:
(Find query plan of minimum cost ) How to do this?
(Need a way to measure cost of a plan (a cost model) )

Better plan possible by using sampling data.

 -------------------
CPU cost (# of instructions)    1Ghz == 1 billions instrs/sec;  1 nsec/instr
I/Ocost(#ofpagesread,#ofseeks)  100MB/sec                       10nsec/byte
(RandomI/O=pageread+seek)       10msec/seek                     100seeks/sec

Random I/O can be a real killer (10 million instrs/seek). (You probably
experience it yourself on your laptop sometimes).

For example, fastest TPC-H benchmark result (data warehousing bench­
mark), on 10 TB of data, uses 1296 74 GB disks, which is 100 TB of storage.
Additional storage is partly for indices, but partly just because they needed
additional disk arms. 72 processors, 144 cores 10 disks / processor!
But, if we do a bad job, random I/O can kill us!

Moreover there is a big difference from sequential and random IO. Example:

Read 10KB from a random location on disk: 10ms + 100 micros = *10.1ms*

    Read 10KB+10KB from a random location on disk (two blocks are next to 
    each other): 10ms + 200micros = *10.2ms*

    Read 10KB+10KB from two random locations on disk: 10ms +100micros= *20.2ms*

WOW! So saving disk I/O, and in particular random ones is VERY important!!
==================================================================


Appendix ~

 Recommended Books ~

    * Concurrency Control and Recovery in database systems by Bernstein,
                                        Hadzilacos, and Goodman
    * Transaction Processing by Jim Gray and Andreas Reuter
    * Transactional Information Systems by Weikum and Vossen 


 Useful Research Pointers ~

 http://tab.computer.org/tcde/icde_inf_paper.html
  ICDE Influential papers award:

About TCDE

Executive Committee

Self-Managing DB WG

ICDE
     Steering Committee
     Past Conferences
     Influential Papers
     Bylaws

ICDE Influential Paper Awards

ICDE 2012

Sanjay Agrawal, Surajit Chaudhuri, Gautam Das: DBXplorer: A System for
Keyword-Based Search over Relational Databases. ICDE 2002

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S.
Sudarshan: Keyword Searching and Browsing in Databases using BANKS. 
           ICDE 2002

ICDE 2008

Rakesh Agrawal and Ramakrishnan Srikant, Mining Sequential Patterns, ICDE 1995.

Citation: This paper launched a new area in data mining. Sequential pattern
mining has since become an important and active area with a variety of
applications and much published work. The paper is a milestone in the field of
data mining.

ICDE 2006

Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A
Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and
Sub-Total, ICDE 1996

Citation: This seminal paper defined a simple SQL construct that enables one
to efficiently compute aggregations over all combinations of group-by columns
in a single query, where previous approaches required multiple queries. This
feature has had significant impact on industry and is now incorporated in all
major database systems.