Email address: Isamu.Shigemori@eng.sun.com

* Use #sysdef to display all system defined params (like SHMNI)

shmid = shmget(key, size, flags); shmid is the index into shmid_ds[SHMNI];
addr = shmat(shmid, ...);

shared mem shmid_ds[] entry:
  ptr to anon_map; --> anon_hdr --> **array_chunk --> 
anon -->page offset or page_t
    |--> swapfs vnode ->

The process who has this shared mem segment has the following links:
proc struct --> as --> seg --> segvn_data --> anon_map

---------------------------------------------------------------------

read(fd, buf, mem)-->
ufs_read
base=segmap_getmapflt(vnode,offset,....);
uiomove(base, userbuffer,nbytes,...); /* Now page fault happens 
for kernel on the base address. so it reads the vnode.*/
segmap_release(...);
----------------------------------------------------------------------

struct as kas; /* Kernel address space */

kas --> --> seg --> (segmap_ops, smap_data,i.e. list of smap_structs);
   |--> page table

>kas::walk seg| $< seg
For each seg, say 256MB, there is smap_data for all pages of 256MB.

Each smap_struct has (vnode,offset,page_t,... );
----------------------------------------------------------------------

anon struct is 1 per anon page of memory. contains location of
page, i.e. in mem or on swap (or both). This also maintains refcnt.
----------------------------------------------------------------------
page_t which helps to get to the page : 1 per page in an array setup at
boot. contains info about how and who uses the page.
----------------------------------------------------------------------
anon_map maintains a refcnt. When it reaches 0, the anon pages freed.
so are anon hdr, array chunks freed.
----------------------------------------------------------------------
page tables of  process shared by all segments. The page_t entries
are allocated from this and entered into seg structures. The page
table is directly maintained by HAT. Note that you can travel from
page table entry to page and seg entry to page_t(i.e. page table entry).

----------------------------------------------------------------------
With vfork, the parent waits until the child exits or execs.
----------------------------------------------------------------------
Multiprocessor System Arch. Ben Catanzaro, Sunsoft 1995 -- Good book.
----------------------------------------------------------------------
Level 1 cache-- level 2 cache-etc.
sun4d uses sun xerox bus...
The paper by Talluri and Khalidi uses a new page table algo for 64 bit
----------------------------------------------------------------------
64 bit addressing:
0-12: page offset
13-16: index
17-63: virtual page number
--->hash--> (vpn, tag)-> use index to look into page table pte 
(16 entries table since 4 bits index) --> get the page
TLB --> TSB --> then map it.
----------------------------------------------------------------------
Level 1 cache is virt addr cache.(is write thru cache) level 2 is mem cache
(is writeback cache). virt addr mapping has a context so mapping 
would be done to differentiate the running processes.

CPU --> Level 1 --> Level2
   |--> TLB
Only if the TLB mapped phys addr is same as level 1 result, that
is chosen. what is this ???? If TLB missed, level 1 is not used.
To list kernel symbols, use:
# nm -x /dev/ksyms |grep variable_name

lotsfree is a variable in kernel which starts slowscan when
number of free pages fall below threshold. Uses clock algo.

Solaris 7 has optional priority paging which is disabled by default.
Has new var cachefree.
Page pages belonging to file data --not executable.

----------------------------------------------------------------------
cond variable and mutexes.

mutex_enter(&mp); /* mutex lock */
cv_wait(&cv, &mp); /* release lock, when signaled, get lock,ret */
mutex_exit(&mp);
---- other thread ---
mutex_enter(&mp);
cv_signal(&cv);  cv -> conditional variable.
/* You need to hold the lock when you do cv_signal */
mutex_exit(&mp);
----------------------------------------------------------------------
cv_waitsig --> wait can be killed with kill -9, otherwise not.
see man condvar
----------------------------------------------------------------------
Reader/Writer lock:
Many could have read lock. Only one write lock could be active.
see man rwlock; rwlock_t rwlock_init etc.
when many threads waiting to acquire lock, writers get
preference.

----------------------------------------------------------------------
counting semaphores enqueues the waiting threads on semaphore
----------------------------------------------------------------------

The dispatch priorities:
dispadmin -c TS -g 
displays priority table:

# Time Sharing Dispatcher Configuration
RES=1000

# ts_quantum  ts_tqexp  ts_slpret  ts_maxwait ts_lwait  PRIORITY LEVEL
       200         0        50           0        50        #     0
       200         0        50           0        50        #     1
       200         0        50           0        50        #     2
people get more time slices in lower priority. (they will get
prempted within this time slice by higher priority threads, anyway).
if (runnable without running upto max ts_maxwait seconds)
   new priority = ts_lwait;
This is how a compute bound thread avoids starvation by acquiring
higher priority after some time.


Scheduling algo:
 If it gets preempted, remain in disp queue with the remaining
time left with same priority. If the preemption is due to
using up all the time, then the priority is reduced.

Among all runnable threads:
1. Get highest priority RT thread on my dispq. If none found go next;
2. Get higher priority non-bound RT thread from any other dispq.
3. Get highest thread on my dispq
4. get highest prio non-bound thread on any other dispq.
5. If none found, run idle thread.

RT threads stay at same priorities even after using up its timeslice.
-----
priority inversion -- interesting thing where higher prio thread is
blocked by lower prio thread.
T1 (pri=0)            T2(prio=100)  T3(prio=10)  T(prio=11)
....                  ..
....
lock(prio from
     0 to 100)        ..
....                  lock
...
unlock(eprio=100)

---------

use more /usr/words/dict on one term.
use ps -e addr,comm |grep more
you got:     30001659540 more
> 30001659540 $< proc2u
Look at nfiles=127.
Get 4th file entry
then
<addr> $< file
> 30001580c00 $< file
0x30001580c08:  flag    count   vnode
                2001    1               30001c39350     
0x30001580c18:  offset          cred            audit_data
                2000            3000065bc28     0               
offset is lseek offset. count gets updated when dup and fork.
30001c39350 $< vnode
0x30001c39358:  flag    refcnt  vfsmnt
                0       2               0               
0x30001c39368:  op              vfsp            stream
                ufs_vnodeops    1043f4f0        0               
0x30001c39380:  pages           type            rdev
                1062a840        1               0               
0x30001c39398:  data            filocks         shrlocks
                30001c392c0     0               0               

The refcnt is > 1 since dnlc references it.
use vmstat -s to see the dnlc directory name lookup cache hit rate.
data contains inode.

30001c392c0 $< inode
The inode number is displayed as 14090.
direct_blocks  ......
0x30001c392f8:  direct_blocks   1c8b8           1c8c0           1c8c8
                1c8d0           1c8d8           1c8e0           1c8e8
                1c8f0           1c8f8           1c900           1c908
                1c910           
0x30001c39328:  indirect_blocks 37c30           0               0


$ ls -li /usr/dict/words
     14090 -r--r--r--   1 root     bin       206662 Jan  5  2000 /usr/dict/words
confirm that the inode displayed is same.

direct blocks listed in inode contents is really disk block number.
Use dd if=/dev/dsk/c0t0d0s0 bs=1k iseek=dec_blkno count=8
dd if=/dev/dsk/c0t0d0s0 bs=1k iseek=116920 count=8 

mail to: max@axiomax.com Max Bruning