A Quick Look at PostgreSQL Source Code

A Feel for the Overall Structure and Style of the Project

To get and compile the code, run the normal commands:

$ git clone git://git.postgresql.org/git/postgresql.git

$ ./configure
$ make

(Interestingly, the Makefile is actually committed into the repository, so you might not even need to run the configure script.)

Before looking at the code, try to imagine what the overall structure might be, based on what it needs to accomplish. There needs to be a network component, which hands data to a parser, then maybe an optimizer, and some code to actually run the queries. There is of course a doc directory and README files, but let's jump right in to the src directory.

Structure is the key to understanding when it comes to code, and a diagram might help, but here we're trying to understand the structure looking at the code itself.

The broad division in PostgreSQL is between the frontend (libraries/CLI making requests over the network), and the backend. The backend is probably more interesting. Run $ cd src/backend; ls and we find this directory listing (among others):

Makefile	common.mk	main		po		replication
access		executor	nls.mk		port		rewrite
bootstrap	foreign		nodes		postgres	snowball
catalog		lib		optimizer	postmaster	storage
commands	libpq		parser		regex		tcop
tsearch 	utils

Woah, what is snowball? cd snowball; cat README. Looks like a grammatical library for natural language processing, designed to find the stem of words: for example, from the word "eating" it will find "eat." Apparently PostgreSQL lets you search text based on the stem of the word. That's cool, learning already.

Optimizer and parser are easy enough to figure out, what about postmaster? Does that sound like a network layer? less postmaster/postmaster.c and we see this beautiful comment:

/*-------------------------------------------------------------------------
 *
 * postmaster.c
 *        This program acts as a clearing house for requests to the
 *        POSTGRES system.  Frontend programs send a startup message
 *        to the Postmaster and the postmaster uses the info in the
 *        message to setup a backend process.
 *
 *        The postmaster also manages system-wide operations such as
 *        startup and shutdown. The postmaster itself doesn't do those
 *        operations, mind you --- it just forks off a subprocess to do them
 *        at the right times.  It also takes care of resetting the system
 *        if a backend crashes.
 *

So PostgreSQL has a high-level message concept, excellent. grep socket * -r leads to pqcomm.c, which contains the low-level network routines. It holds another nice comment:

/*------------------------
 *INTERFACE ROUTINES
 *
 *setup/teardown:
 *  StreamServerPort        - Open postmaster's server port
 *  StreamConnection        - Create new connection with client
 *  StreamClose             - Close a client/backend connection
 *  TouchSocketFiles        - Protect socket files against /tmp cleaners
 *  pq_init                 - initialize libpq at backend startup
 *  pq_comm_reset           - reset libpq during error recovery
 *  pq_close                - shutdown libpq at backend exit
 *
 *low-level I/O:
 *  pq_getbytes             - get a known number of bytes from connection
 *  pq_getstring            - get a null terminated string from connection
 *  pq_getmessage           - get a message with length word from connection
 *  pq_getbyte              - get next byte from connection
 *  pq_peekbyte             - peek at next byte from connection
 *  pq_putbytes             - send bytes to connection (flushed by pq_flush)
 *  pq_flush                - flush pending output
 *  pq_flush_if_writable    - flush pending output if non-blocking writeable
 *  pq_getbyte_if_available - get a byte if available without blocking
 *
 *message-level I/O (and old-style-COPY-OUT cruft):
 *  pq_putmessage      - send a normal message (suppressed in COPY OUT mode)
 *  pq_putmessage_noblock - buffer a normal message (suppressed in COPY OUT)
 *  pq_startcopyout    - inform libpq that a COPY OUT transfer is beginning
 *  pq_endcopyout      - end a COPY OUT transfer
 *
 *------------------------
 */

Moving back up, the lib directory looks interesting, I wonder what's in it?

$ ls lib
Makefile	    bipartite_match.c    objfiles.txt	   stringinfo.c
README		    hyperloglog.c        pairingheap.c
binaryheap.c	    ilist.c              rbtree.c

Fascinating, PostgreSQL uses a red-black tree. I haven't used those much since college. Here is some code from rbtree.c. It looks a lot like a college textbook:

/*
 * rb_leftmost: fetch the leftmost (smallest-valued) tree node.
 * Returns NULL if tree is empty.
 *
 * Note: in the original implementation this included an unlink step, but
 * that's a bit awkward.  Just call rb_delete on the result if that's what
 * you want.
 */
RBNode *
rb_leftmost(RBTree *rb)
{
        RBNode     *node = rb->root;
        RBNode     *leftmost = rb->root;

        while (node != RBNIL)
        {
                leftmost = node;
                node = node->left;
        }

        if (leftmost != RBNIL)
                return leftmost;

        return NULL;
}

What about that postgres directory? Why is it named so generically?

$ cd postgres
-bash: cd: postgres: Not a directory
$ file postgres
postgres: 64-bit executable x86_64

OK, it's not a directory, it's the executable. That explains the generic name.

What about the port directory? From the name, it seems it could either be related to networking, or short for 'portability.'

$ ls port

Makefile	dynloader	pg_latch.c	sysv_sema.c	unix_latch.c
aix		dynloader.c	pg_sema.c	sysv_shmem.c	win32
atomics.c	hpux		pg_shmem.c	tas		win32_latch.c
darwin		objfiles.txt	posix_sema.c	tas.s		win32_sema.c
win32_shmem.c

It's portability, it even has code for porting to HPUX and AIX. Despite being very portable, there isn't a lot of platform specific code scattered throughout PostgreSQL, which is a sign of good design. Portability is a difficult architectural problem, and they've isolated the platform specific code to just a few sections.

Let's try to find the code for select. Where would it be? Running $ grep -r select * actually returns a lot of results. so instead, let's try looking at execMain.c, that seems like the place were a query begins to be executed. The first function is ExecutorRun(), and it calls standard_ExecutorRun() (apparently you can dynamically set a custom executor), and that calls ExecutePlan(), which calls ExecProcNode(), which is not in that file.

If I were using Emacs, I could use ETAGS to jump directly to that function, but I'm not, I'm using the command line and less, like a savage. Fortunately this time $ grep ExecProcNode * quickly shows me the function I'm looking for, and so I run the command less execProcnode.c, taking a look.

execProcnode.c has some great comments explaining the overall structure, complete with an ascii diagram.

 *       Suppose we want the age of the manager of the shoe department and
 *       the number of employees in that department.  So we have the query:
 *
 *                        select DEPT.no_emps, EMP.age
 *                        from DEPT, EMP
 *                        where EMP.name = DEPT.mgr and
 *                                   DEPT.name = "shoe"
 *
 *		Suppose the planner gives us the following plan:
 *
 *                 +---------------------------------------+
 *                 |      Nest Loop (DEPT.mgr = EMP.name)  |
 *                 |             /          \              |
 *                 |            /            \             |
 *                 |     Seq Scan           Seq Scan       |
 *                 |       DEPT              EMP           |
 *                 |  (name = "shoe")                      |
 *                 +---------------------------------------+
 *
 *		ExecutorStart() is called first.
 *		It calls InitPlan() which calls ExecInitNode() on
 *		the root of the plan -- the nest loop node.

Moving down; this is it, looks like we've found the core algorithm for making a query:

/* ----------------------------------------------------------------
 *		ExecProcNode
 *
 *		Execute the given node to return a(nother) tuple.
 * ----------------------------------------------------------------
 */
TupleTableSlot *
ExecProcNode(PlanState *node)
{
	TupleTableSlot *result;

	CHECK_FOR_INTERRUPTS();

	if (node->chgParam != NULL) /* something changed */
		ExecReScan(node);		/* let ReScan handle this */

	if (node->instrument)
		InstrStartNode(node->instrument);

	switch (nodeTag(node))
	{
		/*
		 * control nodes
		 */
	   case T_ResultState:
		result = ExecResult((ResultState *) node);
		break;

	   case T_ModifyTableState:
		result = ExecModifyTable((ModifyTableState *) node);
		break;

	   case T_AppendState:
		result = ExecAppend((AppendState *) node);
		break;

	   case T_MergeAppendState:
		result = ExecMergeAppend((MergeAppendState *) node);
		break;

	   case T_RecursiveUnionState:
		result = ExecRecursiveUnion((RecursiveUnionState *) node);
		break;

		/* BitmapAndState does not yield tuples */

		/* BitmapOrState does not yield tuples */

		/*
		 * scan nodes
		 */
	   case T_SeqScanState:
		result = ExecSeqScan((SeqScanState *) node);
		break;

	   case T_SampleScanState:
		result = ExecSampleScan((SampleScanState *) node);
		break;

	   case T_IndexScanState:
		result = ExecIndexScan((IndexScanState *) node);
		break;

	   case T_IndexOnlyScanState:
		result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
		break;

		/* BitmapIndexScanState does not yield tuples */

	   case T_BitmapHeapScanState:
		result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
		break;

	   case T_TidScanState:
		result = ExecTidScan((TidScanState *) node);
		break;

	   case T_SubqueryScanState:
		result = ExecSubqueryScan((SubqueryScanState *) node);
		break;

	   case T_FunctionScanState:
		result = ExecFunctionScan((FunctionScanState *) node);
		break;

	   case T_ValuesScanState:
		result = ExecValuesScan((ValuesScanState *) node);
		break;

	   case T_CteScanState:
		result = ExecCteScan((CteScanState *) node);
		break;

	   case T_WorkTableScanState:
		result = ExecWorkTableScan((WorkTableScanState *) node);
		break;

	   case T_ForeignScanState:
		result = ExecForeignScan((ForeignScanState *) node);
		break;

	   case T_CustomScanState:
		result = ExecCustomScan((CustomScanState *) node);
		break;

		/*
		 * join nodes
		 */
	   case T_NestLoopState:
		result = ExecNestLoop((NestLoopState *) node);
		break;

	   case T_MergeJoinState:
		result = ExecMergeJoin((MergeJoinState *) node);
		break;

	   case T_HashJoinState:
		result = ExecHashJoin((HashJoinState *) node);
		break;

		/*
		 * materialization nodes
		 */
	   case T_MaterialState:
		result = ExecMaterial((MaterialState *) node);
		break;

	   case T_SortState:
		result = ExecSort((SortState *) node);
		break;

	   case T_GroupState:
		result = ExecGroup((GroupState *) node);
		break;

	   case T_AggState:
		result = ExecAgg((AggState *) node);
		break;

	   case T_WindowAggState:
		result = ExecWindowAgg((WindowAggState *) node);
		break;

	   case T_UniqueState:
		result = ExecUnique((UniqueState *) node);
		break;

	   case T_GatherState:
		result = ExecGather((GatherState *) node);
		break;

	   case T_HashState:
		result = ExecHash((HashState *) node);
		break;

	   case T_SetOpState:
		result = ExecSetOp((SetOpState *) node);
		break;

	   case T_LockRowsState:
		result = ExecLockRows((LockRowsState *) node);
		break;

	   case T_LimitState:
		result = ExecLimit((LimitState *) node);
		break;

	   default:
		elog(ERROR, "unrecognized node type: %d", 
		                    (int) nodeTag(node));
		result = NULL;
		break;
	}

	if (node->instrument)
		InstrStopNode(node->instrument, 
		              TupIsNull(result) ? 0.0 : 1.0);

	return result;
}

This is good code. Internally, it seems PostgreSQL refers to select as 'scan,' not 'select,' and there are several different ways to select.

That's all for now; I hope you enjoyed my brief tour through some code!