PageBox
Web pagebox.net
 Java PageBox 
 New PageBox 
 Other products 
 Miscellaneous 
 Patents 
  

Software development

Foreword

This document is a summary of our believes and experience.

We enumerate different methods, structures and algorithms useful for software deployment. We believe that they are not apples and oranges: A design must have a skeleton and this skeleton can be an algorithm, a data structure or come from a requirement analysis. We can build our design from reusable objects or from design patterns.

Goal

A methodology formalizes an empirical practice.
Thoroughly applying a methodology is a time consuming process.
Resources - especially development time - are scarce. Before using a methodology you should always ask the question: "In that particular context, is that methodology worth the time I have to spend?" You should start considering the problem in different perspectives and evaluate
  • If an approach captures the essence of the problem
  • If it allows making design choices
Identify components that
  • Are not necessarily feasible
  • Whose implementation will have trouble
  • Whose implementation will be hard to test
Focus your analysis on these components. Avoid over-specifying troublefree components.

Though it is not possible to use all methodologies, it is possible to know many of them and to go by their spirit rather than the letter. We designed this page to keep useful links at hand.

Scope

We focus on Web languages, C#, Java, PHP. These languages are interpreted or precompiled and embed rich set of powerful libraries.

Business Case

Developers should understand
  1. The Business case of the product that they develop
  2. The Business case of their prospects. Why should a prospect choose their product? The product must be good (address the need) but also look good (address the concern).
For background information on Business cases, you can read the Solution Matrix White Papers.

A business case includes:

  1. Assumptions
    • Prediction
    • Simplification
    • Clarification
  2. Scope and boundaries. Involved organizations and functions.
  3. Cash Flow statement. To assign cost and benefits should we use Full or incremental values?
  4. Nonfinancial results. When it is impossible to evaluate the value in monetary terms. Such as "Improve company image" or "Improve customer satisfaction"
  5. Sensitivity analysis. What happens if the assumptions change?
  6. Risk analysis. What are the most likely results?
  7. Contingencies and dependencies. The case subject has business objectives, and that reaching these objectives requires contributions from different people and organizations. Especially for big organizations, it is far easier to add something than to replace something.
The Monte Carlo simulation is often used in risks analysis. This method shows how the bottom line of the finacial model changes when all the important input factors or assumptions change at once. To learn more about Monte Carlo simulation you can visit the Monte Carlo Simulation for Statistical Physics page of Paul Coddington.

Virtual machine

Java and .NET rely on a Virtual Machine Infrastructure. At precompilation time, the source code is translated in a sort of portable assembler with its own instruction set and stack. The virtual machine emulates the stack and interprets the pseudo code. From a performance point of view it has important consequences:
  • Virtual machine code is much slower than compiled code at startup: The first user pays for the structure creation and initialization. Then with Just In Time compilation, the VM infrastructure optimizes most frequently used pieces of code.
  • By delaying the commitment to a specific native platform as much as possible, the execution platform can make optimal use of the knowledge of the underlying machine.
  • Garbage Collection automatically determines what memory a program is no longer using, and recycles it for other use. As a consequence, you cannot predict when a object will be reclaimed.
The latter point becomes increasingly important:
  • Web Applications and PageBox with run in routers and even modems whose processors can be different of Unix and PC processors
  • New processors are coming and it is not only a move to 64 bit architectures
  • New architectures such as IA-64 Itanium implement:
    • Instruction level parallelism. The code can (have to?) explicitly indicate parallelism.
    • Predication. Predication is the conditional execution of instructions based on a qualifying predicate. When the predicate is true (1), the instruction is executed. When it is false (0), the instruction is treated as a NOP. Predication removes branches.
    • Control speculation. New processors can execute a group of instruction while speculating the results of the governing branch. The processors jump to correction code that repeats all dependent instructions only if the speculation failed.
    • Hardware support for function calls and returns.

      Binary code processing involves frequent invocation of small method calls, especially with bounds-checking, null-checking, and exception handling that are part of the type-safety feature that guards VM technologies from certain classes of common coding errors. One study found a typical method call to be only 68 bytes in Java bytecode, with one in four instructions being a null or bounds check. In addition, the majority of Java method calls are virtual, or indirect - in other words, it takes several steps to find the memory location of the method being called.

    New hardware architectures should boost the VM language performance. We need to keep in mind this evolution when we want to optimize our code:
    1. We should use language libraries to profit of their improvement
    2. Coming enhancements tend to be layered: New processors improve the performance of the VM. Pre-compiler enhancements inline some methods and reorder small chunks of code. Let's take a Java example:
      String a = ...;
      String b = ...;
      String c = a + b;
      It is still preferable to use the StringBuffer class and the append method though the pre-compiler could identify the case and make the substitution.

    Optimization

    Garbage Collection is efficient for short-lived objects with a small number of references. Therefore the best approach is:
    • To create a small set of permanent objects whose methods can be executed by many threads at the same time. Permanent object methods can't update member variables.
    • To create temporary objects that can be reclaimed at the end of the method.
    • To define static methods when it is possible.
    Specification are a great source of information. You can learn:
  • How class loaders work
  • How JVM or CLI instructions look like
  • For background information on garbage collection visit this and this collections.

    Java

    You can find the Java specification here.

    This article describes the Java garbage collection.

    .NET

    The .NET VM manages code and implements the Common Language Infrastructure specification (CLI). You can start with this Overview.
    The Common Language Infrastructure is standardized by ECMA. You can read the specification on CLI partition I to V.

    This article describes the Garbage collection in CLI.

    Algorithms

    In this section we list:
    • Algorithms
    • Free (at least for some of us) implementations with sources

    The Introduction to algorithms of Cormen, Leiserson, Rivest and Stein was a great source of inspiration for that section.

    You can check the Wayne's Little Data Structures and Algorithms Library.

    Sort

    Fintan Culwin's Learn Ada on the Web Algorithms describes sort algorithms.
    Jon Bentley and Robert Sedgewick present here fast Algorithms for Sorting and Searching Strings.

    Remember that RDBMS are sort engines

    To sort elements you can use:
  • The SortedSet and SortedMap in Java
  • The SortedList in .NET
  • Data structures

    Regarding data structures, don't try to reinvent the wheel:
    1. Pick the most suitable model in the list below
    2. Use the language implementation when it is available

    Stack and Queues

    Stacks and queues are dynamic sets in which the element removed from the set by the DELETE operation is prespecified.

    In a stack, the element deleted from the set is the one most recently inserted: the stack implements a last-in, first-out or LIFO policy.
    In a queue, the element deleted from the set has been in the set for the longest time: the stack implements a first-in, first-out or FIFO policy.

    Examples:

    • We used a stack in Cuckoo for XML generation. Each time we create a new element we first insert the opening tag (for instance <p>) and push the closing tag on the stack (here </p>). Once we have written the element we pop the closing tag from the stack.
    • The author used queues in worker threads. In that model the consumer thread queues requests on the worker thread queue. The worker threads dequeue and processes requests.
    Java JDK provides a Stack class.

    Linked lists

    A linked list is a data structure in which the order is determined by a next pointer in each object. In a double list, there is also a prev pointer that points the previous element.

    Use linked lists when:

    • The list is frequently updated
    • The element order matters

    Because Java and C# support references, you can implement your own linked list. However we don't recommand it.
    Java SDK provides a LinkedList class.
    Use ArrayList in .NET.

    Rooted trees

    There are different sort of trees:
    • Binary tree
    • Red-Black trees, a special sort of binary trees
    • Rooted trees with bounded branching
    • Rooted trees with unbounded branching
    • B-Tree
    • Binomial tree
    • Ternary search tree
    Like a linked list, a tree is made of nodes. Tree nodes all are connected but a tree cannot connect cycles (different paths from one node to another). A rooted tree is a tree in which one of the vertices is the root of the tree. An ordered tree is a rooted tree in which the children of each node are ordered. In this section we focus on ordered trees.

    In the simplest case (binary tree) a node contains three pointers, p pointing the parent, left pointing the left child and right pointing the right child.
    We can represent a rooted tree with bounded branching simply using a pointer for each child.
    We can represent a rooted tree with unbounded branching with a binary tree and the left-child, right-sibling representation.

    When a binary tree is unbalanced the search time may be no better than with a linked list.
    Red-black trees are "balanced" binary trees. Each node of a red-black tree contains a color that can be RED or BLACK.

    1. The root and every end node is black
    2. If a node is red both its children are black
    3. For each node, all paths to descendant nodes contain the same number of black nodes

    Nodes are inserted at a given place in the tree depending on their value.

    In case of a binary tree each node having a max of two children, the child on the left branch is a lesser value than the parent and the child on the right branch is a greater value than the parent.

    This means that for searching you dont have to search the whole tree, what we do is starting at the root, if the value we are looking for is less than the root jump to the child on the left branch, if the value is greater than the root jump to the child on the right. From here you do the same type of check, if the value you are looking for is less than the value of the node you are currently at jump to the child on the left branch, if the value you are looking for is greater than the value of the node you are currently at jump to the child on the right branch.

    B-trees are red-black trees with many children.
    Binomial trees are trees defined recursively. The binomial tree B0 consists of a single node. The binomial tree Bk consists of two binomial trees Bk -1 linked together. We can unite binomial trees.

    In a ternary search tree a search compares the current character in the search string with the character at the node. If the search character is less, the search goes to the left child; if the search character is greater, the search goes to the right child. When the search character is equal, though, the search goes to the middle child, and proceeds to the next character in the search string. See the Jon Bentley and Bob Sedgewick article and a C implementation for more information.

    Trees are useful when you primarily need to retrieve keys in a range. For instance the Java JDK implementation has a subMap method that returns a view of the portion of the tree whose keys are in a range and a tailMap method that returns a view of the portion of the tree whose keys are greater than or equal to a given value.
    The ternary search tree competes with Hashtables without the collision problem.

    Implementations
    • This page contains a good presentation of binary trees and even a C# implementation.
    • Java SDK provides a Red-Black implementation in TreeMap and TreeSet classes.

    Hash tables

    A Hashtable uses an array of slots. An element with key k is stored in array[h(k)] where h is a hash function. The number of possible key values is much higher than the array size. Therefore two key may hash to the same slot. It is a collision. Typically elements that hash to the same slot are stored in a linked list.

    Use Hashtables when you primarily access your data with a key. Hashtables are especially useful for associative tables.
    Both Java and .NET provide a Hashtable class.

    Heaps

    A binomial heap is a set of min-heap binomial trees. In a min-heap binomial tree the key of a node is greater than or equal to the key of its parent. There are also:
    • Fibonacci heaps. You can find here a C++ implementation
    • Relaxed heaps
    You can find a heap comparison here, a presentation here and C implementations here.

    Optimization

    Optimization problems are problems that can have many solutions and where we search a solution with an optimal value.

    Dynamic programming

    Dynamic programming applies to problems with an optimal substructure. In that case:
    • An optimal solution of the problem contains optimal solutions to subproblems
    • The solution of one subproblem must not affect the solution of another subsproblem of the same problem (subproblems must be independent)
    The subproblems must overlap, which means that the algorithm should visit the same subproblem over and over again, solves each subproblem once and stores the solution in a cache.

    You can use dynamic programming to compute the distance between words or sentences.
    You can find here a presentation and an implementation of the Levenshtein distance algorithm. The Levenshtein distance algorithm has been used in:

    • Spell checking
    • Speech recognition
    • DNA analysis
    • Plagiarism detection
    We wrote a Javascript implementation. You can test it below.
    Source string:
    Target string:

    Note:

    • PHP 4 includes a Levenshtein function
    • It is not realistic to compute the Levenshtein distance between a word and each of the words in a dictionary. A paper of Klaus Schultz and Stoyan Mihov Fast String Correction with Levenshtein-Automata explains the problem and presents a possible solution.
    An example of tool applying such ideas is agrep. You can download it here.
    You can also use "near-neighbor searching" on ternary search tree as described in the Jon Bentley and Bob Sedgewick paper that we described above.

    Greedy algorithms

    A greedy algorithm is an algorithm that gobbles up all of its favorites first. The idea behind a greedy algorithm is to perform a single procedure in the recipe over and over again until it can't be done any more and see what kind of results it will produce.

    An example of greedy algorithm is the Dijkstra's algorithm to find the shortest path from a point in a graph to a destination. It allows finding the shortest paths from a given source to all points in a graph in the same time. You can find here and here descriptions of the Dijkstra algorithm.

    Another example of Greedy algorithm is Huffman compression. However optimal merge is even simpler to understand.

    Classic compression

    Huffman compression is a three step process:
    1. Get the frequency of the file characters
    2. Create a binary tree
      1. Start with as many trees as there are symbols. The character itself is in the leaf of the tree
      2. While there is more than one tree
        1. Find the two trees with the smallest total frequency
        2. Combine the trees into one, setting one as the left child and the other as the right
        Now the tree contains all the symbols. A '0' represents following the left child; a '1' represents following the right child. The binary code of a symbol is made of the branch values
    3. Replace each symbol by its binary code
    To uncompress we read the file bit per bit and we use the bit value to navigate in the binary tree up to the leaf that contains the character.

    Alexander Simakov wrote a small and clean implementation shc.
    We compared shcodec with bzip, gzip, winzip and Microsoft compress with this xhtml file (59467 bytes then). We also included in this comparison XMILL, one of the XML products described on Oasis site.

    CommandFile size
    compress24807
    compress -zq119490
    compress -zq717595
    gzip18728
    gzip -121123
    gzip -918247
    winzip18496
    bzip16161
    XMILL19294
    shcodec37477

    gzip, zlib and png use the same algorithm of Jean-loup Gailly and Mark Adler. You can find zlib here and gzip here.

    Winzip, gzip and zlib uses a combination of deflation and Huffman algorithms. Deflation algorithm is a unpattented variation of Lempel Ziv (that is used in compress and gif). It finds duplicated strings in input data. The second occurence of a string is replaced by a pointer to the previous string. Distances are limited to 32K bytes and lengths are limited to 258 bytes. Strings or match lengths are compressed with one Huffman tree, and match distances are compressed with another tree. Compressed data are in blocks and each block has its two Huffman trees.

    The deflation process depends on being able to identify portions of the input text which are identical to earlier input within a sliding window trailing behind the input currently being processed. gzip and zlib try all possible matches and select the longest.

    To find matching strings the deflation uses the Karp-Rabin algorithm. To find if a text contains a pattern, the first idea is to compare the string starting at each position of the text with the pattern. The problem with this solution is that character comparisons are expensive. With Karp-Rabin we first compare the hash of the pattern with the hash of the string and we only compare the strings if the hashes are equal. This approach is efficient only because we can update rather than recompute the hash.

    gzip/zlib compressor uses a chained hash table to find duplicated strings, using this Karp-Rabin hash function that operates on 3-byte sequences. From RFC 1951:
    At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match.

    gzip and zlib allow specifying if you prefer intensive or fast compression. It only changes the deflation. Here is the code from zlib deflate.c (Jean-loup Gailly):

    /* Values for max_lazy_match, good_match and max_chain_length, depending on
     * the desired pack level (0..9). The values given below have been tuned to
     * exclude worst case performance for pathological files. Better values may be
     * found for specific files.
     */
    typedef struct config_s {
       ush good_length; /* reduce lazy search above this match length */
       ush max_lazy;    /* do not perform lazy search above this match length */
       ush nice_length; /* quit search above this match length */
       ush max_chain;
       compress_func func;
    } config;
    
    local const config configuration_table[10] = {
    /*      good lazy nice chain */
    /* 0 */ {0,    0,  0,    0, deflate_stored},  /* store only */
    /* 1 */ {4,    4,  8,    4, deflate_fast}, /* maximum speed, no lazy matches */
    /* 2 */ {4,    5, 16,    8, deflate_fast},
    /* 3 */ {4,    6, 32,   32, deflate_fast},
    
    /* 4 */ {4,    4, 16,   16, deflate_slow},  /* lazy matches */
    /* 5 */ {8,   16, 32,   32, deflate_slow},
    /* 6 */ {8,   16, 128, 128, deflate_slow},
    /* 7 */ {8,   32, 128, 256, deflate_slow},
    /* 8 */ {32, 128, 258, 1024, deflate_slow},
    /* 9 */ {32, 258, 258, 4096, deflate_slow}}; /* maximum compression */
    
    /* Note: the deflate() code requires max_lazy >= MIN_MATCH and max_chain >= 4
     * For deflate_fast() (levels <= 3) good is ignored and lazy has a different
     * meaning.
     */
    
    Lazy means here that a match is finally adopted only if there is no better match when it searchs for a longer match at the next input bytes.

    Not surprisingly we can see deflate-based compression tools have similar performances - compressed size between 30 and 40% of the original size - whereas pure Huffman compressed size is 60% of the original size.

    shcodec is fast and has a very small memory requirement (1280 bytes for encoding and only 574 bytes for decoding) because it is a static Huffman codec. For files with large amounts of the same characters (binary zero, blank, ...), Adaptative Huffman coding can compress better. With adaptative coding, both the compresser and decompresser maintain a code tree. Each time they process a character they recompute the code tree either to insert a new code or to maintain a sibling property: each node (except the root) must have a sibling and the nodes must be listed in order of nonincreasing frequency with each node adjacent to its sibling.

    Static Huffman encoding has another advantage: You can compute a Huffman tree once and use it over and over. The compresser treats data as a continuous stream and replaces bytes by their variable code. You can find an example of tree on Sierra Creative Interpreter. It is good for speed (one-pass compression) but not for size, especially for small files or messages because the unique Huffman tree must accomodate all possible characters (typically 256). If a small message contains only 50 to 60 characters a character can be coded on a smaller length.

    bzip uses Burrows-Wheeler. You can download it from Redhat site. It has the best compression ratio but uses more memory and CPU than gzip.
    From Julian Seward, the author of bzip:
    "bzip2 is a resource hog. It soaks up large amounts of CPU cycles and memory. Also, it gives very large latencies. In the worst case, you can feed many megabytes of uncompressed data into the library before getting any compressed output, so this probably rules out applications requiring interactive behaviour. These aren't faults of my implementation, I hope, but more an intrinsic property of the Burrows-Wheeler transform."

    You can download the original paper (of Burrows and Wheeler) named A Block-sorting Lossless Data Compression Algorithm. Mark Nelson wrote an article in Dr Dobb's about Burrows-Wheeler Transform. Arturo San Emeterio Campos wrote another paper.

    Millau extends WBXML (from WAP) with separation of structure and content.

    • A structure part containing the encoding for the element tags and attributes
    • The content part containing the compressed data form of the text part.
    By separating structure and content it separates the content and structure redundancy by encoding the structure part using the WAP WBXML encoding and the content using standard text compression techniques. The decompression process does just the reverse - reading in two compressed streams (structure and content) and producing an XML stream or a DOM tree or generating SAX events as required.

    Because Millau preserves the element structure of XML it allows a browser to skip unknown elements and attributes. Millau also treats data as a continuous stream.

    There are two interesting documents on Millau, Algorithms and Programming Models for Efficient Representation of XML for Internet Applications from Neel Sundaresan and Reshad Moussa and Millau: an encoding format for efficient representation and exchange of XML over the Web from Neel Sundaresan.We are not aware of available Millau implementations.

    The only product with sources that we know is XMILL from ATT. The distribution includes a very interesting paper/xmill.ps.gz from Hartmut Liefke and Dan Suciu. XMILL applies three principles to compress data

    • Separate structure from data. The structure is a tree of XML elements and attributes. The data consists of a sequence of items representing element contents and attribute values.
    • Group data item with related meaning. Data items are grouped in containers and each container is compressed separately using gzip.
    • Apply different compressors to different containers.
    When we invoke XML, we can specify a semantic compressor to invoke for a given data item.
    1. To the opposite of Millau, XMILL is not designed to work in conjunction with a query processor.
    2. XMILL wins over existing techniques only if the data set is large, typically over 20KBytes and when the content is pure XML with a low content/element ratio. It is effective for instance to transfer XML logs.
    For the test we add to slightly modify our file because XMILL only processes pure XML. The original size was 57687 bytes and the compressed size 19294 bytes, 33% of the original size. gzip and winzip perform slightly better.

    Notes:
    • zip archives are significantly bigger than tar.gz archives because it is more efficient to create an archive (tar) and then to compress. For a single file zip and gzip are as effective
    • Support of zip is embedded in Java library
    • PHP support zlib. You can also configure PHP 4 to enable bzip and zip (in read-only) support
    • Mike Kruger wrote a native (C#) Zip/GZip library for .NET, NZipLib

    Other compression techniques

    Mark Nelson wrote a good introduction (with code examples) to Arithmetic Coding and Data Compression. These papers were published in February 91 issue of Dr. Dobbs Journal. Therefore they are a bit outdated. We have hundred time more memory, CPU and bandwidth than then and less time to spend designing our programs.

    Arithmetic coding can be seen as an enhancement of Huffman coding. Huffman coding produces the lowest possible output bit count possible for the input stream of symbols, when using fixed length codes. The optimal number of bits to be used for each symbol is the log base 2 of (1/p), where p is the probability of a given character. This is a real number whereas Huffman requires an integer number of bits to encode a symbol. From the second paper of Mark Nelson:

    "Arithmetic coding provides a way to encode symbols using an optimal number of bits. The number of bits used by each symbol is not necessarily an integer, as would be the case with Huffman coding. In order to use arithmetic coding to compress data, a model for the data is needed. The model needs to be able to do two things to effectively compress data:

    • The model needs to accurately predict the frequency/probability of symbols in the input data stream.
    • The symbol probabilities generated by the model need to deviate from a uniform distribution."

    Higher is the probability to find a nth character less space we need to represent this character. It is true also with Huffman coding but with arithmetic coding if the probability of a character is 90%, the optimal code size is 0.15 and not 1 bit. If we compute the probability of the nth based on the value of n-1th, n-2th characters, the probability is generally higher. We can distinguish between compression models based on the number of characters used in predictions. With a order-0 model, the probability of each symbol is independent of any previous symbols. With a order-1 model, the probability of a symbol depends on the previous symbol, and so on.

    There is another introduction by Matt Timmermans, Bijective Arithmetic Encoding with Optimal End Treatment.

    A bijective compressor has the property that any file can be decompressed and then compressed back to its original form, in addition to the normal property that any file can be compressed and then decompressed back to its original form. This property is interesting for encryption: one area of crypto-analysis is to guess the key using the pattern of the original data. If the data to encrypt exhibits no pattern, it is much harder to guess the key. Matt Timmermans is the author of BICOM.

    As we have seen, the capability to predict the next byte is important for compression. The most common technique is Prediction by Partial Matching (PPM). In PPM we call context the last few symbols of the input stream and we maintain statistical information about each context in order to predict the next symbol. Jesper Larsson wrote: "For each context, a table of symbol counts is dynamically maintained, and the code applied whenever that context occurs is based on the statistics of this table. When a symbol appears in a context for the first time, its count in that context is zero. Still, it must be possible to encode the symbol in that context, so some amount of code space must be reserved for previously unseen events. Therefore, each context also keeps an escape count, used to encode a new symbol event in that context."

    These explanations help to understand the different algorithms.

    Dmitry Shkarin wrote a nice paper. He is also the author or PPMD, which exists in different Vars.
    Charles Bloom wrote PPMZ.

    We compared BICOM, PPMD and PPMZ with gzip and bzip2 using this file (now 84 615 bytes).
    NameCompressed size
    gzip27182
    bzip224177
    bicom23125
    ppmd22309
    ppmz21328

    Codecs are listed in order of speed: bicom is still fast and ppmz is slow but compresses better. Sources are available for all implementations, so you can adapt them to your needs. Bicom allows applying a Rjindael encryption to the compressed stream.

    Miscellaneous about compression

    Conclusion
    1. The best general purpose algorithm is Deflation + Huffman. For the best performance it needs to process big chunks of data in three passes:
      1. Deflation. Finding duplicated strings in input data and replacing them by a pointer on the previous string
      2. Compute character frequency and build a Huffman tree
      3. Replace characters by their code
    2. For structured content, it is possible to compress in a form that can be queried. It is useful when some receivers only process a part of the content or when messages must be routed: a router process can dispatch messages depending on a part of the content (data dependent routing).
    Useful links: These links also present lossy compression methods (JPEG...)

    There is a hard problem that we would like to present now, the prediction of the worse case compression ratio.
    Please send us a mail if you have comments or a solution.

    1. We cannot find a compression algorithm able to reduce the size of all files or messages
    2. However in real life messages and files don't contain only information. Consider the case of a SOAP response. The message is generated by a server using data stored in a database. Some data are added to allow the message parsing and facilitate the development of the client. It is however half the picture: The database content has been created using forms containing dropdown lists, checkboxes and filters. Therefore we can say that m = f(i) where i is the message and i is the information - whatever can change -. The information length l(i) is the length needed to represent any valid value, which is greater or equal to the length needed to represent an occurence of information. Because the decoration comes from if ... then coding, there is an inverse function. i=f-1 (m). f-1 represents the theoretical worse case compression because l(i) is the length needed to represent any valid value.
    3. It is not practical to implement f-1. We want to use a generic compression such as zlib. With this compression, the actual compression ratio is made of two parts, one that comes from the fact that the information occurence can be represented with less than l(i) - and which is never granted - and another part due to f.
    Now we can formulate the hard problem (at least for us!)
    Is it possible to predict the worse case compression ratio of a generic algorithm
    • For messages produced by a function f(i) where i is random information that can be coded on l(i)
    • When there is a f-1 function
    Our current feeling

    The compressed size tends to l(i). Therefore the problem is to estimate i using less CPU than we need to compress the message. The ratio message_size / l(i) is meaningless whereas the ratio compressed_size / l(i) is meaningful. Let's consider the application of this technique to networks.

    1. Below a given value (around 1400 bytes on Ethernet), there is no value reducing the message size
    2. Above a given value, messages must be sent in chunks. The client and the server code is more complex and the response time is close to chunk_number * chunk_response_time. A solution is faster and simpler if it can assume that message size cannot exceed a given value.
    In real cases we rather want to meet a target maximum message size.

    Consider the simple case of csv files. A csv file is made of rows containing fields separated by commas. Some fields contain figures in a given range, some other contain alpha characters and some other can only have a set of values.

    The csv producer should be able to compute the quantity of information per row only once and the compression tool can retrieve the number of rows at runtime. Then depending on how hard it is to achieve the target compression size (how high is the ratio i / target_size) the compression tool can use
    1. A fast Huffman coding
    2. A zlib coding
    3. A PPM coding with if possible a variable order: if after compressing 1/3 of the message the PPM coding wrote 1/6 of the target size, it can use a smaller order but if it wrote 1/2 of the target size it must increase the order.
    We can extend what we said for csv file to XML files described by a schema. A schema analysis tool prompting the user for additional information can build a list giving the amount of information per element. At runtime, the compression tool can compute the number of occurrences of each element and then predict the compressed size of the message.

    Graphs

    You can look at Review of Elementary Graph Theory.
    Another presentation with Java implementations for breadth-first search and depth-first search is here.

    We also an introduction to graphs. It includes a glossary and presents graph partitioning, an important issue for instance for parallel processing. We can represent a graph:
    • As a collection of adjacency lists. For each node we list the nodes it is connected to.
    • As an adjacency matrix. This matrix is a square matrix with one row and one column per node. If there is an edge between node i and node j, then Adj[i, j] equals one (or the edge weight), otherwise it equals zero.
    In a directed graph, edges can be represented by arrows.
    The degree of a vertex is the number of edges incident on it.
    A path is a sequence of vertices with their connecting edges.
    A path forms a cycle if its first vertex is also its last vertex.
    A graph with no cycle is acyclic.

    Given a source vertex s Breadth-First Search (BFS) explores the graph edges to discover vertices reachable from s and builds a s breadth-first tree that contains all reachable vertices. For any vertex reachable from s, the path in the breadth-first tree contains the path containing the smallest number of edges. To keep track of progress, breadth-first search colors each vertex white, gray or black. Vertices start out white. A discovered vertex whose all adjacent vertices have been discovered is black. otherwise it is gray.

    See C++ Boost breadth_first_search for a implementation using (alas!) C++ templates.

    Breadth-first search crawling yields high-quality pages presents a Web crawling strategy based on BFS.

    In depth-first search, edges are explored out of the most recently discovered vertex v that still has unexplored edges leaving it. When all of v's edges have been explored, the search backtracks to explore edges leaving the vertex from which v was discovered.As in breadth-first search, vertices are colored during the search. Each vertex is initially white, it is grayed when it is discovered and it is blackened when its adjent nodes have been explored.

    See C++ Boost depth_first_search for a implementation using C++ templates.

    The topological sort algorithm creates a ordering of the vertices such that if edge (u,v) appears in the graph, then u comes before v in the ordering. The graph must be a directed acyclic graph (DAG). The implementation consists mainly of a call to depth_first_search().

    See C++ Boost topological_sort for a implementation using C++ templates.

    The minimum-spanning-tree problem is defined as follows:
    find an acyclic set of edges that connects all of the vertices in the graph and whose total weight is minimized, where the total weight is given by w(T) = sum of w(u,v) over all (u,v) in T, where w(u,v) is the weight on the edge (u,v), T being the spanning tree.

    There are two greedy algorithms:
    • Krustal
    • Prim

    Kruskal's algorithm starts with each vertex in a tree by itself, and with no edges in the minimum spanning tree T. The algorithm then examines each edge in the graph in order of increasing edge weight. If an edge connects two vertices in different trees the algorithm merges the two trees into a single tree and adds the edge to T.
    See C++ Boost kruskal_minimum_spanning_tree for a implementation using C++ templates.

    Prim's algorithm operates like Dijkstra shortest path.
    See C++ Boost prim_minimum_spanning_tree for a implementation using C++ templates.

    There are different shortest path problems:
    1. Single-pair problem. We try to find the shortest path between two vertices
    2. Single-source problem. We try to find all shortest paths from one vertex to every other vertex in the graph
    3. Single-destination problem. Equivalent to the single source problem
    4. All-pair problem
    There are no algorithms for solving the single-pair problem that are asymptotically faster than algorithms that solve the single-source problem. Therefore if the graph doesn't change (for instance the roads between towns), it is possible to run the algorithm only once and reuse the result over and over.

    The Bellman-Ford algorithm solves the single-source shortest-path problem for a graph with both positive and negative edge weights. If you only need to solve the shortest paths problem for positive edge weights, Dijkstra's algorithm is faster. If all weights equal one, use breadth-first search.

    The Bellman-Ford algorithm proceeds by looping through all of the edges in the graph, applying the relaxation operation to each edge. In the following pseudo-code, u is a vertex, v is a vertex adjacent to u, w is a map where w[u, v] is the weight of the edge, d is a map where d[u] is the length of the shortest path to u so far, p is a map where p[u] is the parent of u.
    relax(u, v, w, d, p) {
      if (w[u, v] + d[u] < d[v]) {
        d[v] = w[u, v] + d[u];
        p[v] = u;
      }
    }
    bool bellman-ford(V, w, [out]p, [out]d) {
      for each vertex u in V {
        d[u] = infinity;
        p[u] = u;
      }
      for (int i = 1; i < V.length; ++i) {
        for each edge(u, v) in w
          relax(u, v, w, d, p);
      }
      for each edge(u, v) in w
        if (w[u, v] + d[u] < d[v])
          return false;	// not minimized
      }
      return true;
    }
    

    See C++ Boost bellman_ford_shortest_paths for a implementation using C++ templates.

    If the graph is a direct acyclic graph, we can use:
    topological_sort();
    for each vertex u in V {
      d[u] = infinity;
      p[u] = u;
    }
    for (int i = 1; i < V.length; ++i) {
      for each edge(u, v) in w
        relax(u, v, w, d, p);
    }
    
    Dijkstra's algorithm finds all the shortest paths from the source vertex to every other vertex by iteratively growing the set of vertices S to which it knows the shortest path. At each step of the algorithm, the next vertex added to S is determined by a priority queue. The queue contains the vertices in the graph but not in S prioritized by their distance label, which is the length of the shortest path seen so far for each vertex. The vertex u at the top of the priority queue is then added to S, and each of its out-edges is relaxed: if the distance to u plus the weight of the out-edge (u,v) is less than the distance label for v then the estimated distance for vertex v is reduced. The algorithm then loops back, processing the next vertex at the top of the priority queue. The algorithm finishes when the priority queue is empty.
    See C++ Boost dijkstra_shortest_paths for a implementation using C++ templates.

    To solve the all-pair problem, we can use
    • The Floyd-Warshall algorithm
    • The Johnson algorithm for sparse graphs
    With Floyd-Warshall algorithm, we first build a n x n matrix W where W[u, u] = 0, W[u, v] is the weight of the (u, v) edge if u and v are connected and infinite otherwise.
    n = W.length;
    D(0)=W;
    for (int k = 0; k < n; ++k) {
      for (int i = 0; i < n; ++k) {
        for (int j = 0; j < n; ++k) {
          d(k)ij= min(d(k-1)ij,d(k-1)ik+ d(k-1)kj);
        }
      }
    }
    return D(n);
    
    There are other presentations from from Mark Dettinger and of the University of Rochester.

    Johnson's algorithm returns either returns a matrix of shortest-path weights for all pairs of vertices or reports that the input graph contains a negative-weight cycle. First, it adds a new node with zero weight edges from it to all other nodes, and runs the Bellman-Ford algorithm to check for negative weight cycles and find h(v), the least weight of a path from the new node to node v. Next it reweights the edges using the nodes' h(v) values. Finally for each node, it runs Dijkstra's algorithm and stores the computed least weight to other nodes, reweighted using the nodes' h(v) values, as the final weight.

    See C++ Boost johnson_all_pairs_shortest_paths for a implementation using C++ templates.

    We can interpret a directed graph as a flow network and use it to answer questions about material flows. In a flow network a source produces the material at some steady rate, and a sink consumes the material at the same rate. An edge in a flow network has a capacity. The network must satisfy two constraints:

    • For each edge, the flow must not exceed the edge's capacity
    • For each node, the incoming flow must be equal to the outgoing flow

    In the maximum-flow problem, we wish to compute the greatest rate at which material can be shipped from the source to the sink. We can use two methods:
    • Ford-Fulkerson method
    • Push-relabeled method
    In Ford-Fulkerson method, an augmenting path is a path from the source to the sink along which we can send more flow. Here is the Ford-Fulkerson pseudo code:
    f = 0;
    while there is an augmenting path p
      augment flow f along p;
    return f;
    
    You can find a C implementation of Ford-Fulkerson method here.

    Push-relabeled algorithms are often faster than Ford-Fulkerson method.
    See C++ Boost push_relabel_max_flow for a implementation using C++ templates.

    Linear programming

    In the general linear programming problem, we are given an m x n matrix A, a m vector b, and an n vector c. We wish to find a vector x of n elements that maximizes an objective function Sumn i=1c ix isubject to the m constraints given by Ax <= b. The constraints form the feasible region.
    With n = 2 and m = 3
    if A =  (4 -1)
            (2  1)
            (5  2)
    c = (1 1) and b = (8 10 6)
    
    it is the same as:
    maximize x1 + x2 subject to
    4 x1 - x2 <= 8
    2 x1 + x2 <= 10
    5 x1 - 2 x2 <= 6
    

    4 x1 - x2 <= 8 is a constraint.
    A constraint has the form Sumn j=1a ijx j<= b i.
    When n equals two (two dimensions) we can represent a constraint by a line on a graph whose the abcisse is x1 and the ordered is x2. The feasible region is the surface delimited by constraint lines. When n equals 3 we can represent a constraint by a plane and the feasible region by a volume and so on.
    We also call the feasible region a simplex.
    The set of points that achieve the optimal objective includes a vertex of the feasible region.

    The simplex algorithm starts at some vertex of the simplex and performs a sequence of iterations. At each iteration, it moves along an edge of the simplex from the current vertex to a neighboring vertex whose objective value is no smaller than that of the current vertex. The simplex algorithm terminates when it reaches a vertex from which all neighboring vertices have a smaller objective value.

    First we convert constraints into a slack form. We typically introduce new x n+i variables and we write:
    x n+i= b i- Sumn j=1a ijx j
    xn+i >= 0
    The variable x n+i on the left-hand side of the equality is a basic variable.
    The variables x j on the right-hand side of the equality are nonbasic variables.

    We focus on the basic solution: set all nonbasic variables to zero and compute the value of the basic variables. x n+i= b i. Next iterations of the simplex algorithm will rewrite the set of equations and the objective function so as to put a different set of variables on the right-hand side. We select a nonbasic variable x j whose coefficient in the objective function is positive, and we increase the value of x j as much as possible without violating any of the constraints. The variable x j becomes basic and the x n+i on the left-hand side of the tightest constraint becomes nonbasic. This operation is a pivot. x j is the entering variable and x n+i is the leaving variable. We rewrite the constraints and the objective function to have only nonbasic variables on the right-hand side. The basic solution objective should improve.

    If we cannot improve the objective function by a single pivot operation (moving from a basic feasible solution to an neighbouring basic feasible solution) then the current basic feasible solution is optimal.

    Given a primal linear program as above, we can define a dual linear program as

    Minimize Summ i=1b iy i
    subject to
    Summ i=1a ijy i>= c j for j = 1 to n
    y i>= 0 for i = 1 to m

    Given a linear program, Simplex can return
    • Infeasible
    • Unbounded (infinite objective value)
    • An optimal solution with a finite objective value

    Robert J. Vanderbei wrote a free implementation for Windows and Unix that you can download here. It features the two-phase and the self-dual methods with the eta-matrix and the refactorization approaches.

    SoPlex is an implementation of the revised simplex algorithm. It features primal and dual solving routines for linear programs and is implemented as a C++ class library that can be used with other programs. You are allowed to retrieve SoPlex only for research purpose as a member of a non-commercial and academic institution.

    Bayesian networks

    Read first A Brief Introduction to Graphical Models and Bayesian Networks of Kevin Patrick Murphy.
    This page lists Software Packages for Graphical Models / Bayesian Networks.
    The most important implementations are perhaps:

    Reinforcement learning

    Reinforcement learning is an extension of Markov Decision Process (MDP). A Markov decision process, M = {S, A, T, R}, is a description of a synchronous control domain specified by four components:
    • A state space, S = {s1, s2, ..., sN}, of cardinality |S| = N;
    • A set of primitive actions, A = {a1, a2, ..., ak}, of cardinality |A| = k
    • A transition function, T. T(s'|s, a) determines the probability of arriving in state s' upon taking action a from state s
    • A reward function, R. An agent located at a single state s at any time step chooses a primitive action, is then relocated to a state s' determined by T(s'|s, a), whereupon it receives a R(s') reward

    The goal of the agent is to maximize its aggregated reward over time
    SumT t=0(R(S t) - e) or SumT t=0R(S t)gt
    The goal of planning in MDP is to locate a policy that determines an action for the agent for any possible state. The policy can be found by solving the Bellman equations.

    You can read The Reinforcement Learning Problem of Richard Sutton and A brief introduction to reinforcement learning of Kevin Patrick Murphy.

    Design

    Before diving into methods and patterns, you can read The Cathedral and the Bazaar and visit the Extreme programming site. OK, it is not the same thing, you can use both extreme programming and a design method... Blabla. If "release early/release often" has more impact on software quality than thorough object analysis why should we spend time learning that stuff?

    Methods exist for three reasons:
    1. Organizations would like to taylorize software development. Not so much to des-humanize programming but because they would like to know before how much they will spend before the release of a stable version
    2. Object methods were supposed to improve component reuse and it didn't happen. However UML notation succeeded to document development. A company only own the software whose it has the source and UML documentation
    3. Object methods are useful to communicate with other programmers, patners, customers

    UML

    UML is a modeling language, not a method.

    A modeling language is a notation that design methods use to express designs. Because UML is broadly used it is certainly the best modeling language that we can use. To design a product, we need to apply a process, which is the modeling language advice on what steps to take in doing a design.

    Processes are necessarily fuzzy and almost never followed because they assume a phasing.

    1. During inception, we address political issues (getting management support and establishing the business rationale).
    2. During elaboration we collect requirements and create a plan. It is also where we identify risks related to requirements, technology, skills and politics.
    3. During construction we write the product.
    4. During transition the product is released.
    Requirements continue to appear and change up to the transition phase whereas the design, sizing and schedule are frozen in elaboration.

    Iterative construction addresses this issue: in the initial version, developers don't try to implement all requirements. They release early and often. Users see the product and it is cheaper and simpler to take corrective actions. The problem is the elaboration also has to be iterative and to go on in parallel with construction. The management must agree on something that nobody can describe.

    To address this issue we hierarchise requirements. On a medium size project it is already common to already have 100 or more requirents. Cognitive psychologists have determined that a human is only capable of maintaining and manipulating cognitive models which contain seven plus or minus two, i.e. between five and nine, cognitive chunks of information.

    Therefore we first identify two to five top level requirements that we describe as objectives. Then we assign second level requirements to first level requirements. A first level requirement can have seven to nine second level requirements.

    At this step we can see how second level requirements relate to each other. A requirement shouldn't be much harder to satisfy than another. Requirements should also cover well the area of possible features. If it is the case you have good chances that further requirements will continue filling the area of possible features. Otherwise the top level requirement taxonomy probably doesn't represent the need.

    Then we can make preliminary designs (more than one!) meeting the requirement. Next we select the most capable in term of evolution by answering to the following questions:

    • How the design will behave if a new second level requirement is added?
    • How the design will behave if a top level requirement glides?
    • Can the design accomodate a new top level requirement?

    During the construction phase, we choose the solution whose impact on the evolution capability is the smallest.

    Design patterns

    Christopher Alexander says, "Each pattern describes a problem which occur over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million time over, without doing it the same way twice".

    We can decompose a design in design problems that a design pattern can solve.

    A pattern has four elements:
    1. A name
    2. The problem, when to apply the pattern
    3. The solution that describes the elements
    4. The consequences, which are the results and trade-offs of the pattern
    PurposeNameIntentJava/C#
    CreationalAbstract Factory Provide an interface for creating families of related or dependent objects without specifying their concrete classes.  
     Builder Separate the construction of a complex object from its representation so that the same construction process can create different representations  
     Factory method Define an interface for creating an object, but let subclasses decide which class to instantiate. 
     Prototype Specify the find of objects to create using a prototypical instance, and create new objects by copying this prototypeclone()
     Singleton Ensure a class only has one instance, and provide a global point of access to it 
    StructuralAdapter Convert the interface of a class into another interface clients expect.  
     Bridge Decouple an abstraction from its implementation so that the two can vary independently. 
     Adapter Convert the interface of a class into another interface clients expect.  
     Composite Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat individual objects and compositions uniformly  
     Decorator Attach additional responsibilities to an object dynamically.  
     Facade Provide a unified interface to a set of interfaces in a subsystem.  
     Flyweight Use sharing to support large number of fine-grained objects.  
     Proxy Provide a surrogate for another object to control access to it.  
    BehavioralChain of responsibility Chain the receiving objects and pass the request along the chain until an object handles it. 
     Command Encapsulate a request as an object, thereby letting parameterize clients with different requests.  
     Interpreter Define a representation for a language grammar along with an interpreter that uses the representation to interpret sentences in the language. Url
     Iterator Provides a way to access the elements of an aggregate object sequentially without exposing its underlying representation. Java Iterator and C# IEnumerator
     Mediator Define an object that encapsulates how a set of objects interact.  
     Memento Capture and externalize an object's internal state.  
     Observer Define a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically.  
     State Allow an object to alter its behavior when its internal state changes.  
     Strategy Encapsulate a family of algorithms and make them interchangeable.  
     Template Define the skeleton of an algorithm in an operation, deferring some steps to subclasses. 
     Visitor Represent an operation to be performed on the elements of an object structure.  

    Feature-driven development

    Feature-Driven Development (FDD) is a recent software-development process. The five processes within FDD are:
    1. Develop an overall model
    2. Build a features list
    3. Plan by feature
    4. Design by feature
    5. Build by feature
    See the Coad letter Development Guide for more information. There is also a good paper on UIdesign. Make your mind. What we don't like in FDD is that it is still a waterfall approach: first make an overall design, then define the feature sets (using different strategies) and then the features in the feature sets. It raises the following questions?
    1. Is it by chance that object methods (including FDD) always mimic the structure of the organizations they are sold to?
    2. What do you remember the best, stories or hierarchies?
    3. When you use a GUI do you perceive, a hierarchy or a mix of layouts and behaviors?

    Storyboarding

    When you have to design a product, identify the key element. If the users will primarily use the product through its graphical interface then the graphical interface is the key element. UML and design patterns don't help to design graphical interfaces. You must first storyboard your graphical interface.

    The Art of Storyboarding presents storyboarding. You can also read Storyboarding Multimedia and Storyboarding for Design: An Overview of the Process.

    Storyboarding allows checking usability before spending man * monthes in development. Though storyboarding tends to be less formal than UML, it allows better cost estimates. It also allows identifying components that you don't know how to develop.
    • Storyboarding is not for artists and for multimedia only!
    • Lay out the structure of the program on paper
    • Design the screen by screen output in detail for the critical parts
    • Do not allow your ideas to become firm too early
    • Prototype the components where you identified a risk
    Storyboarding doesn't help to define the business logic and the data structure behind the graphical interface but it allows identifying the interactions between the interface and the system (use cases) and the needed information (definition of services).

    Some authors suggest that we should wireframe before. They explain that wireframing defines the "what" of the creative process while storyboarding tackles the "how" but at the same time they define wireframing as identifying the entry and exit point the users experience on every page of the site. Then one goal of storyboards is to identify the pages (the what).

    We believe that we should - just like in film industry:
    1. Describe the idea
    2. Write a synopsis
    3. If the project is big enough, write a scenario
    When we wrote Reservation we tried to emulate a storyboarding IDE. We had to address the issues identified above:
    1. The data: whatever we get and update using RPC, Web services or SQL requests.
      There are two sorts of data, the existing data that the application must accomodate and the new data, defined for the application. In the design process, we change the new data when we extend and improve our storyboard. Therefore the storyboarding IDE should support a kind of dynamic binding and maintain the display constraints and rules rather than the actual position of elements. For tabular data, grids are the obvious response. For other data, layouts a la AWT are probably not enough: we can need to use cognitive technologies and products such as SOAR.
    2. The business logic: the rules and the code that implements these rules.
      One of our findings with Reservation is that
      • We must be aware of this business logic even on smaller projects
      • Object analysis and implementation is the most effective solution - probably something similar to feature-driven development

    Project Management

    Mythical Man-Month

    In the Mythical Man-Month (1975- 1995) Frederick P. Brooks wrote: "Cost does indeed vary as the product of the number of men and the number of months. Progress does not. Hence the man-month as a unit for measuring the size of a job is a dangerous and deceptive myth. It implies men and months are interchangeable. "Men and months are interchangeable commodities only when a task can be partitioned among many workers with no communication among them. This is true of reaping wheat or picking cotton; it is not even approximately true of systems programming. "When a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women are assigned. Many software tasks have this same characteristic because of the sequential nature of debugging. "In tasks that can be partitioned but which require communication among the subtasks, the effort of communication must be added to the amount of work to be done. Therefore the best than be done is somewhat poorer than an even trade of men for months... "Since software construction is inherently a systems effort -- an exercise in complex interrelationships -- communication effort is great, and it quickly dominates the decrease in individual task time brought about by partitioning. Adding more men then lengthens, not shortens, the schedule."

    Following this observation, which was never seriously challenged we can easily conclude that the safest and cheapest project is the project handled by a single person - the project that doesn't require management. A fact frequently forgotten is that a compiler, gcc, the first PC databases and Linux were originally written by a single programmer and in a timeframe matching user needs. Even today an experienced programmer can develop a cheaper, faster, smaller and more reliable product than a small team.

    This is not say that teams and even big teams are not needed. New products have to offer more functions and to be friendlier than the products they replace. Therefore they are more complex and require bigger teams. The OS/360 project of Frederick P. Brooks, then one of most strategic projects of the biggest computer maker (IBM) had a team of 160 programmers. Now in some companies this would be a medium size project. Therefore project management is more needed than ever.

    Project breakdown

    Brooks also found an empirical breakdown of software tasks in 1/3 planning, 1/6 coding, 1/4 component test and early system test, 1/4 system test, all components in hand.

    We can note that this breakdown follows from the project complexity, not from communication and sequentail tasks: when the project is complex, you experience this breakdown even when you work alone.

    No Silver bullet

    In No Silver Bullet (1986) Brooks wrote: "no single new development in the next ten years would give software developers an order-of-magnitude improvement in productivity, reliability, or simplicity", which also turned to be true. This is not to say that there were no improvements but they happend mostly in coding with better languages and IDEs and in component tests with symbolic debuggers that represent only 30% of the cost.

    Quite interestingly programming and component debugging are also the tasks that can be effectively taught whereas design and system troubleshooting require experience and remain arts. And what we call experience here is not something that can easily be transfered or modelized. This has more to do with intuition. This is this brain magic that allows us classifying facts and feeling that things shouldn't be done in a given way.

    Tools

    If projects are not released on time this is not because project management tools (products like Microsoft Project) are bad. Simply these products handle tasks and leave the definition of these tasks to the user appreciation. There are three traps:
    1. Tasks are more or less demanding in term of experience. Developers are not interchangeable. Regardless of the team size if there are more demanding tasks than experienced developers the project will be delayed.
    2. There are always more dependencies between tasks than expected. Tasks that were originally planed to be made in parallel are eventually made sequentially.
    3. The key word in project management is "completed" and the meaning of "completed" depends on the task. A design can be completed and be flawed. A component can be completed, have successfully passed unit tests and be unable to integrate with the rest of the product.

    These problems are not simple ones. For instance experience comes through a succession of epiphanies rather and not through a peaceful journey. A project manager can trust some people because they already did something similar but she cannot know if an unconfirmed programmer will or won't be able to achieve a complex task. Therefore a project manager should be;

    1. Humble. Nowaday this is easy to write lengthy documents that nobody will read and complex diagrams that nobody will check. This is more challenging to maintain a simpler view of the project everyone can grasp and agree upon.
    2. Sceptical. This is easy to paint in green tasks that were declared to be on schedule and in red tasks that were not delivered on time. A project manager must also consider the meaning of "completed" for a given task and question the reliability of her measurements.

    Because of that we believe that the best product management tools are the simplest and most intuitive. Take a look at products like DELEGATOR from Madrigal, a company that provides Scheduling Software for managing people and resources with free evaluation versions.

    Tests

    Flowchart

    Flowgraph analysis

    White box

    White Box test

    Black box

    Metrics

    The reasons for measurement

    General large scale metrics
    Medium Scale Metrics
    Small Scale Software Metrics

    Cuckoo Ptah Algorithms Graphs Kalman Air Transport Society Device

    Contact:support@pagebox.net
    ©2001 Alexis Grandemange   Last modified