Selfadhesivity in Gaussian conditional independence structures

This page contains the source code and explanations related to the computational
results presented in the paper:
Tobias Boege: Selfadhesivity in Gaussian conditional independence structures

ABSTRACT: Selfadhesivity is a property of entropic polymatroids which can be formulated as gluability conditions of the polymatroid to an identical copy of itself along arbitrary restrictions and such that the two pieces are independent given the common restriction.

We show that positive definite matrices satisfy this condition as well and examine consequences for Gaussian conditional independence structures. New axioms of Gaussian CI are obtained by applying selfadhesivity to the previously known axioms of structural semigraphoids and orientable gaussoids.

All data is archived in the MPG Digital Library Keeper at https://keeper.mpdl.mpg.de/d/fbfe463162e94a14ac28/.

Installation of the tools

The computations of structural semigraphoids, orientable gaussoids and their selfadhesions described in the paper rely at their core on state of the art solvers for the boolean satisfiability problem (CaDiCaL and nbc_minisat_all) and on exact rational linear programming solvers (SoPlex). The solvers are connected to conditional independence structures through Perl modules. The source code for these modules is bundled with this MathRepo page. Download and unpack the following archive which contains the module packages:

These modules require Modern::Perl 2018 which translates to a Perl version of 5.26 or later. To install them, you need a CPAN client like cpanm (set it up to use local::lib as described in the documentation). All instructions below assume a Linux(-like) environment. Open a terminal and execute

$ cpanm Keyword::Declare Devel::Gladiator
$ cpanm CInet-Alien-*

The SAT solvers are bundled with their respective CInet::Alien module and they are compiled and installed alongside the Perl modules. You need to have prerequisite build tools (make, C and C++ compiler, etc.) available on your system. The binary are statically linked and some operating systems require you to install special -static versions of the C and C++ standard libraries for this (on Fedora, for example, install glibc-static and libstdc++-static as well as zlib-static).

The license of SoPlex does not permit bundling its source code. You have to install a version of SoPlex with arbitrary-length rational number support (via the GMP library) and then continue installing CInet::Bridge::SoPlex and the rest:

$ cpanm CInet-Bridge-SoPlex* CInet-ManySAT*
$ cpanm CInet-Base* CInet-Propositional* CInet-Polyhedral*
$ cpanm CInet-Gaussoids* CInet-Semimatroids* CInet-Adhesive*

If you do not use Perl much, then the whole installation procedure is likely to pull in a lot of transitive dependencies and is going to take a while. In case of errors, consult the log file pointed out by cpanm. It usually contains useful information as for which dependencies are missing or what the error exactly is.

Enumeration and symmetric reduction of gaussoids

The initial computations of structural semigraphoid and orientable gaussoid candidates for selfadhesivity receive as input the set of gaussoids over a 5-element ground set reduced modulo isomorphy. This set can be computed as follows:

use CInet::Base;
use CInet::Gaussoids;

say for Gaussoids(5)->modulo(SymmetricGroup)->list;

This outputs a list of 508817 binary vectors of length 80. The format is the same as on https://gaussoids.de. The computation takes around 12 GiB of memory and 95 minutes on a single core at 2.20 GHz. The output is available for download:

This file and all other text files are quite large and are therefore compressed using the zstd algorithm. You have to install the zstd program and execute the following to unpack the file:

$ zstd -d gaussoids.txt.zst

Structural semigraphoids and orientability

The Perl modules installed in the first step contain routines for testing whether a given CI structure is a structural semigraphoid or an orientable gaussoid:

use Modern::Perl 2018;
use CInet::Base;
use CInet::Semimatroids;

sub chomped {
    local $_ = shift // $_;
    chomp; $_
}

local $| = 1; # no output buffering
my @input = map CInet::Relation->new(5 => chomped), <<>>;
say for grep is_semimatroid($_), @input;

With these programs, the input gaussoids can be filtered for these properties:

$ perl get-structurals.pl <gaussoids.txt | tee structurals.txt
$ perl get-orientables.pl <gaussoids.txt | tee orientables.txt

The former takes about 144 minutes, the latter about 8 minutes. This discrepancy is in part caused by the fact that the SAT solver CaDiCaL can be used as a library and works incrementally: it only has to be filled once with the axioms for oriented gaussoids which is then solved 508817 times under different assumptions. On the other hand, SoPlex is an external program for which 508817 linear program files have to be written and each is solved by a new SoPlex process. Almost 75% of the time in get-structurals.pl is spent writing files and starting SoPlex processes. In this context it is important to note that linear program files are created usually in a directory under /tmp. For performance reasons, this should be a RAM-backed temporary filesystem instead of a real hard disk.

The number of structural semigraphoids and of orientable gaussoids is the number of lines in the output files, 336838 and 175215, respectively:

Their intersection of structural semigraphoids and orientable gaussoids can be computed easily because the same isomorphy representatives gaussoids.txt have been input into both programs, for example using the following Perl snippet and giving both file names as arguments:

use Modern::Perl 2018;

sub chomped {
    local $_ = shift // $_;
    chomp; $_
}

my %h;
$h{$_}++ for map chomped, <<>>;

for (keys %h) {
    say if $h{$_} == 2;
}

The result is available for download below and it consists of 175139 gaussoids.

Selfadhesion computations

Selfadhesivity testing is a very demanding procedure, both in terms of time and memory. This is because the number of CI statements (and correspondingly the size of linear programs or boolean formulas) grow exponentially in the ground set size \(n\), namely as \(\binom{n}{2} 2^{n-2}\). For testing the selfadhesivity of a CI structure on \(n=5\) with respect to a given property, that property has to be tested for each \(L \subseteq [n]\) on a ground set of size \(2n - |L|\). The programs computing selfadhesions are all parallelized using the Forks::Super Perl module, which has to be installed along with some other dependencies:

$ cpanm Path::Tiny Forks::Super List::MoreUtils

The installation of Forks::Super in particular takes a long time and may timeout. In that case increase the --configure-timeout, --build-timeout and --test-timeout in the cpanm call.

The program deciding \(\mathfrak{o}^{\textsf{sa}}\), for example, looks like this:

use Modern::Perl 2018;

use Forks::Super;
use List::MoreUtils qw(natatime);
use Path::Tiny;

use CInet::Base;
use CInet::Gaussoids;
use CInet::Semimatroids;
use CInet::Adhesive;

sub chomped {
    local $_ = shift // $_;
    chomp; $_
}

sub worker {
    my $id = shift;
    srand($id);

    my $outfile = path(shift);
    my $fh = $outfile->openw_utf8;
    for my $A (@_) {
        say {$fh} $A if is_selfadhesive(CInet::Relation->new(5 => $A), \&is_orientable);
    }
    close $fh;
}

my @input = map chomped, <<>>;

# Warm caches (shared by all workers):
is_selfadhesive(CInet::Relation->new(5 => $input[0]), \&is_orientable);

my $batch_size = 1000;
my $batches = natatime $batch_size, @input;
my $total = @input / $batch_size;
my $done = 0;

$Forks::Super::MAX_PROC = 30;
$Forks::Super::ON_BUSY = 'block';

my $id = 0;
my @outfiles;
while (my @batch = $batches->()) {
    my $outfile = Path::Tiny->tempfile;
    push @outfiles, $outfile;
    fork \&worker, args => [ $id++, "$outfile", @batch ];
    $done++;
    warn "Scheduled $done out of $total batches";
}
waitall;

# Print total result
for my $outfile (@outfiles) {
    say for $outfile->lines_utf8({ chomp => 1 });
}

This program occupies $Forks::Super::MAX_PROC = 30 CPUs. This can be increased or decreased according to your resources. On each CPU a dedicated worker process runs which gets a batch of 1000 input structures. There is a warmup phase which executes the is_selfadhesive routine once before starting the workers. The CInet modules for deciding orientability and structural semigraphoidality make extensive use of caches to not recompute the definition of oriented gaussoids or structural semigraphoids for every query. These caches are computed once and are then shared by all workers, greatly contributing to performance but increasing memory usage.

For orientability checking and its large axiom systems in the background, the setup phase allocates about 15 GiB of caches. With 30 parallel workers (instead of 30), the memory usage peaks at shortly below 280 GiB. The processing takes 5677 processor minutes on Intel Xeon Gold processors at 3.50 GHz.

By contrast, the structural semigraphoid computation only caches a few templates of linear programs. 20 GiB of memory suffice for the entire set of workers, but again it is much slower than the incremental SAT solver. The processing takes 59446 processor minutes of which 8745 minutes are spent on starting processes and writing files, on Intel Xeon E7-8867 v3 processors at 2.50 GHz.

Note that since the selfadhesion operator is recessive, it suffices to input only orientables.txt (with 175215 elements) into the get-orientables-sa.pl checker and only structurals.txt (with 336838 elements) into get-structurals-sa.pl. This also skews the comparison between the two selfadhesion computations.

The most demanding selfadhesion computation is the one for the meet property \(\mathfrak{sg}_* \wedge \mathfrak{o}\). It takes 123071 processor minutes of which 66% are spent in kernel space writing files or executing SoPlex, on Intel Xeon CPU E7- 8837 processors at 2.67 GHz. The number of results is 167989. The resulting files are available for download below:

Finding new inference rules

Every CI structure which does not belong to \((\mathfrak{sg}_* \wedge \mathfrak{o})^{\textsf{sa}}(5)\) cannot be realizable by a Gaussian distribution. Therefore it encodes valid inference rules for Gaussians which are violated by this structure. To find these inference rules, one examines all realizable CI structures which contain this non-realizable structure. In particular, Horn-type inference rules (those without disjunctions on the conclusion side) are precisely given by the elements in the intersection of all realizable structures containing the given non-realizable structure.

Since we do not have access to the set of realizable CI structures, we approximate realizability by a Gaussian distribution via necessary properties. The Horn-type inference rules derivable form necessary properties may be weaker (because the intersection is taken over more structures) but they can be computed. Hence, we take the elements \(\mathcal{L}\) of \((\mathfrak{sg}_* \wedge \mathfrak{o})(5)\) one after another and examine the intersection of all the structures in \((\mathfrak{sg}_* \wedge \mathfrak{o})^{\textsf{sa}}(5)\) which contain \(\mathcal{L}\).

First compute the difference of the files structural-orientables.txt and structural-orientables-sa.txt by giving them in this order as arguments to the following program:

use Modern::Perl 2018;
use Path::Tiny;

my ($f1, $f2) = @ARGV;
my %h;
$h{$_}++ for path($f1)->lines_utf8({ chomp => 1 });
$h{$_}-- for path($f2)->lines_utf8({ chomp => 1 });

for (keys %h) {
    say if $h{$_} > 0;
}

These are the 7150 candidate structures which may give new inference rules which are not implied by the structural semigraphoid or orientability properties but are implied by their selfadhesion.

To intersect all selfadhesive CI structures containing one of these candidates, we first have to compute the entire list of CI structures which are selfadhesive with respect to structural semigraphoidality and orientability (above we only computed isomorphy representatives). This is accomplished by the following Perl code:

use Modern::Perl 2018;
use CInet::Base;

my %h;
for (<<>>) {
    chomp;
    $h{$_}++ for CInet::Relation->new(5 => $_)->orbit(SymmetricGroup)->list;
}
say for keys %h;

The output of sending structural-orientables-sa.txt into this program is large: its zstd-compressed version is 202 MiB big and it inflates to about 1.5 GiB uncompressed.

The file contains 19721952 CI structures. The search for new Horn-type axioms goes through these almost 20 million CI structures for each candidate and intersects all structures containing the candidate:

$ cat axiom-candidates.txt | while read G
> do echo "   $G"; echo -n "=> ";
> grep -e $(sed 's/1/./g' <<<"$G") structural-orientables-sa-all.txt |
> perl intersect.pl; done

Here, grep extracts the structures containing the candidate $G and the Perl script intersect.pl performs the intersection. Its source code is below. It requires the Math::BigInt::GMP module to be installed. (The intersection of CI structures is implemented using (fast) bit operations on arbitrary-length binary numbers, as each CI structure is encoded as a string of bits.)

use Modern::Perl;
use bignum lib => 'GMP';

my $x = 0;
while (<<>>) {
    chomp;
    # Bitwise OR is intersection because per convention on
    # gaussoids.de, a 0 bit means that a conditional independence
    # statement is IN the set.
    $x |= "0b$_";
}
say sprintf "%080s", $x->to_bin;

The above shell routine is a slow process which could benefit from better algorithm engineering, but it only has to run until one inference rule is found:

   11111111011111111111111010101111111001111111111111110111110110111111111111111111
=> 11111111000000001011001000000000000000001111111100010111000000001111111111111111
   11111111110110111011111111111110101111101111110101111111111111111111111111111111
=> 11111111000000000000000000000000000000000000000000000000111111111111111111111111
   11111111111111111111001101111101111011111111111111110111011111111111111111111111
=> 11111111111111111011001000000000010011111111111100010111010101111111111111111111
   11110111101111111111111001111011111111011111111111011111101111111111111111111111
=> 00000000000000000000000000000000111111001111111111010111001111111111111111111111
   11111111101111111111111001111001111111011111101111011111011111111111111111111111
=> 11111111000000000000000000000000000000000000000000000000000101111111111111111111
   11111111111111111111110111101111111111101010111111110111011111111111111111111111
=> 11111111111111111111110011101011101100100000000000100010010111111111111111111111
...

In each case the candidate is printed and then on the next line an arrow (for implication) followed by the intersection from structural-orientables-sa-all.txt. The results are called the closures or completions of the inputs with respect to the property under consideration (here selfadhesive structural orientability). The zeros that appear after the arrow but not before are additional valid CI statements which are implied by the candidate. To get the CI statements printed, use the following snippet:

$ perl -MCInet::Base -E 'say join ", ", map FACE, CInet::Relation->new(5 => shift)->independences' 11111111011111111111111010101111111001111111111111110111110110111111111111111111
13|, 14|235, 15|2, 15|4, 23|5, 23|14, 25|13, 34|2, 34|15
$ perl -MCInet::Base -E 'say join ", ", map FACE, CInet::Relation->new(5 => shift)->independences' 11111111000000001011001000000000000000001111111100010111000000001111111111111111
13|, 13|2, 13|4, 13|5, 13|24, 13|25, 13|45, 13|245, 14|2, 14|23, 14|25, 14|235, 15|, 15|2, 15|3, 15|4, 15|23, 15|24, 15|34, 15|234, 23|, 23|1, 23|4, 23|5, 23|14, 23|15, 23|45, 23|145, 25|, 25|1, 25|3, 25|13, 34|, 34|1, 34|2, 34|5, 34|12, 34|15, 34|25, 34|125

This proves many Horn-type inference rules, for example

\[(13|) \wedge (14|235) \wedge (15|2) \wedge (15|4) \wedge (23|5) \wedge (23|14) \wedge (25|13) \wedge (34|2) \wedge (34|15) \;\Rightarrow\; (13|2)\]

where \((ij|K)\) is shorthand for \([i \mathrel{\text{$\perp\mkern-10mu\perp$}} j \mid K]\). To check how strong this new axiom is, compute orientables-all.txt by symmetry expansion as described above and compute the orientable closure of

\[\mathcal{L} = \{\, (13|), (14|235), (15|2), (15|4), (23|5), (23|14), (25|13), (34|2), (34|15) \,\} \cup \{\, (13|2) \,\}\]
$ echo 11111111001111111111111010101111111001111111111111110111110110111111111111111111 | while read G
> do echo "   $G"; echo -n "=> ";
> grep -e $(sed 's/1/./g' <<<"$G") orientables-all.txt |
> perl intersect.pl; done
   11111111001111111111111010101111111001111111111111110111110110111111111111111111
=> 11111111000000001011001000000000000000001111111100010111000000001111111111111111

The last structure is the closure of \(\mathcal{L}\) in the orientable gaussoids. It coincides with the closure of \(\mathcal{L} \setminus \{\, (13|2) \,\}\) in the selfadhesive structural orientable gaussoids. Hence, the inference rule singled out above is new (not implied by structural semigraphoids and orientability) but it is also sufficient (given orientability axioms) to imply all other new Horn-type inference rules which can be derived from this candidate structure.

System information

Computations were performed on various of the MPI’s compute servers; timings, setup and CPU information is given throughout the text. On consumer hardware, the memory usage will be the first obstacle in carrying out these computations. Parallelization (using at least 30 cores) is necessary for the computations to finish in a matter of days (instead of months).

The source code for the CInet Perl modules is provided for download above. Versions of other software and the utilized solvers are as follows: \(\verb|Perl v5.32|\), \(\verb|CaDiCaL v1.3.0|\), \(\verb|nbc_minisat_all v1.0.2|\), \(\verb|SoPlex v4.0.0|\).

Colophon

Project page created: 16/05/2022

Project contributors: Tobias Boege

Corresponding author of this page: Tobias Boege, tobias.boege@mis.mpg.de