Türker Biyikoglu, Axel Mosig, Sonja Prohaska, Peter Stadler

Paper #: 07-08-025

Gene expression in eukaryotic cells is regulated by a complex network of interactions, in which transcription factors and their binding sites on the genomic DNA play a determining role. As transcriptions factor rarely, if ever, act in isolation, binding sites of interacting factors are typically arranged in close proximity forming so-called cis-regulatory modules. Even when the individual binding sites are known, module discovery remains a hard combinatorial problem, which we formalize here as the Best Barbecue Problem. It asks for simultaneously stabbing a maximum number of differently colored intervals from K arrangements of colored intervals. This geometric problem turns out to be an elementary, yet previously unstudied combinatorial optimization problem of detecting common edges in a family of hypergraphs, a decision version of which we show here to be NP-complete. Due to its relevance in biological applications, we propose algorithmic variations that are suitable for the analysis of real data sets comprising either many sequences or many binding sites. Being based on set systems induced by interval arrangements, our problem setting generalizes to discovering patterns of co-localized itemsets in non-sequential objects that consist of corresponding arrangements or induce set systems of co-localized items. In fact, our optimization problem is a generalization of the popular concept of frequent itemset mining.

PDF