Databases 10
☆ Private Quantum Database
Quantum databases open an exciting new frontier in data management by
offering privacy guarantees that classical systems cannot match. Traditional
engines tackle user privacy, which hides the records being queried, or data
privacy, which prevents a user from learning more than she has queried. We
propose a quantum database that protects both by leveraging quantum mechanics:
when the user measures her chosen basis, the superposition collapses and the
unqueried rows become physically inaccessible. We encode relational tables as a
sequence of Quantum Random Access Codes (QRACs) over mutually unbiased bases
(MUBs), transmit a bounded number of quantum states, and let a single,
destructive measurement reconstruct only the selected tuple. This allows us to
preserve data privacy and user privacy at once without trusted hardware or
heavyweight cryptography. Moreover, we envision a novel hybrid
quantum-classical architecture ready for early deployment, which ensures
compatibility with the limitations of today's Noisy Intermediate-Scale Quantum
devices.
☆ Enriching Object-Centric Event Data with Process Scopes: A Framework for Aggregation and Analysis
Object-Centric Process Mining enables the analysis of complex operational
behavior by capturing interactions among multiple business objects (e.g.,
orders, items, deliveries). These interactions are recorded using
Object-Centric Event Data (OCED) formats, such as the Object-Centric Event Log
(OCEL). However, existing formats lack explicit definitions of process scopes,
which restricts analysis to individual processes and limits insights to a low
level of granularity. In practice, OCED often spans multiple interrelated
processes, as shared objects connect events across organizational functions.
This structure reflects how value is created along the organizational value
chain, but introduces challenges for interpretation when process boundaries are
not clearly defined. Moreover, process definitions are typically subjective and
context-dependent; they vary across organizations, roles, and analytical goals,
and cannot always be discovered automatically. To address these challenges, we
propose a method for embedding analyst-defined process scopes into OCEL. This
enables the structured representation of multiple coexisting processes,
supports the aggregation of event data across scopes, and facilitates analysis
at varying levels of abstraction. We demonstrate the applicability of our
approach using a publicly available OCEL log and provide supporting tools for
scope definition and analysis.
☆ Text to Query Plans for Question Answering on Large Tables
Efficient querying and analysis of large tabular datasets remain significant
challenges, especially for users without expertise in programming languages
like SQL. Text-to-SQL approaches have shown promising performance on benchmark
data; however, they inherit SQL's drawbacks, including inefficiency with large
datasets and limited support for complex data analyses beyond basic querying.
We propose a novel framework that transforms natural language queries into
query plans. Our solution is implemented outside traditional databases,
allowing us to support classical SQL commands while avoiding SQL's inherent
limitations. Additionally, we enable complex analytical functions, such as
principal component analysis and anomaly detection, providing greater
flexibility and extensibility than traditional SQL capabilities. We leverage
LLMs to iteratively interpret queries and construct operation sequences,
addressing computational complexity by incrementally building solutions. By
executing operations directly on the data, we overcome context length
limitations without requiring the entire dataset to be processed by the model.
We validate our framework through experiments on both standard databases and
large scientific tables, demonstrating its effectiveness in handling extensive
datasets and performing sophisticated data analyses.
☆ Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics
Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee
Serving Large Language Models (LLMs) at scale requires meeting strict Service
Level Objectives (SLOs) under severe computational and memory constraints.
Nevertheless, traditional caching strategies fall short: exact-matching and
prefix caches neglect query semantics, while state-of-the-art semantic caches
remain confined to traditional intuitions, offering little conceptual
departure. Building on this, we present SISO, a semantic caching system that
redefines efficiency for LLM serving. SISO introduces centroid-based caching to
maximize coverage with minimal memory, locality-aware replacement to preserve
high-value entries, and dynamic thresholding to balance accuracy and latency
under varying workloads. Across diverse datasets, SISO delivers up to
1.71$\times$ higher hit ratios and consistently stronger SLO attainment
compared to state-of-the-art systems.
☆ WoW: A Window-to-Window Incremental Index for Range-Filtering Approximate Nearest Neighbor Search SIGMOD
Given a hybrid dataset where every data object consists of a vector and an
attribute value, for each query with a target vector and a range filter,
range-filtering approximate nearest neighbor search (RFANNS) aims to retrieve
the most similar vectors from the dataset and the corresponding attribute
values fall in the query range. It is a fundamental function in vector database
management systems and intelligent systems with embedding abilities. Dedicated
indices for RFANNS accelerate query speed with an acceptable accuracy loss on
nearest neighbors. However, they are still facing the challenges to be
constructed incrementally and generalized to achieve superior query performance
for arbitrary range filters. In this paper, we introduce a window graph-based
RFANNS index. For incremental construction, we propose an insertion algorithm
to add new vector-attribute pairs into hierarchical window graphs with varying
window size. To handle arbitrary range filters, we optimize relevant window
search for attribute filter checks and vector distance computations by range
selectivity. Extensive experiments on real-world datasets show that for index
construction, the indexing time is on par with the most building-efficient
index, and 4.9x faster than the most query-efficient index with 0.4-0.5x
smaller size; For RFANNS query, it is 4x faster than the most efficient
incremental index, and matches the performance of the best statically-built
index.
comment: Accepted in the ACM SIGMOD/PODS International Conference on
Management of Data (SIGMOD 2026)
☆ Optimal $(α,β)$-Dense Subgraph Search in Bipartite Graphs
Dense subgraph search in bipartite graphs is a fundamental problem in graph
analysis, with wide-ranging applications in fraud detection, recommendation
systems, and social network analysis. The recently proposed $(\alpha,
\beta)$-dense subgraph model has demonstrated superior capability in capturing
the intrinsic density structure of bipartite graphs compared to existing
alternatives. However, despite its modeling advantages, the $(\alpha,
\beta)$-dense subgraph model lacks efficient support for query processing and
dynamic updates, limiting its practical utility in large-scale applications. To
address these limitations, we propose BD-Index, a novel index that answers
$(\alpha, \beta)$-dense subgraph queries in optimal time while using only
linear space $O(|E|)$, making it well-suited for real-world applications
requiring both fast query processing and low memory consumption. We further
develop two complementary maintenance strategies for dynamic bipartite graphs
to support efficient updates to the BD-Index. The space-efficient strategy
updates the index in time complexity of $O(p \cdot |E|^{1.5})$ per edge
insertion or deletion, while maintaining a low space cost of $O(|E|)$ (the same
as the index itself), where $p$ is typically a small constant in real-world
graphs. In contrast, the time-efficient strategy significantly reduces the
update time to $O(p \cdot |E|)$ per edge update by maintaining auxiliary
orientation structures, at the cost of increased memory usage up to $O(p \cdot
|E|)$. These two strategies provide flexible trade-offs between maintenance
efficiency and memory usage, enabling BD-Index to adapt to diverse application
requirements. Extensive experiments on 10 large-scale real-world datasets
demonstrate high efficiency and scalability of our proposed solutions.
☆ Brook-2PL: Tolerating High Contention Workloads with A Deadlock-Free Two-Phase Locking Protocol
The problem of hotspots remains a critical challenge in high-contention
workloads for concurrency control (CC) protocols. Traditional concurrency
control approaches encounter significant difficulties under high contention,
resulting in excessive transaction aborts and deadlocks. In this paper, we
propose Brook-2PL, a novel two-phase locking (2PL) protocol that (1) introduces
SLW-Graph for deadlock-free transaction execution, and (2) proposes partial
transaction chopping for early lock release. Previous methods suffer from
transaction aborts that lead to wasted work and can further burden the system
due to their cascading effects. Brook-2PL addresses this limitation by
statically analyzing a new graph-based dependency structure called SLW-Graph,
enabling deadlock-free two-phase locking through predetermined lock
acquisition. Brook-2PL also reduces contention by enabling early lock release
using partial transaction chopping and static transaction analysis. We overcome
the inherent limitations of traditional transaction chopping by providing a
more flexible chopping method. Evaluation using both our synthetic online game
store workload and the TPC-C benchmark shows that Brook-2PL significantly
outperforms state-of-the-art CC protocols. Brook-2PL achieves an average
speed-up of 2.86x while reducing tail latency (p95) by 48% in the TPC-C
benchmark.
♻ ☆ Anonymity-washing
Anonymization is a foundational principle of data privacy regulation, yet its
practical application remains riddled with ambiguity and inconsistency. This
paper introduces the concept of anonymity-washing -- the misrepresentation of
the anonymity level of ``sanitized'' personal data -- as a critical privacy
concern. While both legal and technical critiques of anonymization exist, they
tend to address isolated aspects of the problem. In contrast, this paper offers
a comprehensive overview of the conditions that enable anonymity-washing. It
synthesizes fragmented legal interpretations, technical misunderstandings, and
outdated regulatory guidance and complements them with a systematic review of
national and international resources, including legal cases, data protection
authority guidelines, and technical documentation. Our findings reveal a lack
of coherent support for practitioners, contributing to the persistent misuse of
pseudonymization and obsolete anonymization techniques. We conclude by
recommending targeted education, clearer technical guidance, and closer
cooperation between regulators, researchers, and industry to bridge the gap
between legal norms and technical reality.
comment: Accepted at Privacy Forum
♻ ☆ ST-Raptor: LLM-Powered Semi-Structured Table Question Answering SIGMOD 2026
Zirui Tang, Boyu Niu, Xuanhe Zhou, Boxiu Li, Wei Zhou, Jiannan Wang, Guoliang Li, Xinyi Zhang, Fan Wu
Semi-structured tables, widely used in real-world applications (e.g.,
financial reports, medical records, transactional orders), often involve
flexible and complex layouts (e.g., hierarchical headers and merged cells).
These tables generally rely on human analysts to interpret table layouts and
answer relevant natural language questions, which is costly and inefficient. To
automate the procedure, existing methods face significant challenges. First,
methods like NL2SQL require converting semi-structured tables into structured
ones, which often causes substantial information loss. Second, methods like
NL2Code and multi-modal LLM QA struggle to understand the complex layouts of
semi-structured tables and cannot accurately answer corresponding questions. To
this end, we propose ST-Raptor, a tree-based framework for semi-structured
table question answering using large language models. First, we introduce the
Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures
complex semi-structured table layouts, along with an effective algorithm for
constructing the tree. Second, we define a set of basic tree operations to
guide LLMs in executing common QA tasks. Given a user question, ST-Raptor
decomposes it into simpler sub-questions, generates corresponding tree
operation pipelines, and conducts operation-table alignment for accurate
pipeline execution. Third, we incorporate a two-stage verification mechanism:
forward validation checks the correctness of execution steps, while backward
validation evaluates answer reliability by reconstructing queries from
predicted answers. To benchmark the performance, we present SSTQA, a dataset of
764 questions over 102 real-world semi-structured tables. Experiments show that
ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code
is available at https://github.com/weAIDB/ST-Raptor.
comment: Extension of our SIGMOD 2026 paper. Please refer to source code
available at: https://github.com/weAIDB/ST-Raptor
♻ ☆ Enumeration for MSO-Queries on Compressed Trees
We study the problem of enumerating the answers to a query formulated in
monadic second order logic (MSO) over an unranked forest F that is compressed
by a straight-line program (SLP) D. Our main result states that this can be
done after O(|D|) preprocessing and with output-linear delay (in data
complexity). This is a substantial improvement over the previously known
algorithms for MSO-evaluation over trees, since the compressed size |D| might
be much smaller than (or even logarithmic in) the actual data size |F|, and
there are linear time SLP-compressors that yield very good compressions on
practical inputs. In particular, this also constitutes a meta-theorem in the
field of algorithmics on SLP-compressed inputs: all enumeration problems on
trees or strings that can be formulated in MSO-logic can be solved with linear
preprocessing and output-linear delay, even if the inputs are compressed by
SLPs. We also show that our approach can support vertex relabelling updates in
time that is logarithmic in the uncompressed data. Our result extends previous
work on the enumeration of MSO-queries over uncompressed trees and on the
enumeration of document spanners over compressed text documents.