-
Robust Learning with Optimal Error
Authors:
Guy Blanc
Abstract:
We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses.
- For $η$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot η/(1-η)$, improving on the optimal error of deterministic h…
▽ More
We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses.
- For $η$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot η/(1-η)$, improving on the optimal error of deterministic hypotheses by a factor of $1/2$. This answers an open question of Cesa-Bianchi et al. (JACM 1999) who showed randomness can improve error by a factor of $6/7$.
- For $η$-rate nasty noise, we show the optimal error is $\frac{3}{2} \cdot η$ for distribution-independent learners and $η$ for fixed-distribution learners, both improving upon the optimal $2 η$ error of deterministic hypotheses. This closes a gap first noted by Bshouty et al. (Theoretical Computer Science 2002) when they introduced nasty noise and reiterated in the recent works of Klivans et al. (NeurIPS 2025) and Blanc et al. (SODA 2026).
- For $η$-rate agnostic noise and the closely related nasty classification noise model, we show the optimal error is $η$, improving upon the optimal $2η$ error of deterministic hypotheses.
All of our learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error. All except for the fixed-distribution nasty noise learner are time efficient given access to an oracle for empirical risk minimization.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Samplability makes learning easier
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
The standard definition of PAC learning (Valiant 1984) requires learners to succeed under all distributions -- even ones that are intractable to sample from. This stands in contrast to samplable PAC learning (Blum, Furst, Kearns, and Lipton 1993), where learners only have to succeed under samplable distributions. We study this distinction and show that samplable PAC substantially expands the power…
▽ More
The standard definition of PAC learning (Valiant 1984) requires learners to succeed under all distributions -- even ones that are intractable to sample from. This stands in contrast to samplable PAC learning (Blum, Furst, Kearns, and Lipton 1993), where learners only have to succeed under samplable distributions. We study this distinction and show that samplable PAC substantially expands the power of efficient learners.
We first construct a concept class that requires exponential sample complexity in standard PAC but is learnable with polynomial sample complexity in samplable PAC. We then lift this statistical separation to the computational setting and obtain a separation relative to a random oracle. Our proofs center around a new complexity primitive, explicit evasive sets, that we introduce and study. These are sets for which membership is easy to determine but are extremely hard to sample from.
Our results extend to the online setting to similarly show how its landscape changes when the adversary is assumed to be efficient instead of computationally unbounded.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
Differential privacy from axioms
Authors:
Guy Blanc,
William Pires,
Toniann Pitassi
Abstract:
Differential privacy (DP) is the de facto notion of privacy both in theory and in practice. However, despite its popularity, DP imposes strict requirements which guard against strong worst-case scenarios. For example, it guards against seemingly unrealistic scenarios where an attacker has full information about all but one point in the data set, and still nothing can be learned about the remaining…
▽ More
Differential privacy (DP) is the de facto notion of privacy both in theory and in practice. However, despite its popularity, DP imposes strict requirements which guard against strong worst-case scenarios. For example, it guards against seemingly unrealistic scenarios where an attacker has full information about all but one point in the data set, and still nothing can be learned about the remaining point. While preventing such a strong attack is desirable, many works have explored whether average-case relaxations of DP are easier to satisfy [HWR13,WLF16,BF16,LWX23].
In this work, we are motivated by the question of whether alternate, weaker notions of privacy are possible: can a weakened privacy notion still guarantee some basic level of privacy, and on the other hand, achieve privacy more efficiently and/or for a substantially broader set of tasks? Our main result shows the answer is no: even in the statistical setting, any reasonable measure of privacy satisfying nontrivial composition is equivalent to DP. To prove this, we identify a core set of four axioms or desiderata: pre-processing invariance, prohibition of blatant non-privacy, strong composition, and linear scalability. Our main theorem shows that any privacy measure satisfying our axioms is equivalent to DP, up to polynomial factors in sample complexity. We complement this result by showing our axioms are minimal: removing any one of our axioms enables ill-behaved measures of privacy.
△ Less
Submitted 26 November, 2025;
originally announced November 2025.
-
Is nasty noise actually harder than malicious noise?
Authors:
Guy Blanc,
Yizhi Huang,
Tal Malkin,
Rocco A. Servedio
Abstract:
We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrup…
▽ More
We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner.
We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings: For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class ${\cal C}$ of functions is efficiently learnable in the presence of $η$-rate malicious noise, then it is also efficiently learnable in the presence of $η$-rate nasty noise. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value $r$ there exists a concept class for which there is a ratio of $r$ between the rate $η_{malicious}$ of malicious noise that polynomial-time learning algorithms can tolerate, versus the rate $η_{nasty}$ of nasty noise that such learning algorithms can tolerate.
To offset the negative result for the fixed-distribution setting, we define a broad and natural class of algorithms, namely those that ignore contradictory examples (ICE). We show that for these algorithms, malicious noise and nasty noise are equivalent up to a factor of two in the noise rate: Any efficient ICE learner that succeeds with $η$-rate malicious noise can be converted to an efficient learner that succeeds with $η/2$-rate nasty noise. We further show that the above factor of two is necessary, again under a standard cryptographic assumption.
△ Less
Submitted 16 February, 2026; v1 submitted 12 November, 2025;
originally announced November 2025.
-
The power of quantum circuits in sampling
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
We give new evidence that quantum circuits are substantially more powerful than classical circuits. We show, relative to a random oracle, that polynomial-size quantum circuits can sample distributions that subexponential-size classical circuits cannot approximate even to TV distance $1-o(1)$. Prior work of Aaronson and Arkhipov (2011) showed such a separation for the case of exact sampling (i.e. T…
▽ More
We give new evidence that quantum circuits are substantially more powerful than classical circuits. We show, relative to a random oracle, that polynomial-size quantum circuits can sample distributions that subexponential-size classical circuits cannot approximate even to TV distance $1-o(1)$. Prior work of Aaronson and Arkhipov (2011) showed such a separation for the case of exact sampling (i.e. TV distance $0$), but separations for approximate sampling were only known for uniform algorithms.
A key ingredient in our proof is a new hardness amplification lemma for the classical query complexity of the Yamakawa-Zhandry (2022) search problem. We show that the probability that any family of query algorithms collectively finds $k$ distinct solutions decays exponentially in $k$.
△ Less
Submitted 3 October, 2025;
originally announced October 2025.
-
Instance-Optimal Uniformity Testing and Tracking
Authors:
Guy Blanc,
Clément L. Canonne,
Erik Waingarten
Abstract:
In the uniformity testing task, an algorithm is provided with samples from an unknown probability distribution over a (known) finite domain, and must decide whether it is the uniform distribution, or, alternatively, if its total variation distance from uniform exceeds some input distance parameter. This question has received a significant amount of interest and its complexity is, by now, fully set…
▽ More
In the uniformity testing task, an algorithm is provided with samples from an unknown probability distribution over a (known) finite domain, and must decide whether it is the uniform distribution, or, alternatively, if its total variation distance from uniform exceeds some input distance parameter. This question has received a significant amount of interest and its complexity is, by now, fully settled. Yet, we argue that it fails to capture many scenarios of interest, and that its very definition as a gap problem in terms of a prespecified distance may lead to suboptimal performance.
To address these shortcomings, we introduce the problem of uniformity tracking, whereby an algorithm is required to detect deviations from uniformity (however they may manifest themselves) using as few samples as possible, and be competitive against an optimal algorithm knowing the distribution profile in hindsight. Our main contribution is a $\operatorname{polylog}(\operatorname{opt})$-competitive uniformity tracking algorithm. We obtain this result by leveraging new structural results on Poisson mixtures, which we believe to be of independent interest.
△ Less
Submitted 4 August, 2025;
originally announced August 2025.
-
Computational-Statistical Tradeoffs from NP-hardness
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
A central question in computer science and statistics is whether efficient algorithms can achieve the information-theoretic limits of statistical problems. Many computational-statistical tradeoffs have been shown under average-case assumptions, but since statistical problems are average-case in nature, it has been a challenge to base them on standard worst-case assumptions.
In PAC learning where…
▽ More
A central question in computer science and statistics is whether efficient algorithms can achieve the information-theoretic limits of statistical problems. Many computational-statistical tradeoffs have been shown under average-case assumptions, but since statistical problems are average-case in nature, it has been a challenge to base them on standard worst-case assumptions.
In PAC learning where such tradeoffs were first studied, the question is whether computational efficiency can come at the cost of using more samples than information-theoretically necessary. We base such tradeoffs on $\mathsf{NP}$-hardness and obtain:
$\circ$ Sharp computational-statistical tradeoffs assuming $\mathsf{NP}$ requires exponential time: For every polynomial $p(n)$, there is an $n$-variate class $C$ with VC dimension $1$ such that the sample complexity of time-efficiently learning $C$ is $Θ(p(n))$.
$\circ$ A characterization of $\mathsf{RP}$ vs. $\mathsf{NP}$ in terms of learning: $\mathsf{RP} = \mathsf{NP}$ iff every $\mathsf{NP}$-enumerable class is learnable with $O(\mathrm{VCdim}(C))$ samples in polynomial time. The forward implication has been known since (Pitt and Valiant, 1988); we prove the reverse implication.
Notably, all our lower bounds hold against improper learners. These are the first $\mathsf{NP}$-hardness results for improperly learning a subclass of polynomial-size circuits, circumventing formal barriers of Applebaum, Barak, and Xiao (2008).
△ Less
Submitted 17 July, 2025;
originally announced July 2025.
-
PROTEAN: Federated Intrusion Detection in Non-IID Environments through Prototype-Based Knowledge Sharing
Authors:
Sara Chennoufi,
Yufei Han,
Gregory Blanc,
Emiliano De Cristofaro,
Christophe Kiennert
Abstract:
In distributed networks, participants often face diverse and fast-evolving cyberattacks. This makes techniques based on Federated Learning (FL) a promising mitigation strategy. By only exchanging model updates, FL participants can collaboratively build detection models without revealing sensitive information, e.g., network structures or security postures. However, the effectiveness of FL solutions…
▽ More
In distributed networks, participants often face diverse and fast-evolving cyberattacks. This makes techniques based on Federated Learning (FL) a promising mitigation strategy. By only exchanging model updates, FL participants can collaboratively build detection models without revealing sensitive information, e.g., network structures or security postures. However, the effectiveness of FL solutions is often hindered by significant data heterogeneity, as attack patterns often differ drastically across organizations due to varying security policies. To address these challenges, we introduce PROTEAN, a Prototype Learning-based framework geared to facilitate collaborative and privacy-preserving intrusion detection. PROTEAN enables accurate detection in environments with highly non-IID attack distributions and promotes direct knowledge sharing by exchanging class prototypes of different attack types among participants. This allows organizations to better understand attack techniques not present in their data collections. We instantiate PROTEAN on two cyber intrusion datasets collected from IIoT and 5G-connected participants and evaluate its performance in terms of utility and privacy, demonstrating its effectiveness in addressing data heterogeneity while improving cyber attack understanding in federated intrusion detection systems (IDSs).
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
A Distributional-Lifting Theorem for PAC Learning
Authors:
Guy Blanc,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a distributional-lifting theorem: This upgrades a learner that succeeds with respect to a limited distribution famil…
▽ More
The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a distributional-lifting theorem: This upgrades a learner that succeeds with respect to a limited distribution family $\mathcal{D}$ to one that succeeds with respect to any distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$.
Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners and designed a lifter that uses a conditional sample oracle for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then uses this information to lift.
We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Explainable Graph-theoretical Machine Learning: with Application to Alzheimer's Disease Prediction
Authors:
Narmina Baghirova,
Duy-Thanh Vũ,
Duy-Cat Can,
Christelle Schneuwly Diaz,
Julien Bodlet,
Guillaume Blanc,
Georgi Hrusanov,
Bernard Ries,
Oliver Y. Chén
Abstract:
Alzheimer's disease (AD) affects 50 million people worldwide and is projected to overwhelm 152 million by 2050. AD is characterized by cognitive decline due partly to disruptions in metabolic brain connectivity. Thus, early and accurate detection of metabolic brain network impairments is crucial for AD management. Chief to identifying such impairments is FDG-PET data. Despite advancements, most gr…
▽ More
Alzheimer's disease (AD) affects 50 million people worldwide and is projected to overwhelm 152 million by 2050. AD is characterized by cognitive decline due partly to disruptions in metabolic brain connectivity. Thus, early and accurate detection of metabolic brain network impairments is crucial for AD management. Chief to identifying such impairments is FDG-PET data. Despite advancements, most graph-based studies using FDG-PET data rely on group-level analysis or thresholding. Yet, group-level analysis can veil individual differences and thresholding may overlook weaker but biologically critical brain connections. Additionally, machine learning-based AD prediction largely focuses on univariate outcomes, such as disease status. Here, we introduce explainable graph-theoretical machine learning (XGML), a framework employing kernel density estimation and dynamic time warping to construct individual metabolic brain graphs that capture the distance between pair-wise brain regions and identify subgraphs most predictive of multivariate AD-related outcomes. Using FDG-PET data from the Alzheimer's Disease Neuroimaging Initiative, XGML builds metabolic brain graphs and uncovers subgraphs predictive of eight AD-related cognitive scores in new subjects. XGML shows robust performance, particularly for predicting scores measuring learning, memory, language, praxis, and orientation, such as CDRSB ($r = 0.74$), ADAS11 ($r = 0.73$), and ADAS13 ($r = 0.71$). Moreover, XGML unveils key edges jointly but differentially predictive of several AD-related outcomes; they may serve as potential network biomarkers for assessing overall cognitive decline. Together, we show the promise of graph-theoretical machine learning in biomarker discovery and disease prediction and its potential to improve our understanding of network neural mechanisms underlying AD.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values
Authors:
Christelle Schneuwly Diaz,
Duy-Thanh Vu,
Julien Bodelet,
Duy-Cat Can,
Guillaume Blanc,
Haiting Jiang,
Lin Yao,
Guiseppe Pantaleo,
ADNI,
Oliver Y. Chén
Abstract:
Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease predictio…
▽ More
Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis
Authors:
Duy-Cat Can,
Linh D. Dang,
Quang-Huy Tang,
Dang Minh Ly,
Huong Ha,
Guillaume Blanc,
Oliver Y. Chén,
Binh T. Nguyen
Abstract:
Objective: Assessing Alzheimer's disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease predict…
▽ More
Objective: Assessing Alzheimer's disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making.
Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA's efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA's explainability, we compared its explanations with human experts' explanations.
Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts' and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Adaptive and oblivious statistical adversaries are equivalent
Authors:
Guy Blanc,
Gregory Valiant
Abstract:
We resolve a fundamental question about the ability to perform a statistical task, such as learning, when an adversary corrupts the sample. Such adversaries are specified by the types of corruption they can make and their level of knowledge about the sample. The latter distinguishes between sample-adaptive adversaries which know the contents of the sample when choosing the corruption, and sample-o…
▽ More
We resolve a fundamental question about the ability to perform a statistical task, such as learning, when an adversary corrupts the sample. Such adversaries are specified by the types of corruption they can make and their level of knowledge about the sample. The latter distinguishes between sample-adaptive adversaries which know the contents of the sample when choosing the corruption, and sample-oblivious adversaries, which do not. We prove that for all types of corruptions, sample-adaptive and sample-oblivious adversaries are \emph{equivalent} up to polynomial factors in the sample size. This resolves the main open question introduced by [BLMT22] and further explored in [CHL+23].
Specifically, consider any algorithm $A$ that solves a statistical task even when a sample-oblivious adversary corrupts its input. We show that there is an algorithm $A'$ that solves the same task when the corresponding sample-adaptive adversary corrupts its input. The construction of $A'$ is simple and maintains the computational efficiency of $A$: It requests a polynomially larger sample than $A$ uses and then runs $A$ on a uniformly random subsample.
△ Less
Submitted 29 August, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
The Sample Complexity of Smooth Boosting and the Tightness of the Hardcore Theorem
Authors:
Guy Blanc,
Alexandre Hayderi,
Caleb Koch,
Li-Yang Tan
Abstract:
Smooth boosters generate distributions that do not place too much weight on any given example. Originally introduced for their noise-tolerant properties, such boosters have also found applications in differential privacy, reproducibility, and quantum learning theory. We study and settle the sample complexity of smooth boosting: we exhibit a class that can be weak learned to $γ$-advantage over smoo…
▽ More
Smooth boosters generate distributions that do not place too much weight on any given example. Originally introduced for their noise-tolerant properties, such boosters have also found applications in differential privacy, reproducibility, and quantum learning theory. We study and settle the sample complexity of smooth boosting: we exhibit a class that can be weak learned to $γ$-advantage over smooth distributions with $m$ samples, for which strong learning over the uniform distribution requires $\tildeΩ(1/γ^2)\cdot m$ samples. This matches the overhead of existing smooth boosters and provides the first separation from the setting of distribution-independent boosting, for which the corresponding overhead is $O(1/γ)$.
Our work also sheds new light on Impagliazzo's hardcore theorem from complexity theory, all known proofs of which can be cast in the framework of smooth boosting. For a function $f$ that is mildly hard against size-$s$ circuits, the hardcore theorem provides a set of inputs on which $f$ is extremely hard against size-$s'$ circuits. A downside of this important result is the loss in circuit size, i.e. that $s' \ll s$. Answering a question of Trevisan, we show that this size loss is necessary and in fact, the parameters achieved by known proofs are the best possible.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
A Strong Direct Sum Theorem for Distributional Query Complexity
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optim…
▽ More
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optimal. In particular, computing a direct product necessitates a blowup in both query complexity and error.
Strong direct sum theorems contrast with results that only show a blowup in query complexity or error but not both. There has been a long line of such results for distributional query complexity, dating back to (Impagliazzo, Raz, Wigderson 1994) and (Nisan, Rudich, Saks 1994), but a strong direct sum theorem had been elusive.
A key idea in our work is the first use of the Hardcore Theorem (Impagliazzo 1995) in the context of query complexity. We prove a new "resilience lemma" that accompanies it, showing that the hardcore of $f^{\otimes k}$ is likely to remain dense under arbitrary partitions of the input space.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Harnessing the Power of Choices in Decision Tree Learning
Authors:
Guy Blanc,
Jane Lange,
Chirag Pabbaraju,
Colin Sullivan,
Li-Yang Tan,
Mo Tiwari
Abstract:
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just…
▽ More
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a {\sl greediness hierarchy theorem} showing that for every $k \in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\varepsilon$, whereas the latter only achieves accuracy $\frac1{2}+\varepsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
△ Less
Submitted 25 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
A Strong Composition Theorem for Junta Complexity and the Boosting of Property Testers
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large…
▽ More
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large $\varepsilon$-approximate junta complexity, then $g \circ f$ has even larger $\varepsilon'$-approximate junta complexity, even for $\varepsilon' \gg \varepsilon$. We develop a fairly complete understanding of this behavior, proving that the junta complexity of $g \circ f$ is characterized by that of $f$ along with the multivariate noise sensitivity of $g$. For the important case of symmetric functions $g$, we relate their multivariate noise sensitivity to the simpler and well-studied case of univariate noise sensitivity.
We then show how strong composition theorems yield boosting algorithms for property testers: with a strong composition theorem for any class of functions, a large-distance tester for that class is immediately upgraded into one for small distances. Combining our contributions yields a booster for junta testers, and with it new implications for junta testing. This is the first boosting-type result in property testing, and we hope that the connection to composition theorems adds compelling motivation to the study of both topics.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
Lifting uniform learners via distributional decomposition
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by…
▽ More
We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by depth-$d$ decision trees, where $m$ is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from $\mathcal{D}$, and for general ones it uses subcube conditioning samples.
A key technical ingredient is an algorithm which, given the aforementioned access to $\mathcal{D}$, produces an optimal decision tree decomposition of $\mathcal{D}$: an approximation of $\mathcal{D}$ as a mixture of uniform distributions over disjoint subcubes. With this decomposition in hand, we run the uniform-distribution learner on each subcube and combine the hypotheses using the decision tree. This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially improve on the prior state of the art -- results of independent interest in distribution learning.
△ Less
Submitted 29 March, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Subsampling Suffices for Adaptive Data Analysis
Authors:
Guy Blanc
Abstract:
Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the sem…
▽ More
Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014).
We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work.
In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
△ Less
Submitted 24 September, 2024; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Certification with an NP Oracle
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore soug…
▽ More
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore sought to design efficient algorithms by imposing assumptions on $f$ such as monotonicity.
Our first result is a $\mathsf{BPP}^{\mathsf{NP}}$ algorithm for the general problem. The key ingredient is a new notion of the balanced influence of variables, a natural variant of influence that corrects for the bias of the function. Balanced influences can be accurately estimated via uniform generation, and classic $\mathsf{BPP}^{\mathsf{NP}}$ algorithms are known for the latter task.
We then consider certification with stricter instance-wise guarantees: for each $x^\star$, find a certificate whose size scales with that of the smallest certificate for $x^\star$. In sharp contrast with our first result, we show that this problem is $\mathsf{NP}^{\mathsf{NP}}$-hard even to approximate. We obtain an optimal inapproximability ratio, adding to a small handful of problems in the higher levels of the polynomial hierarchy for which optimal inapproximability is known. Our proof involves the novel use of bit-fixing dispersers for gap amplification.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Multitask Learning via Shared Features: Algorithms and Hardness
Authors:
Konstantina Bairaktari,
Guy Blanc,
Li-Yang Tan,
Jonathan Ullman,
Lydia Zakynthinou
Abstract:
We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k \ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $γ$, which is based on a simultaneous boosting technique and requires only…
▽ More
We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k \ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $γ$, which is based on a simultaneous boosting technique and requires only $\textrm{poly}(k/γ)$ samples-per-task and $\textrm{poly}(k\log(d)/γ)$ samples in total.
In addition, we prove a computational separation, showing that assuming there exists a concept class that cannot be learned in the attribute-efficient model, we can construct another concept class such that can be learned in the attribute-efficient model, but cannot be multitask learned efficiently -- multitask learning this concept class either requires super-polynomial time complexity or a much larger total number of samples.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
A Query-Optimal Algorithm for Finding Counterfactuals
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is th…
▽ More
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is the sensitivity of $f$, a discrete analogue of the Lipschitz constant, and $Δ_f(x^\star)$ is the distance from $x^\star$ to its nearest counterfactuals. The previous best known query complexity was $d^{\,O(Δ_f(x^\star))}$, achievable by brute-force local search. We further prove a lower bound of $S(f)^{Ω(Δ_f(x^\star))} + Ω(\log d)$ on the query complexity of any algorithm, thereby showing that the guarantees of our algorithm are essentially optimal.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Open Problem: Properly learning decision trees in polynomial time?
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open pr…
▽ More
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Popular decision tree algorithms are provably noise tolerant
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been centr…
▽ More
Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been central to everyday machine learning, enjoy provable guarantees in the noisy setting that are unmatched by existing algorithms in the theoretical literature on decision tree learning. Taken together, our results add to an ongoing line of research that seeks to place the empirical success of these practical decision tree algorithms on firm theoretical footing.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
The Query Complexity of Certification
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone func…
▽ More
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone functions, a classic local search algorithm of Angluin accomplishes this task with $n$ queries, which we show is optimal for local search algorithms. Our main result is a new algorithm for certifying monotone functions with $O(k^8 \log n)$ queries, which comes close to matching the information-theoretic lower bound of $Ω(k \log n)$. The design and analysis of our algorithm are based on a new connection to threshold phenomena in monotone functions.
We further prove exponential-in-$k$ lower bounds when $f$ is non-monotone, and when $f$ is monotone but the algorithm is only given random examples of $f$. These lower bounds show that assumptions on the structure of $f$ and query access to it are both necessary for the polynomial dependence on $k$ that we achieve.
△ Less
Submitted 6 April, 2022; v1 submitted 19 January, 2022;
originally announced January 2022.
-
On the power of adaptivity in statistical adversaries
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt…
▽ More
We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$.
In this work, we investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of statistical query algorithms, under all reasonable noise models. We then show that in the specific case of additive noise, this equivalence holds for all algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.
△ Less
Submitted 29 June, 2022; v1 submitted 19 November, 2021;
originally announced November 2021.
-
Provably efficient, succinct, and precise explanations
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We consider the problem of explaining the predictions of an arbitrary blackbox model $f$: given query access to $f$ and an instance $x$, output a small set of $x$'s features that in conjunction essentially determines $f(x)$. We design an efficient algorithm with provable guarantees on the succinctness and precision of the explanations that it returns. Prior algorithms were either efficient but lac…
▽ More
We consider the problem of explaining the predictions of an arbitrary blackbox model $f$: given query access to $f$ and an instance $x$, output a small set of $x$'s features that in conjunction essentially determines $f(x)$. We design an efficient algorithm with provable guarantees on the succinctness and precision of the explanations that it returns. Prior algorithms were either efficient but lacked such guarantees, or achieved such guarantees but were inefficient.
We obtain our algorithm via a connection to the problem of {\sl implicitly} learning decision trees. The implicit nature of this learning task allows for efficient algorithms even when the complexity of $f$ necessitates an intractably large surrogate decision tree. We solve the implicit learning problem by bringing together techniques from learning theory, local computation algorithms, and complexity theory.
Our approach of "explaining by implicit learning" shares elements of two previously disparate methods for post-hoc explanations, global and local explanations, and we make the case that it enjoys advantages of both.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Properly learning decision trees in almost polynomial time
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision tr…
▽ More
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision trees, which we augment with additional ideas to circumvent known lower bounds against these heuristics. To analyze our algorithm, we prove a new structural result for decision trees that strengthens a theorem of O'Donnell, Saks, Schramm, and Servedio. While the OSSS theorem says that every decision tree has an influential variable, we show how every decision tree can be "pruned" so that every variable in the resulting tree is influential.
△ Less
Submitted 1 November, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Decision tree heuristics can fail, even in the smoothed setting
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996).
Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possibl…
▽ More
Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996).
Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees.
We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Multiway Online Correlated Selection
Authors:
Guy Blanc,
Moses Charikar
Abstract:
We give a $0.5368$-competitive algorithm for edge-weighted online bipartite matching. Prior to our work, the best competitive ratio was $0.5086$ due to Fahrbach, Huang, Tao, and Zadimoghaddam (FOCS 2020). They achieved their breakthrough result by developing a subroutine called \emph{online correlated selection} (OCS) which takes as input a sequence of pairs and selects one item from each pair. Im…
▽ More
We give a $0.5368$-competitive algorithm for edge-weighted online bipartite matching. Prior to our work, the best competitive ratio was $0.5086$ due to Fahrbach, Huang, Tao, and Zadimoghaddam (FOCS 2020). They achieved their breakthrough result by developing a subroutine called \emph{online correlated selection} (OCS) which takes as input a sequence of pairs and selects one item from each pair. Importantly, the selections the OCS makes are negatively correlated.
We achieve our result by defining \emph{multiway} OCSes which receive arbitrarily many elements at each step, rather than just two. In addition to better competitive ratios, our formulation allows for a simpler reduction from edge-weighted online bipartite matching to OCSes. While Fahrbach et al. used a factor-revealing linear program to optimize the competitive ratio, our analysis directly connects the competitive ratio to the parameters of the multiway OCS. Finally, we show that the formulation of Farhbach et al. can achieve a competitive ratio of at most $0.5239$, confirming that multiway OCSes are strictly more powerful.
△ Less
Submitted 20 September, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Learning stochastic decision trees
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An a…
▽ More
We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An additive $2η$ is the information-theoretic minimum.
Previously no non-trivial algorithm with a guarantee of $O(η) + \varepsilon$ was known, even for weaker noise models. Our algorithm is furthermore proper, returning a hypothesis that is itself a decision tree; previously no such algorithm was known even in the noiseless setting.
△ Less
Submitted 8 May, 2021;
originally announced May 2021.
-
Reconstructing decision trees
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give the first {\sl reconstruction algorithm} for decision trees: given queries to a function $f$ that is $\mathrm{opt}$-close to a size-$s$ decision tree, our algorithm provides query access to a decision tree $T$ where:
$\circ$ $T$ has size $S = s^{O((\log s)^2/\varepsilon^3)}$;
$\circ$ $\mathrm{dist}(f,T)\le O(\mathrm{opt})+\varepsilon$;
$\circ$ Every query to $T$ is answered with…
▽ More
We give the first {\sl reconstruction algorithm} for decision trees: given queries to a function $f$ that is $\mathrm{opt}$-close to a size-$s$ decision tree, our algorithm provides query access to a decision tree $T$ where:
$\circ$ $T$ has size $S = s^{O((\log s)^2/\varepsilon^3)}$;
$\circ$ $\mathrm{dist}(f,T)\le O(\mathrm{opt})+\varepsilon$;
$\circ$ Every query to $T$ is answered with $\mathrm{poly}((\log s)/\varepsilon)\cdot \log n$ queries to $f$ and in $\mathrm{poly}((\log s)/\varepsilon)\cdot n\log n$ time.
This yields a {\sl tolerant tester} that distinguishes functions that are close to size-$s$ decision trees from those that are far from size-$S$ decision trees. The polylogarithmic dependence on $s$ in the efficiency of our tester is exponentially smaller than that of existing testers.
Since decision tree complexity is well known to be related to numerous other boolean function properties, our results also provide a new algorithms for reconstructing and testing these properties.
△ Less
Submitted 22 May, 2022; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Estimating decision tree learnability with polylogarithmic sample complexity
Authors:
Guy Blanc,
Neha Gupta,
Jane Lange,
Li-Yang Tan
Abstract:
We show that top-down decision tree learning heuristics are amenable to highly efficient learnability estimation: for monotone target functions, the error of the decision tree hypothesis constructed by these heuristics can be estimated with polylogarithmically many labeled examples, exponentially smaller than the number necessary to run these heuristics, and indeed, exponentially smaller than info…
▽ More
We show that top-down decision tree learning heuristics are amenable to highly efficient learnability estimation: for monotone target functions, the error of the decision tree hypothesis constructed by these heuristics can be estimated with polylogarithmically many labeled examples, exponentially smaller than the number necessary to run these heuristics, and indeed, exponentially smaller than information-theoretic minimum required to learn a good decision tree. This adds to a small but growing list of fundamental learning algorithms that have been shown to be amenable to learnability estimation.
En route to this result, we design and analyze sample-efficient minibatch versions of top-down decision tree learning heuristics and show that they achieve the same provable guarantees as the full-batch versions. We further give "active local" versions of these heuristics: given a test point $x^\star$, we show how the label $T(x^\star)$ of the decision tree hypothesis $T$ can be computed with polylogarithmically many labeled examples, exponentially smaller than the number necessary to learn $T$.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Query strategies for priced information, revisited
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We consider the problem of designing query strategies for priced information, introduced by Charikar et al. In this problem the algorithm designer is given a function $f : \{0,1\}^n \to \{-1,1\}$ and a price associated with each of the $n$ coordinates. The goal is to design a query strategy for determining $f$'s value on unknown inputs for minimum cost.
Prior works on this problem have focused o…
▽ More
We consider the problem of designing query strategies for priced information, introduced by Charikar et al. In this problem the algorithm designer is given a function $f : \{0,1\}^n \to \{-1,1\}$ and a price associated with each of the $n$ coordinates. The goal is to design a query strategy for determining $f$'s value on unknown inputs for minimum cost.
Prior works on this problem have focused on specific classes of functions. We analyze a simple and natural strategy that applies to all functions $f$, and show that its performance relative to the optimal strategy can be expressed in terms of a basic complexity measure of $f$, its influence. For $\varepsilon \in (0,\frac1{2})$, writing $\mathsf{opt}$ to denote the expected cost of the optimal strategy that errs on at most an $\varepsilon$-fraction of inputs, our strategy has expected cost $\mathsf{opt} \cdot \mathrm{Inf}(f)/\varepsilon^2$ and also errs on at most an $O(\varepsilon)$-fraction of inputs. This connection yields new guarantees that complement existing ones for a number of function classes that have been studied in this context, as well as new guarantees for new classes.
Finally, we show that improving on the parameters that we achieve will require making progress on the longstanding open problem of properly learning decision trees.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Universal guarantees for decision tree induction via a higher-order splitting criterion
Authors:
Guy Blanc,
Neha Gupta,
Jane Lange,
Li-Yang Tan
Abstract:
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitt…
▽ More
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes.
Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $ε$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/ε^2)}$ that achieves error $\le O(\mathsf{opt}_s) + ε$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Efficient hyperparameter optimization by way of PAC-Bayes bound minimization
Authors:
John J. Cherian,
Andrew G. Taube,
Robert T. McGibbon,
Panagiotis Angelikopoulos,
Guy Blanc,
Michael Snarski,
Daniel D. Richman,
John L. Klepeis,
David E. Shaw
Abstract:
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor p…
▽ More
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor performance on unseen data. We argue that this overfitting results from using the standard hyperparameter optimization objective function. Here we present an alternative objective that is equivalent to a Probably Approximately Correct-Bayes (PAC-Bayes) bound on the expected out-of-sample error. We then devise an efficient gradient-based algorithm to minimize this objective; the proposed method has asymptotic space and time complexity equal to or better than other gradient-based hyperparameter optimization methods. We show that this new method significantly reduces out-of-sample error when applied to hyperparameter optimization problems known to be prone to overfitting.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
Provable guarantees for decision tree induction: the agnostic setting
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}. While prior works have focused on the realizable setting, we consider the more realistic and challenging {\sl agnostic} setting. We show that for all monotone functions~$f$ and parameters $s\in \mathbb{N}$, these heuristics construct a decision…
▽ More
We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}. While prior works have focused on the realizable setting, we consider the more realistic and challenging {\sl agnostic} setting. We show that for all monotone functions~$f$ and parameters $s\in \mathbb{N}$, these heuristics construct a decision tree of size $s^{\tilde{O}((\log s)/\varepsilon^2)}$ that achieves error $\le \mathsf{opt}_s + \varepsilon$, where $\mathsf{opt}_s$ denotes the error of the optimal size-$s$ decision tree for $f$. Previously, such a guarantee was not known to be achievable by any algorithm, even one that is not based on top-down heuristics. We complement our algorithmic guarantee with a near-matching $s^{\tildeΩ(\log s)}$ lower bound.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Anomalous Communications Detection in IoT Networks Using Sparse Autoencoders
Authors:
Mustafizur Rahman Shahid,
Gregory Blanc,
Zonghua Zhang,
Hervé Debar
Abstract:
Nowadays, IoT devices have been widely deployed for enabling various smart services, such as, smart home or e-healthcare. However, security remains as one of the paramount concern as many IoT devices are vulnerable. Moreover, IoT malware are constantly evolving and getting more sophisticated. IoT devices are intended to perform very specific tasks, so their networking behavior is expected to be re…
▽ More
Nowadays, IoT devices have been widely deployed for enabling various smart services, such as, smart home or e-healthcare. However, security remains as one of the paramount concern as many IoT devices are vulnerable. Moreover, IoT malware are constantly evolving and getting more sophisticated. IoT devices are intended to perform very specific tasks, so their networking behavior is expected to be reasonably stable and predictable. Any significant behavioral deviation from the normal patterns would indicate anomalous events. In this paper, we present a method to detect anomalous network communications in IoT networks using a set of sparse autoencoders. The proposed approach allows us to differentiate malicious communications from legitimate ones. So that, if a device is compromised only malicious communications can be dropped while the service provided by the device is not totally interrupted. To characterize network behavior, bidirectional TCP flows are extracted and described using statistics on the size of the first N packets sent and received, along with statistics on the corresponding inter-arrival times between packets. A set of sparse autoencoders is then trained to learn the profile of the legitimate communications generated by an experimental smart home network. Depending on the value of N, the developed model achieves attack detection rates ranging from 86.9% to 91.2%, and false positive rates ranging from 0.1% to 0.5%.
△ Less
Submitted 26 December, 2019;
originally announced December 2019.
-
Constructive derandomization of query algorithms
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give efficient deterministic algorithms for converting randomized query algorithms into deterministic ones. We first give an algorithm that takes as input a randomized $q$-query algorithm $R$ with description length $N$ and a parameter $\varepsilon$, runs in time $\mathrm{poly}(N) \cdot 2^{O(q/\varepsilon)}$, and returns a deterministic $O(q/\varepsilon)$-query algorithm $D$ that $\varepsilon$-…
▽ More
We give efficient deterministic algorithms for converting randomized query algorithms into deterministic ones. We first give an algorithm that takes as input a randomized $q$-query algorithm $R$ with description length $N$ and a parameter $\varepsilon$, runs in time $\mathrm{poly}(N) \cdot 2^{O(q/\varepsilon)}$, and returns a deterministic $O(q/\varepsilon)$-query algorithm $D$ that $\varepsilon$-approximates the acceptance probabilities of $R$. These parameters are near-optimal: runtime $N + 2^{Ω(q/\varepsilon)}$ and query complexity $Ω(q/\varepsilon)$ are necessary.
Next, we give algorithms for instance-optimal and online versions of the problem:
$\circ$ Instance optimal: Construct a deterministic $q^\star_R$-query algorithm $D$, where $q^\star_R$ is minimum query complexity of any deterministic algorithm that $\varepsilon$-approximates $R$.
$\circ$ Online: Deterministically approximate the acceptance probability of $R$ for a specific input $\underline{x}$ in time $\mathrm{poly}(N,q,1/\varepsilon)$, without constructing $D$ in its entirety.
Applying the techniques we develop for these extensions, we constructivize classic results that relate the deterministic, randomized, and quantum query complexities of boolean functions (Nisan, STOC 1989; Beals et al., FOCS 1998). This has direct implications for the Turing machine model of computation: sublinear-time algorithms for total decision problems can be efficiently derandomized and dequantized with a subexponential-time preprocessing step.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Top-down induction of decision trees: rigorous guarantees and inherent limitations
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
Consider the following heuristic for building a decision tree for a function $f : \{0,1\}^n \to \{\pm 1\}$. Place the most influential variable $x_i$ of $f$ at the root, and recurse on the subfunctions $f_{x_i=0}$ and $f_{x_i=1}$ on the left and right subtrees respectively; terminate once the tree is an $\varepsilon$-approximation of $f$. We analyze the quality of this heuristic, obtaining near-ma…
▽ More
Consider the following heuristic for building a decision tree for a function $f : \{0,1\}^n \to \{\pm 1\}$. Place the most influential variable $x_i$ of $f$ at the root, and recurse on the subfunctions $f_{x_i=0}$ and $f_{x_i=1}$ on the left and right subtrees respectively; terminate once the tree is an $\varepsilon$-approximation of $f$. We analyze the quality of this heuristic, obtaining near-matching upper and lower bounds:
$\circ$ Upper bound: For every $f$ with decision tree size $s$ and every $\varepsilon \in (0,\frac1{2})$, this heuristic builds a decision tree of size at most $s^{O(\log(s/\varepsilon)\log(1/\varepsilon))}$.
$\circ$ Lower bound: For every $\varepsilon \in (0,\frac1{2})$ and $s \le 2^{\tilde{O}(\sqrt{n})}$, there is an $f$ with decision tree size $s$ such that this heuristic builds a decision tree of size $s^{\tildeΩ(\log s)}$.
We also obtain upper and lower bounds for monotone functions: $s^{O(\sqrt{\log s}/\varepsilon)}$ and $s^{\tildeΩ(\sqrt[4]{\log s } )}$ respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004) and Lee (2009).
Our upper bounds yield new algorithms for properly learning decision trees under the uniform distribution. We show that these algorithms---which are motivated by widely employed and empirically successful top-down decision tree learning heuristics such as ID3, C4.5, and CART---achieve provable guarantees that compare favorably with those of the current fastest algorithm (Ehrenfeucht and Haussler, 1989). Our lower bounds shed new light on the limitations of these heuristics.
Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend it to give the first uniform-distribution proper learning algorithm that achieves polynomial sample and memory complexity, while matching its state-of-the-art quasipolynomial runtime.
△ Less
Submitted 17 November, 2019;
originally announced November 2019.
-
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
Authors:
Guy Blanc,
Neha Gupta,
Gregory Valiant,
Paul Valiant
Abstract:
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm o…
▽ More
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.
△ Less
Submitted 22 July, 2020; v1 submitted 19 April, 2019;
originally announced April 2019.
-
Adaptive Sampled Softmax with Kernel Based Sampling
Authors:
Guy Blanc,
Steffen Rendle
Abstract:
Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it too expensive for many real-world problems. A common approach to speed up training involves sampling only some of the classes at each training step. It is known…
▽ More
Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it too expensive for many real-world problems. A common approach to speed up training involves sampling only some of the classes at each training step. It is known that this method is biased and that the bias increases the more the sampling distribution deviates from the output distribution. Nevertheless, almost any recent work uses simple sampling distributions that require a large sample size to mitigate the bias. In this work, we propose a new class of kernel based sampling methods and develop an efficient sampling algorithm. Kernel based sampling adapts to the model as it is trained, thus resulting in low bias. Kernel based sampling can be easily applied to many models because it relies only on the model's last hidden layer. We empirically study the trade-off of bias, sampling distribution and sample size and show that kernel based sampling results in low bias with few samples.
△ Less
Submitted 1 August, 2018; v1 submitted 1 December, 2017;
originally announced December 2017.
-
Combining Technical and Financial Impacts for Countermeasure Selection
Authors:
Gustavo Gonzalez-Granadillo,
Christophe Ponchel,
Gregory Blanc,
Hervé Debar
Abstract:
Research in information security has generally focused on providing a comprehensive interpretation of threats, vulnerabilities, and attacks, in particular to evaluate their danger and prioritize responses accordingly. Most of the current approaches propose advanced techniques to detect intrusions and complex attacks but few of these approaches propose well defined methodologies to react against a…
▽ More
Research in information security has generally focused on providing a comprehensive interpretation of threats, vulnerabilities, and attacks, in particular to evaluate their danger and prioritize responses accordingly. Most of the current approaches propose advanced techniques to detect intrusions and complex attacks but few of these approaches propose well defined methodologies to react against a given attack. In this paper, we propose a novel and systematic method to select security countermeasures from a pool of candidates, by ranking them based on the technical and financial impact associated to each alternative. The method includes industrial evaluation and simulations of the impact associated to a given security measure which allows to compute the return on response investment for different candidates. A simple case study is proposed at the end of the paper to show the applicability of the model.
△ Less
Submitted 16 October, 2014;
originally announced November 2014.