-
What Are Brands Telling You About Smishing? A Cross-Industry Evaluation of Customer Guidance
Authors:
Dev Vikesh Doshi,
Mehjabeen Tasnim,
Fernando Landeros,
Chinthagumpala Muni Venkatesh,
Daniel Timko,
Muhammad Lutfor Rahman
Abstract:
Phishing attacks through text, also known as smishing, are a prevalent type of social engineering tactic in which attackers impersonate brands to deceive victims into providing personal information and/or money. While smishing awareness and cyber education are a key method by which organizations communicate this awareness, the guidance itself varies widely. In this paper, we investigate the state…
▽ More
Phishing attacks through text, also known as smishing, are a prevalent type of social engineering tactic in which attackers impersonate brands to deceive victims into providing personal information and/or money. While smishing awareness and cyber education are a key method by which organizations communicate this awareness, the guidance itself varies widely. In this paper, we investigate the state of practice of how 149 well-known brands across 25 categories educate their customers about smishing and what smishing prevention and reporting advice they provide. After conducting a comprehensive content analysis of the brands, we identified significant gaps in the smishing-related information provided: only 46\% of the 149 brands mentioned the definition of smishing, less than 1\% had a video tutorial on smishing, and only 50\% of brands provided instructions on how to report. Our study highlights variation in terminology, prevention advice, and reporting mechanisms across industries, with some brands recommending potentially ineffective strategies such as "ignoring suspicious messages." These findings establish a baseline for understanding the current state of industry smishing awareness advice and provide specific areas where standardization improvements are needed. From our evaluation, we provide recommendations for brands on how to offer streamlined education to their respective customers on smishing for better awareness and protection against increasing smishing attacks.
△ Less
Submitted 28 January, 2026;
originally announced January 2026.
-
Sequential Solution Concepts in Cooperative Games with Generalized Characteristic Functions
Authors:
Ashwin Goyal,
Drashthi Doshi,
Swaprava Nath
Abstract:
Motivated by the fact that the worth of a coalition may depend on the order in which agents arrive, Nowak and Radzik (1994) (NR) introduced cooperative games with generalized characteristic functions. We study such temporal cooperative games (TCGs), where the worth function v is defined on sequences of agents π rather than sets S. This order sensitivity necessitates a re-examination of axioms for…
▽ More
Motivated by the fact that the worth of a coalition may depend on the order in which agents arrive, Nowak and Radzik (1994) (NR) introduced cooperative games with generalized characteristic functions. We study such temporal cooperative games (TCGs), where the worth function v is defined on sequences of agents π rather than sets S. This order sensitivity necessitates a re-examination of axioms for reward sharing. NR and subsequent work proposed several axioms; the resulting solution concepts are still inherently order-oblivious and closely tied to the Shapley value. In contrast, we focus on sequential solution concepts that explicitly depend on the realized order π. We study reward-sharing mechanisms satisfying incentive for optimal arrival (I4OA), which promotes orders maximizing total worth; online individual rationality (OIR), which ensures agents are not harmed by later arrivals; and sequential efficiency (SE), which requires that the worth of any sequence is fully distributed among its agents. These axioms are intrinsic to TCGs, and we characterize a class of reward-sharing mechanisms uniquely determined by them. The classical Shapley value does not directly extend to this setting. We therefore construct natural Shapley analogs in two worlds: a sequential world, where rewards are defined for each sequence agent pair, and an extended world, where rewards are defined per agent, consistent with the NR framework. In both cases, the axioms of efficiency, additivity, and null player uniquely characterize the corresponding Shapley analogs. But, these Shapley analogs are disjoint from the class of solutions satisfying the sequential axioms, even for convex and simple TCGs.
△ Less
Submitted 15 March, 2026; v1 submitted 13 October, 2025;
originally announced October 2025.
-
Apple Intelligence Foundation Language Models: Tech Report 2025
Authors:
Ethan Li,
Anders Boesen Lindbo Larsen,
Chen Zhang,
Xiyou Zhou,
Jun Qin,
Dian Ang Yap,
Narendran Raghavan,
Xuankai Chang,
Margit Bowler,
Eray Yildiz,
John Peebles,
Hannah Gillis Coleman,
Matteo Ronchi,
Peter Gray,
Keen You,
Anthony Spalvieri-Kruse,
Ruoming Pang,
Reed Li,
Yuli Yang,
Emad Soroush,
Zhiyun Lu,
Crystal Xiao,
Rong Situ,
Jordan Huffaker,
David Griffiths
, et al. (373 additional authors not shown)
Abstract:
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform…
▽ More
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines.
A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
△ Less
Submitted 27 August, 2025; v1 submitted 17 July, 2025;
originally announced July 2025.
-
Incentivize Contribution and Learn Parameters Too: Federated Learning with Strategic Data Owners
Authors:
Drashthi Doshi,
Aditya Vema Reddy Kesari,
Avishek Ghosh,
Swaprava Nath,
Suhas S Kowshik
Abstract:
Classical federated learning (FL) assumes that the clients have a limited amount of noisy data with which they voluntarily participate and contribute towards learning a global, more accurate model in a principled manner. The learning happens in a distributed fashion without sharing the data with the center. However, these methods do not consider the incentive of an agent for participating and cont…
▽ More
Classical federated learning (FL) assumes that the clients have a limited amount of noisy data with which they voluntarily participate and contribute towards learning a global, more accurate model in a principled manner. The learning happens in a distributed fashion without sharing the data with the center. However, these methods do not consider the incentive of an agent for participating and contributing to the process, given that data collection and running a distributed algorithm is costly for the clients. The question of rationality of contribution has been asked recently in the literature and some results exist that consider this problem. This paper addresses the question of simultaneous parameter learning and incentivizing contribution in a truthful manner, which distinguishes it from the extant literature. Our first mechanism incentivizes each client to contribute to the FL process at a Nash equilibrium and simultaneously learn the model parameters. We also ensure that agents are incentivized to truthfully reveal information in the intermediate stages of the algorithm. However, this equilibrium outcome can be away from the optimal, where clients contribute with their full data and the algorithm learns the optimal parameters. We propose a second mechanism that enables the full data contribution along with optimal parameter learning. Large scale experiments with real (federated) datasets (CIFAR-10, FEMNIST, and Twitter) show that these algorithms converge quite fast in practice, yield good welfare guarantees and better model performance for all agents.
△ Less
Submitted 15 March, 2026; v1 submitted 17 May, 2025;
originally announced May 2025.
-
(How) Can Transformers Predict Pseudo-Random Numbers?
Authors:
Tao Tao,
Darshil Doshi,
Dayal Singh Kalra,
Tianyu He,
Maissam Barkeshli
Abstract:
Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{t+1} = a x_t + c \;\mathrm{mod}\; m$. We find that with sufficie…
▽ More
Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{t+1} = a x_t + c \;\mathrm{mod}\; m$. We find that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$). By analyzing the embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures to learn these sequences in two scenarios of increasing complexity. First, we investigate how Transformers learn LCG sequences with unseen ($a, c$) but fixed modulus; and demonstrate successful learning up to $m = 2^{32}$. We find that models learn to factorize $m$ and utilize digit-wise number representations to make sequential predictions. In the second, more challenging scenario of unseen moduli, we show that Transformers can generalize to unseen moduli up to $m_{\text{test}} = 2^{16}$. In this case, the model employs a two-step strategy: first estimating the unknown modulus from the context, then utilizing prime factorizations to generate predictions. For this task, we observe a sharp transition in the accuracy at a critical depth $d= 3$. We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the modulus.
△ Less
Submitted 8 July, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
Grokking Modular Polynomials
Authors:
Darshil Doshi,
Tianyu He,
Aritra Das,
Andrey Gromov
Abstract:
Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend…
▽ More
Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these "expert" solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
Authors:
Tianyu He,
Darshil Doshi,
Aritra Das,
Andrey Gromov
Abstract:
Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions…
▽ More
Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.
△ Less
Submitted 4 November, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation
Authors:
Divyang Doshi,
Jung-Eun Kim
Abstract:
In this research, we propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models. Knowledge Distillation trains a smaller ``student'' model with guidance from a larger ``teacher'' model, which is computationally costly. However, the main benefit comes from the soft labels provided by the teacher, helping the student grasp nuanced class…
▽ More
In this research, we propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models. Knowledge Distillation trains a smaller ``student'' model with guidance from a larger ``teacher'' model, which is computationally costly. However, the main benefit comes from the soft labels provided by the teacher, helping the student grasp nuanced class similarities. In our work, we propose an efficient method for generating these soft labels, thereby eliminating the need for a large teacher model. We employ a compact autoencoder to extract essential features and calculate similarity scores between different classes. Afterward, we apply the softmax function to these similarity scores to obtain a soft probability vector. This vector serves as valuable guidance during the training of the student model. Our extensive experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach compared to traditional knowledge distillation methods that rely on large teacher models. Importantly, our approach consistently achieves similar or even superior performance in terms of model accuracy. We also perform a comparative study with various techniques recently developed for knowledge distillation showing our approach achieves competitive performance with using significantly less resources. We also show that our approach can be easily added to any logit based knowledge distillation method. This research contributes to making knowledge distillation more accessible and cost-effective for practical applications, making it a promising avenue for improving the efficiency of model training. The code for this work is available at, https://github.com/JEKimLab/ReffAKD.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Authors:
Darshil Doshi,
Aritra Das,
Tianyu He,
Andrey Gromov
Abstract:
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, an…
▽ More
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($ξ\cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-ξ)\%$.
△ Less
Submitted 4 March, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Parking Spot Classification based on surround view camera system
Authors:
Andy Xiao,
Deep Doshi,
Lihao Wang,
Harsha Gorantla,
Thomas Heitzmann,
Peter Groth
Abstract:
Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy s…
▽ More
Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy status in recent years, but little work on whether the free slot is compatible with the mission of the ego vehicle or not. For instance, some spots are handicap or electric vehicles accessible only. In this paper, we tackle parking spot classification based on the surround view camera system. We adapt the object detection neural network YOLOv4 with a novel polygon bounding box model that is well-suited for various shaped parking spaces, such as slanted parking slots. To the best of our knowledge, we present the first detailed study on parking spot detection and classification on fisheye cameras for auto valet parking scenarios. The results prove that our proposed classification approach is effective to distinguish between regular, electric vehicle, and handicap parking spots.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
AutoInit: Automatic Initialization via Jacobian Tuning
Authors:
Tianyu He,
Darshil Doshi,
Andrey Gromov
Abstract:
Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good ini…
▽ More
Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications
Authors:
Darshil Doshi,
Tianyu He,
Andrey Gromov
Abstract:
Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rat…
▽ More
Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce \emph{partial Jacobians} of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks; containing fully connected, convolutional and normalization layers. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze ResNet and MLP-Mixer architectures; demonstrating the everywhere-critical regime.
△ Less
Submitted 5 October, 2023; v1 submitted 23 November, 2021;
originally announced November 2021.
-
AI Tax: The Hidden Cost of AI Data Center Applications
Authors:
Daniel Richins,
Dharmisha Doshi,
Matthew Blackmore,
Aswathy Thulaseedharan Nair,
Neha Pathapati,
Ankit Patel,
Brainard Daguman,
Daniel Dobrijalowski,
Ramesh Illikkal,
Kevin Long,
David Zimmerman,
Vijay Janapa Reddi
Abstract:
Artificial intelligence and machine learning are experiencing widespread adoption in industry and academia. This has been driven by rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into specialized hardware AI accelerators. Given the rapid pace of advances, it is easy to forget that they are often develope…
▽ More
Artificial intelligence and machine learning are experiencing widespread adoption in industry and academia. This has been driven by rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into specialized hardware AI accelerators. Given the rapid pace of advances, it is easy to forget that they are often developed and evaluated in a vacuum without considering the full application environment. This paper emphasizes the need for a holistic, end-to-end analysis of AI workloads and reveals the "AI tax." We deploy and characterize Face Recognition in an edge data center. The application is an AI-centric edge video analytics application built using popular open source infrastructure and ML tools. Despite using state-of-the-art AI and ML algorithms, the application relies heavily on pre-and post-processing code. As AI-centric applications benefit from the acceleration promised by accelerators, we find they impose stresses on the hardware and software infrastructure: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By specializing for AI applications, we show that a purpose-built edge data center can be designed for the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure.
△ Less
Submitted 20 July, 2020;
originally announced July 2020.