Authors with wavy underline are students advised by me; For a full list of publications, visit my Google Scholar page.
Archer Amon, Zichong Wang, Zhipeng Yin and Wenbin Zhang
The 24th IEEE International Conference on Data Mining (ICDM), Abu Dhabi, UAE, 2024
In the rapidly evolving landscape of generative artificial intelligence (AI), the increasingly pertinent issue of copyright infringement arises as AI advances to generate content from scraped copyrighted data, prompting questions about ownership and protection that impact professionals across various careers. With this in mind, this survey provides an extensive examination of copyright infringement as it pertains to generative AI, aiming to stay abreast of the latest developments and open problems. Specifically, it will first outline methods of detecting copyright infringement in mediums such as text, image, and video. Next, it will delve an exploration of existing techniques aimed at safeguarding copyrighted works from generative models. Furthermore, this survey will discuss resources and tools for users to evaluate copyright violations. Finally, insights into ongoing regulations and proposals for AI will be explored and compared. Through combining these disciplines, the implications of AI-driven content and copyright are thoroughly illustrated and brought into question.
Thang Viet Doan, Zichong Wang, Minh Nhat Nguyen and Wenbin Zhang
The 33rd ACM International Conference on Information and Knowledge Management (CIKM), Boise, USA, 2024
Large Language Models (LLMs) have demonstrated remarkable success across various domains but often lack fairness considerations, potentially leading to discriminatory outcomes against marginalized populations. Unlike fairness in traditional machine learning, fairness in LLMs involves unique backgrounds, taxonomies, and fulfillment techniques. This tutorial provides a systematic overview of recent advances in the literature concerning fair LLMs, beginning with real-world case studies to introduce LLMs, followed by an analysis of bias causes therein. The concept of fairness in LLMs is then explored, summarizing the metrics for evaluating bias and the algorithms designed to promote fairness. Additionally, resources for assessing bias in LLMs, including toolkits and datasets, are compiled, and current research challenges and open questions in the field are discussed.
Eric Xu, Wenbin Zhang and Weifeng Xu
The 33rd ACM International Conference on Information and Knowledge Management (CIKM), Boise, USA, 2024
In the pursuit of justice and accountability in the digital age, the integration of Large Language Models (LLMs) with digital forensics holds immense promise. This half-day tutorial provides a comprehensive exploration of the transformative potential of LLMs in automating digital investigations and uncovering hidden insights. Through a combination of real-world case studies, interactive exercises, and hands-on labs, participants will gain a deep understanding of how to harness LLMs for evidence analysis, entity identification, and knowledge graph reconstruction. By fostering a collaborative learning environment, this tutorial aims to empower professionals, researchers, and students with the skills and knowledge needed to drive innovation in digital forensics. As LLMs continue to revolutionize the field, this tutorial will have far-reaching implications for enhancing justice outcomes, promoting accountability, and shaping the future of digital investigations.
Zhen Liu, Ruoyu Wang, Bitao Peng, Lingyu Qiu and Wenbin Zhang
Computers & Security
Malware is still a challenging security problem in the Android ecosystem, as malware is often obfuscated to evade detection. In such case, semantic behavior feature extraction is crucial for training a robust malware detection model. In this paper, we propose a novel Android malware detection method (named SeGDroid) that focuses on learning the semantic knowledge from sensitive function call graphs (FCGs). Specifically, we devise a graph pruning method to build a sensitive FCG on the base of an original FCG. The method preserves the sensitive API (security-related API) call context and removes the irrelevant nodes of FCGs. We propose a node representation method based on word2vec and social-network-based centrality to extract attributes for graph nodes. Our representation aims at extracting the semantic knowledge of the function calls and the structure of graphs. Using this representation, we induce graph embeddings of the sensitive FCGs associated with node attributes using a graph convolutional neural network algorithm. To provide a model explanation, we further propose a method that calculates node importance. This creates a mechanism for understanding malicious behavior. The experimental results show that SeGDroid achieves an F-score of 98% in the case of malware detection on the CICMal2020 dataset and an F-score of 96% in the case of malware family classification on the MalRadar dataset. In addition, the provided model explanation is able to trace the malicious behavior of the Android malware.
Zichong Wang, Zhibo Chu, Thang Viet Doan, Shaowei Wang, Yongkai Wu, Vasile Palade and Wenbin Zhang
Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, USA, 2025
Learning high-level representations for graphs is crucial for tasks like node classification, where graph pooling aggregates node features to provide a holistic view that enhances predictive performance. Despite numerous methods that have been proposed in this promising and rapidly developing research field, most efforts to generalize the pooling operation to graphs are primarily performance-driven, with fairness issues largely overlooked: i) the process of graph pooling could exacerbate disparities in distribution among various subgroups; ii) the resultant graph structure augmentation may inadvertently strengthen intra-group connectivity, leading to unintended inter-group isolation. To this end, this paper extends the initial effort on fair graph pooling to the development of fair graph neural networks, while also providing a unified framework to collectively address group and individual graph fairness. Our experimental evaluations on multiple datasets demonstrate that the proposed method not only outperforms state-of-the-art baselines in terms of fairness but also achieves comparable predictive performance.
Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Xuan Shen, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang and Yanzhi Wang
Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, USA, 2025
Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarse-grained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, i.e., individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessment for weight importance are essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%. The code will be released.
Wenbin Zhang, Shuigeng Zhou, Toby Walsh and Jeremy C Weiss
AI Magazine
The growing importance of understanding and addressing algorithmic bias in artificial intelligence (AI) has led to a surge in research on AI fairness, which often assumes that the underlying data is independent and identically distributed (IID). However, real-world data frequently exists in non-IID graph structures that capture connections among individual units. To effectively mitigate bias in AI systems, it is essential to bridge the gap between traditional fairness literature, designed for IID data, and the prevalence of non-IID graph data. This survey reviews recent advancements in fairness amidst non-IID graph data, including the newly introduced fair graph generation and the commonly studied fair graph classification. In addition, available datasets and evaluation metrics for future research are identified, the limitations of existing work are highlighted, and promising future directions are proposed.
Zichong Wang, Zhipeng Yin, Yuying Zhang, Liping yang, Tingting Zhang, Niki Pissinou, Yu Cai, Shu Hu, Yun Li, Liang Zhao and Wenbin Zhang
ACM SIGKDD Explorations
Graph generative models have become increasingly prevalent across various domains due to their superior performance in diverse applications. However, as their application rises, particularly in high-risk decision-making scenarios, concerns about their fairness are intensifying within the community. Existing graph-based generation models mainly focus on synthesizing minority nodes to enhance the node classification performance. However, by overlooking the node generation process, this strategy may intensify representational disparities among different subgroups, thereby further compromising the fairness of the model. Moreover, existing oversampling methods generate samples by selecting instances from corresponding subgroups, risking overfitting in those subgroups owing to their underrepresentation. Furthermore, they fail to account for the inherent imbalance in edge distributions among subgroups, consequently introducing structural bias when generating graph structure information. To address these challenges, this paper elucidates how existing graph-based sampling techniques can amplify real-world bias and proposes a novel framework, Fair Graph Synthetic Minority Oversampling Technique (FG-SMOTE), aimed at achieving a fair balance in representing different subgroups. Specifically, FG-SMOTE starts by removing the identifiability of subgroup information from node representations. Subsequently, the embeddings for simulated nodes are generated by sampling from these subgroup information desensitized node representations. Lastly, a fair link predictor is employed to generate the graph structure information. Extensive experimental evaluations on four real graph datasets show that FG-SMOTE outperforms the state-of-the-art baselines in fairness while also maintaining competitive predictive performance.
Zichong Wang, Zhipeng Yin, Fang Liu, Zhen Liu, Christine Lisetti, Rui Yu, Shaowei Wang, Jun Liu, Sukumar Ganapati, Shuigeng Zhou and Wenbin Zhang
ACM SIGKDD Explorations
The extensive use of graph-based Machine Learning (ML) decision-making systems has raised numerous concerns about their potential discrimination, especially in domains with high societal impact. Various fair graph methods have thus been proposed, primarily relying on statistical fairness notions that emphasize sensitive attributes as a primary source of bias, leaving other sources of bias inadequately addressed. Existing works employ counterfactual fairness to tackle this issue from a causal perspective. However, these approaches suffer from two key limitations: they overlook hidden confounders that may affect node features and graph structure, leading to an oversimplification of causality and the inability to generate authentic counterfactual instances; they neglect graph structure bias, resulting in over-correlation of sensitive attributes with node representations. In response, this paper introduces the Authentic Graph Counterfactual Generator (AGCG), a novel framework designed to mitigate graph structure bias through a novel fair message passing technique and to improve counterfactual sample generation by inferring hidden confounders. Comprising four key modules - subgraph selection, fair node aggregation, hidden confounder identification, and counterfactual instance generation - AGCG offers a holistic approach to advancing graph model fairness in multiple dimensions. Empirical studies conducted on both real and synthetic datasets demonstrate the effectiveness and utility of AGCG in promoting fair graph-based decision-making.
Xingyu Zhang, Yanshan Wang, Yun Jiang, Charissa B Pacella and Wenbin Zhang
BMC Medical Informatics and Decision Making
Efficient triage in emergency departments (EDs) is critical for timely and appropriate care. Traditional triage systems primarily rely on structured data, but the increasing availability of unstructured data, such as clinical notes, presents an opportunity to enhance predictive models for assessing emergency severity and to explore associations between patient characteristics and severity outcomes. This study utilized data from the National Hospital Ambulatory Medical Care Survey-Emergency Department (NHAMCS-ED) for the year 2021 to develop and compare models predicting emergency severity. The severity scores were categorized into two groups: urgent (scores 1–3) and non-urgent (scores 4–5). We employed both structured data (eg, demographics, vital signs, medical history) and unstructured data (eg, chief complaints) processed through a Transformer-based Natural Language Processing (NLP) model (BERT). Three models were developed: a structured data model, an unstructured data model, and two combined models integrating both data types. Additionally, we performed an association analysis to identify significant predictors of emergency severity.
Zhong Chen, Yi He, Di Wu, Liudong Zuo, Keren Li, Zhiqiang Deng and Wenbin Zhang
Proceedings of the IEEE International Conference on Big Data (IEEE BigData), Washington, D.C., USA, 2024
Aiming at learning from a sequence of data instances over time, online learning has attracted increasing attention in the big data era. As two important variants, sparse online learning has been extensively explored by facilitating sparse constraints for online models such as truncated gradient, ℓ1-norm regularization, ℓ1-ball projection, and regularized dual averaging; while online active learning aims to build an online prediction model with a limited number of labeled instances, deploying the so called query strategies to select informative instances over time. However, most existing studies consider sparse online learning or online active learning with fixed feature spaces, whereby in real practice the features may be dynamically evolved over time. To the end, we propose a novel unified one-pass online learning framework named OASF for simultaneously online active learning and sparse online learning tailored for data streams described by open feature spaces, where new features can emerge constantly, and old features may be vanished over various time spans. Specifically, we technically develop an effective online CUR matrix decomposition based on the ℓ1,2 mixed norm constraint for simultaneously selecting important up-to-date samples in a sliding window and facilitating stable and meaningful features in open feature spaces over time. If the loss function is simultaneously Lipschitz and convex, a sub-linear regret bound of our proposed algorithm is guaranteed with a solid theoretical analysis. Extensive experiments that are conducted with multiple streaming datasets have demonstrated the effectiveness of the proposed OASF compared with state-of-the-art online active learning and sparse online learning methods.
Azmain Kabir, Shaowei Wang, Yuan Tian, Tse-Hsun Chen, Muhammad Asaduzzaman and Wenbin Zhang
ACM Transactions on Software Engineering and Methodology (TOSEM)
Technical Q&A sites are valuable for software developers seeking knowledge, but the code snippets they provide are often uncompilable and incomplete due to unresolved types and missing libraries. This poses a challenge for users who wish to reuse or analyze these snippets. Existing methods either do not focus on creating compilable code or have low success rates. To address this, we propose ZS4C, a lightweight approach for zero-shot synthesis of compilable code from incomplete snippets using Large Language Models (LLMs). ZS4C operates in two stages: first, it uses an LLM, like GPT-3.5, to identify missing import statements in a snippet; second, it collaborates with a validator (e.g., compiler) to fix compilation errors caused by incorrect imports and syntax issues. We evaluated ZS4C on the StatType-SO benchmark and a new dataset, Python-SO, which includes 539 Python snippets from Stack Overflow across the 20 most popular Python libraries. ZS4C significantly outperforms existing methods, improving the compilation rate from 63% to 95.1% compared to the state-of-the-art SnR, marking a 50.1% improvement. On average, ZS4C can infer more accurate import statements (with an F1 score of 0.98) than SnR, with an improvement of 8.5% in the F1.
Jun Liu, Zhenglun Kong, Weihao Zeng, Changdi Yang, Pu Zhao, Xuan Shen, Hao Tang,
Wenbin Zhang, Geng Yuan, Wei Niu, Xue Lin and Yanzhi Wang
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)
Autonomous driving platforms often encounter various driving scenarios, each with different hardware resources and precision requirements. Considering computational limitations on embedded devices, it is vital to factor in computing costs when deploying on the target autonomous driving platform, such as the DRIVE PX 2. Our goal is to customize the semantic segmentation network based on the computing power and scenarios of autonomous driving hardware. Our proposed feature extractor can automatically search for the most suitable depth multiplier, classifier depth, and kernel depth based on hardware capacity for complexity customization. Our method is notable for addressing the issue of MobileNetV4’s inability to adjust the size and quantity of convolution kernels dynamically. These parameters grant control over feature channels, managing computational demands. We also propose an extension module for semantic segmentation, which achieves accurate pixel-level classification. Our approach adapts to Scenario-Specific-Task-Specific requirements, with automatic parameter search addressing the unique computational complexity and accuracy needs. It possesses the capability to scale its Multiply-Accumulate Operations (MACs) for Learning Adaptation (TSLA) and results in 80 alternative configurations. These TSLA customizations, tailored to diverse self-driving tasks, maximize computational capacity and model accuracy for hardware utilization.
Zichong Wang, Zhibo Chu, Thang Viet Doan, Shiwen Ni, Min Yang and Wenbin Zhang
AI and Ethics
Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive research spanning decades, language modeling has progressed from initial statistical language models (SLMs) to the contemporary landscape of large language models (LLMs). Notably, the swift evolution of LLMs has reached the ability to process, understand, and generate human-level text. Nevertheless, despite the significant advantages that LLMs offer in improving both work and personal lives, the limited understanding among general practitioners about the background and principles of these models hampers their full potential. Notably, most LLMs reviews focus on specific aspects and utilize specialized language, posing a challenge for practitioners lacking relevant background knowledge. In light of this, this survey aims to present a comprehensible overview of LLMs to assist a broader audience. It strives to facilitate a comprehensive understanding by exploring the historical background of language models and tracing their evolution over time. The survey further investigates the factors influencing the development of LLMs, emphasizing key contributions. Additionally, it concentrates on elucidating the underlying principles of LLMs, equipping audiences with essential theoretical knowledge. The survey also highlights the limitations of existing work and points out promising future directions.
Youpeng Li, Lingling An, Xinda Wang, Fuxun Yu, Lichao Sun, Wenbin Zhang and Xuyu Wang
Proceedings of the 40th Annual Computer Security Applications Conference (ACSAC), Waikiki, USA, 2024
Federated learning (FL), an emerging distributed machine learning paradigm, has been applied to various privacy-preserving scenarios. However, due to its distributed nature, FL faces two key issues: the non-independent and identical distribution (non-IID) of user data and vulnerability to Byzantine threats. To address these challenges, in this paper, we propose FedCAP, a robust FL framework against both data heterogeneity and Byzantine attacks. The core of FedCAP is a customized model aggregation rule that facilitates collaborative training among similar clients while accelerating the model deterioration of malicious clients. Furthermore, we design a model update calibration mechanism to help the server capture the differences in the direction and magnitude of model updates among clients. With a Euclidean norm-based anomaly detection mechanism, the server quickly identifies and permanently removes malicious clients. Moreover, the impact of data heterogeneity and Byzantine attacks can be further mitigated through personalization on the client side. We conduct extensive experiments, comparing multiple state-of-the-art baselines, to demonstrate that FedCAP performs well in several non-IID settings and shows strong robustness under a series of poisoning attacks.
Wenbin Zhang
AI Magazine
Understanding and correcting algorithmic bias in artificial intelligence (AI) has become increasingly important, leading to a surge in research on AI fairness within both the AI community and broader society. Traditionally, this research operates within the constrained supervised learning paradigm, assuming the presence of class labels, independent and identically distributed (IID) data, and batch-based learning necessitating the simultaneous availability of all training data. However, in practice, class labels may be absent due to censoring, data is often represented using non-IID graph structures that capture connections among individual units, and data can arrive and evolve over time. These prevalent real-world data representations limit the applicability of existing fairness literature, which typically addresses fairness in static and tabular supervised learning settings. This paper reviews recent advances in AI fairness aimed at bridging these gaps for practical deployment in real-world scenarios. Additionally, opportunities are envisioned by highlighting the limitations and significant potential for real applications.
Zhibo Chu, Zichong Wang and Wenbin Zhang
ACM SIGKDD Explorations
Large Language Models (LLMs) have demonstrated remarkable success across various domains. However, despite their promising performance in numerous real-world applications, most of these algorithms lack fairness considerations. Consequently, they may lead to discriminatory outcomes against certain communities, particularly marginalized populations, prompting extensive study in fair LLMs. On the other hand, fairness in LLMs, in contrast to fairness in traditional machine learning, entails exclusive backgrounds, taxonomies, and fulfillment techniques. To this end, this survey presents a comprehensive overview of recent advances in the existing literature concerning fair LLMs. Specifically, a brief introduction to LLMs is provided, followed by an analysis of factors contributing to bias in LLMs. Additionally, the concept of fairness in LLMs is discussed categorically, summarizing metrics for evaluating bias in LLMs and existing algorithms for promoting fairness. Furthermore, resources for evaluating bias in LLMs, including toolkits and datasets, are summarized. Finally, existing research challenges and open questions are discussed.
Zichong Wang, David Ulloa, Tongjia Yu, Raju Rangaswami, Roland Yap and Wenbin Zhang
Proceedings of the 27th European Conference on Artificial Intelligence (ECAI), Santiago de Compostela, Spain, 2024.
Graph Neural Networks (GNNs) have demonstrated remarkable capabilities across various domains. Despite the successes of GNN deployment, their utilization often reflects societal biases, which critically hinder their adoption in high-stake decision-making scenarios such as online clinical diagnosis, financial crediting, etc. To this end, numerous efforts have been made to develop fair GNNs by proposing various notions and algorithms. However, these efforts typically concentrate on either individual or group fairness, overlooking the intricate interplay between the two, resulting in the enhancement of one, usually at the cost of the other. In addition, existing individual fairness works through the ranking perspective fails to identify those who are truly discriminated against in the ranking. To this end, this paper introduces two innovative notions for individual graph fairness and group-aware individual graph fairness, which aim to more accurately measure individual and group biases, respectively. Subsequently, the Group Equality Individual Fairness (GEIF), a novel framework designed not only to achieve individual fairness but also to equalize the levels of individual fairness among subgroups, is proposed. Furthermore, experiments conducted on several real-world graph datasets demonstrate that GEIF outperforms other state-of-the-art methods by a significant margin in terms of individual fairness, group fairness, and utility performance.
Zichong Wang and Wenbin Zhang
Proceedings of the 27th European Conference on Artificial Intelligence (ECAI), Santiago de Compostela, Spain, 2024.
The widespread use of Artificial Intelligence (AI) based decision-making systems has raised a lot of concerns regarding potential discrimination, particularly in domains with high societal impact. Most existing fairness research focused on tackling bias relies heavily on the presence of class labels, an assumption that often mismatches real-world scenarios, which ignores the ubiquity of censored data. Further, existing works regard group fairness and individual fairness as two disparate goals, overlooking their inherent interconnection, i.e., addressing one can degrade the other. This paper proposes a novel unified method that aims to mitigate group unfairness under censorship while curbing the amplification of individual unfairness when enforcing group fairness constraints. Specifically, our introduced ranking algorithm optimizes individual fairness within the bounds of group fairness, uniquely accounting for censored information. Evaluation across four benchmark tasks confirms the effectiveness of our method in quantifying and mitigating both fairness dimensions in the face of censored data.
Zichong Wang, Meikang Qiu, Min Chen, Malek Ben Salem, Xin Yao and Wenbin Zhang
Knowledge and Information Systems (KAIS)
Bests of ICDM
Graph Neural Networks (GNNs) have become pivotal in various critical decision-making scenarios due to their exceptional performance. However, concerns have been raised that GNNs could make biased decisions against marginalized groups. To this end, many efforts have been taken for fair GNNs. However, most of them tackle this bias issue by assuming that discrimination solely arises from sensitive attributes (\textit{e.g.,} race or gender), while disregarding the prevalent labeling bias that exists in real-world scenarios. Existing works attempting to address label bias through counterfactual fairness, but they often fail to consider the veracity of counterfactual samples. Moreover, the bias introduced by message-passing mechanisms remains largely unaddressed. To fill these gaps, this paper introduces Real Fair Counterfactual Graph Neural Networks+ (RFCGNN+), a novel learning model that not only addresses graph counterfactual fairness by identifying authentic counterfactual samples within complex graph structures but also incorporates strategies to mitigate labeling bias guided by causal analysis. Additionally, RFCGNN+ introduces a fairness-aware message-passing framework towards comprehensive fair graph neural networks. Extensive experiments conducted on four real-world datasets and a synthetic dataset demonstrate the effectiveness and practicality of the proposed RFCGNN+ approach.
Zichong Wang, Zhibo Chu, Ronald Blanco, Zhong Chen, Shu-Ching Chen and Wenbin Zhang
Proceedings of the 35th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Vilnius, Lithuania, 2024.
Graph neural networks (GNNs) have shown remarkable success in various domains. Nonetheless, studies have shown that GNNs may inherit and amplify societal bias, which critically hinders their application in high-stakes scenarios. Although efforts have been exerted to enhance the fairness of GNNs, most of them rely on the statistical fairness notion, which assumes that biases arise solely from sensitive attributes, neglecting the pervasive issue of labeling bias prevalent in real-world scenarios. To this end, recent works extend counterfactual fairness in graph data to address label bias, but they neglect the graph structure bias, where nodes sharing sensitive attributes tend to connect more closely. To bridge these gaps, we propose a novel GNN framework, Fair Disentangled GNN (FDGNN), designed to mitigate multi-sources biases to enhance the fairness of GNNs while preserving task-related information via disentanglement learning. Specifically, FDGNN initiates by mitigating graph structure bias by ensuring consistent representation of different subgroups. Subsequently, to achieve fair node representation, identified counterfactual instances are utilized as guides for disentangling a node's representation and eliminating sensitive attribute-related information via a masking mechanism. Extensive experiments on multiple real-world graph datasets demonstrate the superiority of FDGNN in graph fairness compared to other state-of-the-art methods while achieving comparable utility performance.
Zichong Wang, Jocelyn Dzuong, Xiaoyong Yuan, Zhong Chen, Yanzhao Wu, Xin Yao and Wenbin Zhang
Proceedings of the 35th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Vilnius, Lithuania, 2024.
As machine learning (ML) extends its influence across diverse societal realms, the need to ensure fairness within these systems has markedly increased, reflecting notable advancements in fairness research. However, most existing fairness studies exclusively optimize either individual fairness or group fairness, neglecting the potential impact on one aspect while enforcing the other. In addition, most of them operate under the assumption of having full access to class labels, a condition that often proves impractical in real-world applications due to censorship. This paper delves into the concept of individual fairness amidst censorship and also with group awareness. We argue that this setup provides a more realistic understanding of fairness that aligns with real-world scenarios. Through experiments conducted on four real-world datasets with socially sensitive concerns and censorship, we demonstrate that our proposed approach not only outperforms state-of-the-art methods in terms of fairness but also maintains a competitive level of predictive performance.
Nripsuta Saxena, Wenbin Zhang and Cyrus Shahabi
AI and Ethics
Ride-hailing services have skyrocketed in popularity due to the convenience they offer, but recent research has shown that their pricing strategies can have a disparate impact on some riders, such as those living in disadvantaged neighborhoods with a greater share of residents of color or residents below the poverty line. Since these communities tend to be more dependent on ride-hailing services due to lack of adequate public transportation, it is imperative to address this inequity. To this end, this paper presents the first thorough study on fair pricing for ride-hailing services by devising applicable fairness measures and corresponding fair pricing mechanisms. By providing discounts that may be subsidized by the government, our approach results in an increased number and more affordable rides for the disadvantaged community. Experiments on real-world Chicago taxi data confirm our theoretical findings which provide a basis for the government to establish fair ride-hailing policies.
Saket Chaturvedi, Lan Zhang, Wenbin Zhang, Pan He and Xiaoyong Yuan
Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI), Jeju Island, South Korea, 2024.
3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting ``triggers'' to poison the training dataset, backdoor attacks manipulate the detector's prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D triggers are injected to ensure their visibility within the sparse 3D space, rendering them easy to detect and impractical in real-world scenarios. In this paper, we delve into the robustness of 3D object detection, exploring a new backdoor attack surface through 2D cameras. Given the prevalent adoption of camera and LiDAR signal fusion for high-fidelity 3D perception, we investigate the latent potential of camera signals to disrupt the process. Although the dense nature of camera signals enables the use of nearly imperceptible small-sized triggers to mislead 2D object detection, realizing 2D-oriented backdoor attacks against 3D object detection is non-trivial. The primary challenge emerges from the fusion process that transforms camera signals into a 3D space, thereby compromising the association with the 2D trigger to the target output. To tackle this issue, we propose an innovative 2D-oriented backdoor attack against LiDAR-camera fusion methods for 3D object detection, named BadFusion, aiming to uphold trigger effectiveness throughout the entire fusion process. Extensive experiments validate the effectiveness of BadFusion, achieving a significantly higher attack success rate compared to existing 2D-oriented attacks.
Mingzhen Zhang, Yue Wang, Shangzhi Song, Ruiqiang Guo, Wenbin Zhang, Chengming Li, Junjun Wei, Puqing Jiang and Ronggui Yang
Surfaces and Interfaces
Reactively sputtered tantalum nitride (TaN) thin films are used extensively in high-precision chip resistors because of their near-zero temperature coefficients of resistance (TCR). Passivation is usually necessary to ensure the long-term stability of the films. However, the inevitable room-temperature oxidation of TaN films before resistor device passivation poses a challenge. The impact of room-temperature oxidation on the stability and properties of TaN thin films intended for use in resistors remains unclear. This work systematically studies the room-temperature oxidation of reactively sputtered TaN thin films with varying nitrogen contents, represented by nitrogen flow ratios during film deposition. Results suggest that among different nitrogen flow ratios of 2%, 3%, 5%, and 7%, the films sputtered with a 3% N2 flow ratio are predominantly composed of the Ta2N phase, exhibiting the most stable structure and properties. These films demonstrate unaffected TCR, resistance, and thermal conductivity even upon exposure to air. In contrast, films prepared with other N2 contents are prone to room-temperature oxidation, leading to noticeable degradation in TCR and a reduction in lattice thermal conductivities. Notably, the electrical resistances of different films show little susceptibility to room-temperature oxidation. This work contributes essential insights into the effects of short-term room-temperature oxidation on the properties of TaN films and can have a great impact on their applications in high-precision sheet resistors.
Wenbin Zhang
Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), New Faculty Highlights, Vancouver, Canada, 2024.
Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistics. Most assume the availability of class label which is impractical in many real-world applications such as precision medicine, actuarial analysis and recidivism prediction. To this end, this talk revisits fairness and reveals idiosyncrasies of existing fairness literature assuming the availability of class label that limits their real-world utility. The primary artifacts are formulating fairness with censorship to account for scenarios where the class label is not guaranteed, and a suite of corresponding new fairness notions, algorithms, and theoretical constructs to bridge the gap between the design of a ``fair'' model in the lab and its deployment in the real-world.
Zhen Liu, Ruoyu Wang, Nathalie Japkowicz, Heitor Murilo Gomes, Bitao Peng and Wenbin Zhang
Expert Systems with Applications (ESWA)
Malware is still a challenging security problem in the Android ecosystem, as malware is often obfuscated to evade detection. In such case, semantic behavior feature extraction is crucial for training a robust malware detection model. In this paper, we propose a novel Android malware detection method (named SeGDroid) that focuses on learning the semantic knowledge from sensitive function call graphs (FCGs). Specifically, we devise a graph pruning method to build a sensitive FCG on the base of an original FCG. The method preserves the sensitive API (security-related API) call context and removes the irrelevant nodes of FCGs. We propose a node representation method based on word2vec and social-network-based centrality to extract attributes for graph nodes. Our representation aims at extracting the semantic knowledge of the function calls and the structure of graphs. Using this representation, we induce graph embeddings of the sensitive FCGs associated with node attributes using a graph convolutional neural network algorithm. To provide a model explanation, we further propose a method that calculates node importance. This creates a mechanism for understanding malicious behavior. The experimental results show that SeGDroid achieves an F-score of 98% in the case of malware detection on the CICMal2020 dataset and an F-score of 96% in the case of malware family classification on the MalRadar dataset. In addition, the provided model explanation is able to trace the malicious behavior of the Android malware.
Xin Huang, Chenxi Wang, Wenbin Zhang, Sanjay Purushotham and Jianwu Wang
Proceedings of the IEEE International Conference on Big Data (IEEE BigData), Sorrento, Italy, 2023. (Acceptance rate: 92/526=17.5%)
Collocation of measurements from active and passive satellite sensors involves pairing measurements from two sensors that observe the same location quasi-simultaneously but with different spatial resolutions and at different angles. The collocated data, widely recognized as on-track data, only consists of the pixels on an active satellite orbiting track, thus having very limited spatial coverage compared to large amounts of off-track data. Typically, on-track data are labeled with accurate product type from the active sensor, while off-track data are unlabeled or have inaccurate labels from the passive sensor product. Due to the large volume and rich information contained in off-track data, it is essential to learn a machine learning model that can incorporate the inherent characteristics of off-track data in addition to the on-track data. On the contrary, the large amount of off-track data poses challenges for machine learning model to learn good representation as they are unlabeled and contains a lot of noisy information. To address the challenges of large amounts of unlabeled off-track data in remote sensing applications, we introduce a self-supervised representation learning model with VAE and domain adaptation methods to learn a domain invariant classifier for the on-track and off-track data. The model’s performance is enhanced by pre-training off-track data with VAE generative model using off-track data, to learn a good representation that can be transferred to the down-streaming domain adaptation and classification tasks. The classifier is built on these representations to classify different cloud types in passive sensing data, with the goal of achieving higher accuracy in cloud property retrieval. Extensive quantitative and qualitative evaluation demonstrate our method achieves higher accuracy in cloud property retrieval for off-track remote sensing data.
Zichong Wang, Giri Narasimhan, Xin Yao and Wenbin Zhang
Proceedings of the 23rd IEEE International Conference on Data Mining (ICDM), Shanghai, China, 2023. (Acceptance rate: 94/1003=9.37%)
Best Paper Award Candidate
Graph neural networks (GNNs) have demonstrated remarkable success in various real-world applications. However, they often inadvertently inherit and amplify existing societal bias. Most existing approaches for fair GNNs tackle this bias issue by assuming that discrimination solely arises from sensitive attributes such as race or gender, while disregarding the prevalent labeling bias that exists in real-world scenarios. Additionally, prior works attempting to address label bias through counterfactual fairness often fail to consider the veracity of counterfactual samples. This paper aims to bridge these gaps by investigating the identification of authentic counterfactual samples within complex graph structures and proposing strategies for mitigating labeling bias guided by causal analysis. Our proposed learning model, known as Real Fair Counterfactual GNNs (RFCGNN), also goes a step further by considering the learning disparity resulting from imbalanced data distribution across different demographic groups in the graph. Extensive experiments conducted on three real-world datasets and a synthetic dataset demonstrate the effectiveness and practicality of the proposed RFCGNN approach.
Wenbin Zhang, Zichong Wang, Juyong Kim, Cheng Cheng, Thomas Oommen, Pradeep Ravikumar and Jeremy Weiss
Proceedings of the 26th European Conference on Artificial Intelligence (ECAI), Kraków, Poland, 2023. (Acceptance rate: 392/1631=24%)
Algorithmic fairness, the research field of making machine learning (ML) algorithms fair, is an established area in ML. As ML technologies expand their application domains, including ones with high societal impact, it becomes essential to take fairness into consideration during the building of ML systems. Yet, despite its wide range of socially sensitive applications, most work treats the issue of algorithmic bias as an intrinsic property of supervised learning, i.e., the class label is given as a precondition. Unlike prior studies in fairness, we propose an individual fairness measure and a corresponding algorithm that deal with the challenges of uncertainty arising from censorship in class labels, while enforcing similar individuals to be treated similarly from a ranking perspective, free of the Lipschitz condition in the conventional individual fairness definition. We argue that this perspective represents a more realistic model of fairness research for real-world application deployment and show how learning with such a relaxed precondition draws new insights that better explains algorithmic fairness. We conducted experiments on four real-world datasets to evaluate our proposed method compared to other fairness models, demonstrating its superiority in minimizing discrimination while maintaining predictive performance with uncertainty present.
Qiong Hu, Amir Mehdizadeh, Alexander Vinel, Miao Cai, Steven Rigdon, Wenbin Zhang and Fadel Megahed
Transportation Research Record (TRR)
With more and more data related to driving, traffic, and road conditions becoming available, there has been renewed interest in predictive modeling of traffic incident risk and corresponding risk factors. New machine learning approaches in particular have recently been proposed, with the goal of forecasting the occurrence of either actual incidents or their surrogates, or estimating driving risk over specific time intervals, road segments, or both. At the same time, as evidenced by our review, prescriptive modeling literature (e.g., routing or truck scheduling) has yet to capitalize on these advancements. Indeed, research into risk-aware modeling for driving is almost entirely focused on hazardous materials transportation (with a very distinct risk profile) and frequently assumes a fixed incident risk per mile driven. We propose a framework for developing data-driven prescriptive optimization models with risk criteria for traditional trucking applications. This approach is combined with a recently developed machine learning model to predict driving risk over a medium-term time horizon (the next 20 min to an hour of driving), resulting in a biobjective shortest path problem. We further propose a solution approach based on the k-shortest path algorithm and illustrate how this can be employed.
Zichong Wang, Charles Wallace, Albert Bifet, Xin Yao and Wenbin Zhang
Proceedings of the 34th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Turin, Italy, 2023. (Acceptance rate: 199/830=23.9%)
Graph generation models have gained increasing popularity and success across various domains. However, most research in this area has concentrated on enhancing performance, with the issue of fairness remaining largely unexplored. Existing graph generation models prioritize minimizing graph reconstruction's expected loss, which can result in representational disparities in the generated graphs that unfairly impact marginalized groups. This paper addresses this socially sensitive issue by conducting the first comprehensive investigation of fair graph generation models by identifying the root causes of representational disparities, and proposing a novel framework that ensures consistent and equitable representation across all groups. Additionally, a suite of fairness metrics has been developed to evaluate bias in graph generation models, standardizing fair graph generation research. Through extensive experiments on five real-world datasets, the proposed framework is demonstrated to outperform existing benchmarks in terms of graph fairness while maintaining competitive prediction performance.
Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang and Yanzhao Wu
Proceedings of the IEEE International Conference on Cognitive Machine Intelligence (IEEE CogMI), Vision Track, Atlanta, USA, 2023
Large Language Models (LLMs) represent the recent success of deep learning in achieving remarkable human-like predictive performance. It has become a mainstream strategy to leverage fine-tuning to adapt LLMs for various real-world applications due to the prohibitive expenses associated with LLM training. The learning rate is one of the most important hyperparameters in LLM fine-tuning with direct impacts on both fine-tuning efficiency and fine-tuned LLM quality. Existing learning rate policies are primarily designed for training traditional deep neural networks (DNNs), which may not work well for LLM fine-tuning. We reassess the research challenges and opportunities of learning rate tuning in the coming era of Large Language Models. This paper makes three original contributions. First, we revisit existing learning rate policies to analyze the critical challenges of learning rate tuning in the era of LLMs. Second, we present LRBench++ to benchmark learning rate policies and facilitate learning rate tuning for both traditional DNNs and LLMs. Third, our experimental analysis with LRBench++ demonstrates the key differences between LLM fine-tuning and traditional DNN training and validates our analysis.
Zichong Wang, Nripsuta Saxena, Tongjia Yu, Sneha Karki, Tyler Zetty, Israat Haque, Shan Zhou,
Dukka Kc, Ian Stockwell, Xuyu Wang, Albert Bifet and Wenbin Zhang
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), Chicago, USA, 2023. (Acceptance rate: 150/608=24.7%)
Best Paper Award
Bias in machine learning has rightly received significant attention over the last decade. However, most fair machine learning (fair-ML) work to address bias in decision-making systems has focused solely on the offline setting. Despite the wide prevalence of online systems in the real world, work on identifying and correcting bias in the online setting is severely lacking. The unique challenges of the online environment make addressing bias more difficult than in the offline setting. First, Streaming Machine Learning (SML) algorithms must deal with the constantly evolving real-time data stream. Second, they need to adapt to changing data distributions (concept drift) to make accurate predictions on new incoming data. Adding fairness constraints to this already complicated task is not straightforward. In this work, we focus on the challenges of achieving fairness in biased data streams while accounting for the presence of concept drift, accessing one sample at a time. We present Fair Sampling over Stream, a novel fair rebalancing approach capable of being integrated with SML classification algorithms. Furthermore, we devise the first unified performance-fairness metric, Fairness Bonded Utility (FBU), to evaluate and compare the trade-off between performance and fairness of different bias mitigation methods efficiently. FBU simplifies the comparison of fairness-performance trade-offs of multiple techniques through one unified and intuitive evaluation, allowing model designers to easily choose a technique. Overall, extensive evaluations show our measures surpass those of other fair online techniques previously reported in the literature.
Nripsuta Saxena, Wenbin Zhang and Cyrus Shahabi
Proceedings of the SIAM International Conference on Data Mining (SDM), Blue Sky Track, Minneapolis, USA, 2023
In the last decade or so, the area of fairness in AI has received widespread attention, both within the scientific community as well as the general media. Researchers have made significant progress towards fairer AI, with work exploring everything from statistical definitions of fairness for individual and group fairness to fairness constraints and algorithms for debiasing models and datasets. Given the nascent nature of the field, however, progress in the space has been somewhat haphazard. For work in fair-AI to have as much real-world impact as possible, we need to take a step back and gauge what the gaps are, and which research questions need urgent attention. This work analyzes where the field is currently, and proposes more focused questions and new areas of research within fair AI.
Wenbin Zhang, Tina Hernandez-Boussard and Jeremy Weiss
Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, D.C., USA, 2023. (Acceptance rate: 1721/8777=19.6%)
Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistic. Most assume availability of the class label, which is impractical in many real-world applications such as precision medicine, actuarial analysis and recidivism prediction. Here we consider fairness in longitudinal right-censored environments, where the time to event might be unknown, resulting in censorship of the class label and inapplicability of existing fairness studies. We devise applicable fairness measures, propose a debiasing algorithm, and provide necessary theoretical constructs to bridge fairness with and without censorship for these important and socially-sensitive tasks. Our experiments on four censored datasets confirm the utility of our approach.
Wenbin Zhang and Jeremy Weiss
Knowledge and Information Systems (KAIS)
Bests of ICDM
Fairness in machine learning (ML) has gained attention within the ML community and the broader society beyond with many fairness definitions and algorithms being proposed. Surprisingly, there is little work quantifying and guar- anteeing fairness in the presence of uncertainty which is prevalent in many socially sensitive applications, ranging from marketing analytics to actuarial analysis and recidivism prediction instruments. To this end, we revisit fairness and reveal id- iosyncrasies of existing fairness literature assuming certainty on the class label that limits their real-world utility. Our primary contributions are formulating fairness under uncertainty and group constraints along with a suite of corresponding new fairness definitions and algorithm. We argue that this formulation has a broader applicability to practical scenarios concerning fairness. We also show how the newly devised fairness notions involving censored information and the general framework for fair predictions in the presence of censorship allow us to measure and mitigate discrimination under uncertainty that bridges the gap with real-world applica- tions. Empirical evaluations on real-world datasets with censorship and sensitive attributes demonstrate the practicality of our approach.
Wenbin Zhang and Jeremy Weiss
Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), online, 2022. (Acceptance rate: 1349/9020=15%)
Also at Research2Clinics Workshop at NeurIPS, online, 2021
Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistic. Most assume availability of the class label, which is impractical in many real-world applications such as precision medicine, actuarial analysis and recidivism prediction. Here we consider fairness in longitudinal right-censored environments, where the time to event might be unknown, resulting in censorship of the class label and inapplicability of existing fairness studies. We devise applicable fairness measures, propose a debiasing algorithm, and provide necessary theoretical constructs to bridge fairness with and without censorship for these important and socially-sensitive tasks. Our experiments on four censored datasets confirm the utility of our approach.
Mohammad Ariful Islam, Hisham Siddique, Wenbin Zhang and Israat Haque
IEEE Transactions on Network and Service Management (TNSM)
5G networks enable emerging latency and bandwidth critical applications like industrial IoT, AR/VR, or autonomous vehicles, in addition to supporting traditional voice and data communications. In 5G infrastructure, Radio Access Networks (RANs) consist of radio base stations that communicate over wireless radio links. The communication, however, is prone to environmental changes like the weather and can suffer from radio link failure and interrupt ongoing services. The impact is severe in the above-mentioned applications. One way to mitigate such service interruption is to proactively predict failures and reconfigure the resource allocation accordingly. Existing works like the supervised ensemble learning-based model do not consider the spatial-temporal correlation between radio communication and weather changes. This paper proposes a communication link failure prediction scheme based on the LSTM-autoencoder that considers the spatial-temporal correlation between radio communication and weather forecast. We implement and evaluate the proposed scheme over a huge volume of real radio and weather data. The results confirm that the proposed scheme significantly outperforms the existing solutions.
Tai Le Quy, Arjun Roy, Vasileios Iosifidis, Wenbin Zhang and Eirini Ntoutsi
Data Mining and Knowledge Discovery (DAMI)
Top 10 articles in 2022
As decision-making increasingly relies on machine learning (ML) and (big) data, the issue of fairness in data-driven artificial intelligence systems is receiving increasing attention from both research and industry. A large variety of fairness-aware ML solutions have been proposed which involve fairness-related interventions in the data, learning algorithms, and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware ML. We focus on tabular data as the most common data representation for fairness-aware ML. We start our analysis by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate interesting relationships using exploratory analysis.
Zhen Liu, Ruoyu Wang and Wenbin Zhang
Medical & Biological Engineering & Computing (MBEC)
Machine learning techniques have been utilized on gene expression profiling for cancer diagnosis. However, the gene expression data suffer from the curse of high dimensionality. Different kinds of feature reduction methods have been proposed to decrease the features for specific cancer diagnosis. However, with the difficulty of obtaining the samples of a particular tumor, the lack of training samples may lead to the overfitting problem. In addition, the feature reduction model on a specific tumor may lead to the problem that the model is not scalable and cannot be generalized to new cancer types. To handle these problems, this paper proposes an unsupervised feature learning method to reduce the data dimensionality of gene expression data. This method amplifies the training samples of feature learning by utilizing the unlabeled samples from different sources. Two heuristic rules are devised to check if the unlabeled samples could be used for amplifying the training set. The amplified training set is used to train the feature learning model based on sparse autoencoder. Since the method leverages the knowledge among the expression data from different sources, it improves the generalization of unsupervised feature learning and further boosts the cancer diagnosis performance. A series of experiments are carried out on the gene expression datasets from TCGA and other sources. Experimental results prove that our method improves the generalization of cancer diagnosis when unlabeled data are used for latent feature learning. The flowchart of our proposed feature learning method.
Thomas Guyet, Wenbin Zhang and Albert Bifet
Proceedings of the 22nd International Conference on Computational Science (ICCS), online, 2022. (Acceptance rate: 55/169=32%)
The need to analyze information from streams arises in a variety of applications. One of the fundamental research directions is to mine sequential patterns over data streams. Current studies mine series of items based on the existence of the pattern in transactions but pay no attention to the series of itemsets and their multiple occurrences. The pattern over a window of itemsets stream and their multiple occurrences, however, provides additional capability to recognize the essential characteristics of the patterns and the inter-relationships among them that are unidentifiable by the existing items and existence based studies. In this paper, we study such a new sequential pattern mining problem and propose a corresponding efficient sequential miner with novel strategies to prune search space efficiently. Experiments on both real and synthetic data show the utility of our approach.
Kea Turner, Naomi C Brownstein, Zachary Thompson, Issam El Naqa, Yi Luo, Heather SL Jim, Dana E Rollison, Rachel Howard, Desmond Zeng, Stephen A Rosenberg, Bradford Perez, Andreas Saltos, Laura B Oswald, Brian D Gonzalez, Jessica Y Islam, Amir Alishahi Tabriz, Wenbin Zhang and Thomas J Dilling
Radiotherapy and Oncology
Background and purpose: The study objective was to determine whether longitudinal changes in patient-reported outcomes (PROs) were associated with survival among early-stage, non-small cell lung cancer (NSCLC) patients undergoing stereotactic body radiation therapy (SBRT).
Materials and methods: Data were obtained from January 2015 through March 2020. We ran a joint probability model to assess the relationship between time-to-death, and longitudinal PRO measurements. PROs were measured through the Edmonton Symptom Assessment Scale (ESAS). We controlled for other covariates likely to affect symptom burden and survival including stage, tumor diameter, comorbidities, gender, race/ethnicity, relationship status, age, and smoking status.
Results: The sample included 510 early-stage NSCLC patients undergoing SBRT. The median age was 73.8 (range: 46.3-94.6). The survival component of the joint model demonstrates that longitudinal changes in ESAS scores are significantly associated with worse survival (HR: 1.04; 95% CI: 1.02-1.05). This finding suggests a one-unit increase in ESAS score increased probability of death by 4%. Other factors significantly associated with worse survival included older age (HR: 1.04; 95% CI: 1.03-1.05), larger tumor diameter (HR: 1.21; 95% CI: 1.01-1.46), male gender (HR: 1.87; 95% CI: 1.36-2.57), and current smoking status (HR: 2.39; 95% CI: 1.25-4.56).
Conclusion: PROs are increasingly being collected as a part of routine care delivery to improve symptom management. Healthcare systems can integrate these data with other real-world data to predict patient outcomes, such as survival. Capturing longitudinal PROs-in addition to PROs at diagnosis-may add prognostic value for estimating survival among early-stage NSCLC patients undergoing SBRT.
Wenbin Zhang and Jeremy Weiss
Proceedings of the 21st IEEE International Conference on Data Mining (ICDM), online, 2021. (Acceptance rate: 98/990=9.9%)
Best Paper Award Candidate
There has been concern within the artificial intelligence (AI) community and the broader society regarding the potential lack of fairness of AI-based decision-making systems. Surprisingly, there is little work quantifying and guaranteeing fairness in the presence of uncertainty which is prevalent in many socially sensitive applications, ranging from marketing analytics to actuarial analysis and recidivism prediction instruments. To this end, we study a longitudinal censored learning problem subject to fairness constraints, where we require that algorithmic decisions made do not affect certain individuals or social groups negatively in the presence of uncertainty on class label due to censorship. We argue that this formulation has a broader applicability to practical scenarios concerning fairness. We show how the newly devised fairness notions involving censored information and the general framework for fair predictions in the presence of censorship allow us to measure and mitigate discrimination under uncertainty that bridges the gap with real-world applications. Empirical evaluations on real-world discriminated datasets with censorship demonstrate the practicality of our approach.
Wenbin Zhang, Liming Zhang, Dieter Pfoser and Liang Zhao
Proceedings of the SIAM International Conference on Data Mining (SDM), online, 2021. (Acceptance rate: 85/400=21.25%)
Deep generative models for graphs have exhibited promising performance in ever-increasing domains such as design of molecules (i.e, graph of atoms) and structure prediction of proteins (i.e., graph of amino acids). Existing work typically focuses on static rather than dynamic graphs, which are actually very important in the applications such as protein folding, molecule reactions, and human mobility. Extending existing deep generative models from static to dynamic graphs is a challenging task, which requires to handle the factorization of static and dynamic characteristics as well as mutual interactions among node and edge patterns. Here, this paper proposes a novel framework of factorized deep generative models to achieve interpretable dynamic graph generation. Various generative models are proposed to characterize conditional independence among node, edge, static, and dynamic factors. Then, variational optimization strategies as well as dynamic graph decoders are proposed based on newly designed factorized variational autoencoders and recurrent graph deconvolutions. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed models.
Zhen Liu, Nathalie Japkowicz, Deyu Tang, Wenbin Zhang and Jie Zhao
Future Generation Computer Systems (FGCS)
Android malware detection has attracted much attention in recent years. Existing methods mainly research on extracting static or dynamic features from mobile apps and build mobile malware detection model by machine learning algorithms. The number of extracted static or dynamic features maybe much high. As a result, the data suffers from high dimensionality. In addition, to avoid being detected, malware data is varied and hard to obtain in the first place. To detect zeroday malware, unsupervised malware detection methods were applied. In such case, unsupervised feature reduction method is an available choice to reduce the data dimensionality. In this paper, we propose an unsupervised feature learning algorithm called Subspace based Restricted Boltzmann Machines (SRBM) for reducing data dimensionality in malware detection. Multiple subspaces in the original data are firstly searched. And then, an RBM is built on each subspace. All outputs of the hidden layers of the trained RBMs are combined to represent the data in lower dimension. The experimental results on OmniDroid, CIC2019 and CIC2020 datasets show that the features learned by SRBM perform better than the ones learned by other feature reduction methods when the performance is evaluated by clustering evaluation metrics, i.e., NMI, ACC and Fscore.
Wenbin Zhang, Albert Bifet, Xiangliang Zhang, Jeremy Weiss and Wolfgang Nejdl
Proceedings of the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), online, 2021. (Acceptance rate: 157/768=20.4%)
As Artificial Intelligence (AI) is used in more applications, the need to consider and mitigate biases from the learned models has followed. Most works in developing fair learning algorithms focus on the offline setting. However, in many real-world applications data comes in an online fashion and needs to be processed on the fly. Moreover, in practical application, there is a trade-off between accuracy and fairness that needs to be accounted for, but current methods often have multiple hyper-parameters with non-trivial interaction to achieve fairness. In this paper, we propose a flexible ensemble algorithm for fair decision-making in the more challenging context of evolving online settings. This algorithm, called FARF (Fair and Adaptive Random Forests), is based on using online component classifiers and updating them according to the current distribution, that also accounts for fairness and a single hyper-parameters that alters fairness-accuracy balance. Experiments on real-world discriminated data streams demonstrate the utility of FARF.
Xuejian Wang, Wenbin Zhang, Aishwarya Jadhav and Jeremy Weiss
AAAI Spring Symposium Series (AAAI SSS), online, 2021
Survival analysis models are necessary for clinical forecasting with data censorship. Implicitly, existing works focus on the individuals with higher risks while lower risk individuals are poorly characterized. Developing survival models to represent different risk individuals equally is a challenging task but of great importance for providing accurate risk assessments across levels of risk. Here, we characterize this problem and propose an adjusted log-likelihood formulation as the new objective for survival prognostication. Several models are then proposed based on the newly designed optimization objective function which produce risks that count individuals “equally” on risk ratios thus providing representative attention to individuals of varying risk. Extensive experiments on multiple real-world datasets demonstrate the benefits of the proposed approach.
Xuejiao Tang, Wenbin Zhang, Yi Yu, Kea Turner, Tyler Derr, Mengyu Wang and Eirini Ntoutsi
Proceedings of the 30th International Conference on Artificial Neural Networks (ICANN), online, 2021. (Acceptance rate: 255/561=45%)
While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach.
Mingli Zhang, Ahmad Chaddad, Fenghua Guo, Wenbin Zhang, Ji Zhang and Alan Evans
Proceedings of the 32nd International Conference on Databases and Expert Systems Applications (DEXA), online, 2021. (Acceptance rate: 67/248=27%)
Variational AutoEncoder (VAE) as a class of neural networks performing nonlinear dimensionality reduction has become an effective tool in neuroimaging analysis. Currently, most studies on VAE consider unsupervised learning to capture the latent representations and to some extent, this strategy may be under-explored in the case of heavy noise and imbalanced neural image dataset. In the reinforcement learning point of view, it is necessary to consider the class-wise capability of decoder. The latent space for autoencoders depends on the distribution of the raw data, the architecture of the model and the dimension of the latent space, combining a supervised linear autoencoder model with variational autoencoder (VAE) may improve the performance of classification. In this paper, we proposed a supervised linear and nonlinear cascade dual autoencoder approach, which increases the latent space discriminative capability by feeding the latent low dimensional space from semi-supervised VAE into a further step of the linear encoder-decoder model. The effectiveness of the proposed approach is demonstrated on brain development. The proposed method also is evaluated on imbalanced neural spiking classification.
Qiqiang Xu, Ji Zhang, Ting Yu, Wenbin Zhang, Mingli Zhang, Yonglong Luo and Fulong Chen
Proceedings of the 32nd International Conference on Databases and Expert Systems Applications (DEXA), online, 2021. (Acceptance rate: 67/248=27%)
Text classification is a fundamental task that is widely used in various sub-domains of natural language processing, such as information extraction, semantic understanding, etc. For the general text classification problems, various deep learning models, such as Bi-LSTM, Transformer, BERT, etc. have been used which achieved good performance. In this paper, however, we consider a new problem on how to deal with a special scenario in text classification which has a weak sequential relationship among different classification entities. A typical example is in the block classification of resumes where there are sequential relationships existing amongst different blocks. By fully utilizing this useful sequential feature, we in this paper propose an effective hybrid model which combines a fully connected neural network model and a block-level recurrent neural network model with feature fusion that makes full use of such a sequential feature. The experimental results show that the average F1-score value of our model on three 1,400 real resume datasets is 5.5–11% higher than the existing mainstream algorithms.
Xuejiao Tang, Xin Huang, Wenbin Zhang, Travers B. Child, Qiong Hu, Zhen Liu and Ji Zhang
Proceedings of the 23rd International Conference on Big Data Analytics and Knowledge Discovery (DaWaK), online, 2021. (Acceptance rate: 12/71=16%)
Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input. VCR is a recently introduced visual scene understanding task with a wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and prior knowledge. In this paper we propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning. A Python implementation of our mechanism is publicly available at https://github.com/tanjatang/DMVCR
Wenbin Zhang, Mingli Zhang, Ji Zhang, Zhen Liu, Zhiyuan Chen, Jianwu Wang, Edward Raff and Enza Messina
Proceedings of the 32nd International Conference on Tools with Artificial Intelligence (ICTAI), online, 2020. (Acceptance rate: 111/445=25%)
Artificial intelligence (AI)-based decision-making systems are employed nowadays in an ever growing number of online as well as offline services-some of great importance. Depending on sophisticated learning algorithms and available data, these systems are increasingly becoming automated and data-driven. However, these systems can impact individuals and communities with ethical or legal consequences. Numerous approaches have therefore been proposed to develop decision-making systems that are discrimination-conscious by-design. However, these methods assume the underlying data distribution is stationary without drift, which is counterfactual in many realworld applications. In addition, their focus has been largely on minimizing discrimination while maximizing prediction performance without necessary flexibility in customizing the tradeoff according to different applications. To this end, we propose a learning algorithm for fair classification that also adapts to evolving data streams and further allows for a flexible control on the degree of accuracy and fairness. The positive results on a set of discriminated and non-stationary data streams demonstrate the effectiveness and flexibility of this approach.
Liming Zhang, Wenbin Zhang and Nathalie Japkowicz
Proceedings of the 25th International Conference on Pattern Recognition (ICPR), online, 2020. (Acceptance rate: 1263/3250=38.8%)
Recognizing human activities from multi-channel time series data collected from wearable sensors is ever more practical. However, in real-world conditions, coherent activities and body movements could happen at the same time, like moving head during walking or sitting. A new problem, so-called "Coherent Human Activity Recognition (Co-HAR)", is more complicated than normal multi-class classification tasks since signals of different movements are mixed and interfered with each other. On the other side, we consider such Co-HAR as a dense labelling problem that classify each sample on a time step with a label to provide high-fidelity and duration-varied support to applications. In this paper, a novel condition-aware deep architecture "Conditional-UNet" is developed to allow dense labeling for Co-HAR problem. We also contribute a first-of-its-kind Co-HAR dataset for head movement recognition under walk or sit condition for future research. Experiments on head gesture recognition show that our model achieve overall 2%-3% performance gain of F1 score over existing state-of-the-art deep methods, and more importantly, systematic and comprehensive improvements on real head gesture classes.
Wenbin Zhang and Albert Bifet
Proceedings of the 23rd International Conference on Discovery Science (DS), online, 2020. (Acceptance rate: 26/76=34.2%)
Fairness-aware learning is increasingly important in socially-sensitive applications for the sake of achieving optimal and non-discriminative decision-making. Most of the proposed fairness-aware learning algorithms process the data in offline settings and assume that the data is generated by a single concept without drift. Unfortunately, in many real-world applications, data is generated in a streaming fashion and can only be scanned once. In addition, the underlying generation process might also change over time. In this paper, we propose and illustrate an efficient algorithm for mining fair decision trees from discriminatory and continuously evolving data streams. This algorithm, called FEAT (Fairness-Enhancing and concept-Adapting Tree), is based on using the change detector to learn adaptively from non-stationary data streams, that also accounts for fairness. We study FEAT’s properties and demonstrate its utility through experiments on a set of discriminated and time-changing data streams.
Mingli Zhang, Xin Zhao, Wenbin Zhang, Ahmad Chaddad, Jean-Baptiste Poline and Alan Evans
Proceedings of the 31st International Conference on Databases and Expert Systems Applications (DEXA), online, 2020. (Acceptance rate: 38/190=20%)
Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder characterized by deficiencies in social, communication and repetitive behaviors. We propose imaging-based ASD biomarkers to find the neural patterns related ASD as the primary goal of identifying ASD. The secondary goal is to investigate the impact of imaging-patterns for ASD. In this paper, we model and explore the identification of ASD by learning a representation of the T1 MRI and fMRI by fusioning a discriminative learning (DL) approach and deep convolutional neural network. Specifically, a class-wise analysis dictionary to generate non-negative low-rank encoding coefficients with the multi-model data, and an orthogonal synthesis dictionary to reconstruct the data. Then, we map the reconstructed data with the original multi-modal data as input of the deep learning model. Finally, the learned priors from both model are returned to the fusion framework to perform classification. The effectiveness of the proposed approach was tested on a world-wide cross-site (34) database of 1127 subjects, experiments show competitive results of the proposed approach. Furthermore, we were able to capture the status of brain neural patterns with the known input of the same modality.
Wenbin Zhang and Eirini Ntoutsi
Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macau, China, 2019. (Acceptance rate: 850/4752=17.9%)
Automated data-driven decision-making systems are ubiquitous across a wide spread of online as well as offline services. These systems, depend on sophisticated learning algorithms and available data, to optimize the service function for decision support assistance. However, there is a growing concern about the accountability and fairness of the employed models by the fact that often the available historic data is intrinsically discriminatory, i.e., the proportion of members sharing one or more sensitive attributes is higher than the proportion in the population as a whole when receiving positive classification, which leads to a lack of fairness in decision support system. A number of fairness-aware learning methods have been proposed to handle this concern. However, these methods tackle fairness as a static problem and do not take the evolution of the underlying stream population into consideration. In this paper, we introduce a learning mechanism to design a fair classifier for online stream based decision-making. Our learning model, FAHT (Fairness-Aware Hoeffding Tree), is an extension of the well-known Hoeffding Tree algorithm for decision tree induction over streams, that also accounts for fairness. Our experiments show that our algorithm is able to deal with discrimination in streaming environments, while maintaining a moderate predictive performance over the stream.
Wenbin Zhang, Xuejiao Tang and Jianwu Wang
IEEE International Conference on Data Mining (ICDM), PhD Forum Track, Beijing, China, 2019
Algorithmic data-driven decision-making systems are becoming increasingly automated and have enjoyed tremendous success in a variety of application domains. More recently, these systems are increasingly being used to render all sort of socially-sensitive decisions. Yet, these automated decisions can lead, even in the absence of an intention, to a lack of fairness in the sense that members sharing one or more sensitive attributes are being treated unequally. In this paper, we handle unfairness in both online and offline settings. We introduce an algorithm-agnostic learning mechanism for optimal and non-discriminative decision-making as appropriate. This translates to a fairness-aware learning schema which can be immediately applied to most existing algorithms and to general decision-making tasks in dynamic settings with join data distribution changes over time.
Wenbin Zhang, Jianwu Wang, Daeho Jin, Lazaros Oreopoulos and Zhibo Zhang
IEEE International Conference on Big Data (BigData), Seattle, USA, 2018
A self-organizing map (SOM) is a type of competitive artificial neural network, which projects the high-dimensional input space of the training samples into a low-dimensional space with the topology relations preserved. This makes SOMs supportive of organizing and visualizing complex data sets and have been pervasively used among numerous disciplines with different applications. Notwithstanding its wide applications, the self-organizing map is perplexed by its inherent randomness, which produces dissimilar SOM patterns even when being trained on identical training samples with the same parameters every time, and thus causes usability concerns for other domain practitioners and precludes more potential users from exploring SOM based applications in a broader spectrum. Motivated by this practical concern, we propose a deterministic approach as a supplement to the standard self-organizing map. In accordance with the theoretical design, the experimental results with satellite cloud data demonstrate the effective and efficient organization as well as simplification capabilities of the proposed approach.
Antonio Candelieri, Wenbin Zhang, Enza Messina and Francesco Archetti
IEEE International Conference on Big Data (BigData), Poster Track, Seattle, USA, 2018
This work stems from the Italian project H-CIM (Health-Care Intelligent Monitoring), aimed at developing a wearable sensor data streams based home-monitoring system to support self-rehabilitation of elderly outpatients. Different from the pervasive data stream applications, which are always accompanied by the evolution of unstable class concepts, this project requires stable standard and personalized rehabilitation exercises patterns be provided to assess outpatient's self-therapy progress at home. In this designed pipeline, the representation sequences of the personal standard rehabilitation exercises in wearable sensor streams is therefore first benchmarked, then an assessment system which integrates multistage data processing and analyzing is proposed to enable elders to manage their own rehabilitation progress properly. The system proved to be an effective tool for supporting compliance monitoring and personalized self-rehabilitation; it is currently under further development within the Italian project Home-IoT, with the aim to become a more general data stream analytics service, not only devoted to rehabilitation exercises assessment.
Wenbin Zhang and Jianwu Wang
EEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 2018
Recommender system seeks to assist and augment the natural social process of making choices without sufficient personal experience of the alternatives. They have become fundamental applications in electronic commerce and information access, assisting users to effectively pinpoint information that of their interests from large catalog spaces. Contrary to the pervasive utilization of recommender systems in domains such as electronic commerce, the application of recommendation system in medical domain is limited and further effort is needed. In addition, while a variety of approaches have been proposed for performing recommendation, including collaborative filtering, demographic recommender and other techniques, each individual method has its own drawbacks. This paper proposes a medical oriented recommendation system in which patient's background data is used to bootstrap the collaborative filtering engine and personalized suggestions are provided therein. We present empirical experiment results that show how the content-bootstrapped part of the system enhances the effectiveness of medical article recommendation of the collaborative filtering.
Wenbin Zhang and Jianwu Wang
IEEE International Congress on Big Data (BigData Congress), Honolulu, USA, 2017
The pervasive imbalanced class distribution occurring in real-world stream applications, such as surveillance, security and finance, in which data arrive continuously has sparked extensive interest in the study of imbalanced stream classification. In such applications, the evolution of unstable class concepts is always accompanied and complicated by the skewed class distribution. However, most of the existing methods focus on either class imbalance problem or non-stationary learning problem, the combined approach of addressing both issues has enjoyed relatively little research. In this paper, we propose a hybrid framework for imbalanced stream learning that consists of three components: classifier updating, resampling and cost sensitive classifier. Based on the framework, we propose a hybrid learning algorithm to combine data-level and algorithm-level methods as well as classifier retraining mechanics to tackle class imbalance in data streams. Our experiments using real-world datasets and synthetic datasets show that our proposed hybrid learning algorithm can have better effectiveness and efficiency.
2021
ICDM Best Paper Award Candidate, SIAM Early Career, NeurIPS, AISTATS and IEEE BigData Travel Awards
2020
AAAI Travel Award
2019
ACM SIGAI, C-Fair Youth Forum and ICDM Travel Awards
Dec 2022
Aug 2022
May 2022
Feb 2022
Aug 2021
Apr 2021
Apr 2021
Oct 2020
Oct 2020
Oct 2020
Oct 2019
Nov 2019
Jun 2018