[2024-01-04]Protect Code Datasets against Unauthorized Training Usage with Imperceptible Watermark
|Protect Code Datasets against Unauthorized Training Usage with Imperceptible Watermark
Code datasets are of immense value for training neural code completion models. Companies and organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the black-box nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. In this talk, I will share a framework and an imperceptible watermarking technique for protecting open-source code repositories and code datasets in general. The embedded watermark can trace the dataset usage in training neural code models with an effective validation method.
|Dr. Du Xiaoning is an Assistant Professor at the Faculty of Information Technology, Monash University. She received her Ph.D. from Nanyang Technological University in 2020 and her Bachelor's degree from Fudan University in 2014. Her research primarily focuses on responsible code intelligence and she looks into problems that are critical for code intelligence systems and tools to be deployed and used in practice, including dataset quality issues, dataset copyright issues, model efficiency problems, robustness and interpretability. Her research has been published in top-tier conferences and journals, including ICSE, ASE, FSE, NeurIPS, AAAI, S&P, USENIX Security, and TDSC. One of her works, which evaluated and improved the quality of code search datasets, was published at ICSE 2021 and nominated for the ACM SIGSOFT Distinguished Paper Award.