17 Sep 2020

Discussion meeting

This is a summary of the group meeting on 17 September 2020

Depeng：On privacy and accuracy in data releases, Annabelle McIver （CONCUR 20’）

This is an invited talk given by Annabelle in QONFEST 2020. It focuses on using quantification information flow (QIF) to analyze privacy and accuracy after differential privacy channels. Privacy is the secret part of the input that needs to be protected while accuracy is the useful output part and they correspond to unintended/intended inference respectively. QIF is a technique based on Bayesian inference, where vulnerability is one typical definition that analyzes the probability of correct inference (unintended/intended) in one try. The presenter analyzes two kinds of differential privacy mechanisms: local mechanisms add noise to each individual input and oblivious mechanisms add noise to the final outputs. She also points out there is no free lunch or desserts in the sense that an optimal level in terms of neither privacy nor accuracy can be achieved when there is correlation in the data. A case study of America criminal justice system is shown, where outputs have correlation with inputs. It shows that the accurate inference rates of useful outputs and secret inputs increase in a synchronous manner as the decrease of noise in both privacy mechanisms.

Model local/oblivious mechanisms of random response in our HMM model: It’s easy to model both kinds of mechanisms, where the sizes of models differ as the increase of inputs (linearly or exponentially). The local one has been analyzed in our DP paper.
Describe the vulnerability using a logic and an algorithm for solving the logic: This is a tough one. Vulnerability can describe inference on anything, such as the correct inference of the 1st/ 2nd respondent/ output. Assume we build a logic to describe vulnerability and want to solve it in the HMM model. We know that the model is almost fixed in the sense that it doesn’t change according to different vulnerabilities we want to compute. The process of computing vulnerability is more than computing reachability. It has to factorize the probabilities into marginal and posteriors. In this step a connection between the states and the inference objects needs to be established. How to establish the above connection is unknown and it seems easier to compute vulnerability if we just follow the Bayesian framework without building the models and logic…
Modeling no free lunch/ deserts, then prove the theorems in the model? Maybe future work…

Weizhi：Program Synthesis Using Deduction-Guided Reinforcement Learning，Yanju Chen, Chenglong Wang, Osbert Bastani, Isil Dillig, and Yu Feng. (CAV 20′)

springer link to this paper

主要贡献：

这篇文章提出了一种新的基于reinforcement learning的program synthesis算法，给出一个off-line训练的初始policy，他们的方法使用policy来指导搜索，并通过利用deductive engine获得的反馈来改进它。他们的方法优势在于1、将deductive和statistical这两种方法的能力结合到一个统一框架中；2、它不仅利用deduction方法来修剪搜索空间，还可以指导搜索。他们实现了一个工具CONCORD，并和前人的synthesis工具进行了实验比较，相比state-of-art的synthesis工具NEO，他们可以多解决15%的benchmarks，同时可以提升8.71倍的搜索速率。

问题和讨论：

Program synthesis的意义：讨论中提到program synthesis这个方向研究的实际价值，我后来检索查阅了一下，要生成完整的大规模程序的确是不太现实且困难的，目前还处于非常早期的阶段（八皇后都写不出来），基本上用来合成DSL(程序专用语言)，比如SQL查数据库等，在业界有一些简单的应用例子：如Excel的FlashFill功能，表格第一列是姓名，想在表格第二列里放姓，那么只需要给出几个例子，按Ctrl+E, Excel就会自动综合程序并填充第二列的数据；另外还有一些IDE、或者VSCode插件集成的代码补全或者写注释补全代码等功能会用到program synthesis。
Specification：另外，讨论中也提到如何描述specification的问题，这的确是program synthesis的一个主要研究点，用户如何提供需求（specification）？本质上需要在不提供代码的前提下完整描述需要代码做的事情是有点自相矛盾的事情，于是如何让用户更容易地说明自己的需求，目前还没有很好的思路。现在一般有三种：1、给一个手写的烂程序，合成一个等价的优化后的程序；2、给出输入输出样例，合成一个满足样例的程序（本文是这种需求）；3、非常形式化的描述程序特征的说明文档。
讨论：本文将程序基于DSL合成的每一步看作一个MDP中的状态，通过梯度下降方法来求解这个MDP问题，这种将reinforcement learning结合进来的思路值得借鉴学习。