This study was done with my classmate Haike Xu as the final project of Mathematical Model of Smart City in 2019 summer. I transcribed our poster and slides into the blog article trying to inspire more people who are interested in this topic.

Introduction

Background

In the real world, the transmission of a message is often indirect, passing through a complex base station network. However, due to some reasons, the network may contain noises which will disturb the message and make it inaccurate when it reaches its destination.

Loss of accuracy of messages during transmission may reduce communication efficiency or even block certain communication channels. Traditional remedies involve using multiple channels and error-correcting code. However, all these remedies have certain limits.

In our research, we present an easy method to tackle this problem by only changing the message recovering mechanism. Our method doesn't require any modification of the current communication system and only needs the final user to change its message recovering mechanism. Thus it is very applicable in daily lives.

Message Recovering Mechanism

Today's communication networks already contain multiple channels, so a base station will receive variant copies of the same message from different paths. The most trivial way to recover the original message is to calculate all the copies' average, so we assume that all the base stations in the network will do so. Actually, except using the average, we have many alternatives: the median, weighted average, solely the most reliable copy, etc. By the same time, is the most unreliable copy totally useless in our case?

Model

Assumptions

The original message is a single real number parameter
Each station will calculate the average of the data it gathers, add its noise, and send the sum to further stations
The noise each station adds to the message has normal distribution

Problem statement

Given a directed acyclic graph \(G\) with \(n\) nodes, one source \(S\), and one sink \(T\), a single parameter \(y\) will be transmitted through the network from the source to the sink. For each node \(i\) other than the source and the sink, it will calculate the average of the data it gathers from its parents, add its noise \(\epsilon_i\) which follows \(N(0,\sigma_i^2)\) to the average, and send the sum \(a_i\) to his successors. Now, given the data the sink receives from its parents \(x_1,x_2,...x_k\), we want to estimate the original parameter \(y\) in the source.

Problem Solving

Analysis

Some observations
- Each number kept by sink’s parent \(x_i\) is a random value which is determined by \(\epsilon_1, …, \epsilon_n\) . So the value of \(x_i\) is a random variable, denoted as \(X_i\) .
- Our goal is to construct a function of \(X_1,X_2,..., X_k\), which is called an estimator.
Conventional methods
- Message median, message average, or solely minimal variance node.
- But these methods do not give accurate results in some cases.
Metrics for solutions
- Most of these estimators are unbiased, so we mainly compare their variances.
- We also consider mean square error in experiments.
Intuition of our method
- Assign weights for these \(k\) nodes.
- If a random variable has high variance, it should be given small weight.
- If a random variable has similar propagation path with others, it should be given small weight.

Weighted average estimator

Define \(\hat{Y} = t _1 X _1 + t _2 X _2 + …+ t _k X_k\). \(t_1, ..., t_k\) are the weights, of which the sum is \(1\).
Noise in each random variable \(X_i\) is a linear combination of those noise functions \(\epsilon_i\). Let \(X_i = y + \sum_{j=1}^n{b_{i,j} \epsilon_j}\)
- \(b_{i,j}\) is the coefficient of \(\epsilon_j\) in \(X_i\) .
- An algorithm to calculate \(b_{i,j}\)
  - \(b_{i,j} = \sum_{k\text{ is }j\text{'s successor}}{\frac{b_{i,k}}{deg_{k}}}\), where \(deg_k\) is the in-degree of node \(k\) .
  - \(b_{i,j}\) can be calculated in topological dynamic programming with time complexity \(O(n k)\) .
Optimizing weighted average estimator
- minimize \[ \begin{aligned} \operatorname{Var}(\hat{Y}) &= \left(t_{1} b_{1,1}+t_{2} b_{2,1}+\cdots+t_{k} b_{k, 1}\right)^{2}\sigma_1^2 \\ &+\left(t_{1} b_{1,2}+t_{2} b_{2,2}+\cdots+t_{k} b_{k, 2}\right)^{2}\sigma_2^2 \\ &+\cdots+\left(t_{1} b_{1, n}+t_{2} b_{2, n}+\cdots+t_{k} b_{k, n}\right)^{2}\sigma_n^2 \end{aligned} \] subject to \[ t_{1}+t_{2}+\cdots+t_{k}=1 \]
By method of Lagrange multipliers, \(t_1, ..., t_k\) can be solved from the following equations, \[ \left\{\begin{array}{c}{C_{1,1} t_{1}+C_{1,2} t_{2}+\cdots+C_{1, k} t_{k}+\lambda=0} \\ {C_{2,1} t_{1}+C_{2,2} t_{2}+\cdots+C_{2, k} t_{k}+\lambda=0} \\ {C_{k, 1} t_{1}+C_{k, 2} t_{2}+\cdots+C_{k, k} t_{k}+\lambda=0} \\ {t_{1}+t_{2}+\cdots+t_{k}=1}\end{array}\right. \] where \(C_{i, j}=\sum_{l=1}^{n} b_{i, l} b_{j, l} \sigma_l^2 =\operatorname{Cov}\left(\mathrm{X}_{\mathrm{i}}, \mathrm{X}_{\mathrm{j}}\right)\).

Proof of estimator's optimality

It is obvious that our method has the least variance among all linear combinations of \(X_1,X_2,...X_k\). Now, we will prove that it is the best one among all the estimators.

Theorem (Multivariate normal distribution) Every linear combination \(Y=a_{1}X_{1}+ \cdots +a_{k}X_{k}\) of its components is normally distributed. That is, for any constant vector \(\mathbf{a} \in \mathbb{R}^{k}\), the random variable \(Y=\mathbf{a}^{\mathrm{T} }\mathbf{X}\) has a univariate normal distribution, where a univariate normal distribution with zero variance is a point mass on its mean.

The joint distribution of \(X_1, ..., X_k\) is \[ f(x_{1}, \cdots, x_{k})=\frac{1}{\sqrt{(2 \pi)^{k}|C|}} e^{-\frac{1}{2}(X-y) C^{-1}(X-y)^{T}} \]
- \(X-y\) is the row vector of \(\left(X_{1}-y, X_{2}-y \ldots, X_{k}-y\right)\) .
- \(C_{i, j}\) is the covariance of \(X_{i}\) and \(X_{j}\) .
- \(C^{-1}\) is the inverse matrix of covariance matrix \(C\) .

Theorem (Cramer-Rao Inequality) Let \((X_1, ... ,X_n)\) be a sample with pdf \(f(X|\theta)\), and let \(W(X) = W(X_1, ..., X_n)\) be any unbiased estimator satisfying some restrictions, then \[ Var(W(X)) \geq \frac{1}{I(\theta)} = \frac{1}{E\left((\frac{\partial}{\partial \theta} \log f(X|\theta))^2\right)} . \]

By Cramer-Rao inequality, \[ \operatorname{Var}(Y) \geq \frac{1}{I(\theta)}=\frac{1}{E\left(\left(\frac{\partial}{\partial \theta} \log f(X | \theta)\right)^{2}\right)}=\frac{1}{\sum_{i=1}^{n} \sum_{j=1}^{n}\left(C^{-1}\right)_{i, j}} \]
- We use Maximize Likelihood Estimator(MLE) as a bridge, which is \(\tilde{Y}=\sum_{i=1}^{k}\left(\frac{\left(c^{-1}\right)_{i}}{\left(C^{-1}\right)_{*}}\right) X_{i}\)
- For convenience, we denote \(\left(C^{-1}\right)_{i}=\sum_{j=1}^{k}\left(C^{-1}\right)_{i, j}\) and \(\left(C^{-1}\right)_{*}=\sum_{i=1}^{k} \sum_{j=1}^{k}\left(C^{-1}\right)_{i, j}\)
- \(\operatorname{Var}(\tilde{Y})=\frac{1}{\left(C^{-1}\right)_{*}}\) reaches the Cramer-Rao bound.

Because MLE is a linear combination of \(x_i\), our estimator must have a lower or equal variance. Since MLE is the optimal unbiased estimator, our estimator is the optimal unbiased estimator as well.

An Intuitive Observation for Our Result

By previous proof, our optimal weight is \(t_i=\frac{(C^{-1})_i}{(C^{-1})_*}\). The meaning for each element in the \(C^{-1}\) is: \[ C^{-1}_{ij}= \begin{cases} i=j& \frac{\tilde{C}_{ii}}{|C|}\\ i\neq j & -\sqrt{C^{-1}_{ii}C^{-1}_{jj}}\rho_{x_i,x_j \cdot V\setminus\{x_i,x_j\}}\\ \end{cases} \] Here \(\tilde{C_{ii}}\) means the algebraic cofactor and \(\rho_{x_i,x_j\cdot V\setminus\{x_i,x_j\}}\) means the partial correlation.

\(\tilde{C_{ii}}\) is the generalized variance of the remaining variables, a one-dimensional measure of their multidimensional scatter. Larger \(\tilde{C_{ii}}\) means other variables are unreliable and thus is positively related with \(t_i\).
\(\rho_{x_i,x_j\cdot V\setminus\{x_i,x_j\}}\) is the partial correlation between random variables \(x_i\) and \(x_j\). Its higher value means \(x_i\) has similar propagation path with other nodes, so it is negatively related with \(t_i\).
Usually all nodes have non-zero weight, revealing that all nodes are useful

Experiment

In the first experiment, we chose a network and nodes' noise distributions. For this example, we fixed the true value of \(y\) to be the gray line, propagated the message many times. Each time, we mapped different methods' outputs on one column by different colors.

In the second experiment, we use a real communication network (http://socialnetworks.mpi-sws.org/data-wosn2009.html) and randomly choose 16 pairs of nodes in it to test our methods' mean square error.

We found that the contrast between these methods vary greatly in different networks, so we especially test these methods on asymmetric graphs which we assume can obviously show our method's advantage.

Extension

We present some extensions for our research problems: - Replace noise's normal distribution with other distributions - Requirement: independent noise and independently determinable contribution of each noise - Proof for optimality doesn't work for other distributions - Unknown variances for the noises' distributions - Assume the same \(\sigma_i\) for all nodes and apply our method, sometimes performs better than ordinary methods - Or estimate each node's variance first and then apply our method, doesn't perform well on experiments - Multiple data for a network Can directly estimate the covariance matrix - Design different algorithms for different network structures Our experiment results vary greatly on different networks, reasonable to apply different methods on different networks

Discussion

In general, we should utilize every message we collected, even if the most unreliable one is helpful. Large variances path can serve as alternative sources to dilute our weight since our variance function grows proportional to weight square.
Messages from paths that are rarely visited are more valuable than those that come from ordinary paths.