The reading notes record thoughts from things I read. 這網誌是我的一些閱讀後的思考和摘要記錄。My website 我的網頁:

The Big Data problem

Big Data is the buzz word of the day. With the Internet extended to all facets of everybody’s everyday life, there are a lot of personal information flowing around. Many companies knew a long time ago how to make use of the personal data of their customers in aiding marketing and sales. There are many researches by sociologists, city planners and the police using big data in demographics and crowd control. The recent revelation of big data usage is the massive project of governments in keeping almost all information about everybody. All these have alerted the privacy advocates that the big data have created problems on infringement of personal privacy, and possibly secret government control on citizens. The main worry is that the government is keeping too much data centrally on everybody, and the citizens have no way to know what the government is doing with it. As such, there are bound to be conspiracy theories that the big data are used against the citizens.

In the IT field, there are much discussion about the big data problem. There is an article in the August issue of Scientific American on How to Save Big Data from Itself. It proposes a three-step plan for using the data right in the age of government overreach.

The governments claimed that, for security reasons, big data is required to search for terrorists. The threat is everywhere and searching for useful information is like finding a needle in a haystack. The haystack now being accumulated is really too large, and any searching within is closed to the citizens. The article suggested that the haystack should be scattered. That means, instead of a large centralized database of information, each agency should maintain their own information database; and that such databases should be independent from each other. Law enforcement agencies could still have access to the individual databases on legitimate grounds, but the query of information would be overseen through metadata. This is a way to safeguard massive search of information from being abused.

A major threat to big data is data leakage. All databases and transmissions must be hardened with encryption. Without adequate protection, data could be siphoned off without notice. Cybersecurity is a major concern, and hacking activities are rampant in the Internet. Data being leaked this way are likely to be used in cybercrime. Big data could create an even bigger problem owing to the extensive coverage of information of a lot of persons. From time to time, we heard horror stories of databases of large corporations being compromised, leading to loss of assets. This is a good lesson that security of databases and transmissions must be upgraded.

The third part of the plan is never stop experimenting on control. The big data scenario is new. Existing regulations and traditional controls may not be suitable to maintain a healthy environment. But there is still no definite solution to the problem. What we could do now is to experiment all kinds of control to see what works. Citizens, technology companies and other countries are now pressing the USA government to impose limits on NSA surveillance. Telecommunication companies are suing for the right to release information on the metadata about NSA’s request for data. The USA Freedom Act being debated could impose restrictions on the collection of bulk data. All these could keep the use of big data under control, but a final solution is still far away.