Giter Club home page Giter Club logo

project_v's Introduction

Intro

[Man-Computer Symbiosis](http://worrydream.com/refs/Licklider - Man-Computer Symbiosis.pdf) by JCR Licklider

Thoughts: Computers should help humans in formulative thinking (by guiding us in intuitive trial-and-error process) and save time in routinized work (calculating, plotting, searching, etc). The difference in "language" and "speed" between humans and computer pose challenges in the proposed "symbiosis". These observations are still relevant in today's HCI world.

Author also discusses how computer can work with team of humans. Author proposed to develop technologies such as writing interface, wall display, natural language understanding, which are becoming mature now after half a century. To what extent these technologies are applied and to what level has the symbiosis developed requires more examination.

The Data Lifecycle by Jeannette M. Wing

Data science is the study of extracting value from data. “Value” is subject to the interpretation by the end user and “extracting” represents the work done in all phases of the data life cycle.

Generation -> Collection -> Processing -> Storage -> Management -> Analysis -> Visualization -> Interpretation

Thoughts: There are many stages before data reach data analysts' desktops. Paying attention to data generation and collection process not only helps greatly with cleaning and feature engineering, but also the interpretation of results and reflection of limitations.

Storage and management is becoming increasingly important as we are creating larger and faster datasets, building longer and more complex analytical pipeline. How to store (and not to store) data, how to manage their structures and versions. These questions affect scale and efficiency bottleneck both in computing power and human collaboration.

What really inspired me is that Prof. Wing list visualization and interpretation as two separate stages in the lifecycle. This makes sense to me as I found many EDA tools that generate an array of graphs but no interpretation. Interpretation is inherently a human job. While graphs have their implicit message, especially when designed intentionally, the interpretation still needs to be mediated by person who is familiarized with the lifecycle. Alternatively, if we truly want to design automatic EDA tools, we need to consider how to integrate critical interpretation and reflection alongside the graphs generated.

Summary:

For humans and computers to collaborate, there is a trade-off to make: how automated should computer programs be, how much agency and control should human users retain?

Successful "Automation + Agency" integration can be achieved by automatic reasoning + user-centered interactive systems.

The challenge to achieve such goal are:

  1. Research community focus mostly on developing full automation.
  2. Users want to feel in control and not interrupted in their workflow.
  3. Too much automation/recommendation discourage thinking.

Some solutions/tools are proposed by the author:

  1. Domain Specific Language (DSL)
    1. Formalize specification and user action model
    2. "Shared representation" for humans and machines
  2. Machine Learning Artificial Intelligence Searching Models
    1. Models that search through space defined by/in DSL to predict possible next steps of the user
  3. Graphic Interface that Maps DSL to Intuitive Visuals
    1. ... such that user can review and act upon machine output more effeciently

Author gives 3 case studies:

  1. Data Wrangling
  2. Exploratory Data Analysis
  3. Machine Translation

Future Steps:

  1. Build inference, learning, monitoring, model management services so as to reduce the prototyping/developing efforts
  2. Construct shared representation of data in data-driven manner
    • Help promote interpretability and skill acquisition of novices
    • Use appropriate design to encourage critical engagement in face of automated decision support

Thoughts:

One of the key takeways for me is that "users want to feel in control and not interrupted in their workflow". I used to think full automation is a plausible idea and machine should be more proactive in mixed intiative data exploration. But after trying many solutions and my own tools, I found myself easy to drift away from my analytic goal, distracted by computer-generated recommendations and went on to explore less pertinent aspects of the data. Some solution has an intention detection mechanism which may help user be more focused in their analysis, but I find most intention detection mechanism less accurate. Having computer second-guessing my intention by observing my actions is fine when I have a rough idea of what to explore, but it could be frustrating if I have a clear idea, but could not get my AI counterpart to understand my idea precisely and execute it. Teaching the AI through examples may take more time and thinking than writing out fully-specified codes.

In a purely exploratory settings, I tend to gravitate towards the graphs that are shown to me first, only to realize there is a better way to formulate the question / visualize the phenomenon at a much later stage, when I think abstractly and independently. This resonates with the observation author made about "automation discourages thinking".

Clicking and scrolling is so "natural" and "intuitive" that we become lazy thinkers. Once we see a potential insight from one of the graphs, our confirmation bias will nudge us towards those graphs verify this "insight". The innoccent computer, with no knowledge about the meaning and assumptions of underlying data, picks up this intention and feeds users with more related graphs. This works just like how certain algorithms powering the social media could strengthen stereotypes. This is why designing interface that encourages critical engagement and training intelligent reasoning machine is so important.

Takeaways: Goals for the next-generation systems proposed: "1) support visual and interactive data exploration 2) recommend relevant data to enrich visualizations 3) facilitate the integration and cleaning of datasets." Authors emphasize the importance of combining these features in an "uninterrupted data analysis cycle", which coincide with the point made in "Automation + Agency" paper.

Today's challenge / motivation for proposed goals: visual analytics services have high "churn rate", many used once and never return. Reasons include: 1) current services lack adequate data cleaning capabilities (mostly because preset data are cleaned already) 2) no recommendation for datasets to integrate with the current data 3) even if a user find new dataset, joining capabilities are primitive. Personally, I believe another reason is that visualization interfaces are not playful enough, some elements of gamification and goal-setting mechasim will make users want to explore more and have ownership over the work they have done.

The complex transformations to get data ready for plotting cause disruptions to the analysis cycle. Preemptive computation to automate some transformations is wasteful and relatively slow in a real-time interactive environment. Once transformations are defined by one user, should we share the same cleaning procedure to other users of the same data? Version management and quality validation are key considerations.

Recommendation of new datasets may draw from "schemas, axis labels, annotations, domain of values being visualized, primary keys, or aggregation/filters/projections used" but also "collaborative filtering" based on actions of similar users or users in similar situations.

Author argues for the necessity of a formalism that unifies the above fuctionalities, while acknowledging that these funtionalities are usually conducted "using different interfaces and yield very different actions". VizQL, the formalism underlying Tableau, is quoted as a partial example. Another interesting observation is that different functionalities may have different time-scales. Dataset recommendations based on entity resolution may take much longer to train and run compared to other analytical tasks.




Tasks

Visualization process decomposed as chains of actions, made of three types of elements (why, how, what) at different levels of abstraction. This new topology is argued to inform better design workflow and evaluation strategy, to "describe", "generate", "evaluate" as the authors summarized.

Simplied model:

... -> Input -> ( Why -> How ) -> Output/Input -> ( Why -> How ) -> Output -> ...

What = Input/Output, Why and How correspond to the ends and means of a task.

Why:

High level: Consume (present, discover, enjoy) + Produce

Mid level: Search (lookup, browse, locate, explore) ** Interesting definition, using 2D matrix of whether target and location is known or not.*

Low level: Query (identify, compare, summarize)

How:

Manipulate: select, navigate, arrange, change, filter, aggregate

Introduce: annotate, import, derive, record

Encode: encoding (visual encoding etc.)

Limitations of extant visualization taxonomy system:

Gap between high and low level of abstraction tasks. * This paper propose to mark High-Mid-Low level "why" tasks at every stage, which is very innovative.

Lack of expression for sequences and dependencies. * This paper suggests that the use of "what" solve the sequence problem naturally. It's interesting to think that this input/output typology can support not only chain of actions, but also trees of actions, with branching representing diverging thoughts analyst may have during exploratory analysis.

Theoretical foundations:

*Distributed Cognition, Stages of action, Sensemaking, *Play theory

The idea of distributed cognition really excites me. I especially like how authors refer epistemic actions to "the process of coordination between internal mental models and external representations". A definition from Wikipedia describes distributed congition as when "cognition is off-loaded into the environment through social and technological means". I have been intrigued by this idea of facilitating mental and phyiscal cognitive representation for a while. In the refereneces, I identified a list of related literature in this direction, which will be covered in this doc.

  • J. Hollan, E. Hutchins, and D. Kirsh. Distributed cognition: toward a new foundation for human-computer interaction research. ACM Trans. Computer-Human Interaction (TOCHI), 7(2):174–196, 2000.

  • J. J. Thomas and K. A. Cook. Illuminating the Path: The Research and

    Development Agenda for Visual Analytics. IEEE, 2005.

  • G.Klein,B.Moon,andR.R.Hoffman.Makingsenseofsensemaking2: A macrocognitive model. IEEE Intelligent Systems, 21(5):88–92, 2006.

  • A. Perer and B. Shneiderman. Integrating statistics and visualization for exploratory power: From long-term case studies to design guidelines. IEEE Trans. Computer Graphics and Applications (CG&A), 29(3):39– 51, 2009.

  • Z. Liu and J. T. Stasko. Mental models, visual reasoning and interac- tion in information visualization: a top-down perspective. IEEE Trans. Visualization and Computer Graphics (TVCG), 16(6):999–1008, 2010.

  • M. Pohl, M. Smuc, and E. Mayr. The user puzzle: Explaining the in- teraction with visual analytics systems. IEEE Trans. Visualization and Computer Graphics (TVCG), 18(12):2908–2916, 2012.

Another theoretical foundation that is innovative from my perspective is the play theory. The authors talk about the casual readers of visualization, which I believe will become increasingly mainstream as the data generation (IoT, social media), display technology (ubiquitous computing surface, VRAR), and visualization techniques develop in today's world. I am interested in how insights from game design and gamification may bring values to visualization design.

Update: It occurs to me that D3.js actually mirrors the idea of "what" in this typology, where output is the input of next action in a chain of methods.

The Eyes Have It by Ben Shneiderman

Published in 1996, this is quite a foresighted paper.

The author proposes his famous *Visual lnformation-Seeking Mantra: overview first, zoom and filter, then details on demand", and add Relate, History, Extract to the tasks. Then, he discusses how the tasks apply to seven types of data (1-, 2-, 3-dimensional data, temporal and multi-dimensional data, and tree and network data). Together, the tasks and types form the type by task taxonomy (TTT).

Many of the ideas have been implemented and taken for granted for today's visualization tools. But there are some new lessons learned for me.

  1. Author commented on two challenges of creating good 3-dimensional scattergrams: "disorientation (especially if the users point of view is inside the cluster of points)" and "occlusion (especially if close points are represented as being larger)". Last week, I have worked on a 3D scatter plot myself, and I always felt the plot is good enough for sensemaking. The author has articulated the problems better than I do, and now I will look into literature on these two problems to see how I can improve.
  2. The section on tree data ties together concept of plots that I used to think separately. Animated hyperbolic trees, cone tree, treemaps may look very different visually, but they all serve the purpose of visualizating hierarchy structure. I recently visualized the Google folder of long-term team projects, with the goal of understanding its structure, storage usage, update pattern, so that we can design better data management protocol. The visualization of such wide & deep directory is very challenging, especially the trade-off between keeping a sensible structure and providng enough detail. After reading this paper and searching the term "animated hyperbolic tree", I found this paper and this interactive example, which demonstrates a focus strategy that could really helps with my struggle with large directory visualization.
  3. I also like the quote "Smooth zooming helps users preserve their sense of position and context." I think this corroborates my dislike of the plotly zoom via selection feature. I can see my selection highlighted right before triggering zooming, but the zooming happens in a flash as soon as I release my mouse, and now I am looking at a corner of the canvas without knowing the context. Something like how you do a smooth zoom in datashader is more natural to me. Another project that I came across at Columbia C4SR In Plain Sight also use smooth zoom to put visualizations of very local areas in the larger context of the globe, which show the power behind this technique.
  4. The author proposes that we should "allow extraction of sub-collections and of the query parameters". This is quite foresighted because one trend in visualization community right now is to use declarative languages and create specifications of visualizations such that they can be rendered efficiently client-side and server-side, cross-platforms, and support interactive and automatic transformations. Managing the specifications and share and re-apply them may be the next challenge.
  5. Finally, when discussing advanced filtering and dynamic queries, author describes a filter-flow model that aims at making complex boolean expressions more intuitive. I believe this is a terrific idea because its makes the abstract operations tangible. If we make the rendering of this filter-flow model real-time as user compose the filter expression, user can see how the query will change the sub-population size and potentially some other relevant features, which expedite their exploration and decision-making process. Interestingly, I found very limited follow-up literature on this model after author's paper. There is an evaluation study in 2013 but I believe it missed the point of filter-flow model by simplifying it to a graphical programming interface model. I plan to experiment with visual filter-flow model in my nl4ds project to build more intuitive complex filtering.

I realized that I need to pick up the pace with reading, so I plan to write more brief reflections for the papers to come, except when I really have something to express.

I find the definitions in this paper (some of them also borrowing from previous literature) very clear and compelling, marking a few here:

"The purpose of data exploration and visualization is to offer ways for information perception and manipulation, as well as knowledge extraction and inference."

"In an exploration scenario, it's common that users are interested in finding something interesting and useful without previously know what exactly are searching for, until the time they identify it."

"In order to tackle both performance and presentations issues, ... approximation techniques (a.k.a. data reduction techniques) ... sampling and filtering ... or/and aggregation (binning, clustering) ... incremental (a.k.a. progressive) techniques ... results/visual elements are computed/constructed incrementally based on user interaction or as time progresses."

Customization for diverse users "systems should allow the user to: (1) organize data into different ways, according to the type of information or the level of detail she wishes to explore (e.g., hierarchical aggregation framework for efficient multilevel visual exploration); (2) modify approximation criteria, thresholds, sampling rates, etc. (e.g., [78]); (3) define her own operations for data manipulation and analysis (e.g., aggregation, statistical, filtering functions); "

Polaris by Chris Stolte et al.

Cannot believe this is published in 2001, reading this paper explains so much about the layout of many of today's best visualization software GUI (Tableau, Power BI, Voyager 2, DataTone, ...) There are so many to learn from this paper but I am most inspired by the following:

To handle multidimensional data, authors make the most out of the Pivot Table interface to create highly customizable facted visualizations.

The idea of X,Y,Z shelves allow flexible and intuitive specification, while the Table Algebra (Concatenation, Cross, Nest) bridge the user specification with interface layout.

Authors propose to divide "graphics into three families by the type of fields assigned to their axes", then summarize the common goals analysts have when using this family of graphic (very insightful):

  • ordinal - ordinal: understanding patterns and trends in some function f(􏰁Ox ; Oy ) -> R, (R represents the fields encoded in the retinal properties of the marks)
  • ordinal - quantitative: understand or compare the properties of some set of functions f􏰁(O)􏰂 -> Q.
  • quantitative - quantitative: understand the distribution of data as a function of one or both quantitative variables and to discover causal relationships between the two quantitative variables.

When discussiong data transformations, authors analyze how each transformation may turn one Q into O (e.g. Partitioning) or O into Q (e.g. Counting), which in turns affect layout and display. This ties with how table algebra is designed.

SELECT {dim} {aggregates} GROUP BY {G} HAVING {filters} ORDER BY {S}

Using Having instead of Where because some filter operates on aggregated values

There are difference between the aggregation happening when partitioning the data into panes of the "pivot table" and the aggregation happening during visual specification or retina properties.

project_v's People

Contributors

timwgy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.