w3c / mediacapture-screen-share Goto Github PK

View Code? Open in Web Editor NEW

84.0 47.0 28.0 716 KB

Media Capture Screen Capture specification

Home Page: https://w3c.github.io/mediacapture-screen-share/

License: Other

HTML 93.45% CSS 3.17% JavaScript 3.37%

webrtc

mediacapture-screen-share's Introduction

Screen Capture

This work is currently licensed under a Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) License. See http://creativecommons.org/licenses/by-nd/4.0/legalcode

Published Versions

While we have taken measures to reduce the frequency of build breakages, the tip-of-tree of this document may contain work in progress changes and other inconsistencies. If you want to review something more coherent, review the latest editors' draft; these are published at intervals on the order of weeks.

Tip of tree as HTML
Latest published editor's draft at github
Latest W3C published version (automatically updated; should be identical to the latest editors' draft)

To Reflow the Spec

To format the draft use something like

tidy --quiet y -utf8 --vertical-space y --tidy-mark n -indent -wrap 80

mediacapture-screen-share's People

Contributors

Stargazers

Watchers

mediacapture-screen-share's Issues

Modifying already shared permissions for screen

@KiranKumarGuduru says:
Considering a use case where the use has provided permissions to share a part of the screen and now want to modify the permissions to share a full screan or vice versa for example, then it will be good to have a permission dialog, which displays an option like "modify existing permissions".

Event for screen sharing

Firing an event on tab being shared could notify web application to hide sensitive information or change layout during screen sharing.

Draft of this event could be:

The following event fires on Document:

Event name	Interface	Fired when...
screensharingchange	`Event`	The tab is started to be shared, or the entire screen is started to be shared and current tab is active. Also fired when screen sharing is stopped.

partial interface Document {
  readonly attribute boolean screenSharing;
}

Open issues:

Web app may be notified after it is being shared. So information leakage may happen before app takes action.
screenSharing only indicates screen sharing performed by browser.

Application vs. browser

Feedback from discussions today was that the distinction between "browser tab" and "application window" was not something that was completely obvious. While the distinction is still relevant from an implementation perspective, the suggestion was to remove the differentiation in the user consent dialog and the API.

This avoids applications needing to be able to know, a priori, whether an application is needed vs. the browser. For those apps that have access to browser sharing, that would allow us to show a unified selection dialog (using crosshair-based selection complicates things a little, but we are unlikely to start with that).

Elevated consent for browser sharing

Looks like the whole elevated consent process, or "Application Permission" is going to be interesting. While Chrome/Blink has decided to go for an application install process, it's likely that Firefox will not.

I think that we need to abstract a little more and just say that there is some special process that is required to grant a site the ability to share the browser.

What is the expectation when the "application" has multiple window areas?

When the displaySurface is set to "application", is the user agent expected to produce multiple video streams (i.e. mediastreamtracks), one for each application window? It can be complicated when the number of windows can change during runtime or the lifetime of each window can vary. It'd be more challenging if we want to preserve the relative positions of the windows. It might be easier in some cases for the web developer to choose the "window" or "monitor" surfaceType instead.

Source pixel ratio of the video track

This is a follow-up to #35.

Applications that request a non-resized stream (resizeMode = none) on high DPI sources will receive a video stream of physical pixels. However such applications would likely need to know what the pixel ratio of the content is when processing it.

For example if the application wants to render the screen capture in a non-zoomed "100% mode" while keeping the highest fidelity on a variety of screens, it would request the capture with resizeMode none but would need to also know the content's logical dimensions to avoid visual scaling when rendering.

The problem I see is how to expose this information. The stream's width and height can be retrieved from the track settings, but there is no settings for "contentPixelRatio", and it probably wouldn't make sense as it's not a directly constrainable property.

To note, the pixel ratio can change during the life of the stream when the user drags the source from one monitor to another.
There doesn't seem to be any event when settings of the track change (especially since such intrinsic changes never trigger over an overconstrained event). In the case of the resizeMode none, the dimensions of the video track would also change, but again, that change isn't directly observable on the stream track object.

Should browser area be allowed as “window” surfaceType?

This is more of an open question, since I'm not sure about all the security implications. It might be helpful to define a "browser-window" surfaceType explicitly so user knows that he/she is about to share part of the current webpage.

non-top-level browsing contexts

getUserMedia is getting a sandbox flag: whatwg/html#1211

Do we want one too, or do we instead want to forbid the use of getDisplayMedia from sandboxed iframes entirely?

Should we support audio?

"1. Introduction" says that the document says is primarily concerns itself with video, but that the general mechanism here could be extended to audio and depth.

Does this mean we should ignore audio:true or define what to do in this case?
If we want to support audio we should support MediaTrackConstraints (#67).

If we support audio I think we should make it simple, and either share all audio being played out or none of it. Filtering out audio on a per-window basis does not make sense the way that display medias do. I think it would confuse the user as to what is or is not shared and have no practical uses.

We could also not support audio, in which case we should decide whether to ignore it or throw an exception, we need to be consistent (#68).

TAG review

Part of the process for shipping stuff is a review by the TAG.
Someone needs to file a ticket and fill in the questionnaires for that.

Constraint to exclude application audio (echo)

If I screen share in a presentation that is also a conference with remote participants, and my screen sharing includes audio (e.g. I want to show a media clip as part of the presentation) we have a risk of echo.

If screen share contains remote participants talking, when this stream arrives at a remote participant for playout, they will hear themselves AND other remote participants twice (once because they are receiving a stream from them directly and once more because of the screen share).
We can't just throw "echo cancellation" on the problem.

Constraints are currently not allowed to limit the user's choice, and this is generally a good thing; which sources are present are are none of the application's business. However, the application need to be able to constrain the user agent not to include audio from the application that performed the getDisplayMedia() request to avoid echo. Otherwise these applications would end up presenting the user with false choices. Options that, if chosen, would seemingly cause "echo bugs".

Proposal: {audio:{excludeApplicationAudio:true}} limits user agent to provide a stream that does not contain the application audio. Note that this does not say what audio the user agent must provide - the implementation/user still very much has freedom of choice - it only specifies that a particular audio source must not be present. This constrain can be fulfilled in multiple ways, e.g:

Exclude choices from the user. For example, "tab audio" or "window audio" are still valid choices, as long as the "tab" chosen is not the application tab. "System audio" is not a valid choice.
"No audio" is a valid choice. (Though in this case no audio track should be produced)
Manipulate the audio sources to subtract the application audio.

1 and 2 are easy to implement. 3 is likely infeasible for most platforms, but conceivable. In any case, the application should not have to care about user agent capabilities - as long as the application gets a stream that does not produce echo it is happy.

Do constraints (e.g., size) restrict which windows you can use?

Behavior for getDisplayMedia() and getDisplayMedia({}) should be the same

https://w3c.github.io/mediacapture-screen-share/#navigator-additions says:

If the constraints argument is omitted, a default value containing a single video attribute set to true is assumed.

However, this doesn't match the MediaStreamConstraints definition in https://w3c.github.io/mediacapture-main/#mediastreamconstraints:

dictionary MediaStreamConstraints {
             (boolean or MediaTrackConstraints) video = false;
             (boolean or MediaTrackConstraints) audio = false;
};

Here, video has a default value of false. Web IDL says in https://heycam.github.io/webidl/#dfn-optional-argument-default-value:

If the type of an argument is a dictionary type or a union type that has a dictionary as one of its flattened member types, and that dictionary type and its ancestors have no required members, and the argument is either the final argument or is followed only by optional arguments, then the argument must be specified as optional. Such arguments are always considered to have a default value of an empty dictionary, unless otherwise specified.

In other words, per Web IDL, in this case, constraints is considered to have a default value of {}, which means that getDisplayMedia() and getDisplayMedia({}) will be indistinguishable. getDisplayMedia({ audio: false }) should also have the same behavior if audio has a default value of false, which it does here.

The most straightforward fix for this would be to make the IDL match the wanted behavior, e.g.:

dictionary DisplayMediaConstraints {
             boolean video = true;
             boolean audio = false;
};

As a very unfortunate side effect getDisplayMedia({ video: false }) and getDisplayMedia({ video: undefined }) will not mean the same thing even though undefined is a falsy value, but I don't see another great solution.

limiting browser sharing to a list of domain/urls

Please consider adding another constraint which will be relevant for OutputCaptureSurfaceType="browser". A commercial product may (and will) need to limit screen sharing to only those tabs which were open from a white-list of domains or even urls. And vice versa - it will be very useful to support a black list of domains/urls (never share contents if a tab is navigated to this address).

Document issues with transparent/semi-transparent windows

The subject says it all. Most systems for sharing properly avoid capturing the contents of the window background, but it's a concern.

Make logicalSurface constraint unchangeable

@alvestrand notes that logicalSurface could in theory change once a screen capture has been provided. However, that would be at odds with the advice in the document.

We should make it very clear that logicalSurface is a constraint that selects, just like facingMode. That is, unlike resolution, once you have a track, it is an unchangeable property and attempts to change it result in overconstraining the track. This is different to other constraints like width and height in that changing those can be possible once the track is acquired.

We don't have formal language to support making this distinction, I don't know if it is needed though.

Disable Local Playback During Audio Sharing

We want to add a constraint googDisableLocalEcho in the audio part of getUserMedia call.
It only works with audio capturing (either from the whole system or a tab).
If set googDisableLocalEcho:true, then we are going to mute the local playback of the audio being shared.

For example, if one wants to share a tab, and specifies
getUserMedia({video:..., audio:{mandatory:{sourceId:, googDisableLocalEcho:true}}})
Then during the sharing of the tab, the tab is muted on the sender's side. Otherwise the shared tab is still playing out on the sender's side.

If one wants to share the full screen with system audio and specifies such a parameter. Then we will mute the sender's speaker during the sharing.

Constraints for Captured Display Surfaces should move to MediaTrackSettings

The spec currently allows constraints to observe the properties of the selected display surface. These are only observable, such that they cannot be changed after returned.

The way it is working right now, they would be better defined under getSettings() and partial dictionary should be defined on MediaTrackSettings. The aim here seem to be to allow the application to query the current settings of the object's constrainable properties, which getSettings() does exactly. Putting these as read-only exceptional cases under getConstraints() seems to overlap with what getsettings() should do.

Should getDisplayMedia be functional in SecureContext only?

Following on w3c/webrtc-pc#1945, the question is whether to mandate secure origins for getDisplayMedia.
We could:

Mandate to reject getDisplayMedia promise for non secure origins
Make getDisplayMedia SecureContext

getOutputMedia() is a confusing name

Perhaps:
getDisplayMedia()?

Be consistent with error handling

5.2 says to throw an InvalidAccessError if MediaTrackConstraints are used, 5.4 says deviceId can make it overconstrained which I think implies an OverconstrainedError is thrown, and I think other parts of the spec may have implied that unsupported constraints are simply ignored (no exception thrown), but I'm not sure about that last part.

We should be consistent with how we handle errors. Should we use different exception types?
If an unsupported constraint or a constraint that doesn't make sense, should we always throw an exception? If we throw on deviceId being set we might want to throw on other recognized attributes that aren't supported being set.

Make it possible to indicate specific surfaces

Currently I can't specify a specific app (e.g., "Powerpoint"). That seems limited in terms of not having parity between web applications and native apps (which can enumerate all the apps in their ui). We're now building this functionality into gUM so maybe we should just start from the beginngign.

Hazard with application sharing

It's been noted that application sharing comes with certain usability hazards. We need to either highlight those hazards and explain them, or remove the ability to request an entire application.

Remove entire desktop sharing

Feedback from some folks at Mozilla is that sharing the entire desktop has some undesirable properties:

It forces us to black out the browser to provide the protection that we need for CSRF
It opens up the option for a click through dialog of the form "Do you want to share your desktop? y/n", which removes the implicit consent advantage of a forced selection.
It provides challenges for feedback similar to those posed by a full-screen application
It isn't actually that desirable a feature

In short, we are unlikely to implement this particular feature in Firefox at this stage.

Mention capture of system audio

It seems natural that getDisplayMedia({audio: true}) should capture system audio (what would normally be played out on the speakers).

The spec should mention that this is the correct interpretation.

Browser tab sharing

Add tab display surface for sharing single browser tab.

Describe what happens if the window is closed or the monitor is disconnected

I assume MediaStreamTrack.onmute/onended fires. Is there a GETUSERMEDA section we can reference?

There is also under 6.3 "In addition to feedback mechanisms, a means to for the user to stop any active capture is advisable." - this would be the same type of thing; we should define what the expected behavior is when this happens.

getDisplayMedia({video:false}) is interpreted as getDisplayMedia({video:true})

Let requestedMediaTypes be the set of media types in constraints with either a dictionary value or a value of true.

If requestedMediaTypes is the empty set, set requestedMediaTypes to a set containing "video".

The spec does not distinguish the case of "no value" from "false". Because 3) results in an empty set, 4) turns this empty set into the same as "video:true". I don't think this is intentional, let's fix.

Clarify 'origin' concept

From https://lists.w3.org/Archives/Public/public-media-capture/2018Feb/0005.html:
"From Element has its own developed privacy considerations section, particularly focused on origin separation. Is it expected that handling origin separation is to be handled by implementors or is there best practices or implementation guidelines that will be provided? If the latter, is Feature Policy an applicable solution here?"

Bring back constraints for downscaling, not selection filtering.

@martinthomson In hindsight, an obstacle to implementing the spec is our current implementation in Firefox allows passing in constraints to downscale the resolution and decimate frame rates from the very first frame. Our constraints do not filter the selection list at all.

We consider this a useful feature, since desktops tend to be of extremely large resolution, and their high frame-rates are overkill for most WebRTC use.

The spec currently disallows constraints on getDisplayMedia, except on subsequent calls to track.applyConstraints presumably.

This presumably means JS would need to expect the full-size full-frame-rate stream first, and then issue

    await track.applyConstraints({width, height, frameRate})

on the video track before attaching it to a sink.

We're concerned this is resource intensive and inefficient, and may even cause failure in situations where returning the unscaled capture might fail where a downscaled version might not have failed.

We think this is a good reason to bring back constraints on getDisplayMedia. We'd need to craft strong wording to forbid UAs from using constraints to guide selection in any way.

Refine "share what is seen"

Operating systems with application exclusive modes (Windows 8, virtually all mobile operating systems) cannot easily and reasonably manage things like screen sharing without violating the rule that states that only what is seen is shared.

We need to refine the language around this a little. There probably needs to be a higher amount of care taken for applications that are shared while not visible, but I don't think that a blanket prohibition is the exact right answer.

Handle source device pixel ratio

Issue for discussion started on the list: https://lists.w3.org/Archives/Public/public-media-capture/2015Dec/0029.html

The spec currently doesn't mention anything about how to handle sources (screen, windows) with logical pixels that don't match physical pixels of the monitor.

The most prominent case are "retina displays", which have high density physical pixels. The OS exposes logical pixels that simulate a lower resolution screen and effectively scales things. The scaling factor is usually referred to as "device pixel ratio", at least in the Web world (See CSS spec)

A Mac retina display with a physical size of 2880x1800 usually exposes a virtual resolution of 1440x900. The devicePixelRatio is 2.0 for such monitors.

I believe current implementations simply capture the physical pixels.

When capturing large windows or a full screen, an application might not want a size that large, especially when transmitting to another display with a 1.0 device pixel ratio (effectively zooming the content 200%).

Existing width / height constraints are not adapted to dealing with the resolution in this case:

There is no way to know the size, physical or logical, of the source selected by the user, making fixed width/height constraints impossible to use
Max constraints would only limit the size. The content is really meant to be scaled by 2.0, not a floating point number that would make it just fit in the constrained size
A smaller application window might have a size below the max constraints

One workaround with the state of the current specs could be to:

gDM
Read width / height of stream
use Window.devicePixelRatio API to guess device pixel ratio of content
applyConstraints on stream to scale it by that ratio

Unfortunately this might fail on a variety of cases:

On some multi-monitor systems that have different pixel ratios per monitor (Windows 10), when the browser tab and content captured are on different monitors (Especially because windows can be dragged across monitors, dynamically changing pixel ratios!)
when the current web app content is zoomed, thus decoupling the CSS devicePixelRatio from the monitor's pixel ratio.

We can solve this by:

Specifying that the screen share is done using logical pixels only. This might be limiting for apps that would like to capture unscaled screens. It could however be a good default.
Providing a "constraint" on the getDeviceMedia request that would allow the application to indicate whether it wants a stream of logical or physical pixels.
Optionally, a stream video track attribute, or maybe a stat for the stream, where the application could “read” the current devicePixelRatio for the stream source, irrespective of whether constrained the stream to physical or logical pixels (the app should know what it asked for)

The last point is to know if the captured stream is currently scaled when the app asked for a "logical pixels" stream, or the factor by which to scale it when asking for "physical pixels" and the app wants to do that scaling on its own (e.g. compute a scaling factor adapted to the pixel ratio of the receiving display)

define behaviors of all existing constraints that should apply in the screen capture

At the TPAC session, we decided to clarify the meaning of all the existing constraints that can be applicable to the screen capture. For example, resolution, framerate, and maybe "exact", etc. It'd be good to go through the constraint model in the current Media Capture spec so we don't miss anything.

The user agent should be allowed to change sources after getDisplayMedia() resolves

The spec currently has language saying the source is constant after getDisplayMedia() returns a stream, e.g:

The provided media MUST include precisely one track of each media type in requestedMediaTypes. The devices chosen MUST be the ones determined by the user. Once selected, the source of a MediaStreamTrack MUST NOT change.
[...]
Since the source of media cannot be changed after a MediaStreamTrack has been returned and constraints do not affect the selection of display surfaces, these constrainable properties cannot be changed by an application.

This refers to the settings "displaySurface", "logicalSurface" or "cursor" to be constant for a track. I'm not convinced we should enforce these being constants, but even if we do want that we don't have to go so far to say sources are constant, just that sources cannot change type from "logical" to "display" or enable/disable cursor. And just because a user agent (end-user) can change the source doesn't mean that the application has any control of this.

We should allow the user agent to change source on-the-fly. This allows, amongst other things, the user agent to allow the end-user to change which tab ("browser" surface) to present at any given point in time. This is what current extensions are already doing (edit: actually I'm not sure if this is already implemented or just planned to be implemented).

NavigatorUserMedia no longer exists

Found while authoring web-platform-tests/wpt#12284

getUserMedia in mediacapture-streams adds directly to Navigator.

So, this (mediacapture-screen-share) should be changed to

partial interface Navigator {
    Promise<MediaStream> getDisplayMedia(optional MediaStreamConstraints constraints);
};

Examples should use async/await

Example 3. uses then(), it should be updated to use async/await

tabid

Having the application able to select a sharing target, or to significantly influence the sharing target, was identified in a review to be problematic. An important characteristic of the consent process is that the user selects what is shared, this undermines this.

This might need to be held until we have final UX, which might suggest a way around this.

Selecting a Browser Tab for Share

@keithgrif says:
A site/app wishing to share needs to advertise a list of applications available to the user. When a browser has multiple tabs open and available the user needs to be able to select which of those tabs they wish to share.

An example is on Windows an app can choose the open list of applications to share as shown in the view below from task manager.

Note that the browser shows a single application instance. Although multiple tabs are available within the browser (see image below) the currently selected tab is the only one available in the list.

Should getDisplayMedia be moved to navigator.mediaDevices

Currently getDisplayMedia is exposed in navigator.
It is more consistent to have getDisplayMedia be part of navigator.mediaDevices.

One potential issue is that Edge shipped already navigator.mediaDevices.

There are some ongoing changes to getDisplayMedia, for instance:

New handling of constraints
SecureContext
More changes might come as we make progress.

Given all of this, it might be ok to move it to navigator.mediaDevices.
Some advantages:

More consistent
We could envision making mediaDevices SecureContext and would not need to special case getDisplayMedia.

Define tab capture

The nonstandard use of getUserMedia for screen capture included the option of capturing from browser tabs. This has proved popular.

The spec should mention that this is a possibility - while tabs can be seen as a special case of "application window", they are mostly perceived as their own thing.

why does getDisplayMedia live on navigator and not navigator.mediaDevices?

apologies if I missed any previous discussion...

Is there a reason for getDisplayMedia living on navigator instead of navigator.mediaDevices?
navigator.getUserMedia is considered legacy so it is somewhat unexpected to see a similar pattern in this spec.

Originally filed as w3c/mediacapture-main#508 but this repository seems more appropriate.

Option to exclude cursor from video stream

Opening spec issue based on Chrome issue 463423.

Some applications might want to receive a video stream of the screen share that doesn't include the cursor.
I believe adding a constraint for this would be an easy way to provide this feature.

Where things get trickier is that such applications might want to receive the cursor position programatically instead. This would require extending the MediaStreamTrack object to dispatch events about the cursor position.
Furthermore, applications would also want to get the cursor image, which is tricky since not all cursors are simple bitmaps that are alpha blended, some are xoring their background (text selection cursor on windows is the main example)

In #35 I'm already suggesting extending the MediaStreamTrack object to expose a "pixel ratio" attribute of the stream. I'm open to alternatives but it sounds like we need a consistent way to expose "metadata" about the screen sharing stream.

What should be the default expectation for "displaySurface" and "logicalSurface"?

The spec should define the expectation when "displaySurface" or "logicalSurface" is not set explicitly, for example, either with default values (e.g. "monitor", and "false") or returning the promise with an error code, etc.

Fullscreen needs handling (was: Powerpoint is special)

We will have to detect when a particular Powerpoint window has become full screen and shift the window share to the presentation. We might want to extend the same sort of privilege to other applications that shift to a full screen mode.

Offer high level source filtering

Current implementations in Chrome and Firefox both offer the ability for the web app to restrict the sharing choice to desktop, window, tabs etc. Web apps like Google Hangouts use this functionality and would like to keep it. There's also technical reasons to why this makes sense, for example tab sharing is much cheaper on system resources, and unlike desktop and window sharing audio capture always work. By restricting the choice to tab sharing a web app could know that you can do high fps streaming with audio.

Unclear how to aggregate windows or how to handle multiple windows/monitors

Terminology says multiple windows may be aggregated into a single track, but it does not explain how this is done. Will the background be black? What if they are far apart? What happens if windows move around, will that change the resolution to match covering both windows? Or should the windows always be grouped next to each other, so that their relative position in the track does not correlate to the relative position of the windows on the monitor?

The same questions arise if you think about sharing "monitor" and the user have multiple monitors. Additionally, does the user pick a monitor or do you get all monitors, etc?

What happens if a window is resized?
What happens if resolutions change?
What happens if monitor setups change?
etc.

MediaTrackConstraints usage contradiction

5.1 getDisplayMedia definition says using MediaTrackConstraints throws an exception for the sake of not controlling which surface to show, but MediaTrackConstraints contains stuff we might want to support (if we support audio: volume) and other parts of the spec, 5.4, talks about what to do with deviceId, which is part of the MediaTrackConstraints