The deepflow-analysis from fbkarsdorp

Open goals in the paper

I am adding the number of goals we set ourselves last meeting. Feel free to open new issues to discuss more in detail the goal. I am not sure who is in charge of the remaining issues, feel free to add.

break down features [E]
motivate, group, explain and refer to literature of all features [E]
explain data collection preprocessing [F]
related research section [F/E/M]
move away from turing test as background framing (since it's not clearly a turing test) [M]
look into cases that killed people [F/M]

Literature list

The last one is especially good. Solid section about creative NLG and NLG evaluation.

Other comments reviewer 1

Other comments:

p. 1: "Increasingly, people interact with a variety of artificial agents, often even without being fully aware of whether or not their conversation partners are in fact human." ==> are you sure people are often not aware of this? This seems like an overstatement.
p. 4: Why did you decide to allow the same amount of time for A- and B-runs? Why not allow self-paced reading in both conditions?
p. 5: "encompassing the main body of English Hip-Hip music produced and consumed in the United States of America. " ==> ..English-language Hip-Hop.. Also: I think the "consumed in the United States of America" part could be removed, as this is now a global genre.
p. 5: since PLOS is a general scientific journal, it would be good to briefly spell out what LSTMs and Transformers are (with references and/or pointers to later sections of the paper)
p. 5: "we translated all unique words into a" ==> this slightly confused me; do you mean all words that occur only once (the hapax legomena), or all words that occur (all types)? Initially I assumed the former, but surely it must be the latter.
p. 6: "LSTMs have been shown to excel at Language Modeling [31] and we therefore resort to it" ==> .. resort to them
p. 6: "(i.e. in the present corpus from 89337 syllables to 172 characters)" ==> 172 characters is more than one might expect, so perhaps briefly explain where this number comes from
p. 6: "The reasoning is twofold: (i) noisy data.." ==> here, also, it might be worthwhile to say something about the possibly noisy input
p. 7: "extracting single word-level distributional feature vector." ==> .. vectors
p. 7: "One possibility to accomplish it is to initialize" ==> .. accomplish this..
p. 7: "Our model, however less general since it assumes.." ==> Our model, however, is less general (..) yet still achieves...
p. 8: "we fine-tune the on a model-per-model basis" ==> something wrong here
p. 8: "by manually inspection of the model output at different temperature values" ==> ..manual inspection.. More importantly: can you say a bit more about how this manual inspection was done?
p. 8: "Following the template, we generate as many sentences.." ==> it might be worth pointing out that templates are also often used in NLG (although arguably in a somewhat different way). See e.g., Deemter, K. van et al. (2005). Real versus template-based natural language generation: A false opposition?. Computational Linguistics, 31(1), 15-24.
p. 8: "where $\mu$ was selected per model through an inspection of random samples" ==> please briefly say how this was done
p. 9, caption table 5: "PC words have been deliberately masked" ==> I assume this should be Non-PC words, right? And what about motherfcking and sht? Also: what is a W model?
p. 11: "participants performed significantly words on" ==> ..significantly worse..
p. 11: "As can be observed from the marginal effects plot in Fig 2a, the learning effect is present in both question types and it is most strongly pronounced at the beginning of the game, after which it diminishes." ==> "most strong pronounced" is a pleonasm ("most strong" or "most pronounced"). More importantly: could this suggest that people start to pick up cues of neurally generated text (see above)?
p. 12: I would suggest removing \mu = and \sigma = and just report means and SDs, as 0.045 (0.057) (i.e., M(SD)), which is much more common.
p. 16: "At the same time, Hip-Hop lyrics very often do not develop longer stretches of thematically coherent narrative, ..." ==> I beg to differ. Do you have any evidence for this claim? If not, it would be good to phrase this a bit more cautiously.
p. 16: "This effect might also be reduced when longer fragments are admitted." ==> I agree, and think this would be a very interesting question for follow-up research. Maybe make this explicit?
p. 17ff: the references are not fully consistent in how they cite pages and dates. The Turing reference stands out because it is all caps. Would be good to make this consistent.

Syntactic features (literature)

Recent
On the Relevance of Syntactic and Discourse Features for Author
Profiling and Identification, page 2
Up to 2012:
Syntactic Stylometry: Using Sentence Structure for Authorship Attribution, pages 2-4

The issue with the expert scoring system;

Perhaps most importantly, I was not convinced by the expert analysis (p. 15/16). “All players with a score higher than 10 are considered experts (n = 135).” Isn't it likely that a sizeable number of these experts just happened to guess correctly? The scoring system is not entirely clear from the paper, but given the large number of participants, a many “experts” could just have been lucky. My suggestion would be to remove this part of the analysis, or consider setting-up a "real" expert analysis.

The issue with the scope of the findings

In the general discussion, I feel the findings could be positioned somewhat more broadly. What do we learn from this study beyond the generation of rap lyrics? The authors very briefly touch upon this, but the discussion remains at a somewhat high level. Could these models also work for the generation of other forms of poetry? There has been some work on this, for example, on the generation of haikus. I am not sure, but don't think it is true that computer-generated haikus are difficult to distinguish from human authored haikus, at least until recently (and note that these are very short as well, like the rap fragments studied in this paper). Also: do you think there are practical benefits for computer-generated rap lyrics? And, if so, should rappers acknowledge using tools like these for their lyrics? (I am sure the authors are aware of the controversy surrounding rappers who allegedly don't write their own lyrics, like Drake or Dr. Dre.)

The issue with using semantic coherence measures

Judging a snippet of several lines of lyrics seems like a very hard task if there is no context. I can imagine the original position within the song of the snippet is very important for the rate of success for classifying it as real. A snippet from the beginning of a verse might for instance seem more coherent than a snippet from the middle. It seems to me that semantic coherence is the main driver for succesful classification. In fig 1b for instance you would not make your choice based on linguistic features but rather on coherence of the text. I would like to see more analyis on semantic coherence and which role it plays.

The issue of Hip-Hop lyrics vs rap lyrics

The next point is a detail, but might be worth clarifying nevertheless: why do you speak of Hip-Hop lyrics and not rap lyrics? I guess the latter term would be more appropriate, since Hip-Hop, at least traditionally, refers to a culture, which obviously includes rapping, but also, for example, graffiti and breakdancing.

The issue with evaluation task set up

Another issue is that we don't really get an idea how well this particular architecture works. Adding a baseline model that uses a very simple language model could tell us something on where to place the quality of outputs of these more sophisticated models. This might be something for the discussion.

The issue withe training corpus

About the training materials: can you say a little about how the OHHLA corpus was collected and whether there is any quality control of the lyrics? Why didn't you use rap.genius? It would also be good, in this context, to refer to Bradley & DuBois' (2010) Anthology of Rap, Yale University Press, since it is the first ‘serious’ anthology of rap lyrics. At the time, this anthology got quite a bit of criticism about the quality of the transliterations, which are indeed notoriously difficult for rap lyrics. How does the possible noisiness of the data influence the models?

Remaining issues

Table "Text generation model details." needs to be mentioned and referred to in the text.
Table "Examples of generated samples." needs to be mentioned and referred to in the text.

The issue with linguistic features

The analysis in terms of linguistic features is very interesting. However, most features are motivated from characteristic properties of rap lyrics. The exceptions are lexical diversity and word repetition. Interestingly, there have been earlier analyses of lexical diversity of rappers, showing that there are huge differences between them (see, for example, here: https://pudding.cool/projects/vocabulary/index.html). In a somewhat similar vein, repetition also occurs a lot in ‘real’ rap lyrics (Travis Scott ft Young Thug -- Yeah Yeah is just one example that springs to mind). It would be good to update the discussion of features accordingly. And what about other features that might be typical of neurally generated text, such as agreement mistakes and lack of global coherence? Would it be possible to integrate those?

fbkarsdorp / deepflow-analysis Goto Github PK

deepflow-analysis's Introduction

deepflow-analysis's People

Contributors

Watchers

deepflow-analysis's Issues

Open goals in the paper

Literature list

Other comments reviewer 1

Syntactic features (literature)

The issue with the expert scoring system;

The issue with the scope of the findings

The issue with using semantic coherence measures

The issue of Hip-Hop lyrics vs rap lyrics

The issue with evaluation task set up

The issue withe training corpus

Remaining issues

The issue with linguistic features

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent