Giter Club home page Giter Club logo

yauaa's Introduction

Yauaa: Yet Another UserAgent Analyzer

Github actions Build status Coverage Status License Maven Central Reproducible Builds GitHub stars Docker Hub If this project has business value for you then don't hesitate to support me with a small donation. If this project has business value for you then don't hesitate to support me with a small donation. Website

This is a java library that tries to parse and analyze the useragent string (and when available the User-Agent Client Hints) and extract as many relevant attributes as possible.

Works with Java, Scala, Kotlin and provides ready for use UDFs for several processing systems.

The full documentation can be found here https://yauaa.basjes.nl

Try it!

You can try it online with your own browser here: https://try.yauaa.basjes.nl/.

NOTES

  1. This runs on a very slow and rate limited machine.
  2. If you really like this then run it on your local systems. It's much faster that way. A Kubernetes ready Docker image is provided. See this page about the WebServlet for more information.

Donations

If this project has business value for you then don't hesitate to support me with a small donation.

If this project has business value for you then don't hesitate to support me with a small donation. If this project has business value for you then don't hesitate to support me with a small donation.

License

Yet Another UserAgent Analyzer
Copyright (C) 2013-2024 Niels Basjes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

yauaa's People

Contributors

bkersbergen avatar dependabot[bot] avatar jlleitschuh avatar naccl avatar nielsbasjes avatar pawel-piecyk avatar pethers avatar renovate-bot avatar renovate[bot] avatar robstoll avatar selli96 avatar snuyanzin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yauaa's Issues

Report info about rules with clickable links to the yaml file.

Several steps:

  1. Requires reading and using the yaml files differently.
    Change
    Object loadedYaml = yaml.load(yamlStream);
    into
    Node node = yaml.compose(new InputStreamReader(yamlStream));

In the 'node' the Mark is contains the line number information.

  1. Every Matcher/MatcherAction MUST retain the file+line number where they were defined.

  2. Logging any information like this will make the file + line clickable in IntelliJ
    LOG.error("Syntax error.({}:{})", file, line);
    Without the extra '.' it won't match the pattern they look for.
    See http://stackoverflow.com/questions/7930844/is-it-possible-to-have-clickable-class-names-in-console-output-in-intellij

Feature request & help generating antlr source files

I like what I’ve seen of YAUAA; thank you for your effort!

One feature request I have would be to recognize when Internet Explorer 11 is in compatibility mode. The user agent looks like this, for example:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Win64; x64; Trident/7.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3; .NET4.0E)

And YAUAA will currently output this information:

  • DeviceClass = Desktop
  • DeviceName = Desktop
  • DeviceBrand = Unknown
  • DeviceCpu = x64
  • OperatingSystemClass = Desktop
  • OperatingSystemName = Windows NT
  • OperatingSystemVersion = Windows 7
  • OperatingSystemNameVersion = Windows 7
  • LayoutEngineClass = Browser
  • LayoutEngineName = Trident
  • LayoutEngineVersion = 7.0
  • LayoutEngineVersionMajor = 7
  • LayoutEngineNameVersion = Trident 7.0
  • LayoutEngineNameVersionMajor = Trident 7
  • AgentClass = Browser
  • AgentName = Internet Explorer
  • AgentVersion = 11.0
  • AgentVersionMajor = 11
  • AgentNameVersion = Internet Explorer 11.0`

It would be very helpful to me if there was a way to indicate that it is in IE 7 compatibility mode.

To that end, I downloaded the source and tried to get it working. I had some problems, starting with generating the Antlr java classes. Admittedly, I am Antlr beginner, so it’s likely my process is wrong. But for reference, what I tried was:

  1. Copying UserAgent.g4 to a temporary directory.
  2. From the command line, I ran antlr4 against the grammar file:

$antlr4 UserAgent.g4

The output was a number of errors:

  • error(156): UserAgent.g4:70:6: invalid escape sequence
  • error(156): UserAgent.g4:121:24: invalid escape sequence
  • error(156): UserAgent.g4:130:8: invalid escape sequence
  • error(156): UserAgent.g4:130:51: invalid escape sequence
  • error(156): UserAgent.g4:134:8: invalid escape sequence
  • error(156): UserAgent.g4:148:30: invalid escape sequence
  • error(156): UserAgent.g4:149:30: invalid escape sequence

Compiling the resulting java source and using the TestRig tool failed to test the parsing failed:

grun UserAgent userAgent -gui

Can't load UserAgent as lexer or parser

When I tried the same thing with UserAgentTreeWalker.g4, antlr was able to generate the java source without any errors, and I was able to use the TestRig tool successfully.

Test run findings

Hi,
It is a great job you have done so far.
I am working on a project which involve in about a million clicks everyday.
After test run your parser with about 600k useragent in my database, I found some interesting findings:

  • It took my Macbook pro late 2013, 16GB RAM ~ 60 seconds to finish 600k UA strings. Welldone!
  • The following UAs give "Hacker" in the fields:
    MAUI WAP Browser
    Mozilla/5.0 (Linux; Android 6.1; Honor Note 8 Build/MXC89L) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.91 Safari/537.36
    Dorado WAP-Browser

I attach here the parse results in case you want to look into it. Basically the results were logged by the following code:

	for (String ua: uArrayList) {
		   UserAgent agent = uaa.parse(ua);  
		   AgentField deviceClass = agent.get("DeviceClass");
		   AgentField deviceName = agent.get("DeviceName");
		   AgentField operatingSystemClass =  agent.get("OperatingSystemClass");
		   AgentField operatingSystemName = agent.get("OperatingSystemName");
		   AgentField operatingSystemVersion = agent.get("OperatingSystemVersion");
		   AgentField layoutEngineClass = agent.get("LayoutEngineClass");
		   logger.info("{}|{}|{}|{}|{}|{}|{}", deviceClass.getValue(), deviceName.getValue(),
				   operatingSystemClass.getValue(), operatingSystemName.getValue(),
				   operatingSystemVersion.getValue(), layoutEngineClass.getValue(),
				   ua);

	}

myLogFile.log.gz

Since I am sitting on a lot of user agents, just let me know if you want to look into them.
Regards,

Lenovo tablets recognised as phones

The following Lenovo tablets are recognised incorrectly as phones:

Mozilla/5.0 (Linux; U; Android 4.2.2; pl-pl; Lenovo B8000-H/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.2.2 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 4.2.2;pl-pl; Lenovo B8000-F/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.2.2 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 4.2.2; pl-pl; Lenovo A7600-H Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 4.2.2;pl-pl; Lenovo B6000-F/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.2.2 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 4.2.2; pl-pl; Lenovo A3500-H Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 4.2.2; pl-pl; Lenovo A3500-FL Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30

I would prepare PR for Lenovo case but it is not so easy to start. Should I create separate config file for Lenovo or add matchers to Android.yaml? Is it possible to create single rule and put all Lenovo models into CSV file like windows oem models? How to override device class only from phone to tablet?

Integrate Samsung specifications

https://www.samsungdforum.com/SamsungDForum/NewsView?newsID=70
http://developer.samsung.com/technical-doc/view.do?v=T000000202
http://developer.samsung.com/technical-doc/view.do?v=T000000203

Mozilla/$(MOZILA_VER) ($(DEVICE_TYPE); $(OS); $(PLATFORM) $(PLATFORM_VER);
SAMSUNG $(MODEL_NAME) Build/$(BUILD_TAG)) AppleWebKit/$(APPLEWEBKIT_VER)
(KHTML, like Gecko) $(APP_NAME)/$(APP_VER) (Chrome/$(CHROME_VER))
$(UX RECOMMEND) Safari/$(SAFARI_VER)
Field Description Required
$(DEVICE_TYPE) “SMART-TV” is used for Samsung Smart TV. Mobile devices do not use this field. Optional
$(PLATFORM) $(PLATFORM_VER) “Tizen” is used for Samsung Smart TV 2015 new models (and later), and for Tizen Mobile. Mandatory
SAMSUNG Company name Optional
$(MODEL_NAME) Mobile devices use MODEL_NAME field for each device. Smart TV currently does not use this field. This Field will be possibly used in the future. Optional
Build/$(BUILD_TAG)) Platform Build Tag is used on Android devices. Currently, Tizen devices do not use this field. It may be possibly used in the future. Optional
$(APP_NAME)/$(APP_VER) Web Browsers on Samsung devices (Mobile and Smart TV) use “SamsungBrowser/version”. Mandatory
(Chrome/$(CHROME_VER)) This field is presented in a Chrome-based web browser only. Android Browser is currently presented this way whereas, Tizen Samsung Browser is not as it is based on webkit. This will also reflect on the Tizen Samsung Browser if it will be based on Chrome in the future. Optional
$(UX_RECOMMEND) Mobile devices with small screens (e.g. less than 7") use “Mobile”. Smart TVs use “TV”. For devices supporting Virtual Reality contents, use “VR”. If PC UX is appropriate for the device, this field is empty. Optional
Samsung Web Application User-Agent String Format
Mozilla/$(MOZILA_VER) ($(DEVICE_TYPE); $(OS); $(PLATFORM) $(PLATFORM_VER);
SAMSUNG $(MODEL_NAME) Build/$(BUILD_TAG)) AppleWebKit/$(APPLEWEBKIT_VER)
(KHTML, like Gecko) Version/$(PLATFORM_VER) (Chrome/$(CHROME_VER))
$(UX RECOMMEND) Safari/$(SAFARI_VER)
Field Description Required
Version/$(PLATFORM_VER) Web Application using the platform’s webView, have Version/OS’s Version instead of $(APP_NAME)/$((APP_VER) Mandatory
※ if the $(PLATFORM_VER) is less than 4.0, than it is not a Web Application.

Other fields are same with Web browser.

Samsung Internet for Smart-TV User-Agent String Format

Please check below for existing Samsung Internet for SmartTV UA.

Identify the Samsung Internet for SmartTV by using the “SMART-TV” keyword.

Year UA String
2015 Mozilla/5.0 (SMART-TV; Linux; Tizen 2.3) AppleWebkit/538.1 (KHTML, like Gecko) SamsungBrowser/1.0 TV Safari/538.1
2014 Mozilla/5.0 (SMART-TV; X11; Linux armv7l) AppleWebkit/537.42 (KHTML, like Gecko) Safari/537.42
2013 Mozilla/5.0 (SMART-TV;X11; Linux i686) AppleWebkit/535.20+ (KHTML, like Gecko) Version/5.0 Safari/535.20+
2012 Mozilla/5.0 (SMART-TV; X11; Linux i686) AppleWebKit/534.7 (KHTML, like Gecko) Version/5.0 Safari/534.7
2011 Mozilla/5.0 (SmartHub; SMART-TV; U; Linux/SmartTV) AppleWebKit/531.2 (KHTML, like Gecko) Web Browser/1.0 SmartTV Safari/531.2+
User-Agent(UA) String Examples
Previous :
UA String used in devices before 2015
Current :
New UA String used in

  • Mobile : 2015 and later released devices, and Android 5.0 Lollipop updated devices (with the small number of exceptions)
  • TV : Tizen SmartTV Web Browser 2015 and later
    Samsung Internet for Android

Previous (Samsung Galaxy Note Edge):
Mozilla/5.0 (Linux; Android 4.4.4; en-au; SAMSUNG SM-N915G Build/KTU84P) AppleWebKit/537.36 (KTHML, like Gecko) Version/2.0 Chrome/34.0.1847.76 Mobile Safari/537.36
Current (Samsung Internet for Android 4.0):
Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-G925F Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36
Samsung Internet for Tizen Mobile

Current :
Mozilla/5.0 (Linux; Tizen 2.3; SAMSUNG SM-Z130H) AppleWebKit/537.3 (KHTML, like Gecko) SamsungBrowser/1.0 Mobile Safari/537.3
Tizen Mobile Web Application

Current :
Mozilla/5.0 (Linux; Tizen 2.3; SAMSUNG SM-Z130H) AppleWebKit/537.3 (KHTML, like Gecko) Version/2.3 Mobile Safari/537.3
Samsung Internet for Smart-TV

Current :
Mozilla/5.0 (SMART-TV; Linux; Tizen 2.3) AppleWebkit/538.1 (KHTML, like Gecko) SamsungBrowser/1.0 TV Safari/538.1
Tizen TV Web Application

Current :
Mozilla/5.0 (SMART-TV; Linux; Tizen 2.2; SAMSUNG SM-Z910F) AppleWebKit/537.3 (KHTML, like Gecko) Version/2.2 TV Safari/538.1
Samsung Internet for Gear VR

Current :
Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-G925K Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile VR Safari/537.36
Content Guide
Request for Mobile Content

  • If $(UX_RECOMMEND) is “Mobile”, Mobile page and UX are appropriate
    Request for TV Content
  • If $(DEVICE_TYPE) is “SMART-TV” or $(UX_RECOMMEND) is “TV”, it is proper for TV content.
    If the web page is not suitable for the TV Web Browser, provide the web page for Tablet or PC version (Tablet web page is preferred over the PC version).
    ※ Consider the Resize Event ※
    On the Resize Event, focus should be maintained on the Input field so that the user of the TV web browser can input characters and symbols using Samsung IME.
  • Samsung Smart TV uses both fields: “SMART-TV” for $(DEVICE_TYPE) and “TV” for $(UX_RECOMMEND).
  • Do not use Flash content. Use HTML5
  • If $(UX_RECOMMEND) is empty, PC content is shown
  • If there is no TV oriented content, PC content option is OK.
    “Samsung” is not used for Mobile Only. “Samsung” is not a good identifier for Mobile.
    “Tizen” is not used for Mobile Only. “Tizen” is also not a good identifier for Mobile.
    $(PLATFORM): Android can be used for Mobile and Tablet (PC); Tizen can be used for Mobile and TV.
    The following table shows how the identifiers and proper contents are related.

Proper Contents $(DEVICE_TYPE) $(PLATFORM) $(UX recommend)
Mobile - Android or Tizen Mobile
TV SMART-TV Tizen TV
PC - Android or Tizen -

Device name with underscore

DeviceName return an _ in the name.


- test:
    input:
      user_agent_string: 'UCWEB/2.0 (MIDP-2.0; U; Adr 7.0; en-US; Nexus_6) U2/1.0.0 UCBrowser/10.9.0.946 U2/1.0.0 Mobile'
    expected:
      DeviceClass  : 'Mobile'
      DeviceName  : 'Google Nexus 6'
      DeviceBrand  : 'Google'
      DeviceVersion  : '7.0'
      OperatingSystemClass  : 'Mobile'
      OperatingSystemName  : 'Android'
      OperatingSystemVersion  : '7.0'
      OperatingSystemNameVersion  : 'Android 7.0'
      LayoutEngineClass  : 'Browser'
      LayoutEngineName  : 'UCBrowser'
      LayoutEngineVersion  : '10.9.0.946'
      LayoutEngineVersionMajor  : '10'
      LayoutEngineNameVersion  : 'UCBrowser 10.9.0.946'
      LayoutEngineNameVersionMajor  : 'UCBrowser 10'
      AgentClass  : 'Browser'
      AgentName  : 'UCBrowser'
      AgentVersion  : '10.9.0.946'
      AgentVersionMajor  : '10'
      AgentNameVersion  : 'UCBrowser 10.9.0.946'
      AgentNameVersionMajor  : 'UCBrowser 10'
      AgentLanguage  : 'English (United States)'
      AgentLanguageCode  : 'en-us'
      AgentSecurity  : 'Strong security'



Not serializable

Hi,
We're trying to use the library in Apache Spark processing tasks and the UserAgentAnalyser itself is not serializable. Do you think it would possible to do it ?

Thank you for your time.

hive udf make hive server 2000% CPU useage

hive logs:
2017-07-19T06:09:02,492 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResources(334)) - Building 2033 (dropped 231) matchers from 52 files took 3880 msec resulted in 180308 hashmap entries
2017-07-19T06:09:02,492 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResources(337)) - Analyzer stats
2017-07-19T06:09:02,493 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResources(338)) - Lookups : 24
2017-07-19T06:09:02,493 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResources(339)) - Matchers : 2033 (total:2033 ; dropped: 231)
2017-07-19T06:09:02,493 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResources(340)) - Hashmap size : 180308
2017-07-19T06:09:02,493 INFO [b558b99d-de0b-477e-ba8d-1362ba049269 HiveServer2-Handler-Pool: Thread-1209506([])]: useragent.UserAgentAnalyzer (UserAgentAnalyzer.java:loadResource

and make JVM
2017-07-19T06:09:17,286 INFO [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@58015e56([])]: common.JvmPauseMonitor (JvmPauseMonitor.java:run(194)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1263ms

Speedup idea

Sort all matchers by score (high to low). Use the higest available score.
Only try to extract the value (walk) if a matcher action exists that CAN provide a better value.

I hit from safari with linux operating system but showed chromium as agent name

I found a problem with this useragent.
[Please update the output below to match what you expect it should be]


- test:
    input:
      user_agent_string: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36'
    expected:
      DeviceClass  : 'Desktop'
      DeviceName  : 'Linux Desktop'
      DeviceBrand  : 'Unknown'
      DeviceCpu  : 'Intel x86_64'
      OperatingSystemClass  : 'Desktop'
      OperatingSystemName  : 'Linux'
      OperatingSystemVersion  : 'Intel x86_64'
      OperatingSystemNameVersion  : 'Linux Intel x86_64'
      LayoutEngineClass  : 'Browser'
      LayoutEngineName  : 'Blink'
      LayoutEngineVersion  : '54.0'
      LayoutEngineVersionMajor  : '54'
      LayoutEngineNameVersion  : 'Blink 54.0'
      LayoutEngineNameVersionMajor  : 'Blink 54'
      AgentClass  : 'Browser'
      AgentName  : 'Chrome'
      AgentVersion  : '54.0.2840.100'
      AgentVersionMajor  : '54'
      AgentNameVersion  : 'Chrome 54.0.2840.100'
      AgentNameVersionMajor  : 'Chrome 54'



Version "??" value is inconsistent with other "Unknown" values

When parsing the user agent string "Apache-HttpClient/release (java 1.5)", the UserAgentAnalyzer "OperatingSystemVersion" field gets resolved to the string "??". Most other fields seem get resolved to "Unknown" if there's no reliable value.

For the sake of consistency, I recommend resolving an unknown "OperatingSystemVersion" value to "Unknown" instead of "??".

include if the OS is 16,32 or 64 bit

We've got a usecase where we want to expose the optimal download of a binary to our customers. It would be greatly appreciated if this information could be added to the DeviceProfile so we can utilize this knowledge to improve the customer experience.

Is there a list of values?

Hi,

are there lists of possible values for classes? For example, a list of possible values for Device Class or Operating System Name etc.

Puffin

Puffin on Android Phone looks like this:

Mozilla/5.0 (X11; U; Linux x86_64; en-gb) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.114 Safari/537.36 Puffin/4.8.0.2790AP

Yet is reported as Chrome 30
Need generic rule to look at the product after Safari and use that if precent. Same as with Opera.
Cannot fix the "Linux Desktop" effect.

License question

Hi Niels,
I'm interested in making a UDF for Drill that uses your UA parser. Would you be willing to release it using the Apache license?
Thanks,

Extra agents

Apple-iPad2C2/1305.238
Apple-iPhone3C1/1104.257
Apple-iPhone7C2
Apple-iPhone/705.18
Apple-iPod4C1/1002.500

Uc browser tablet/phone detection problems

I found a problem with this useragent.
[Please update the output below to match what you expect it should be]


- test:
    input:
      user_agent_string: 'UCWEB/2.0 (MIDP-2.0; U; Adr 7.0; en-US; Nexus_6) U2/1.0.0 UCBrowser/10.9.2.962 U2/1.0.0 Mobile'
    expected:
      DeviceClass  : 'Phone'
      DeviceName  : 'Google Nexus 6'
      DeviceBrand  : 'Google'
      DeviceVersion  : '7.0'
      OperatingSystemClass  : 'Mobile'
      OperatingSystemName  : 'Android'
      OperatingSystemVersion  : '7.0'
      OperatingSystemNameVersion  : 'Android 7.0'
      LayoutEngineClass  : 'Browser'
      LayoutEngineName  : 'UCBrowser'
      LayoutEngineVersion  : '10.9.2.962'
      LayoutEngineVersionMajor  : '10'
      LayoutEngineNameVersion  : 'UCBrowser 10.9.2.962'
      LayoutEngineNameVersionMajor  : 'UCBrowser 10'
      AgentClass  : 'Browser'
      AgentName  : 'UCBrowser'
      AgentVersion  : '10.9.2.962'
      AgentVersionMajor  : '10'
      AgentNameVersion  : 'UCBrowser 10.9.2.962'
      AgentNameVersionMajor  : 'UCBrowser 10'
      AgentLanguage  : 'English (United States)'
      AgentLanguageCode  : 'en-us'
      AgentSecurity  : 'Strong security'



Fix appengine/gcloud warning

[INFO] GCLOUD: Reading application configuration data...
[INFO] GCLOUD: *********************************
[INFO] GCLOUD: Configuration Warning : / XML elements and --application/--version should not be specified when staging
[INFO] GCLOUD:
[INFO] GCLOUD: The following parameters will be scrubbed from app.yaml
[INFO] GCLOUD: application : analyze-useragent
[INFO] GCLOUD: version : v2.0 2017-08-19 @ 21:14:11 CEST
[INFO] GCLOUD:
[INFO] GCLOUD: Future versions of staging will fail if application or version is specified.
[INFO] GCLOUD: *********************************

Verify specifications

Assembly warning

[WARNING] The assembly descriptor contains a filesystem-root relative reference, which is not cross platform compatible /

IsNull[IsNull[foo]] fails

Multiple IsNull operators can be nested but if you do the checks will fail.
Perhaps block the option (doesn't seem to be useful)

Mac OS showing as Play Station. Not a play station.

Background:
Looking at the OS field of bid_request_data ('$.us.dv.os') I noticed that from a random sample of 10K bids in the desktop channel on 6/27, ~13% of these had the value of 'PLAY_STATION' which seemed way too high.
Looking at the UA string from lineitem_scarcity_score ('$.ua') , most of these indicated that they were some version of OSX (e.g. macintosh; intel mac os x_10_15).
I've attached a dist from AML of 1 hour looking at all the cases where we saw a 'PLAY_STATION' OS type. Most of the time, the UA string indicates that the device has some kind of macintosh OS.

iOS names

I see the following iOS Names --> Consolidate
iOS
iPhone
iPhone OS
iPhoneOS
iPadOS
Darwin ??

google adwords bots misclassified

The google adwords bots seems to be misclassified. They should be Robot instead of Browser.
These are the user agents google is currently using: (confirmed on https://support.google.com/adwords/answer/2404197 )

Mozilla/5.0 (Linux; Android 5.0; SM-G920A) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)
Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)

I've attached code+data to reproduce the effect.

adwordbots.pig.txt
bots.txt

Logging in spring boot for every http request

please provide a configuration to disable printing banner

Yauaa 1.4 (v1.4 @ 2017-07-07T20:35:47Z) |
Yauaa 1.4 (v1.4 @ 2017-07-07T20:35:47Z) |
---------------------------------------------------------+
---------------------------------------------------------+
finding it diffcult to use this jar cause it reduces the performance of the application.

Can you standardize the DeviceName ?

now eg.

DeviceName = SM801
DeviceBrand = Unknown

DeviceName = Redmi Note 2
DeviceBrand = Redmi

DeviceName = HUAWEI GRA-CL10
DeviceBrand = Huawei

DeviceName = PE-TL20
DeviceBrand = Huawei

I can't statistics...
if DeviceName all with deviceBrand then all data have.
if DeviceName all without deviceBrand then all data remove it.
just one way,the above.

I want...

DeviceName = GRA-CL10
DeviceBrand = Huawei

DeviceName = PE-TL20
DeviceBrand = Huawei

DeviceName = Note 2
DeviceBrand = Redmi

I hope you know what I mean ..........

Opr = Opera

According to http://www.useragentstring.com/pages/Opera/ Opera uses Opera. But that's only for versions until 12.
According to the official homepage ( https://dev.opera.com/blog/opera-user-agent-strings-opera-15-and-beyond/ ), version 15 and up uses OPR.

So I would regex for both, Opera and OPR

Opera User Agent Strings: Opera 15 and Beyond

If you track user agents visiting your sites, please adjust your scripts for Opera 15 for desktop and Android.

Desktop's UA string (on Windows) is

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36 OPR/15.0.1147.100

On all platforms, the digits after "OPR/" tell you version and minor version number - in this case "15.0". (The subsequent numbers are internal identifiers and build numbers.)

Opera 15 for Android contains the string "Mobile" and also contains "OPR/" followed by version number, as both Opera 15 for desktop and Android are based on Chromium 28.

Opera Mini continues to use Presto on the server, and its UA string is unchanged; it contains the string "Opera Mini".

Split code between 'construct' and 'run'

While analyzing the serializability I concluded that in the long run it would be better to split the 'setup' part (i.e. reading Yaml files and setting up all the Hashmaps and Step lists) from the 'runtime' part (i.e. actually analyzing the useragents and returning the results).

DuckDuckGo app

I found a problem with this useragent.
DuckDuckGo search and browser on android v3.0.14


- test:
    input:
      user_agent_string: 'DDG-Android-3.0.14'
    expected:
      DeviceClass  : 'Hacker'
      DeviceName  : 'Hacker'
      DeviceBrand  : 'Hacker'
      DeviceVersion  : 'Hacker'
      OperatingSystemClass  : 'Hacker'
      OperatingSystemName  : 'Hacker'
      OperatingSystemVersion  : 'Hacker'
      OperatingSystemNameVersion  : 'Hacker'
      LayoutEngineClass  : 'Hacker'
      LayoutEngineName  : 'Hacker'
      LayoutEngineVersion  : 'Hacker'
      LayoutEngineVersionMajor  : 'Hacker'
      LayoutEngineNameVersion  : 'Hacker'
      LayoutEngineNameVersionMajor  : 'Hacker'
      AgentClass  : 'Hacker'
      AgentName  : 'Hacker'
      AgentVersion  : 'Hacker'
      AgentVersionMajor  : 'Hacker'
      AgentNameVersion  : 'Hacker'
      AgentNameVersionMajor  : 'Hacker'
      HackerAttackVector  : 'Unknown'
      HackerToolkit  : 'Unknown'



Some device are detected as Mobile instead Phone

I here an agent which is detected as Mobile instead Phone

mozilla/5.0 (linux; android 6.0.1; sm-g920f build/mmb29k) applewebkit/537.36 (khtml, like gecko) chrome/48.0.2564.95 mobile

Device
Class Mobile
Name Samsung SM-G920F
Brand Samsung
Operating System
Class Mobile
Name Android
Version 6.0.1
Name Version Android 6.0.1
Version Build mmb29k
Layout Engine
Class Browser
Name Blink
Version 48.0
Version Major 48
Name Version Blink 48.0
Name Version Major Blink 48
Agent
Class Browser
Name Chrome
Version 48.0.2564.95
Version Major 48
Name Version Chrome 48.0.2564.95
Name Version Major Chrome 48

Preheat jvm method

Add method that simply runs all testcases. This should be about 1-2 seconds. Run that several times to pre heat the JVM.

Wechat useragent shoud parse to WeChat(Browser)

eg. wechat app useragent:

Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; PE-TL20 Build/HuaweiPE-TL20) AppleWebKit/533.1 (KHTML, like Gecko)Version/4.0 MQQBrowser/5.4 TBS/025440 Mobile Safari/533.1 MicroMessenger/6.2.5.53_r2565f18.621 NetType/WIFI Language/zh_CN

but if u access http://devicedetector.net/
result:
client (mobile app)
WeChat 6.2

IllegalArgumentException when parsing AgentClass

When parsing the attached (malformed) useragent, yauaa does not catch the runtime exception and the pig-job crashes.
This effect is not wanted :)
The error can also be reproduced by pasting it into http://analyze-useragent.appspot.com/

error.txt
input.txt

pig latin code

    DEFINE ParseUserAgent  nl.basjes.parse.useragent.pig.ParseUserAgent;
    REGISTER lib/yauaa-pig-0.11-udf.jar;

    raw = load 'input.txt' as (agent:chararray);
    b = foreach raw generate ParseUserAgent(agent).AgentClass;
    dump b;

Samsung Browser on non-Samsung device

I found a problem with this useragent.
[Please update the output below to match what you expect it should be]


- test:
    input:
      user_agent_string: 'Mozilla/5.0 (Linux; Android 7.0; SAMSUNG Nexus 6 Build/NBD92G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/5.4 Chrome/51.0.2704.106 Mobile Safari/537.36'
    expected:
      DeviceClass  : 'Phone'
      DeviceName  : 'Google Nexus 6'
      DeviceBrand  : 'Google'
      OperatingSystemClass  : 'Mobile'
      OperatingSystemName  : 'Android'
      OperatingSystemVersion  : '7.0'
      OperatingSystemNameVersion  : 'Android 7.0'
      OperatingSystemVersionBuild  : 'NBD92G'
      LayoutEngineClass  : 'Browser'
      LayoutEngineName  : 'Blink'
      LayoutEngineVersion  : '51.0'
      LayoutEngineVersionMajor  : '51'
      LayoutEngineNameVersion  : 'Blink 51.0'
      LayoutEngineNameVersionMajor  : 'Blink 51'
      AgentClass  : 'Browser'
      AgentName  : 'SamsungBrowser'
      AgentVersion  : '5.4'
      AgentVersionMajor  : '5'
      AgentNameVersion  : 'SamsungBrowser 5.4'
      AgentNameVersionMajor  : 'SamsungBrowser 5'



Hacker or not hacker

(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240

(Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0

DeviceCPU returning Unknown for 32-bit IE on 64 bit OS

Hi,

I noticed that with this user agent string:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3; .NET4.0E)

When I ask for UserAgent.getValue("DeviceCpu"), I get a value of Unknown. From the MSDN docs you found:

WOW64 A 32-bit version of Internet Explorer is running on a 64-bit processor.

So, maybe it could be updated?

Also, I guess DeviceCPU is intended to show the CPU of the device, but really the user agent string (at least in this example) tells us two things:

  • the CPU of the device (which I think matches up with DeviceCPU)
  • the CPU architecture of the user agent.

Maybe as a future enhancement, DeviceCPU could be supplemented with another entry that indicates Agent architecture? Just a thought.

Remove Spring

We only use Spring to read the resource files.
This is much too heavy for this simple usecase.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.